Pitfalls of Data Driven Models
Updated: Apr 2
Defining a system from data is poor mans version of research. If you don’t understand a complicated system, then throw in enough data and see if you could build a predictive model for your desired output variables from a series of control variables. More often what goes into the control variable pool is subjective but occasionally there is some reasoning behind it. Usually the reasoned part arises from the desire to reduce the volatility in predicting future values of the output variables. The belief here is that if by the addition of a new control variable to this cocktail pool of variables called the system model resulted in lowering the volatility in predicting future values of your output variables, then magically there is a cause and effect relationship between this new variable and your output variable. More often the researcher doesn’t elaborate on the basis of this causal relationship or at the best provides a perfunctory explanation that is often an excuse for his belief in this magical causality. This kind of data driven research leads to interesting set of conclusions some of which are listed in Nate Silver’s book Signals and Noise. However, the best example that I could find on the interwebs for failing to understand the dictum “correlation doesn’t imply causation” goes to the research by Tatu Westling of Universtiy of Helsinki who found an inverted-U relationship between the economic growth rates and reported average penis length. You may read about this interesting research hereWestling, T. (2011). Male organ and economic growth: Does size matter? University of Helsinki Discussion Paper (335).
In the above article the author states the following “The existence and channel of causality remains obscure at this point but the correlations are robust” is the sort of reasoning that finally ends up in correlation masquerading as causality. However, this type of system modelling seems to be in vogue now. The 2003 Nobel prize from economics was won by C.W.J Granger for his contribution in analyzing time series data with common trends or Granger Causality and in 2011 the Nobel prize in economics was won by Christopher Sims for his contribution to the development of Vector Auto Regressive (VAR) for analyzing the cause of effect between statistical variables.