Negative Consequences of Dichotomizing Continuous Predictor Variables

The applet on this page illustrates in a regression context the negative effects of using median splits in data analysis. When the page is loaded, the applet displays a regression analysis of 25 data points randomly sampled from a bivariate normal distribution for which the population squared correlation is 0.25. The thin vertical red line marks the median. In a regression context, splitting the predictor variable at the median is equivalent to assigning all observations below the predictor median to have the same score on the predictor and all observations above the predictor median to have another score on the predictor. It is common to use 0, 1 (dummy codes) or -1, +1 (contrast codes) as these predictor scores. Equivalently, one can use the mean predictor value for the respective groups as is done in the applet below.

As the slider at the bottom of the graph is moved from left to right, the points in the graph slide towards their location if the predictor were split at the median. At the top of the graph is the regression equation, its r-square, as well as its t and p values. Note that as the slider moves towards the right, the regression line fluctuates minially but r-square and t steadily decrease. These decreases are due almost entirely to the systematic reduction in predictor variance, displayed under the regression equation.

When the slider is at the far right, a representation of the two-sample Student t-test of the mean difference appears for the data split at the median. Note that the values of t and p for this statistical test are necessarily identical to those for the regression analysis. This illustrates that doing a median split reduces statistical power, primarily due to the reduction in the inherent variablity of the predictor.

It is interesting to observe the movement of points near each other and near the median. Moving the slider from left to right exaggerates the difference between those observations. At the same time, extreme observations are grouped together with observations near the median as the slider moves from left to right. Exaggerating the difference between observations that were orignally close together while at the same time minimizing the differences between observations that were originally very far apart cannot possibly be a useful strategy for data analysis.

For futher consideration of this example and other negative consequences of dichotomizing continuous variables, see:

Irwin, J.R., & McClelland, G.H. (2003). Negative consequences of dichotomizing continuous predictor variables. Journal of Market Research, 40, 366-371.

See also

MacCallum, R.C., Zhang, S., Preacher, K.J., & Rucker, D.D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19-40.

Browser Note: Users of recent versions of the Windows operating systems will likely need to download the Java VM from
http://java.sun.com/getjava/download.html
Also, the applet probably only works on Macs using OS 10, but this has not been thouroughly tested yet.

© 2002, Gary McClelland




Locations of visitors to this page