The interactive graph on this page illsutrates in a regression context the negative effects of using median splits in data analysis. When the page is loaded, the graph displays a regression analysis of 20 data points randomly sampled from a bivariate normal distribution for which the population squared correlation is 0.25. The vertical orange line marks the median. In a regression context, splitting the predictor variable at the median is equivalent to assigning all observations below the predictor median to have the same score on the predictor and all observations above the predictor median to have another score on the predictor. It is common to use 0, 1 (dummy codes) or -1, +1 (contrast codes) as these predictor scores. Equivalently, one can use the mean predictor value for the respecitve groups, as is in the graph below.
As the slider at the bottom of the graph is moved from left to right, the points in the graph slide towards their location if the predictor were split at the median. At the top of the graph is the regression equation, its r-square, as well as its t and p values. Note that as the slider moves towards the right, the regression line fluctuates minially but r-square and t steadily decrease. These decreases are due almost entirely to the systematic reduction in predictor variance, displayed under the regression equation.
When the slider is at the far right, the test statistics reported are identical to those that would be obtained using a two-sample Student's t-test. This illustrates that doing a median split reduces statistical power, primarily due to the reduction in the inherent variability of the predictor variable.
Double click on the graph to generate a new sample of 20 observations. Note that due to sampling variablity, you may occassionally generate a sample for which doing the median split slightly increases the squared correlation. However, on average, performing the median split reduces the squared correlation to about 64% of what it otherwise would have been. [see references below].
It is interesting to observe the movement of points near each other and near the median. Moving the slider from left to right exaggerates the difference between those observations. At the same time, extreme observations are grouped together with observations near the median as the slider moves from left to right. Exaggerating the difference between obsevations that were originally close together while at the same time minimizing the differences between observations that were originally very far apart cannot possibly be a useful strategy for data analysis.
For further considertion of this example and other negative consequences of dichotomizing continuous variables, see: