Part II — Computer Simulations · R

Regression Artifacts — Regression to the Mean

Download the complete R script for this exercise to run it in RStudio.

⇓ Download R Script

Overview

This exercise demonstrates regression to the mean — a statistical phenomenon that occurs whenever a group is selected on the basis of an imperfectly measured variable. The simulation is designed to convince you that regression effects are real, predictable, and entirely the product of imperfect measurement. No program, no treatment, no change in true ability is needed to produce what can look like a meaningful gain or decline.

The key insight is this: when you select people who scored low on one test, you have inadvertently selected a group that contains an unusually large proportion of people who had a bad day (i.e., whose random error on that test was negative). On a second test, measured independently, those same people are unlikely to have equally bad days. So on average they appear to improve — not because anything happened to them, but because the extreme errors that put them in the low group are unlikely to repeat.

Step 1 — Generate the Data

Create two test scores, X and Y, from the same true score T. The true score has mean 50 and standard deviation 10; the errors for each test have mean 0 and standard deviation 5.

library(psych) library(ggplot2) T <- rnorm(500, mean = 50, sd = 10) eX <- rnorm(500, mean = 0, sd = 5) eY <- rnorm(500, mean = 0, sd = 5) SimData <- data.frame(T, eX, eY) describe(SimData, fast = TRUE) X <- T + eX Y <- T + eY XYdata <- data.frame(X, Y) AllData <- data.frame(T, eX, eY, X, Y)

Step 2 — Examine the Full Sample

Look at descriptive statistics and the bivariate distribution for the full sample of 500. Compute the correlation between X and Y — this value will be used later in the regression formula.

describe(XYdata) ggplot(AllData, aes(x = X)) + geom_histogram(bins = 30, fill = "#0E7C6A", color = "white") + theme_minimal() + labs(title = "Distribution of X", x = "X Score") ggplot(AllData, aes(x = Y)) + geom_histogram(bins = 30, fill = "#0E7C6A", color = "white") + theme_minimal() + labs(title = "Distribution of Y", x = "Y Score") r_xy <- cor(XYdata) print(r_xy) ggplot(XYdata, aes(x = X, y = Y)) + geom_point(alpha = 0.3, color = "#0E7C6A") + theme_minimal() + labs(title = "Bivariate Distribution: X and Y")

Step 3 — Select the Below-Average Group (X < 50)

Suppose we want to give a special remedial program to all students who scored below the mean of 50 on the X test. Select this group and compare their mean on X versus their mean on Y.

SelData <- subset(XYdata, X < 50) SubData <- subset(AllData, X < 50) describe(SelData, fast = TRUE)

The mean of Y for this group will be higher than their mean of X. It looks like these students improved between the two tests — even though no program was given and no time passed. This apparent gain is entirely a regression artifact.

To understand why, examine the error sign distributions. Among the full sample, negative and positive errors are approximately equally common. But in the group selected for having a low X score, negative X errors are overrepresented — many of those students scored low on X partly because of bad luck on test day.

hist(sign(AllData$eX), main = "Sign of eX — Full Sample (500)", xlab = "Sign") hist(sign(AllData$eY), main = "Sign of eY — Full Sample (500)", xlab = "Sign") hist(sign(SubData$eX), main = "Sign of eX — Low X Group (<50)", xlab = "Sign") hist(sign(SubData$eY), main = "Sign of eY — Low X Group (<50)", xlab = "Sign")

Step 4 — Select the Above-Average Group (X > 50)

Now look at the change for students who scored above average on X. Their mean on Y will be lower — they appear to have regressed downward. If you didn't know the data were simulated, you might conclude that high scorers were harmed.

SelData <- subset(XYdata, X > 50) describe(SelData, fast = TRUE)

Step 5 — Extreme Low Scorers (X < 40)

Select a very extreme low-scoring group (below 40) to see regression to the mean more dramatically. The more extreme the selection, the larger the apparent gain.

SelData <- subset(XYdata, X < 40) SubData <- subset(AllData, X < 40) describe(SelData, fast = TRUE) hist(sign(SubData$eX), main = "Sign of eX — Extreme Low Group (<40)", xlab = "Sign") hist(sign(SubData$eY), main = "Sign of eY — Extreme Low Group (<40)", xlab = "Sign")

The Regression-to-the-Mean Formula

We can predict the magnitude of the regression effect using the correlation between the two tests. The expected percent of regression toward the overall mean is:

# Percent regression = 100 * (1 - r) # where r is the correlation between X and Y in the full sample. # # Example: if r = 0.80 and the selected group has mean X = 30 (overall mean = 50), # then expected Y mean = 30 + 0.20 * (50 - 30) = 30 + 4 = 34. # The group appears to gain 4 points purely due to regression. r <- cor(AllData$X, AllData$Y) pct_regression <- 100 * (1 - r) cat("Correlation r =", round(r, 3), "\n") cat("Expected percent regression =", round(pct_regression, 1), "%\n")

Reflections & Variations

  1. Vary the correlation. Change the standard deviations of eX and eY to alter the correlation between X and Y. Very small error variances produce a high correlation and little regression; very large error variances produce a low correlation and large regression effects. Verify empirically that higher r means less regression to the mean, and vice versa.
  2. Perfect and zero correlation. To simulate a perfect correlation (r = 1), set X <- T and Y <- T (no error). To simulate zero correlation, set X <- eX and Y <- eY (no true score). What happens to regression to the mean in each case?
  3. Bidirectional regression. Regression occurs in both directions. Select an extreme group on the Y test and check their mean on X. Confirm that they also regress toward the mean — even though X was measured before Y.
  4. Applied interpretation. Imagine a school district that evaluates a program by selecting low posttest scorers and noting that they had scored higher on the pretest. They might conclude that their program harmed previously strong students. Use this simulation to show that the same pattern arises with no program at all.