Part II — Computer Simulations · R

The Regression-Discontinuity Design

Download the complete R script for this exercise to run it in RStudio.

⇓ Download R Script

Overview

The regression-discontinuity design (RDD) is a quasi-experimental design in which participants are assigned to groups solely on the basis of whether their pretest score falls above or below a predetermined cutoff value. This strict assignment rule creates a sharp discontinuity in the pre-post bivariate distribution at the cutoff — and that discontinuity is the estimate of the treatment effect.

Because assignment is deterministic given the cutoff, the RDD can yield a valid causal estimate under assumptions that are weaker than those required for the nonequivalent group design. The key analytic challenge is that the bivariate function between pretest and posttest must be correctly specified. This exercise demonstrates a systematic over-specification strategy: fit increasingly complex models and verify that the treatment effect estimate is stable.

We use the following notation for the design:

C O X O (program group — scores ≤ cutoff) C O O (comparison group — scores > cutoff)

where C indicates cutoff-based assignment and X the treatment.

Step 1 — Generate Data

library(psych) library(ggplot2) T <- rnorm(500, mean = 50, sd = 5) eX <- rnorm(500, mean = 0, sd = 5) eY <- rnorm(500, mean = 0, sd = 5) pretest <- T + eX

Step 2 — Assign Groups by Cutoff

In a compensatory program, participants who score at or below the cutoff receive the treatment. Here the cutoff is 50 (approximately the pretest mean). Participants with pretest ≤ 50 are in the program group (Z = 1).

cutoff <- 50 Z <- ifelse(pretest <= cutoff, 1, 0) table(Z)

Step 3 — Build the Posttest

The posttest is constructed from true score plus posttest error plus a 10-point treatment effect for program group members.

posttest <- T + eY + (10 * Z) AllData <- data.frame(T, eX, eY, pretest, posttest, Z) describeBy(AllData[, c("pretest", "posttest")], group = AllData$Z, fast = TRUE)

The program group should start with a lower pretest mean (selected from the lower half of the distribution) but, because of the 10-point effect, should end up with a posttest mean that is close to or exceeds the comparison group's posttest mean.

Step 4 — Visualize the Discontinuity

ggplot(AllData, aes(x = pretest, y = posttest, colour = factor(Z), shape = factor(Z))) + geom_point(alpha = 0.5, size = 2) + geom_vline(xintercept = cutoff, linetype = "dashed", color = "gray40") + scale_colour_manual(values = c("0" = "#888", "1" = "#0E7C6A"), labels = c("Comparison", "Program")) + scale_shape_manual(values = c("0" = 1, "1" = 16), labels = c("Comparison", "Program")) + theme_minimal() + labs(title = "Regression-Discontinuity Design", x = "Pretest", y = "Posttest", colour = "Group", shape = "Group")

You should see a visible jump at the cutoff of 50. The bivariate distribution "steps up" at that point for the program group — that step is the treatment effect.

Step 5 — Prepare Analysis Variables

Center the pretest at the cutoff. This ensures the regression estimates the program effect at the cutoff point (where the pretest equals zero after centering), rather than at an arbitrary pretest value of zero.

AllData$pre_cut <- AllData$pretest - cutoff AllData$I1 <- AllData$pre_cut * AllData$Z # linear interaction AllData$pre2 <- AllData$pre_cut^2 # quadratic term AllData$I2 <- AllData$pre2 * AllData$Z # quadratic interaction

Step 6 — Progressive Regression Analysis

Fit a series of regression models, adding higher-order terms at each step. For each model, the coefficient for Z is the estimate of the treatment effect. In a correctly specified model it should be close to 10 in all steps.

Model 1 — linear, equal slopes (standard ANCOVA):

M1 <- lm(posttest ~ pre_cut + Z, data = AllData) summary(M1) cat("Model 1 treatment estimate:", coef(M1)["Z"], "\n\n")

Model 2 — linear, unequal slopes (add linear interaction I1):

M2 <- lm(posttest ~ pre_cut + Z + I1, data = AllData) summary(M2) cat("Model 2 treatment estimate:", coef(M2)["Z"], "\n\n")

Model 3 — quadratic, unequal linear slopes (add quadratic term pre2):

M3 <- lm(posttest ~ pre_cut + Z + I1 + pre2, data = AllData) summary(M3) cat("Model 3 treatment estimate:", coef(M3)["Z"], "\n\n")

Model 4 — quadratic, unequal slopes in both linear and quadratic (add quadratic interaction I2):

M4 <- lm(posttest ~ pre_cut + Z + I1 + pre2 + I2, data = AllData) summary(M4) cat("Model 4 treatment estimate:", coef(M4)["Z"], "\n\n")

As more terms are added, unnecessary coefficients should be near zero and the treatment effect estimate should remain stable near 10. The standard error will increase slightly with each added term (precision is lost when superfluous predictors are included), but the estimate stays unbiased.

In a real study you would fit these models to assess whether a linear fit is adequate or whether curvature is needed. The stability of the treatment estimate across models is evidence that your specification is correct.

Reflections & Variations

  1. Change pretest reliability. Vary the ratio of sd(T) to sd(eX). Try a perfectly reliable pretest (omit eX entirely: pretest <- T). How does pretest reliability affect the treatment effect estimate?
  2. Change posttest reliability. Vary sd(eY). How does posttest reliability affect the precision of the estimate?
  3. Elite program. Reverse the cutoff rule so that the highest pretest scorers receive the program (Z <- ifelse(pretest > cutoff, 1, 0)). In what real-world situations would high scorers receive a new program?
  4. Negative treatment effect. Use -10 * Z. How does the direction of the discontinuity change in the bivariate plot?
← Back to Simulation Home