The Regression Discontinuity Design

In this exercise we are going to create and analyze data for a regression discontinuity design. Recall that in its simplest form the design has a pretest, a posttest, and two groups, usually a program and comparison group. The distinguishing feature of the design is its procedure for assignment to groups -- persons or units are assigned to one or the other group solely on the basis of a cutoff score on the pre-program measure. Thus, all persons having a pre-program score on one side of the cutoff value are put into one group and all remaining persons are put in the other. We can depict the design using the following notation:

C O X O
C O O

where the C indicates that groups are assigned by a cutoff score, the first O represents the pretest, the X depicts the administration of some program or treatment and the second O signifies the posttest. Notice that the top line represents the program group while the second line indicates the comparison group.

In this simulation you will create data for a "compensatory" program case. We assume that both the pretest and posttest are fallible measures of ability where higher scores indicate generally higher ability. We also assume that we want the program being studied to be given to the low pretest scorers - those whose are low in pretest ability.

Get into MINITAB as you normally would. You should see the MINITAB prompt . Now you are ready to enter the commands below.

The first step is to create two hypothetical tests, the pretest and posttest. Before you can do this you need to create a measure of true ability and separate error measures for each test:

MTB> Random 500 C1;
SUBC> Normal 50 5.0.
MTB> Random 500 C2;
SUBC> Normal 0 5.0.
MTB> Random 500 C3;
SUBC> Normal 0 5.0.

Now you can construct the pretest by adding true ability (C1) to pretest error (C2):

MTB> Add C1 C2 C4

Before constructing the posttest it is useful to create the variable that describes the two groups. The pretest mean will be about 50 and you will use 50 as the cutoff score in this simulation. Because this is a compensatory case, we want all those who score lower than or equal to 50 to be program cases, with all those scoring above 50 to be in the comparison group. The following two code statements will create a new dummy variable (C5) with a value of 1 for program cases and 0 for comparison cases:

MTB> Code (0:50) 1 C4 C5
MTB> Code (50:100) 0 C5 C5

To check on how many persons you have in each condition do:

MTB> Table C5

Notice that you probably don't have exactly 250 people in each group (although in the long run, that is how many you would expect if you divide a normal distribution at the mean). Now you are ready to construct the posttest. We would like to simulate an effective program so we will add in 10 points for all program cases (recall that you accomplish this by multiplying 10 by the dummy-coded treatment variable - for all program cases this product is 10, for comparison cases, 0 -- this is then added into the posttest):

MTB> Let C6= C1 + C3 + (10*C5)

It is convenient to name the variables:

MTB> Name C1 = 'true' C2 ='x error' C3 ='y error' C4 ='pretest' C5='group' C6='posttest'

To get some idea of what the data look like try:

MTB> Table C5;
SUBC> means C4 C6.

and don't forget to put the period at the end of the second line. This command gives pre and post means for the two groups. Note that the program group starts off at a distinct disadvantage - we deliberately selected the lower scorers on the pre-program measure. Notice also that the comparison group actually regresses back toward the overall mean of 50 between the pretest and posttest. This is to be expected because you selected both groups from the extremes of the pretest distribution. Finally, notice that the program group scores as well or better than the comparison group on the posttest. This is because of the sizable 10 point program effect which you put in. You might examine pre and post histograms, correlations, and the like. Now, look at the bivariate distribution:

MTB> Plot C6 * C4;
SUBC> symbol C5.

You should be able to see that the bivariate distribution looks like it "jumps" at the pretest value of 50 points. This is the discontinuity that we expect in a regression-discontinuity design when the program has an effect (note that if the program has no effect we expect a bivariate distribution that is continuous or does not jump).

At this point you have finished creating the data. The distribution that you see might be what you would get if you conducted a real study (although real data seldom behaves as well as this). The first step in analyzing this data is to examine the data to try to determine what the "likely" pre-post function is. We know that the true function here is linear (that the same straight-line fit in both groups is appropriate) but with real data it will often be difficult to tell by visual inspection alone whether straight or curved lines are needed. Therefore even though we might think that the most likely function or distribution is linear, w e will deliberately over-fit or over-specify this likely function a bit to be on the safe side.

The first thing you need to do to set up the analysis is to set up a new variable that will assure the program effect will be estimated at the cutoff point. To do this, you simply create a new variable that is equal to the pretest minus the cutoff score. You should see that this new variable will now be equal to zero at the cutoff score and that all program cases will have negative values on this score while the comparison group will have positive ones. Since the regression program would automatically estimate the vertical difference or "jump" between the two groups at the intercept (i.e., where the pretest equals 0), when you create this new variable you are setting the cutoff equal to the a pretest value of 0 and the regression program will correctly estimate the jump at the cutoff. Put this new variable in C7:

MTB> Let C7 = C4 - 50
MTB> Name C7 = 'pre-cut'

and name it appropriately. You will see that we always substitute this variable for the pretest in the analyses.

Now you need to set up some additional variables that will enable you to over-specify the "likely" true linear function:

MTB> Let C8 = C7 * C5

This new variable is simply the product of the corrected pretest and the dummy assignment variable. Thus C8 will be equal to zero for each comparison group case and equal to the corrected pretest for each program case. When this variable is added into the analysis we are in effect telling the regression program to see if there is any interaction between the pretest (C7) and the program (C5). This is equivalent to asking whether the linear slopes in the two groups are equal or whether they are different (which implies that the effect of the program differs depending on what pretest score a person had). Now, construct quadratic (second-order) terms:

MTB> Let C9 = C7 * C7
MTB> Let C10 = C9 * C5

For C9 you simply square the pretest. When this variable is entered into the analysis we are in effect asking whether the bivariate distribution looks curved in a quadratic pattern (consult an introductory algebra book if you don't recall what a quadratic or squared function looks like). The second variable, C10, allows the quadratic elements in each group to differ, and therefore, can be considered a quadratic interaction term. You should name the variables:

MTB> Name C8 ='I1' C9 = 'pre2' C10 = 'I2'

where I1 stands for 'linear interaction' , PRE2 for the 'squared pretest', and I2 for the 'quadratic interaction'. You could continue generating even higher-order terms and their interactions (cubic, quadratic, quintic etc.) but these will suffice for this demonstration.

You are now ready to begin the analysis. You will do this in a series of regression steps, each time adding in higher-order terms. Because of the length of the standard regression output, you might want to request briefer output with the command

MTB> Brief 1

In the first step, you fit a model which assumes that the bivariate distribution is best described by straight lines with the same slopes in each group and a jump at the cutoff:

MTB> Regress C6 2 C7 C5

This is simply the standard Analysis of Covariance (ANCOVA) model. The coefficient associated with the GROUP variable in the table is the estimate of the program effect. Since you created the data you know that this regression analysis exactly-specifies the true bivariate function - you created the data to have the same slope in each group and to have a program effect of 10 points. Is the estimate that you obtained near the true effect of ten? You can construct a 95% confidence interval (using plus or minus 2 times the standard deviation of the coefficient for the GROUP variable). Does the true effect of ten points fall within this interval (it should for almost all of you)?

With real data we would not be sure that the model we fit in this first step includes all the necessary terms. If we have left out a necessary term (for instance, if there was in fact a linear interaction) then it would be very likely that the estimate we obtained in this first analysis would be biased (you will see this in the next simulation). To be on the safe side, we will add in a few more terms to the analysis in successive regression steps. If we have already included all necessary terms (as in the analysis above) then these additional terms should be superfluous. They should not bias the estimate of program effect, but there will be less precision. For the next step in the analysis you will allow the slopes in the two groups to differ by adding in Il, the linear interaction term:

MTB> Regress C6 3 C7 C5 C8

The coefficient for the GROUP variable is, as usual, the estimate of program effect. We know that the new variable, C8, is unnecessary because you set up the simulation so that the slopes in both groups are the same. You should see that the coefficient for this Il variable is near zero and that a zero value almost surely falls within the 95% confidence interval of this coefficient. Because this term is unnecessary, you should still have an unbiased estimate of the program effect. Is the coefficient for GROUP near the true value of 10 points? Does the value of 10 fall within the 95% confidence interval of the coefficient? You should also note that the estimate of the program effect is less precise in this analysis than in the previous one - the standard error of the coefficient for the GROUP variable should be larger in this case than in the previous run. Now, add in the quadratic term:

MTB> Regress C6 4 C7 C5 C8 C9

Again, you should see that the coefficients for the superfluous terms (I1 and PRE2) are near zero. Similarly, the estimate of program effect should still be unbiased and near a value of ten. This time the standard error of the GROUP coefficient will be a little larger than last time - again indicating that there is some loss of precision as higher-order terms are added in. Finally, you will allow the quadratic terms to differ between groups by adding in the quadratic interaction term, I2:

MTB> Regress C6 5 C7 C5 C8 C9 C10

By now you should be able to see the pattern across analyses. Unnecessary terms will have coefficients near zero. The program effect estimate should still be near ten, but the 95% confidence interval will be slightly wider indicating that there is a loss of precision as you add in more terms.

In an analysis of real data, you would by now be more convinced that your initial guess that the bivariate distribution was linear was a sensible one. You might decide to continue fitting higher order terms or you might stop with the quadratic terms. This whole procedure may strike you as somewhat wasteful. If we think the correct function is linear, why not just fit that? The procedure outlined here is a conservative one. It is designed to minimize the chances of obtaining a biased estimate of program effect by increasing your chances of overspecifying the true function. At this point you should stop and consider the steps that were involved in conducting this analysis. You might want to try one or more of the following variations:

Change the reliability of the pretest. Recall that the reliability of the pretest depends on how much true score and error you add into it. Try the simulation with extremely low reliability (high standard deviation for error, C2, low for true score, C1) and high reliability. You might even want to try a perfectly reliable pretest (i.e., don't add C2 in at all). What effect does pretest reliability have on estimates of the treatment effect? How is this different from or similar to the nonequivalent group design?
Change the reliability of the posttest. Here you would alter the ratio of the standard deviations for the Y error, C3, and true score, C1 variables. How does posttest reliability relate to the estimate of the effect?
Construct a simulation where the program group consists of the high pretest scorers. In what kinds of real-world situations would the high scorers be likely to receive a new program?
Put in a negative program effect. To do this, just put in a -10 for the 10 in the LET statement which constructed the posttest. How does the shape of the bivariate distribution change when you introduce a negative rather than a positive effect?

Downloading the MINITAB Commands

If you wish, you can download a text file that has all of the MINITAB commands in this exercise and you can run the exercise simply by executing this macro file. To find out how to call the macro check the MINITAB help system on the machine you're working on. You may want to run the exercise several times -- each time will generate an entirely new set of data and slightly different results. Click here if you would like to download the MINITAB macro for this simulation.

Simulation Home Page

The Regression Discontinuity Design

Downloading the MINITAB Commands

Copyright © 1996, William M.K. Trochim