In this exercise we are going to create and analyze data for a
regression discontinuity design. Recall that in its simplest
form the design has a pretest, a posttest, and two groups, usually
a program and comparison group. The distinguishing feature of
the design is its procedure for assignment to groups -- persons
or units are assigned to one or the other group solely
on the basis of a cutoff score on the pre-program measure. Thus,
all persons having a pre-program score on one side of the cutoff
value are put into one group and all remaining persons are put
in the other. We can depict the design using the following notation:
where the C indicates that groups are assigned by a cutoff score,
the first O represents the pretest, the X depicts the administration
of some program or treatment and the second O signifies the posttest.
Notice that the top line represents the program group while the
second line indicates the comparison group.
In this simulation you will create data for a "compensatory"
program case. We assume that both the pretest and posttest are
fallible measures of ability where higher scores indicate generally
higher ability. We also assume that we want the program being
studied to be given to the low pretest scorers - those whose are
low in pretest ability.
Get into MINITAB as you normally would. You should see the MINITAB
prompt . Now you are ready to enter the commands below.
The first step is to create two hypothetical tests, the pretest
and posttest. Before you can do this you need to create a measure
of true ability and separate error measures for each test:
MTB> Random 500 C1;
SUBC> Normal 50 5.0.
MTB> Random 500 C2;
SUBC> Normal 0 5.0.
MTB> Random 500 C3;
SUBC> Normal 0 5.0.
Now you can construct the pretest by adding true ability (C1)
to pretest error (C2):
MTB> Add C1 C2 C4
Before constructing the posttest it is useful to create the variable
that describes the two groups. The pretest mean will be about
50 and you will use 50 as the cutoff score in this simulation.
Because this is a compensatory case, we want all those who score
lower than or equal to 50 to be program cases, with all those
scoring above 50 to be in the comparison group. The following
two code statements will create a new dummy variable (C5) with
a value of 1 for program cases and 0 for comparison cases:
MTB> Code (0:50) 1 C4 C5
MTB> Code (50:100) 0 C5 C5
To check on how many persons you have in each condition do:
MTB> Table C5
Notice that you probably don't have exactly 250 people in each
group (although in the long run, that is how many you would expect
if you divide a normal distribution at the mean). Now you are
ready to construct the posttest. We would like to simulate an
effective program so we will add in 10 points for all program
cases (recall that you accomplish this by multiplying 10 by the
dummy-coded treatment variable - for all program cases this product
is 10, for comparison cases, 0 -- this is then added into the
posttest):
MTB> Let C6= C1 + C3 + (10*C5)
It is convenient to name the variables:
MTB> Name C1 = 'true' C2 ='x error' C3 ='y error' C4 ='pretest' C5='group' C6='posttest'
To get some idea of what the data look like try:
MTB> Table C5;
SUBC> means C4 C6.
and don't forget to put the period at the end of the second line.
This command gives pre and post means for the two groups. Note
that the program group starts off at a distinct disadvantage -
we deliberately selected the lower scorers on the pre-program
measure. Notice also that the comparison group actually regresses
back toward the overall mean of 50 between the pretest and posttest.
This is to be expected because you selected both groups from
the extremes of the pretest distribution. Finally, notice that
the program group scores as well or better than the comparison
group on the posttest. This is because of the sizable 10 point
program effect which you put in. You might examine pre and post
histograms, correlations, and the like. Now, look at the bivariate
distribution:
MTB> Plot C6 * C4;
SUBC> symbol C5.
You should be able to see that the bivariate distribution looks
like it "jumps" at the pretest value of 50 points.
This is the discontinuity that we expect in a regression-discontinuity
design when the program has an effect (note that if the program
has no effect we expect a bivariate distribution that is continuous
or does not jump).
At this point you have finished creating the data. The distribution
that you see might be what you would get if you conducted a real
study (although real data seldom behaves as well as this). The
first step in analyzing this data is to examine the data to try
to determine what the "likely" pre-post function is.
We know that the true function here is linear (that the same
straight-line fit in both groups is appropriate) but with real
data it will often be difficult to tell by visual inspection alone
whether straight or curved lines are needed. Therefore even
though we might think that the most likely function or distribution
is linear, w e will deliberately over-fit or over-specify this
likely function a bit to be on the safe side.
The first thing you need to do to set up the analysis is to set
up a new variable that will assure the program effect will be
estimated at the cutoff point. To do this, you simply create a
new variable that is equal to the pretest minus the cutoff
score. You should see that this new variable will now be equal
to zero at the cutoff score and that all program cases will have
negative values on this score while the comparison group will
have positive ones. Since the regression program would automatically
estimate the vertical difference or "jump" between the
two groups at the intercept (i.e., where the pretest equals 0),
when you create this new variable you are setting the cutoff equal
to the a pretest value of 0 and the regression program will correctly
estimate the jump at the cutoff. Put this new variable in C7:
MTB> Let C7 = C4 - 50
MTB> Name C7 = 'pre-cut'
and name it appropriately. You will see that we always substitute
this variable for the pretest in the analyses.
Now you need to set up some additional variables that will enable
you to over-specify the "likely" true linear function:
MTB> Let C8 = C7 * C5
This new variable is simply the product of the corrected pretest
and the dummy assignment variable. Thus C8 will be equal to zero
for each comparison group case and equal to the corrected pretest
for each program case. When this variable is added into the analysis
we are in effect telling the regression program to see if there
is any interaction between the pretest (C7) and the program (C5).
This is equivalent to asking whether the linear slopes in the
two groups are equal or whether they are different (which implies
that the effect of the program differs depending on what pretest
score a person had). Now, construct quadratic (second-order)
terms:
MTB> Let C9 = C7 * C7
MTB> Let C10 = C9 * C5
For C9 you simply square the pretest. When this variable is entered
into the analysis we are in effect asking whether the bivariate
distribution looks curved in a quadratic pattern (consult an introductory
algebra book if you don't recall what a quadratic or squared function
looks like). The second variable, C10, allows the quadratic elements
in each group to differ, and therefore, can be considered a quadratic
interaction term. You should name the variables:
MTB> Name C8 ='I1' C9 = 'pre2' C10 = 'I2'
where I1 stands for 'linear interaction' , PRE2 for the 'squared
pretest', and I2 for the 'quadratic interaction'. You could continue
generating even higher-order terms and their interactions (cubic,
quadratic, quintic etc.) but these will suffice for this demonstration.
You are now ready to begin the analysis. You will do this in a
series of regression steps, each time adding in higher-order terms.
Because of the length of the standard regression output, you
might want to request briefer output with the command
MTB> Brief 1
In the first step, you fit a model which assumes that the bivariate
distribution is best described by straight lines with the same
slopes in each group and a jump at the cutoff:
MTB> Regress C6 2 C7 C5
This is simply the standard Analysis of Covariance (ANCOVA) model.
The coefficient associated with the GROUP variable in the table
is the estimate of the program effect. Since you created the
data you know that this regression analysis exactly-specifies
the true bivariate function - you created the data to have the
same slope in each group and to have a program effect of 10 points.
Is the estimate that you obtained near the true effect of ten?
You can construct a 95% confidence interval (using plus or minus
2 times the standard deviation of the coefficient for the GROUP
variable). Does the true effect of ten points fall within this
interval (it should for almost all of you)?
With real data we would not be sure that the model we fit in this
first step includes all the necessary terms. If we have left out
a necessary term (for instance, if there was in fact a linear
interaction) then it would be very likely that the estimate we
obtained in this first analysis would be biased (you will see
this in the next simulation). To be on the safe side, we will
add in a few more terms to the analysis in successive regression
steps. If we have already included all necessary terms (as in
the analysis above) then these additional terms should be superfluous.
They should not bias the estimate of program effect, but there
will be less precision. For the next step in the analysis you
will allow the slopes in the two groups to differ by adding in
Il, the linear interaction term:
MTB> Regress C6 3 C7 C5 C8
The coefficient for the GROUP variable is, as usual, the estimate of program effect. We know that the new variable, C8, is unnecessary because you set up the simulation so that the slopes in both groups are the same. You should see that the coefficient for this Il variable is near zero and that a zero value almost surely falls within the 95% confidence interval of this coefficient. Because this term is unnecessary, you should still have an unbiased estimate of the program effect. Is the coefficient for GROUP near the true value of 10 points? Does the value of 10 fall within the 95% confidence interval of the coefficient? You should also note that the estimate of the program effect is less precise in this analysis than in the previous one - the standard error of the coefficient for the GROUP variable should be larger in this case than in the previous run. Now, add in the quadratic term:
MTB> Regress C6 4 C7 C5 C8 C9
Again, you should see that the coefficients for the superfluous
terms (I1 and PRE2) are near zero. Similarly, the estimate of
program effect should still be unbiased and near a value of ten.
This time the standard error of the GROUP coefficient will be
a little larger than last time - again indicating that there is
some loss of precision as higher-order terms are added in. Finally,
you will allow the quadratic terms to differ between groups by
adding in the quadratic interaction term, I2:
MTB> Regress C6 5 C7 C5 C8 C9 C10
By now you should be able to see the pattern across analyses.
Unnecessary terms will have coefficients near zero. The program
effect estimate should still be near ten, but the 95% confidence
interval will be slightly wider indicating that there is a loss
of precision as you add in more terms.
In an analysis of real data, you would by now be more convinced
that your initial guess that the bivariate distribution was linear
was a sensible one. You might decide to continue fitting higher
order terms or you might stop with the quadratic terms. This whole
procedure may strike you as somewhat wasteful. If we think the
correct function is linear, why not just fit that? The procedure
outlined here is a conservative one. It is designed to minimize
the chances of obtaining a biased estimate of program effect by
increasing your chances of overspecifying the true function.
At this point you should stop and consider the steps that were involved in conducting this analysis. You might want to try one or more of the following variations: