In this exercise you are going to look at how regression to the
mean operates. It is designed to convince you that regression
effects do occur, when they can be expected to occur, and to hint
at why they occur.
To begin, get into MINITAB as usual. You should see the MINITAB
prompt (which looks like this MTB>). Now you are ready to
enter the following commands. You will begin by creating two
variables similar to the ones in the Generating Data exercise:
MTB> Random 500 C1;
SUBC> Normal 50 10.
MTB> Random 500 C2;
SUBC> Normal 0 5.
MTB> Random 500 C3;
SUBC> Normal 0 5.
You have created three random variables. The first will have a
mean or average of 50 and a standard deviation of 10. This will
be a measure of the true ability of people on some characteristic.
You also created two random error scores having a mean of zero
(remember that we always assume that errors have zero mean) and
a standard deviation half that of the true score. These errors
will be used to construct two separate "tests":
MTB> Add C1 C2 C4.
MTB> Add C1 C3 C5.
Thus, you have created two tests of some ability. The tests are
related or correlated because they share the same true score (i.e.,
they measure the same true ability) Name the five variables you
have created so far:
MTB> Name C1= ''true' C2 = 'x error' C3 = 'y error' C4 = 'X' C5 = 'Y'
Check on the distributions:
MTB> Describe C1-C5
You should find that the mean of the TRUE variable is near 50,
the means of the two error variables are near 0 and the means
of x and y are near 50. Now look at the distributions of the two
tests to see if they appear to be normal in shape:
MTB> Histogram C4
MTB> Histogram C5
The graphs should look like bell-shaped curves. You should also look at the correlation between x and y:
MTB> Correlation C4 C5
and at the bivariate distribution:
MTB> Plot C5 * C4;
SUBC> Symbol 'x'.
Up to now, this is pretty much what you did in the first exercise.
Now let's look at regression to the mean. Remember that we said
that regression will occur whenever we select a group asymmetrically
from a distribution. Let's say that the X test is a measure of
math ability and that we would like to give a special math program
to all those children who score below average on this test. We
would then like to select a group of all those scoring below the
mean of 50:
MTB> Copy C4 C5 C1-C3 C6-C10;
SUBC> use C4 = 0:50.
Be sure to type these commands in exactly as written. In words,
what you have done is to "COPY all those with scores between
0 and 50 on variable C4 (the X test) and to also COPY the scores
in C5 (the Y test) and C1 through C3 for those cases; and put
the copied cases in C6 through C10 respectively." Now, name
the new columns:
MTB> Name C6='New x' C7='New y' C8='New t' C9='New xe' C10='New ye'
Notice that the order of the variables has been rearranged. The
X test score for the group you chose is in C6 and the Y test score
is in C7.
Now, assume that you never gave your math program (your school
district lost all funding for special training - sometimes simulations
can be realistic!) but that you measured your selected group later
on a similar math test, the Y test. Look at the before-and-after
means for your selected group:
MTB> Describe C6 C7
It looks like your group improved slightly between the two tests!
Does this mean that your program might not have been necessary?
Is this improvement merely normal maturation in math ability?
Of course not! All you are witnessing is regression to the mean.
You selected a group that consisted of low-scorers on the basis
of the X test. On any other measure which is imperfectly related
to the X test this group will appear to score better simply because
of regression to the mean. Look at the distributions of the selected
group:
MTB> Histogram C6
MTB> Histogram C7
Notice that the distribution of the X test for this group looks
like one half of a bell-shaped curve. It should because you selected
the persons scoring on the lower half of the X test. But the Y
test distribution is not as clearly cut as the X test. Why not?
Persons have the same true score on both tests (remember that
you added in the same true score to the X and Y test). It must
have something to do with the errors then. Obviously, since every
person has the same true score on both tests, and since persons
appeared on average to score a little higher on the Y test than
on the X test, we should expect to see more negative errors on
the X test (i.e., x errors) than on the Y test. Let's see if that
is true. You can use the SIGN command to see how many negative,
zero, and positive values a variable has:
MTB> Sign C9
MTB> Sign C10
There should be a few more negative errors for the X test and
a few more positive errors for the Y test. Are there? At this
point you should stop to reflect on what you've done. You created
two tests that measure the same ability. The tests are imperfectly
measured (i.e., they have error). You then selected an asymmetrical
sample on the basis of scores on one of the tests - you selected
the "low" scorers on the X test. Even though you didn't
do anything to this group, when you measured them again on the
Y test you found that they appear to improve slightly. If you
worked in a school district, people might begin to question your
suggestion that those children needed special training. Let's
say that you decided to show them that the apparent gain in this
group really isn't accurate. You decide that you will look at
the change between the X and Y test for those children who score
above the mean:
MTB> COPY C4 C5 C1-C3 C6-C10;
SUBC> use C4 = 50:100.
Now you look at the means for this above average group on the
X and Y tests:
MTB> Describe C6 C7
What happened? It appears that the above average group lost ground
between the X and Y tests. Now your critics are really convinced
(or should we say confused?). They argue that the low scorers
improved just fine without your special math program but that
the high scorers were the ones who lost ground. Maybe they say,
you should be giving your program to the high scorers to help
prevent further decline.
What is going on here? What you have witnessed is a statistical
phenomenon called regression to the mean. It occurs because we
have imperfect measurement. Recall that a child's true ability
in math is indicated by the true score. The tests partially show
us the true score but they also have error in them. For any given
child the error could work for or against them - they could have
a good day (i.e., a positive error) or a bad one (i.e., negative
error). If they have a bad day, their test score will be below
their true ability. If they have a good day, the test score will
be above their true ability (i.e., they got lucky or guessed well).
When you selected a group that was below the overall average for
the entire population of 500 children, you chose a lot of children
who really did have below average true ability - but you also
chose a number of children who scored low because they had a bad
day (i.e., had a negative x error). When you tested them again,
their true ability hadn't changed, but the odds that they would
have as bad a day as the first time were much lower. Thus, it
was likely that on the next test (i.e., the Y test) the group
would do better on the average.
It is possible that sometimes when you do the above simulation,
the results will not turn out as we have said. This is because
you have selected a group that is really not too extreme. Let's
really stack the deck now and see a clear cut regression artifact.
We will choose a really extreme low scoring group:
MTB> COPY C4 C5 C1-C3 C6 C10;
SUBC> use C4 = 0:40.
We have selected all those who scored lower than 40 on the X test.
Look at the distributions:
MTB> Histogram C6
MTB> Histogram C7
Here, the X test distribution looks like it was sharply cut, the
Y test looks much less so. Now, look at the means:
MTB> Describe C6 C7
Here we have clear regression to the mean. The selected group
scored much higher on the Y test than on the X test.
To get some idea of why this occurred look at the positive and
negative values of the errors for the two tests:
MTB> Sign C9
MTB> Sign C10
There should be far more negative values on the x error than on the y error (new xe and new ye).
We can predict how much regression to the mean will occur in any
case. To do so we need to know the correlation between the two
measures for the entire population (the first correlation you
calculated above) and the means for each measure for the entire
population (in this case, both means will be near 50). The percent
of regression to the mean is simply 100 (1-r) where r is the correlation.
For example, assume that the correlation between the X and Y tests
for all 500 cases is .80 and that the means are both equal to
50. Further, assume you select a low scoring group from the X
test and when you look at the mean for this group you find that
it equals 30 points. By the formula you would expect this mean
to regress 100 (1-.8) or 20% of the way back towards the mean
on the other measure. The mean that we would expect on the other
measure (the Y test) in the absence of regression is the same
as for the X test -- 30 points. But with regression we expect
the actual mean to be 20% closer to the overall Y test mean of
50. Twenty percent of the distance between 30 and 50 would put
the Y test mean at 34 and it would appear that the group gained
four points, but this gain would instead be due to regression.
Now, assume the same situation, but this time assume the correlation
between the X and Y test is .50. Here we would expect that there
would be 100 (1-.5) = 50% regression to the mean and if the group
which was selected had a 30 on the X test we would expect them
to get a 40 on the Y test just because of the regression phenomenon.
There are several variations on this basic simulation that you
might want to try: