Regression Artifacts

In this exercise you are going to look at how regression to the mean operates. It is designed to convince you that regression effects do occur, when they can be expected to occur, and to hint at why they occur.

To begin, get into MINITAB as usual. You should see the MINITAB prompt (which looks like this MTB>). Now you are ready to enter the following commands. You will begin by creating two variables similar to the ones in the Generating Data exercise:

MTB> Random 500 C1;
SUBC> Normal 50 10.
MTB> Random 500 C2;
SUBC> Normal 0 5.
MTB> Random 500 C3;
SUBC> Normal 0 5.

You have created three random variables. The first will have a mean or average of 50 and a standard deviation of 10. This will be a measure of the true ability of people on some characteristic. You also created two random error scores having a mean of zero (remember that we always assume that errors have zero mean) and a standard deviation half that of the true score. These errors will be used to construct two separate "tests":

MTB> Add C1 C2 C4.
MTB> Add C1 C3 C5.

Thus, you have created two tests of some ability. The tests are related or correlated because they share the same true score (i.e., they measure the same true ability) Name the five variables you have created so far:

MTB> Name C1= ''true' C2 = 'x error' C3 = 'y error' C4 = 'X' C5 = 'Y'

Check on the distributions:

MTB> Describe C1-C5

You should find that the mean of the TRUE variable is near 50, the means of the two error variables are near 0 and the means of x and y are near 50. Now look at the distributions of the two tests to see if they appear to be normal in shape:

MTB> Histogram C4
MTB> Histogram C5

The graphs should look like bell-shaped curves. You should also look at the correlation between x and y:

MTB> Correlation C4 C5

and at the bivariate distribution:

MTB> Plot C5 * C4;
SUBC> Symbol 'x'.

Up to now, this is pretty much what you did in the first exercise. Now let's look at regression to the mean. Remember that we said that regression will occur whenever we select a group asymmetrically from a distribution. Let's say that the X test is a measure of math ability and that we would like to give a special math program to all those children who score below average on this test. We would then like to select a group of all those scoring below the mean of 50:

MTB> Copy C4 C5 C1-C3 C6-C10;
SUBC> use C4 = 0:50.

Be sure to type these commands in exactly as written. In words, what you have done is to "COPY all those with scores between 0 and 50 on variable C4 (the X test) and to also COPY the scores in C5 (the Y test) and C1 through C3 for those cases; and put the copied cases in C6 through C10 respectively." Now, name the new columns:

MTB> Name C6='New x' C7='New y' C8='New t' C9='New xe' C10='New ye'

Notice that the order of the variables has been rearranged. The X test score for the group you chose is in C6 and the Y test score is in C7.

Now, assume that you never gave your math program (your school district lost all funding for special training - sometimes simulations can be realistic!) but that you measured your selected group later on a similar math test, the Y test. Look at the before-and-after means for your selected group:

MTB> Describe C6 C7

It looks like your group improved slightly between the two tests! Does this mean that your program might not have been necessary? Is this improvement merely normal maturation in math ability? Of course not! All you are witnessing is regression to the mean. You selected a group that consisted of low-scorers on the basis of the X test. On any other measure which is imperfectly related to the X test this group will appear to score better simply because of regression to the mean. Look at the distributions of the selected group:

MTB> Histogram C6
MTB> Histogram C7

Notice that the distribution of the X test for this group looks like one half of a bell-shaped curve. It should because you selected the persons scoring on the lower half of the X test. But the Y test distribution is not as clearly cut as the X test. Why not? Persons have the same true score on both tests (remember that you added in the same true score to the X and Y test). It must have something to do with the errors then. Obviously, since every person has the same true score on both tests, and since persons appeared on average to score a little higher on the Y test than on the X test, we should expect to see more negative errors on the X test (i.e., x errors) than on the Y test. Let's see if that is true. You can use the SIGN command to see how many negative, zero, and positive values a variable has:

MTB> Sign C9
MTB> Sign C10

There should be a few more negative errors for the X test and a few more positive errors for the Y test. Are there? At this point you should stop to reflect on what you've done. You created two tests that measure the same ability. The tests are imperfectly measured (i.e., they have error). You then selected an asymmetrical sample on the basis of scores on one of the tests - you selected the "low" scorers on the X test. Even though you didn't do anything to this group, when you measured them again on the Y test you found that they appear to improve slightly. If you worked in a school district, people might begin to question your suggestion that those children needed special training. Let's say that you decided to show them that the apparent gain in this group really isn't accurate. You decide that you will look at the change between the X and Y test for those children who score above the mean:

MTB> COPY C4 C5 C1-C3 C6-C10;
SUBC> use C4 = 50:100.

Now you look at the means for this above average group on the X and Y tests:

MTB> Describe C6 C7

What happened? It appears that the above average group lost ground between the X and Y tests. Now your critics are really convinced (or should we say confused?). They argue that the low scorers improved just fine without your special math program but that the high scorers were the ones who lost ground. Maybe they say, you should be giving your program to the high scorers to help prevent further decline.

What is going on here? What you have witnessed is a statistical phenomenon called regression to the mean. It occurs because we have imperfect measurement. Recall that a child's true ability in math is indicated by the true score. The tests partially show us the true score but they also have error in them. For any given child the error could work for or against them - they could have a good day (i.e., a positive error) or a bad one (i.e., negative error). If they have a bad day, their test score will be below their true ability. If they have a good day, the test score will be above their true ability (i.e., they got lucky or guessed well). When you selected a group that was below the overall average for the entire population of 500 children, you chose a lot of children who really did have below average true ability - but you also chose a number of children who scored low because they had a bad day (i.e., had a negative x error). When you tested them again, their true ability hadn't changed, but the odds that they would have as bad a day as the first time were much lower. Thus, it was likely that on the next test (i.e., the Y test) the group would do better on the average.

It is possible that sometimes when you do the above simulation, the results will not turn out as we have said. This is because you have selected a group that is really not too extreme. Let's really stack the deck now and see a clear cut regression artifact. We will choose a really extreme low scoring group:

MTB> COPY C4 C5 C1-C3 C6 C10;
SUBC> use C4 = 0:40.

We have selected all those who scored lower than 40 on the X test. Look at the distributions:

MTB> Histogram C6
MTB> Histogram C7

Here, the X test distribution looks like it was sharply cut, the Y test looks much less so. Now, look at the means:

MTB> Describe C6 C7

Here we have clear regression to the mean. The selected group scored much higher on the Y test than on the X test.

To get some idea of why this occurred look at the positive and negative values of the errors for the two tests:

MTB> Sign C9
MTB> Sign C10

There should be far more negative values on the x error than on the y error (new x_e and new y_e).

We can predict how much regression to the mean will occur in any case. To do so we need to know the correlation between the two measures for the entire population (the first correlation you calculated above) and the means for each measure for the entire population (in this case, both means will be near 50). The percent of regression to the mean is simply 100 (1-r) where r is the correlation. For example, assume that the correlation between the X and Y tests for all 500 cases is .80 and that the means are both equal to 50. Further, assume you select a low scoring group from the X test and when you look at the mean for this group you find that it equals 30 points. By the formula you would expect this mean to regress 100 (1-.8) or 20% of the way back towards the mean on the other measure. The mean that we would expect on the other measure (the Y test) in the absence of regression is the same as for the X test -- 30 points. But with regression we expect the actual mean to be 20% closer to the overall Y test mean of 50. Twenty percent of the distance between 30 and 50 would put the Y test mean at 34 and it would appear that the group gained four points, but this gain would instead be due to regression. Now, assume the same situation, but this time assume the correlation between the X and Y test is .50. Here we would expect that there would be 100 (1-.5) = 50% regression to the mean and if the group which was selected had a 30 on the X test we would expect them to get a 40 on the Y test just because of the regression phenomenon.

There are several variations on this basic simulation that you might want to try:

You should vary the strength of the correlation between the X and Y tests. You can do this by varying the standard deviations of the x and y errors (in the original three Random/Normal statements). You might want to try the simulation with the error variances very small (Random 500 C2; Normal 0 5., for example) or you could simply lower the variance of the true score (Random 500 C1; /Normal 50 1). Whichever you do, you should realize that you will be affecting the range of the scores that you obtain on the X and Y tests (the smaller the variances, the smaller the range). Make sure that when you do a copy statement you are actually choosing some cases. You should verify for yourself that the higher the correlation between X and Y the lower the regression, and vice versa. You can even simulate the extreme cases. To simulate a perfect correlation, set both X and Y equal to the true score only (e.g., Let C4=Cl). To simulate a zero correlation, set X and Y equal to their respective error terms (and no true score).
Regression to the mean occurs in two directions. Try selecting persons based on extremes on the Y test and verify that they regress toward the mean on the X test. While it might seem obvious to you now that regression will occur, it has not always been so obvious in practice. Consider a school district that has conducted a pre-post evaluation of a program and would like to look more closely at the results. They might for example, want to look at how the children who scored low or high on the posttest had done on the pretest. They might look at a group of high posttest scorers and would find that these children did "worse" on the pretest. Similarly, they might look at the low posttest scorers and they would find that these children had done "betterÓ on the pretest. They might conclude that children who did fairly well on the pretest were hurt by their program while poor pretest scorers were helped by it. But this kind of pattern in the results occurs whether a program is given or not - these evaluators would be basing their conclusions entirely on regression artifacts. Try to look at regression in both directions - from the X to the Y test, and vice versa.

Downloading the MINITAB Commands

If you wish, you can download a text file that has all of the MINITAB commands in this exercise and you can run the exercise simply by executing this macro file. To find out how to call the macro check the MINITAB help system on the machine you're working on. You may want to run the exercise several times -- each time will generate an entirely new set of data and slightly different results. Click here if you would like to download the MINITAB macro for this simulation.

Simulation Home Page

Regression Artifacts

Downloading the MINITAB Commands

Copyright © 1996, William M.K. Trochim