Part II — Computer Simulations · R

The Sampling Distribution Simulation

Download the complete R script for this exercise to run it in RStudio.

⇓ Download R Script

Overview

This simulation builds an empirical picture of one of the most important concepts in statistics — the sampling distribution of the mean. The sampling distribution is the theoretical distribution you would obtain if you drew infinitely many samples from a population and plotted their means. In practice you can never draw infinitely many samples, but a computer simulation can draw hundreds and show you what the sampling distribution looks like.

Understanding the sampling distribution is the key to understanding statistical inference. Confidence intervals, p-values, and hypothesis tests all depend on knowing the shape and spread of the sampling distribution. This exercise makes those abstract properties visible.

Step 1 — Create a Population

Generate a large population of 10,000 observations from a normal distribution with mean 50 and standard deviation 10. This is your "true" population — you know its parameters exactly.

T <- rnorm(10000, mean = 50, sd = 10) mean(T) sd(T) summary(T) hist(T, main = "Population Distribution", xlab = "Score", col = "steelblue")

The histogram should be a nearly perfect bell curve. Confirm that the mean is close to 50 and the standard deviation close to 10.

Step 2 — Draw One Sample

Draw a random sample of 100 observations from the population. In practice, this is all a researcher ever gets. Compute the sample mean and compare it to the true population mean of 50.

X <- sample(T, 100) mean(X) sd(X) summary(X) hist(X, main = "One Sample (n = 100)", xlab = "Score", col = "steelblue")

Your sample mean will be close to 50 but not exactly 50. This gap is called sampling error — it arises not from mistakes but simply from the randomness of sampling.

Step 3 — Standard Error

The standard error of the mean (SE) is the standard deviation of the sampling distribution. It quantifies how much sample means vary from sample to sample. Compute it from this single sample using the formula SE = sd / √n, and also compute the z-score for your sample mean.

SEx <- sd(X) / sqrt(length(X)) zvalue <- (mean(X) - 50) / SEx cat("Sample mean:", round(mean(X), 3), "\n") cat("Standard error:", round(SEx, 3), "\n") cat("z-value:", round(zvalue, 3), "\n")

Step 4 — Build the Sampling Distribution

Now repeat the process 500 times. Each iteration draws a fresh sample of 100 from the population and records its mean. The resulting vector of 500 means is the sampling distribution.

T <- rnorm(10000, mean = 50, sd = 10) # fresh population sample_means <- rep(NA, 500) for (i in 1:500) { sample_means[i] <- mean(sample(T, 100)) } mean(sample_means) sd(sample_means) cat("Mean of sampling distribution:", round(mean(sample_means), 3), "\n") cat("SD of sampling distribution (= SE):", round(sd(sample_means), 3), "\n") cat("Theoretical SE = 10/sqrt(100) =", 10 / sqrt(100), "\n")

The mean of the sampling distribution should be very close to 50 (the population mean). The standard deviation of the sampling distribution should be close to 1.0 — which is the theoretical standard error: 10 / √100 = 1.0.

Step 5 — Plot the Sampling Distribution

Compare the histogram of the population to the histogram of the 500 sample means. The sampling distribution is much narrower and more symmetric — even if the population were not perfectly normal, the sampling distribution of the mean approaches normality as sample size increases (the Central Limit Theorem).

hist(T, main = "Population Distribution (N = 10,000)", xlab = "Score", col = "steelblue") hist(sample_means, main = "Sampling Distribution of the Mean (500 samples, n = 100)", xlab = "Sample Means", col = "darkred") # Overlay theoretical normal curve hist(sample_means, main = "Sampling Distribution with Normal Curve", xlab = "Sample Means", prob = TRUE, col = "darkred") curve(dnorm(x, mean = mean(sample_means), sd = sd(sample_means)), col = "blue", lwd = 3, add = TRUE)

Reflections & Variations

  1. Change sample size. Repeat the simulation with samples of n = 25 and n = 400. How does the width of the sampling distribution change? Confirm that sd(sample_means) is approximately 10 / sqrt(n) in each case.
  2. Change the population. Try generating a skewed or uniform population instead of a normal one. How many samples are needed before the sampling distribution of the mean looks approximately normal?
  3. Number of iterations. Replace 500 with 50 or 5,000 iterations. How does the number of replications affect the appearance of the sampling distribution?
  4. Confidence interval interpretation. For each of your 500 sample means, compute a 95% CI as mean ± 1.96 * SEx. Count what fraction of those intervals actually contain the true mean of 50. It should be close to 95%.
← Back to Simulation Home