Week 3, Session 1 — Populations, samples, and the central limit theorem

Course 1 — #courses

Author

R. Heller

Note

Inference labs use the five-step template: Hypothesis → Visualise → Assumptions → Conduct → Conclude.

Learning objectives

Distinguish population parameters from sample estimates and their sampling distributions.
Decompose expected error into bias squared, variance, and irreducible noise.
Demonstrate the central limit theorem by simulation on a strongly skewed population.

Prerequisites

Weeks 1 and 2.

Background

A population is the set of units we wish to learn about. A sample is the set of units we actually measure. The population has a fixed parameter — mean μ, proportion π, whatever — and the sample has an estimate — x̄, p̂ — which is a random variable because the sample was drawn at random. Every estimate has a sampling distribution: the distribution of its value across hypothetical repetitions of the sampling process. Standard errors, confidence intervals, and p-values are all properties of sampling distributions.

Error decomposes into bias and variance. The bias of an estimator is the systematic offset of its sampling distribution’s mean from the parameter. The variance is the spread of the sampling distribution. Mean squared error — bias squared plus variance — is the natural scalar summary of how wrong an estimator is on average. A good estimator is not one with zero bias; it is one with low MSE.

The central limit theorem is the reason the normal distribution is ubiquitous. No matter how skewed the population (within mild conditions on variance), the sampling distribution of the mean approaches normal as n grows. The lab makes that convergence visible.

Setup

library(tidyverse)
set.seed(42)
theme_set(theme_minimal(base_size = 12))

1. Hypothesis

Claim: the sampling distribution of the sample mean from an exponential (strongly right-skewed) population approaches a normal distribution as n increases, with shrinking standard error.

2. Visualise

Draw from an exponential population and show the distribution of individual observations.

rate <- 1
pop_mean <- 1 / rate
pop_sd   <- 1 / rate

pop_sample <- tibble(x = rexp(2000, rate = rate))

pop_sample |>
  ggplot(aes(x)) +
  geom_histogram(bins = 40, fill = "grey60", colour = "white") +
  geom_vline(xintercept = pop_mean, linetype = 2) +
  labs(x = "x", y = "Count")

3. Assumptions

CLT assumes iid sampling and finite variance. Exponential qualifies. The speed of convergence is set by how skewed the parent is — the more skewed, the larger the n needed for normality.

# Bias-variance sanity check on the sample mean as an estimator of the
# population mean, across sample sizes.
summarise_means <- function(n, B = 2000) {
  xbars <- replicate(B, mean(rexp(n, rate = rate)))
  tibble(n = n,
         bias = mean(xbars) - pop_mean,
         variance = var(xbars),
         mse = mean((xbars - pop_mean)^2),
         se = sd(xbars))
}
bv <- map_dfr(c(5, 10, 30, 100, 500), summarise_means)
bv

# A tibble: 5 × 5
      n      bias variance     mse     se
  <dbl>     <dbl>    <dbl>   <dbl>  <dbl>
1     5 -0.000703  0.200   0.200   0.447 
2    10 -0.00189   0.100   0.100   0.317 
3    30 -0.00528   0.0331  0.0331  0.182 
4   100 -0.000430  0.0101  0.0101  0.100 
5   500  0.000726  0.00198 0.00198 0.0445

4. Conduct

Plot the sampling distribution of the mean across growing n.

sim_means <- function(n, B = 4000) {
  replicate(B, mean(rexp(n, rate = rate)))
}

means_df <- bind_rows(
  tibble(n = "n = 2",   mean = sim_means(2)),
  tibble(n = "n = 5",   mean = sim_means(5)),
  tibble(n = "n = 30",  mean = sim_means(30)),
  tibble(n = "n = 100", mean = sim_means(100))
) |>
  mutate(n = factor(n, levels = c("n = 2", "n = 5", "n = 30", "n = 100")))

means_df |>
  ggplot(aes(mean)) +
  geom_histogram(bins = 40, fill = "grey60", colour = "white") +
  facet_wrap(~ n, scales = "free") +
  geom_vline(xintercept = pop_mean, linetype = 2) +
  labs(x = "Sample mean", y = "Count")

The histograms widen (more spread with small n) and are strongly right-skewed at n = 2. By n = 30 they are close to symmetric.

# Check SE shrinking like pop_sd / sqrt(n).
bv |> mutate(expected_se = pop_sd / sqrt(n),
             ratio = se / expected_se)

# A tibble: 5 × 7
      n      bias variance     mse     se expected_se ratio
  <dbl>     <dbl>    <dbl>   <dbl>  <dbl>       <dbl> <dbl>
1     5 -0.000703  0.200   0.200   0.447       0.447  0.999
2    10 -0.00189   0.100   0.100   0.317       0.316  1.00 
3    30 -0.00528   0.0331  0.0331  0.182       0.183  0.996
4   100 -0.000430  0.0101  0.0101  0.100       0.1    1.00 
5   500  0.000726  0.00198 0.00198 0.0445      0.0447 0.996

5. Concluding statement

Sampling from an exponential population with mean 1 and SD 1, the sample mean was unbiased across every n simulated (|bias| < 0.02 in all cases). Its standard error shrank as 1/√n — ratio of observed to expected SE near 1 throughout — and its sampling distribution was visibly skewed at n = 2 but approximately normal by n = 30. The central limit theorem is fully operational by modest sample sizes even for a strongly skewed parent.

“n ≥ 30” is a rule of thumb, not a theorem. For very skewed parents, a larger n is needed; for symmetric parents, a smaller n may suffice. Always check by simulation when in doubt.

The 2x2 facet is the single most compelling picture of the CLT; make sure to let the audience see it large, not crowded into a side slide.

Common pitfalls

Believing the CLT applies to the data, not to the mean of the data. Individual observations stay whatever they were.
Invoking the CLT to justify normality of a small skewed sample. The CLT is about sampling distributions.
Confusing standard error (a property of the sampling distribution) with standard deviation (a property of the data).

Session info

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.5 forcats_1.0.1   stringr_1.6.0   dplyr_1.2.1    
 [5] purrr_1.2.2     readr_2.2.0     tidyr_1.3.2     tibble_3.3.1   
 [9] ggplot2_4.0.3   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.4.1     tidyselect_1.2.1  
 [5] scales_1.4.0       yaml_2.3.12        fastmap_1.2.0      R6_2.6.1          
 [9] labeling_0.4.3     generics_0.1.4     knitr_1.51         htmlwidgets_1.6.4 
[13] pillar_1.11.1      RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.2.0       
[17] stringi_1.8.7      xfun_0.57          S7_0.2.2           otel_0.2.0        
[21] timechange_0.4.0   cli_3.6.6          withr_3.0.2        magrittr_2.0.5    
[25] digest_0.6.39      grid_4.4.1         hms_1.1.4          lifecycle_1.0.5   
[29] vctrs_0.7.3        evaluate_1.0.5     glue_1.8.1         farver_2.1.2      
[33] rmarkdown_2.31     tools_4.4.1        pkgconfig_2.0.3    htmltools_0.5.9

Learning objectives

Prerequisites

Background

Setup

1. Hypothesis

2. Visualise

3. Assumptions

4. Conduct

5. Concluding statement

Common pitfalls

Further reading

Session info

Related labs