Course 2 — #courses
Note
Inference labs use the five-step template: Hypothesis → Visualise → Assumptions → Conduct → Conclude.
Sessions 1 and 2 of this week.
Multiple regression is the standard way to estimate the effect of one predictor while holding others constant. The word holding is doing a lot of work: the adjusted coefficient is the effect conditional on the other variables in the model, not the effect one would observe in a new study that intervened on them. Confounding, mediation, and collider bias all pivot on which covariates are in the model. The statistics textbook cannot answer that question; the subject-matter science must.
Interactions complicate this picture. When two predictors interact, the effect of one depends on the level of the other, and the main-effect coefficient becomes the effect when the interacting variable is zero. Centring continuous predictors before fitting an interaction restores a readable interpretation: the main effect becomes the effect at the sample mean of the other variable, which is usually what the reader wants.
Collinearity and scale are often discussed together. Centring does not reduce real collinearity between X and Z, but it does reduce the algebraic collinearity between X and X·Z that otherwise makes the coefficients hard to interpret and inflates their standard errors.
Simulate a scenario in which a confounder masks the effect of interest, then add an interaction to show effect modification.
n <- 300
age <- rnorm(n, 55, 10)
# smoker more common among older people in this simulation (confounding)
smoker <- rbinom(n, 1, plogis(-3 + 0.05 * age))
# outcome depends on both; smoker effect is stronger at older ages (interaction)
y <- 120 + 0.6 * age + 5 * smoker + 0.2 * smoker * (age - mean(age)) +
rnorm(n, 0, 8)
dat <- tibble(age, smoker = factor(smoker, labels = c("no", "yes")), y)
ggplot(dat, aes(age, y, colour = smoker)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Age (years)", y = "Outcome", colour = "Smoker")Both lines climb with age; the smoker line sits higher and may slope more steeply.
The usual linear-model assumptions, plus an implicit assumption that all relevant confounders are in the model.
Crude smoker effect, then adjusted for age, then with interaction:
crude <- lm(y ~ smoker, data = dat)
adjusted <- lm(y ~ smoker + age, data = dat)
inter <- lm(y ~ smoker * age, data = dat)
bind_rows(
tidy(crude, conf.int = TRUE) |> mutate(model = "crude"),
tidy(adjusted, conf.int = TRUE) |> mutate(model = "adjusted"),
tidy(inter, conf.int = TRUE) |> mutate(model = "interaction")
) |>
filter(term != "(Intercept)") |>
select(model, term, estimate, conf.low, conf.high, p.value)# A tibble: 6 × 6
model term estimate conf.low conf.high p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 crude smokeryes 9.25 6.79 11.7 1.29e-12
2 adjusted smokeryes 5.14 3.22 7.05 2.49e- 7
3 adjusted age 0.742 0.647 0.838 2.53e-39
4 interaction smokeryes 3.29 -7.77 14.4 5.58e- 1
5 interaction age 0.729 0.605 0.853 7.05e-26
6 interaction smokeryes:age 0.0330 -0.162 0.228 7.39e- 1
Now with centred age, which makes the smokeryes coefficient the smoker effect at mean age:
# A tibble: 4 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 152. 0.616 247. 0 151. 154.
2 smokeryes 5.10 0.980 5.21 3.57e- 7 3.17 7.03
3 age_c 0.729 0.0629 11.6 7.05e-26 0.605 0.853
4 smokeryes:age_c 0.0330 0.0992 0.333 7.39e- 1 -0.162 0.228
After adjusting for age, the estimated smoker-effect at the sample mean age was 5.1 units (95% CI: 3.2 to 7), with evidence of effect modification by age (interaction coefficient 0.03).
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] broom_1.0.12 lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0
[5] dplyr_1.2.1 purrr_1.2.2 readr_2.2.0 tidyr_1.3.2
[9] tibble_3.3.1 ggplot2_4.0.3 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] Matrix_1.7-0 gtable_0.3.6 jsonlite_2.0.0 compiler_4.4.1
[5] tidyselect_1.2.1 splines_4.4.1 scales_1.4.0 yaml_2.3.12
[9] fastmap_1.2.0 lattice_0.22-6 R6_2.6.1 labeling_0.4.3
[13] generics_0.1.4 knitr_1.51 backports_1.5.1 htmlwidgets_1.6.4
[17] pillar_1.11.1 RColorBrewer_1.1-3 tzdb_0.5.0 rlang_1.2.0
[21] utf8_1.2.6 stringi_1.8.7 xfun_0.57 S7_0.2.2
[25] otel_0.2.0 timechange_0.4.0 cli_3.6.6 mgcv_1.9-1
[29] withr_3.0.2 magrittr_2.0.5 digest_0.6.39 grid_4.4.1
[33] hms_1.1.4 nlme_3.1-164 lifecycle_1.0.5 vctrs_0.7.3
[37] evaluate_1.0.5 glue_1.8.1 farver_2.1.2 rmarkdown_2.31
[41] tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.9