Week 1, Session 3 — Multiple regression: confounding, interaction, centring

Course 2 — #courses

Author

R. Heller

Note

Inference labs use the five-step template: Hypothesis → Visualise → Assumptions → Conduct → Conclude.

Learning objectives

Distinguish confounding from mediation and from effect modification.
Read a multiple regression coefficient as an adjusted effect.
Centre predictors before fitting interactions and explain why.

Prerequisites

Sessions 1 and 2 of this week.

Background

Multiple regression is the standard way to estimate the effect of one predictor while holding others constant. The word holding is doing a lot of work: the adjusted coefficient is the effect conditional on the other variables in the model, not the effect one would observe in a new study that intervened on them. Confounding, mediation, and collider bias all pivot on which covariates are in the model. The statistics textbook cannot answer that question; the subject-matter science must.

Interactions complicate this picture. When two predictors interact, the effect of one depends on the level of the other, and the main-effect coefficient becomes the effect when the interacting variable is zero. Centring continuous predictors before fitting an interaction restores a readable interpretation: the main effect becomes the effect at the sample mean of the other variable, which is usually what the reader wants.

Collinearity and scale are often discussed together. Centring does not reduce real collinearity between X and Z, but it does reduce the algebraic collinearity between X and X·Z that otherwise makes the coefficients hard to interpret and inflates their standard errors.

Setup

library(tidyverse)
library(broom)
set.seed(42)
theme_set(theme_minimal(base_size = 12))

1. Hypothesis

Simulate a scenario in which a confounder masks the effect of interest, then add an interaction to show effect modification.

2. Visualise

n <- 300
age <- rnorm(n, 55, 10)
# smoker more common among older people in this simulation (confounding)
smoker <- rbinom(n, 1, plogis(-3 + 0.05 * age))
# outcome depends on both; smoker effect is stronger at older ages (interaction)
y <- 120 + 0.6 * age + 5 * smoker + 0.2 * smoker * (age - mean(age)) +
  rnorm(n, 0, 8)
dat <- tibble(age, smoker = factor(smoker, labels = c("no", "yes")), y)

ggplot(dat, aes(age, y, colour = smoker)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(x = "Age (years)", y = "Outcome", colour = "Smoker")

Both lines climb with age; the smoker line sits higher and may slope more steeply.

3. Assumptions

The usual linear-model assumptions, plus an implicit assumption that all relevant confounders are in the model.

fit_full <- lm(y ~ age * smoker, data = dat)
par(mfrow = c(2, 2))
plot(fit_full)

par(mfrow = c(1, 1))

4. Conduct

Crude smoker effect, then adjusted for age, then with interaction:

crude    <- lm(y ~ smoker, data = dat)
adjusted <- lm(y ~ smoker + age, data = dat)
inter    <- lm(y ~ smoker * age, data = dat)

bind_rows(
  tidy(crude, conf.int = TRUE) |> mutate(model = "crude"),
  tidy(adjusted, conf.int = TRUE) |> mutate(model = "adjusted"),
  tidy(inter, conf.int = TRUE) |> mutate(model = "interaction")
) |>
  filter(term != "(Intercept)") |>
  select(model, term, estimate, conf.low, conf.high, p.value)

# A tibble: 6 × 6
  model       term          estimate conf.low conf.high  p.value
  <chr>       <chr>            <dbl>    <dbl>     <dbl>    <dbl>
1 crude       smokeryes       9.25      6.79     11.7   1.29e-12
2 adjusted    smokeryes       5.14      3.22      7.05  2.49e- 7
3 adjusted    age             0.742     0.647     0.838 2.53e-39
4 interaction smokeryes       3.29     -7.77     14.4   5.58e- 1
5 interaction age             0.729     0.605     0.853 7.05e-26
6 interaction smokeryes:age   0.0330   -0.162     0.228 7.39e- 1

Now with centred age, which makes the smokeryes coefficient the smoker effect at mean age:

dat_c <- dat |> mutate(age_c = age - mean(age))
inter_c <- lm(y ~ smoker * age_c, data = dat_c)
tidy(inter_c, conf.int = TRUE)

# A tibble: 4 × 7
  term            estimate std.error statistic  p.value conf.low conf.high
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)     152.        0.616    247.    0         151.      154.   
2 smokeryes         5.10      0.980      5.21  3.57e- 7    3.17      7.03 
3 age_c             0.729     0.0629    11.6   7.05e-26    0.605     0.853
4 smokeryes:age_c   0.0330    0.0992     0.333 7.39e- 1   -0.162     0.228

5. Concluding statement

After adjusting for age, the estimated smoker-effect at the sample mean age was 5.1 units (95% CI: 3.2 to 7), with evidence of effect modification by age (interaction coefficient 0.03).

The point here is that adjusted is not a synonym for causal, and that centring is a small habit with a large readability payoff.

Common pitfalls

Reading an adjusted coefficient as a population-level causal effect without thinking about what the adjustment set represents.
Fitting an interaction without centring and then trying to interpret the main effects.
Dropping covariates that look non-significant one at a time.

Session info

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] broom_1.0.12    lubridate_1.9.5 forcats_1.0.1   stringr_1.6.0  
 [5] dplyr_1.2.1     purrr_1.2.2     readr_2.2.0     tidyr_1.3.2    
 [9] tibble_3.3.1    ggplot2_4.0.3   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] Matrix_1.7-0       gtable_0.3.6       jsonlite_2.0.0     compiler_4.4.1    
 [5] tidyselect_1.2.1   splines_4.4.1      scales_1.4.0       yaml_2.3.12       
 [9] fastmap_1.2.0      lattice_0.22-6     R6_2.6.1           labeling_0.4.3    
[13] generics_0.1.4     knitr_1.51         backports_1.5.1    htmlwidgets_1.6.4 
[17] pillar_1.11.1      RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.2.0       
[21] utf8_1.2.6         stringi_1.8.7      xfun_0.57          S7_0.2.2          
[25] otel_0.2.0         timechange_0.4.0   cli_3.6.6          mgcv_1.9-1        
[29] withr_3.0.2        magrittr_2.0.5     digest_0.6.39      grid_4.4.1        
[33] hms_1.1.4          nlme_3.1-164       lifecycle_1.0.5    vctrs_0.7.3       
[37] evaluate_1.0.5     glue_1.8.1         farver_2.1.2       rmarkdown_2.31    
[41] tools_4.4.1        pkgconfig_2.0.3    htmltools_0.5.9

Learning objectives

Prerequisites

Background

Setup

1. Hypothesis

2. Visualise

3. Assumptions

4. Conduct

5. Concluding statement

Common pitfalls

Further reading

Session info

Related labs