Week 2, Session 4 — GAMs with mgcv

Course 2 — #courses

Author

R. Heller

Note

Inference labs use the five-step template: Hypothesis → Visualise → Assumptions → Conduct → Conclude.

Learning objectives

  • Fit a generalised additive model with mgcv::gam() and a single smooth term.
  • Read effective degrees of freedom as a data-driven measure of flexibility.
  • Visualise a smooth with gratia::draw() and report its shape in plain language.

Prerequisites

Sessions 1–2 of Week 1.

Background

Generalised additive models replace the linear term β·X in a regression with a smooth function f(X) estimated from the data. The smooth is regularised so that wiggliness is penalised; the penalty strength is chosen by restricted maximum likelihood or generalised cross-validation. The effective degrees of freedom (edf) of the smooth quantifies how much of that flexibility was used: an edf near 1 means the smooth is essentially a line; an edf of 5 means something meaningfully curved.

GAMs are particularly useful when the shape of the relationship is the scientific question. They are also useful as diagnostics for linear models, because plotting a smooth against a predictor can reveal a curve that the linear term would miss.

The mgcv package is the reference implementation. gratia is a newer graphics layer on top of mgcv that produces tidy, ggplot-based plots of smooths, partial effects, and derivatives.

Setup

library(tidyverse)
library(mgcv)
library(gratia)
library(broom)
set.seed(42)
theme_set(theme_minimal(base_size = 12))

1. Hypothesis

Simulate a sine-wave-plus-noise relationship and ask whether a GAM recovers it while a linear model misses it.

2. Visualise

n <- 300
x <- runif(n, 0, 4 * pi)
y <- sin(x) + rnorm(n, 0, 0.4)
dat <- tibble(x, y)

ggplot(dat, aes(x, y)) +
  geom_point(alpha = 0.5, colour = "grey30") +
  labs(x = "x", y = "y")

3. Assumptions

GAMs assume independent observations and a correctly specified response distribution. The smooth flexibility is chosen automatically; the user chooses the basis and, if relevant, the upper limit on flexibility via k.

fit_lm  <- lm(y ~ x, data = dat)
fit_gam <- gam(y ~ s(x, bs = "cr", k = 20), data = dat, method = "REML")

plot(fit_gam, residuals = TRUE, pch = 1, shade = TRUE)

gam.check(fit_gam)


Method: REML   Optimizer: outer newton
full convergence after 4 iterations.
Gradient range [-5.838318e-05,1.726219e-06]
(score 150.8396 & scale 0.1374259).
Hessian positive definite, eigenvalue range [4.979807,149.2267].
Model rank =  20 / 20 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

       k'  edf k-index p-value
s(x) 19.0 12.4    1.11    0.99

gam.check reports k-index and effective df — an indication of whether k is large enough.

4. Conduct

summary(fit_gam)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(x, bs = "cr", k = 20)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.002187   0.021403  -0.102    0.919

Approximate significance of smooth terms:
       edf Ref.df     F p-value    
s(x) 12.43   14.8 62.12  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.755   Deviance explained = 76.5%
-REML = 150.84  Scale est. = 0.13743   n = 300
AIC(fit_lm, fit_gam)
              df      AIC
fit_lm   3.00000 631.7192
fit_gam 14.69402 271.6137
draw(fit_gam)

5. Concluding statement

A GAM with a single smooth of x captured the non-linear relationship (edf = 12.4; approximate p < 0.001). A linear model gave a near-zero slope because the underlying function is a full sine wave. The GAM is preferred when the shape, not a scalar slope, is the scientific object.

Explain that a large edf is not a problem; the penalty prevents the smooth from interpolating noise. The p-value reported for a smooth is approximate.

Common pitfalls

  • Setting k too small and constraining the smooth.
  • Comparing GAMs with AIC as if it were a likelihood-ratio test.
  • Reporting only the smooth without showing its plot.

Further reading

  • Wood SN. Generalized Additive Models: An Introduction with R.
  • Pedersen EJ et al. (2019), Hierarchical generalized additive models.
  • Simpson GL. gratia package vignette.

Session info

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] broom_1.0.12    gratia_0.11.2   mgcv_1.9-1      nlme_3.1-164   
 [5] lubridate_1.9.5 forcats_1.0.1   stringr_1.6.0   dplyr_1.2.1    
 [9] purrr_1.2.2     readr_2.2.0     tidyr_1.3.2     tibble_3.3.1   
[13] ggplot2_4.0.3   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] generics_0.1.4     stringi_1.8.7      lattice_0.22-6     hms_1.1.4         
 [5] digest_0.6.39      magrittr_2.0.5     evaluate_1.0.5     grid_4.4.1        
 [9] timechange_0.4.0   RColorBrewer_1.1-3 fastmap_1.2.0      jsonlite_2.0.0    
[13] Matrix_1.7-0       backports_1.5.1    ggokabeito_0.1.0   scales_1.4.0      
[17] cli_3.6.6          rlang_1.2.0        splines_4.4.1      withr_3.0.2       
[21] yaml_2.3.12        otel_0.2.0         tools_4.4.1        tzdb_0.5.0        
[25] nanonext_1.9.0     vctrs_0.7.3        R6_2.6.1           lifecycle_1.0.5   
[29] tweedie_3.0.19     htmlwidgets_1.6.4  pkgconfig_2.0.3    pillar_1.11.1     
[33] gtable_0.3.6       glue_1.8.1         Rcpp_1.1.1-1.1     statmod_1.5.1     
[37] xfun_0.57          tidyselect_1.2.1   knitr_1.51         farver_2.1.2      
[41] patchwork_1.3.2    htmltools_0.5.9    mirai_2.6.1        labeling_0.4.3    
[45] rmarkdown_2.31     compiler_4.4.1     S7_0.2.2           mvnfast_0.2.8