Course 2 — #courses
Note
Inference labs use the five-step template: Hypothesis → Visualise → Assumptions → Conduct → Conclude.
plot(lm) diagnostics and say what each is for.Sessions 2 and 3 of this week.
Diagnostics are the step in a regression workflow that protects the report from the dataset. A coefficient is only as trustworthy as the assumptions behind it. The four classical plots — residuals vs fitted, QQ, scale-location, and residuals vs leverage — answer four questions: is the mean right, are the errors roughly normal, is the variance constant, and is any single point running the show?
Leverage measures how unusual a point’s predictor values are; influence measures how much the fit changes when that point is removed. Cook’s distance combines the two. A point can have high leverage without high influence if it sits on the regression line, and it can have high influence without spectacular residuals if its predictors are extreme.
The performance::check_model() function wraps many of these ideas in a single call and is increasingly the way teams read diagnostics on screen. car::vif() gives the variance inflation factors; a rule of thumb is that values above 5 deserve attention and above 10 demand it, but rules of thumb are no substitute for thinking about which predictors are truly redundant.
Simulate a regression with mild heteroscedasticity, a high-leverage point, and two correlated predictors. Ask which diagnostics flag which problem.
n <- 150
x1 <- rnorm(n, 0, 1)
x2 <- x1 + rnorm(n, 0, 0.3) # x1 and x2 correlated
x3 <- rnorm(n, 0, 1)
y <- 1 + 2 * x1 - 1.5 * x2 + 0.5 * x3 + rnorm(n, 0, 1 + 0.4 * abs(x1))
# add one influential point
x1[1] <- 5; x2[1] <- 5; y[1] <- 0
dat <- tibble(y, x1, x2, x3)
ggplot(dat, aes(x1, y)) +
geom_point(alpha = 0.6) +
labs(x = "x1", y = "y")The residuals-vs-fitted plot should show no trend. The QQ plot should track the line. Scale-location should be flat. Residuals-vs-leverage should have no points outside the Cook’s-distance contours.
Influence diagnostics:
# A tibble: 5 × 5
row .fitted .resid .hat .cooksd
<int> <dbl> <dbl> <dbl> <dbl>
1 1 4.00 -4.00 0.152 0.369
2 9 1.81 5.17 0.0493 0.159
3 12 2.49 3.77 0.0552 0.0958
4 19 -1.23 -3.88 0.0502 0.0909
5 81 3.21 -3.01 0.0643 0.0724
VIF:
Performance summary:
VIF for x1 and x2 should be high; x3 should be fine. The first row (the injected leverage point) should dominate Cook’s distance.
Regression diagnostics identified one high-influence observation (Cook’s D = 0.37) and collinearity between x1 and x2 (VIF = 15.6). Refitting without the influential point or after combining the collinear predictors would be the next step before reporting coefficients.
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] performance_0.16.0 car_3.1-5 carData_3.0-6 broom_1.0.12
[5] lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0 dplyr_1.2.1
[9] purrr_1.2.2 readr_2.2.0 tidyr_1.3.2 tibble_3.3.1
[13] ggplot2_4.0.3 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 xfun_0.57 bayestestR_0.17.0
[4] twosamples_2.0.1 caTools_1.18.3 htmlwidgets_1.6.4
[7] insight_1.5.0 ggrepel_0.9.8 lattice_0.22-6
[10] tzdb_0.5.0 bitops_1.0-9 vctrs_0.7.3
[13] tools_4.4.1 generics_0.1.4 datawizard_1.3.1
[16] parallel_4.4.1 pbmcapply_1.5.1 DEoptimR_1.1-4
[19] pkgconfig_2.0.3 Matrix_1.7-0 RColorBrewer_1.1-3
[22] S7_0.2.2 lifecycle_1.0.5 compiler_4.4.1
[25] farver_2.1.2 codetools_0.2-20 htmltools_0.5.9
[28] yaml_2.3.12 pracma_2.4.6 Formula_1.2-5
[31] pillar_1.11.1 MASS_7.3-60.2 iterators_1.0.14
[34] foreach_1.5.2 abind_1.4-8 nlme_3.1-164
[37] robustbase_0.99-7 tidyselect_1.2.1 digest_0.6.39
[40] mvtnorm_1.3-7 stringi_1.8.7 splines_4.4.1
[43] labeling_0.4.3 fastmap_1.2.0 grid_4.4.1
[46] cli_3.6.6 qqconf_1.3.2 magrittr_2.0.5
[49] patchwork_1.3.2 withr_3.0.2 scales_1.4.0
[52] backports_1.5.1 opdisDownsampling_1.0.1 timechange_0.4.0
[55] estimability_1.5.1 rmarkdown_2.31 emmeans_2.0.3
[58] otel_0.2.0 hms_1.1.4 coda_0.19-4.1
[61] evaluate_1.0.5 qqplotr_0.0.7 knitr_1.51
[64] parameters_0.28.3 doParallel_1.0.17 mgcv_1.9-1
[67] rlang_1.2.0 Rcpp_1.1.1-1.1 xtable_1.8-8
[70] glue_1.8.1 see_0.13.0 jsonlite_2.0.0
[73] R6_2.6.1