Course 3 — #courses
Note
Inference lab using the five-step template: Hypothesis → Visualise → Assumptions → Conduct → Conclude.
Course 2 regression basics.
The language of missingness is Rubin’s taxonomy. MCAR (missing completely at random) means the probability of being missing does not depend on anything — observed or unobserved. MAR (missing at random) means the probability depends only on observed variables. MNAR (missing not at random) means the probability depends on the missing value itself, even after conditioning on everything observed. Each is a property of the process, not the data, and you can rarely decide between MAR and MNAR from the data alone.
Complete-case analysis drops any row with a missing value. It is unbiased under MCAR, usually biased under MAR, and always suspect under MNAR. Multiple imputation handles MAR correctly if the imputation model is compatible with the analysis model, but cannot rescue MNAR without explicit modelling of the missingness mechanism.
Auxiliary variables — things that correlate with both the outcome and the missingness but are not in the analysis model — are the currency of MAR. If you have them, put them in the imputation model. If you do not, your MAR claim is on thinner ice than it appears.
We estimate the mean of Y from each of three datasets with the same true mean, where Y has been made missing by MCAR, MAR, and MNAR processes.
n <- 1000
df <- tibble(
x = rnorm(n, 50, 10),
y = 2 + 0.5 * x + rnorm(n, 0, 5)
)
mcar <- df |> mutate(y_obs = if_else(runif(n) < 0.30, NA_real_, y),
mech = "MCAR")
mar <- df |> mutate(p = plogis(-2 + 0.05 * (x - 50)),
y_obs = if_else(runif(n) < p, NA_real_, y),
mech = "MAR")
mnar <- df |> mutate(p = plogis(-2 + 0.05 * (y - mean(y))),
y_obs = if_else(runif(n) < p, NA_real_, y),
mech = "MNAR")
all <- bind_rows(
mcar |> select(mech, x, y, y_obs),
mar |> select(mech, x, y, y_obs),
mnar |> select(mech, x, y, y_obs)
)
all |>
mutate(missing = is.na(y_obs)) |>
ggplot(aes(x, y, colour = missing)) +
geom_point(alpha = 0.5) +
facet_wrap(~ mech) +
labs(colour = "Y missing?")Complete-case analysis is the bluntest possible method. It is justified if missingness is MCAR. We will compute the mean of Y on observed rows for each mechanism and compare with the true mean.
# A tibble: 3 × 5
mech truth cc bias pct_missing
<chr> <dbl> <dbl> <dbl> <dbl>
1 MAR 26.8 26.5 -0.301 12.7
2 MCAR 26.8 27.0 0.135 28.5
3 MNAR 26.8 26.5 -0.380 14.9
Complete-case estimation of the mean of Y was unbiased under MCAR (bias ≈ 0.13) and meaningfully biased under MAR (-0.3) and MNAR (-0.38), despite the same true mean and similar missingness proportions.
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0 dplyr_1.2.1
[5] purrr_1.2.2 readr_2.2.0 tidyr_1.3.2 tibble_3.3.1
[9] ggplot2_4.0.3 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.4.1 tidyselect_1.2.1
[5] scales_1.4.0 yaml_2.3.12 fastmap_1.2.0 R6_2.6.1
[9] labeling_0.4.3 generics_0.1.4 knitr_1.51 htmlwidgets_1.6.4
[13] pillar_1.11.1 RColorBrewer_1.1-3 tzdb_0.5.0 rlang_1.2.0
[17] utf8_1.2.6 stringi_1.8.7 xfun_0.57 S7_0.2.2
[21] otel_0.2.0 timechange_0.4.0 cli_3.6.6 withr_3.0.2
[25] magrittr_2.0.5 digest_0.6.39 grid_4.4.1 hms_1.1.4
[29] lifecycle_1.0.5 vctrs_0.7.3 evaluate_1.0.5 glue_1.8.1
[33] farver_2.1.2 rmarkdown_2.31 tools_4.4.1 pkgconfig_2.0.3
[37] htmltools_0.5.9