Course 4 — #courses
Note
Testing labs use the main template: Hypothesis → Visualise → Assumptions → Conduct → Conclude.
Survival ML from Session 4; calibration from Course 1.
Validation is the point at which a prediction model either earns its keep or does not. Internal validation — CV or the bootstrap on the training data — controls optimism from the fitting step. External validation applies the fixed model to an independent cohort and measures discrimination, calibration, and clinical utility. Time- dependent Brier scores and IPA (1 − Brier/Brier_null) summarise the cost of a probabilistic prediction at a given horizon, combining discrimination and calibration into a single number per time.
Real biomedical models routinely lose performance on external cohorts. Distribution shift (different age, comorbidities, measurement protocols) is the usual reason; miscalibration is the usual failure mode. This lab simulates both and shows the diagnostic signature each leaves behind.
TRIPOD-AI asks you to report all three — discrimination, calibration, clinical utility — on both internal and external data, and to disclose the shift between them. A reader who only sees AUC on internal data has no basis for deploying the model.
A Cox model fit on a development cohort retains its calibration and discrimination on an external cohort with shifted covariate distribution.
make_cohort <- function(n, mu_x = 0, beta = 0.6) {
x <- rnorm(n, mu_x)
lp <- beta * x
t <- rexp(n, rate = exp(lp - 1))
c <- rexp(n, rate = 0.1)
time <- pmin(t, c)
event <- as.integer(t <= c)
tibble(x = x, time = time, event = event)
}
dev <- make_cohort(500, mu_x = 0)
ext <- make_cohort(500, mu_x = 1.0) # distribution shift
bind_rows(dev |> mutate(cohort = "dev"),
ext |> mutate(cohort = "ext")) |>
ggplot(aes(x, fill = cohort)) +
geom_histogram(bins = 30, alpha = 0.7, position = "identity")Independent right censoring; Cox proportional hazards in the development cohort; no time-varying covariates.
Fit on dev; predict survival at a fixed horizon on both cohorts.
fit <- coxph(Surv(time, event) ~ x, data = dev)
horizon <- 3
predict_surv <- function(fit, newdata, t) {
bh <- basehaz(fit, centered = FALSE)
H0 <- approx(bh$time, bh$hazard, xout = t, rule = 2)$y
lp <- predict(fit, newdata = newdata, type = "lp")
exp(-H0 * exp(lp))
}
s_dev <- predict_surv(fit, dev, horizon)
s_ext <- predict_surv(fit, ext, horizon)
# Observed indicator (for simple Brier w/out censoring at horizon)
obs_dev <- dev$event == 1 & dev$time <= horizon
obs_ext <- ext$event == 1 & ext$time <= horizon
brier <- function(p_event, obs) mean((p_event - obs)^2)
brier_null <- function(obs) mean((mean(obs) - obs)^2)
b_dev <- brier(1 - s_dev, obs_dev); b0_dev <- brier_null(obs_dev)
b_ext <- brier(1 - s_ext, obs_ext); b0_ext <- brier_null(obs_ext)
ipa_dev <- 1 - b_dev / b0_dev
ipa_ext <- 1 - b_ext / b0_ext
auc_dev <- as.numeric(auc(roc(obs_dev, 1 - s_dev, quiet = TRUE)))
auc_ext <- as.numeric(auc(roc(obs_ext, 1 - s_ext, quiet = TRUE)))
tibble(cohort = c("dev", "ext"),
auc = c(auc_dev, auc_ext),
brier = c(b_dev, b_ext),
ipa = c(ipa_dev, ipa_ext))# A tibble: 2 × 4
cohort auc brier ipa
<chr> <dbl> <dbl> <dbl>
1 dev 0.712 0.218 0.107
2 ext 0.689 0.186 0.0405
A calibration plot at the horizon.
mk_cal <- function(p, y, lab) {
tibble(p = p, y = y, cohort = lab) |>
mutate(bin = cut(p, quantile(p, seq(0, 1, by = 0.1)),
include.lowest = TRUE)) |>
group_by(cohort, bin) |>
summarise(mean_pred = mean(p), obs = mean(y), .groups = "drop")
}
bind_rows(mk_cal(1 - s_dev, obs_dev, "dev"),
mk_cal(1 - s_ext, obs_ext, "ext")) |>
ggplot(aes(mean_pred, obs, colour = cohort)) +
geom_point() + geom_line() +
geom_abline(slope = 1, intercept = 0, colour = "grey50") +
labs(x = "predicted event prob at horizon",
y = "observed event proportion")On a simulated external cohort with shifted covariate distribution, a Cox model fit on the development cohort retained discrimination (AUC 0.69) but lost calibration, with IPA falling from 0.11 (development) to 0.04 (external). The calibration plot showed systematic over-prediction in the shifted cohort.
Distribution shift does not always show up in AUC; it often shows up first in calibration. Reporting Brier/IPA alongside AUC catches this failure mode.
pec or riskRegression handle this properly.R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] pROC_1.19.0.1 survival_3.6-4 lubridate_1.9.5 forcats_1.0.1
[5] stringr_1.6.0 dplyr_1.2.1 purrr_1.2.2 readr_2.2.0
[9] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.3 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] Matrix_1.7-0 gtable_0.3.6 jsonlite_2.0.0 compiler_4.4.1
[5] Rcpp_1.1.1-1.1 tidyselect_1.2.1 splines_4.4.1 scales_1.4.0
[9] yaml_2.3.12 fastmap_1.2.0 lattice_0.22-6 R6_2.6.1
[13] labeling_0.4.3 generics_0.1.4 knitr_1.51 htmlwidgets_1.6.4
[17] pillar_1.11.1 RColorBrewer_1.1-3 tzdb_0.5.0 rlang_1.2.0
[21] utf8_1.2.6 stringi_1.8.7 xfun_0.57 S7_0.2.2
[25] otel_0.2.0 timechange_0.4.0 cli_3.6.6 withr_3.0.2
[29] magrittr_2.0.5 digest_0.6.39 grid_4.4.1 hms_1.1.4
[33] lifecycle_1.0.5 vctrs_0.7.3 evaluate_1.0.5 glue_1.8.1
[37] farver_2.1.2 rmarkdown_2.31 tools_4.4.1 pkgconfig_2.0.3
[41] htmltools_0.5.9