Week 4, Session 1 — Dichotomisation, change scores, regression to the mean

Course 2 — #courses

Author

R. Heller

Note

Inference labs use the five-step template: Hypothesis → Visualise → Assumptions → Conduct → Conclude.

Learning objectives

Quantify the power loss of a median split on a continuous predictor.
Demonstrate regression to the mean in a simulated pre-post design.
Distinguish apparent improvement due to RTM from a true treatment effect.

Prerequisites

Basic regression; ANCOVA from Week 3 Session 2.

Background

Dichotomising a continuous predictor — high vs low — loses information in proportion to the variability thrown away. A well-known rule of thumb is that the median split reduces power in a comparison equivalent to throwing away roughly a third of the sample. The only time a dichotomy is defensible is when the clinical decision it feeds into is itself binary, and even then the continuous underlying variable should enter the analysis.

Regression to the mean is a statistical fact about correlated pairs of measurements: participants who score unusually high on a first measurement will, on average, score less extremely on a second measurement of the same quantity, even with no intervention. In a pre-post study this produces apparent improvement in the high baseline group and apparent worsening in the low baseline group. ANCOVA handles this automatically; a simple comparison of pre and post in the extreme group does not.

RTM is not the same as measurement error, but measurement error is one source of RTM. Any non-perfect correlation between measurements produces it. This is why baseline stratification by a continuous variable must be prespecified — selecting on a single noisy baseline guarantees RTM.

Setup

library(tidyverse)
library(broom)
set.seed(42)
theme_set(theme_minimal(base_size = 12))

1. Hypothesis

Two simulations:

A continuous predictor truly associated with a continuous outcome; compare the p-value when treated as continuous vs median-split.
A pre-post dataset with no intervention; show apparent improvement in the high-baseline group.

2. Visualise

n <- 200
x <- rnorm(n)
y <- 0.3 * x + rnorm(n, 0, 1)
dat1 <- tibble(x, y, x_bin = factor(ifelse(x > median(x), "high", "low")))

ggplot(dat1, aes(x, y)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", colour = "steelblue") +
  labs(x = "x (continuous)", y = "y")

3. Assumptions

Independence, approximate normality; standard regression assumptions.

4. Conduct

Power loss from dichotomisation:

tidy(lm(y ~ x, data = dat1))     # continuous

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)  0.00914    0.0669     0.136 0.892  
2 x            0.222      0.0688     3.22  0.00148

tidy(lm(y ~ x_bin, data = dat1)) # median split

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    0.191    0.0952      2.01 0.0463 
2 x_binlow      -0.376    0.135      -2.79 0.00577

The p-value for x_bin should be larger than for x; the standard error is larger and the estimate is attenuated.

Regression to the mean:

n2 <- 1000
pre  <- rnorm(n2, 100, 15)
post <- 0.7 * (pre - 100) + 100 + rnorm(n2, 0, 11)
dat2 <- tibble(pre, post) |>
  mutate(high_baseline = pre > quantile(pre, 0.8))

dat2 |>
  group_by(high_baseline) |>
  summarise(pre_mean = mean(pre), post_mean = mean(post),
            change = mean(post - pre))

# A tibble: 2 × 4
  high_baseline pre_mean post_mean change
  <lgl>            <dbl>     <dbl>  <dbl>
1 FALSE             94.2      96.2   2.00
2 TRUE             120.      114.   -6.30

The “high baseline” group shows a large negative mean change; the rest of the sample shows a small positive change. No intervention occurred.

ggplot(dat2, aes(pre, post)) +
  geom_point(alpha = 0.3) +
  geom_abline(slope = 1, intercept = 0, linetype = 2, colour = "grey50") +
  geom_smooth(method = "lm", colour = "firebrick") +
  labs(x = "Baseline", y = "Follow-up")

The fitted line is flatter than the identity line; that flattening is RTM.

5. Concluding statement

Dichotomising the continuous predictor inflated its p-value from 0.0015 to 0.0058 and widened its standard error. In the pre-post simulation with no intervention, the top-baseline quintile showed a mean decline of -6.3 units, entirely attributable to regression to the mean.

If time permits, loop the power simulation at different effect sizes to make the power loss visceral.

Common pitfalls

Dichotomising to avoid thinking about the slope.
Comparing pre-post within a high-baseline subgroup and calling the decline an effect.
Using a paired t-test on change in a selected subgroup.

Session info

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] broom_1.0.12    lubridate_1.9.5 forcats_1.0.1   stringr_1.6.0  
 [5] dplyr_1.2.1     purrr_1.2.2     readr_2.2.0     tidyr_1.3.2    
 [9] tibble_3.3.1    ggplot2_4.0.3   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] Matrix_1.7-0       gtable_0.3.6       jsonlite_2.0.0     compiler_4.4.1    
 [5] tidyselect_1.2.1   splines_4.4.1      scales_1.4.0       yaml_2.3.12       
 [9] fastmap_1.2.0      lattice_0.22-6     R6_2.6.1           labeling_0.4.3    
[13] generics_0.1.4     knitr_1.51         backports_1.5.1    htmlwidgets_1.6.4 
[17] pillar_1.11.1      RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.2.0       
[21] utf8_1.2.6         stringi_1.8.7      xfun_0.57          S7_0.2.2          
[25] otel_0.2.0         timechange_0.4.0   cli_3.6.6          mgcv_1.9-1        
[29] withr_3.0.2        magrittr_2.0.5     digest_0.6.39      grid_4.4.1        
[33] hms_1.1.4          nlme_3.1-164       lifecycle_1.0.5    vctrs_0.7.3       
[37] evaluate_1.0.5     glue_1.8.1         farver_2.1.2       rmarkdown_2.31    
[41] tools_4.4.1        pkgconfig_2.0.3    htmltools_0.5.9

Learning objectives

Prerequisites

Background

Setup

1. Hypothesis

2. Visualise

3. Assumptions

4. Conduct

5. Concluding statement

Common pitfalls

Further reading

Session info

Related labs