library(tidyverse)
set.seed(42)
theme_set(theme_minimal(base_size = 12))Week 4, Session 5 — Pre-registration and statistical analysis plans
Course 3 — #courses
Workflow lab: Goal → Approach → Execution → Check → Report.
Learning objectives
- Distinguish a pre-registration from a statistical analysis plan.
- Draft the core sections of each for a realistic study.
- Recognise the forms of post-hoc flexibility that pre-registration is designed to prevent.
Prerequisites
Course 3 to date.
Background
A pre-registration is a timestamped public record of a study’s hypotheses, design, and primary analysis. A statistical analysis plan (SAP) is a more detailed, often confidential companion document that specifies exactly how the data will be handled — variable definitions, handling of missing data, subgroup analyses, sensitivity analyses, and reporting conventions. Between them, these two documents make the difference between “we planned this” and “we planned this, we can prove it, and here is the file to show when it was signed.”
Flexibility during analysis — researcher degrees of freedom — is not fraud. It is usually well-meaning curiosity. But aggregated across a field it produces a literature of spurious findings, and for any one study it produces an analysis that will not replicate. A pre-registration does not forbid curiosity. It separates confirmatory analyses (specified in advance) from exploratory ones, and requires each to be labelled when reported.
Setup
1. Goal
Produce a minimal-but-complete pre-registration template for a fictional two-arm randomised trial, plus a SAP skeleton.
2. Approach
A pre-registration has four mandatory elements:
- Question and hypotheses — exact form of H₀ and H₁ for each primary outcome.
- Design — eligibility, allocation, blinding, intervention, follow-up.
- Primary analysis — the single specification whose p-value will be quoted as the headline result.
- Sample size justification — the calculation behind the target N.
A SAP adds:
- Secondary analyses — ordered and pre-specified.
- Handling of missing data — mechanism assumed, method used.
- Subgroup analyses — pre-specified, limited in number.
- Sensitivity analyses — especially for primary outcomes.
3. Execution — template
# Pre-registration — Trial X
Protocol version 1.0 | Date 2026-04-18 | PI [NAME]
## 1. Research question
Primary: Does [intervention] reduce [outcome] relative to [control]
in [population] over [time frame]?
## 2. Hypotheses
H0: difference in [outcome] = 0.
H1: difference in [outcome] != 0. Two-sided, alpha = 0.05.
## 3. Design
Two-arm, parallel-group, double-blind, placebo-controlled trial.
1:1 allocation, block randomisation (block size 4), stratified by site.
## 4. Primary analysis
ITT. Linear regression of outcome at follow-up on arm, adjusted for
baseline value and stratification factors. Primary estimand:
adjusted mean difference with 95% CI.
## 5. Sample size
Target n = 200 per arm; power = 0.80 for d = 0.3, alpha = 0.05.
## 6. Data handling
- Missing data: multiple imputation under MAR (m = 20).
- Outliers: retained in primary analysis.
- Adherence: per-protocol analysis as sensitivity.4. Check
Three questions to ask of any pre-registration before posting:
- Could a reader reconstruct the primary analysis from the text without contacting you?
- Are exploratory analyses labelled as such?
- Is the target N justified by a calculation a reader can redo?
5. Report
A pre-registration for Trial X was deposited on the OSF on [date] (DOI [doi]). The primary analysis, statistical model, and sample-size justification are specified therein. Any deviation from the pre-registered plan is reported in the Deviations section of the manuscript with its rationale.
Emphasise the “deviations” norm — pre-registration does not forbid change, it requires disclosure of change.
Common pitfalls
- Treating pre-registration as a checkbox without a concrete analysis plan.
- Pre-registering two primary outcomes and quoting the winner.
- Failing to report deviations from plan.
Further reading
- Nosek BA et al. (2018). The preregistration revolution.
- AMRC / NIHR Statistical Analysis Plan templates.
Session info
sessionInfo()R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0 dplyr_1.2.1
[5] purrr_1.2.2 readr_2.2.0 tidyr_1.3.2 tibble_3.3.1
[9] ggplot2_4.0.3 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.4.1 tidyselect_1.2.1
[5] scales_1.4.0 yaml_2.3.12 fastmap_1.2.0 R6_2.6.1
[9] generics_0.1.4 knitr_1.51 htmlwidgets_1.6.4 pillar_1.11.1
[13] RColorBrewer_1.1-3 tzdb_0.5.0 rlang_1.2.0 stringi_1.8.7
[17] xfun_0.57 S7_0.2.2 otel_0.2.0 timechange_0.4.0
[21] cli_3.6.6 withr_3.0.2 magrittr_2.0.5 digest_0.6.39
[25] grid_4.4.1 hms_1.1.4 lifecycle_1.0.5 vctrs_0.7.3
[29] evaluate_1.0.5 glue_1.8.1 farver_2.1.2 rmarkdown_2.31
[33] tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.9