Week 4, Session 5 — Pre-registration and statistical analysis plans

Course 3 — #courses

Author

R. Heller

Note

Workflow lab: Goal → Approach → Execution → Check → Report.

Learning objectives

Distinguish a pre-registration from a statistical analysis plan.
Draft the core sections of each for a realistic study.
Recognise the forms of post-hoc flexibility that pre-registration is designed to prevent.

Prerequisites

Course 3 to date.

Background

A pre-registration is a timestamped public record of a study’s hypotheses, design, and primary analysis. A statistical analysis plan (SAP) is a more detailed, often confidential companion document that specifies exactly how the data will be handled — variable definitions, handling of missing data, subgroup analyses, sensitivity analyses, and reporting conventions. Between them, these two documents make the difference between “we planned this” and “we planned this, we can prove it, and here is the file to show when it was signed.”

Flexibility during analysis — researcher degrees of freedom — is not fraud. It is usually well-meaning curiosity. But aggregated across a field it produces a literature of spurious findings, and for any one study it produces an analysis that will not replicate. A pre-registration does not forbid curiosity. It separates confirmatory analyses (specified in advance) from exploratory ones, and requires each to be labelled when reported.

Setup

library(tidyverse)
set.seed(42)
theme_set(theme_minimal(base_size = 12))

1. Goal

Produce a minimal-but-complete pre-registration template for a fictional two-arm randomised trial, plus a SAP skeleton.

2. Approach

A pre-registration has four mandatory elements:

Question and hypotheses — exact form of H₀ and H₁ for each primary outcome.
Design — eligibility, allocation, blinding, intervention, follow-up.
Primary analysis — the single specification whose p-value will be quoted as the headline result.
Sample size justification — the calculation behind the target N.

A SAP adds:

Secondary analyses — ordered and pre-specified.
Handling of missing data — mechanism assumed, method used.
Subgroup analyses — pre-specified, limited in number.
Sensitivity analyses — especially for primary outcomes.

3. Execution — template

# Pre-registration — Trial X

Protocol version 1.0 | Date 2026-04-18 | PI [NAME]

## 1. Research question
Primary: Does [intervention] reduce [outcome] relative to [control]
in [population] over [time frame]?

## 2. Hypotheses
H0: difference in [outcome] = 0.
H1: difference in [outcome] != 0. Two-sided, alpha = 0.05.

## 3. Design
Two-arm, parallel-group, double-blind, placebo-controlled trial.
1:1 allocation, block randomisation (block size 4), stratified by site.

## 4. Primary analysis
ITT. Linear regression of outcome at follow-up on arm, adjusted for
baseline value and stratification factors. Primary estimand:
adjusted mean difference with 95% CI.

## 5. Sample size
Target n = 200 per arm; power = 0.80 for d = 0.3, alpha = 0.05.

## 6. Data handling
- Missing data: multiple imputation under MAR (m = 20).
- Outliers: retained in primary analysis.
- Adherence: per-protocol analysis as sensitivity.

4. Check

Three questions to ask of any pre-registration before posting:

Could a reader reconstruct the primary analysis from the text without contacting you?
Are exploratory analyses labelled as such?
Is the target N justified by a calculation a reader can redo?

5. Report

A pre-registration for Trial X was deposited on the OSF on [date] (DOI [doi]). The primary analysis, statistical model, and sample-size justification are specified therein. Any deviation from the pre-registered plan is reported in the Deviations section of the manuscript with its rationale.

Emphasise the “deviations” norm — pre-registration does not forbid change, it requires disclosure of change.

Common pitfalls

Treating pre-registration as a checkbox without a concrete analysis plan.
Pre-registering two primary outcomes and quoting the winner.
Failing to report deviations from plan.

Session info

sessionInfo()

R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_Germany.utf8  LC_CTYPE=English_Germany.utf8   
[3] LC_MONETARY=English_Germany.utf8 LC_NUMERIC=C                    
[5] LC_TIME=English_Germany.utf8    

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.5 forcats_1.0.1   stringr_1.6.0   dplyr_1.2.1    
 [5] purrr_1.2.2     readr_2.2.0     tidyr_1.3.2     tibble_3.3.1   
 [9] ggplot2_4.0.3   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.2     tidyselect_1.2.1  
 [5] scales_1.4.0       yaml_2.3.12        fastmap_1.2.0      R6_2.6.1          
 [9] generics_0.1.4     knitr_1.51         htmlwidgets_1.6.4  pillar_1.11.1     
[13] RColorBrewer_1.1-3 tzdb_0.5.0         rlang_1.2.0        stringi_1.8.7     
[17] xfun_0.57          S7_0.2.2           otel_0.2.0         timechange_0.4.0  
[21] cli_3.6.6          withr_3.0.2        magrittr_2.0.4     digest_0.6.39     
[25] grid_4.5.2         hms_1.1.4          lifecycle_1.0.5    vctrs_0.7.3       
[29] evaluate_1.0.5     glue_1.8.1         farver_2.1.2       rmarkdown_2.31    
[33] tools_4.5.2        pkgconfig_2.0.3    htmltools_0.5.9

Learning objectives

Prerequisites

Background

Setup

1. Goal

2. Approach

3. Execution — template

4. Check

5. Report

Common pitfalls

Further reading

Session info

Related labs