Course 1 · Week 2 — Describing data, probability, distributions

Cheatsheet — biostats_courses

Author

R. Heller

Descriptive statistics

Situation	Summary
Roughly symmetric, no outliers	mean ± SD
Skewed or has outliers	median, IQR, range
Categorical	counts and proportions
Grouped	Table 1 via `gtsummary::tbl_summary()`

table(df$x, df$y)
prop.table(table(df$x, df$y), margin = 1)

\(P(A \mid B) = \dfrac{P(B \mid A) P(A)}{P(B)}\)

In odds form: posterior odds = prior odds × likelihood ratio.

PPV collapses at low prevalence — even a 95%-accurate test is mostly false positives.

Binomial → Poisson as n grows and p shrinks with np = λ.

Distribution	Use
Normal(μ, σ)	almost everything via the CLT
Student t(ν)	inference about means, small n
χ²(ν)	variance, counts, GoF
F(ν₁, ν₂)	variance ratios, ANOVA
Exponential(λ)	time between events, memoryless

R uses the d / p / q / r prefix: density, CDF, quantile, random.

ggplot(df, aes(sample = x)) + stat_qq() + stat_qq_line()

S-shape → heavy tails. U-shape → skew. Plot before the test.