Course 4 — #courses
Note
Workflow labs use the variant template: Goal → Approach → Execution → Check → Report.
limma-voom workflow is preferable.GLMs from Course 2; multiple testing from Course 1.
Bulk RNA-sequencing yields integer counts per gene per sample. Raw counts are heteroscedastic — variance grows with the mean — and the distribution has a long tail. DESeq2 and edgeR model counts as negative-binomial, with a gene-specific dispersion shrunk toward a trend estimated across the dataset. This shrinkage is the statistical heart of modern differential-expression analysis in low-replicate studies.
limma-voom takes a different route: model the log-counts with precision weights that reflect the count-level mean–variance trend. The output is a familiar linear-model workflow with all of its tools (contrasts, interactions, F-tests). For large or complex designs, limma-voom is often the most flexible.
DESeq2 and edgeR are Bioconductor packages, not CRAN. We show their call pattern with #| eval: false and walk through a conceptually parallel simulation with base R GLMs so the ideas run end-to-end in any environment.
Simulate a small RNA-seq count matrix with two conditions and demonstrate the DESeq2/edgeR pipeline pattern alongside a base-R negative-binomial GLM that actually runs.
Simulate 500 genes × 10 samples (5 per group). Ten genes are truly differentially expressed.
n_gene <- 500; n_samp <- 10
group <- rep(c("A", "B"), each = 5)
mu <- 2 * runif(n_gene, 1, 10) # baseline mean per gene
lfc <- c(rep(log(3), 10), rep(0, n_gene - 10))
counts <- sapply(seq_len(n_samp), function(j) {
rate <- mu * exp(ifelse(group[j] == "B", lfc, 0))
rnbinom(n_gene, mu = rate, size = 5)
})
rownames(counts) <- paste0("g", seq_len(n_gene))
colnames(counts) <- paste0("s", seq_len(n_samp))
tibble(mean = rowMeans(counts), var = apply(counts, 1, var)) |>
ggplot(aes(mean, var)) + geom_point(alpha = 0.3) +
scale_x_log10() + scale_y_log10() +
geom_abline(slope = 1, intercept = 0, colour = "grey50") +
labs(title = "mean–variance (log–log)")A DESeq2 call pattern (not run).
An edgeR call pattern (not run).
A per-gene negative-binomial GLM that runs.
pvals <- numeric(n_gene)
logfcs <- numeric(n_gene)
for (g in seq_len(n_gene)) {
y <- counts[g, ]
fit <- try(glm.nb(y ~ group), silent = TRUE)
if (inherits(fit, "try-error")) { pvals[g] <- NA; next }
s <- summary(fit)
pvals[g] <- s$coefficients[2, 4]
logfcs[g] <- s$coefficients[2, 1]
}
padj <- p.adjust(pvals, method = "BH")Volcano plot with the truly DE genes flagged.
On a simulated 500-gene, 10-sample RNA-seq dataset with 10 truly differentially expressed genes (log fold change ≈ 1.1), per-gene negative-binomial GLMs followed by Benjamini–Hochberg adjustment recovered 90% of the true DE genes at FDR < 0.05 and yielded a false-discovery rate among null genes of 91.8%.
DESeq2 and edgeR would improve power by shrinking gene-wise dispersion across the experiment; the per-gene fit shown here is a transparent but low-power version of the same idea.
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MASS_7.3-60.2 lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0
[5] dplyr_1.2.1 purrr_1.2.2 readr_2.2.0 tidyr_1.3.2
[9] tibble_3.3.1 ggplot2_4.0.3 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.4.1 tidyselect_1.2.1
[5] scales_1.4.0 yaml_2.3.12 fastmap_1.2.0 R6_2.6.1
[9] labeling_0.4.3 generics_0.1.4 knitr_1.51 htmlwidgets_1.6.4
[13] pillar_1.11.1 RColorBrewer_1.1-3 tzdb_0.5.0 rlang_1.2.0
[17] stringi_1.8.7 xfun_0.57 S7_0.2.2 otel_0.2.0
[21] timechange_0.4.0 cli_3.6.6 withr_3.0.2 magrittr_2.0.5
[25] digest_0.6.39 grid_4.4.1 hms_1.1.4 lifecycle_1.0.5
[29] vctrs_0.7.3 evaluate_1.0.5 glue_1.8.1 farver_2.1.2
[33] rmarkdown_2.31 tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.9