Course 4 — #courses
Note
Workflow labs use the variant template: Goal → Approach → Execution → Check → Report.
uwot UMAP.UMAP from Week 1 Session 5; RNA-seq from Session 1.
Single-cell RNA-seq measures gene expression in thousands of cells simultaneously, producing a sparse, noisy, zero-inflated matrix. Clustering and cell-type annotation are exploratory steps that require a chain of careful preprocessing: drop empty droplets, normalise library size, select variable genes, reduce to principal components, build a nearest-neighbour graph, cluster on the graph, embed with UMAP.
Seurat is the dominant R toolkit for this pipeline. Every step has alternatives — SCTransform vs NormalizeData, Louvain vs Leiden clustering, Harmony for batch correction — and each choice changes the resulting partition. The good reflex is to run the same pipeline twice with a perturbed parameter and check that the biology survives.
Seurat is not on CRAN and its installation is non-trivial in a rendering environment. We sketch the pipeline with #| eval: false and run an analogous, tiny base-R + uwot pipeline on a simulated cell × gene matrix so the reasoning is visible in every render.
Walk through an scRNA-seq analysis conceptually with a small simulated cell × gene matrix, ending at a 2-D embedding with cluster labels.
300 cells, 500 genes, three cell types with partly overlapping expression programmes.
n_cell <- 300; n_gene <- 500
cell_type <- rep(c("T", "B", "myeloid"), each = n_cell / 3)
mu_base <- exp(rnorm(n_gene, 1, 1))
X <- matrix(0, n_cell, n_gene)
for (i in seq_len(n_cell)) {
program <- rep(1, n_gene)
if (cell_type[i] == "T") program[1:20] <- 5
if (cell_type[i] == "B") program[21:40] <- 5
if (cell_type[i] == "myeloid") program[41:60] <- 5
X[i, ] <- rnbinom(n_gene, mu = mu_base * program, size = 2)
}
tibble(lib = rowSums(X), ct = cell_type) |>
ggplot(aes(ct, lib)) + geom_boxplot() +
labs(x = NULL, y = "total counts per cell")Log-normalise, select variable genes, PCA, UMAP, cluster.
logX <- log1p(sweep(X, 1, rowSums(X), FUN = "/") * 1e4)
gene_var <- apply(logX, 2, var)
keep <- order(gene_var, decreasing = TRUE)[1:100]
pcs <- prcomp(logX[, keep])$x[, 1:10]
emb <- umap(pcs, n_neighbors = 20, min_dist = 0.1)
# A simple k-means cluster on the PCs as a stand-in.
km <- kmeans(pcs, centers = 3, nstart = 10)
tibble(x = emb[, 1], y = emb[, 2],
cluster = factor(km$cluster), truth = cell_type) |>
ggplot(aes(x, y, colour = cluster, shape = truth)) +
geom_point(alpha = 0.8, size = 1) +
labs(x = "UMAP1", y = "UMAP2")A Seurat call pattern (not run).
library(Seurat)
obj <- CreateSeuratObject(counts = t(X))
obj <- NormalizeData(obj)
obj <- FindVariableFeatures(obj, nfeatures = 100)
obj <- ScaleData(obj)
obj <- RunPCA(obj, npcs = 10)
obj <- FindNeighbors(obj, dims = 1:10)
obj <- FindClusters(obj, resolution = 0.5)
obj <- RunUMAP(obj, dims = 1:10)
DimPlot(obj, label = TRUE)Contingency of cluster vs true cell type.
On a simulated 300-cell × 500-gene dataset with three cell types, a small pipeline (log normalisation, top-100 variable genes, 10 PCs, UMAP, k-means) recovered the three types with high agreement. The same reasoning — preprocess, reduce, graph, cluster, embed — underlies the Seurat pipeline sketched in the accompanying code.
On a real dataset the critical choices are the normalisation (SCTransform vs NormalizeData), the number of PCs (elbow or JackStraw), and the clustering resolution. Perturb each and confirm that the biological conclusions do not.
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] uwot_0.2.4 Matrix_1.7-0 lubridate_1.9.5 forcats_1.0.1
[5] stringr_1.6.0 dplyr_1.2.1 purrr_1.2.2 readr_2.2.0
[9] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.3 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.4.1 Rcpp_1.1.1-1.1
[5] tidyselect_1.2.1 FNN_1.1.4.1 scales_1.4.0 yaml_2.3.12
[9] fastmap_1.2.0 lattice_0.22-6 R6_2.6.1 labeling_0.4.3
[13] generics_0.1.4 knitr_1.51 htmlwidgets_1.6.4 pillar_1.11.1
[17] RColorBrewer_1.1-3 tzdb_0.5.0 rlang_1.2.0 stringi_1.8.7
[21] xfun_0.57 S7_0.2.2 otel_0.2.0 timechange_0.4.0
[25] cli_3.6.6 withr_3.0.2 magrittr_2.0.5 digest_0.6.39
[29] grid_4.4.1 hms_1.1.4 lifecycle_1.0.5 vctrs_0.7.3
[33] RSpectra_0.16-2 evaluate_1.0.5 glue_1.8.1 farver_2.1.2
[37] rmarkdown_2.31 tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.9