Finding Marker Genes
Introduction
Marker genes are genes differentially expressed in one cluster relative to others; they drive cell-type annotation, functional interpretation, and the connection between unsupervised clustering and known biology. Seurat’s FindMarkers and FindAllMarkers implement the standard differential-expression tests (Wilcoxon rank-sum, MAST hurdle, DESeq2) on log-normalised or SCT-transformed data, per cluster, against either all other cells or a specified comparison cluster.
Prerequisites
A working understanding of clustered scRNA-seq output, log-fold-change interpretation, and the difference between cluster-vs-all and pairwise cluster comparisons.
Theory
For each cluster, the marker test compares cells in that cluster to cells in all other clusters (or a specified subset) gene-by-gene. The output for each gene includes:
- avg_log2FC: average log-fold change between the target cluster and the reference.
- pct.1: fraction of cells in the target cluster expressing the gene.
- pct.2: fraction of cells in the reference expressing the gene.
- p_val_adj: Benjamini-Hochberg adjusted \(p\)-value.
Wilcoxon rank-sum is the default test in Seurat (fast, robust); MAST handles zero-inflation more rigorously via a hurdle model; DESeq2 applies the negative-binomial framework (slow but principled).
Conventional marker criteria: \(\log_2 \mathrm{FC} > 0.5\), adjusted \(p < 0.05\), and pct.1 > 0.25 (the gene is expressed in at least a quarter of the cluster’s cells).
Assumptions
Cells within a cluster are reasonably homogeneous; the reference set is a valid comparison; multiple testing is corrected across genes within each comparison (but not across the joint cluster set, which would be over-conservative).
R Implementation
library(Seurat)
set.seed(2026)
counts <- matrix(rpois(300 * 200, lambda = 2), 200, 300)
rownames(counts) <- paste0("g", 1:200)
colnames(counts) <- paste0("c", 1:300)
# Inject a marker pattern in cells 1-50
counts[1:10, 1:50] <- counts[1:10, 1:50] + 10
so <- CreateSeuratObject(counts)
so <- NormalizeData(so, verbose = FALSE)
so <- FindVariableFeatures(so, nfeatures = 100, verbose = FALSE)
so <- ScaleData(so, verbose = FALSE)
so <- RunPCA(so, npcs = 10, verbose = FALSE)
so <- FindNeighbors(so, dims = 1:10, verbose = FALSE)
so <- FindClusters(so, resolution = 0.5, verbose = FALSE)
markers_all <- FindAllMarkers(so, only.pos = TRUE, min.pct = 0.25,
logfc.threshold = 0.25, verbose = FALSE)
head(markers_all[, c("cluster", "gene", "avg_log2FC", "p_val_adj", "pct.1")])Output & Results
FindAllMarkers() returns a data frame with one row per significant marker per cluster, including cluster identity, gene name, log-fold change, adjusted \(p\)-value, and percentage of cells expressing in target vs reference. Filtering to only.pos = TRUE retains only up-regulated markers, which are typically the most interpretable for cell-type identification.
Interpretation
A reporting sentence: “FindAllMarkers (Wilcoxon test, min.pct = 0.25, logfc.threshold = 0.25, only.pos = TRUE) identified 1{,}240 unique markers across 12 clusters; cluster 2’s top markers (CD8A, GZMB, PRF1) identify cytotoxic T cells, and cluster 5’s markers (MS4A1, CD79A) identify B cells. Annotation followed manual review against the Human Cell Atlas reference.” Always state the test, thresholds, and annotation source.
Practical Tips
- Use
only.pos = TRUEto retain up-regulated markers; down-regulated markers are less interpretable in scRNA-seq because most genes are dropouts in any given cluster. - Set
min.pct = 0.25to filter out genes with very limited expression in the target cluster; this is a quality gate that prevents spurious markers from low-detection genes. - For publication-quality marker calls, use MAST (
test.use = "MAST"); it handles zero-inflation more rigorously than Wilcoxon and gives more conservative \(p\)-values. - Always annotate clusters using markers after clustering — never adjust clusters to fit pre-conceived markers; the latter is circular reasoning.
- Do not perform DE testing between clusters produced by the same clustering on the same data; this is a form of double-dipping that inflates significance. Use post-hoc validation or independent test data instead.
- For automated cell-type annotation, pair markers with reference-based tools (
SingleR,scType,Azimuth); they reduce manual annotation effort substantially.
R Packages Used
Seurat::FindMarkers() and FindAllMarkers() for the canonical workflow; MAST for hurdle-based DE testing; DESeq2 for negative-binomial DE on pseudobulk; SingleR and Azimuth for reference-based automated annotation; presto for fast Wilcoxon implementations on large datasets.