Week 3, Session 3 — DAGs with dagitty and ggdag

Course 3 — #courses

R. Heller

Note

Workflow lab using the variant template: Goal → Approach → Execution → Check → Report.

Learning objectives

Express a causal scenario as a directed acyclic graph (DAG).
Identify confounders, mediators, and colliders from a DAG.
Use dagitty::adjustmentSets() to find sufficient adjustment sets.

Prerequisites

Conceptual knowledge of confounding.

Background

A DAG is a set of variables joined by arrows that encode assumed causal directions. The graph distinguishes three kinds of third variable on a path between exposure and outcome: a confounder is a common cause; a mediator lies on the causal path from exposure to outcome; a collider is a common effect of two variables on the path. The practical upshot is that adjusting for a confounder reduces bias, adjusting for a mediator removes part of the effect you are trying to estimate, and adjusting for a collider creates bias where none existed.

dagitty and ggdag let you declare a DAG in text, visualise it, and then query it for adjustment sets — the minimal variable sets that block all back-door paths from exposure to outcome without opening new ones. The right adjustment set depends on the question. For the total effect of X on Y, you want to close all back-door paths but leave mediators alone; for the direct effect, you also block mediators.

M-bias is the classic example of collider adjustment: conditioning on a variable that is a common effect of an unmeasured cause of X and an unmeasured cause of Y opens a path and biases the estimate. This is why “adjust for everything” is bad advice.

Setup

library(tidyverse)
library(dagitty)
library(ggdag)
set.seed(42)
theme_set(theme_minimal(base_size = 12))

1. Goal

Write a confounder, mediator, and collider scenario as DAGs; ask dagitty which variables to adjust for.

2. Approach

dag1 <- dagitty('dag {
  X -> Y
  C -> X
  C -> Y
  X [exposure]
  Y [outcome]
}')

dag2 <- dagitty('dag {
  X -> M -> Y
  X -> Y
  X [exposure]
  Y [outcome]
}')

dag3 <- dagitty('dag {
  X -> Z
  Y -> Z
  X -> Y
  X [exposure]
  Y [outcome]
}')

ggdag(dag1) + theme_dag()

ggdag(dag2) + theme_dag()

ggdag(dag3) + theme_dag()

3. Execution

adjustmentSets(dag1, exposure = "X", outcome = "Y",
               effect = "total")

{ C }

# Total vs direct effect when a mediator exists
adjustmentSets(dag2, exposure = "X", outcome = "Y",
               effect = "total")

{}

adjustmentSets(dag2, exposure = "X", outcome = "Y",
               effect = "direct")

{ M }

# Collider: do not adjust for Z
adjustmentSets(dag3, exposure = "X", outcome = "Y")

{}

4. Check

Simulate dag1 and verify that adjusting for C recovers the true effect while failing to adjust biases it.

n  <- 2000
c_ <- rnorm(n)
x  <- 0.6 * c_ + rnorm(n)
y  <- 0.4 * x + 0.7 * c_ + rnorm(n)
df <- tibble(c_, x, y)

coef(lm(y ~ x,        data = df))["x"]

        x 
0.7042359

coef(lm(y ~ x + c_,   data = df))["x"]

        x 
0.3866581

The unadjusted coefficient is inflated; the adjusted one is close to 0.4 as simulated.

5. Report

A directed acyclic graph is a compact, testable statement of causal assumptions. Dagitty identified the confounder C as the required adjustment set in the first DAG; the collider Z in the third DAG is explicitly not in any adjustment set, and conditioning on it would introduce selection bias.

Common pitfalls

Listing “every variable we measured” as the adjustment set.
Adjusting for a mediator and calling the result a total effect.
Using automated variable selection on observational data without a DAG.
Drawing the DAG after the analysis.

Session info

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggdag_0.2.13    dagitty_0.3-4   lubridate_1.9.5 forcats_1.0.1  
 [5] stringr_1.6.0   dplyr_1.2.1     purrr_1.2.2     readr_2.2.0    
 [9] tidyr_1.3.2     tibble_3.3.1    ggplot2_4.0.3   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] viridis_0.6.5      generics_0.1.4     stringi_1.8.7      hms_1.1.4         
 [5] digest_0.6.39      magrittr_2.0.5     evaluate_1.0.5     grid_4.4.1        
 [9] timechange_0.4.0   RColorBrewer_1.1-3 fastmap_1.2.0      jsonlite_2.0.0    
[13] ggrepel_0.9.8      gridExtra_2.3      viridisLite_0.4.3  scales_1.4.0      
[17] tweenr_2.0.3       cli_3.6.6          graphlayouts_1.2.3 rlang_1.2.0       
[21] polyclip_1.10-7    tidygraph_1.3.1    cachem_1.1.0       withr_3.0.2       
[25] yaml_2.3.12        otel_0.2.0         tools_4.4.1        tzdb_0.5.0        
[29] memoise_2.0.1      boot_1.3-30        curl_7.1.0         vctrs_0.7.3       
[33] R6_2.6.1           lifecycle_1.0.5    V8_8.2.0           htmlwidgets_1.6.4 
[37] MASS_7.3-60.2      ggraph_2.2.2       pkgconfig_2.0.3    pillar_1.11.1     
[41] gtable_0.3.6       glue_1.8.1         Rcpp_1.1.1-1.1     ggforce_0.5.0     
[45] xfun_0.57          tidyselect_1.2.1   knitr_1.51         farver_2.1.2      
[49] htmltools_0.5.9    igraph_2.3.1       labeling_0.4.3     rmarkdown_2.31    
[53] compiler_4.4.1     S7_0.2.2

Week 3, Session 3 — DAGs with dagitty and ggdag

Learning objectives

Prerequisites

Background

Setup

1. Goal

2. Approach

3. Execution

4. Check

5. Report

Common pitfalls

Further reading

Session info