Week 4, Session 2 — Kappa, ICC, Bland–Altman

Course 2 — #courses

Author

R. Heller

Note

Workflow labs use the variant template: Goal → Approach → Execution → Check → Report.

Learning objectives

Compute Cohen’s kappa for categorical agreement and explain its chance-correction.
Compute an intraclass correlation coefficient for continuous agreement and distinguish consistency from absolute agreement.
Draw a Bland–Altman plot and report limits of agreement.

Prerequisites

Basic R and ggplot2.

Background

Measurement-agreement studies ask whether two raters, two methods, or two instruments give the same answer on the same units. The choice of statistic depends on the scale of the measurement. Cohen’s kappa adjusts simple percent agreement for the agreement expected by chance given the marginal frequencies; it ranges from −1 to 1 with common landmarks at 0.4 and 0.6. Its main weakness is sensitivity to prevalence.

For continuous measurements, the intraclass correlation (ICC) and the Bland–Altman plot answer complementary questions. The ICC is a single-number summary of reliability, defined in several flavours (one-way, two-way, consistency vs absolute). The Bland–Altman plot shows pattern: it plots the difference between two raters against their mean and marks the limits of agreement (typically mean ± 1.96 SD). It reveals bias, proportional bias, and heteroscedasticity that ICCs hide.

Reliability is not the same as agreement. Two raters can be highly correlated (one is always twice the other) and have a terrible agreement. Always report both and let the picture tell the pattern.

Setup

library(tidyverse)
library(broom)
library(psych)
library(irr)
set.seed(42)
theme_set(theme_minimal(base_size = 12))

1. Goal

Build two small rater datasets — one categorical, one continuous — and compute the matching agreement statistics.

2. Approach

For the categorical example, simulate 100 radiograph classifications (3 categories) by two readers with substantial but not perfect agreement. For the continuous example, simulate 60 measurements by two instruments, one with a small constant bias.

# categorical
cats <- c("normal", "mild", "severe")
truth <- sample(cats, 100, replace = TRUE, prob = c(0.5, 0.3, 0.2))
r1 <- ifelse(runif(100) < 0.2, sample(cats, 100, replace = TRUE), truth)
r2 <- ifelse(runif(100) < 0.25, sample(cats, 100, replace = TRUE), truth)
kap_tbl <- tibble(r1 = factor(r1, levels = cats),
                  r2 = factor(r2, levels = cats))

# continuous
n <- 60
true_val <- rnorm(n, 100, 15)
inst1 <- true_val + rnorm(n, 0, 3)
inst2 <- true_val + 2 + rnorm(n, 0, 3)    # small positive bias
meas <- tibble(inst1, inst2)

3. Execution

Cohen’s kappa:

kappa2(kap_tbl[, c("r1", "r2")])

 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 100 
   Raters = 2 
    Kappa = 0.584 

        z = 8.21 
  p-value = 2.22e-16

ICC via psych:

ICC(as.matrix(meas))

Call: ICC(x = as.matrix(meas))

Intraclass correlation coefficients 
                         type  ICC  F df1 df2       p lower bound upper bound
Single_raters_absolute   ICC1 0.95 43  59  60 1.4e-33        0.93        0.97
Single_random_raters     ICC2 0.95 44  59  59 2.1e-33        0.93        0.97
Single_fixed_raters      ICC3 0.96 44  59  59 2.1e-33        0.93        0.97
Average_raters_absolute ICC1k 0.98 43  59  60 1.4e-33        0.96        0.99
Average_random_raters   ICC2k 0.98 44  59  59 2.1e-33        0.96        0.99
Average_fixed_raters    ICC3k 0.98 44  59  59 2.1e-33        0.96        0.99

 Number of subjects = 60     Number of Judges =  2
See the help file for a discussion of the other 4 McGraw and Wong estimates,

Bland–Altman:

ba <- meas |>
  mutate(mean_val = (inst1 + inst2) / 2,
         diff_val = inst2 - inst1)
loa <- mean(ba$diff_val) + c(-1.96, 0, 1.96) * sd(ba$diff_val)

ggplot(ba, aes(mean_val, diff_val)) +
  geom_point(alpha = 0.7) +
  geom_hline(yintercept = loa[1], linetype = 2, colour = "firebrick") +
  geom_hline(yintercept = loa[2], linetype = 1, colour = "steelblue") +
  geom_hline(yintercept = loa[3], linetype = 2, colour = "firebrick") +
  labs(x = "Mean of two instruments",
       y = "Difference (inst2 − inst1)")

4. Check

The ICC should be high (> 0.9) because the raters are well correlated, but the Bland–Altman plot shows a small positive bias (inst2 reads about 2 units higher on average).

5. Report

Cohen’s kappa for the two radiograph readers was 0.58. For the two instruments, the ICC (absolute agreement, two-way random) was 0.95, but the Bland–Altman plot revealed a mean bias of 0.9 units with 95% limits of agreement from -7.9 to 9.7.

Two takeaways: bias and precision are separate components of agreement, and neither kappa nor a single correlation captures both.

Common pitfalls

Reporting percent agreement instead of kappa.
Using Pearson r on two raters and calling it agreement.
Omitting the limits of agreement from a Bland–Altman plot.

Session info

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

time zone: UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] irr_0.84.1      lpSolve_5.6.23  psych_2.6.3     broom_1.0.12   
 [5] lubridate_1.9.5 forcats_1.0.1   stringr_1.6.0   dplyr_1.2.1    
 [9] purrr_1.2.2     readr_2.2.0     tidyr_1.3.2     tibble_3.3.1   
[13] ggplot2_4.0.3   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] generics_0.1.4     stringi_1.8.7      lattice_0.22-6     lme4_2.0-1        
 [5] hms_1.1.4          digest_0.6.39      magrittr_2.0.5     evaluate_1.0.5    
 [9] grid_4.4.1         timechange_0.4.0   RColorBrewer_1.1-3 fastmap_1.2.0     
[13] Matrix_1.7-0       jsonlite_2.0.0     backports_1.5.1    scales_1.4.0      
[17] reformulas_0.4.4   Rdpack_2.6.6       mnormt_2.1.2       cli_3.6.6         
[21] rlang_1.2.0        rbibutils_2.4.1    splines_4.4.1      withr_3.0.2       
[25] yaml_2.3.12        otel_0.2.0         tools_4.4.1        parallel_4.4.1    
[29] tzdb_0.5.0         nloptr_2.2.1       minqa_1.2.8        boot_1.3-30       
[33] vctrs_0.7.3        R6_2.6.1           lifecycle_1.0.5    htmlwidgets_1.6.4 
[37] MASS_7.3-60.2      pkgconfig_2.0.3    pillar_1.11.1      gtable_0.3.6      
[41] Rcpp_1.1.1-1.1     glue_1.8.1         xfun_0.57          tidyselect_1.2.1  
[45] knitr_1.51         farver_2.1.2       htmltools_0.5.9    nlme_3.1-164      
[49] labeling_0.4.3     rmarkdown_2.31     compiler_4.4.1     S7_0.2.2

Learning objectives

Prerequisites

Background

Setup

1. Goal

2. Approach

3. Execution

4. Check

5. Report

Common pitfalls

Further reading

Session info

Related labs