#courses
  • Overview
  • Courses
    • Course 1 — Foundations
    • Course 2 — Regression
    • Course 3 — Design & Causal
    • Course 4 — ML & High-Dim
  • About
  • Impressum

On this page

  • Tree-based models
  • Interpretability
  • Tabular neural networks with torch
  • tidymodels pipeline
  • Decision rule for Week 2
  • Common pitfalls
  • Further reading

Other Formats

  • Typst

Course 4 · Week 2 — ML done honestly

Cheatsheet — biostats_courses

Author

R. Heller

Tree-based models

Model R Strengths
CART rpart::rpart interpretable single tree
Random forest ranger::ranger robust, OOB error, variable importance
XGBoost xgboost::xgb.train top tabular performance, careful tuning
LightGBM lightgbm::lgb.train fast on large data
library(ranger)
fit <- ranger(y ~ ., data = df, importance = "permutation",
              num.trees = 500, mtry = floor(sqrt(ncol(df) - 1)))
fit$prediction.error   # OOB error

Interpretability

library(DALEX); library(iml)

ex  <- explain(fit, data = df |> select(-y), y = df$y)
ip  <- model_parts(ex)           # permutation importance
pd  <- model_profile(ex, variables = "x1")   # partial dependence
sh  <- Shapley$new(Predictor$new(fit, data = df, y = df$y),
                   x.interest = df[1, ])     # SHAP

PDP assumes feature independence; if features are correlated, use ALE plots.

Tabular neural networks with torch

# sketch only — typically in a loop on GPU
library(torch)
nn <- nn_sequential(
  nn_linear(p, 64), nn_relu(),
  nn_linear(64, 1)
)
opt <- optim_adam(nn$parameters)

On small biomedical tabular data, boosted trees usually beat NN. Use NN when you need end-to-end training with images, sequences, or text.

tidymodels pipeline

library(tidymodels)
rec  <- recipe(y ~ ., data = df) |>
  step_normalize(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

mod  <- rand_forest(mtry = tune(), trees = 500) |>
  set_engine("ranger") |> set_mode("classification")

wf   <- workflow() |> add_recipe(rec) |> add_model(mod)
cv   <- vfold_cv(df, v = 5, strata = y)
tuned <- tune_grid(wf, cv, grid = 20)
collect_metrics(tuned)

Decision rule for Week 2

  • Tabular, < 100k rows → gradient-boosted trees.
  • Need explanations → SHAP + PDP, validate with held-out.
  • Images / sequences → CNN / transformer, use torch.
  • Reproducibility → tidymodels pipeline + fixed seed + locked recipe.

Common pitfalls

  • Reporting feature importance from a model trained on the full data.
  • Interpreting PDPs when features are strongly correlated.
  • Claiming NN superiority without CV on the same split as a tree model.
  • Forgetting to set a seed for any tune / split operation.

Further reading

  • Biecek & Burzykowski, Explanatory Model Analysis.
  • Molnar, Interpretable Machine Learning.

#courses · MIT

Get Started · Overview · Schedule · Cheatsheets · Interactive apps · Research workflow · Decision tree · Glossary · Common errors · Writing a report · References · Acknowledgements · Impressum · Kontakt

Built with Quarto