Chapter 3: Causal inference — a methods toolkit

Handbook of Biostatistics for Medical Research

When this chapter applies

Use this chapter when you have a causal question (does treatment X cause outcome Y) and randomization is not in the cards. The decision framework here walks through choosing among the main causal-inference designs (DiD, RDD, IV, synthetic control, propensity-score methods) and stating the identifying assumption each design relies on. The chapter assumes you’ve already worked through Chapter 1 (the research question is specified in PICO and target-trial terms) and Chapter 2 (the study population is defined and the sample is in hand).

The deliverables a careful pass produces:

The causal estimand stated explicitly (ATT, ATE, LATE)
A map of the threats to identification
A design choice that addresses the dominant threat, with rationale documented
The identifying assumption stated as a single defensible sentence
The estimator that operationalizes the design
A defense plan for the identifying assumption (event-study for DiD, bandwidth sensitivity for RDD, instrument-relevance and exclusion-restriction tests for IV, balance diagnostics for propensity scores)
The pre-specified robustness checks (Chapter 4 covers these in detail)

If your data come from a randomized trial, most of this chapter doesn’t apply; randomization handles identification, and the methodological work is on Chapters 1, 2, and 4. If your data come from an observational source, this chapter is the core.

The decision framework

Choosing and defending a causal-inference design is a sequence of seven decisions.

Step 1. Specify the causal estimand

Before picking a design, name what you want to estimate. The average treatment effect (ATE) is the population-average causal effect. The average treatment effect on the treated (ATT) is the effect among those who actually received treatment. The local average treatment effect (LATE) is the effect among compliers, the population whose treatment status the instrument or design actually changes. These are different parameters; the policy implications differ.

For most policy evaluations, ATT is the right target. For most clinical-effectiveness questions, ATE is the right target. For IV-based designs, LATE is what you actually estimate. Write the estimand explicitly; it dictates the rest of the design.

Step 2. Map the threats to identification

For each of the standard threats to causal inference, ask whether it plausibly applies:

Confounding. A common cause of treatment and outcome. The most pervasive threat in observational data.
Selection bias. Who gets included in the study correlates with treatment and outcome.
Measurement error. Treatment, outcome, or confounders are mis-measured in a way that biases the estimate.
Reverse causation. The outcome causes the treatment, not the other way.
Spillover or interference. Treating one unit affects untreated units, violating the stable unit treatment value assumption (SUTVA).

The dominant threat (the one most likely to bias your estimate) is the design problem you have to solve. Each causal-inference design family is built to handle a specific dominant threat.

Step 3. Pick the design that addresses the dominant threat

The five main causal-inference design families, with their use cases:

Difference-in-differences (DiD). Treatment applied at a known date to a known group; an untreated comparison group exists; pre- and post-treatment data are available. Use when the policy is uniformly applied at a single date (the Part D insulin cap, the Medicaid expansion, an FDA approval). For staggered roll-outs, use Callaway–Sant’Anna or Sun–Abraham estimators rather than plain two-way fixed-effects, which can be biased.
Regression discontinuity (RDD). Treatment is assigned by a cutoff on a continuous score (CKD stage, BMI threshold, eligibility age, test score). Use when there’s a clean threshold-based assignment rule.
Instrumental variables (IV). A variable (the instrument) affects treatment but not the outcome directly. Use when you can argue for a credible instrument: distance-to-treatment, genetic variants (Mendelian randomization), policy changes that affect treatment uptake but not the outcome through other paths.
Synthetic control. One or a few treated units, many untreated units, a long pre-treatment period. Use for state-level policy evaluations (one or two treated states, many untreated states) or rare-event treatments.
Propensity-score methods (matching, weighting, doubly-robust). Many treated and untreated units, with measured confounders rich enough to address the dominant threats. Use when no clean natural experiment is available but the confounders are measured.

If the dominant threat is unmeasured confounding and you have no instrument, no natural experiment, no threshold-based assignment, and no untreated comparison group, you may not have a causal study. That’s worth knowing before you commit to one.

Step 4. State the identifying assumption explicitly

Every causal design rests on an assumption that cannot be tested with the data alone. The discipline is to write the assumption as a single defensible sentence in the protocol.

DiD’s identifying assumption is parallel trends: the treated and control groups would have moved together in the outcome, absent treatment.
RDD’s identifying assumption is continuity: potential outcomes are continuous in the running variable at the cutoff.
IV’s identifying assumption is the exclusion restriction: the instrument affects the outcome only through its effect on treatment.
Synthetic control’s identifying assumption is pre-treatment fit: the weighted combination of donor units that matches the treated unit pre-treatment also approximates its counterfactual post-treatment.
Propensity-score methods rest on conditional independence: given the measured confounders, treatment assignment is independent of potential outcomes.

Writing the assumption explicitly is half the work of defending it.

Step 5. Choose the estimator that fits the design

The design tells you the family; the data tell you which estimator inside the family.

DiD with simultaneous treatment: two-way fixed-effects (TWFE).
DiD with staggered roll-out: Callaway–Sant’Anna, Sun–Abraham, or de Chaisemartin–D’Haultfœuille.
RDD: local linear regression with a triangular kernel and an optimal bandwidth (Calonico–Cattaneo–Titiunik).
IV: two-stage least squares (2SLS) or limited-information maximum likelihood.
Synthetic control: Abadie–Diamond–Hainmueller weight optimization (or the augmented version of Ben-Michael–Feller–Rothstein).
Propensity-score methods: nearest-neighbor matching, optimal full matching, inverse-probability-of-treatment weighting (IPTW) with stabilized weights, or augmented IPW with outcome modeling for doubly-robust estimation.

Step 6. Plan the assumption-defense

For each design, there’s a standard defense move. Pre-specify it before estimating.

DiD: event-study with year-by-treatment interactions, plus a formal F-test on pre-period leads. Chapter 4 covers when and how this defense fails.
RDD: discontinuity tests in covariates at the cutoff (no jump means assignment is locally random), the McCrary density test for manipulation, bandwidth sensitivity.
IV: first-stage F-statistic for relevance, the Hansen J-test for over-identification when multiple instruments are available, a written argument for the exclusion restriction (which cannot be formally tested with a single instrument).
Synthetic control: in-time placebo (test whether the synthetic control “predicts” earlier pseudo-treatments), in-space placebo (assign the treatment to each untreated unit in turn and compare effect sizes).
Propensity-score methods: balance diagnostics (standardized differences before and after matching or weighting), overlap diagnostics (propensity-score distributions across treated and control).

Step 7. Pre-specify the robustness checks

Chapter 4 covers sensitivity analysis in detail. The Step-7 discipline here is parallel to Chapter 1: while you’re designing the primary causal estimator, list the methodological choices most likely to be challenged at review. Alternative comparator definitions, alternative outcome operationalizations, alternative bandwidths or matching specifications, alternative instruments. Each is a candidate for pre-specified robustness analysis.

Target trial emulation as a framing layer

Across all of these design families, the target-trial-emulation framework (Hernán and Robins 2016; Hernán 2018) is the framing layer that makes the causal target explicit. The idea: write down the hypothetical randomized trial that would answer your question if it could be run, then design the observational study to approximate that trial. The framework forces explicit specification of treatment assignment, eligibility, follow-up start, and outcome ascertainment, in a way that the observational analysis can defend at review.

Target-trial emulation pairs particularly cleanly with propensity-score methods (the emulation specifies the trial, the propensity score handles the assignment) but is useful across all five design families.

Worked example

The Part D insulin DiD case study walks through the seven-step framework applied to a policy-evaluation question. The IRA’s $35 insulin cap was a simultaneous treatment with a clean control group, plain TWFE is unbiased here, the parallel-trends assumption is the identifying claim, the event-study tests it visually, and the placebo cap-years plus leave-one-out plus drop-GLP-1 robustness checks each earn their place in the protocol. The case study also surfaces the harder methodological points: when the data are pre-filtered on the test statistic, when the empirical null is heavier-tailed than the textbook assumes, and when the natural experiment’s pre-period is partially contaminated by an earlier demonstration (the CMS Senior Savings Model).

Get the methods by email

This chapter is part of the free methods reference on this site. The Confounder delivers the same methodological spine to your inbox, one piece at a time, alongside shorter dispatches on new research and methods. Free, roughly every other week.

Subscribe to The Confounder →