Chapter 4: Sensitivity analysis and robustness as design defenses

Handbook of Biostatistics for Medical Research

When this chapter applies

Use this chapter when you have a causal study, or any observational study whose headline claim depends on assumptions that cannot be tested with the data alone. The work covered here defends the result against the most-likely methodological critiques by pre-specifying sensitivity analyses that try to break the design. Sensitivity analyses run only because a reviewer asked for them are reactive and credibility-poor; sensitivity analyses pre-specified at design time are proactive and credibility-strong.

The deliverables a careful pass produces:

A map of the assumptions your design rests on
An identification of which assumption is most vulnerable
A sensitivity-analysis method matched to that assumption
Pre-specified sensitivity analyses in the protocol
Bias-adjusted estimates under plausible violation scenarios
Leave-one-out and placebo diagnostics
Transparent reporting of all sensitivity results, not just the favorable ones

The decision framework

Seven steps for designing sensitivity and robustness analyses.

Step 1. Map the assumptions

Every analysis rests on assumptions. For a causal study, the identifying assumption (parallel trends, continuity, exclusion restriction, conditional independence, pre-treatment fit) is the load-bearing one (Chapter 3). For non-causal studies, the assumptions are about the measurement model, the missing-data mechanism, the model specification, and the population definition.

Write the assumptions down explicitly. A study where you cannot enumerate the assumptions does not have a well-defined methodology.

Step 2. Identify which assumption is most vulnerable

Not all assumptions are equal. Some are routinely satisfied; others are routinely violated. The discipline is to identify which assumption is most likely to be challenged at review.

For DiD, parallel trends is the most-common challenge. For RDD, manipulation around the cutoff. For IV, the exclusion restriction. For propensity-score methods, unmeasured confounding. For analyses with substantial missing data, the missing-at-random (MAR) assumption. The most-vulnerable assumption gets the most sensitivity-analysis attention.

Step 3. Choose a sensitivity-analysis method matched to the assumption

The standard sensitivity-analysis catalog:

Unmeasured confounding. E-values (VanderWeele and Ding 2017), Rosenbaum bounds (Rosenbaum 2002), simex for measurement-error-corrected confounders.
Parallel trends in DiD. Event-study with year-by-treatment interactions, formal F-test on pre-period leads, honest DiD (Rambachan and Roth 2023), placebo cap-years.
Missing data. Multiple imputation under MAR, pattern-mixture models for missing-not-at-random (MNAR), tipping-point analysis.
Outcome misclassification. Probabilistic bias analysis (Lash, Fox, and Fink 2009), bounds analysis under non-differential misclassification.
Selection bias. Inverse-probability-of-selection weighting, Heckman selection model when the selection mechanism is identifiable.
Model misspecification. Alternative model forms (linear vs nonlinear, parametric vs nonparametric), leave-one-out estimators, specification curves (Simonsohn, Simmons, and Nelson 2020).

Step 4. Pre-specify the sensitivity analyses in the protocol

The single discipline that most increases sensitivity-analysis credibility is pre-specification. Document, before any data work, which sensitivity analyses you will run, under what assumed violation scenarios, and how you will report them. Reviewers and IRB members read pre-specified sensitivity-analysis plans as a credibility signal.

The protocol should specify:

The sensitivity-analysis method(s) for each vulnerable assumption
The violation scenarios (for example, “we will assess robustness to unmeasured confounding at E-value of 1.5, 2.0, and 3.0”)
The reporting plan (a table of bias-adjusted estimates across scenarios)

Step 5. Compute bias-adjusted estimates under plausible violation scenarios

For unmeasured-confounding sensitivity, the E-value is the simplest tool: compute the minimum strength of unmeasured-confounder–treatment association and unmeasured-confounder–outcome association that would explain away the observed effect. E-values above 2.0 typically indicate the effect is robust to unmeasured confounding; below 1.5 indicates fragility. Rosenbaum bounds are the matched-design analog, expressing sensitivity as a parameter Γ representing how strongly an unmeasured confounder would have to predict treatment to negate the result.

For missing-data sensitivity, multiple imputation gives a single point estimate under MAR; pattern-mixture models or tipping-point analyses extend to MNAR. Report the estimate under each plausible mechanism.

For measurement-error sensitivity, simex or probabilistic bias analysis produce bias-corrected estimates and intervals. Report both the corrected estimate and the magnitude of the correction relative to the headline.

Step 6. Run leave-one-out and placebo diagnostics

Two diagnostics that earn their place across most causal designs:

Leave-one-out. Re-estimate the primary effect dropping one unit (one drug, one state, one cohort year) at a time. If a single unit drives the entire result, the design is fragile.
Placebo (or falsification) tests. Apply the treatment to a unit that should not have been treated and re-estimate. If the placebo shows a treatment effect, the design has a problem. Real placebos try to find an effect where none should exist; the diagnostic is whether the design survives the attempt.

Both are pre-specifiable. Both are credibility-positive when reported.

Step 7. Report all sensitivity results, not just the favorable ones

The single most-common failure of sensitivity analysis in published work is selective reporting: the unfavorable sensitivity result is buried or omitted. Pre-specify the sensitivity table at design time, and report it in full at write-up time.

A transparent sensitivity table (with the headline estimate, the sensitivity-analysis-method-specific estimates, and the violation-scenario column) is more credible than a single headline number followed by a “robust to sensitivity analysis” assertion in the discussion.

Worked example

The NHANES cardiometabolic case study walks through case-definition sensitivity (NCEP ATP III, IDF 2005, JIS 2009) as a measurement-sensitivity exercise: how the headline prevalence shifts when the operationalization of the outcome shifts.

The Medicaid outliers case study walks through peer-group sensitivity (a related form of population-definition robustness) and methodological-procedure sensitivity (the opacity rule that handles “no information” answers in the BH-FDR setup, an explicit departure from textbook BH that the field note flags as v0.1’s biggest methodological stake).

The Part D insulin DiD case study is the most-complete sensitivity-analysis example in the portfolio: placebo cap-years, leave-one-out, control-group composition sensitivity (drop GLP-1 RAs to absorb the concurrent demand shock), and prescriber-FE robustness as a specification check.

Get the methods by email

This chapter is part of the free methods reference on this site. The Confounder delivers the same methodological spine to your inbox, one piece at a time, alongside shorter dispatches on new research and methods. Free, roughly every other week.

Subscribe to The Confounder →