Chapter 5: Monte Carlo simulation for biostatistics

Handbook of Biostatistics for Medical Research

When this chapter applies

Use this chapter when closed-form statistical formulas don’t fit your design well enough to answer a question. Monte Carlo simulation lets you generate synthetic data under a specified data-generating process (DGP), apply the estimator you plan to use, and observe how the estimator behaves over many replicates. The technique answers four kinds of questions:

How much sample size do you need (when closed-form sample-size formulas don’t apply)?
How biased is the estimator under realistic violations of its assumptions?
How well does the estimator’s nominal type-I error rate match its empirical type-I error rate?
How well-calibrated are its confidence intervals (does the 95% CI contain the true value 95% of the time)?

The deliverables a careful pass produces:

A written specification of the data-generating process
A pre-specified set of estimators to evaluate
The set of summary metrics (bias, MSE, coverage, power, type-I error rate)
A grid of DGP-parameter scenarios to vary
An annotated R or Python script that reproduces the simulation end-to-end
A table or figure reporting the metrics across scenarios

The decision framework

Seven steps for a defensible Monte Carlo simulation.

Step 1. Identify the question the simulation answers

Simulations answer specific questions. The common ones:

“How much sample size do I need for design X under DGP Y to achieve power Z?”
“How biased is estimator E when assumption A is violated by amount k?”
“What is the empirical type-I error rate of test T under DGP Y at nominal alpha 0.05?”
“How well-calibrated are the 95% confidence intervals from estimator E under realistic DGPs?”

Write the simulation question explicitly. The DGP and metrics follow from the question.

Step 2. Specify the data-generating process

The DGP is the synthetic-data recipe. Write it as a numbered procedure:

Sample N units.
Assign treatment with probability p (or by some assignment rule).
Generate confounders from distribution F.
Generate the outcome as a function of treatment and confounders (with a specified functional form).
Add noise from distribution G.

Document the parameters that will be varied across scenarios (sample size, treatment-effect magnitude, confounder strength, noise variance). The DGP is the structural model that the simulation tests the estimator against. The more realistic the DGP, the more credible the simulation; the more transparently the DGP is specified, the more reproducible the simulation.

Step 3. Choose the estimator(s) to evaluate

The simulation evaluates one or more estimators against the DGP. Standard practice is to compare:

The primary estimator you plan to use.
One or two alternative estimators, for benchmarking.
An oracle estimator (an estimator that knows the true DGP), where available, as the gold-standard comparator.

If the simulation is for sample-size planning, the estimator is fixed and the question is “what N produces the target power.”

Step 4. Decide the number of simulation replicates

The standard answer is 1,000 to 10,000 replicates. The right number depends on the precision you need for the summary metrics. For bias estimation, 1,000 replicates is usually adequate. For tail probabilities (type-I error rate, coverage at 95%), 10,000 replicates gives tighter Monte Carlo standard errors.

A useful diagnostic: compute the Monte Carlo SE on each summary metric across replicates. If the Monte Carlo SE is large relative to the differences you’re trying to detect, increase the replicate count.

Step 5. Compute the simulation metrics

The standard metrics:

Bias. Mean of (estimate − true value) across replicates.
MSE. Mean of (estimate − true value)² across replicates. Decomposes as bias² + variance.
Coverage. Fraction of replicates where the 95% CI contains the true value. Should be close to 95%.
Type-I error rate. Under a null DGP, fraction of replicates that reject. Should be close to nominal alpha.
Power. Under a non-null DGP, fraction of replicates that reject. Should match the design’s planned power.

Report metrics with their Monte Carlo SEs alongside the point estimates.

Step 6. Vary the DGP parameters across realistic scenarios

Most simulations vary at least two parameters. The result is a table or heatmap of the summary metric across the scenario grid. Realistic scenarios should bracket the parameter values you expect in the actual study, plus optimistic and conservative scenarios that test the boundaries.

A sample-size simulation might cross sample size (50, 100, 200, 500) with effect-size magnitude (small, medium, large), producing a 4×3 power table. A bias-quantification simulation might cross unmeasured-confounder strength (none, modest, strong) with confounder prevalence (10%, 30%, 50%), producing a 3×3 bias table.

Step 7. Report the simulation transparently with code

Reproducibility is the credibility test for simulation-based methodology. The simulation script (DGP, estimator, replicate loop, metric computation) should be published alongside the protocol or as supplementary material to the paper.

A transparent simulation report includes the question, the DGP specification, the estimator(s), the replicate count, the metrics, the scenario grid, and the script. A simulation reported only as “we ran a simulation that showed…” is not a simulation; it’s an assertion.

Worked example

None of the published portfolio case studies uses Monte Carlo simulation centrally. The NHANES cardiometabolic case study uses survey-weighted estimation. The Medicaid outliers case study uses BH-FDR with normal-reference p-values, which the field note flags as a calibration weakness; a permutation-based simulation would be the v2 fix and is exactly the kind of question Chapter 5 walks through. The Part D insulin DiD case study uses cluster-robust inference on a relatively standard DiD. The full Chapter 5 walks through three worked simulations: a sample-size simulation for a cluster-randomized trial with informative dropout, a bias-quantification simulation for a propensity-score estimator under unmeasured-confounding violation, and a permutation-based null for the Medicaid BH-FDR setup that fixes the heavy-tail null-mismatch issue.

Get the methods by email

This chapter is part of the free methods reference on this site. The Confounder delivers the same methodological spine to your inbox, one piece at a time, alongside shorter dispatches on new research and methods. Free, roughly every other week.

Subscribe to The Confounder →