Monte Carlo simulation for biostatistics

From Data to Bedside · the full write-up

When this applies

Use this write-up when closed-form statistical formulas don’t fit your design well enough to answer a question: a multilevel or longitudinal structure, informative dropout, an adaptive rule, a heavy-tailed null, a complex composite endpoint. The ladder already says what a simulation is and what it answers. The gap it leaves is the recipe, and the recipe is where simulations actually go wrong: a data-generating process that doesn’t match the real failure mode, too few replicates to resolve a tail probability, or a result reported without the code that produced it. This write-up is that recipe.

The deliverables a careful pass produces:

A written specification of the data-generating process
A pre-specified set of estimators to evaluate
The set of summary metrics (bias, MSE, coverage, power, type-I error rate)
A grid of DGP-parameter scenarios to vary
An annotated R or Python script that reproduces the simulation end-to-end
A table or figure reporting the metrics across scenarios

The decision framework

Seven steps for a defensible Monte Carlo simulation.

Step 1. Identify the question the simulation answers

A simulation answers one specific question, and the question fixes everything downstream: the data-generating process and the summary metric both fall out of it. “What \(N\) gives 80% power for design \(X\) under DGP \(Y\)?” fixes the metric (power) and the quantity you vary (\(N\)); “how biased is estimator \(E\) when assumption \(A\) is violated by amount \(k\)?” fixes the metric (bias) and the DGP knob (\(k\)). Write the question first, in that form, and the DGP and metric follow from it rather than being chosen ad hoc.

Step 2. Specify the data-generating process

The DGP is the synthetic-data recipe. Write it as a numbered procedure:

Sample \(N\) units.
Assign treatment with probability \(p\) (or by some assignment rule).
Generate confounders from distribution \(F\).
Generate the outcome as a function of treatment and confounders (with a specified functional form).
Add noise from distribution \(G\).

Document the parameters that will be varied across scenarios (sample size, treatment-effect magnitude, confounder strength, noise variance). The DGP is the structural model that the simulation tests the estimator against. The more realistic the DGP, the more credible the simulation; the more transparently the DGP is specified, the more reproducible the simulation.

Step 3. Choose the estimator(s) to evaluate

The simulation evaluates one or more estimators against the DGP. Standard practice is to compare:

The primary estimator you plan to use.
One or two alternative estimators, for benchmarking.
An oracle estimator (an estimator that knows the true DGP), where available, as the gold-standard comparator.

If the simulation is for sample-size planning, the estimator is fixed and the question is “what N produces the target power.”

Step 4. Decide the number of simulation replicates

The standard answer is 1,000 to 10,000 replicates. The right number depends on the precision you need for the summary metrics. For bias estimation, 1,000 replicates is usually adequate. For tail probabilities (type-I error rate, coverage at 95%), 10,000 replicates gives tighter Monte Carlo standard errors.

A useful diagnostic: compute the Monte Carlo SE on each summary metric across replicates. If the Monte Carlo SE is large relative to the differences you’re trying to detect, increase the replicate count.

Step 5. Compute the simulation metrics

Each metric has a target, and the gap from it is the finding:

Metric	What it measures	What good looks like
Bias	mean of \((\text{estimate} - \text{true value})\) across replicates	\(\approx 0\); a bias that does not shrink as \(N\) grows is a real estimator flaw, not noise
MSE	mean of \((\text{estimate} - \text{true value})^2\), which decomposes as \(\text{bias}^2 + \text{variance}\)	small, and the \(\text{bias}^2\)-vs-variance split tells you whether to fix the estimator or just collect more data
Coverage	fraction of replicates whose 95% CI contains the true value	\(\approx 95\%\); materially below means intervals are too narrow (anticonservative), above means needlessly wide
Type-I error	under a null DGP, fraction of replicates that reject	\(\approx\) nominal \(\alpha\) (0.05); above it means the test over-rejects and the p-values can’t be trusted
Power	under a non-null DGP, fraction of replicates that reject	matches the design’s planned power (e.g. 80%) at the effect size that matters

The MSE row is the one worth writing out, because the decomposition is what tells you which fix to reach for:

\[ \text{MSE} = \text{bias}^2 + \text{variance} \]

where:

\(\text{MSE}\) is the mean squared error of the estimator across replicates, the mean of \((\text{estimate} - \text{true value})^2\)
\(\text{bias}^2\) is the squared bias, the systematic part of the error that does not shrink as \(N\) grows
\(\text{variance}\) is the variance of the estimate across replicates, the part that does shrink as \(N\) grows

A large \(\text{bias}^2\) term points to an estimator flaw, so the fix is a better estimator; a large \(\text{variance}\) term points to sampling noise, so the fix is more data.

Report every metric with its Monte Carlo SE alongside the point estimate, so a “coverage 0.94” is read against whether that differs from 0.95 by more than simulation noise.

Step 6. Vary the DGP parameters across realistic scenarios

Most simulations vary at least two parameters. The result is a table or heatmap of the summary metric across the scenario grid. Realistic scenarios should bracket the parameter values you expect in the actual study, plus optimistic and conservative scenarios that test the boundaries.

A sample-size simulation might cross sample size (50, 100, 200, 500) with effect-size magnitude (small, medium, large), producing a \(4 \times 3\) power table. A bias-quantification simulation might cross unmeasured-confounder strength (none, modest, strong) with confounder prevalence (10%, 30%, 50%), producing a \(3 \times 3\) bias table.

Step 7. Report the simulation transparently with code

Reproducibility is the credibility test for simulation-based methodology. The simulation script (DGP, estimator, replicate loop, metric computation) should be published alongside the protocol or as supplementary material to the paper.

A transparent simulation report includes the question, the DGP specification, the estimator(s), the replicate count, the metrics, the scenario grid, and the script. A simulation reported only as “we ran a simulation that showed…” is not a simulation; it’s an assertion.

Where this shows up

None of the published portfolio case studies uses Monte Carlo simulation centrally. The NHANES cardiometabolic case study uses survey-weighted estimation. The Medicaid outliers case study uses BH-FDR with normal-reference p-values, which the field note flags as a calibration weakness; a permutation-based simulation would be the v2 fix and is exactly the kind of question this write-up walks through. The Part D insulin DiD case study uses cluster-robust inference on a relatively standard DiD. Three simulations fit this recipe directly: a sample-size simulation for a cluster-randomized trial with informative dropout, a bias-quantification simulation for a propensity-score estimator under unmeasured-confounding violation, and a permutation-based null for the Medicaid BH-FDR setup that fixes the heavy-tail null-mismatch issue.