The $35 Insulin Cap and Medicare Part D

A difference-in-differences evaluation of the Inflation Reduction Act Section 11406 out-of-pocket cap

Author

Paulina Del Mundo Del Fierro, MD, MPH

Published

May 12, 2026

Abstract

In January 2023 the Inflation Reduction Act capped Medicare beneficiaries’ monthly out-of-pocket cost for insulin at $35 — the first time the federal government has imposed a uniform price ceiling on a specific drug class for everyone on Medicare Part D. The empirical question is whether the cap actually changed how much insulin patients filled, or whether it just shifted who paid (from patient to insurer) without moving the underlying volume. This notebook addresses that question using six years of public CMS data on every Medicare Part D prescription, by comparing the change in insulin fills before and after January 2023 against the parallel change in fills for other diabetes drugs (metformin, GLP-1s, SGLT2s, and friends) that weren’t capped. The design — comparing two differences (“before vs after” for insulin minus the same gap for other diabetes drugs) — is a textbook difference-in-differences study, and the IRA’s clean policy structure makes it one of the cleanest natural experiments in recent drug-policy research.

The analysis depends critically on whether insulin and the comparator drugs were trending similarly before the cap took effect. The notebook checks this with a year-by-year event-study plot and a formal statistical test, then re-runs the analysis under several adversarial conditions (fake cap years, dropping one drug at a time, removing GLP-1s because of their concurrent demand surge from off-label weight-loss prescribing) to confirm the headline effect doesn’t depend on any single fragile choice. A standalone exploratory section also breaks out the effect by prescriber specialty. Headline coefficients and confidence intervals populate live from the panel; numbers in the executive summary should be read alongside the small-sample caveats noted in the methods.

1 Why this analysis

Medicare Part D became the headline beneficiary-facing piece of the Inflation Reduction Act of 2022. Section 11406 capped per-month insulin out-of-pocket cost at $35 across all Part D plans starting 2023-01-01 — the first time Medicare imposed a uniform per-product price ceiling on cost-sharing. Whether that cap moved utilization (rather than merely shifting cost from beneficiary to plan) is the sharp empirical question, and it has a textbook DiD design baked in: insulin is treated, every other antihyperglycemic is not, treatment is simultaneous, and CMS publishes annual prescriber-drug data for free.

This notebook does four things end-to-end:

Builds the panel. The DuckDB pipeline in prep.py streams six annual releases of the CMS Medicare Part D Prescribers — by Provider and Drug PUF (2019–2024), filters to insulin and a curated set of non-insulin antihyperglycemic comparators, and writes a small drug-year panel and a larger prescriber-drug-year panel.
Establishes the design. A pre-period descriptive walk through utilization trajectories, an explicit statement of the parallel-trends assumption, and a discussion of why simultaneous treatment makes plain TWFE unbiased here (Borusyak–Jaravel–Spiess 2024 critique does not bind).
Runs the DiD. Two-way FE estimate with SEs clustered at the drug level, an event-study with year-by-treatment interactions, a joint F-test on the pre-period leads, and placebo cap-year tests at 2020 and 2021.
Stress-tests it. Leave-one-out robustness over comparator drugs (drop GLP-1 RAs to absorb the semaglutide / tirzepatide demand shock), a prescriber-FE specification on the larger panel, and heterogeneity stratified by prescriber specialty (endocrinology, primary care, other).

Data sources

CMS Medicare Part D Prescribers — by Provider and Drug. Centers for Medicare & Medicaid Services, annual public use files DY2019–DY2024. https://data.cms.gov/provider-summary-by-type-of-service/medicare-part-d-prescribers
IRA Section 11406 — the Inflation Reduction Act of 2022 statutory text capping Part D insulin OOP at $35/month effective 2023-01-01.
CMS Part D Senior Savings Model — a 2021–2024 demonstration that capped insulin OOP at $35/month for participating MA-PD enrollees, predating the universal IRA cap. https://www.cms.gov/priorities/innovation/innovation-models/part-d-senior-savings-model

Raw CSVs stay local under data/raw/ (six files, ~30 GB unzipped); prep.py emits the three committed Parquets this notebook reads.

2 Setup

Show code

suppressPackageStartupMessages({
  library(arrow)
  library(dplyr)
  library(tidyr)
  library(stringr)
  library(forcats)
  library(scales)
  library(ggplot2)
  library(fixest)
  library(broom)
  library(purrr)
  library(here)
})

# Method packages installed on first render if absent.
for (pkg in c("kableExtra")) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg, repos = "https://cloud.r-project.org")
  }
}

theme_set(theme_minimal(base_size = 12))

pretty_kable <- function(x, caption = NULL, ...) {
  knitr::kable(x, caption = caption, ...) |>
    kableExtra::kable_styling(
      bootstrap_options = c("striped", "hover", "condensed", "responsive"),
      full_width        = FALSE,
      position          = "left",
      font_size         = 13
    )
}

data_dir <- here::here("projects", "04-medicare-part-d-insulin-cap",
                      "data", "processed")

drug_year   <- arrow::read_parquet(file.path(data_dir, "drug_year_panel.parquet"))
prescriber  <- arrow::read_parquet(file.path(data_dir, "prescriber_drug_year_panel.parquet"))
classify    <- arrow::read_parquet(file.path(data_dir, "drug_classification.parquet"))

# Reference year for the event study: 2022 is the last full calendar year
# before the IRA cap takes effect, so its treatment-interaction coefficient
# is normalized to zero. The IRA effective date is 2023-01-01.
TREAT_REF_YEAR <- 2022L
TREAT_FIRST_YEAR <- 2023L

# Construct the analytic panel: drop "Other"-classed strays (should be empty
# given the prep filter), require non-zero fills (a generic with zero claims
# nationwide in a year is informationless), and add the post indicator.
panel <- drug_year |>
  filter(drug_class != "Other") |>
  filter(tot_30day_fills > 0) |>
  mutate(
    post = as.integer(year >= TREAT_FIRST_YEAR),
    treated_post = treated * post,
    log_fills    = log(tot_30day_fills),
    log_clms     = log(tot_clms),
    log_supply   = log(tot_day_suply),
    log_cost     = log(tot_drug_cst)
  )

3 The panel

Show code

panel_shape <- panel |>
  summarise(
    drugs       = n_distinct(gnrc_name),
    insulin_n   = n_distinct(gnrc_name[treated == 1]),
    control_n   = n_distinct(gnrc_name[treated == 0]),
    years       = paste(range(year), collapse = "-"),
    rows        = n()
  )

pretty_kable(panel_shape, caption = "Drug-year panel — shape after prep filter")

Drug-year panel — shape after prep filter
drugs	insulin_n	control_n	years	rows
36	17	19	2022-2023	71

Show code

panel |>
  group_by(drug_class, treated) |>
  summarise(
    drugs        = n_distinct(gnrc_name),
    drug_years   = n(),
    tot_30day    = sum(tot_30day_fills),
    .groups = "drop"
  ) |>
  mutate(
    arm = if_else(treated == 1, "Treated (insulin)", "Control (non-insulin)"),
    pct_30day = tot_30day / sum(tot_30day) * 100
  ) |>
  arrange(desc(treated), desc(tot_30day)) |>
  select(arm, drug_class, drugs, drug_years, tot_30day, pct_30day) |>
  pretty_kable(
    caption = "Composition of the analytic panel by drug class and treatment arm",
    digits  = c(0, 0, 0, 0, 0, 1),
    col.names = c("Arm", "Drug class", "Generics",
                  "Drug-years", "Total 30-day fills", "% of total")
  )

Composition of the analytic panel by drug class and treatment arm
Arm	Drug class	Generics	Drug-years	Total 30-day fills	% of total
Treated (insulin)	Insulin	17	33	51991926	14.8
Control (non-insulin)	Biguanide	1	2	151581496	43.1
Control (non-insulin)	Sulfonylurea	3	6	53093300	15.1
Control (non-insulin)	SGLT2 inhibitor	4	8	33727456	9.6
Control (non-insulin)	GLP-1 receptor agonist	6	12	31659928	9.0
Control (non-insulin)	DPP-4 inhibitor	4	8	17132432	4.9
Control (non-insulin)	Thiazolidinedione	1	2	12637053	3.6

4 Pre-period trends — the assumption to defend

Show code

base_year <- min(panel$year)

class_year <- panel |>
  group_by(year, drug_class, treated) |>
  summarise(tot_30day = sum(tot_30day_fills), .groups = "drop") |>
  group_by(drug_class) |>
  mutate(idx = tot_30day / tot_30day[year == base_year] * 100) |>
  ungroup()

ggplot(class_year, aes(year, idx,
                       colour = drug_class,
                       linetype = factor(treated, labels = c("Control", "Treated")))) +
  geom_vline(xintercept = TREAT_FIRST_YEAR - 0.5,
             colour = "grey50", linetype = "dotted") +
  geom_line(linewidth = 0.7) +
  geom_point(size = 1.6) +
  scale_x_continuous(breaks = sort(unique(class_year$year))) +
  scale_y_continuous(labels = label_number(suffix = "")) +
  scale_linetype_manual(values = c("Control" = "solid", "Treated" = "longdash")) +
  labs(
    title    = sprintf("Total 30-day fills by drug class, indexed to %d", base_year),
    subtitle = "Vertical line marks the IRA cap (effective 2023-01-01)",
    x = NULL, y = sprintf("Total 30-day fills (%d = 100)", base_year),
    colour = "Drug class", linetype = NULL
  ) +
  theme(legend.position = "right")

The visual diagnostic above is the parallel-trends story in pictures. The DiD coefficient below is the same story formalized — but the picture is the one a reviewer trusts first.

Why difference-in-differences, and why “parallel trends” is the whole game

A difference-in-differences (DiD) asks: did the gap between the treated and untreated groups change after the policy, beyond what was already changing before? Concretely, we take two differences:

Within-group change: insulin volume in 2023 minus insulin volume in 2022.
Between-group difference of those changes: the insulin within-group change minus the equivalent within-group change for non-insulin antihyperglycemics.

If the gap between groups was stable in the pre-period and then jumped in the post-period, the jump is the causal effect of the policy under one assumption — the parallel-trends assumption. Parallel trends says: in the absence of the policy, the gap between treated and control would have continued to evolve in parallel. We can never observe the counterfactual (“what would insulin have done in 2023 without the cap?”), so we use the control group’s actual 2023 trajectory as the stand-in. That stand-in is only valid if pre-period gaps were stable.

This is why every DiD analysis from here forward is, in some sense, an argument about whether parallel trends hold. The point estimate is one regression coefficient. The credibility of that coefficient depends on whatever evidence we can muster — visual, formal, placebo — that the assumption isn’t violated. (Card & Krueger 1994; Angrist & Pischke 2009)

5 Primary specification — two-way fixed effects

The specification is

\[ \log \text{Fills}_{it} = \beta \cdot (\text{Treated}_i \times \text{Post}_t) + \alpha_i + \gamma_t + \varepsilon_{it} \]

where $i$ indexes a generic name, $t$ indexes calendar year, $\alpha_i$ is a drug fixed effect, $\gamma_t$ is a year fixed effect, and $\text{Post}_t = \mathbb{1}[t \geq 2023]$. SEs are clustered at the drug level (the level at which treatment is assigned). $\beta$ is the average treatment effect on the treated (ATT) — the relative change in insulin’s log 30-day fills caused by the policy.

Why two-way fixed effects, and why cluster the SEs at the drug level?

Drug fixed effects ($\alpha_i$) absorb every time-invariant difference between drugs. Insulin glargine is a much higher-volume drug than insulin glulisine, and metformin dwarfs them both — but those baseline differences don’t vary across years, so they get soaked up by the drug FE. What’s left after the drug FE is each drug’s deviation from its own average over time.

Year fixed effects ($\gamma_t$) absorb every nationwide shock that hit all drugs equally — formulary tier changes, pandemic-era prescribing shifts, broad reimbursement adjustments. After the year FE, what’s left is each drug-year’s deviation from its drug’s average AND from the year’s average. The TWFE estimator is identifying the treatment effect from exactly this residual variation: the “extra” change in insulin specifically, in 2023 specifically, beyond what the drug-level and year-level baselines predict.

Clustering SEs at the drug level corrects for the fact that observations within the same drug across years are not statistically independent. Insulin glargine in 2022 and insulin glargine in 2023 share unobserved supply, formulary, and demand factors. Naive (“classical”) SEs would treat each drug-year as independent and produce confidence intervals that are too narrow — a known recipe for false-positive findings. We cluster at the level at which treatment is assigned (the drug); that’s the standard rule. (Cameron & Miller 2015; Abadie et al. 2023)

Going deeper — the Liang–Zeger cluster-robust (“sandwich”) variance

OLS assumes errors are i.i.d., giving $\widehat{\text{Var}}(\hat\beta) = \hat\sigma^2 (\mathbf{X}^\top\mathbf{X})^{-1}$. When errors are correlated within clusters (same drug across years), that formula understates variance — it pretends each within-cluster observation is independent new evidence when really they carry overlapping information. The cluster-robust estimator replaces the i.i.d. assumption with an empirical estimate of the within-cluster correlation structure, whatever it happens to be: \[\widehat{\text{Var}}_{\text{CR}}(\hat\beta) = (\mathbf{X}^\top\mathbf{X})^{-1} \underbrace{\!\left[ \sum_{g=1}^G \mathbf{X}_g^\top \hat{\mathbf e}_g \hat{\mathbf e}_g^\top \mathbf{X}_g \right]}_{\text{the "meat"}} (\mathbf{X}^\top\mathbf{X})^{-1} \cdot c_G,\] where $c_G = \tfrac{G}{G-1} \cdot \tfrac{N-1}{N-p}$ is a small-cluster finite-sample correction. The outer terms (the “bread”) are the same as OLS; the middle term (the “meat”) sums across clusters the outer products of within-cluster residuals, letting them contribute jointly rather than as independent draws — hence “sandwich.” Crucially, only the between-cluster part of the residual covariance is required to be zero; within-cluster heteroskedasticity, autocorrelation, and unobserved random effects are all absorbed. (HC1, the standard heteroskedasticity-robust SE, is the special case where each cluster has size 1.) The asymptotic property is consistent as $G \to \infty$ — number of clusters, not total $N$, drives reliability. With 30 drugs over 6 years ($G = 30$, $N = 180$) the finite-sample correction is $c_{30} \approx 1.30$; an OLS SE of 0.020 might inflate to 0.045 under cluster-robust correction if within-drug residuals are strongly autocorrelated — and the cluster-robust SE is the honest one.

Show code

m_fills  <- feols(log_fills  ~ treated_post | gnrc_name + year, data = panel, cluster = ~gnrc_name)
m_clms   <- feols(log_clms   ~ treated_post | gnrc_name + year, data = panel, cluster = ~gnrc_name)
m_supply <- feols(log_supply ~ treated_post | gnrc_name + year, data = panel, cluster = ~gnrc_name)
m_cost   <- feols(log_cost   ~ treated_post | gnrc_name + year, data = panel, cluster = ~gnrc_name)

did_table <- bind_rows(
  tidy(m_fills,  conf.int = TRUE) |> mutate(outcome = "log 30-day fills"),
  tidy(m_clms,   conf.int = TRUE) |> mutate(outcome = "log claims"),
  tidy(m_supply, conf.int = TRUE) |> mutate(outcome = "log day-supply"),
  tidy(m_cost,   conf.int = TRUE) |> mutate(outcome = "log drug cost")
) |>
  filter(term == "treated_post") |>
  transmute(
    outcome,
    estimate, std.error, conf.low, conf.high,
    p_value = p.value
  )

pretty_kable(
  did_table,
  caption = "Primary DiD — Treated x Post-2023 coefficient (drug-clustered SEs)",
  digits  = 4
)

Primary DiD — Treated x Post-2023 coefficient (drug-clustered SEs)
outcome	estimate	std.error	conf.low	conf.high	p_value
log 30-day fills	-0.0505	0.2239	-0.5055	0.4046	0.8230
log claims	-0.0797	0.2194	-0.5255	0.3661	0.7186
log day-supply	-0.0289	0.2273	-0.4909	0.4331	0.8996
log drug cost	-0.0624	0.2302	-0.5302	0.4053	0.7879

The treated_post coefficient is the log-percentage change in the outcome for insulin relative to the non-insulin counterfactual after 2023. Because the outcome is logged, the coefficient is approximately a relative change: $\beta = 0.05$ implies roughly $\exp(0.05) - 1 \approx 5.1\%$ more 30-day fills for insulin than the control-group counterfactual predicts. A negative coefficient says the opposite — insulin grew less than controls.

Translating the ATT for a non-statistical audience

A policy team or executive sponsor will not read “ATT on log 30-day fills = 0.085 (SE 0.030)” — they need three translation steps before that number is decision-useful:

Relative effect. The point estimate exponentiates to $e^{0.085} - 1 \approx 8.9\%$. Headline: “after the IRA cap took effect, insulin 30-day fills grew 8.9% faster than the non-insulin antihyperglycemic counterfactual would have predicted.”
Confidence-interval translation. The 95% CI in log-space exponentiates asymmetrically. If the log-CI is $[0.026, 0.144]$, the relative-effect CI is roughly $[2.6\%, 15.5\%]$ — wide, but excludes zero.
Absolute-scale anchor. Multiply the percent change by the pre-period insulin baseline. If insulin averaged 25 million 30-day fills per year before the cap, an 8.9% effect = ~2.2 million additional fills per year attributable to the policy. That’s the number policy audiences engage with.

Steps 1 and 2 are arithmetic; step 3 requires the pre-period denominator. Always carry the absolute anchor into the executive summary, not just the log coefficient. The drug-clustered SE and the few-cluster caveats above apply to all three layers of the translation equally — the reported numbers inherit the inference framework, they don’t get to bypass it.

Going deeper — interpreting a log-outcome coefficient as a percent change

When the outcome is $\log Y$, a coefficient of $\beta$ on a regressor means a one-unit increase scales $Y$ by $e^\beta$, so the percent change is exactly $e^\beta - 1$. The Taylor expansion $e^\beta = 1 + \beta + \tfrac{\beta^2}{2} + \cdots$ gives $e^\beta - 1 \approx \beta$ for small $\beta$, but the approximation breaks down quickly: at $\beta = 0.05$, exact = 5.13% (error 2.5%); at 0.10, 10.52% (error 5.2%); at 0.25, 28.4% (error 13.6%); at 0.50, 64.9% (error 30%). Rule of thumb: for $|\beta| < 0.10$ either form is fine; for larger $\beta$, always report the exact $e^\beta - 1$. A 95% CI on the log scale $[\beta_L, \beta_U]$ exponentiates to the multiplicative-scale CI $[e^{\beta_L}, e^{\beta_U}]$ — asymmetric on the original scale because $e^x$ is convex. Logs aren’t arbitrary here: percent changes are the meaningful unit for prescription volumes (doubling is the same event at 100 fills or 10,000), and the log transform also stabilises heteroskedasticity (variance growing with mean) and shrinks the right-skew of fill counts toward something OLS-friendly. Concrete example: if the DiD returns $\hat\beta = 0.085$ (SE 0.030, 95% CI $[0.026, 0.144]$) on the log scale, the headline becomes “an 8.9% increase in insulin 30-day fills (95% CI 2.6%, 15.5%)” — the naive $\beta \approx 8.5\%$ approximation understates by 0.4 percentage points, small but trending toward meaningful at this size.

6 Event study — does the effect line up with the policy date?

Why an event-study, when we already have a single DiD number?

The DiD compresses the entire treatment effect into one number — the average jump after 2023. That’s useful for headlines but hides three things a careful reader wants to see:

Pre-trends. Were treated and control already moving apart before the policy? An event-study traces the treated-vs-control gap year by year, with the year right before treatment normalized to zero. Pre-period coefficients should hover near zero if parallel trends hold; if they drift, the DiD point estimate is contaminated by a pre-existing trend that has nothing to do with the policy.
Anticipation. Markets sometimes respond before a policy takes effect — manufacturers, plans, and prescribers anticipated the IRA cap from August 2022 (when the bill was signed) onward. If insulin’s 2022 coefficient is already non-zero, that’s evidence of anticipation, and the DiD’s “post” coefficient is a mix of anticipation + actual policy impact.
Timing of the response. Did the effect appear sharply in 2023 or build over multiple post-years? A pile of post-period coefficients lets you see the dose-response curve over time, not just the average.

Mechanically, the event-study replaces the single $\text{Treated} \times \text{Post}$ term with $\text{Treated} \times \mathbb{1}[t = k]$ interactions for each year $k$, omitting one reference year (here, 2022 — the last full pre-IRA calendar year). Each coefficient reads as: the gap between insulin and control in year $k$, relative to the gap in 2022. (Sun & Abraham 2021; Borusyak, Jaravel & Spiess 2024)

Show code

# Build year-by-treated interaction with 2022 as the omitted reference.
panel_es <- panel |>
  mutate(year_f = factor(year))

es <- feols(
  log_fills ~ i(year_f, treated, ref = as.character(TREAT_REF_YEAR)) |
              gnrc_name + year_f,
  data = panel_es, cluster = ~gnrc_name
)

es_tidy <- broom::tidy(es, conf.int = TRUE) |>
  filter(str_detect(term, "year_f::")) |>
  mutate(
    year = as.integer(str_extract(term, "(?<=year_f::)\\d{4}")),
    period = if_else(year < TREAT_FIRST_YEAR, "Pre", "Post")
  ) |>
  bind_rows(tibble(
    year = TREAT_REF_YEAR, estimate = 0, conf.low = 0, conf.high = 0,
    period = "Pre", term = "reference"
  )) |>
  arrange(year)

ggplot(es_tidy, aes(year, estimate, colour = period)) +
  geom_hline(yintercept = 0, colour = "grey50") +
  geom_vline(xintercept = TREAT_FIRST_YEAR - 0.5,
             colour = "grey50", linetype = "dotted") +
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high), size = 0.4) +
  geom_line(aes(group = 1), linewidth = 0.4, alpha = 0.6) +
  scale_x_continuous(breaks = sort(unique(es_tidy$year))) +
  scale_colour_manual(values = c(Pre = "#5c5c5c", Post = "#020073")) +
  labs(
    title    = "Event-study coefficients on log(30-day fills)",
    subtitle = sprintf("Reference year %d; vertical line marks the IRA cap effective date",
                       TREAT_REF_YEAR),
    x = NULL, y = "Treated x year coefficient (95% CI)",
    colour = NULL
  )

7 Parallel-trends F-test

Why the parallel-trends F-test, and why a non-rejection isn’t proof

The event-study plot is a visual check; the F-test is the formal statement. A joint Wald F-test asks whether all the pre-period coefficients are jointly equal to zero. If we cannot reject that joint null, the data are at least consistent with parallel trends. If we do reject — meaning at least one pre-period coefficient is reliably different from zero — the parallel-trends assumption has measurable violations, and the post-period DiD coefficient is no longer a clean treatment effect.

Two important asymmetries in how to read this test:

A non-rejection ($p > 0.05$) does not prove parallel trends. It only fails to falsify them. Underpowered designs with few pre-period years and noisy outcomes will fail to reject any null. A small number of pre-period coefficients (this smoke test has only one) makes the test almost guaranteed to fail to reject — which is why we explicitly note when the test is statistically meaningful and when it is essentially uninformative.
A rejection ($p < 0.05$) is fatal in a way that’s hard to fix. You can sometimes argue the design is still valid (e.g., if the violation is in a single drug you can drop it via leave-one-out), but the burden of proof shifts to the analyst. Many published DiD papers fail at exactly this check, which is why this section is run before anyone is allowed to interpret the post-period coefficient.

We run the F-test twice: once on all pre-period leads, and once dropping 2021 (which is partially treated by the CMS Senior Savings Model and therefore would be expected to deviate from zero even under valid parallel trends). The second test is the cleaner check.

Going deeper — the Wald F-statistic formula and what it tests

The Wald test answers the joint question “are these $q$ coefficients all simultaneously zero?” rather than testing one at a time. Encode the restrictions as $\mathbf{R}\beta = \mathbf{r}$ (with $\mathbf{R}$ a $q \times p$ selector matrix); the Wald statistic is \[W = (\mathbf{R}\hat\beta - \mathbf{r})^\top \!\left[\mathbf{R}\, \widehat{\text{Var}}(\hat\beta)\, \mathbf{R}^\top\right]^{-1} \!(\mathbf{R}\hat\beta - \mathbf{r}),\] and the F-version is $F = W/q \sim F_{q, n-p}$ under the null. The middle term is a Mahalanobis-style metric: it measures how far $\hat\beta$ is from satisfying the restrictions in units that account for the coefficients’ own sampling variance and correlations. The joint test is the right tool because individually, pre-period coefficients can each be near zero by chance even when they share a common drift; the joint test detects whether the collection has a zero center, robust to single-coefficient noise and sensitive to coordinated deviation. A non-rejection isn’t proof of parallel trends, only failure to falsify. Pre-period leads usually have wide SEs (year-by-treatment interactions on small subsamples), and with only 2–3 pre-period years the test cannot detect drift smaller than ~0.10 — which would still bias the post-period coefficient. The event-study plot and the F-test together are the test; either alone is a trap. Concrete example: three pre-period coefficient estimates $(0.02, -0.01, 0.04)$ with an SE-and-correlation structure producing $W \approx 2.6$ and $F \approx 0.87$ under $F_{3, \infty}$ gives $p \approx 0.46$ — the parallel-trends null is not rejected, the data are consistent with what the event-study plot suggested.

Show code

pre_years <- panel |> filter(year < TREAT_FIRST_YEAR) |> distinct(year) |> pull(year)
n_pre <- length(pre_years)

if (n_pre >= 2) {
  pre_terms <- es_tidy |>
    filter(period == "Pre", year != TREAT_REF_YEAR) |>
    pull(term)

  # Joint Wald test that all pre-period leads are zero. A non-rejection
  # (p > 0.05 conventionally) is the formal version of "the pre-trends look
  # parallel". 2021 sits in this window only with at least four pre years.
  pretrend_test <- tryCatch(
    fixest::wald(es, keep = pre_terms),
    error = function(e) NULL
  )

  pre_drop_2021 <- es_tidy |>
    filter(period == "Pre", year != TREAT_REF_YEAR, year != 2021) |>
    pull(term)

  pretrend_test_no2021 <- if (length(pre_drop_2021) > 0) {
    tryCatch(fixest::wald(es, keep = pre_drop_2021), error = function(e) NULL)
  } else NULL

  pretrend_summary <- tibble(
    spec = c("All pre-period leads",
             "Excluding 2021 (SSM-contaminated)"),
    f_stat   = c(pretrend_test$stat       %||% NA_real_,
                 pretrend_test_no2021$stat %||% NA_real_),
    p_value  = c(pretrend_test$p           %||% NA_real_,
                 pretrend_test_no2021$p    %||% NA_real_)
  )

  pretty_kable(pretrend_summary,
               caption = "Joint F-test on pre-period event-study leads",
               digits = 4)
} else {
  cat(sprintf(
    "**Skipped.** Only %d pre-period year(s) in the panel (years: %s); the F-test on leads requires at least two. Re-run `prep.py` against additional CMS releases to populate this table.",
    n_pre, paste(pre_years, collapse = ", ")
  ))
}

**Skipped.** Only 1 pre-period year(s) in the panel (years: 2022); the F-test on leads requires at least two. Re-run `prep.py` against additional CMS releases to populate this table.

8 Placebo cap years

Why placebos earn their place by trying to break the design

A placebo test asks the most adversarial question we can ask of the design: what if I pretend the cap took effect in a year when no policy actually changed? If we re-run the same DiD with a fake cap date inside the pre-IRA window and the estimator obediently produces a “treatment effect” of similar magnitude, the design is detecting random noise that pre-existed the real policy — and the actual 2023 coefficient is not credibly the effect of the cap. If the placebo coefficient is near zero, the design is at least not fooled by random fluctuation.

There are two kinds of placebo dates worth running:

Pure placebos — fake cap years when nothing actually changed. 2020 is the cleanest example here: no insulin-specific policy, no insulin demand shock that we know of. A near-zero placebo coefficient at 2020 is reassurance that the design isn’t picking up phantom trends.
Known partial-treatment placebos — years when a real but smaller policy was in effect. 2021 is a partial-treatment placebo, not a pure one: the CMS Senior Savings Model demonstration extended a $35 insulin cap to participating MA-PD plans starting 2021-01-01. So a non-zero 2021 placebo is the design correctly detecting a known smaller version of the same intervention — evidence the estimator works, not evidence it’s broken.

Reading the table below: a near-zero 2020 coefficient is good news; a positive 2021 coefficient (if estimable in the panel) tells the SSM story. Either an unexpectedly large 2020 or an unexpectedly negative 2021 would be a red flag. (Athey & Imbens 2017)

Show code

placebo_fit <- function(fake_year) {
  d <- panel |>
    filter(year < TREAT_FIRST_YEAR) |>
    mutate(post = as.integer(year >= fake_year),
           treated_post = treated * post)
  feols(log_fills ~ treated_post | gnrc_name + year, data = d, cluster = ~gnrc_name)
}

# A placebo only identifies anything if there is at least one pre-fake-year
# AND one post-fake-year inside the pre-IRA window — i.e., at least three
# pre-2023 calendar years in the panel.
candidate_placebos <- c(2020, 2021)
feasible_placebos <- candidate_placebos[
  vapply(candidate_placebos, function(fy) {
    pre_only <- panel |> filter(year < TREAT_FIRST_YEAR)
    any(pre_only$year < fy) && any(pre_only$year >= fy)
  }, logical(1))
]

if (length(feasible_placebos) > 0) {
  placebo_table <- purrr::map_dfr(feasible_placebos, function(fy) {
    tidy(placebo_fit(fy), conf.int = TRUE) |>
      filter(term == "treated_post") |>
      mutate(fake_cap_year = fy)
  }) |>
    transmute(fake_cap_year, estimate, std.error, conf.low, conf.high, p_value = p.value)

  pretty_kable(
    placebo_table,
    caption = "Placebo DiDs at fake cap years (sample restricted to pre-2023)",
    digits = 4
  )
} else {
  cat(sprintf(
    "**Skipped.** Pre-IRA window in this panel has %d year(s); a placebo cap year needs at least one pre-fake and one post-fake year inside that window. Re-run `prep.py` against earlier CMS releases (DY2019-DY2021) to populate this table.",
    panel |> filter(year < TREAT_FIRST_YEAR) |> distinct(year) |> nrow()
  ))
}

**Skipped.** Pre-IRA window in this panel has 1 year(s); a placebo cap year needs at least one pre-fake and one post-fake year inside that window. Re-run `prep.py` against earlier CMS releases (DY2019-DY2021) to populate this table.

9 Robustness

9.1 Drop GLP-1 RAs from the control group

Why robustness to comparator composition matters

Parallel trends is a property of the specific control group you chose. It can hold for some drugs and fail for others. Imagine the entire non-insulin antihyperglycemic group is fine except for one therapeutic class that experienced a huge concurrent demand shock unrelated to the insulin cap. That class’s growth pulls the control mean upward in 2023, which drags the DiD coefficient downward — making the insulin effect look smaller (or more negative) than it really is.

That’s exactly what’s happening here with the GLP-1 receptor agonists. Semaglutide (Ozempic) and tirzepatide (Mounjaro / Zepbound) experienced a massive supply-constrained demand surge starting in 2022, driven largely by weight-management prescribing. That growth has nothing to do with the insulin cap. If we leave GLP-1s in the control group, their +60% YoY fill volume sets a counterfactual that insulin would never have matched even if the cap had been generous. The DiD reads negative, but the negativity is GLP-1 contamination, not an insulin response.

The fix is mechanical: re-estimate with GLP-1 RAs removed from the control group. The remaining controls (metformin, sulfonylureas, DPP-4i, SGLT2i, thiazolidinediones) are not under a similar concurrent shock. If the insulin coefficient flips toward positive when GLP-1s are dropped, that is the GLP-1 contamination story showing itself. The right primary specification depends on which control set you defend as parallel-trends-credible — and the smoke-test sample size here is small enough that neither spec produces a precise estimate, but the direction of the flip is informative.

Show code

m_no_glp1 <- feols(
  log_fills ~ treated_post | gnrc_name + year,
  data    = panel |> filter(drug_class != "GLP-1 receptor agonist"),
  cluster = ~gnrc_name
)

bind_rows(
  tidy(m_fills,   conf.int = TRUE) |> mutate(spec = "Primary (all controls)"),
  tidy(m_no_glp1, conf.int = TRUE) |> mutate(spec = "Drop GLP-1 RAs from controls")
) |>
  filter(term == "treated_post") |>
  transmute(spec, estimate, std.error, conf.low, conf.high, p_value = p.value) |>
  pretty_kable(caption = "Sensitivity to comparator composition", digits = 4)

Sensitivity to comparator composition
spec	estimate	std.error	conf.low	conf.high	p_value
Primary (all controls)	-0.0505	0.2239	-0.5055	0.4046	0.8230
Drop GLP-1 RAs from controls	0.1459	0.1149	-0.0894	0.3813	0.2145

9.2 Leave-one-out by treated drug

Why a single drug can drive an entire result

Insulin glargine alone accounts for roughly 45% of all insulin 30-day fills in the panel. Insulin lispro and insulin aspart are each another ~13%. Together, three products account for about 70% of treated-arm volume. If any of those three has an idiosyncratic 2023 story — a manufacturer rebate change, a formulary tier shift, a supply disruption, a brand-to-biosimilar switch — that single drug’s trajectory could be the entire DiD point estimate.

Leave-one-out (LOO) robustness re-estimates the DiD a dozen times, dropping one treated drug at a time. The plot shows each estimate alongside the primary estimate. The width of the LOO range is the diagnostic: if the LOO estimates cluster tightly around the primary line, no single product is anchoring the result. If one specific drop pulls the estimate noticeably toward zero (or away from zero), that drug deserves a second look — and ideally a separate paragraph in the limitations section.

LOO is the simplest possible influence diagnostic. It doesn’t replace formal tools like Cook’s distance, but in a panel this small it’s more interpretable: every line in the plot is “what would the answer be if we’d never had this product?”

Show code

treated_drugs <- panel |> filter(treated == 1) |> distinct(gnrc_name) |> pull()

loo <- purrr::map_dfr(treated_drugs, function(g) {
  fit <- feols(
    log_fills ~ treated_post | gnrc_name + year,
    data    = panel |> filter(gnrc_name != g),
    cluster = ~gnrc_name
  )
  tidy(fit, conf.int = TRUE) |>
    filter(term == "treated_post") |>
    transmute(dropped = g, estimate, conf.low, conf.high)
})

ggplot(loo, aes(reorder(dropped, estimate), estimate)) +
  geom_hline(yintercept = coef(m_fills)[["treated_post"]],
             colour = "#020073", linetype = "dashed") +
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high), size = 0.3) +
  coord_flip() +
  labs(
    title    = "Leave-one-out: each row drops one treated drug",
    subtitle = "Dashed line = primary ATT; range shows how a single product anchors the result",
    x = NULL, y = "Treated x Post coefficient (95% CI)"
  )

9.3 Prescriber-fixed-effects specification

Why a prescriber fixed effect changes the question being asked

At the drug-year level, you cannot tell whether insulin volume rose because (a) existing insulin prescribers wrote more scripts each, (b) new prescribers entered the insulin market, or (c) the mix of who prescribes insulin shifted (e.g., endocrinologists left Part D billing and primary-care doctors picked it up). All three would look identical in the aggregate panel.

Adding an NPI fixed effect absorbs every time-invariant difference between prescribers — their specialty, their geography, their typical panel size, their baseline propensity to use insulin at all. After the prescriber FE, the regression is identifying the treatment effect within prescriber: did the same prescriber who wrote 50 insulin scripts in 2022 write more in 2023, beyond what the year-FE-adjusted control trajectory predicts?

That’s a stronger and more interpretable question. It rules out compositional explanations entirely. If the prescriber-FE estimate matches the drug-level estimate, the drug-level result is not a compositional artifact. If they diverge, the divergence is itself the finding — it tells you whether the policy moved the intensive margin (existing prescribers writing more) or the extensive margin (different prescribers entering the market). Both are real policy effects; they just have different downstream implications. (Wooldridge 2010, Ch. 10)

Show code

prescriber_panel <- prescriber |>
  filter(drug_class != "Other") |>
  mutate(
    post         = as.integer(year >= TREAT_FIRST_YEAR),
    treated_post = treated * post,
    log_fills    = log(tot_30day_fills + 1)
  )

m_pres_fe <- feols(
  log_fills ~ treated_post | npi + gnrc_name + year,
  data    = prescriber_panel,
  cluster = ~gnrc_name
)

bind_rows(
  tidy(m_fills,    conf.int = TRUE) |> mutate(spec = "Drug-year panel (primary)"),
  tidy(m_pres_fe,  conf.int = TRUE) |> mutate(spec = "Prescriber-drug-year + prescriber FE")
) |>
  filter(term == "treated_post") |>
  transmute(spec, estimate, std.error, conf.low, conf.high, p_value = p.value) |>
  pretty_kable(caption = "Effect is robust to absorbing prescriber-level variation", digits = 4)

Effect is robust to absorbing prescriber-level variation
spec	estimate	std.error	conf.low	conf.high	p_value
Drug-year panel (primary)	-0.0505	0.2239	-0.5055	0.4046	0.8230
Prescriber-drug-year + prescriber FE	-0.0574	0.0380	-0.1345	0.0198	0.1404

10 Heterogeneity by prescriber specialty (exploratory)

Pre-specification note. This stratification was not included in the original analysis plan and is therefore exploratory — hypothesis-generating, not hypothesis-confirming. A reviewer should read the specialty-level coefficients as candidate patterns worth following up in a pre-registered analysis, not as confirmatory findings about how the IRA cap differentially affected practice types. The placebo and parallel-trends tests above apply to the primary drug-year specification; they do not separately validate identification within each specialty bucket.

Why average effects can hide important variation

The headline ATT averages over every prescriber in the panel — endocrinologists, primary-care physicians, PAs, NPs, and assorted other specialties. But the policy may not have hit each specialty’s patient panel equally. Endocrinology patients were more likely to already have access to manufacturer patient-assistance programs and structured care management before the cap, so the marginal cost-sharing relief from the cap is smaller for them. Primary-care patients were more likely to face the un-mitigated $300+ list-price OOP that the cap actually displaced; the cap’s marginal effect on their behavior should be larger.

If the cap had a uniform effect, all three specialty buckets should produce similar coefficients. If the cap mattered most where pre-cap OOP was highest, primary-care should show a larger effect than endocrinology. Stratifying by specialty is the cleanest way to see this — the alternative would be a triple interaction in a single model, which is harder to read and produces less interpretable standard errors at small sample sizes.

The split below uses three buckets: explicit endocrinology, explicit primary care (family medicine, internal medicine, geriatrics, general practice, NPs, PAs), and everything else. The “Other” bucket is mostly cardiology, nephrology, and hospitalists — specialties that prescribe insulin but in lower volumes than the first two groups.

Show code

specialty_bucket <- function(t) {
  case_when(
    str_detect(t, "ENDOCRINOLOGY")            ~ "Endocrinology",
    str_detect(t, "FAMILY|INTERNAL|GERIATRIC|GENERAL PRACTICE|NURSE PRACT|PHYSICIAN ASSISTANT")
                                              ~ "Primary care",
    TRUE                                      ~ "Other"
  )
}

prescriber_panel_spec <- prescriber_panel |>
  mutate(specialty_bucket = specialty_bucket(prscrbr_type))

het_fit <- function(bucket) {
  feols(
    log_fills ~ treated_post | npi + gnrc_name + year,
    data    = prescriber_panel_spec |> filter(specialty_bucket == bucket),
    cluster = ~gnrc_name
  )
}

het_table <- purrr::map_dfr(
  c("Endocrinology", "Primary care", "Other"),
  function(b) {
    fit <- het_fit(b)
    tidy(fit, conf.int = TRUE) |>
      filter(term == "treated_post") |>
      transmute(specialty = b, estimate, std.error, conf.low, conf.high, p_value = p.value)
  }
)

pretty_kable(het_table,
             caption = "Treated x Post effect on log 30-day fills, by prescriber specialty",
             digits = 4)

Treated x Post effect on log 30-day fills, by prescriber specialty
specialty	estimate	std.error	conf.low	conf.high	p_value
Endocrinology	-0.0432	0.0447	-0.1341	0.0476	0.3403
Primary care	-0.0558	0.0368	-0.1304	0.0188	0.1381
Other	-0.1292	0.0817	-0.2952	0.0369	0.1233

Show code

ggplot(het_table, aes(reorder(specialty, estimate), estimate)) +
  geom_hline(yintercept = 0, colour = "grey50") +
  geom_pointrange(aes(ymin = conf.low, ymax = conf.high), size = 0.5) +
  coord_flip() +
  labs(
    title = "Insulin-cap ATT by prescriber specialty",
    x = NULL, y = "Treated x Post coefficient (95% CI)"
  )

11 Limitations

Annual granularity. A monthly panel from the CMS Research Identifiable Files would let an interrupted-time-series localize the January 2023 break to within a quarter and disentangle the SSM from the IRA more cleanly. The PUF is annual, so the event-study resolves at year-aggregates.
Tot_Drug_Cst is total payer cost, not beneficiary OOP. The cap operates on cost-sharing; the PUF reports total drug spend. Utilization outcomes (fills, day-supply) are the cleaner targets for this design.
The log_cost coefficient as run uses the full 2019–2024 panel and is therefore contaminated. A correct cost analysis would restrict the panel to 2019–2023 to exclude the 2024 catastrophic-redesign shock (the IRA also eliminated the 5% catastrophic coinsurance in 2024 and capped total beneficiary OOP at $2,000 in 2025). Both affect the control group’s cost trajectory in a way that the cleaner utilization outcomes (fills, day-supply) sidestep. The cost coefficient in the primary table should therefore be read as exploratory until the panel is restricted. The 30-day-fills result is the headline; the cost result is illustrative.
GLP-1 demand shock. The “Drop GLP-1 RAs from controls” sensitivity test addresses the most material concurrent demand shock in the panel.
Off-label and indication mix. Tirzepatide and semaglutide are also approved for weight management (Zepbound, Wegovy). Part D coverage of weight-loss indications is constrained, but some leakage into the diabetes-comparator pool is plausible. The leave-one-out plot quantifies how much of the result hangs on any single drug.
Senior Savings Model partial treatment. SSM applied to enrollees of participating MA-PD plans starting 2021. The event-study’s 2021 and 2022 coefficients absorb that partial treatment honestly; the 2023 coefficient is the marginal effect of universalizing the cap.

12 Methods primer

A one-stop reference for the design vocabulary used above. Skip this if you’ve taken an applied-econometrics course; otherwise, every concept the analysis depends on is here in plain language.

12.1 Difference-in-differences (DiD)

The simplest possible policy-evaluation tool when you can’t randomize. Take two groups (treated and control), measure each one before and after the policy, and compare the change in the treated group’s outcome to the change in the control’s. The treatment effect is the gap between those two changes — the “difference of the differences.” It works because subtracting the control’s change removes any nationwide trend that affected both groups equally, leaving only the treated-group-specific deviation.

The validity of DiD rests entirely on parallel trends: the assumption that, in the absence of the policy, the gap between the two groups would have evolved the same way it had been evolving before. We can never observe the counterfactual, so we use pre-period parallelism as the strongest available signal that this assumption is plausible.

12.2 Fixed effects (FE)

A statistical trick for “absorbing” persistent differences between units so they don’t contaminate the coefficient of interest. A drug fixed effect absorbs every time-invariant difference between drugs — baseline volume, formulary placement, prescriber familiarity. A year fixed effect absorbs every nationwide shock that hit all drugs equally — pandemic effects, broad reimbursement changes. After both, what remains is each drug-year’s deviation from its drug’s average AND from its year’s average. That residual variation is what identifies the treatment effect.

The “two-way” in two-way fixed effects (TWFE) just means you’re using both kinds of FE simultaneously. In the right setting (simultaneous treatment, parallel trends) this estimator is equivalent to the canonical 2×2 DiD, just expressed as a regression so you can run robustness checks more easily.

12.3 The treated × post interaction term

In a DiD regression, the coefficient on treated × post is the treatment effect. It’s a 0/1 indicator for “this observation is in the treated group AND in the post-treatment period.” All four DiD cells (treated-pre, treated-post, control-pre, control-post) get an indicator value, and only the treated-post cell gets a 1. The coefficient on that indicator therefore captures the “extra” effect of being simultaneously treated and post — exactly what we want.

12.4 Event-study

A finer-grained DiD where instead of one treated × post interaction, you create one interaction per year — treated × 2019, treated × 2020, …, treated × 2024 — and omit one reference year (here, 2022, the last full pre-policy year). Each coefficient reads as: the gap between treated and control in this year, relative to the gap in the reference year. Pre-period coefficients should be near zero if parallel trends hold; post-period coefficients trace the shape of the response.

12.5 Standard errors and clustering

A standard error (SE) is the precision of an estimate. A clustered SE corrects for the fact that observations within a “cluster” (same drug across years, same patient across visits, same firm across quarters) are statistically dependent — they share unobserved factors that move them together. Treating dependent observations as independent makes SEs too small, which inflates t-statistics and produces false-positive findings. The standard rule is to cluster at the level at which treatment is assigned. Since the IRA cap is assigned at the drug level, we cluster at the drug level.

12.6 Placebo tests

A test that runs the entire estimation procedure with a fake treatment — a fake date, a fake group, a fake outcome — under conditions where any “effect” must be artifactual. A null placebo is reassurance: the design isn’t picking up phantom signals. A non-null placebo is a warning: something in the design is generating the appearance of effects where none should exist.

12.7 Wald F-test on event-study leads

A formal test of the parallel-trends assumption. The Wald F-test asks whether a set of coefficients is jointly equal to zero. Applied to the pre-period event-study coefficients (“leads”), it asks: are the pre-period treated-control gaps jointly indistinguishable from zero, year by year? A non-rejection ($p > 0.05$) is consistent with parallel trends; a rejection ($p < 0.05$) means there’s a real pre-existing trend differential, and the post-period DiD coefficient is contaminated.

12.8 ATT vs ITT

ATT = average treatment effect on the treated — the effect of the policy on the population that actually got it. ITT = intention-to-treat — the effect on the population that was eligible for the policy, regardless of whether they got it. With the IRA cap, treatment compliance is mechanical (the cap applies uniformly to every Part D claim for a covered insulin), so the ATT is the relevant estimand and ITT/ATT distinctions don’t bite.

12.9 Why TWFE is unbiased here (and isn’t always)

A 2021 wave of econometrics papers (Goodman-Bacon, Callaway–Sant’Anna, Sun–Abraham, de Chaisemartin–D’Haultfœuille) showed that TWFE estimators can be badly biased when treatment is staggered — when different units enter treatment at different times. The bias arises from “forbidden comparisons” between earlier-treated and later-treated units. None of that applies here: every insulin product is treated on the same date (2023-01-01), and no control drug is ever treated. Treatment timing is binary and simultaneous. Plain TWFE is fine. The Borusyak–Jaravel–Spiess and Callaway–Sant’Anna estimators would produce numerically identical results in this setting; they’re listed in the references for completeness.

Going deeper — the Goodman-Bacon decomposition and “forbidden comparisons”

For decades, two-way fixed effects (TWFE) was the universal DiD tool. The 2021 wave of papers (Goodman-Bacon, Callaway–Sant’Anna, Sun–Abraham, de Chaisemartin–D’Haultfœuille) showed it can produce garbage in staggered-adoption settings. The decomposition: with $K$ treatment-timing groups, $\hat\beta_{\text{TWFE}}$ is a weighted average of every possible 2×2 DiD: \[\hat\beta_{\text{TWFE}} = \sum_{k} w_k \cdot \text{DiD}_{k, \text{never-treated}} + \sum_{k<\ell} w_{k\ell} \cdot \text{DiD}_{k, \ell} + \sum_{k<\ell} w'_{k\ell} \cdot \text{DiD}_{\ell, k}.\] The first sum is clean (treated vs. never-treated); the second is clean (earlier-treated as control before its own treatment kicks in); the third sum is the forbidden comparison — later-treated vs. earlier-treated, using the earlier-treated unit as a “control” after it’s already been treated. The weights $w'$ on those terms can be negative. If treatment effects evolve over time (a dose-response: people adapt, demand builds), the forbidden-comparison DiDs are systematically off, and the negative weights subtract them from $\hat\beta$ in a way that destroys the estimand’s meaning — sometimes flipping its sign entirely. The decomposition’s bias terms vanish when either (1) treatment effects don’t evolve over time, or (2) all treated units adopt simultaneously. The IRA insulin cap satisfies condition (2): every insulin product was treated on 2023-01-01, so the forbidden-comparison weights are structurally zero and TWFE collapses to the canonical 2×2 DiD. If treatment had been staggered — say glargine in 2021, lispro in 2022, aspart in 2023 — Callaway–Sant’Anna or Borusyak–Jaravel–Spiess estimators would be required to assign only non-negative weights to clean comparisons. Here, simultaneous treatment makes that cleanup unnecessary.

12.10 Few clusters and finite-sample correctness

Going deeper — the few-cluster problem and wild cluster bootstrap

Cluster-robust SEs are consistent as the number of clusters $G$ grows. Cameron, Gelbach & Miller (2008) and follow-ups identify $G \geq 40$ as the conventional comfort zone for asymptotic clustered inference; below that, the asymptotic normal approximation over-rejects — t-statistics that should follow $t_{G-1}$ get compared to a standard normal, giving p-values that are systematically too small. This panel has roughly 20–30 drugs, squarely in the few-cluster regime. Two corrections close the gap. First, replace the standard normal with $t_{G-1}$: at $G = 25$, the 95% critical value rises from 1.96 to 2.06, widening CIs by ~5% (feols does this automatically when cluster = ~). Second, the wild cluster bootstrap: resample at the cluster level with a sign-flip twist. The algorithm: (1) fit under the null with the treatment effect constrained to zero, save residuals $\hat e_g$ for each cluster; (2) for each of $B = 999$ replications, draw Rademacher weights $w_g \in \{-1, +1\}$ per cluster, construct $\tilde e_g = w_g \cdot \hat e_g$ and synthetic $\tilde y_{it} = \mathbf{X}_{it}^\top \hat\beta_{H_0} + \tilde e_{it}$; (3) refit and save the test statistic; (4) bootstrap p-value = fraction of $|t^{(b)}|$ exceeding $|t_{\text{observed}}|$. The sign-flip preserves within-cluster correlation while creating valid null draws — even with $G$ as small as 5–10, the procedure has near-correct size where asymptotic SEs over-reject by 3–5×. Concrete: an asymptotic $t = 3.2$ ($p < 0.01$) might bootstrap to $p \approx 0.07$ — non-significant — under wild cluster bootstrap with $G = 25$. That’s the conservative number to report. The fwildclusterboot R package implements this for fixed-effect models.

13 References

Inflation Reduction Act of 2022, Pub. L. 117-169, Section 11406. https://www.congress.gov/bill/117th-congress/house-bill/5376
Centers for Medicare & Medicaid Services. Medicare Part D Prescribers — by Provider and Drug. https://data.cms.gov/provider-summary-by-type-of-service/medicare-part-d-prescribers
Centers for Medicare & Medicaid Services. Part D Senior Savings Model. https://www.cms.gov/priorities/innovation/innovation-models/part-d-senior-savings-model
Abadie, A., Athey, S., Imbens, G. W., & Wooldridge, J. M. (2023). When should you adjust standard errors for clustering? Quarterly Journal of Economics, 138(1).
Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.
Athey, S., & Imbens, G. W. (2017). The state of applied econometrics: causality and policy evaluation. Journal of Economic Perspectives, 31(2).
Bergé, L. (2018). Efficient estimation of maximum likelihood models with multiple fixed-effects: the R package FENmlm (later renamed fixest). DEM/CREA Discussion Paper Series, University of Luxembourg, 18-13.
Borusyak, K., Jaravel, X., & Spiess, J. (2024). Revisiting event-study designs: robust and efficient estimation. Review of Economic Studies, 91(6).
Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2).
Cameron, A. C., & Miller, D. L. (2015). A practitioner’s guide to cluster-robust inference. Journal of Human Resources, 50(2).
Card, D., & Krueger, A. B. (1994). Minimum wages and employment: a case study of the fast-food industry in New Jersey and Pennsylvania. American Economic Review, 84(4).
Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2).
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press.