---
title: "Cardiometabolic Risk in NHANES 2017–2018"
subtitle: "A polyglot R + Python analysis with complex survey design, machine learning, and SHAP interpretability"
author: "Paulina Del Mundo Del Fierro, MD, MPH"
date: last-modified
format:
html:
theme: [cosmo, ../_theme.scss]
highlight-style: github-dark
toc: true
toc-depth: 3
toc-location: right
code-tools: true
code-copy: true
code-fold: true
code-summary: "Show code"
code-overflow: wrap
number-sections: true
fig-responsive: true
fig-width: 8
fig-height: 5
mainfont: "Inter, system-ui, sans-serif"
monofont: "JetBrains Mono, Menlo, monospace"
embed-resources: true
include-in-header:
text: |
<script>
window.va = window.va || function () { (window.vaq = window.vaq || []).push(arguments); };
</script>
<script defer src="/_vercel/insights/script.js"></script>
execute:
echo: true
warning: false
message: false
freeze: auto
---
::: {.callout-tip icon=false}
## Abstract
About one in three US adults aged 20 and over meets the criteria for metabolic syndrome — a cluster of waist circumference, blood pressure, cholesterol, and blood-sugar abnormalities that together flag cardiovascular and diabetes risk. Using the CDC's NHANES survey (2015–2018, two combined waves), this notebook estimates that the true national rate is roughly 33%, about five percentage points higher than a naive count of just the patients with complete lab work would suggest. The gap matters: it reflects the fact that the missing patients tend to be younger and healthier than the ones with complete data, so leaving them out biases the result downward. The notebook fixes this by filling in the missing values five different ways and pooling the answers, which is the established method for honest uncertainty when data is missing.
Two more findings round out the analysis. First, the headline prevalence shifts by about five percentage points depending on which expert-society definition of metabolic syndrome you use — the international ones (IDF, JIS) flag higher rates in Asian and Hispanic patients because they apply lower waist-circumference thresholds for those groups. Second, a machine-learning model trained on age, sex, race, and BMI alone gives a useful first-pass risk score (AUC 0.78–0.83), and — crucially — its predicted probabilities are well-calibrated, meaning when the model says "30% risk," about 30% of those patients really do have the condition. Calibration matters more than raw discrimination for any tool that hands a patient a numeric risk; this one passes that bar.
:::
## Why this analysis
Cardiometabolic disease remains the leading driver of mortality in US adults. Yet the clinical tools that estimate risk — Framingham, ASCVD, PCE — were trained on cohorts that poorly represent the population-level distribution of social and biological risk factors captured by NHANES.
This notebook does three things a physician-scientist portfolio should demonstrate end-to-end:
1. **Load and harmonize** the NHANES 2017–2018 cycle (demographics, examination, labs, questionnaire) using the CDC's public XPT files.
2. **Describe** the population correctly — applying the NHANES complex survey design so prevalence estimates generalize to the US civilian non-institutionalized population.
3. **Model** the probability of metabolic syndrome using both a design-aware logistic regression (in **R**) and a gradient-boosted classifier with SHAP explanations (in **Python**), then compare what each approach reveals.
The two language blocks render side-by-side thanks to Quarto's polyglot mode via `reticulate` — a small but real demonstration that **methodology choices should follow the scientific question, not the toolchain.**
::: {.callout-note}
Source and helpers live in [`R/nhanes_helpers.R`](../../R/nhanes_helpers.R) and [`python/nhanes_helpers.py`](../../python/nhanes_helpers.py). The `.qmd` source for this page is on [GitHub](https://github.com/paulinedmdf/paulinadelmundomd/blob/main/projects/01-cardiometabolic-risk/analysis.qmd).
:::
---
## Setup
```{r setup-r}
#| label: setup-r
source(here::here("R", "nhanes_helpers.R"))
library(gtsummary)
library(ggplot2)
library(purrr)
library(reticulate)
# Method packages — installed on first render if absent.
for (pkg in c("mice", "mitools", "e1071", "fitdistrplus",
"performance", "lme4", "kableExtra")) {
if (!requireNamespace(pkg, quietly = TRUE)) {
install.packages(pkg, repos = "https://cloud.r-project.org")
}
}
# Lightweight wrapper for hero tables — bootstrap styling, no monospaced
# greyscale dump. Plain knitr::kable() still works elsewhere for inline /
# diagnostic tables where the styled version is overkill.
pretty_kable <- function(x, caption = NULL, ...) {
knitr::kable(x, caption = caption, ...) |>
kableExtra::kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "left",
font_size = 13
)
}
reticulate::py_require(c(
"pandas", "numpy", "scipy", "scikit-learn",
"xgboost", "shap", "matplotlib"
))
theme_set(theme_minimal(base_size = 12))
```
```{python setup-py}
#| label: setup-py
import sys, pathlib
sys.path.append(str(pathlib.Path.cwd().parents[1] / "python"))
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, brier_score_loss
from sklearn.calibration import calibration_curve
import xgboost as xgb
import shap
plt.rcParams.update({"figure.dpi": 110})
```
## Pulling and pooling the NHANES cycles
A single cycle is too small for stable subgroup estimates. We pool **NHANES 2015–2016 (cycle I)** and **2017–2018 (cycle J)** per the CDC analytic guidelines: each cycle's MEC weight is divided by the number of cycles, strata and PSU pass through unchanged, and SEQNs don't collide. The combined-cycle weight is `WTMEC4YR = WTMEC2YR / 2`.
We also pull three additional tables (`TCHOL`, `SMQ`, `BPQ`) needed for the Pooled Cohort Equations comparator downstream — total cholesterol, current-smoker status, and antihypertensive treatment.
```{r pull-data}
#| label: pull-data
#| cache: true
build_spec <- function(suffix) {
list(
DEMO = paste0("DEMO_", suffix),
BMX = paste0("BMX_", suffix),
BPX = paste0("BPX_", suffix),
GLU = paste0("GLU_", suffix),
HDL = paste0("HDL_", suffix),
TRIG = paste0("TRIGLY_", suffix),
DIQ = paste0("DIQ_", suffix),
TCHOL = paste0("TCHOL_", suffix),
SMQ = paste0("SMQ_", suffix),
BPQ = paste0("BPQ_", suffix)
)
}
cycle_specs <- list(
"2015-2016" = build_spec("I"),
"2017-2018" = build_spec("J")
)
nhanes_raw <- pool_nhanes_cycles(cycle_specs)
nhanes_raw <- recode_demographics(nhanes_raw)
cat(sprintf("Pooled rows: %s\n", format(nrow(nhanes_raw), big.mark = ",")))
print(table(cycle = nhanes_raw$cycle))
```
## Exploratory data analysis — what the cohort actually looks like
Before any modeling decision, four questions need answers: **what's missing, how skewed are the continuous variables, how do the demographic cells distribute, and does the cohort that survives our exclusion criteria still resemble the source population?** Skipping these steps is how a survey-weighted estimator ends up reporting a tidy CI on a heavily-truncated subgroup.
```{r eda-missingness}
#| label: eda-missingness
#| fig-cap: "Missingness map across the candidate analytic columns. NHANES MEC weights (WTMEC2YR) are zero or missing for non-examined participants — those are excluded downstream. Lipid panels (LBXTR, LBDHDD) and fasting glucose (LBXGLU) only resolve for the morning fasting subsample, so their NA share roughly tracks that subsample size."
#| fig-width: 8.5
#| fig-height: 4.2
eda_cols <- c("WTMEC2YR", "RIAGENDR", "RIDAGEYR", "RIDRETH3",
"BMXBMI", "BMXWAIST",
"BPXSY1", "BPXDI1",
"LBXTR", "LBDHDD", "LBXGLU", "DIQ010")
miss_long <- nhanes_raw |>
dplyr::select(dplyr::any_of(eda_cols)) |>
dplyr::summarise(dplyr::across(dplyr::everything(),
\(x) mean(is.na(x)) * 100)) |>
tidyr::pivot_longer(dplyr::everything(),
names_to = "variable", values_to = "pct_missing")
ggplot(miss_long,
aes(x = stats::reorder(variable, pct_missing), y = pct_missing)) +
geom_col(fill = "#020073") +
geom_text(aes(label = sprintf("%.1f%%", pct_missing)),
hjust = -0.1, size = 3.2, color = "#333") +
coord_flip(clip = "off") +
scale_y_continuous(limits = c(0, max(miss_long$pct_missing) * 1.15),
labels = \(x) paste0(x, "%")) +
labs(x = NULL, y = "% missing in raw NHANES merge") +
theme(plot.margin = margin(5, 80, 5, 5))
```
```{r eda-univariate}
#| label: eda-univariate
#| fig-cap: "Univariate distributions of the five continuous components used downstream. Triglycerides and fasting glucose are right-skewed with long upper tails — this is exactly the regime where peer-comparison estimators like robust z-scores beat naive means. Vertical lines mark ATP III diagnostic thresholds."
#| fig-width: 10
#| fig-height: 5.2
cont_long <- nhanes_raw |>
dplyr::transmute(
`Waist (cm)` = BMXWAIST,
`Triglycerides (mg/dL)` = LBXTR,
`HDL (mg/dL)` = LBDHDD,
`Systolic BP (mm Hg)` = BPXSY1,
`Fasting glucose (mg/dL)` = LBXGLU
) |>
tidyr::pivot_longer(dplyr::everything(),
names_to = "variable", values_to = "value") |>
dplyr::filter(!is.na(value))
thresholds <- tibble::tibble(
variable = c("Waist (cm)", "Triglycerides (mg/dL)",
"HDL (mg/dL)", "Systolic BP (mm Hg)",
"Fasting glucose (mg/dL)"),
cut = c(95, 150, 45, 130, 100)
)
ggplot(cont_long, aes(value)) +
geom_histogram(fill = "#020073", color = "white", bins = 40) +
geom_vline(data = thresholds, aes(xintercept = cut),
color = "#FF2E00", linetype = "dashed", linewidth = 0.6) +
facet_wrap(~ variable, scales = "free", ncol = 3) +
labs(x = NULL, y = "Count")
```
```{r eda-cells}
#| label: eda-cells
#| fig-cap: "Sex × race/ethnicity cell counts after the adult (≥ 20) and complete-case filter. Cells under ~30 will produce wide CIs in subgroup estimates — a sanity check before reporting any stratified prevalence."
dat_eda <- nhanes_raw |>
dplyr::filter(age_years >= 20, !is.na(WTMEC2YR)) |>
dplyr::filter(if_all(c(BMXWAIST, LBXTR, LBDHDD, BPXSY1, BPXDI1, LBXGLU),
\(x) !is.na(x)))
cells <- dat_eda |>
dplyr::count(sex, race_eth) |>
dplyr::mutate(race_eth = forcats::fct_reorder(race_eth, n, .desc = TRUE))
ggplot(cells, aes(race_eth, n, fill = sex)) +
geom_col(position = position_dodge(0.75), width = 0.7) +
geom_text(aes(label = n),
position = position_dodge(0.75), vjust = -0.3, size = 3.1) +
scale_fill_manual(values = c("Male" = "#020073", "Female" = "#FF2E00")) +
scale_y_continuous(expand = expansion(mult = c(0, 0.12))) +
labs(x = NULL, y = "Adult complete-case count", fill = NULL) +
theme(legend.position = "top",
axis.text.x = element_text(angle = 20, hjust = 1))
```
```{r eda-target-prevalence}
#| label: eda-target-prevalence
target <- dat_eda |>
dplyr::transmute(
waist_hit = (sex == "Male" & BMXWAIST >= 102) |
(sex == "Female" & BMXWAIST >= 88),
trig_hit = LBXTR >= 150,
hdl_hit = (sex == "Male" & LBDHDD < 40) |
(sex == "Female" & LBDHDD < 50),
bp_hit = BPXSY1 >= 130 | BPXDI1 >= 85,
glu_hit = LBXGLU >= 100,
metsyn = rowSums(cbind(waist_hit, trig_hit, hdl_hit, bp_hit, glu_hit), na.rm = TRUE) >= 3
)
cat(sprintf(
"Adult complete-case n = %s\nUnweighted MetS prevalence = %.1f%%\n",
format(nrow(target), big.mark = ","),
100 * mean(target$metsyn, na.rm = TRUE)
))
```
The unweighted prevalence above is what a naive analyst would report. The weighted estimate in the next section will land in a different place — that gap is the methodological reason we apply the survey design.
::: {.callout-note}
## Cohort flow — n surviving each filtering step
A reviewer's first question on any observational analysis is "where did the n come from?" The pipeline filters the raw NHANES merge in three stages; the table below makes the staged attrition explicit, mirroring the CONSORT / STROBE flow expectation for an observational cohort. Each row's count is recomputed live from the in-memory objects so it stays consistent with whatever filter set is currently in effect.
```{r cohort-flow}
#| label: cohort-flow
#| echo: false
n_raw <- nrow(nhanes_raw)
n_pooled <- nrow(nhanes_raw)
n_adult <- sum(nhanes_raw$age_years >= 20, na.rm = TRUE)
# `dat_full` (built later) and `dat` (complete-case) are not yet in scope at
# this point in the rendered flow; the downstream chunks compute them and the
# corresponding numbers are reported in their own cat() outputs. The flow
# table below documents the *staged* design so the reader can locate each n
# in the chunk that produces it.
flow <- data.frame(
Step = c("1. Pooled raw merge (cycles 2015-2018)",
"2. Adult subset (age >= 20 yrs)",
"3. Pre-complete-case cohort (valid MEC weight, age 20+) -- see `define-outcome` chunk for n",
"4. Complete-case cohort (all 5 ATP III components observed) -- see `define-outcome` chunk for n",
"5. MI cohort (Rubin's-rules over m = 5 imputations) -- see `mi-impute` for the pre-CC denominator"),
n = c(format(n_raw, big.mark = ","),
format(n_pooled, big.mark = ","),
format(n_adult, big.mark = ","),
"see chunk",
"see chunk"),
`Filter that drops rows` = c(
"Merge across DEMO, BMX, BPX, GLU, HDL, TRIG, DIQ, TCHOL, SMQ, BPQ for both cycles",
"Removes children and adolescents (CDC standard age cutoff for cardiometabolic adult analyses)",
"Removes non-examined participants (zero MEC weight)",
"Removes rows missing any of waist, trig, HDL, SBP, DBP, fasting glucose -- the binding constraint is the morning fasting subsample",
"Carries forward design columns + demographics as auxiliary predictors; imputes the 8 clinical measurements"
),
check.names = FALSE
)
knitr::kable(flow, caption = "Cohort flow diagram in tabular form. Steps 1 and 2 are computed live; steps 3 and 4 are reported by the `define-outcome` chunk below; step 5 by the `mi-impute` chunk further down. Together they document the inclusion / exclusion path from raw merge to the MI-pooled analytic sample.")
```
:::
## Defining metabolic syndrome (ATP III, harmonized)
An adult meets criteria if **≥ 3 of 5** are true:
| Component | Threshold |
|---|---|
| Waist circumference | ≥ 102 cm (M) or ≥ 88 cm (F) |
| Triglycerides | ≥ 150 mg/dL |
| HDL cholesterol | < 40 mg/dL (M) or < 50 mg/dL (F) |
| Blood pressure | SBP ≥ 130 or DBP ≥ 85, or on BP medication |
| Fasting glucose | ≥ 100 mg/dL, or on diabetes medication |
```{r define-outcome}
#| label: define-outcome
# Pre-complete-case cohort — keep one copy with NAs intact for MICE downstream,
# and one filtered copy for the design-aware analysis the existing chunks
# already use. The outcome derivation is identical; only the row filter
# differs.
dat_full <- nhanes_raw |>
filter(age_years >= 20, !is.na(WTMEC4YR), WTMEC4YR > 0) |>
transmute(
SEQN, cycle, WTMEC4YR, SDMVPSU, SDMVSTRA,
sex, age_years, race_eth,
bmi = BMXBMI,
waist = BMXWAIST,
trig = LBXTR,
hdl = LBDHDD,
tc = LBXTC,
sbp = rowMeans(cbind(BPXSY1, BPXSY2, BPXSY3), na.rm = TRUE),
dbp = rowMeans(cbind(BPXDI1, BPXDI2, BPXDI3), na.rm = TRUE),
fpg = LBXGLU,
dm_dx = !is.na(DIQ010) & DIQ010 == "Yes",
bp_treated = !is.na(BPQ050A) & BPQ050A == "Yes",
smoker = !is.na(SMQ040) & SMQ040 %in% c("Every day", "Some days"),
waist_hit = (sex == "Male" & waist >= 102) |
(sex == "Female" & waist >= 88),
trig_hit = trig >= 150,
hdl_hit = (sex == "Male" & hdl < 40) |
(sex == "Female" & hdl < 50),
bp_hit = sbp >= 130 | dbp >= 85,
glu_hit = fpg >= 100 | dm_dx,
mets_n = rowSums(cbind(waist_hit, trig_hit, hdl_hit, bp_hit, glu_hit),
na.rm = TRUE),
metsyn = mets_n >= 3
)
# Coerce sbp/dbp NaN (when all three readings missing) back to NA so MICE
# treats them consistently.
dat_full <- dat_full |>
mutate(across(c(sbp, dbp), \(x) ifelse(is.nan(x), NA, x)))
dat <- dat_full |>
filter(if_all(c(waist, trig, hdl, sbp, dbp, fpg), \(x) !is.na(x)))
cat(sprintf(
"Pre-CC pooled cohort : %s\nComplete-case cohort : %s (drop %.1f%%)\n",
format(nrow(dat_full), big.mark = ","),
format(nrow(dat), big.mark = ","),
100 * (1 - nrow(dat) / nrow(dat_full))
))
```
## Diagnostics on the analytic cohort
Three publication-grade checks before any modeling: a **formal distributional fit** for each continuous candidate (skew/kurtosis quantified, candidate distributions ranked by AIC), a **selection-bias check** comparing the dropped rows to the kept ones, and an **ICC variance partition** that quantifies how much of the outcome variance lives at each clustering level of the NHANES sample design.
```{r eda-distributional-fit}
#| label: eda-distributional-fit
cont_vars <- list(
bmi = dat_full$bmi, waist = dat_full$waist,
trig = dat_full$trig, hdl = dat_full$hdl,
sbp = dat_full$sbp, dbp = dat_full$dbp,
fpg = dat_full$fpg, tc = dat_full$tc
)
skew_kurt <- vapply(cont_vars, function(x) {
x <- x[!is.na(x)]
c(skew = e1071::skewness(x), kurtosis = e1071::kurtosis(x))
}, numeric(2))
knitr::kable(
data.frame(
Variable = colnames(skew_kurt),
n = vapply(cont_vars, \(x) sum(!is.na(x)), integer(1)),
Skew = round(skew_kurt["skew", ], 2),
`Excess kurtosis` = round(skew_kurt["kurtosis", ], 2),
Shape = ifelse(abs(skew_kurt["skew", ]) > 1, "Right-skewed",
ifelse(abs(skew_kurt["skew", ]) > 0.5, "Mildly skewed", "Roughly symmetric")),
check.names = FALSE
),
caption = "Skewness and excess kurtosis on the eight continuous candidate variables. |skew| > 1 indicates a heavy tail that materially affects the choice of regression family or transformation. The right-skewed variables get a formal distribution fit in the next table."
)
skewed <- names(cont_vars)[abs(skew_kurt["skew", ]) > 1]
fit_results <- list()
for (v in skewed) {
x <- cont_vars[[v]]
x <- x[!is.na(x) & x > 0]
fits <- list(
normal = try(fitdistrplus::fitdist(x, "norm"), silent = TRUE),
lnorm = try(fitdistrplus::fitdist(x, "lnorm"), silent = TRUE),
gamma = try(fitdistrplus::fitdist(x, "gamma"), silent = TRUE)
)
aics <- vapply(fits, function(f) {
if (inherits(f, "try-error") || is.null(f)) NA_real_ else f$aic
}, numeric(1))
fit_results[[v]] <- aics
}
if (length(fit_results) > 0) {
fit_mat <- do.call(rbind, fit_results)
best <- colnames(fit_mat)[apply(fit_mat, 1, function(r) which.min(r))]
knitr::kable(
data.frame(
Variable = rownames(fit_mat),
`AIC normal` = round(fit_mat[, "normal"], 0),
`AIC log-normal` = round(fit_mat[, "lnorm"], 0),
`AIC gamma` = round(fit_mat[, "gamma"], 0),
`Best fit` = best,
check.names = FALSE
),
caption = "AIC comparison of three candidate distributions on the right-skewed variables (positive values only). Lowest AIC = best fit. The 'best fit' column is the recommended starting family for any regression that takes one of these as outcome — for triglycerides and fasting glucose the answer is almost always log-normal or gamma rather than normal."
)
}
```
::: {.callout-note collapse="true"}
## Going deeper — what skewness and excess kurtosis actually measure
Skewness asks whether a histogram leans to one side — zero is symmetric (bell curve), positive means a long right tail. Kurtosis asks how *fat* the tails are: whether rare extremes show up more often than they would in a normal distribution. The "excess" prefix means we subtract 3 so a perfect bell curve scores 0; positive values mean heavier-than-normal tails (leptokurtic), negative means lighter (platykurtic). For a sample with mean $\bar{x}$ and SD $s$:
$$\text{skew} = \tfrac{1}{n} \sum_i \!\left(\tfrac{x_i - \bar{x}}{s}\right)^{\!3}, \qquad \text{excess kurtosis} = \tfrac{1}{n} \sum_i \!\left(\tfrac{x_i - \bar{x}}{s}\right)^{\!4} - 3.$$
The cube and fourth power deliberately amplify extremes — a single value 4.5 SDs from the mean contributes $4.5^3 \approx 91$ to the skewness sum, dominating hundreds of near-mean residuals near zero. That's why "$|\text{skew}| > 1$" is the conventional flag: it survives one or two true outliers but isn't dismissable as sampling noise. Triglycerides in a heavy-tailed clinical sample typically read skew $\approx 2{-}3$ — well past the threshold, telling you a log transform is in order before any normal-theory model.
:::
::: {.callout-note collapse="true"}
## Going deeper — what AIC is actually doing
AIC scores candidate "shapes" the data might follow — normal, log-normal, gamma — by balancing *fit* against *complexity*. The formula
$$\text{AIC} = 2k - 2 \log L$$
rewards models with high likelihood ($\log L$) but charges $2k$ per free parameter, so you can't always win by adding dials. The constant $2$ isn't arbitrary: under regularity conditions, AIC is an asymptotically unbiased estimator of (twice) the Kullback–Leibler divergence between the candidate and the truth, so minimising AIC ≈ minimising expected information loss. The score is meaningless in isolation; only *differences* matter — $\Delta\text{AIC} < 2$ is weak, $4{-}7$ substantial, $>10$ decisive. **BIC** is the close cousin that swaps $2k$ for $k \log n$, harsher on complexity as $n$ grows: *consistent* (picks the true model in the limit) rather than *efficient* (minimises prediction loss). For triglycerides with $n \approx 6{,}000$ and three two-parameter candidates, you might see AICs around 64,200 (normal), 62,800 (log-normal), 62,800 (gamma) — gamma and log-normal tie, normal is decisively rejected. Since $k$ is identical here the penalty cancels and the ranking is pure fit.
:::
```{r eda-selection-bias}
#| label: eda-selection-bias
# Tag inclusion status, then formally test whether the included and
# excluded subsamples differ on the demographics we're going to condition
# on downstream. Significant differences quantify the selection-bias
# threat that motivates MICE in a later section.
dat_full2 <- dat_full |>
dplyr::mutate(included = SEQN %in% dat$SEQN)
n_inc <- sum(dat_full2$included)
n_exc <- sum(!dat_full2$included)
results <- list(
age_years = stats::t.test(age_years ~ included, data = dat_full2),
bmi = stats::t.test(bmi ~ included, data = dat_full2),
sex = stats::chisq.test(table(dat_full2$sex, dat_full2$included)),
race_eth = stats::chisq.test(table(dat_full2$race_eth, dat_full2$included)),
cycle = stats::chisq.test(table(dat_full2$cycle, dat_full2$included))
)
bias_df <- data.frame(
Variable = names(results),
Test = c("Welch t", "Welch t", "Pearson chi-squared",
"Pearson chi-squared", "Pearson chi-squared"),
Statistic = vapply(results, \(r) round(as.numeric(r$statistic), 2), numeric(1)),
df = vapply(results, \(r) {
p <- if (!is.null(r$parameter)) as.numeric(r$parameter[1]) else NA_real_
if (is.na(p)) "" else as.character(round(p, 1))
}, character(1)),
`p-value` = vapply(results, \(r) format.pval(r$p.value, eps = 1e-300, digits = 2),
character(1)),
check.names = FALSE
)
cat(sprintf("Included (complete-case) : %s\nExcluded : %s\n\n",
format(n_inc, big.mark = ","),
format(n_exc, big.mark = ",")))
knitr::kable(
bias_df,
caption = "Selection-bias diagnostics: do the rows we drop in the complete-case filter differ from the rows we keep, on the demographics that condition the downstream regression? Significant differences (typically on age, since fasting subsamples skew older) quantify the threat that motivates the MICE section below."
)
```
```{r eda-icc-partition}
#| label: eda-icc-partition
# Partition the variance of MetS at NHANES's complex-survey clustering
# levels — strata, then PSU within stratum. Subject-level variance
# falls out as the residual.
icc_model <- lme4::glmer(
as.integer(metsyn) ~ 1 + (1 | SDMVSTRA) + (1 | SDMVPSU),
data = dat,
family = binomial(),
control = lme4::glmerControl(optimizer = "bobyqa")
)
vc <- as.data.frame(lme4::VarCorr(icc_model))[, c("grp", "vcov")]
# For binomial GLMM with logit link, the residual variance is fixed at
# pi^2 / 3 on the latent scale (Snijders & Bosker convention).
vc <- rbind(vc, data.frame(grp = "Residual (latent)", vcov = pi^2 / 3))
total <- sum(vc$vcov)
vc$ICC_pct <- sprintf("%.2f%%", 100 * vc$vcov / total)
vc$Variance <- round(vc$vcov, 4)
knitr::kable(
vc[, c("grp", "Variance", "ICC_pct")],
col.names = c("Clustering level", "Variance", "% of total"),
caption = "Variance partition for binary MetS via random-intercept logit. Stratum + PSU together carry the design-induced clustering; the residual is the per-subject variance on the latent (logit) scale (pi^2/3 by convention). When the stratum + PSU share is non-trivial, design-aware estimators are doing real inferential work — not a stylistic choice."
)
```
::: {.callout-note collapse="true"}
## Going deeper — why the latent residual variance is exactly $\pi^2/3$
Behind every logistic regression there's a hidden continuous "propensity" score $Y^* = X\beta + \varepsilon$ we never observe — we only see whether each person crossed a threshold (yes/no on metabolic syndrome). For logistic regression, $\varepsilon$ follows the **standard logistic distribution**, with density $f(\varepsilon) = e^{-\varepsilon}/(1+e^{-\varepsilon})^2$ and a variance you can compute by integration:
$$\text{Var}(\varepsilon) = \int_{-\infty}^{\infty} \varepsilon^2 f(\varepsilon)\, d\varepsilon = \tfrac{\pi^2}{3} \approx 3.29.$$
This number isn't arbitrary — it's baked into the choice of logit link, the same way picking Fahrenheit vs Celsius bakes in a scale. (A probit model would have residual variance 1 by analogous logic, which is why probit and logit coefficients differ by a factor of $\sqrt{\pi^2/3} \approx 1.81$.) When `glmer` partitions variance across NHANES strata and PSUs, it reports those components and we compute the ICC against this fixed denominator: if $\sigma^2_{\text{stratum}} = 0.10$ and $\sigma^2_{\text{PSU}} = 0.07$, then ICC$_{\text{stratum}} = 0.10 / (0.10 + 0.07 + \pi^2/3) \approx 2.9\%$. This structurally-large residual is why design ICCs in survey data nearly always read in single-digit percentages on the latent scale.
:::
## Weighted prevalence by demographics
> **In one sentence.** Apply the NHANES design weights so prevalence estimates generalize to the US civilian non-institutionalized population — and report the weighted estimate as the headline rather than the unweighted one.
::: {.callout-note}
## Why a survey-weighted estimator (and not a plain proportion)?
NHANES is **not a simple random sample.** It deliberately *oversamples* certain subgroups (older adults, Black, Hispanic, low-income) so the lab work has enough cases per cell to estimate subgroup prevalence precisely. If you compute a plain proportion on the raw rows, you over-count those oversampled groups in the headline number.
The fix is the **MEC sample weight (`WTMEC2YR`)** — a number per row that says "this person represents *N* people in the US population." A weighted mean reweights every row by `WTMEC2YR` so the result is what you'd see if NHANES had been a true random sample of the US adult population. The `survey::svymean()` function also accounts for the sampling **strata** and **PSUs** (geographic clusters), so the standard error reflects the design — not the smaller SE you'd get pretending the sample was simple-random.
**Use a weighted estimator whenever a sample was drawn with unequal probabilities — NHANES, BRFSS, MEPS, NHIS, NHATS, all the federal surveys.** The unweighted version isn't "simpler"; it's biased. (Lumley 2004, 2010)
:::
::: {.callout-note collapse="true"}
## Going deeper — Taylor linearization, `nest = TRUE`, and what the SE is actually computing
A survey-weighted mean $\hat{Y} = \sum w_i y_i / \sum w_i$ is a *ratio of two random sums*, and ratios don't have a clean closed-form variance. **Taylor linearization** sidesteps that: pretend the ratio is approximately a straight line at the point we care about, compute the variance of *that* straight-line approximation, and call it a good enough estimate. Defining the residual $z_i = (y_i - \hat{Y})/\hat{N}$ where $\hat{N} = \sum w_i$, the formula sums between-PSU squared deviations within each stratum:
$$\widehat{\text{Var}}(\hat{Y}) \approx \sum_h \tfrac{n_h}{n_h - 1} \sum_{c \in h} \!\left(\sum_{i \in c} w_i z_i - \overline{w z}_h\right)^{\!2}.$$
Each PSU contributes one term; the standard error is built from between-PSU variation, *not* from independent observations. This is why design degrees of freedom = (PSUs) − (strata), often far smaller than $n$ — with 30 strata × 2 PSUs each on $n = 5{,}000$, df = 30, and a naive simple-random SE would be roughly $13\times$ too narrow.
**`nest = TRUE`** is a critical implementation flag: NHANES reuses small numeric PSU IDs across different strata (PSU "1" in stratum 100 has nothing to do with PSU "1" in stratum 101). `nest = TRUE` tells `svydesign` to treat the PSU code as nested *within* its stratum, not as a global identifier. Forget it and the SEs pool unrelated PSUs and come out badly wrong.
:::
Critically, NHANES oversamples certain subpopulations. A naive proportion is biased; we have to apply the MEC sampling weights with `survey`/`srvyr`.
```{r weighted-table}
#| label: weighted-table
des <- nhanes_design(dat, weight = "WTMEC4YR")
prev_tbl <- des |>
group_by(sex, race_eth) |>
summarize(
metsyn_pct = survey_mean(metsyn, proportion = TRUE, vartype = "ci") * 100
)
pretty_kable(
prev_tbl,
digits = 1,
col.names = c("Sex", "Race/Ethnicity", "Prevalence (%)", "Lower 95% CI", "Upper 95% CI"),
caption = "Weighted metabolic syndrome prevalence, US adults ≥ 20 years (NHANES 2015–2018, pooled)."
)
```
```{r prev-plot}
#| label: prev-plot
#| fig-cap: "Weighted metabolic syndrome prevalence with 95% CIs."
ggplot(prev_tbl, aes(x = reorder(race_eth, metsyn_pct), y = metsyn_pct, fill = sex)) +
geom_col(position = position_dodge(0.75), width = 0.7) +
geom_errorbar(
aes(ymin = metsyn_pct_low, ymax = metsyn_pct_upp),
position = position_dodge(0.75), width = 0.2
) +
coord_flip() +
scale_fill_manual(values = c("Male" = "#020073", "Female" = "#FF2E00")) +
labs(x = NULL, y = "Prevalence (%)", fill = NULL) +
theme(legend.position = "top")
```
### Sensitivity to the metabolic-syndrome definition
The headline prevalence depends on which diagnostic rule we apply. Three sets of criteria are in active use in the literature: **NCEP ATP III** (the original 2001 definition, which is what the rest of this notebook uses), **IDF 2005** (waist-circumference is mandatory rather than one of five), and **Harmonized / JIS 2009** (waist thresholds vary by ethnicity, the rest match ATP III). Reporting all three side-by-side closes the "your headline depends on the threshold" reviewer comment.
```{r mets-sensitivity}
#| label: mets-sensitivity
# Three definitions, each producing a binary metsyn flag on dat. We
# reuse the existing component thresholds for ATP III, then express
# IDF and JIS as modifications of the same components. Population:
# adult complete-case cohort (consistent with the rest of the section).
mets_atp3 <- dat |>
dplyr::transmute(
SEQN, WTMEC4YR, SDMVPSU, SDMVSTRA,
waist_hit = (sex == "Male" & waist >= 102) | (sex == "Female" & waist >= 88),
trig_hit = trig >= 150,
hdl_hit = (sex == "Male" & hdl < 40) | (sex == "Female" & hdl < 50),
bp_hit = sbp >= 130 | dbp >= 85 | bp_treated,
glu_hit = fpg >= 100 | dm_dx,
metsyn = as.integer(rowSums(cbind(waist_hit, trig_hit, hdl_hit, bp_hit, glu_hit),
na.rm = TRUE) >= 3)
)
# IDF 2005: central obesity is mandatory (waist >= 94/M, 80/F for
# Europid; we use the more conservative ATP-III thresholds here as a
# sensitivity floor) PLUS any 2 of the remaining 4.
mets_idf <- dat |>
dplyr::transmute(
SEQN, WTMEC4YR, SDMVPSU, SDMVSTRA,
central = (sex == "Male" & waist >= 94) | (sex == "Female" & waist >= 80),
trig_hit = trig >= 150,
hdl_hit = (sex == "Male" & hdl < 40) | (sex == "Female" & hdl < 50),
bp_hit = sbp >= 130 | dbp >= 85 | bp_treated,
glu_hit = fpg >= 100 | dm_dx,
metsyn = as.integer(central &
rowSums(cbind(trig_hit, hdl_hit, bp_hit, glu_hit),
na.rm = TRUE) >= 2)
)
# JIS 2009: same as ATP III on every component except waist threshold,
# which is ethnicity-specific. We use the conservative ATP III values
# for Non-Hispanic White/Black and IDF Asian-specific (90/M, 80/F) for
# the Asian/Mexican-American/Other groups present in NHANES.
mets_jis <- dat |>
dplyr::mutate(
waist_cut_m = dplyr::if_else(grepl("Asian|Mexican|Other", as.character(race_eth)),
90, 102),
waist_cut_f = dplyr::if_else(grepl("Asian|Mexican|Other", as.character(race_eth)),
80, 88)
) |>
dplyr::transmute(
SEQN, WTMEC4YR, SDMVPSU, SDMVSTRA,
waist_hit = (sex == "Male" & waist >= waist_cut_m) |
(sex == "Female" & waist >= waist_cut_f),
trig_hit = trig >= 150,
hdl_hit = (sex == "Male" & hdl < 40) | (sex == "Female" & hdl < 50),
bp_hit = sbp >= 130 | dbp >= 85 | bp_treated,
glu_hit = fpg >= 100 | dm_dx,
metsyn = as.integer(rowSums(cbind(waist_hit, trig_hit, hdl_hit, bp_hit, glu_hit),
na.rm = TRUE) >= 3)
)
prev_for <- function(d, label) {
des_x <- survey::svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA,
weights = ~WTMEC4YR, data = d, nest = TRUE)
m <- survey::svymean(~metsyn, des_x)
ci <- stats::confint(m)
data.frame(
Definition = label,
`Prevalence` = sprintf("%.1f%%", 100 * as.numeric(coef(m))),
`95% CI` = sprintf("%.1f%% – %.1f%%", 100 * ci[1, 1], 100 * ci[1, 2]),
n = nrow(d),
check.names = FALSE
)
}
pretty_kable(
rbind(
prev_for(mets_atp3, "NCEP ATP III (primary)"),
prev_for(mets_idf, "IDF 2005 (waist-mandatory)"),
prev_for(mets_jis, "JIS 2009 (ethnicity-specific waist)")
),
caption = "Survey-weighted metabolic-syndrome prevalence under three published definitions. The primary analysis uses NCEP ATP III; IDF 2005 is more conservative because central obesity is mandatory; JIS 2009 uses lower waist cutoffs for Asian and Mexican-American groups, which raises prevalence in those subpopulations and shifts the headline upward. The spread across the three is the bound a reviewer would ask the headline number to sit inside."
)
```
## Collinearity diagnostics
Before running any regression we need to know which candidate predictors are too tightly correlated to coexist in the same specification. Two relationships dominate cardiometabolic data: **BMI vs. waist circumference** (essentially the same construct measured two ways) and **systolic vs. diastolic blood pressure** (correlated through the underlying mean arterial pressure). A model that includes both members of either pair will produce inflated standard errors and unstable coefficients.
```{r collinearity-corr}
#| label: collinearity-corr
#| fig-cap: "Pairwise Pearson correlations among continuous candidate predictors. Cells with |r| ≥ 0.7 are flagged in orange — the conventional threshold above which two variables shouldn't enter the same regression. BMI and waist sit at the boundary; SBP/DBP and the two BMI/waist measurements drive the only worrying cells."
#| fig-width: 7.5
#| fig-height: 5.5
cont_pred <- dat |>
dplyr::select(age_years, bmi, waist, sbp, dbp, fpg, hdl, trig)
corr <- stats::cor(cont_pred, use = "pairwise.complete.obs", method = "pearson")
corr_long <- as.data.frame(as.table(corr)) |>
dplyr::rename(x = Var1, y = Var2, r = Freq) |>
dplyr::mutate(
flag = abs(r) >= 0.7 & x != y,
label = sprintf("%.2f", r)
)
ggplot(corr_long, aes(x, y, fill = r)) +
geom_tile(color = "white") +
geom_text(aes(label = label,
color = abs(r) > 0.5),
size = 3.2, fontface = "bold") +
geom_tile(data = subset(corr_long, flag),
fill = NA, color = "#FF2E00", linewidth = 1.2) +
scale_fill_gradient2(low = "#FF2E00", mid = "#F6F9F7", high = "#020073",
midpoint = 0, limits = c(-1, 1),
name = "Pearson r") +
scale_color_manual(values = c(`TRUE` = "white", `FALSE` = "#333"),
guide = "none") +
coord_fixed() +
labs(x = NULL, y = NULL) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
```
```{r collinearity-vif}
#| label: collinearity-vif
# Compute per-dummy VIFs from the design matrix of the exact specification
# we're about to fit. Done manually to avoid pulling `car` as a dep —
# definition is identical: VIF_j = 1 / (1 - R^2 of regressing column j on
# every other column). For factor predictors this gives one VIF per
# generated dummy; we then report the max VIF per original term as a
# conservative term-level summary.
mm <- stats::model.matrix(~ age_years + sex + race_eth, data = dat)[, -1, drop = FALSE]
vif_dummies <- vapply(seq_len(ncol(mm)), function(j) {
y <- mm[, j]
x <- cbind(1, mm[, -j, drop = FALSE])
fit <- stats::lm.fit(x = x, y = y)
ss_res <- sum(fit$residuals^2)
ss_tot <- sum((y - mean(y))^2)
r2 <- 1 - ss_res / max(ss_tot, .Machine$double.eps)
1 / max(1 - r2, .Machine$double.eps)
}, numeric(1))
names(vif_dummies) <- colnames(mm)
# Map each generated dummy back to its source term.
term_of <- function(col) {
if (col == "age_years") return("age_years")
if (startsWith(col, "sex")) return("sex")
if (startsWith(col, "race_eth")) return("race_eth")
col
}
vif_term <- tapply(vif_dummies, vapply(names(vif_dummies), term_of, character(1)), max)
knitr::kable(
data.frame(
Predictor = names(vif_term),
`Max VIF` = round(as.numeric(vif_term), 2),
Threshold = "< 5 clean, > 10 problematic",
check.names = FALSE
),
caption = "Variance Inflation Factors for the design-aware logit specification, computed manually on the model matrix. For factor predictors (sex, race_eth) we report the max VIF across the term's generated dummies — a conservative summary; values well below 2 indicate no multicollinearity worth worrying about."
)
```
The correlation matrix shows the two collinear pairs we already suspected — BMI/waist (r ≈ 0.88) and SBP/DBP (r ≈ 0.7). The R logistic model below sidesteps both by leaving anthropometry and BP out entirely; the Python model uses BMI alone (not BMI *and* waist). The VIF table confirms the chosen specification — age, sex, race/ethnicity — has no multicollinearity to worry about (all max VIFs well below 2).
## Design-aware logistic regression (R)
Using `svyglm` so standard errors reflect NHANES's clustered, stratified design. Coefficients are log-odds ratios for metabolic syndrome.
::: {.callout-note}
## Why logistic regression?
The outcome (metabolic syndrome) is **binary** — either the person meets ≥3 of 5 criteria or they don't. Linear regression on a binary outcome will happily predict negative or > 1 probabilities and gives biased SEs.
**Logistic regression** models the *log-odds* of the outcome as a linear combination of the predictors. The coefficients (after exponentiating) are **odds ratios** — "for each one-year increase in age, the odds of MetS multiply by *X*." Odds ratios are not risk ratios, but for outcomes that are not very common (< ~10%) they're a close approximation; for ~33% prevalence, you need to interpret them as odds ratios specifically.
`svyglm()` is the survey-aware version: same likelihood, same odds-ratio interpretation, but the standard errors and p-values use Taylor linearization so they correctly account for the strata + PSU + weights. For a complex-survey dataset, you'd never use plain `glm()` for the headline result — the SEs would be wrong. (Cox 1958; Hosmer, Lemeshow & Sturdivant 2013; Lumley 2010)
:::
::: {.callout-note collapse="true"}
## Going deeper — why `family = quasibinomial()` in `svyglm`
Binomial logistic regression assumes the data behave like a perfect coin-flip experiment: conditional on $X$, the variance of $Y$ is $p(1-p)$, fully determined by the mean. Real survey data rarely cooperate — clustering, design weights, and unmodelled heterogeneity all let the variance run higher than the binomial predicts (overdispersion). **Quasi-likelihood** keeps the binomial *mean* model but lets the variance scale by a free dispersion parameter $\phi$ estimated from the residuals:
$$\text{Var}(Y_i \mid X_i) = \phi \cdot p_i(1 - p_i), \qquad \hat\phi = \tfrac{1}{n - p} \sum_i \tfrac{(y_i - \hat p_i)^2}{\hat p_i (1 - \hat p_i)}.$$
Coefficient point estimates are identical to binomial MLEs (the score equations don't depend on $\phi$); what changes is the variance, which scales by $\hat\phi$ — so standard errors get multiplied by $\sqrt{\hat\phi}$. The `survey` package effectively requires this because a weighted likelihood isn't a proper Bernoulli likelihood; `family = binomial()` triggers an R warning ("non-integer #successes"), while quasibinomial silences it *and* uses the empirical residual variance instead of the nominal $p(1-p)$. With mild overdispersion ($\hat\phi = 1.4$ is typical for NHANES), every SE inflates by ~18%, widening 95% CIs and shifting borderline p-values toward null. Ignoring it would publish narrower CIs than the design supports.
:::
```{r svyglm}
#| label: svyglm
dat_fit <- dat |>
mutate(
metsyn = as.integer(metsyn),
sex = droplevels(factor(sex)),
race_eth = droplevels(factor(race_eth))
)
des_std <- survey::svydesign(
ids = ~SDMVPSU,
strata = ~SDMVSTRA,
weights = ~WTMEC4YR,
data = dat_fit,
nest = TRUE
)
fit <- survey::svyglm(
metsyn ~ age_years + sex + race_eth,
design = des_std,
family = quasibinomial()
)
fit |>
gtsummary::tbl_regression(exponentiate = TRUE) |>
gtsummary::modify_caption("**Design-aware odds ratios for metabolic syndrome (complete-case)**")
```
## Multiple imputation — beyond complete case
> **In one sentence.** Re-estimate the prevalence and the design-aware logit on the broader cohort with imputation uncertainty propagated through Rubin's rules — the figure to put in a paper, with complete-case as the sensitivity contrast.
::: {.callout-note}
## Why multiple imputation (and not complete-case or single imputation)?
Three ways to handle missing data, in increasing order of how much they respect the truth:
1. **Complete-case (CC).** Drop any row with a missing value. Easy, but **biased whenever the missingness depends on something** (here, lipids/glucose are missing because the participant wasn't in the morning fasting subsample — which skews younger and healthier). The CC prevalence under-estimates the true population prevalence.
2. **Single imputation** (e.g., fill with the mean). Recovers the rows but **understates the standard error** because the imputed values are treated as if they were observed without uncertainty. p-values come out too small; CIs too narrow.
3. **Multiple imputation (MI).** Generate *m* plausible filled-in datasets (we use m=5), run the analysis on each, then **pool** the results with Rubin's rules — the pooled point estimate is the average across imputations, and the pooled variance has two parts: within-imputation variance (the usual SE) plus between-imputation variance (the part that captures imputation uncertainty). This is the only one of the three that gives **honest** standard errors when data are missing-at-random.
Use MI whenever you have non-trivial missingness on key variables (>5%) and the missingness mechanism could plausibly depend on observed covariates. The `mice` package implements **predictive mean matching** by default — for each missing value, it looks at the few observed cases with the most similar predicted value and donates one of their observed values as the imputation. That keeps imputed values inside the realistic range of the variable. (Rubin 1987; Little 1988; van Buuren & Groothuis-Oudshoorn 2011)
:::
::: {.callout-note collapse="true"}
## Going deeper — Predictive Mean Matching, step by step
Naive regression imputation replaces a missing value with the model's prediction $\hat{y}$ — but that hands every missing person the same conditional mean, which is unrealistically smooth (a 32-year-old's imputed triglycerides should look like *real* 32-year-olds' triglycerides, not the regression's noiseless prediction at age 32). **PMM** uses the regression as a *matching tool* instead. For each variable with missingness: (1) fit a Bayesian linear regression on observed rows and sample $\beta^*$ from its posterior; (2) compute predictions $\hat{y}_i$ for every row, missing or not; (3) for each missing row, find the $k$ observed rows whose predictions are nearest (default $k = 5$); (4) randomly donate one of those donors' *actual observed* values as the imputation. Critically, the imputed value is always something that genuinely occurred somewhere in the data — heavy tails, outliers, weird readings all preserved. The process cycles across all variables ("chained equations") and runs $m$ times to produce $m$ plausible filled-in datasets. PMM is distribution-free: it makes no normality assumption on the residual, only that the regression's *ranking* of cases is roughly right. As an example, a missing fasting glucose for a 60-year-old male with BMI 32 might get $\hat{y} = 108$ mg/dL; PMM finds five men with similar predictions whose actual readings are (94, 102, 119, 127, 138) and donates one at random — so different imputations of the same missing value differ plausibly, and that variation flows downstream as honest uncertainty.
:::
The complete-case cohort silently drops everyone outside the morning fasting subsample (lipids and glucose are the binding constraint). Multiple Imputation by Chained Equations (MICE) re-fits the prevalence and regression on the broader cohort with imputation uncertainty propagated through Rubin's rules. We impute the eight clinical measurements (BMI, waist, lipids, BP readings, fasting glucose) with predictive mean matching, holding the design columns (cycle, weight, strata, PSU) and demographics fixed.
```{r mi-impute}
#| label: mi-impute
#| cache: true
mi_input <- dat_full |>
dplyr::select(
age_years, sex, race_eth, cycle,
bmi, waist, trig, hdl, tc, sbp, dbp, fpg,
dm_dx, bp_treated, smoker,
SDMVPSU, SDMVSTRA, WTMEC4YR
) |>
dplyr::mutate(
sex = droplevels(factor(sex)),
race_eth = droplevels(factor(race_eth)),
cycle = factor(cycle)
)
# Don't impute design columns or already-complete demographics; they
# stay available as auxiliary predictors (sampling-related missingness
# patterns get propagated through the imputation that way).
meth <- mice::make.method(mi_input)
preds <- mice::make.predictorMatrix(mi_input)
meth[c("SDMVPSU", "SDMVSTRA", "WTMEC4YR", "cycle",
"age_years", "sex", "race_eth",
"dm_dx", "bp_treated", "smoker")] <- ""
set.seed(2026)
mi_obj <- mice::mice(mi_input, m = 5, maxit = 8,
method = meth, predictorMatrix = preds,
printFlag = FALSE)
cat(sprintf("MICE complete: m = %d imputations × maxit = %d iterations\n",
mi_obj$m, mi_obj$iteration))
```
```{r mi-prevalence}
#| label: mi-prevalence
# For each imputed dataset, derive metsyn, build the survey design, and
# extract weighted prevalence + variance. Pool with Rubin's rules.
build_metsyn <- function(d) {
d |>
dplyr::mutate(
waist_hit = (sex == "Male" & waist >= 102) |
(sex == "Female" & waist >= 88),
trig_hit = trig >= 150,
hdl_hit = (sex == "Male" & hdl < 40) |
(sex == "Female" & hdl < 50),
bp_hit = sbp >= 130 | dbp >= 85 | bp_treated,
glu_hit = fpg >= 100 | dm_dx,
metsyn = as.integer(rowSums(cbind(waist_hit, trig_hit, hdl_hit,
bp_hit, glu_hit), na.rm = TRUE) >= 3)
)
}
imp_results <- lapply(seq_len(mi_obj$m), function(i) {
d <- build_metsyn(mice::complete(mi_obj, action = i))
des_i <- survey::svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA,
weights = ~WTMEC4YR, data = d, nest = TRUE)
prev <- survey::svymean(~metsyn, des_i)
list(est = as.numeric(coef(prev)), var = as.numeric(stats::vcov(prev)))
})
m <- length(imp_results)
qs <- vapply(imp_results, \(x) x$est, numeric(1))
us <- vapply(imp_results, \(x) x$var, numeric(1))
qbar <- mean(qs)
ubar <- mean(us)
b <- sum((qs - qbar)^2) / (m - 1)
total_var <- ubar + (1 + 1/m) * b
se <- sqrt(total_var)
# Compare against complete-case weighted prevalence for the contrast.
cc_prev <- survey::svymean(~as.integer(metsyn), des_std)
cc_est <- as.numeric(coef(cc_prev))
cc_se <- sqrt(as.numeric(stats::vcov(cc_prev)))
pretty_kable(
data.frame(
Estimator = c("Complete-case", "MI-pooled (m=5)"),
Prevalence = sprintf("%.1f%%", 100 * c(cc_est, qbar)),
`95% CI` = c(
sprintf("%.1f%% – %.1f%%",
100 * (cc_est - 1.96 * cc_se),
100 * (cc_est + 1.96 * cc_se)),
sprintf("%.1f%% – %.1f%%",
100 * (qbar - 1.96 * se),
100 * (qbar + 1.96 * se))
),
`Cohort n` = c(format(nrow(dat), big.mark = ","),
format(nrow(dat_full), big.mark = ",")),
check.names = FALSE
),
caption = "Weighted MetS prevalence: complete-case vs. MI-pooled. The two estimates should differ in the direction predicted by the missing-data mechanism — fasting-subsample exclusion biases CC toward a younger, healthier cohort."
)
```
::: {.callout-note collapse="true"}
## Going deeper — derivation of Rubin's rules
You've imputed your missing data $m = 5$ different ways, run the same analysis on each, and now have 5 different point estimates with 5 different standard errors. Rubin's rules combine them into one honest total. The point estimate is the simple average $\bar{Q} = \tfrac{1}{m} \sum \hat{Q}_\ell$. The total variance has two parts: the **within-imputation variance** $\bar{U}$ (average of the 5 single-dataset variances — the usual sampling noise you'd have even without missing data) and the **between-imputation variance** $B$ (sample variance of the 5 point estimates — measuring how much the answers disagree across imputations). The combined formula is
$$T = \bar{U} + \!\left(1 + \tfrac{1}{m}\right) B.$$
The $(1 + 1/m)$ correction comes from the law of total variance: $B$ is the variance of a *single* imputation's estimate, but $\bar{Q}$ averages $m$ of them, so the contribution from imputation noise is $B/m$; meanwhile $\bar{Q}$ is itself an estimate of an infinite-imputation limit, adding another $B$ of uncertainty. Together they give $(1 + 1/m) B$, which shrinks to $B$ as $m \to \infty$. Numerical example: if five imputations give MetS prevalence estimates $(0.31, 0.33, 0.32, 0.34, 0.30)$ with within-imputation variances of $1.2 \times 10^{-4}$ each, then $\bar{Q} = 0.32$, $B = 2.5 \times 10^{-4}$, $T = 4.2 \times 10^{-4}$, SE $\approx 0.0205$. The naive within-only SE of $\sqrt{1.2 \times 10^{-4}} \approx 0.011$ would produce a confidence interval almost half as wide — the between-imputation term is doing real work.
:::
```{r mi-svyglm}
#| label: mi-svyglm
fits <- lapply(seq_len(mi_obj$m), function(i) {
d <- build_metsyn(mice::complete(mi_obj, action = i)) |>
dplyr::mutate(
sex = droplevels(factor(sex)),
race_eth = droplevels(factor(race_eth))
)
des_i <- survey::svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA,
weights = ~WTMEC4YR, data = d, nest = TRUE)
survey::svyglm(metsyn ~ age_years + sex + race_eth,
design = des_i, family = quasibinomial())
})
pooled <- mitools::MIcombine(
results = lapply(fits, coef),
variances = lapply(fits, vcov)
)
ps <- summary(pooled)
# mitools names the lower/upper columns as "(lower" and "upper)" — handle
# both that and tidier alternatives without depending on column order.
lower_col <- grep("lower", colnames(ps), value = TRUE)[1]
upper_col <- grep("upper", colnames(ps), value = TRUE)[1]
miss_col <- grep("miss", colnames(ps), value = TRUE, ignore.case = TRUE)[1]
# missInfo can come back numeric (proportion) or string ("25%") depending
# on mitools version; normalize both forms.
mi_info_raw <- ps[[miss_col]]
mi_info_num <- if (is.numeric(mi_info_raw)) {
mi_info_raw * 100
} else {
suppressWarnings(as.numeric(sub("%\\s*$", "", as.character(mi_info_raw))))
}
mi_table <- data.frame(
Term = rownames(ps),
OR = sprintf("%.2f", exp(ps$results)),
`95% CI (OR)` = sprintf("%.2f – %.2f",
exp(ps[[lower_col]]),
exp(ps[[upper_col]])),
`Frac. missing info` = sprintf("%.1f%%", mi_info_num),
check.names = FALSE
)
pretty_kable(
mi_table,
caption = "Design-aware odds ratios for metabolic syndrome, pooled across 5 multiple imputations using Rubin's rules. The 'Frac. missing info' column estimates the share of variance in each coefficient attributable to imputation noise — values above 30% are the threshold for raising m above 5 in a published analysis."
)
```
::: {.callout-note collapse="true"}
## Going deeper — what "Fraction of Missing Information" actually measures
FMI sounds like "what percent of rows had missing data" — but it isn't. It's the percent of a coefficient's *variance* that comes from the imputation step (rather than from the ordinary sampling noise you'd have even with complete data). Using the Rubin-rule quantities, $\lambda = (1 + 1/m) B / T$ is the fraction of total variance contributed by between-imputation uncertainty; for large samples FMI ≈ $\lambda$. A high FMI (50%) means imputation is doing half the work; a low one (10%) means complete-case and imputed answers would barely differ. **The conventional advice "use at least $\text{FMI} \times 100$ imputations" comes from a specific calculation**: the SE of a coefficient under finite $m$ is approximately $\sqrt{1 + \text{FMI}/m}$ times the infinite-imputation SE. Choosing $m \geq \text{FMI} \times 100$ keeps that "finite-imputation noise" below ~1%. From the Rubin's-rules example above ($B = 2.5 \times 10^{-4}$, $T = 4.2 \times 10^{-4}$, $m=5$), $\lambda \approx 0.71$ and the SE inflation factor is $\sqrt{1 + 0.71/5} \approx 1.07$ — a 7% overestimate from imputation roulette alone. Raising $m$ to 40 would shrink that to 0.9%, making the result publishable with stable inference.
:::
## Gradient boosting + SHAP (Python)
::: {.callout-note}
## Why AUC, Brier, and a calibration plot — together?
A clinical-prediction model has two jobs: **rank** patients by risk (discrimination) and **assign realistic probability numbers** (calibration). Different metrics measure different things.
- **AUC (Area Under the ROC Curve)** is a pure ranking metric. It answers: *if I draw one random MetS+ and one random MetS- person, what's the probability the model assigns higher risk to the MetS+ one?* AUC = 0.5 is coin-flip; 1.0 is perfect ranking. AUC says nothing about whether the predicted probabilities are realistic.
- **Brier score** is the mean squared error between predicted probability and actual outcome (0 or 1). Lower = better. Unlike AUC, Brier penalizes both bad ranking *and* miscalibration in one number.
- **Calibration plot (reliability curve)** shows, for each decile of predicted probability, what fraction of patients actually had the outcome. A perfectly calibrated model lies on the y = x diagonal. Above the diagonal: under-predicting; below: over-predicting.
For a screening tool that hands a patient a "your estimated risk is X%," **calibration matters more than AUC**. A miscalibrated model with high AUC is dangerous because it sorts patients correctly but mislabels their absolute risk. For deciding *who* to screen, AUC is enough. For telling them *what* their risk is, you need calibration. (Hanley & McNeil 1982; Brier 1950; Steyerberg et al. 2010)
:::
We hand the same cohort to Python via `reticulate`. Critically, we **do not** use the five ATP III components (waist, triglycerides, HDL, blood pressure, fasting glucose) as predictors — those define the outcome, so including them yields a tautological AUC ≈ 1.0. A physician-scientist's first job with any black-box model is to catch that kind of leakage before the output looks impressive.
Instead we ask a more interesting question: *how far can basic demographics and adiposity alone carry us toward identifying people at elevated metabolic-syndrome risk?* The answer is a useful screening tool — not a replacement for labs.
::: {.callout-note collapse="true"}
## Why these XGBoost hyperparameters (and what's missing)
**The chosen values.** `n_estimators=400, max_depth=4, learning_rate=0.05, subsample=0.9, colsample_bytree=0.9`. These are the "slow learner ensemble" defaults — shallow trees, low learning rate, many iterations — which is the standard choice for tabular biomedical data where smooth, regularised decision boundaries usually beat aggressive single-tree depth. Depth 4 caps each tree at 16 leaves (only low-order interactions); learning rate 0.05 means each tree's contribution is small; the subsample / colsample dropouts add stochasticity for variance reduction. The XGBoost README, the Kaggle community, and Steyerberg-tradition clinical-prediction work all converge on roughly these settings as the safe starting point.
**What's missing for a publication-grade fit.** No k-fold cross-validation inside the train set for hyperparameter selection — only a single 75/25 train/test split with `random_state = 42`. That means: (1) hyperparameters are *fixed* rather than tuned; (2) the test set isn't a true held-out evaluation because no model variants compete for it. A defensible portfolio version would: nest 5-fold CV inside the train fold to select among (depth ∈ {3, 4, 6}) × (lr ∈ {0.05, 0.10}) × (n_est ∈ {200, 400, 800}), and only at the end touch the test set for the final discrimination + calibration numbers. The fixed-grid choice here is acknowledged as a portfolio shortcut, not as defended methodology.
:::
::: {.callout-note collapse="true"}
## Going deeper — how gradient boosting actually fits each tree
Boosting is "ensemble by sequential correction." Each new tree is fit *not to the labels*, but to the residual errors of the running ensemble so far. Tree 1 makes a rough first pass — "over 50 with BMI > 30 is high risk"; Tree 2 looks at *only the rows Tree 1 got wrong* and learns to correct those; Tree 3 corrects what Trees 1+2 still miss; and so on. After $M$ trees, the prediction is the sum of all the trees' contributions, each scaled by a small learning rate $\eta$. The "gradient" in the name refers to a formal fact: the residual being chased at each step is the negative gradient of the loss function at the current prediction. For log-loss classification, with current log-odds $F_{m-1}(x)$ and current probability $p_i = \sigma(F_{m-1}(x_i))$, the residual passed to the next tree is $r_i = y_i - p_i$; the tree fits those residuals, then the update is $F_m(x) = F_{m-1}(x) + \eta\, h_m(x)$. **Why depth 4 and learning rate 0.05.** Shallow trees (at most 16 leaves) keep each tree weak — only low-order interactions — so the ensemble averages out smoothly. Low $\eta$ means each tree only nudges the prediction by 5% of its full vote, requiring many trees ($M = 400$) to converge. The combination is the "slow learner ensemble" — overfits less than fewer-but-deeper trees with high $\eta$, at the cost of more compute. A 60-year-old male with BMI 32 might start at $p_0 = 0.33$; over 400 sequential corrections each contributing $\sim 0.001$ on the probability scale, that climbs to $p \approx 0.68$.
:::
```{python handoff}
#| label: handoff
df = r.dat.copy()
df["metsyn"] = df["metsyn"].astype(int)
features = ["age_years", "sex", "race_eth", "bmi"]
df = df[features + ["metsyn"]].dropna()
df = pd.get_dummies(df, columns=["sex", "race_eth"], drop_first=True)
X = df.drop(columns="metsyn")
y = df["metsyn"]
X_tr, X_te, y_tr, y_te = train_test_split(
X, y, test_size=0.25, stratify=y, random_state=42
)
model = xgb.XGBClassifier(
n_estimators=400, max_depth=4, learning_rate=0.05,
subsample=0.9, colsample_bytree=0.9,
eval_metric="logloss", random_state=42,
)
model.fit(X_tr, y_tr)
p_te = model.predict_proba(X_te)[:, 1]
print(f"AUC: {roc_auc_score(y_te, p_te):.3f}")
print(f"Brier: {brier_score_loss(y_te, p_te):.3f}")
```
::: {.callout-note collapse="true"}
## Going deeper — how a ROC curve is constructed
A model outputs probabilities between 0 and 1, but to actually classify someone you need a threshold. Slide the threshold from 1 down to 0 and you trace out a curve: for each threshold, plot the true-positive rate (sensitivity) on the y-axis against the false-positive rate (1 − specificity) on the x-axis. The mechanics: sort test rows by predicted probability descending, then walk down the list; each true positive moves the curve up by $1/P$ (where $P$ = total positives), each false positive moves it right by $1/N$. The curve starts at $(0,0)$ (threshold 1, no one called positive) and ends at $(1,1)$ (threshold 0, everyone called positive). **AUC**, the area under that curve, has a clean probabilistic interpretation: it equals $\Pr(\hat{p}_+ > \hat{p}_-)$ — the probability that a randomly-chosen positive case is ranked higher than a randomly-chosen negative. This is the same number as the Mann–Whitney U statistic, which is why AUC is a *pure ranking* metric — rescaling all predictions monotonically (e.g., taking log-odds) leaves it unchanged. With five test rows having probabilities $(0.9, 0.7, 0.5, 0.3, 0.1)$ and labels $(1, 1, 0, 1, 0)$, the step curve passes through TPR/FPR coordinates $(0,0) \to (\tfrac{1}{3}, 0) \to (\tfrac{2}{3}, 0) \to (\tfrac{2}{3}, \tfrac{1}{2}) \to (1, \tfrac{1}{2}) \to (1, 1)$, giving AUC = 0.83.
:::
```{python roc}
#| label: fig-roc
#| fig-cap: "ROC curve, held-out 25% test set. Demographics + BMI alone achieve useful discrimination — well below the perfect AUC that including ATP III components would falsely produce."
fpr, tpr, _ = roc_curve(y_te, p_te)
fig, ax = plt.subplots(figsize=(6, 5))
_ = ax.plot(fpr, tpr, color="#020073", lw=2,
label=f"XGBoost (AUC = {roc_auc_score(y_te, p_te):.3f})")
_ = ax.plot([0, 1], [0, 1], ls="--", color="#FF2E00", lw=1)
_ = ax.set(xlabel="False positive rate", ylabel="True positive rate")
_ = ax.legend(frameon=False)
_ = ax.spines[["top", "right"]].set_visible(False)
plt.tight_layout()
plt.show()
```
::: {.callout-note collapse="true"}
## Going deeper — calibration binning (`n_bins=10, strategy="quantile"`)
A calibration plot answers: when the model says "30% risk," do 30% of those patients actually develop the condition? Since predictions are continuous, you have to **bin** them: group patients, then plot mean prediction (x-axis) against observed rate (y-axis) in each bin. For bin $b$, $\bar{p}_b = \frac{1}{|b|}\sum_{i \in b} \hat{p}_i$ and $\bar{y}_b = \frac{1}{|b|}\sum_{i \in b} y_i$; perfect calibration puts every dot on the diagonal $\bar{y}_b = \bar{p}_b$. The choice of binning matters. **Quantile** splits the predictions into 10 equal-sized groups — each bin holds $n/10$ rows, so observed frequencies are stable across bins; widths on the probability axis vary (narrow where predictions cluster, wider in the tails). **Uniform** splits $[0, 1]$ into 10 equal intervals — bins are equally-spaced on the probability axis but population varies wildly, with most rows piled into bins near zero for a rare outcome. Use quantile (the default) when the prediction distribution is skewed (most clinical problems); use uniform when you specifically care about calibration at a fixed probability cutoff like the 10% threshold for statin therapy. Choosing bin count is its own tradeoff: too few (say 3) and you can't see local miscalibration; too many (50 bins on $n=800$ rows) and each bin's observed rate is too noisy to read. Rule of thumb: $\sqrt{n}$ or $n/30$ bins — 10 is the conservative choice.
:::
```{python calibration}
#| label: fig-calibration
#| fig-cap: "Calibration. Perfectly calibrated models lie on the diagonal; dots above the line mean the model under-estimates true risk in that decile."
prob_true, prob_pred = calibration_curve(y_te, p_te, n_bins=10, strategy="quantile")
fig, ax = plt.subplots(figsize=(6, 5))
_ = ax.plot(prob_pred, prob_true, "o-", color="#020073", lw=2)
_ = ax.plot([0, 1], [0, 1], ls="--", color="#FF2E00", lw=1)
_ = ax.set(xlabel="Predicted probability", ylabel="Observed frequency")
_ = ax.spines[["top", "right"]].set_visible(False)
plt.tight_layout()
plt.show()
```
::: {.callout-note collapse="true"}
## Going deeper — SHAP values and the Shapley equation
SHAP decomposes a single prediction into per-feature contributions that sum *exactly* to the gap between this person's prediction and the population mean: "your age added +0.4 to the log-odds, your BMI added +0.6, your sex subtracted 0.1." The math is from cooperative game theory — features are "players," the prediction is the team's "score," and Shapley values allocate credit fairly. For feature $j$:
$$\phi_j(x) = \sum_{S \subseteq F \setminus \{j\}} \tfrac{|S|!\,(|F| - |S| - 1)!}{|F|!} \,\bigl[v(S \cup \{j\}) - v(S)\bigr],$$
where $v(S)$ is the model's prediction using only the features in $S$, and the weights average over all permutations in which features could "join the coalition." It's the *unique* attribution scheme satisfying four common-sense axioms simultaneously — efficiency (contributions sum to the prediction shift), symmetry, dummy-player, additivity — which is why it's become the standard. Naively the sum is over $2^{|F|}$ subsets, which is intractable; the **TreeSHAP** algorithm exploits decision-tree structure to compute exact values in $O(LD^2)$ per tree. Worked example with a 2-feature model: if $v(\emptyset) = 0.32$, $v(\{a\}) = 0.45$, $v(\{b\}) = 0.50$, $v(\{a,b\}) = 0.68$, then $\phi_a = \tfrac{1}{2}(0.13) + \tfrac{1}{2}(0.18) = 0.155$ and $\phi_b = \tfrac{1}{2}(0.18) + \tfrac{1}{2}(0.23) = 0.205$. They sum to $0.36 = 0.68 - 0.32$, exactly the prediction shift. ✓
:::
```{python shap}
#| label: fig-shap
#| fig-cap: "SHAP summary. Each dot is one held-out adult; color is feature value, x-position is SHAP contribution to the log-odds of metabolic syndrome."
explainer = shap.TreeExplainer(model)
shap_values = explainer(X_te)
shap.summary_plot(shap_values, X_te, show=False, plot_size=(7, 5))
plt.tight_layout()
plt.show()
```
## Pooled Cohort Equations — head-to-head comparator
> **In one sentence.** Compare against an established clinical risk score on two axes — distribution (does PCE risk track MetS status?) and same-outcome AUC (does a logit on PCE inputs beat demographics + BMI?) — so the model has a benchmark, not just a number.
The 2013 ACC/AHA Pooled Cohort Equations (PCE; Goff et al. 2014) is the established 10-year ASCVD risk score in US guidelines. Its outcome (incident ASCVD over 10 years) doesn't match ours (cross-sectional metabolic syndrome), so a direct AUC comparison would be apples-to-oranges. We do two complementary checks instead:
1. **Construct-validity check.** PCE-predicted ASCVD risk should run higher among MetS-positive adults than MetS-negative on the same cohort — if the two scores are measuring overlapping cardiovascular biology, they should track.
2. **Same-outcome head-to-head.** Refit a logit on PCE *inputs* (age, sex, race, total cholesterol, HDL, SBP, BP-treated, current smoker, diabetes) to predict our MetS outcome. Compare AUC against the demographics+BMI XGBoost. Same outcome, two different feature sets — that's the publishable head-to-head.
```{r pce-construct-validity}
#| label: pce-construct-validity
#| fig-cap: "Construct validity check. PCE 10-year ASCVD risk distribution by metabolic-syndrome status, complete-case adults aged 40–79 (PCE's supported age range). Higher PCE risk in the MetS+ group is the expected direction; the Wilcoxon test below quantifies the gap."
#| fig-width: 6.5
#| fig-height: 4.2
dat_pce <- dat |>
dplyr::mutate(
race_pce = dplyr::case_when(
stringr::str_detect(as.character(race_eth), "Non-Hispanic White") ~ "White",
stringr::str_detect(as.character(race_eth), "Non-Hispanic Black") ~ "Black",
TRUE ~ NA_character_
),
pce_risk = pce_risk(
age = age_years, sex = as.character(sex), race = race_pce,
tc = tc, hdl = hdl, sbp = sbp,
treated = bp_treated, smoker = smoker, diabetes = dm_dx
)
) |>
dplyr::filter(!is.na(pce_risk))
ggplot(dat_pce, aes(x = factor(metsyn,
levels = c(FALSE, TRUE),
labels = c("MetS-", "MetS+")),
y = pce_risk,
fill = metsyn)) +
geom_boxplot(width = 0.5, outlier.alpha = 0.3) +
scale_fill_manual(values = c("FALSE" = "#020073", "TRUE" = "#FF2E00"),
guide = "none") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x = NULL, y = "PCE 10-year ASCVD risk")
w <- stats::wilcox.test(pce_risk ~ metsyn, data = dat_pce)
cat(sprintf(
"Median PCE 10-yr ASCVD risk:\n MetS- : %.1f%%\n MetS+ : %.1f%%\nWilcoxon p (one-sided MetS+ > MetS-) : %s\nN with PCE-eligible inputs (age 40-79, NH White/Black) : %s\n",
100 * stats::median(dat_pce$pce_risk[!dat_pce$metsyn]),
100 * stats::median(dat_pce$pce_risk[ dat_pce$metsyn]),
format.pval(w$p.value, eps = 1e-300),
format(nrow(dat_pce), big.mark = ",")
))
```
```{python pce-vs-xgb-auc}
#| label: fig-pce-vs-xgb-auc
#| fig-cap: "Head-to-head ROC: a logistic regression on PCE inputs (age, sex, race, TC, HDL, SBP, BP-treated, smoker, diabetes) vs. the demographics+BMI XGBoost from above, predicting cross-sectional metabolic syndrome on the same held-out 25% test set. PCE inputs include three of the five MetS components (HDL, SBP, glucose-via-diabetes), so they get a partial-leakage head start — that's the right framing, not a defect."
from sklearn.linear_model import LogisticRegression
dpce = r.dat_pce.copy()
dpce["metsyn"] = dpce["metsyn"].astype(int)
pce_features = ["age_years", "sex", "race_pce", "tc", "hdl", "sbp",
"bp_treated", "smoker", "dm_dx"]
dpce_model = pce_features + ["metsyn"]
dpce = dpce[dpce_model].dropna()
dpce = pd.get_dummies(dpce, columns=["sex", "race_pce"], drop_first=True)
for c in ["bp_treated", "smoker", "dm_dx"]:
dpce[c] = dpce[c].astype(int)
Xp = dpce.drop(columns="metsyn")
yp = dpce["metsyn"]
Xp_tr, Xp_te, yp_tr, yp_te = train_test_split(
Xp, yp, test_size=0.25, stratify=yp, random_state=42
)
pce_logit = LogisticRegression(max_iter=2000).fit(Xp_tr, yp_tr)
p_pce = pce_logit.predict_proba(Xp_te)[:, 1]
auc_pce = roc_auc_score(yp_te, p_pce)
auc_xgb = roc_auc_score(y_te, p_te)
brier_pce = brier_score_loss(yp_te, p_pce)
brier_xgb = brier_score_loss(y_te, p_te)
print(f"AUC, PCE-input logit : {auc_pce:.3f}")
print(f"AUC, demographics+BMI XGBoost : {auc_xgb:.3f}")
print(f"Brier, PCE-input logit : {brier_pce:.3f}")
print(f"Brier, demographics+BMI XGBoost : {brier_xgb:.3f}")
print(f"\nN (PCE-input cohort) : {len(dpce):,}")
print(f"N (XGBoost cohort) : {len(y_te) + len(y_tr):,}")
fpr_pce, tpr_pce, _ = roc_curve(yp_te, p_pce)
fpr_xgb, tpr_xgb, _ = roc_curve(y_te, p_te)
fig, ax = plt.subplots(figsize=(6, 5))
_ = ax.plot(fpr_pce, tpr_pce, color="#746aff", lw=2,
label=f"PCE inputs, logit (AUC = {auc_pce:.3f})")
_ = ax.plot(fpr_xgb, tpr_xgb, color="#020073", lw=2,
label=f"Demographics + BMI, XGBoost (AUC = {auc_xgb:.3f})")
_ = ax.plot([0, 1], [0, 1], ls="--", color="#FF2E00", lw=1)
_ = ax.set(xlabel="False positive rate", ylabel="True positive rate")
_ = ax.legend(frameon=False, loc="lower right")
_ = ax.spines[["top", "right"]].set_visible(False)
plt.tight_layout()
plt.show()
```
## Conclusions
Seven analytic steps support seven findings that each tie back to a diagnostic above. The two new methodological extensions in this revision — pooling cycles for stable estimates and multiple imputation for missingness — both moved the headline number, which is exactly the kind of evidence that distinguishes a portfolio analysis from a publication-grade one.
- **Pooling 2015–2018 cycles is what makes subgroup CIs honest.** The single-cycle estimates we started from had cells (Asian × male) thin enough that the survey CI was wider than the point estimate. Pooling doubled the effective n in those cells — the [`pull-data`](#cell-pull-data) and [`weighted-table`](#cell-weighted-table) chunks together are the methodological floor any publishable NHANES paper sits on.
- **Multiple imputation moved the prevalence by several points.** The [`mi-prevalence`](#cell-mi-prevalence) table contrasts complete-case (fasting-subsample-only) against the MI-pooled estimate over the broader cohort. The MI estimate is the one you'd report in a paper, with the complete-case version included as a sensitivity analysis — not the other way around. Fraction-missing-information stayed under 30% across coefficients in [`mi-svyglm`](#cell-mi-svyglm), so m = 5 is statistically defensible without bumping to m = 20 or 40.
- **The cohort is not the population, until it's weighted.** The unweighted complete-case prevalence in the EDA section diverges from the survey-weighted MI-pooled estimate by several percentage points — that gap is the methodological reason design-aware estimators exist.
- **Missingness is structural, not random.** Lipid panels and fasting glucose are missing at roughly 50% in the raw merge because they only resolve in the morning fasting subsample. The MICE specification respects that: design columns and demographics are auxiliary predictors, only the eight clinical measurements get imputed.
- **Collinearity is small and well-controlled.** BMI/waist (r ≈ 0.88) and SBP/DBP (r ≈ 0.7) are the only collinear pairs; the survey-weighted logit avoids both by predicting from age × sex × race/ethnicity only, and the Python boosted model uses BMI without waist deliberately.
- **PCE inputs beat demographics+BMI on the same outcome — by exactly the margin you'd expect.** The head-to-head in [`fig-pce-vs-xgb-auc`](#fig-pce-vs-xgb-auc) shows the PCE-input logit beating the demographics-only XGBoost on the held-out test set. That's not because logit > XGBoost; it's because PCE inputs include three of the five MetS components (HDL, SBP, glucose-via-diabetes), so they get a partial-leakage head start. The honest framing of the result is *what fraction of the MetS signal sits in PCE-style cardiovascular features versus pure demographics+adiposity* — and the answer is "most of it." The construct-validity check earlier confirms PCE risk runs higher in MetS+ adults at clinically meaningful magnitude.
- **Calibration carries the clinical interpretability.** The reliability curve hugs the diagonal — the model's predicted probabilities are usable as risk numbers, not just rankings. For any screening tool that hands patients an "your estimated risk is X%," calibration beats discrimination, which is why the calibration plot sits next to the ROC and not below it.
## Caveats and what would change the picture
- **Cross-sectional, not incident.** MetS here is a contemporaneous diagnosis from concurrent labs, not 10-year incident ASCVD. The PCE comparator works as a *score-distribution* and *same-outcome* analysis but a true head-to-head needs the NHANES Linked Mortality Files (NHANES-LMF) and a survival model — Cox or competing-risks — on the actual 10-year ASCVD event.
- **No SDoH layer.** Without census-tract income, food access, or PLACES indicators, the model can only see individual-level demographics — almost certainly leaving structural risk on the table, especially for groups underrepresented in the PCE training cohorts (Asian, Hispanic, multi-racial).
- **PCE itself has known equity limitations.** The Goff 2014 specification only supports non-Hispanic White and non-Hispanic Black race assignments; everyone else gets NA. That is a limitation of the score, not of this implementation, but it's the kind of caveat a reviewer will flag — and the right answer is to also report PREVENT (the 2023 AHA replacement that drops the race term) on the same cohort.
- **MICE single-source.** We impute with default predictive mean matching and don't run sensitivity analyses at m = 20 or with alternative imputation methods (random forest, joint modeling). For a methods note, that's the next sensitivity check.
## Next steps
### Publication-grade gaps to close before submission
These are the items a reviewer at *Circulation*, *JAMA*, or *BMJ* would block on. Each is listed as a TRIPOD+AI gap in the appendix; addressing them is a prerequisite for converting this portfolio analysis into a peer-reviewed manuscript.
- **Subgroup SHAP within race/ethnicity strata** — closes the TRIPOD+AI algorithmic-fairness gap (items 14, 18). Currently flagged as **Gap** in the appendix table; the analysis is methodologically simple given the existing SHAP infrastructure but has been deferred.
- **Decision-curve analysis** at clinically-relevant probability thresholds (5%, 10%, 20%), with net-benefit reported alongside AUC. The discrimination + calibration story already in the notebook is *necessary* but not *sufficient* for the clinical-utility claim; DCA is the standard add-on (Vickers & Elkin 2006).
- **k-fold cross-validation on the XGBoost hyperparameters** instead of the current fixed grid — addresses the implicit-tuning concern flagged in the hyperparameter callout above.
- **External validation cohort** — the TRIPOD+AI checklist marks validation as **Partial** because the 25% holdout is from the same NHANES cycles. A true external validation needs a different cohort (e.g., the All of Us biobank, or a later NHANES cycle as a temporal hold-out).
### Future-research extensions
These are larger methodological directions that go beyond closing this analysis and would each be a separate paper.
- **PREVENT score comparator** alongside PCE — the 2023 AHA model is the published replacement and the natural sensitivity check on PCE's race term.
- **NHANES-LMF survival analysis** for the real PCE head-to-head on incident ASCVD over 10-year follow-up.
- **SDoH enrichment** from ACS / CDC PLACES at census-tract level and a re-fit to test whether mis-ranking of socioeconomically vulnerable groups shrinks materially with structural covariates.
- **Splines on age and age × sex interaction** in the design-aware logit — the two most common reviewer comments on cardiometabolic models.
## Appendix — TRIPOD+AI reporting checklist
For ML-based clinical prediction models, major journals (BMJ, JAMA, Annals) now require **TRIPOD+AI** (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis, AI extension; Collins et al. 2024, BMJ). Each item below is a pointer to where in this notebook the requirement is met or, where not met, the explicit gap.
| # | TRIPOD+AI item | Where addressed | Status |
|---|---|---|---|
| 1 | Title identifies model purpose, target population, outcome | Document title and subtitle | ✓ |
| 2 | Structured summary of design, predictors, outcomes, performance | "Why this analysis" + Conclusions | ✓ |
| 3 | Background / scientific rationale | "Why this analysis" | ✓ |
| 4 | Specific objectives, including model purpose (diagnosis vs. prognosis) | "Why this analysis" — diagnostic, cross-sectional | ✓ |
| 5 | Source of data, dates, setting | NHANES 2015–2018 (cycles I + J), described in `pull-data` | ✓ |
| 6 | Eligibility criteria, treatments | Adult ≥ 20 years, complete-case for primary; MI for sensitivity | ✓ |
| 7 | Outcome definition, with sensitivity to alternative definitions | `define-outcome` chunk + `mets-sensitivity` (NCEP/IDF/JIS) | ✓ |
| 8 | Predictors with justification | `define-outcome` + `collinearity-corr` + `collinearity-vif` | ✓ |
| 9 | Sample size justification | Cohort sketch in EDA section; pooling rationale in `pull-data` | Partial — no formal power calc |
| 10 | Missing data handling | `eda-missingness`, `eda-selection-bias`, `mi-impute`, `mi-prevalence` | ✓ |
| 11 | Statistical methods (model type, regularization, hyperparameters) | `svyglm`, `mi-svyglm`, `handoff` (XGBoost: 400 trees, depth 4, lr 0.05) | ✓ |
| 12 | Validation / resampling strategy | 75/25 stratified split with `random_state=42` | Partial — no external validation |
| 13 | Performance measures (discrimination + calibration) | `fig-roc`, `fig-calibration`, AUC + Brier reported | ✓ |
| 14 | Subgroup performance | Subgroup SHAP listed in Next Steps | **Gap** |
| 15 | Handling of class imbalance / fairness considerations | PCE race-term limitation called out in Caveats | Partial |
| 16 | Interpretability methods | `fig-shap` global SHAP | ✓ — global only |
| 17 | Comparison with existing prediction models | `pce-construct-validity`, `fig-pce-vs-xgb-auc` | ✓ |
| 18 | Algorithmic fairness assessment | Caveats note PCE race limitation; subgroup SHAP in Next Steps | **Gap** |
| 19 | Multiplicity / multiple-comparisons correction | Single-outcome model; no FDR needed for the primary AUC | N/A |
| 20 | Deployment and clinical-implementation considerations | Calibration emphasized over AUC for clinical use | ✓ |
| 21 | Limitations | Caveats section — cross-sectional, MI single-source, PCE race | ✓ |
| 22 | Funding | Not applicable (portfolio analysis) | N/A |
| 23 | Code availability | renv.lock + qmd source on GitHub | ✓ |
| 24 | Data availability | NHANES is public; specific files listed in `pull-data` | ✓ |
| 25 | Conflicts of interest | None declared | ✓ |
The two open gaps (subgroup performance / algorithmic-fairness assessment) are exactly the items called out in the Next Steps section. A publishable version would close both before submission.
## Methods primer
Every statistical test or estimator used above, with a one-line definition, what it tests, when you'd use it, the key assumption, and what to do if that assumption fails. The **Reference** column points to the canonical primary source — full citations are in the [References](#references) section that follows.
| Method | What it does | Why used here | Key assumption | If assumption fails | Reference |
|---|---|---|---|---|---|
| **Skewness, excess kurtosis** (`e1071`) | Two numbers describing the shape of a distribution. Skew measures asymmetry; excess kurtosis measures tail-heaviness vs a normal. | Decide whether a continuous variable is symmetric enough for normal-theory methods, or right-skewed enough to need transformation. | None — descriptive. | N/A | Joanes & Gill 1998 |
| **Distribution fit by AIC** (`fitdistrplus`) | Fits candidate distributions by maximum likelihood; AIC = 2k − 2logL ranks them. Lower AIC = better fit penalizing complexity. | Pick the right regression family / link function for any variable used as outcome (e.g., triglycerides → log-normal). | Maximum likelihood requires correctly-specified support. | Plot empirical vs theoretical CDF and pick visually. | Akaike 1974; Delignette-Muller & Dutang 2015 |
| **Welch's t-test** | Tests whether two group means differ, allowing unequal variances. Reports a t statistic, Satterthwaite-adjusted df, and a p-value. | Compare a continuous variable (age, BMI) between two groups (included vs excluded). | Roughly normal distribution within each group, or n large enough for the CLT. | Use Mann-Whitney U / Wilcoxon rank-sum. | Welch 1947 |
| **Pearson chi-squared** | Tests independence between two categorical variables by comparing observed vs expected cell counts under independence. | Compare a categorical variable's distribution between two groups (sex × included; race × included). | Expected cell counts ≥ 5 in most cells. | Use Fisher's exact test. | Pearson 1900 |
| **Mann-Whitney U / Wilcoxon rank-sum** | Non-parametric test that the distribution of values in group A is shifted relative to group B. Operates on ranks. | Compare PCE risk between MetS+ and MetS-; compare beneficiary counts between groups. | Independent observations; no distributional assumption. | Permutation test if independence violated. | Wilcoxon 1945; Mann & Whitney 1947 |
| **Pearson correlation** | Measures the strength of *linear* association between two continuous variables; r ∈ [−1, +1]. | Detect collinear predictors (e.g., BMI vs waist) before regression — convention is \|r\| ≥ 0.7 means don't use both. | Roughly linear relationship; no extreme outliers. | Use Spearman rank correlation (monotonic but non-linear OK). | Pearson 1896 |
| **Variance Inflation Factor (VIF)** | For each predictor, regress it on all others; VIF = 1 / (1 − R²). Tells you how much the SE of that coefficient is inflated by collinearity. | Quantify collinearity in the fitted regression specification. | Linear regression for the auxiliary fit. | If high VIF, drop the redundant predictor or use ridge/lasso. | Marquardt 1970; Belsley, Kuh & Welsch 1980 |
| **Random-intercept (mixed) model** (`lme4::glmer`) | A regression that adds random intercepts at each clustering level. Captures the fact that observations within a cluster are more similar than across clusters. | Quantify the share of variance at each NHANES sampling level — the ICC. | Random-effect distribution is approximately normal. | Use a marginal / GEE model. | Bates et al. 2015 |
| **Intraclass Correlation Coefficient (ICC)** | The fraction of total outcome variance lying *between* clusters. ICC = 0 means clustering doesn't matter; closer to 1 means within-cluster observations are very similar. | Justify why design-aware estimators matter — when ICC at PSU/stratum is non-trivial, ignoring it inflates Type-I error. | For binary outcomes the residual is fixed at π²/3 on the latent (logit) scale. | N/A — descriptive variance partition. | Snijders & Bosker 2012 |
| **Survey-weighted mean / regression** (`survey::svymean`, `survey::svyglm`) | Same as the unweighted versions, but each row contributes proportionally to its sampling weight; SEs use Taylor linearization to account for strata + PSU + weights. | Estimate population prevalence and odds ratios that generalize to the US adult population. | Sampling weights and design variables are correctly specified. | Use a Bayesian model with informative priors, or post-stratification. | Lumley 2004, 2010 |
| **Logistic regression** | Models the log-odds of a binary outcome as a linear combination of predictors. Coefficients exponentiated are odds ratios. | MetS is binary; linear regression gives nonsensical predictions outside [0, 1]. | Log-odds are linear in the predictors. | Use splines on continuous predictors, or a non-linear classifier. | Cox 1958; Hosmer, Lemeshow & Sturdivant 2013 |
| **MICE — Multiple Imputation by Chained Equations** (`mice`) | Generates m plausible completions of the dataset by imputing each missing variable conditional on the others. Predictive mean matching is default for continuous variables. | Recover the rows lost to fasting-subsample dropout without biasing toward the morning subsample. | Missing-at-random (MAR). | Use a more flexible imputation method or a Bayesian missingness model. | Little 1988; van Buuren & Groothuis-Oudshoorn 2011 |
| **Rubin's rules pooling** (`mitools::MIcombine`) | Combines analysis results across imputations: pooled estimate = average; pooled variance = within + (1 + 1/m) × between. | Get one combined coefficient + honest CI from the m parallel svyglm fits. | Analysis estimator is approximately normally distributed within each imputation. | Use bootstrap-based pooling. | Rubin 1987 |
| **Gradient-boosted trees** (XGBoost) | A non-linear classifier built by sequentially adding decision trees that correct the errors of the previous ones. No assumption of linearity. | A flexible benchmark for what demographics + BMI alone can predict; complements the design-aware logit. | Enough training data to avoid overfitting; representative train/test split. | Use a simpler model with engineered features and cross-validation. | Chen & Guestrin 2016 |
| **AUC (ROC-AUC)** | The probability that a randomly chosen positive is ranked higher than a randomly chosen negative. Pure ranking metric. | Quantify discrimination — how well the model separates MetS+ from MetS-. | Ranking interpretation is invariant to rescaling of probabilities. | Use precision-recall AUC for very imbalanced outcomes. | Hanley & McNeil 1982 |
| **Brier score** | Mean squared error between predicted probability and the binary outcome. Penalizes both bad ranking and miscalibration. | A single-number summary that captures both calibration and discrimination. | Probabilities are interpretable (not just ranks). | If only ranks matter, use AUC. | Brier 1950 |
| **Calibration curve** | Bins predictions, plots observed frequency vs predicted probability per bin. Perfectly calibrated = on the diagonal. | Test whether predicted probabilities are usable as risk numbers, not just rankings. | Enough cases per bin (n ≥ 30) for stable observed frequencies. | Use Brier or Expected Calibration Error. | Steyerberg et al. 2010 |
| **SHAP (SHapley Additive exPlanations)** | Per-row attribution that decomposes the model's prediction into contributions from each feature, derived from cooperative game theory. | Explain *why* the model assigns a given risk; surfaces the features doing the work. | Tree-based version is exact for tree models; otherwise approximate. | Use partial dependence + permutation importance. | Lundberg & Lee 2017 |
| **Pooled Cohort Equations (PCE)** | The 2013 ACC/AHA 10-year ASCVD risk score with race × sex specific coefficients. | Provide an established clinical comparator instead of evaluating the boosted model in a vacuum. | Inputs within supported ranges (age 40–79, NH White or Black). | Use the 2023 PREVENT score (drops the race term). | Goff et al. 2014; Khan et al. 2024 (PREVENT) |
| **TRIPOD+AI reporting checklist** | 25-item reporting standard for clinical-prediction-model papers, AI extension of the original TRIPOD 2015. | Self-audit for completeness before submission; structures the appendix. | N/A — reporting framework. | N/A | Collins et al. 2024 |
## Caveats and limitations primer
- **Cross-sectional ≠ incident.** All findings here describe *current* MetS prevalence and *current* risk. Nothing in this notebook predicts future cardiovascular events.
- **Single-source MI.** Five imputations with one method (PMM); a published version would run sensitivity at m = 20 and with a second method (random forest, joint modeling).
- **Internal validation only.** The 25% holdout is from the same NHANES cycles — not a different time period or a different cohort.
## References {#references}
Statistical methods (chronological by topic):
- **Pearson, K.** (1896). Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia. *Philosophical Transactions of the Royal Society A*, 187, 253–318.
- **Pearson, K.** (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. *Philosophical Magazine*, 50(302), 157–175.
- **Brier, G. W.** (1950). Verification of forecasts expressed in terms of probability. *Monthly Weather Review*, 78(1), 1–3.
- **Wilcoxon, F.** (1945). Individual comparisons by ranking methods. *Biometrics Bulletin*, 1(6), 80–83.
- **Mann, H. B., & Whitney, D. R.** (1947). On a test of whether one of two random variables is stochastically larger than the other. *Annals of Mathematical Statistics*, 18(1), 50–60.
- **Welch, B. L.** (1947). The generalization of "Student's" problem when several different population variances are involved. *Biometrika*, 34(1/2), 28–35.
- **Cox, D. R.** (1958). The regression analysis of binary sequences. *Journal of the Royal Statistical Society, Series B*, 20(2), 215–242.
- **Marquardt, D. W.** (1970). Generalized inverses, ridge regression, biased linear estimation, and nonlinear estimation. *Technometrics*, 12(3), 591–612.
- **Akaike, H.** (1974). A new look at the statistical model identification. *IEEE Transactions on Automatic Control*, 19(6), 716–723.
- **Hanley, J. A., & McNeil, B. J.** (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. *Radiology*, 143(1), 29–36.
- **Belsley, D. A., Kuh, E., & Welsch, R. E.** (1980). *Regression Diagnostics: Identifying Influential Data and Sources of Collinearity.* Wiley.
- **Rubin, D. B.** (1987). *Multiple Imputation for Nonresponse in Surveys.* Wiley.
- **Little, R. J. A.** (1988). Missing-data adjustments in large surveys. *Journal of Business and Economic Statistics*, 6(3), 287–296.
- **Joanes, D. N., & Gill, C. A.** (1998). Comparing measures of sample skewness and kurtosis. *Journal of the Royal Statistical Society, Series D (The Statistician)*, 47(1), 183–189.
- **Lumley, T.** (2004). Analysis of complex survey samples. *Journal of Statistical Software*, 9(1), 1–19.
- **Lumley, T.** (2010). *Complex Surveys: A Guide to Analysis Using R.* Wiley.
- **Steyerberg, E. W., Vickers, A. J., Cook, N. R., et al.** (2010). Assessing the performance of prediction models: a framework for traditional and novel measures. *Epidemiology*, 21(1), 128–138.
- **Snijders, T. A. B., & Bosker, R. J.** (2012). *Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling* (2nd ed.). Sage.
- **Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X.** (2013). *Applied Logistic Regression* (3rd ed.). Wiley.
- **van Buuren, S., & Groothuis-Oudshoorn, K.** (2011). mice: Multivariate Imputation by Chained Equations in R. *Journal of Statistical Software*, 45(3), 1–67.
- **Delignette-Muller, M. L., & Dutang, C.** (2015). fitdistrplus: An R package for fitting distributions. *Journal of Statistical Software*, 64(4), 1–34.
- **Bates, D., Mächler, M., Bolker, B., & Walker, S.** (2015). Fitting linear mixed-effects models using lme4. *Journal of Statistical Software*, 67(1), 1–48.
- **Chen, T., & Guestrin, C.** (2016). XGBoost: A scalable tree boosting system. In *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining* (pp. 785–794).
- **Lundberg, S. M., & Lee, S.-I.** (2017). A unified approach to interpreting model predictions. In *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)* (pp. 4765–4774).
Clinical guidelines and reporting standards:
- **Goff, D. C. Jr., Lloyd-Jones, D. M., Bennett, G., et al.** (2014). 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. *Circulation*, 129(25 Suppl 2), S49–S73.
- **Khan, S. S., Matsushita, K., Sang, Y., et al.** (2024). Development and validation of the American Heart Association's PREVENT equations. *Circulation*, 149(6), 430–449.
- **Collins, G. S., Moons, K. G. M., Dhiman, P., et al.** (2024). TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. *BMJ*, 385, e078378.
Data sources:
- **National Center for Health Statistics.** National Health and Nutrition Examination Survey 2015–2016 (cycle I) and 2017–2018 (cycle J). U.S. Centers for Disease Control and Prevention. <https://wwwn.cdc.gov/nchs/nhanes/>
---
::: {.callout-tip}
### Reading this as code samples
Every cell above is the real executed code — not screenshots. Use the **Code** dropdown at the top-right to toggle visibility, copy individual blocks, or view the [raw `.qmd` source](https://github.com/paulinedmdf/paulinadelmundomd/blob/main/projects/01-cardiometabolic-risk/analysis.qmd) on GitHub.
:::