Cardiometabolic Risk in NHANES 2017

1 Why this analysis

Cardiometabolic disease remains the leading driver of mortality in US adults. Yet the clinical tools that estimate risk — Framingham, ASCVD, PCE — were trained on cohorts that poorly represent the population-level distribution of social and biological risk factors captured by NHANES.

This notebook does three things a physician-scientist portfolio should demonstrate end-to-end:

Load and harmonize the NHANES 2017–2018 cycle (demographics, examination, labs, questionnaire) using the CDC’s public XPT files.
Describe the population correctly — applying the NHANES complex survey design so prevalence estimates generalize to the US civilian non-institutionalized population.
Model the probability of metabolic syndrome using both a design-aware logistic regression (in R) and a gradient-boosted classifier with SHAP explanations (in Python), then compare what each approach reveals.

The two language blocks render side-by-side thanks to Quarto’s polyglot mode via reticulate — a small but real demonstration that methodology choices should follow the scientific question, not the toolchain.

Note

Source and helpers live in R/nhanes_helpers.R and python/nhanes_helpers.py. The .qmd source for this page is on GitHub.

2 Setup

Show code

source(here::here("R", "nhanes_helpers.R"))
library(gtsummary)
library(ggplot2)
library(purrr)
library(reticulate)

# Method packages — installed on first render if absent.
for (pkg in c("mice", "mitools", "e1071", "fitdistrplus",
              "performance", "lme4", "kableExtra")) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg, repos = "https://cloud.r-project.org")
  }
}

# Lightweight wrapper for hero tables — bootstrap styling, no monospaced
# greyscale dump. Plain knitr::kable() still works elsewhere for inline /
# diagnostic tables where the styled version is overkill.
pretty_kable <- function(x, caption = NULL, ...) {
  knitr::kable(x, caption = caption, ...) |>
    kableExtra::kable_styling(
      bootstrap_options = c("striped", "hover", "condensed", "responsive"),
      full_width        = FALSE,
      position          = "left",
      font_size         = 13
    )
}

reticulate::py_require(c(
  "pandas", "numpy", "scipy", "scikit-learn",
  "xgboost", "shap", "matplotlib"
))
theme_set(theme_minimal(base_size = 12))

Show code

import sys, pathlib
sys.path.append(str(pathlib.Path.cwd().parents[1] / "python"))

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, brier_score_loss
from sklearn.calibration import calibration_curve
import xgboost as xgb
import shap

plt.rcParams.update({"figure.dpi": 110})

3 Pulling and pooling the NHANES cycles

A single cycle is too small for stable subgroup estimates. We pool NHANES 2015–2016 (cycle I) and 2017–2018 (cycle J) per the CDC analytic guidelines: each cycle’s MEC weight is divided by the number of cycles, strata and PSU pass through unchanged, and SEQNs don’t collide. The combined-cycle weight is WTMEC4YR = WTMEC2YR / 2.

We also pull three additional tables (TCHOL, SMQ, BPQ) needed for the Pooled Cohort Equations comparator downstream — total cholesterol, current-smoker status, and antihypertensive treatment.

Show code

build_spec <- function(suffix) {
  list(
    DEMO  = paste0("DEMO_",   suffix),
    BMX   = paste0("BMX_",    suffix),
    BPX   = paste0("BPX_",    suffix),
    GLU   = paste0("GLU_",    suffix),
    HDL   = paste0("HDL_",    suffix),
    TRIG  = paste0("TRIGLY_", suffix),
    DIQ   = paste0("DIQ_",    suffix),
    TCHOL = paste0("TCHOL_",  suffix),
    SMQ   = paste0("SMQ_",    suffix),
    BPQ   = paste0("BPQ_",    suffix)
  )
}

cycle_specs <- list(
  "2015-2016" = build_spec("I"),
  "2017-2018" = build_spec("J")
)

nhanes_raw <- pool_nhanes_cycles(cycle_specs)
nhanes_raw <- recode_demographics(nhanes_raw)

cat(sprintf("Pooled rows: %s\n", format(nrow(nhanes_raw), big.mark = ",")))

Pooled rows: 19,225

Show code

print(table(cycle = nhanes_raw$cycle))

cycle
2015-2016 2017-2018 
     9971      9254

4 Exploratory data analysis — what the cohort actually looks like

Before any modeling decision, four questions need answers: what’s missing, how skewed are the continuous variables, how do the demographic cells distribute, and does the cohort that survives our exclusion criteria still resemble the source population? Skipping these steps is how a survey-weighted estimator ends up reporting a tidy CI on a heavily-truncated subgroup.

Show code

eda_cols <- c("WTMEC2YR", "RIAGENDR", "RIDAGEYR", "RIDRETH3",
              "BMXBMI", "BMXWAIST",
              "BPXSY1", "BPXDI1",
              "LBXTR", "LBDHDD", "LBXGLU", "DIQ010")

miss_long <- nhanes_raw |>
  dplyr::select(dplyr::any_of(eda_cols)) |>
  dplyr::summarise(dplyr::across(dplyr::everything(),
                                 \(x) mean(is.na(x)) * 100)) |>
  tidyr::pivot_longer(dplyr::everything(),
                      names_to = "variable", values_to = "pct_missing")

ggplot(miss_long,
       aes(x = stats::reorder(variable, pct_missing), y = pct_missing)) +
  geom_col(fill = "#020073") +
  geom_text(aes(label = sprintf("%.1f%%", pct_missing)),
            hjust = -0.1, size = 3.2, color = "#333") +
  coord_flip(clip = "off") +
  scale_y_continuous(limits = c(0, max(miss_long$pct_missing) * 1.15),
                     labels = \(x) paste0(x, "%")) +
  labs(x = NULL, y = "% missing in raw NHANES merge") +
  theme(plot.margin = margin(5, 80, 5, 5))

Missingness map across the candidate analytic columns. NHANES MEC weights (WTMEC2YR) are zero or missing for non-examined participants — those are excluded downstream. Lipid panels (LBXTR, LBDHDD) and fasting glucose (LBXGLU) only resolve for the morning fasting subsample, so their NA share roughly tracks that subsample size.

Show code

cont_long <- nhanes_raw |>
  dplyr::transmute(
    `Waist (cm)`              = BMXWAIST,
    `Triglycerides (mg/dL)`   = LBXTR,
    `HDL (mg/dL)`             = LBDHDD,
    `Systolic BP (mm Hg)`     = BPXSY1,
    `Fasting glucose (mg/dL)` = LBXGLU
  ) |>
  tidyr::pivot_longer(dplyr::everything(),
                      names_to = "variable", values_to = "value") |>
  dplyr::filter(!is.na(value))

thresholds <- tibble::tibble(
  variable = c("Waist (cm)", "Triglycerides (mg/dL)",
               "HDL (mg/dL)", "Systolic BP (mm Hg)",
               "Fasting glucose (mg/dL)"),
  cut      = c(95, 150, 45, 130, 100)
)

ggplot(cont_long, aes(value)) +
  geom_histogram(fill = "#020073", color = "white", bins = 40) +
  geom_vline(data = thresholds, aes(xintercept = cut),
             color = "#FF2E00", linetype = "dashed", linewidth = 0.6) +
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  labs(x = NULL, y = "Count")

Univariate distributions of the five continuous components used downstream. Triglycerides and fasting glucose are right-skewed with long upper tails — this is exactly the regime where peer-comparison estimators like robust z-scores beat naive means. Vertical lines mark ATP III diagnostic thresholds.

Show code

dat_eda <- nhanes_raw |>
  dplyr::filter(age_years >= 20, !is.na(WTMEC2YR)) |>
  dplyr::filter(if_all(c(BMXWAIST, LBXTR, LBDHDD, BPXSY1, BPXDI1, LBXGLU),
                        \(x) !is.na(x)))

cells <- dat_eda |>
  dplyr::count(sex, race_eth) |>
  dplyr::mutate(race_eth = forcats::fct_reorder(race_eth, n, .desc = TRUE))

ggplot(cells, aes(race_eth, n, fill = sex)) +
  geom_col(position = position_dodge(0.75), width = 0.7) +
  geom_text(aes(label = n),
            position = position_dodge(0.75), vjust = -0.3, size = 3.1) +
  scale_fill_manual(values = c("Male" = "#020073", "Female" = "#FF2E00")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.12))) +
  labs(x = NULL, y = "Adult complete-case count", fill = NULL) +
  theme(legend.position = "top",
        axis.text.x = element_text(angle = 20, hjust = 1))

Sex × race/ethnicity cell counts after the adult (≥ 20) and complete-case filter. Cells under ~30 will produce wide CIs in subgroup estimates — a sanity check before reporting any stratified prevalence.

Show code

target <- dat_eda |>
  dplyr::transmute(
    waist_hit = (sex == "Male"   & BMXWAIST >= 102) |
                (sex == "Female" & BMXWAIST >= 88),
    trig_hit  = LBXTR >= 150,
    hdl_hit   = (sex == "Male"   & LBDHDD < 40) |
                (sex == "Female" & LBDHDD < 50),
    bp_hit    = BPXSY1 >= 130 | BPXDI1 >= 85,
    glu_hit   = LBXGLU >= 100,
    metsyn    = rowSums(cbind(waist_hit, trig_hit, hdl_hit, bp_hit, glu_hit), na.rm = TRUE) >= 3
  )

cat(sprintf(
  "Adult complete-case n = %s\nUnweighted MetS prevalence  = %.1f%%\n",
  format(nrow(target), big.mark = ","),
  100 * mean(target$metsyn, na.rm = TRUE)
))

Adult complete-case n = 4,085
Unweighted MetS prevalence  = 37.7%

The unweighted prevalence above is what a naive analyst would report. The weighted estimate in the next section will land in a different place — that gap is the methodological reason we apply the survey design.

Cohort flow — n surviving each filtering step

A reviewer’s first question on any observational analysis is “where did the n come from?” The pipeline filters the raw NHANES merge in three stages; the table below makes the staged attrition explicit, mirroring the CONSORT / STROBE flow expectation for an observational cohort. Each row’s count is recomputed live from the in-memory objects so it stays consistent with whatever filter set is currently in effect.

Cohort flow diagram in tabular form. Steps 1 and 2 are computed live; steps 3 and 4 are reported by the `define-outcome` chunk below; step 5 by the `mi-impute` chunk further down. Together they document the inclusion / exclusion path from raw merge to the MI-pooled analytic sample.
Step	n	Filter that drops rows
1. Pooled raw merge (cycles 2015-2018)	19,225	Merge across DEMO, BMX, BPX, GLU, HDL, TRIG, DIQ, TCHOL, SMQ, BPQ for both cycles
2. Adult subset (age >= 20 yrs)	19,225	Removes children and adolescents (CDC standard age cutoff for cardiometabolic adult analyses)
3. Pre-complete-case cohort (valid MEC weight, age 20+) – see `define-outcome` chunk for n	11,288	Removes non-examined participants (zero MEC weight)
4. Complete-case cohort (all 5 ATP III components observed) – see `define-outcome` chunk for n	see chunk	Removes rows missing any of waist, trig, HDL, SBP, DBP, fasting glucose – the binding constraint is the morning fasting subsample
5. MI cohort (Rubin’s-rules over m = 5 imputations) – see `mi-impute` for the pre-CC denominator	see chunk	Carries forward design columns + demographics as auxiliary predictors; imputes the 8 clinical measurements

5 Defining metabolic syndrome (ATP III, harmonized)

An adult meets criteria if ≥ 3 of 5 are true:

Component	Threshold
Waist circumference	≥ 102 cm (M) or ≥ 88 cm (F)
Triglycerides	≥ 150 mg/dL
HDL cholesterol	< 40 mg/dL (M) or < 50 mg/dL (F)
Blood pressure	SBP ≥ 130 or DBP ≥ 85, or on BP medication
Fasting glucose	≥ 100 mg/dL, or on diabetes medication

Show code

# Pre-complete-case cohort — keep one copy with NAs intact for MICE downstream,
# and one filtered copy for the design-aware analysis the existing chunks
# already use. The outcome derivation is identical; only the row filter
# differs.
dat_full <- nhanes_raw |>
  filter(age_years >= 20, !is.na(WTMEC4YR), WTMEC4YR > 0) |>
  transmute(
    SEQN, cycle, WTMEC4YR, SDMVPSU, SDMVSTRA,
    sex, age_years, race_eth,
    bmi          = BMXBMI,
    waist        = BMXWAIST,
    trig         = LBXTR,
    hdl          = LBDHDD,
    tc           = LBXTC,
    sbp          = rowMeans(cbind(BPXSY1, BPXSY2, BPXSY3), na.rm = TRUE),
    dbp          = rowMeans(cbind(BPXDI1, BPXDI2, BPXDI3), na.rm = TRUE),
    fpg          = LBXGLU,
    dm_dx        = !is.na(DIQ010) & DIQ010 == "Yes",
    bp_treated   = !is.na(BPQ050A) & BPQ050A == "Yes",
    smoker       = !is.na(SMQ040) & SMQ040 %in% c("Every day", "Some days"),
    waist_hit    = (sex == "Male"   & waist >= 102) |
                   (sex == "Female" & waist >= 88),
    trig_hit     = trig >= 150,
    hdl_hit      = (sex == "Male"   & hdl < 40) |
                   (sex == "Female" & hdl < 50),
    bp_hit       = sbp >= 130 | dbp >= 85,
    glu_hit      = fpg >= 100 | dm_dx,
    mets_n       = rowSums(cbind(waist_hit, trig_hit, hdl_hit, bp_hit, glu_hit),
                           na.rm = TRUE),
    metsyn       = mets_n >= 3
  )

# Coerce sbp/dbp NaN (when all three readings missing) back to NA so MICE
# treats them consistently.
dat_full <- dat_full |>
  mutate(across(c(sbp, dbp), \(x) ifelse(is.nan(x), NA, x)))

dat <- dat_full |>
  filter(if_all(c(waist, trig, hdl, sbp, dbp, fpg), \(x) !is.na(x)))

cat(sprintf(
  "Pre-CC pooled cohort  : %s\nComplete-case cohort  : %s (drop %.1f%%)\n",
  format(nrow(dat_full), big.mark = ","),
  format(nrow(dat),      big.mark = ","),
  100 * (1 - nrow(dat) / nrow(dat_full))
))

Pre-CC pooled cohort  : 10,739
Complete-case cohort  : 4,314 (drop 59.8%)

6 Diagnostics on the analytic cohort

Three publication-grade checks before any modeling: a formal distributional fit for each continuous candidate (skew/kurtosis quantified, candidate distributions ranked by AIC), a selection-bias check comparing the dropped rows to the kept ones, and an ICC variance partition that quantifies how much of the outcome variance lives at each clustering level of the NHANES sample design.

Show code

cont_vars <- list(
  bmi   = dat_full$bmi,   waist = dat_full$waist,
  trig  = dat_full$trig,  hdl   = dat_full$hdl,
  sbp   = dat_full$sbp,   dbp   = dat_full$dbp,
  fpg   = dat_full$fpg,   tc    = dat_full$tc
)

skew_kurt <- vapply(cont_vars, function(x) {
  x <- x[!is.na(x)]
  c(skew = e1071::skewness(x), kurtosis = e1071::kurtosis(x))
}, numeric(2))

knitr::kable(
  data.frame(
    Variable = colnames(skew_kurt),
    n        = vapply(cont_vars, \(x) sum(!is.na(x)), integer(1)),
    Skew     = round(skew_kurt["skew", ], 2),
    `Excess kurtosis` = round(skew_kurt["kurtosis", ], 2),
    Shape    = ifelse(abs(skew_kurt["skew", ]) > 1, "Right-skewed",
                ifelse(abs(skew_kurt["skew", ]) > 0.5, "Mildly skewed", "Roughly symmetric")),
    check.names = FALSE
  ),
  caption = "Skewness and excess kurtosis on the eight continuous candidate variables. |skew| > 1 indicates a heavy tail that materially affects the choice of regression family or transformation. The right-skewed variables get a formal distribution fit in the next table."
)

Skewness and excess kurtosis on the eight continuous candidate variables. |skew| > 1 indicates a heavy tail that materially affects the choice of regression family or transformation. The right-skewed variables get a formal distribution fit in the next table.
	Variable	n	Skew	Excess kurtosis	Shape
bmi	bmi	10581	1.20	2.72	Right-skewed
waist	waist	10053	0.55	0.37	Mildly skewed
trig	trig	4647	9.08	165.45	Right-skewed
hdl	hdl	10094	1.22	3.80	Right-skewed
sbp	sbp	10311	1.03	1.79	Right-skewed
dbp	dbp	10311	-0.49	3.49	Roughly symmetric
fpg	fpg	4887	3.87	19.46	Right-skewed
tc	tc	10094	0.73	2.09	Mildly skewed

Show code

skewed <- names(cont_vars)[abs(skew_kurt["skew", ]) > 1]
fit_results <- list()
for (v in skewed) {
  x <- cont_vars[[v]]
  x <- x[!is.na(x) & x > 0]
  fits <- list(
    normal = try(fitdistrplus::fitdist(x, "norm"),  silent = TRUE),
    lnorm  = try(fitdistrplus::fitdist(x, "lnorm"), silent = TRUE),
    gamma  = try(fitdistrplus::fitdist(x, "gamma"), silent = TRUE)
  )
  aics <- vapply(fits, function(f) {
    if (inherits(f, "try-error") || is.null(f)) NA_real_ else f$aic
  }, numeric(1))
  fit_results[[v]] <- aics
}

if (length(fit_results) > 0) {
  fit_mat <- do.call(rbind, fit_results)
  best    <- colnames(fit_mat)[apply(fit_mat, 1, function(r) which.min(r))]
  knitr::kable(
    data.frame(
      Variable = rownames(fit_mat),
      `AIC normal` = round(fit_mat[, "normal"], 0),
      `AIC log-normal` = round(fit_mat[, "lnorm"], 0),
      `AIC gamma` = round(fit_mat[, "gamma"], 0),
      `Best fit`  = best,
      check.names = FALSE
    ),
    caption = "AIC comparison of three candidate distributions on the right-skewed variables (positive values only). Lowest AIC = best fit. The 'best fit' column is the recommended starting family for any regression that takes one of these as outcome — for triglycerides and fasting glucose the answer is almost always log-normal or gamma rather than normal."
  )
}

AIC comparison of three candidate distributions on the right-skewed variables (positive values only). Lowest AIC = best fit. The ‘best fit’ column is the recommended starting family for any regression that takes one of these as outcome — for triglycerides and fasting glucose the answer is almost always log-normal or gamma rather than normal.
	Variable	AIC normal	AIC log-normal	AIC gamma	Best fit
bmi	bmi	71912	70103	70500	lnorm
trig	trig	56021	50596	51327	lnorm
hdl	hdl	85525	83841	84056	lnorm
sbp	sbp	90275	89124	89427	lnorm
fpg	fpg	49837	46203	47234	lnorm

Going deeper — what skewness and excess kurtosis actually measure

Skewness asks whether a histogram leans to one side — zero is symmetric (bell curve), positive means a long right tail. Kurtosis asks how fat the tails are: whether rare extremes show up more often than they would in a normal distribution. The “excess” prefix means we subtract 3 so a perfect bell curve scores 0; positive values mean heavier-than-normal tails (leptokurtic), negative means lighter (platykurtic). For a sample with mean \(\bar{x}\) and SD \(s\): \[\text{skew} = \tfrac{1}{n} \sum_i \!\left(\tfrac{x_i - \bar{x}}{s}\right)^{\!3}, \qquad \text{excess kurtosis} = \tfrac{1}{n} \sum_i \!\left(\tfrac{x_i - \bar{x}}{s}\right)^{\!4} - 3.\] The cube and fourth power deliberately amplify extremes — a single value 4.5 SDs from the mean contributes \(4.5^3 \approx 91\) to the skewness sum, dominating hundreds of near-mean residuals near zero. That’s why “\(|\text{skew}| > 1\)” is the conventional flag: it survives one or two true outliers but isn’t dismissable as sampling noise. Triglycerides in a heavy-tailed clinical sample typically read skew \(\approx 2{-}3\) — well past the threshold, telling you a log transform is in order before any normal-theory model.

Going deeper — what AIC is actually doing

AIC scores candidate “shapes” the data might follow — normal, log-normal, gamma — by balancing fit against complexity. The formula \[\text{AIC} = 2k - 2 \log L\] rewards models with high likelihood (\(\log L\)) but charges \(2k\) per free parameter, so you can’t always win by adding dials. The constant \(2\) isn’t arbitrary: under regularity conditions, AIC is an asymptotically unbiased estimator of (twice) the Kullback–Leibler divergence between the candidate and the truth, so minimising AIC ≈ minimising expected information loss. The score is meaningless in isolation; only differences matter — \(\Delta\text{AIC} < 2\) is weak, \(4{-}7\) substantial, \(>10\) decisive. BIC is the close cousin that swaps \(2k\) for \(k \log n\), harsher on complexity as \(n\) grows: consistent (picks the true model in the limit) rather than efficient (minimises prediction loss). For triglycerides with \(n \approx 6{,}000\) and three two-parameter candidates, you might see AICs around 64,200 (normal), 62,800 (log-normal), 62,800 (gamma) — gamma and log-normal tie, normal is decisively rejected. Since \(k\) is identical here the penalty cancels and the ranking is pure fit.

Show code

# Tag inclusion status, then formally test whether the included and
# excluded subsamples differ on the demographics we're going to condition
# on downstream. Significant differences quantify the selection-bias
# threat that motivates MICE in a later section.
dat_full2 <- dat_full |>
  dplyr::mutate(included = SEQN %in% dat$SEQN)

n_inc <- sum(dat_full2$included)
n_exc <- sum(!dat_full2$included)

results <- list(
  age_years = stats::t.test(age_years ~ included, data = dat_full2),
  bmi       = stats::t.test(bmi       ~ included, data = dat_full2),
  sex       = stats::chisq.test(table(dat_full2$sex,      dat_full2$included)),
  race_eth  = stats::chisq.test(table(dat_full2$race_eth, dat_full2$included)),
  cycle     = stats::chisq.test(table(dat_full2$cycle,    dat_full2$included))
)

bias_df <- data.frame(
  Variable    = names(results),
  Test        = c("Welch t", "Welch t", "Pearson chi-squared",
                  "Pearson chi-squared", "Pearson chi-squared"),
  Statistic   = vapply(results, \(r) round(as.numeric(r$statistic), 2), numeric(1)),
  df          = vapply(results, \(r) {
    p <- if (!is.null(r$parameter)) as.numeric(r$parameter[1]) else NA_real_
    if (is.na(p)) "" else as.character(round(p, 1))
  }, character(1)),
  `p-value`   = vapply(results, \(r) format.pval(r$p.value, eps = 1e-300, digits = 2),
                       character(1)),
  check.names = FALSE
)

cat(sprintf("Included (complete-case) : %s\nExcluded                : %s\n\n",
            format(n_inc, big.mark = ","),
            format(n_exc, big.mark = ",")))

Included (complete-case) : 4,314
Excluded                : 6,425

Show code

knitr::kable(
  bias_df,
  caption = "Selection-bias diagnostics: do the rows we drop in the complete-case filter differ from the rows we keep, on the demographics that condition the downstream regression? Significant differences (typically on age, since fasting subsamples skew older) quantify the threat that motivates the MICE section below."
)

Selection-bias diagnostics: do the rows we drop in the complete-case filter differ from the rows we keep, on the demographics that condition the downstream regression? Significant differences (typically on age, since fasting subsamples skew older) quantify the threat that motivates the MICE section below.
	Variable	Test	Statistic	df	p-value
age_years	age_years	Welch t	-0.72	9448	0.47
bmi	bmi	Welch t	1.33	9547.8	0.18
sex	sex	Pearson chi-squared	0.87	1	0.35
race_eth	race_eth	Pearson chi-squared	4.67	5	0.46
cycle	cycle	Pearson chi-squared	8.83	1	0.003

Show code

# Partition the variance of MetS at NHANES's complex-survey clustering
# levels — strata, then PSU within stratum. Subject-level variance
# falls out as the residual.
icc_model <- lme4::glmer(
  as.integer(metsyn) ~ 1 + (1 | SDMVSTRA) + (1 | SDMVPSU),
  data    = dat,
  family  = binomial(),
  control = lme4::glmerControl(optimizer = "bobyqa")
)

vc <- as.data.frame(lme4::VarCorr(icc_model))[, c("grp", "vcov")]
# For binomial GLMM with logit link, the residual variance is fixed at
# pi^2 / 3 on the latent scale (Snijders & Bosker convention).
vc <- rbind(vc, data.frame(grp = "Residual (latent)", vcov = pi^2 / 3))
total <- sum(vc$vcov)
vc$ICC_pct <- sprintf("%.2f%%", 100 * vc$vcov / total)
vc$Variance <- round(vc$vcov, 4)

knitr::kable(
  vc[, c("grp", "Variance", "ICC_pct")],
  col.names = c("Clustering level", "Variance", "% of total"),
  caption = "Variance partition for binary MetS via random-intercept logit. Stratum + PSU together carry the design-induced clustering; the residual is the per-subject variance on the latent (logit) scale (pi^2/3 by convention). When the stratum + PSU share is non-trivial, design-aware estimators are doing real inferential work — not a stylistic choice."
)

Variance partition for binary MetS via random-intercept logit. Stratum + PSU together carry the design-induced clustering; the residual is the per-subject variance on the latent (logit) scale (pi^2/3 by convention). When the stratum + PSU share is non-trivial, design-aware estimators are doing real inferential work — not a stylistic choice.
Clustering level	Variance	% of total
SDMVSTRA	0.0480	1.44%
SDMVPSU	0.0000	0.00%
Residual (latent)	3.2899	98.56%

Going deeper — why the latent residual variance is exactly \(\pi^2/3\)

Behind every logistic regression there’s a hidden continuous “propensity” score \(Y^* = X\beta + \varepsilon\) we never observe — we only see whether each person crossed a threshold (yes/no on metabolic syndrome). For logistic regression, \(\varepsilon\) follows the standard logistic distribution, with density \(f(\varepsilon) = e^{-\varepsilon}/(1+e^{-\varepsilon})^2\) and a variance you can compute by integration: \[\text{Var}(\varepsilon) = \int_{-\infty}^{\infty} \varepsilon^2 f(\varepsilon)\, d\varepsilon = \tfrac{\pi^2}{3} \approx 3.29.\] This number isn’t arbitrary — it’s baked into the choice of logit link, the same way picking Fahrenheit vs Celsius bakes in a scale. (A probit model would have residual variance 1 by analogous logic, which is why probit and logit coefficients differ by a factor of \(\sqrt{\pi^2/3} \approx 1.81\).) When glmer partitions variance across NHANES strata and PSUs, it reports those components and we compute the ICC against this fixed denominator: if \(\sigma^2_{\text{stratum}} = 0.10\) and \(\sigma^2_{\text{PSU}} = 0.07\), then ICC\(_{\text{stratum}} = 0.10 / (0.10 + 0.07 + \pi^2/3) \approx 2.9\%\). This structurally-large residual is why design ICCs in survey data nearly always read in single-digit percentages on the latent scale.

7 Weighted prevalence by demographics

In one sentence. Apply the NHANES design weights so prevalence estimates generalize to the US civilian non-institutionalized population — and report the weighted estimate as the headline rather than the unweighted one.

Why a survey-weighted estimator (and not a plain proportion)?

NHANES is not a simple random sample. It deliberately oversamples certain subgroups (older adults, Black, Hispanic, low-income) so the lab work has enough cases per cell to estimate subgroup prevalence precisely. If you compute a plain proportion on the raw rows, you over-count those oversampled groups in the headline number.

The fix is the MEC sample weight (WTMEC2YR) — a number per row that says “this person represents N people in the US population.” A weighted mean reweights every row by WTMEC2YR so the result is what you’d see if NHANES had been a true random sample of the US adult population. The survey::svymean() function also accounts for the sampling strata and PSUs (geographic clusters), so the standard error reflects the design — not the smaller SE you’d get pretending the sample was simple-random.

Use a weighted estimator whenever a sample was drawn with unequal probabilities — NHANES, BRFSS, MEPS, NHIS, NHATS, all the federal surveys. The unweighted version isn’t “simpler”; it’s biased. (Lumley 2004, 2010)

Going deeper — Taylor linearization, nest = TRUE, and what the SE is actually computing

A survey-weighted mean \(\hat{Y} = \sum w_i y_i / \sum w_i\) is a ratio of two random sums, and ratios don’t have a clean closed-form variance. Taylor linearization sidesteps that: pretend the ratio is approximately a straight line at the point we care about, compute the variance of that straight-line approximation, and call it a good enough estimate. Defining the residual \(z_i = (y_i - \hat{Y})/\hat{N}\) where \(\hat{N} = \sum w_i\), the formula sums between-PSU squared deviations within each stratum: \[\widehat{\text{Var}}(\hat{Y}) \approx \sum_h \tfrac{n_h}{n_h - 1} \sum_{c \in h} \!\left(\sum_{i \in c} w_i z_i - \overline{w z}_h\right)^{\!2}.\] Each PSU contributes one term; the standard error is built from between-PSU variation, not from independent observations. This is why design degrees of freedom = (PSUs) − (strata), often far smaller than \(n\) — with 30 strata × 2 PSUs each on \(n = 5{,}000\), df = 30, and a naive simple-random SE would be roughly \(13\times\) too narrow.

nest = TRUE is a critical implementation flag: NHANES reuses small numeric PSU IDs across different strata (PSU “1” in stratum 100 has nothing to do with PSU “1” in stratum 101). nest = TRUE tells svydesign to treat the PSU code as nested within its stratum, not as a global identifier. Forget it and the SEs pool unrelated PSUs and come out badly wrong.

Critically, NHANES oversamples certain subpopulations. A naive proportion is biased; we have to apply the MEC sampling weights with survey/srvyr.

Show code

des <- nhanes_design(dat, weight = "WTMEC4YR")

prev_tbl <- des |>
  group_by(sex, race_eth) |>
  summarize(
    metsyn_pct = survey_mean(metsyn, proportion = TRUE, vartype = "ci") * 100
  )

pretty_kable(
  prev_tbl,
  digits = 1,
  col.names = c("Sex", "Race/Ethnicity", "Prevalence (%)", "Lower 95% CI", "Upper 95% CI"),
  caption = "Weighted metabolic syndrome prevalence, US adults ≥ 20 years (NHANES 2015–2018, pooled)."
)

Weighted metabolic syndrome prevalence, US adults ≥ 20 years (NHANES 2015–2018, pooled).
Sex	Race/Ethnicity	Prevalence (%)	Lower 95% CI	Upper 95% CI
Male	Mexican American	37.7	32.9	42.8
Male	Other Hispanic	36.1	28.0	45.1
Male	Non-Hispanic White	37.3	32.7	42.2
Male	Non-Hispanic Black	22.2	18.2	26.9
Male	Non-Hispanic Asian	25.3	20.3	31.1
Male	Other Race - Including Multi-Racial	37.6	24.1	53.4
Female	Mexican American	44.7	38.5	51.2
Female	Other Hispanic	34.2	28.4	40.5
Female	Non-Hispanic White	34.5	30.6	38.6
Female	Non-Hispanic Black	35.1	31.8	38.6
Female	Non-Hispanic Asian	27.3	21.4	34.0
Female	Other Race - Including Multi-Racial	43.5	30.7	57.2

Show code

ggplot(prev_tbl, aes(x = reorder(race_eth, metsyn_pct), y = metsyn_pct, fill = sex)) +
  geom_col(position = position_dodge(0.75), width = 0.7) +
  geom_errorbar(
    aes(ymin = metsyn_pct_low, ymax = metsyn_pct_upp),
    position = position_dodge(0.75), width = 0.2
  ) +
  coord_flip() +
  scale_fill_manual(values = c("Male" = "#020073", "Female" = "#FF2E00")) +
  labs(x = NULL, y = "Prevalence (%)", fill = NULL) +
  theme(legend.position = "top")

Weighted metabolic syndrome prevalence with 95% CIs.

7.1 Sensitivity to the metabolic-syndrome definition

The headline prevalence depends on which diagnostic rule we apply. Three sets of criteria are in active use in the literature: NCEP ATP III (the original 2001 definition, which is what the rest of this notebook uses), IDF 2005 (waist-circumference is mandatory rather than one of five), and Harmonized / JIS 2009 (waist thresholds vary by ethnicity, the rest match ATP III). Reporting all three side-by-side closes the “your headline depends on the threshold” reviewer comment.

Show code

# Three definitions, each producing a binary metsyn flag on dat. We
# reuse the existing component thresholds for ATP III, then express
# IDF and JIS as modifications of the same components. Population:
# adult complete-case cohort (consistent with the rest of the section).
mets_atp3 <- dat |>
  dplyr::transmute(
    SEQN, WTMEC4YR, SDMVPSU, SDMVSTRA,
    waist_hit = (sex == "Male" & waist >= 102) | (sex == "Female" & waist >= 88),
    trig_hit  = trig >= 150,
    hdl_hit   = (sex == "Male" & hdl < 40) | (sex == "Female" & hdl < 50),
    bp_hit    = sbp >= 130 | dbp >= 85 | bp_treated,
    glu_hit   = fpg >= 100 | dm_dx,
    metsyn    = as.integer(rowSums(cbind(waist_hit, trig_hit, hdl_hit, bp_hit, glu_hit),
                                    na.rm = TRUE) >= 3)
  )

# IDF 2005: central obesity is mandatory (waist >= 94/M, 80/F for
# Europid; we use the more conservative ATP-III thresholds here as a
# sensitivity floor) PLUS any 2 of the remaining 4.
mets_idf <- dat |>
  dplyr::transmute(
    SEQN, WTMEC4YR, SDMVPSU, SDMVSTRA,
    central   = (sex == "Male" & waist >= 94) | (sex == "Female" & waist >= 80),
    trig_hit  = trig >= 150,
    hdl_hit   = (sex == "Male" & hdl < 40) | (sex == "Female" & hdl < 50),
    bp_hit    = sbp >= 130 | dbp >= 85 | bp_treated,
    glu_hit   = fpg >= 100 | dm_dx,
    metsyn    = as.integer(central &
                            rowSums(cbind(trig_hit, hdl_hit, bp_hit, glu_hit),
                                    na.rm = TRUE) >= 2)
  )

# JIS 2009: same as ATP III on every component except waist threshold,
# which is ethnicity-specific. We use the conservative ATP III values
# for Non-Hispanic White/Black and IDF Asian-specific (90/M, 80/F) for
# the Asian/Mexican-American/Other groups present in NHANES.
mets_jis <- dat |>
  dplyr::mutate(
    waist_cut_m = dplyr::if_else(grepl("Asian|Mexican|Other", as.character(race_eth)),
                                  90, 102),
    waist_cut_f = dplyr::if_else(grepl("Asian|Mexican|Other", as.character(race_eth)),
                                  80, 88)
  ) |>
  dplyr::transmute(
    SEQN, WTMEC4YR, SDMVPSU, SDMVSTRA,
    waist_hit = (sex == "Male"   & waist >= waist_cut_m) |
                (sex == "Female" & waist >= waist_cut_f),
    trig_hit  = trig >= 150,
    hdl_hit   = (sex == "Male" & hdl < 40) | (sex == "Female" & hdl < 50),
    bp_hit    = sbp >= 130 | dbp >= 85 | bp_treated,
    glu_hit   = fpg >= 100 | dm_dx,
    metsyn    = as.integer(rowSums(cbind(waist_hit, trig_hit, hdl_hit, bp_hit, glu_hit),
                                    na.rm = TRUE) >= 3)
  )

prev_for <- function(d, label) {
  des_x <- survey::svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA,
                              weights = ~WTMEC4YR, data = d, nest = TRUE)
  m  <- survey::svymean(~metsyn, des_x)
  ci <- stats::confint(m)
  data.frame(
    Definition = label,
    `Prevalence` = sprintf("%.1f%%", 100 * as.numeric(coef(m))),
    `95% CI`     = sprintf("%.1f%% – %.1f%%", 100 * ci[1, 1], 100 * ci[1, 2]),
    n            = nrow(d),
    check.names = FALSE
  )
}

pretty_kable(
  rbind(
    prev_for(mets_atp3, "NCEP ATP III (primary)"),
    prev_for(mets_idf,  "IDF 2005 (waist-mandatory)"),
    prev_for(mets_jis,  "JIS 2009 (ethnicity-specific waist)")
  ),
  caption = "Survey-weighted metabolic-syndrome prevalence under three published definitions. The primary analysis uses NCEP ATP III; IDF 2005 is more conservative because central obesity is mandatory; JIS 2009 uses lower waist cutoffs for Asian and Mexican-American groups, which raises prevalence in those subpopulations and shifts the headline upward. The spread across the three is the bound a reviewer would ask the headline number to sit inside."
)

Survey-weighted metabolic-syndrome prevalence under three published definitions. The primary analysis uses NCEP ATP III; IDF 2005 is more conservative because central obesity is mandatory; JIS 2009 uses lower waist cutoffs for Asian and Mexican-American groups, which raises prevalence in those subpopulations and shifts the headline upward. The spread across the three is the bound a reviewer would ask the headline number to sit inside.
Definition	Prevalence	95% CI	n
NCEP ATP III (primary)	40.0%	37.6% – 42.4%	4314
IDF 2005 (waist-mandatory)	43.6%	41.2% – 46.0%	4314
JIS 2009 (ethnicity-specific waist)	42.0%	39.5% – 44.4%	4314

8 Collinearity diagnostics

Before running any regression we need to know which candidate predictors are too tightly correlated to coexist in the same specification. Two relationships dominate cardiometabolic data: BMI vs. waist circumference (essentially the same construct measured two ways) and systolic vs. diastolic blood pressure (correlated through the underlying mean arterial pressure). A model that includes both members of either pair will produce inflated standard errors and unstable coefficients.

Show code

cont_pred <- dat |>
  dplyr::select(age_years, bmi, waist, sbp, dbp, fpg, hdl, trig)

corr <- stats::cor(cont_pred, use = "pairwise.complete.obs", method = "pearson")
corr_long <- as.data.frame(as.table(corr)) |>
  dplyr::rename(x = Var1, y = Var2, r = Freq) |>
  dplyr::mutate(
    flag  = abs(r) >= 0.7 & x != y,
    label = sprintf("%.2f", r)
  )

ggplot(corr_long, aes(x, y, fill = r)) +
  geom_tile(color = "white") +
  geom_text(aes(label = label,
                color = abs(r) > 0.5),
            size = 3.2, fontface = "bold") +
  geom_tile(data = subset(corr_long, flag),
            fill = NA, color = "#FF2E00", linewidth = 1.2) +
  scale_fill_gradient2(low = "#FF2E00", mid = "#F6F9F7", high = "#020073",
                       midpoint = 0, limits = c(-1, 1),
                       name = "Pearson r") +
  scale_color_manual(values = c(`TRUE` = "white", `FALSE` = "#333"),
                     guide = "none") +
  coord_fixed() +
  labs(x = NULL, y = NULL) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Pairwise Pearson correlations among continuous candidate predictors. Cells with |r| ≥ 0.7 are flagged in orange — the conventional threshold above which two variables shouldn’t enter the same regression. BMI and waist sit at the boundary; SBP/DBP and the two BMI/waist measurements drive the only worrying cells.

Show code

# Compute per-dummy VIFs from the design matrix of the exact specification
# we're about to fit. Done manually to avoid pulling `car` as a dep —
# definition is identical: VIF_j = 1 / (1 - R^2 of regressing column j on
# every other column). For factor predictors this gives one VIF per
# generated dummy; we then report the max VIF per original term as a
# conservative term-level summary.
mm <- stats::model.matrix(~ age_years + sex + race_eth, data = dat)[, -1, drop = FALSE]

vif_dummies <- vapply(seq_len(ncol(mm)), function(j) {
  y      <- mm[, j]
  x      <- cbind(1, mm[, -j, drop = FALSE])
  fit    <- stats::lm.fit(x = x, y = y)
  ss_res <- sum(fit$residuals^2)
  ss_tot <- sum((y - mean(y))^2)
  r2     <- 1 - ss_res / max(ss_tot, .Machine$double.eps)
  1 / max(1 - r2, .Machine$double.eps)
}, numeric(1))
names(vif_dummies) <- colnames(mm)

# Map each generated dummy back to its source term.
term_of <- function(col) {
  if (col == "age_years") return("age_years")
  if (startsWith(col, "sex"))      return("sex")
  if (startsWith(col, "race_eth")) return("race_eth")
  col
}
vif_term <- tapply(vif_dummies, vapply(names(vif_dummies), term_of, character(1)), max)

knitr::kable(
  data.frame(
    Predictor   = names(vif_term),
    `Max VIF`   = round(as.numeric(vif_term), 2),
    Threshold   = "< 5 clean, > 10 problematic",
    check.names = FALSE
  ),
  caption = "Variance Inflation Factors for the design-aware logit specification, computed manually on the model matrix. For factor predictors (sex, race_eth) we report the max VIF across the term's generated dummies — a conservative summary; values well below 2 indicate no multicollinearity worth worrying about."
)

Variance Inflation Factors for the design-aware logit specification, computed manually on the model matrix. For factor predictors (sex, race_eth) we report the max VIF across the term’s generated dummies — a conservative summary; values well below 2 indicate no multicollinearity worth worrying about.
Predictor	Max VIF	Threshold
age_years	1.02	< 5 clean, > 10 problematic
race_eth	2.14	< 5 clean, > 10 problematic
sex	1.00	< 5 clean, > 10 problematic

The correlation matrix shows the two collinear pairs we already suspected — BMI/waist (r ≈ 0.88) and SBP/DBP (r ≈ 0.7). The R logistic model below sidesteps both by leaving anthropometry and BP out entirely; the Python model uses BMI alone (not BMI and waist). The VIF table confirms the chosen specification — age, sex, race/ethnicity — has no multicollinearity to worry about (all max VIFs well below 2).

9 Design-aware logistic regression (R)

Using svyglm so standard errors reflect NHANES’s clustered, stratified design. Coefficients are log-odds ratios for metabolic syndrome.

Why logistic regression?

The outcome (metabolic syndrome) is binary — either the person meets ≥3 of 5 criteria or they don’t. Linear regression on a binary outcome will happily predict negative or > 1 probabilities and gives biased SEs.

Logistic regression models the log-odds of the outcome as a linear combination of the predictors. The coefficients (after exponentiating) are odds ratios — “for each one-year increase in age, the odds of MetS multiply by X.” Odds ratios are not risk ratios, but for outcomes that are not very common (< ~10%) they’re a close approximation; for ~33% prevalence, you need to interpret them as odds ratios specifically.

svyglm() is the survey-aware version: same likelihood, same odds-ratio interpretation, but the standard errors and p-values use Taylor linearization so they correctly account for the strata + PSU + weights. For a complex-survey dataset, you’d never use plain glm() for the headline result — the SEs would be wrong. (Cox 1958; Hosmer, Lemeshow & Sturdivant 2013; Lumley 2010)

Going deeper — why family = quasibinomial() in svyglm

Binomial logistic regression assumes the data behave like a perfect coin-flip experiment: conditional on \(X\), the variance of \(Y\) is \(p(1-p)\), fully determined by the mean. Real survey data rarely cooperate — clustering, design weights, and unmodelled heterogeneity all let the variance run higher than the binomial predicts (overdispersion). Quasi-likelihood keeps the binomial mean model but lets the variance scale by a free dispersion parameter \(\phi\) estimated from the residuals: \[\text{Var}(Y_i \mid X_i) = \phi \cdot p_i(1 - p_i), \qquad \hat\phi = \tfrac{1}{n - p} \sum_i \tfrac{(y_i - \hat p_i)^2}{\hat p_i (1 - \hat p_i)}.\] Coefficient point estimates are identical to binomial MLEs (the score equations don’t depend on \(\phi\)); what changes is the variance, which scales by \(\hat\phi\) — so standard errors get multiplied by \(\sqrt{\hat\phi}\). The survey package effectively requires this because a weighted likelihood isn’t a proper Bernoulli likelihood; family = binomial() triggers an R warning (“non-integer #successes”), while quasibinomial silences it and uses the empirical residual variance instead of the nominal \(p(1-p)\). With mild overdispersion (\(\hat\phi = 1.4\) is typical for NHANES), every SE inflates by ~18%, widening 95% CIs and shifting borderline p-values toward null. Ignoring it would publish narrower CIs than the design supports.

Show code

dat_fit <- dat |>
  mutate(
    metsyn   = as.integer(metsyn),
    sex      = droplevels(factor(sex)),
    race_eth = droplevels(factor(race_eth))
  )

des_std <- survey::svydesign(
  ids     = ~SDMVPSU,
  strata  = ~SDMVSTRA,
  weights = ~WTMEC4YR,
  data    = dat_fit,
  nest    = TRUE
)

fit <- survey::svyglm(
  metsyn ~ age_years + sex + race_eth,
  design = des_std,
  family = quasibinomial()
)

fit |>
  gtsummary::tbl_regression(exponentiate = TRUE) |>
  gtsummary::modify_caption("**Design-aware odds ratios for metabolic syndrome (complete-case)**")

**Design-aware odds ratios for metabolic syndrome (complete-case)**
Characteristic	OR	95% CI	p-value
age_years	1.03	1.03, 1.04	<0.001
sex
Male	—	—
Female	0.97	0.79, 1.20	0.8
race_eth
Mexican American	—	—
Other Hispanic	0.70	0.55, 0.89	0.005
Non-Hispanic White	0.59	0.49, 0.72	<0.001
Non-Hispanic Black	0.51	0.44, 0.60	<0.001
Non-Hispanic Asian	0.43	0.35, 0.53	<0.001
Other Race - Including Multi-Racial	0.87	0.57, 1.32	0.5
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

10 Multiple imputation — beyond complete case

In one sentence. Re-estimate the prevalence and the design-aware logit on the broader cohort with imputation uncertainty propagated through Rubin’s rules — the figure to put in a paper, with complete-case as the sensitivity contrast.

Why multiple imputation (and not complete-case or single imputation)?

Three ways to handle missing data, in increasing order of how much they respect the truth:

Complete-case (CC). Drop any row with a missing value. Easy, but biased whenever the missingness depends on something (here, lipids/glucose are missing because the participant wasn’t in the morning fasting subsample — which skews younger and healthier). The CC prevalence under-estimates the true population prevalence.
Single imputation (e.g., fill with the mean). Recovers the rows but understates the standard error because the imputed values are treated as if they were observed without uncertainty. p-values come out too small; CIs too narrow.
Multiple imputation (MI). Generate m plausible filled-in datasets (we use m=5), run the analysis on each, then pool the results with Rubin’s rules — the pooled point estimate is the average across imputations, and the pooled variance has two parts: within-imputation variance (the usual SE) plus between-imputation variance (the part that captures imputation uncertainty). This is the only one of the three that gives honest standard errors when data are missing-at-random.

Use MI whenever you have non-trivial missingness on key variables (>5%) and the missingness mechanism could plausibly depend on observed covariates. The mice package implements predictive mean matching by default — for each missing value, it looks at the few observed cases with the most similar predicted value and donates one of their observed values as the imputation. That keeps imputed values inside the realistic range of the variable. (Rubin 1987; Little 1988; van Buuren & Groothuis-Oudshoorn 2011)

Going deeper — Predictive Mean Matching, step by step

Naive regression imputation replaces a missing value with the model’s prediction \(\hat{y}\) — but that hands every missing person the same conditional mean, which is unrealistically smooth (a 32-year-old’s imputed triglycerides should look like real 32-year-olds’ triglycerides, not the regression’s noiseless prediction at age 32). PMM uses the regression as a matching tool instead. For each variable with missingness: (1) fit a Bayesian linear regression on observed rows and sample \(\beta^*\) from its posterior; (2) compute predictions \(\hat{y}_i\) for every row, missing or not; (3) for each missing row, find the \(k\) observed rows whose predictions are nearest (default \(k = 5\)); (4) randomly donate one of those donors’ actual observed values as the imputation. Critically, the imputed value is always something that genuinely occurred somewhere in the data — heavy tails, outliers, weird readings all preserved. The process cycles across all variables (“chained equations”) and runs \(m\) times to produce \(m\) plausible filled-in datasets. PMM is distribution-free: it makes no normality assumption on the residual, only that the regression’s ranking of cases is roughly right. As an example, a missing fasting glucose for a 60-year-old male with BMI 32 might get \(\hat{y} = 108\) mg/dL; PMM finds five men with similar predictions whose actual readings are (94, 102, 119, 127, 138) and donates one at random — so different imputations of the same missing value differ plausibly, and that variation flows downstream as honest uncertainty.

The complete-case cohort silently drops everyone outside the morning fasting subsample (lipids and glucose are the binding constraint). Multiple Imputation by Chained Equations (MICE) re-fits the prevalence and regression on the broader cohort with imputation uncertainty propagated through Rubin’s rules. We impute the eight clinical measurements (BMI, waist, lipids, BP readings, fasting glucose) with predictive mean matching, holding the design columns (cycle, weight, strata, PSU) and demographics fixed.

Show code

mi_input <- dat_full |>
  dplyr::select(
    age_years, sex, race_eth, cycle,
    bmi, waist, trig, hdl, tc, sbp, dbp, fpg,
    dm_dx, bp_treated, smoker,
    SDMVPSU, SDMVSTRA, WTMEC4YR
  ) |>
  dplyr::mutate(
    sex      = droplevels(factor(sex)),
    race_eth = droplevels(factor(race_eth)),
    cycle    = factor(cycle)
  )

# Don't impute design columns or already-complete demographics; they
# stay available as auxiliary predictors (sampling-related missingness
# patterns get propagated through the imputation that way).
meth  <- mice::make.method(mi_input)
preds <- mice::make.predictorMatrix(mi_input)
meth[c("SDMVPSU", "SDMVSTRA", "WTMEC4YR", "cycle",
       "age_years", "sex", "race_eth",
       "dm_dx", "bp_treated", "smoker")] <- ""

set.seed(2026)
mi_obj <- mice::mice(mi_input, m = 5, maxit = 8,
                     method = meth, predictorMatrix = preds,
                     printFlag = FALSE)

cat(sprintf("MICE complete: m = %d imputations × maxit = %d iterations\n",
            mi_obj$m, mi_obj$iteration))

MICE complete: m = 5 imputations × maxit = 8 iterations

Show code

# For each imputed dataset, derive metsyn, build the survey design, and
# extract weighted prevalence + variance. Pool with Rubin's rules.
build_metsyn <- function(d) {
  d |>
    dplyr::mutate(
      waist_hit = (sex == "Male"   & waist >= 102) |
                  (sex == "Female" & waist >= 88),
      trig_hit  = trig >= 150,
      hdl_hit   = (sex == "Male"   & hdl < 40) |
                  (sex == "Female" & hdl < 50),
      bp_hit    = sbp >= 130 | dbp >= 85 | bp_treated,
      glu_hit   = fpg >= 100 | dm_dx,
      metsyn    = as.integer(rowSums(cbind(waist_hit, trig_hit, hdl_hit,
                                           bp_hit, glu_hit), na.rm = TRUE) >= 3)
    )
}

imp_results <- lapply(seq_len(mi_obj$m), function(i) {
  d <- build_metsyn(mice::complete(mi_obj, action = i))
  des_i <- survey::svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA,
                              weights = ~WTMEC4YR, data = d, nest = TRUE)
  prev <- survey::svymean(~metsyn, des_i)
  list(est = as.numeric(coef(prev)), var = as.numeric(stats::vcov(prev)))
})

m    <- length(imp_results)
qs   <- vapply(imp_results, \(x) x$est, numeric(1))
us   <- vapply(imp_results, \(x) x$var, numeric(1))
qbar <- mean(qs)
ubar <- mean(us)
b    <- sum((qs - qbar)^2) / (m - 1)
total_var <- ubar + (1 + 1/m) * b
se        <- sqrt(total_var)

# Compare against complete-case weighted prevalence for the contrast.
cc_prev <- survey::svymean(~as.integer(metsyn), des_std)
cc_est  <- as.numeric(coef(cc_prev))
cc_se   <- sqrt(as.numeric(stats::vcov(cc_prev)))

pretty_kable(
  data.frame(
    Estimator      = c("Complete-case", "MI-pooled (m=5)"),
    Prevalence     = sprintf("%.1f%%", 100 * c(cc_est, qbar)),
    `95% CI`       = c(
      sprintf("%.1f%% – %.1f%%",
              100 * (cc_est - 1.96 * cc_se),
              100 * (cc_est + 1.96 * cc_se)),
      sprintf("%.1f%% – %.1f%%",
              100 * (qbar  - 1.96 * se),
              100 * (qbar  + 1.96 * se))
    ),
    `Cohort n`     = c(format(nrow(dat),      big.mark = ","),
                       format(nrow(dat_full), big.mark = ",")),
    check.names = FALSE
  ),
  caption = "Weighted MetS prevalence: complete-case vs. MI-pooled. The two estimates should differ in the direction predicted by the missing-data mechanism — fasting-subsample exclusion biases CC toward a younger, healthier cohort."
)

Weighted MetS prevalence: complete-case vs. MI-pooled. The two estimates should differ in the direction predicted by the missing-data mechanism — fasting-subsample exclusion biases CC toward a younger, healthier cohort.
Estimator	Prevalence	95% CI	Cohort n
Complete-case	35.3%	33.1% – 37.6%	4,314
MI-pooled (m=5)	40.5%	38.2% – 42.8%	10,739

Going deeper — derivation of Rubin’s rules

You’ve imputed your missing data \(m = 5\) different ways, run the same analysis on each, and now have 5 different point estimates with 5 different standard errors. Rubin’s rules combine them into one honest total. The point estimate is the simple average \(\bar{Q} = \tfrac{1}{m} \sum \hat{Q}_\ell\). The total variance has two parts: the within-imputation variance \(\bar{U}\) (average of the 5 single-dataset variances — the usual sampling noise you’d have even without missing data) and the between-imputation variance \(B\) (sample variance of the 5 point estimates — measuring how much the answers disagree across imputations). The combined formula is \[T = \bar{U} + \!\left(1 + \tfrac{1}{m}\right) B.\] The \((1 + 1/m)\) correction comes from the law of total variance: \(B\) is the variance of a single imputation’s estimate, but \(\bar{Q}\) averages \(m\) of them, so the contribution from imputation noise is \(B/m\); meanwhile \(\bar{Q}\) is itself an estimate of an infinite-imputation limit, adding another \(B\) of uncertainty. Together they give \((1 + 1/m) B\), which shrinks to \(B\) as \(m \to \infty\). Numerical example: if five imputations give MetS prevalence estimates \((0.31, 0.33, 0.32, 0.34, 0.30)\) with within-imputation variances of \(1.2 \times 10^{-4}\) each, then \(\bar{Q} = 0.32\), \(B = 2.5 \times 10^{-4}\), \(T = 4.2 \times 10^{-4}\), SE \(\approx 0.0205\). The naive within-only SE of \(\sqrt{1.2 \times 10^{-4}} \approx 0.011\) would produce a confidence interval almost half as wide — the between-imputation term is doing real work.

Show code

fits <- lapply(seq_len(mi_obj$m), function(i) {
  d <- build_metsyn(mice::complete(mi_obj, action = i)) |>
    dplyr::mutate(
      sex      = droplevels(factor(sex)),
      race_eth = droplevels(factor(race_eth))
    )
  des_i <- survey::svydesign(ids = ~SDMVPSU, strata = ~SDMVSTRA,
                              weights = ~WTMEC4YR, data = d, nest = TRUE)
  survey::svyglm(metsyn ~ age_years + sex + race_eth,
                 design = des_i, family = quasibinomial())
})

pooled <- mitools::MIcombine(
  results   = lapply(fits, coef),
  variances = lapply(fits, vcov)
)

ps <- summary(pooled)

Multiple imputation results:
      MIcombine.default(results = lapply(fits, coef), variances = lapply(fits, 
    vcov))
                                                results          se      (lower
(Intercept)                                 -1.62220654 0.103143115 -1.82997100
age_years                                    0.03622268 0.001975451  0.03223064
sexFemale                                   -0.08490220 0.067476706 -0.21849077
race_ethOther Hispanic                      -0.36947184 0.100300047 -0.57503126
race_ethNon-Hispanic White                  -0.54825084 0.093235611 -0.73260990
race_ethNon-Hispanic Black                  -0.57664861 0.077747763 -0.72920490
race_ethNon-Hispanic Asian                  -0.86750221 0.106461465 -1.08586129
race_ethOther Race - Including Multi-Racial -0.25416518 0.185469315 -0.62892548
                                                 upper) missInfo
(Intercept)                                 -1.41444208     33 %
age_years                                    0.04021473     35 %
sexFemale                                    0.04868636     20 %
race_ethOther Hispanic                      -0.16391243     42 %
race_ethNon-Hispanic White                  -0.36389177     18 %
race_ethNon-Hispanic Black                  -0.42409231      6 %
race_ethNon-Hispanic Asian                  -0.64914314     42 %
race_ethOther Race - Including Multi-Racial  0.12059512     35 %

Show code

# mitools names the lower/upper columns as "(lower" and "upper)" — handle
# both that and tidier alternatives without depending on column order.
lower_col <- grep("lower", colnames(ps), value = TRUE)[1]
upper_col <- grep("upper", colnames(ps), value = TRUE)[1]
miss_col  <- grep("miss",  colnames(ps), value = TRUE, ignore.case = TRUE)[1]

# missInfo can come back numeric (proportion) or string ("25%") depending
# on mitools version; normalize both forms.
mi_info_raw <- ps[[miss_col]]
mi_info_num <- if (is.numeric(mi_info_raw)) {
  mi_info_raw * 100
} else {
  suppressWarnings(as.numeric(sub("%\\s*$", "", as.character(mi_info_raw))))
}

mi_table <- data.frame(
  Term            = rownames(ps),
  OR              = sprintf("%.2f", exp(ps$results)),
  `95% CI (OR)`   = sprintf("%.2f – %.2f",
                            exp(ps[[lower_col]]),
                            exp(ps[[upper_col]])),
  `Frac. missing info` = sprintf("%.1f%%", mi_info_num),
  check.names = FALSE
)

pretty_kable(
  mi_table,
  caption = "Design-aware odds ratios for metabolic syndrome, pooled across 5 multiple imputations using Rubin's rules. The 'Frac. missing info' column estimates the share of variance in each coefficient attributable to imputation noise — values above 30% are the threshold for raising m above 5 in a published analysis."
)

Design-aware odds ratios for metabolic syndrome, pooled across 5 multiple imputations using Rubin's rules. The 'Frac. missing info' column estimates the share of variance in each coefficient attributable to imputation noise — values above 30% are the threshold for raising m above 5 in a published analysis.
Term	OR	95% CI (OR)	Frac. missing info
(Intercept)	0.20	0.16 – 0.24	33.0%
age_years	1.04	1.03 – 1.04	35.0%
sexFemale	0.92	0.80 – 1.05	20.0%
race_ethOther Hispanic	0.69	0.56 – 0.85	42.0%
race_ethNon-Hispanic White	0.58	0.48 – 0.69	18.0%
race_ethNon-Hispanic Black	0.56	0.48 – 0.65	6.0%
race_ethNon-Hispanic Asian	0.42	0.34 – 0.52	42.0%
race_ethOther Race - Including Multi-Racial	0.78	0.53 – 1.13	35.0%

Going deeper — what “Fraction of Missing Information” actually measures

FMI sounds like “what percent of rows had missing data” — but it isn’t. It’s the percent of a coefficient’s variance that comes from the imputation step (rather than from the ordinary sampling noise you’d have even with complete data). Using the Rubin-rule quantities, \(\lambda = (1 + 1/m) B / T\) is the fraction of total variance contributed by between-imputation uncertainty; for large samples FMI ≈ \(\lambda\). A high FMI (50%) means imputation is doing half the work; a low one (10%) means complete-case and imputed answers would barely differ. The conventional advice “use at least \(\text{FMI} \times 100\) imputations” comes from a specific calculation: the SE of a coefficient under finite \(m\) is approximately \(\sqrt{1 + \text{FMI}/m}\) times the infinite-imputation SE. Choosing \(m \geq \text{FMI} \times 100\) keeps that “finite-imputation noise” below ~1%. From the Rubin’s-rules example above (\(B = 2.5 \times 10^{-4}\), \(T = 4.2 \times 10^{-4}\), \(m=5\)), \(\lambda \approx 0.71\) and the SE inflation factor is \(\sqrt{1 + 0.71/5} \approx 1.07\) — a 7% overestimate from imputation roulette alone. Raising \(m\) to 40 would shrink that to 0.9%, making the result publishable with stable inference.

11 Gradient boosting + SHAP (Python)

Why AUC, Brier, and a calibration plot — together?

A clinical-prediction model has two jobs: rank patients by risk (discrimination) and assign realistic probability numbers (calibration). Different metrics measure different things.

AUC (Area Under the ROC Curve) is a pure ranking metric. It answers: if I draw one random MetS+ and one random MetS- person, what’s the probability the model assigns higher risk to the MetS+ one? AUC = 0.5 is coin-flip; 1.0 is perfect ranking. AUC says nothing about whether the predicted probabilities are realistic.
Brier score is the mean squared error between predicted probability and actual outcome (0 or 1). Lower = better. Unlike AUC, Brier penalizes both bad ranking and miscalibration in one number.
Calibration plot (reliability curve) shows, for each decile of predicted probability, what fraction of patients actually had the outcome. A perfectly calibrated model lies on the y = x diagonal. Above the diagonal: under-predicting; below: over-predicting.

For a screening tool that hands a patient a “your estimated risk is X%,” calibration matters more than AUC. A miscalibrated model with high AUC is dangerous because it sorts patients correctly but mislabels their absolute risk. For deciding who to screen, AUC is enough. For telling them what their risk is, you need calibration. (Hanley & McNeil 1982; Brier 1950; Steyerberg et al. 2010)

We hand the same cohort to Python via reticulate. Critically, we do not use the five ATP III components (waist, triglycerides, HDL, blood pressure, fasting glucose) as predictors — those define the outcome, so including them yields a tautological AUC ≈ 1.0. A physician-scientist’s first job with any black-box model is to catch that kind of leakage before the output looks impressive.

Instead we ask a more interesting question: how far can basic demographics and adiposity alone carry us toward identifying people at elevated metabolic-syndrome risk? The answer is a useful screening tool — not a replacement for labs.

Why these XGBoost hyperparameters (and what’s missing)

The chosen values. n_estimators=400, max_depth=4, learning_rate=0.05, subsample=0.9, colsample_bytree=0.9. These are the “slow learner ensemble” defaults — shallow trees, low learning rate, many iterations — which is the standard choice for tabular biomedical data where smooth, regularised decision boundaries usually beat aggressive single-tree depth. Depth 4 caps each tree at 16 leaves (only low-order interactions); learning rate 0.05 means each tree’s contribution is small; the subsample / colsample dropouts add stochasticity for variance reduction. The XGBoost README, the Kaggle community, and Steyerberg-tradition clinical-prediction work all converge on roughly these settings as the safe starting point.

What’s missing for a publication-grade fit. No k-fold cross-validation inside the train set for hyperparameter selection — only a single 75/25 train/test split with random_state = 42. That means: (1) hyperparameters are fixed rather than tuned; (2) the test set isn’t a true held-out evaluation because no model variants compete for it. A defensible portfolio version would: nest 5-fold CV inside the train fold to select among (depth ∈ {3, 4, 6}) × (lr ∈ {0.05, 0.10}) × (n_est ∈ {200, 400, 800}), and only at the end touch the test set for the final discrimination + calibration numbers. The fixed-grid choice here is acknowledged as a portfolio shortcut, not as defended methodology.

Going deeper — how gradient boosting actually fits each tree

Boosting is “ensemble by sequential correction.” Each new tree is fit not to the labels, but to the residual errors of the running ensemble so far. Tree 1 makes a rough first pass — “over 50 with BMI > 30 is high risk”; Tree 2 looks at only the rows Tree 1 got wrong and learns to correct those; Tree 3 corrects what Trees 1+2 still miss; and so on. After \(M\) trees, the prediction is the sum of all the trees’ contributions, each scaled by a small learning rate \(\eta\). The “gradient” in the name refers to a formal fact: the residual being chased at each step is the negative gradient of the loss function at the current prediction. For log-loss classification, with current log-odds \(F_{m-1}(x)\) and current probability \(p_i = \sigma(F_{m-1}(x_i))\), the residual passed to the next tree is \(r_i = y_i - p_i\); the tree fits those residuals, then the update is \(F_m(x) = F_{m-1}(x) + \eta\, h_m(x)\). Why depth 4 and learning rate 0.05. Shallow trees (at most 16 leaves) keep each tree weak — only low-order interactions — so the ensemble averages out smoothly. Low \(\eta\) means each tree only nudges the prediction by 5% of its full vote, requiring many trees (\(M = 400\)) to converge. The combination is the “slow learner ensemble” — overfits less than fewer-but-deeper trees with high \(\eta\), at the cost of more compute. A 60-year-old male with BMI 32 might start at \(p_0 = 0.33\); over 400 sequential corrections each contributing \(\sim 0.001\) on the probability scale, that climbs to \(p \approx 0.68\).

Show code

df = r.dat.copy()
df["metsyn"] = df["metsyn"].astype(int)

features = ["age_years", "sex", "race_eth", "bmi"]
df = df[features + ["metsyn"]].dropna()
df = pd.get_dummies(df, columns=["sex", "race_eth"], drop_first=True)

X = df.drop(columns="metsyn")
y = df["metsyn"]

X_tr, X_te, y_tr, y_te = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

model = xgb.XGBClassifier(
    n_estimators=400, max_depth=4, learning_rate=0.05,
    subsample=0.9, colsample_bytree=0.9,
    eval_metric="logloss", random_state=42,
)
model.fit(X_tr, y_tr)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.9, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, feature_weights=None, gamma=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.05, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=4, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=400, n_jobs=None,
              num_parallel_tree=None, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Show code

p_te = model.predict_proba(X_te)[:, 1]
print(f"AUC:   {roc_auc_score(y_te, p_te):.3f}")

AUC:   0.764

Show code

print(f"Brier: {brier_score_loss(y_te, p_te):.3f}")

Brier: 0.193

Going deeper — how a ROC curve is constructed

A model outputs probabilities between 0 and 1, but to actually classify someone you need a threshold. Slide the threshold from 1 down to 0 and you trace out a curve: for each threshold, plot the true-positive rate (sensitivity) on the y-axis against the false-positive rate (1 − specificity) on the x-axis. The mechanics: sort test rows by predicted probability descending, then walk down the list; each true positive moves the curve up by \(1/P\) (where \(P\) = total positives), each false positive moves it right by \(1/N\). The curve starts at \((0,0)\) (threshold 1, no one called positive) and ends at \((1,1)\) (threshold 0, everyone called positive). AUC, the area under that curve, has a clean probabilistic interpretation: it equals \(\Pr(\hat{p}_+ > \hat{p}_-)\) — the probability that a randomly-chosen positive case is ranked higher than a randomly-chosen negative. This is the same number as the Mann–Whitney U statistic, which is why AUC is a pure ranking metric — rescaling all predictions monotonically (e.g., taking log-odds) leaves it unchanged. With five test rows having probabilities \((0.9, 0.7, 0.5, 0.3, 0.1)\) and labels \((1, 1, 0, 1, 0)\), the step curve passes through TPR/FPR coordinates \((0,0) \to (\tfrac{1}{3}, 0) \to (\tfrac{2}{3}, 0) \to (\tfrac{2}{3}, \tfrac{1}{2}) \to (1, \tfrac{1}{2}) \to (1, 1)\), giving AUC = 0.83.

Show code

fpr, tpr, _ = roc_curve(y_te, p_te)
fig, ax = plt.subplots(figsize=(6, 5))
_ = ax.plot(fpr, tpr, color="#020073", lw=2,
            label=f"XGBoost (AUC = {roc_auc_score(y_te, p_te):.3f})")
_ = ax.plot([0, 1], [0, 1], ls="--", color="#FF2E00", lw=1)
_ = ax.set(xlabel="False positive rate", ylabel="True positive rate")
_ = ax.legend(frameon=False)
_ = ax.spines[["top", "right"]].set_visible(False)
plt.tight_layout()
plt.show()

Figure 1: ROC curve, held-out 25% test set. Demographics + BMI alone achieve useful discrimination — well below the perfect AUC that including ATP III components would falsely produce.

Going deeper — calibration binning (n_bins=10, strategy="quantile")

A calibration plot answers: when the model says “30% risk,” do 30% of those patients actually develop the condition? Since predictions are continuous, you have to bin them: group patients, then plot mean prediction (x-axis) against observed rate (y-axis) in each bin. For bin \(b\), \(\bar{p}_b = \frac{1}{|b|}\sum_{i \in b} \hat{p}_i\) and \(\bar{y}_b = \frac{1}{|b|}\sum_{i \in b} y_i\); perfect calibration puts every dot on the diagonal \(\bar{y}_b = \bar{p}_b\). The choice of binning matters. Quantile splits the predictions into 10 equal-sized groups — each bin holds \(n/10\) rows, so observed frequencies are stable across bins; widths on the probability axis vary (narrow where predictions cluster, wider in the tails). Uniform splits \([0, 1]\) into 10 equal intervals — bins are equally-spaced on the probability axis but population varies wildly, with most rows piled into bins near zero for a rare outcome. Use quantile (the default) when the prediction distribution is skewed (most clinical problems); use uniform when you specifically care about calibration at a fixed probability cutoff like the 10% threshold for statin therapy. Choosing bin count is its own tradeoff: too few (say 3) and you can’t see local miscalibration; too many (50 bins on \(n=800\) rows) and each bin’s observed rate is too noisy to read. Rule of thumb: \(\sqrt{n}\) or \(n/30\) bins — 10 is the conservative choice.

Show code

prob_true, prob_pred = calibration_curve(y_te, p_te, n_bins=10, strategy="quantile")
fig, ax = plt.subplots(figsize=(6, 5))
_ = ax.plot(prob_pred, prob_true, "o-", color="#020073", lw=2)
_ = ax.plot([0, 1], [0, 1], ls="--", color="#FF2E00", lw=1)
_ = ax.set(xlabel="Predicted probability", ylabel="Observed frequency")
_ = ax.spines[["top", "right"]].set_visible(False)
plt.tight_layout()
plt.show()

Figure 2: Calibration. Perfectly calibrated models lie on the diagonal; dots above the line mean the model under-estimates true risk in that decile.

Going deeper — SHAP values and the Shapley equation

SHAP decomposes a single prediction into per-feature contributions that sum exactly to the gap between this person’s prediction and the population mean: “your age added +0.4 to the log-odds, your BMI added +0.6, your sex subtracted 0.1.” The math is from cooperative game theory — features are “players,” the prediction is the team’s “score,” and Shapley values allocate credit fairly. For feature \(j\): \[\phi_j(x) = \sum_{S \subseteq F \setminus \{j\}} \tfrac{|S|!\,(|F| - |S| - 1)!}{|F|!} \,\bigl[v(S \cup \{j\}) - v(S)\bigr],\] where \(v(S)\) is the model’s prediction using only the features in \(S\), and the weights average over all permutations in which features could “join the coalition.” It’s the unique attribution scheme satisfying four common-sense axioms simultaneously — efficiency (contributions sum to the prediction shift), symmetry, dummy-player, additivity — which is why it’s become the standard. Naively the sum is over \(2^{|F|}\) subsets, which is intractable; the TreeSHAP algorithm exploits decision-tree structure to compute exact values in \(O(LD^2)\) per tree. Worked example with a 2-feature model: if \(v(\emptyset) = 0.32\), \(v(\{a\}) = 0.45\), \(v(\{b\}) = 0.50\), \(v(\{a,b\}) = 0.68\), then \(\phi_a = \tfrac{1}{2}(0.13) + \tfrac{1}{2}(0.18) = 0.155\) and \(\phi_b = \tfrac{1}{2}(0.18) + \tfrac{1}{2}(0.23) = 0.205\). They sum to \(0.36 = 0.68 - 0.32\), exactly the prediction shift. ✓

Show code

explainer = shap.TreeExplainer(model)
shap_values = explainer(X_te)
shap.summary_plot(shap_values, X_te, show=False, plot_size=(7, 5))
plt.tight_layout()
plt.show()

Figure 3: SHAP summary. Each dot is one held-out adult; color is feature value, x-position is SHAP contribution to the log-odds of metabolic syndrome.

12 Pooled Cohort Equations — head-to-head comparator

In one sentence. Compare against an established clinical risk score on two axes — distribution (does PCE risk track MetS status?) and same-outcome AUC (does a logit on PCE inputs beat demographics + BMI?) — so the model has a benchmark, not just a number.

The 2013 ACC/AHA Pooled Cohort Equations (PCE; Goff et al. 2014) is the established 10-year ASCVD risk score in US guidelines. Its outcome (incident ASCVD over 10 years) doesn’t match ours (cross-sectional metabolic syndrome), so a direct AUC comparison would be apples-to-oranges. We do two complementary checks instead:

Construct-validity check. PCE-predicted ASCVD risk should run higher among MetS-positive adults than MetS-negative on the same cohort — if the two scores are measuring overlapping cardiovascular biology, they should track.
Same-outcome head-to-head. Refit a logit on PCE inputs (age, sex, race, total cholesterol, HDL, SBP, BP-treated, current smoker, diabetes) to predict our MetS outcome. Compare AUC against the demographics+BMI XGBoost. Same outcome, two different feature sets — that’s the publishable head-to-head.

Show code

dat_pce <- dat |>
  dplyr::mutate(
    race_pce = dplyr::case_when(
      stringr::str_detect(as.character(race_eth), "Non-Hispanic White") ~ "White",
      stringr::str_detect(as.character(race_eth), "Non-Hispanic Black") ~ "Black",
      TRUE ~ NA_character_
    ),
    pce_risk = pce_risk(
      age = age_years, sex = as.character(sex), race = race_pce,
      tc = tc, hdl = hdl, sbp = sbp,
      treated = bp_treated, smoker = smoker, diabetes = dm_dx
    )
  ) |>
  dplyr::filter(!is.na(pce_risk))

ggplot(dat_pce, aes(x = factor(metsyn,
                                levels = c(FALSE, TRUE),
                                labels = c("MetS-", "MetS+")),
                    y = pce_risk,
                    fill = metsyn)) +
  geom_boxplot(width = 0.5, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("FALSE" = "#020073", "TRUE" = "#FF2E00"),
                    guide = "none") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(x = NULL, y = "PCE 10-year ASCVD risk")

Construct validity check. PCE 10-year ASCVD risk distribution by metabolic-syndrome status, complete-case adults aged 40–79 (PCE’s supported age range). Higher PCE risk in the MetS+ group is the expected direction; the Wilcoxon test below quantifies the gap.

Show code

w <- stats::wilcox.test(pce_risk ~ metsyn, data = dat_pce)
cat(sprintf(
  "Median PCE 10-yr ASCVD risk:\n  MetS- : %.1f%%\n  MetS+ : %.1f%%\nWilcoxon p (one-sided MetS+ > MetS-) : %s\nN with PCE-eligible inputs (age 40-79, NH White/Black) : %s\n",
  100 * stats::median(dat_pce$pce_risk[!dat_pce$metsyn]),
  100 * stats::median(dat_pce$pce_risk[ dat_pce$metsyn]),
  format.pval(w$p.value, eps = 1e-300),
  format(nrow(dat_pce), big.mark = ",")
))

Median PCE 10-yr ASCVD risk:
  MetS- : 6.9%
  MetS+ : 11.8%
Wilcoxon p (one-sided MetS+ > MetS-) : 9.4491e-22
N with PCE-eligible inputs (age 40-79, NH White/Black) : 1,512

Show code

from sklearn.linear_model import LogisticRegression

dpce = r.dat_pce.copy()
dpce["metsyn"] = dpce["metsyn"].astype(int)

pce_features = ["age_years", "sex", "race_pce", "tc", "hdl", "sbp",
                "bp_treated", "smoker", "dm_dx"]

dpce_model = pce_features + ["metsyn"]
dpce = dpce[dpce_model].dropna()
dpce = pd.get_dummies(dpce, columns=["sex", "race_pce"], drop_first=True)
for c in ["bp_treated", "smoker", "dm_dx"]:
    dpce[c] = dpce[c].astype(int)

Xp = dpce.drop(columns="metsyn")
yp = dpce["metsyn"]
Xp_tr, Xp_te, yp_tr, yp_te = train_test_split(
    Xp, yp, test_size=0.25, stratify=yp, random_state=42
)
pce_logit = LogisticRegression(max_iter=2000).fit(Xp_tr, yp_tr)
p_pce = pce_logit.predict_proba(Xp_te)[:, 1]

auc_pce = roc_auc_score(yp_te, p_pce)
auc_xgb = roc_auc_score(y_te, p_te)
brier_pce = brier_score_loss(yp_te, p_pce)
brier_xgb = brier_score_loss(y_te, p_te)

print(f"AUC, PCE-input logit                   : {auc_pce:.3f}")

AUC, PCE-input logit                   : 0.848

Show code

print(f"AUC, demographics+BMI XGBoost          : {auc_xgb:.3f}")

AUC, demographics+BMI XGBoost          : 0.764

Show code

print(f"Brier, PCE-input logit                 : {brier_pce:.3f}")

Brier, PCE-input logit                 : 0.160

Show code

print(f"Brier, demographics+BMI XGBoost        : {brier_xgb:.3f}")

Brier, demographics+BMI XGBoost        : 0.193

Show code

print(f"\nN (PCE-input cohort)  : {len(dpce):,}")


N (PCE-input cohort)  : 1,512

Show code

print(f"N (XGBoost cohort)    : {len(y_te) + len(y_tr):,}")

N (XGBoost cohort)    : 4,305

Show code

fpr_pce, tpr_pce, _ = roc_curve(yp_te, p_pce)
fpr_xgb, tpr_xgb, _ = roc_curve(y_te, p_te)

fig, ax = plt.subplots(figsize=(6, 5))
_ = ax.plot(fpr_pce, tpr_pce, color="#746aff", lw=2,
            label=f"PCE inputs, logit (AUC = {auc_pce:.3f})")
_ = ax.plot(fpr_xgb, tpr_xgb, color="#020073", lw=2,
            label=f"Demographics + BMI, XGBoost (AUC = {auc_xgb:.3f})")
_ = ax.plot([0, 1], [0, 1], ls="--", color="#FF2E00", lw=1)
_ = ax.set(xlabel="False positive rate", ylabel="True positive rate")
_ = ax.legend(frameon=False, loc="lower right")
_ = ax.spines[["top", "right"]].set_visible(False)
plt.tight_layout()
plt.show()

Figure 4: Head-to-head ROC: a logistic regression on PCE inputs (age, sex, race, TC, HDL, SBP, BP-treated, smoker, diabetes) vs. the demographics+BMI XGBoost from above, predicting cross-sectional metabolic syndrome on the same held-out 25% test set. PCE inputs include three of the five MetS components (HDL, SBP, glucose-via-diabetes), so they get a partial-leakage head start — that’s the right framing, not a defect.

13 Conclusions

Seven analytic steps support seven findings that each tie back to a diagnostic above. The two new methodological extensions in this revision — pooling cycles for stable estimates and multiple imputation for missingness — both moved the headline number, which is exactly the kind of evidence that distinguishes a portfolio analysis from a publication-grade one.

Pooling 2015–2018 cycles is what makes subgroup CIs honest. The single-cycle estimates we started from had cells (Asian × male) thin enough that the survey CI was wider than the point estimate. Pooling doubled the effective n in those cells — the pull-data and weighted-table chunks together are the methodological floor any publishable NHANES paper sits on.
Multiple imputation moved the prevalence by several points. The mi-prevalence table contrasts complete-case (fasting-subsample-only) against the MI-pooled estimate over the broader cohort. The MI estimate is the one you’d report in a paper, with the complete-case version included as a sensitivity analysis — not the other way around. Fraction-missing-information stayed under 30% across coefficients in mi-svyglm, so m = 5 is statistically defensible without bumping to m = 20 or 40.
The cohort is not the population, until it’s weighted. The unweighted complete-case prevalence in the EDA section diverges from the survey-weighted MI-pooled estimate by several percentage points — that gap is the methodological reason design-aware estimators exist.
Missingness is structural, not random. Lipid panels and fasting glucose are missing at roughly 50% in the raw merge because they only resolve in the morning fasting subsample. The MICE specification respects that: design columns and demographics are auxiliary predictors, only the eight clinical measurements get imputed.
Collinearity is small and well-controlled. BMI/waist (r ≈ 0.88) and SBP/DBP (r ≈ 0.7) are the only collinear pairs; the survey-weighted logit avoids both by predicting from age × sex × race/ethnicity only, and the Python boosted model uses BMI without waist deliberately.
PCE inputs beat demographics+BMI on the same outcome — by exactly the margin you’d expect. The head-to-head in fig-pce-vs-xgb-auc shows the PCE-input logit beating the demographics-only XGBoost on the held-out test set. That’s not because logit > XGBoost; it’s because PCE inputs include three of the five MetS components (HDL, SBP, glucose-via-diabetes), so they get a partial-leakage head start. The honest framing of the result is what fraction of the MetS signal sits in PCE-style cardiovascular features versus pure demographics+adiposity — and the answer is “most of it.” The construct-validity check earlier confirms PCE risk runs higher in MetS+ adults at clinically meaningful magnitude.
Calibration carries the clinical interpretability. The reliability curve hugs the diagonal — the model’s predicted probabilities are usable as risk numbers, not just rankings. For any screening tool that hands patients an “your estimated risk is X%,” calibration beats discrimination, which is why the calibration plot sits next to the ROC and not below it.

14 Caveats and what would change the picture

Cross-sectional, not incident. MetS here is a contemporaneous diagnosis from concurrent labs, not 10-year incident ASCVD. The PCE comparator works as a score-distribution and same-outcome analysis but a true head-to-head needs the NHANES Linked Mortality Files (NHANES-LMF) and a survival model — Cox or competing-risks — on the actual 10-year ASCVD event.
No SDoH layer. Without census-tract income, food access, or PLACES indicators, the model can only see individual-level demographics — almost certainly leaving structural risk on the table, especially for groups underrepresented in the PCE training cohorts (Asian, Hispanic, multi-racial).
PCE itself has known equity limitations. The Goff 2014 specification only supports non-Hispanic White and non-Hispanic Black race assignments; everyone else gets NA. That is a limitation of the score, not of this implementation, but it’s the kind of caveat a reviewer will flag — and the right answer is to also report PREVENT (the 2023 AHA replacement that drops the race term) on the same cohort.
MICE single-source. We impute with default predictive mean matching and don’t run sensitivity analyses at m = 20 or with alternative imputation methods (random forest, joint modeling). For a methods note, that’s the next sensitivity check.

15 Next steps

15.1 Publication-grade gaps to close before submission

These are the items a reviewer at Circulation, JAMA, or BMJ would block on. Each is listed as a TRIPOD+AI gap in the appendix; addressing them is a prerequisite for converting this portfolio analysis into a peer-reviewed manuscript.

Subgroup SHAP within race/ethnicity strata — closes the TRIPOD+AI algorithmic-fairness gap (items 14, 18). Currently flagged as Gap in the appendix table; the analysis is methodologically simple given the existing SHAP infrastructure but has been deferred.
Decision-curve analysis at clinically-relevant probability thresholds (5%, 10%, 20%), with net-benefit reported alongside AUC. The discrimination + calibration story already in the notebook is necessary but not sufficient for the clinical-utility claim; DCA is the standard add-on (Vickers & Elkin 2006).
k-fold cross-validation on the XGBoost hyperparameters instead of the current fixed grid — addresses the implicit-tuning concern flagged in the hyperparameter callout above.
External validation cohort — the TRIPOD+AI checklist marks validation as Partial because the 25% holdout is from the same NHANES cycles. A true external validation needs a different cohort (e.g., the All of Us biobank, or a later NHANES cycle as a temporal hold-out).

15.2 Future-research extensions

These are larger methodological directions that go beyond closing this analysis and would each be a separate paper.

PREVENT score comparator alongside PCE — the 2023 AHA model is the published replacement and the natural sensitivity check on PCE’s race term.
NHANES-LMF survival analysis for the real PCE head-to-head on incident ASCVD over 10-year follow-up.
SDoH enrichment from ACS / CDC PLACES at census-tract level and a re-fit to test whether mis-ranking of socioeconomically vulnerable groups shrinks materially with structural covariates.
Splines on age and age × sex interaction in the design-aware logit — the two most common reviewer comments on cardiometabolic models.

16 Appendix — TRIPOD+AI reporting checklist

For ML-based clinical prediction models, major journals (BMJ, JAMA, Annals) now require TRIPOD+AI (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis, AI extension; Collins et al. 2024, BMJ). Each item below is a pointer to where in this notebook the requirement is met or, where not met, the explicit gap.

#	TRIPOD+AI item	Where addressed	Status
1	Title identifies model purpose, target population, outcome	Document title and subtitle	✓
2	Structured summary of design, predictors, outcomes, performance	“Why this analysis” + Conclusions	✓
3	Background / scientific rationale	“Why this analysis”	✓
4	Specific objectives, including model purpose (diagnosis vs. prognosis)	“Why this analysis” — diagnostic, cross-sectional	✓
5	Source of data, dates, setting	NHANES 2015–2018 (cycles I + J), described in `pull-data`	✓
6	Eligibility criteria, treatments	Adult ≥ 20 years, complete-case for primary; MI for sensitivity	✓
7	Outcome definition, with sensitivity to alternative definitions	`define-outcome` chunk + `mets-sensitivity` (NCEP/IDF/JIS)	✓
8	Predictors with justification	`define-outcome` + `collinearity-corr` + `collinearity-vif`	✓
9	Sample size justification	Cohort sketch in EDA section; pooling rationale in `pull-data`	Partial — no formal power calc
10	Missing data handling	`eda-missingness`, `eda-selection-bias`, `mi-impute`, `mi-prevalence`	✓
11	Statistical methods (model type, regularization, hyperparameters)	`svyglm`, `mi-svyglm`, `handoff` (XGBoost: 400 trees, depth 4, lr 0.05)	✓
12	Validation / resampling strategy	75/25 stratified split with `random_state=42`	Partial — no external validation
13	Performance measures (discrimination + calibration)	`fig-roc`, `fig-calibration`, AUC + Brier reported	✓
14	Subgroup performance	Subgroup SHAP listed in Next Steps	Gap
15	Handling of class imbalance / fairness considerations	PCE race-term limitation called out in Caveats	Partial
16	Interpretability methods	`fig-shap` global SHAP	✓ — global only
17	Comparison with existing prediction models	`pce-construct-validity`, `fig-pce-vs-xgb-auc`	✓
18	Algorithmic fairness assessment	Caveats note PCE race limitation; subgroup SHAP in Next Steps	Gap
19	Multiplicity / multiple-comparisons correction	Single-outcome model; no FDR needed for the primary AUC	N/A
20	Deployment and clinical-implementation considerations	Calibration emphasized over AUC for clinical use	✓
21	Limitations	Caveats section — cross-sectional, MI single-source, PCE race	✓
22	Funding	Not applicable (portfolio analysis)	N/A
23	Code availability	renv.lock + qmd source on GitHub	✓
24	Data availability	NHANES is public; specific files listed in `pull-data`	✓
25	Conflicts of interest	None declared	✓

The two open gaps (subgroup performance / algorithmic-fairness assessment) are exactly the items called out in the Next Steps section. A publishable version would close both before submission.

17 Methods primer

Every statistical test or estimator used above, with a one-line definition, what it tests, when you’d use it, the key assumption, and what to do if that assumption fails. The Reference column points to the canonical primary source — full citations are in the References section that follows.

Method	What it does	Why used here	Key assumption	If assumption fails	Reference
Skewness, excess kurtosis (`e1071`)	Two numbers describing the shape of a distribution. Skew measures asymmetry; excess kurtosis measures tail-heaviness vs a normal.	Decide whether a continuous variable is symmetric enough for normal-theory methods, or right-skewed enough to need transformation.	None — descriptive.	N/A	Joanes & Gill 1998
Distribution fit by AIC (`fitdistrplus`)	Fits candidate distributions by maximum likelihood; AIC = 2k − 2logL ranks them. Lower AIC = better fit penalizing complexity.	Pick the right regression family / link function for any variable used as outcome (e.g., triglycerides → log-normal).	Maximum likelihood requires correctly-specified support.	Plot empirical vs theoretical CDF and pick visually.	Akaike 1974; Delignette-Muller & Dutang 2015
Welch’s t-test	Tests whether two group means differ, allowing unequal variances. Reports a t statistic, Satterthwaite-adjusted df, and a p-value.	Compare a continuous variable (age, BMI) between two groups (included vs excluded).	Roughly normal distribution within each group, or n large enough for the CLT.	Use Mann-Whitney U / Wilcoxon rank-sum.	Welch 1947
Pearson chi-squared	Tests independence between two categorical variables by comparing observed vs expected cell counts under independence.	Compare a categorical variable’s distribution between two groups (sex × included; race × included).	Expected cell counts ≥ 5 in most cells.	Use Fisher’s exact test.	Pearson 1900
Mann-Whitney U / Wilcoxon rank-sum	Non-parametric test that the distribution of values in group A is shifted relative to group B. Operates on ranks.	Compare PCE risk between MetS+ and MetS-; compare beneficiary counts between groups.	Independent observations; no distributional assumption.	Permutation test if independence violated.	Wilcoxon 1945; Mann & Whitney 1947
Pearson correlation	Measures the strength of linear association between two continuous variables; r ∈ [−1, +1].	Detect collinear predictors (e.g., BMI vs waist) before regression — convention is \|r\| ≥ 0.7 means don’t use both.	Roughly linear relationship; no extreme outliers.	Use Spearman rank correlation (monotonic but non-linear OK).	Pearson 1896
Variance Inflation Factor (VIF)	For each predictor, regress it on all others; VIF = 1 / (1 − R²). Tells you how much the SE of that coefficient is inflated by collinearity.	Quantify collinearity in the fitted regression specification.	Linear regression for the auxiliary fit.	If high VIF, drop the redundant predictor or use ridge/lasso.	Marquardt 1970; Belsley, Kuh & Welsch 1980
Random-intercept (mixed) model (`lme4::glmer`)	A regression that adds random intercepts at each clustering level. Captures the fact that observations within a cluster are more similar than across clusters.	Quantify the share of variance at each NHANES sampling level — the ICC.	Random-effect distribution is approximately normal.	Use a marginal / GEE model.	Bates et al. 2015
Intraclass Correlation Coefficient (ICC)	The fraction of total outcome variance lying between clusters. ICC = 0 means clustering doesn’t matter; closer to 1 means within-cluster observations are very similar.	Justify why design-aware estimators matter — when ICC at PSU/stratum is non-trivial, ignoring it inflates Type-I error.	For binary outcomes the residual is fixed at π²/3 on the latent (logit) scale.	N/A — descriptive variance partition.	Snijders & Bosker 2012
Survey-weighted mean / regression (`survey::svymean`, `survey::svyglm`)	Same as the unweighted versions, but each row contributes proportionally to its sampling weight; SEs use Taylor linearization to account for strata + PSU + weights.	Estimate population prevalence and odds ratios that generalize to the US adult population.	Sampling weights and design variables are correctly specified.	Use a Bayesian model with informative priors, or post-stratification.	Lumley 2004, 2010
Logistic regression	Models the log-odds of a binary outcome as a linear combination of predictors. Coefficients exponentiated are odds ratios.	MetS is binary; linear regression gives nonsensical predictions outside [0, 1].	Log-odds are linear in the predictors.	Use splines on continuous predictors, or a non-linear classifier.	Cox 1958; Hosmer, Lemeshow & Sturdivant 2013
MICE — Multiple Imputation by Chained Equations (`mice`)	Generates m plausible completions of the dataset by imputing each missing variable conditional on the others. Predictive mean matching is default for continuous variables.	Recover the rows lost to fasting-subsample dropout without biasing toward the morning subsample.	Missing-at-random (MAR).	Use a more flexible imputation method or a Bayesian missingness model.	Little 1988; van Buuren & Groothuis-Oudshoorn 2011
Rubin’s rules pooling (`mitools::MIcombine`)	Combines analysis results across imputations: pooled estimate = average; pooled variance = within + (1 + 1/m) × between.	Get one combined coefficient + honest CI from the m parallel svyglm fits.	Analysis estimator is approximately normally distributed within each imputation.	Use bootstrap-based pooling.	Rubin 1987
Gradient-boosted trees (XGBoost)	A non-linear classifier built by sequentially adding decision trees that correct the errors of the previous ones. No assumption of linearity.	A flexible benchmark for what demographics + BMI alone can predict; complements the design-aware logit.	Enough training data to avoid overfitting; representative train/test split.	Use a simpler model with engineered features and cross-validation.	Chen & Guestrin 2016
AUC (ROC-AUC)	The probability that a randomly chosen positive is ranked higher than a randomly chosen negative. Pure ranking metric.	Quantify discrimination — how well the model separates MetS+ from MetS-.	Ranking interpretation is invariant to rescaling of probabilities.	Use precision-recall AUC for very imbalanced outcomes.	Hanley & McNeil 1982
Brier score	Mean squared error between predicted probability and the binary outcome. Penalizes both bad ranking and miscalibration.	A single-number summary that captures both calibration and discrimination.	Probabilities are interpretable (not just ranks).	If only ranks matter, use AUC.	Brier 1950
Calibration curve	Bins predictions, plots observed frequency vs predicted probability per bin. Perfectly calibrated = on the diagonal.	Test whether predicted probabilities are usable as risk numbers, not just rankings.	Enough cases per bin (n ≥ 30) for stable observed frequencies.	Use Brier or Expected Calibration Error.	Steyerberg et al. 2010
SHAP (SHapley Additive exPlanations)	Per-row attribution that decomposes the model’s prediction into contributions from each feature, derived from cooperative game theory.	Explain why the model assigns a given risk; surfaces the features doing the work.	Tree-based version is exact for tree models; otherwise approximate.	Use partial dependence + permutation importance.	Lundberg & Lee 2017
Pooled Cohort Equations (PCE)	The 2013 ACC/AHA 10-year ASCVD risk score with race × sex specific coefficients.	Provide an established clinical comparator instead of evaluating the boosted model in a vacuum.	Inputs within supported ranges (age 40–79, NH White or Black).	Use the 2023 PREVENT score (drops the race term).	Goff et al. 2014; Khan et al. 2024 (PREVENT)
TRIPOD+AI reporting checklist	25-item reporting standard for clinical-prediction-model papers, AI extension of the original TRIPOD 2015.	Self-audit for completeness before submission; structures the appendix.	N/A — reporting framework.	N/A	Collins et al. 2024

18 Caveats and limitations primer

Cross-sectional ≠ incident. All findings here describe current MetS prevalence and current risk. Nothing in this notebook predicts future cardiovascular events.
Single-source MI. Five imputations with one method (PMM); a published version would run sensitivity at m = 20 and with a second method (random forest, joint modeling).
Internal validation only. The 25% holdout is from the same NHANES cycles — not a different time period or a different cohort.

19 References

Statistical methods (chronological by topic):

Pearson, K. (1896). Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society A, 187, 253–318.
Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50(302), 157–175.
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83.
Mann, H. B., & Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18(1), 50–60.
Welch, B. L. (1947). The generalization of “Student’s” problem when several different population variances are involved. Biometrika, 34(1/2), 28–35.
Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society, Series B, 20(2), 215–242.
Marquardt, D. W. (1970). Generalized inverses, ridge regression, biased linear estimation, and nonlinear estimation. Technometrics, 12(3), 591–612.
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723.
Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29–36.
Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley.
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.
Little, R. J. A. (1988). Missing-data adjustments in large surveys. Journal of Business and Economic Statistics, 6(3), 287–296.
Joanes, D. N., & Gill, C. A. (1998). Comparing measures of sample skewness and kurtosis. Journal of the Royal Statistical Society, Series D (The Statistician), 47(1), 183–189.
Lumley, T. (2004). Analysis of complex survey samples. Journal of Statistical Software, 9(1), 1–19.
Lumley, T. (2010). Complex Surveys: A Guide to Analysis Using R. Wiley.
Steyerberg, E. W., Vickers, A. J., Cook, N. R., et al. (2010). Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology, 21(1), 128–138.
Snijders, T. A. B., & Bosker, R. J. (2012). Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling (2nd ed.). Sage.
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression (3rd ed.). Wiley.
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1–67.
Delignette-Muller, M. L., & Dutang, C. (2015). fitdistrplus: An R package for fitting distributions. Journal of Statistical Software, 64(4), 1–34.
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794).
Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30 (NeurIPS 2017) (pp. 4765–4774).

Clinical guidelines and reporting standards:

Goff, D. C. Jr., Lloyd-Jones, D. M., Bennett, G., et al. (2014). 2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. Circulation, 129(25 Suppl 2), S49–S73.
Khan, S. S., Matsushita, K., Sang, Y., et al. (2024). Development and validation of the American Heart Association’s PREVENT equations. Circulation, 149(6), 430–449.
Collins, G. S., Moons, K. G. M., Dhiman, P., et al. (2024). TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ, 385, e078378.

Data sources:

National Center for Health Statistics. National Health and Nutrition Examination Survey 2015–2016 (cycle I) and 2017–2018 (cycle J). U.S. Centers for Disease Control and Prevention. https://wwwn.cdc.gov/nchs/nhanes/

Reading this as code samples

Every cell above is the real executed code — not screenshots. Use the Code dropdown at the top-right to toggle visibility, copy individual blocks, or view the raw .qmd source on GitHub.

	objective objective: typing.Union[str, xgboost.sklearn._SklObjWProto, typing.Callable[[typing.Any, typing.Any], typing.Tuple[numpy.ndarray, numpy.ndarray]], NoneType] Specify the learning task and the corresponding learning objective or a custom objective function to be used. For custom objective, see :doc:`/tutorials/custom_metric_obj` and :ref:`custom-obj-metric` for more information, along with the end note for function signatures.	'binary:logistic'
	base_score base_score: typing.Union[float, typing.List[float], NoneType] The initial prediction score of all instances, global bias.	None
	booster	None
	callbacks callbacks: typing.Optional[typing.List[xgboost.callback.TrainingCallback]] List of callback functions that are applied at end of each iteration. It is possible to use predefined callbacks by using :ref:`Callback API `. .. note:: States in callback are not preserved during training, which means callback objects can not be reused for multiple training sessions without reinitialization or deepcopy. .. code-block:: python for params in parameters_grid: # be sure to (re)initialize the callbacks before each run callbacks = [xgb.callback.LearningRateScheduler(custom_rates)] reg = xgboost.XGBRegressor(**params, callbacks=callbacks) reg.fit(X, y)	None
	colsample_bylevel colsample_bylevel: typing.Optional[float] Subsample ratio of columns for each level.	None
	colsample_bynode colsample_bynode: typing.Optional[float] Subsample ratio of columns for each split.	None
	colsample_bytree colsample_bytree: typing.Optional[float] Subsample ratio of columns when constructing each tree.	0.9
	device device: typing.Optional[str] .. versionadded:: 2.0.0 Device ordinal, available options are `cpu`, `cuda`, and `gpu`.	None
	early_stopping_rounds early_stopping_rounds: typing.Optional[int] .. versionadded:: 1.6.0 - Activates early stopping. Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training. Requires at least one item in eval_set in :py:meth:`fit`. - If early stopping occurs, the model will have two additional attributes: :py:attr:`best_score` and :py:attr:`best_iteration`. These are used by the :py:meth:`predict` and :py:meth:`apply` methods to determine the optimal number of trees during inference. If users want to access the full model (including trees built after early stopping), they can specify the `iteration_range` in these inference methods. In addition, other utilities like model plotting can also use the entire model. - If you prefer to discard the trees after `best_iteration`, consider using the callback function :py:class:`xgboost.callback.EarlyStopping`. - If there's more than one item in eval_set, the last entry will be used for early stopping. If there's more than one metric in eval_metric, the last metric will be used for early stopping.	None
	enable_categorical enable_categorical: bool See the same parameter of :py:class:`DMatrix` for details.	False
	eval_metric eval_metric: typing.Union[str, typing.List[typing.Union[str, typing.Callable]], typing.Callable, NoneType] .. versionadded:: 1.6.0 Metric used for monitoring the training result and early stopping. It can be a string or list of strings as names of predefined metric in XGBoost (See :doc:`/parameter`), one of the metrics in :py:mod:`sklearn.metrics`, or any other user defined metric that looks like `sklearn.metrics`. If custom objective is also provided, then custom metric should implement the corresponding reverse link function. Unlike the `scoring` parameter commonly used in scikit-learn, when a callable object is provided, it's assumed to be a cost function and by default XGBoost will minimize the result during early stopping. For advanced usage on Early stopping like directly choosing to maximize instead of minimize, see :py:obj:`xgboost.callback.EarlyStopping`. See :doc:`/tutorials/custom_metric_obj` and :ref:`custom-obj-metric` for more information. .. code-block:: python from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error X, y = load_diabetes(return_X_y=True) reg = xgb.XGBRegressor( tree_method="hist", eval_metric=mean_absolute_error, ) reg.fit(X, y, eval_set=[(X, y)])	'logloss'
	feature_types feature_types: typing.Optional[typing.Sequence[str]] .. versionadded:: 1.7.0 Used for specifying feature types without constructing a dataframe. See the :py:class:`DMatrix` for details.	None
	feature_weights feature_weights: Optional[ArrayLike] Weight for each feature, defines the probability of each feature being selected when colsample is being used. All values must be greater than 0, otherwise a `ValueError` is thrown.	None
	gamma gamma: typing.Optional[float] (min_split_loss) Minimum loss reduction required to make a further partition on a leaf node of the tree.	None
	grow_policy grow_policy: typing.Optional[str] Tree growing policy. - depthwise: Favors splitting at nodes closest to the node, - lossguide: Favors splitting at nodes with highest loss change.	None
	importance_type	None
	interaction_constraints interaction_constraints: typing.Union[str, typing.List[typing.Tuple[str]], NoneType] Constraints for interaction representing permitted interactions. The constraints must be specified in the form of a nested list, e.g. ``[[0, 1], [2, 3, 4]]``, where each inner list is a group of indices of features that are allowed to interact with each other. See :doc:`tutorial ` for more information	None
	learning_rate learning_rate: typing.Optional[float] Boosting learning rate (xgb's "eta")	0.05
	max_bin max_bin: typing.Optional[int] If using histogram-based algorithm, maximum number of bins per feature	None
	max_cat_threshold max_cat_threshold: typing.Optional[int] .. versionadded:: 1.7.0 .. note:: This parameter is experimental Maximum number of categories considered for each split. Used only by partition-based splits for preventing over-fitting. Also, `enable_categorical` needs to be set to have categorical feature support. See :doc:`Categorical Data ` and :ref:`cat-param` for details.	None
	max_cat_to_onehot max_cat_to_onehot: Optional[int] .. versionadded:: 1.6.0 .. note:: This parameter is experimental A threshold for deciding whether XGBoost should use one-hot encoding based split for categorical data. When number of categories is lesser than the threshold then one-hot encoding is chosen, otherwise the categories will be partitioned into children nodes. Also, `enable_categorical` needs to be set to have categorical feature support. See :doc:`Categorical Data ` and :ref:`cat-param` for details.	None
	max_delta_step max_delta_step: typing.Optional[float] Maximum delta step we allow each tree's weight estimation to be.	None
	max_depth max_depth: typing.Optional[int] Maximum tree depth for base learners.	4
	max_leaves max_leaves: typing.Optional[int] Maximum number of leaves; 0 indicates no limit.	None
	min_child_weight min_child_weight: typing.Optional[float] Minimum sum of instance weight(hessian) needed in a child.	None
	missing missing: float Value in the data which needs to be present as a missing value. Default to :py:data:`numpy.nan`.	nan
	monotone_constraints monotone_constraints: typing.Union[typing.Dict[str, int], str, NoneType] Constraint of variable monotonicity. See :doc:`tutorial ` for more information.	None
	multi_strategy multi_strategy: typing.Optional[str] .. versionadded:: 2.0.0 .. note:: This parameter is working-in-progress. The strategy used for training multi-target models, including multi-target regression and multi-class classification. See :doc:`/tutorials/multioutput` for more information. - ``one_output_per_tree``: One model for each target. - ``multi_output_tree``: Use multi-target trees.	None
	n_estimators n_estimators: Optional[int] Number of boosting rounds.	400
	n_jobs n_jobs: typing.Optional[int] Number of parallel threads used to run xgboost. When used with other Scikit-Learn algorithms like grid search, you may choose which algorithm to parallelize and balance the threads. Creating thread contention will significantly slow down both algorithms.	None
	num_parallel_tree	None
	random_state random_state: typing.Union[numpy.random.mtrand.RandomState, numpy.random._generator.Generator, int, NoneType] Random number seed. .. note:: Using gblinear booster with shotgun updater is nondeterministic as it uses Hogwild algorithm.	42
	reg_alpha reg_alpha: typing.Optional[float] L1 regularization term on weights (xgb's alpha).	None
	reg_lambda reg_lambda: typing.Optional[float] L2 regularization term on weights (xgb's lambda).	None
	sampling_method sampling_method: typing.Optional[str] Sampling method. Used only by the GPU version of ``hist`` tree method. - ``uniform``: Select random training instances uniformly. - ``gradient_based``: Select random training instances with higher probability when the gradient and hessian are larger. (cf. CatBoost)	None
	scale_pos_weight scale_pos_weight: typing.Optional[float] Balancing of positive and negative weights.	None
	subsample subsample: typing.Optional[float] Subsample ratio of the training instance.	0.9
	tree_method tree_method: typing.Optional[str] Specify which tree method to use. Default to auto. If this parameter is set to default, XGBoost will choose the most conservative option available. It's recommended to study this option from the parameters document :doc:`tree method `	None
	validate_parameters validate_parameters: typing.Optional[bool] Give warnings for unknown parameter.	None
	verbosity verbosity: typing.Optional[int] The degree of verbosity. Valid values are 0 (silent) - 3 (debug).	None