From Data to Bedside
One pathway from the datapoint to the bedside, worked two ways: building each rung soundly, and tracing whether a recommendation holds.
Every clinical recommendation is the end of a pathway. It begins with a raw datapoint as it was first defined and measured, runs up through the model that turned it into an estimate, the effect size and its uncertainty, the synthesis that treated it as evidence, and the decision rule, and arrives at the sentence a clinician acts on. A clinician works at the top of that pathway. A statistician usually works near the bottom. Whether the recommendation can bear the weight a clinician puts on it depends on links most readers of either kind never see together.
This page walks that one pathway. The map of rungs comes first, each rung holding the methods it rests on. A handful of decision tools then turn a rung into a choice you can make for your own question, and a set of traces shows the whole path walked on real recommendations and real data. Read a rung downward to appraise a recommendation that already exists, or upward to design a study soundly. It is the same path either way, and walking it in one direction sharpens the other.
Start here
Before the pathway, a fork: what kind of evidence does your question even need? One question rarely fits two approaches well, and picking the wrong one is the most expensive mistake on this whole page. Open the branch that matches your question to land on the approach, and the rungs it leans on.
Are you generating new evidence, or weighing what is already known?
Weighing the studies that already exist
Systematic review or meta-analysis. When the evidence already exists across studies, the job is to synthesize it soundly, not to run another. Pathway: meta-analysis and pooling · risk-of-bias appraisal · certainty of evidence.
Generating new evidence of my own
A causal question: does X change Y?
Yes, and randomizing is ethical and feasible
Randomized controlled trial. Randomization handles confounding by design, for the highest internal validity. Pathway: choosing the design · endpoint logic · sample size.
Yes, but randomization is off the table
Observational real-world evidence. Emulate the trial you cannot run, then lean on causal-inference methods to defend the contrast. Pathway: target-trial emulation · assembling the cohort · causal-inference toolkit.
A descriptive question: how common, or distributed how?
I need to collect new individual-level data
Primary survey. Design a probability sample so the result speaks for the population, not just the responders. Pathway: data sources · survey weighting · sample size.
Existing aggregate data will answer it
Public aggregate data (ACS, CDC). Published denominators and area-level rates give context and standardized rates fast, with the ecological fallacy kept in view. Pathway: data sources · descriptive epidemiology.
The pathway
The same rungs carry every trace, from framing a study up to the recommendation a clinician acts on, with a cross-cutting set of moves for defending a result. Open a rung to see its methods, then open any item inside to read its write-up. The theory is here, stripped of any one dataset; the decision tools below turn each rung into a choice for your own question, and the traces put the whole path on real data.
Create a free account → to follow new traces and methods as they go up.
Descriptive epidemiology: person, place, time free
Before explaining a health event, describe it. Who is affected (age, sex, race or ethnicity, comorbidity), where (geography, clinical setting, urban or rural), and when (a secular trend, seasonality, a birth-cohort effect, the shape of an epidemic curve). The triad is hypothesis-generating: a pattern in person, place, or time is what suggests the analytic question the next item makes precise. It also fixes the frequency measure you report, whether a prevalence, an incidence, or a rate, quantified with the measures of disease frequency in the Measurement rung. The same description read with statistical rather than epidemiologic eyes is what characterizing the distribution does on a single variable. Seen in the NHANES trace →
Research question (PICO / PECO) free
A study is only as clear as the sentence it answers. PICO (population, intervention, comparator, outcome) and its observational cousin PECO force that sentence to be specific enough to act on: who, given what, compared to what, measured how, over what horizon. The framework flexes with context: PICOT appends a timeframe, the convention in clinical-question teaching, and PICOS appends the study design, the convention in systematic reviews. Time is the element that moves around, written as its own T in some hands and folded into the population in others, the way descriptive epidemiology bundles person, place, and time together as the who, where, and when under study. A question that cannot be written this way is not yet ready to design around, and bundling two questions into one is the most common way the specification quietly fails. Subscriber write-up: from research question to study design → Subscriber write-up: defining populations and sample sizes →
Target-trial emulation free
When randomization is impossible, the cleanest discipline is to imagine the randomized trial you would have run, write down its protocol (eligibility, treatment assignment, start of follow-up, outcome), then build the observational analysis to match it. The framework makes implicit choices explicit, and it surfaces the biases an informal observational design tends to hide, such as immortal time and prevalent-user selection.
Endpoint logic and pre-registration free
The primary endpoint is the one the sample size is built on and the headline claim is read against; everything else is secondary and labelled as such. Choosing it after seeing the data is the classic route to a result that will not replicate. Pre-registration on ClinicalTrials.gov or the Open Science Framework is the public commitment that keeps a confirmatory analysis confirmatory. Size it in the chooser →
Data sources and their tradeoffs
Where the data came from bounds every question it can answer, and each source carries a characteristic strength and a characteristic bias. Survey data (NHANES, BRFSS) is a probability sample built for population estimates, so it generalizes well once its weights and design are respected, though it is cross-sectional and self-reported in places. Electronic health record data is clinically rich, with labs, vitals, and notes, but it was recorded for care rather than research, so it is messy, captured only within one health system, and missing in informative ways. Claims data covers encounters and prescriptions broadly across a payer's population, but it is billing-driven, where a code is a bill, not a diagnosis, and clinical detail like lab values is thin. Publicly available aggregate data, such as the Census Bureau's American Community Survey, CDC WONDER, and vital statistics, supplies population denominators and area-level rates useful for context and standardization, but because it is aggregated it supports only ecological analysis and invites the ecological fallacy when a group-level association is read as an individual one. Registries sit in between, purpose-built for one disease and deep but narrow. The first question of any dataset is which of these it is, because that fixes what it can and cannot support.
Operationalizing the variable
Turning a clinical idea into data means writing a definition precise enough that two analysts produce the same cases. "Adults with diabetes" is not a definition; an age range, a diagnosis code or lab threshold, and an enrollment window is. Loose definitions invite reviewer questions and quietly change who the study is even about.
Measures of disease frequency
Counting how often disease occurs comes before comparing groups, and the count has standard forms. Prevalence is the share of a population that has the condition at a point in time (point prevalence) or over a window (period prevalence), and it reflects both how often the disease arises and how long it lasts. Incidence is the rate of new cases: cumulative incidence, the risk, is new cases over a fixed period divided by the population at risk, while the incidence rate is new cases divided by the person-time at risk, which handles varying follow-up. The two connect by a rule of thumb, prevalence is roughly incidence times average duration, so a rise in prevalence can mean more new cases or just longer survival. A crude rate is the unadjusted whole-population figure; comparing crude rates across populations with different age structures confounds the comparison, which age-standardization fixes, either directly by applying each population's rates to one standard population or indirectly through the standardized mortality or morbidity ratio of observed over expected. These are occurrence measures, distinct from the comparative effect measures a study later estimates.
Measurement error and misclassification
No instrument is exact, and the direction of the error matters. Non-differential misclassification, where the error is unrelated to the outcome, usually biases an effect toward the null; differential error can push it either way and is harder to reason about. The first question to ask about any estimate is how well the underlying quantity was actually measured.
Complex-sample design and survey weighting
Surveys like NHANES are not simple random samples; they oversample some groups and cluster others by design. Ignoring the weights and the design structure gives biased estimates and confidence intervals that are too narrow. Design-aware analysis, using survey weights, strata, and primary sampling units, is what lets a sample speak for the population it was drawn to represent. Seen in the NHANES trace →
Missing data: MCAR, MAR, MNAR
Why a value is missing decides what you can do about it. Missing completely at random is benign; missing at random can be handled by multiple imputation conditional on the observed data; missing not at random, where the missingness depends on the unseen value itself, cannot be fixed by imputation alone and needs a pattern-mixture or tipping-point sensitivity approach. Treating all missingness as ignorable is a common and consequential shortcut.
Data standards and provenance
Before any analysis, a datapoint inherits structure from how it was recorded: CDISC SDTM and ADaM for trial data, or coding ontologies like ICD, HCPCS, RxNorm, and SNOMED for claims and records. Knowing what a code does and does not capture, since a billing code is not a diagnosis, is half of real-world-data competence.
Assembling the analytic cohort
A research database is not an analysis dataset. The work between them is extract-transform-load: pull from the source tables, derive the study variables, and assemble one analysis-ready table. Two decisions govern it. First, the grain: one row per patient for a time-fixed cohort, one row per patient-interval when exposure or covariates change over time (the person-time format Cox and Poisson read), one row per unit-period for a panel design. Second, and this is where observational studies are won or lost, the time structure. Fix a single index date (time zero) for each patient and align three things at it, exactly as target-trial emulation prescribes: eligibility is assessed, exposure is assigned, and follow-up starts, all at T0. Measure confounders only in a pre-index lookback window, so you adjust for baseline causes rather than post-exposure mediators or colliders; ascertain the outcome only after T0; and apply a washout window for a new-user design so prevalent users do not contaminate the comparison. Get the time alignment wrong and you manufacture immortal time before a single model runs. The minimum variable set: a patient identifier for clustering, the index date, the exposure, the outcome with its event date, follow-up start and end with a censoring indicator, and the baseline covariates, each stamped with the window it was measured in.
Assembling a clinical trial dataset
A clinical trial assembles its dataset the opposite way from a real-world cohort. The protocol fixes eligibility, randomized assignment, and the start of follow-up by design, so the bias-prevention that observational assembly engineers by hand is handled up front, and the work that remains is standardization and traceability. Raw case-report-form data is first mapped to CDISC SDTM, one domain per kind of data (DM for demographics, AE for adverse events, LB for labs, VS for vitals, EX for exposure), a tabulation that mirrors what was collected. From SDTM, analysis datasets are derived in ADaM: a subject-level ADSL with one row per participant, and basic-data-structure datasets carrying the derived endpoints the statistical analysis plan calls for. The governing rule is traceability, since every analysis value has to trace back through ADaM to its SDTM source and the original CRF, so a reviewer can follow any number in the results to the record it came from. This is the pipeline the CDISC pilot trace double-programs in SAS and R.
Measurement-method effects
Two devices or protocols measuring the "same" quantity can disagree systematically. Automated, rested, averaged office blood pressure reads lower than a single manual cuff, so a threshold validated under one method does not transfer cleanly to the other. When a guideline number and a clinic number were produced differently, they are not on the same scale. Seen in the 120 mmHg trace →
Characterizing the distribution
Before any model, look at what you measured. The shape, read from a histogram or density and summarized by skewness and kurtosis, tells you whether a mean is even the right summary. The spread is read with the median and IQR, which survive the outliers that distort the mean and SD. And the shape of a relationship is read from a scatter with a LOESS smoother before you assume it is linear. Exploratory analysis is where you catch the heavy tail, the floor-or-ceiling effect, and the nonlinearity that decide which model and which summary are honest, rather than meeting them in a reviewer's question. What it surfaces is what the Model rung's robust statistics and regression families are there to handle. Pick the summary and plot in the chooser → Seen in the Medicaid trace →
Choosing the estimand
Name the target before the method. The average treatment effect (ATE), the effect on the treated (ATT), and the local effect among compliers (LATE) are different quantities that can have different magnitudes and different policy meanings. Reporting an estimate without saying which one it is invites misreading.
Causal diagrams (DAGs) and conceptual frameworks
A conceptual framework is a picture of what causes what, and its formal version is a directed acyclic graph: variables as nodes, assumed causal effects as arrows, no cycles allowed. Drawing it is what turns "what might confound this" into a decision you can defend, because the graph sorts every covariate into a confounder (a common cause of exposure and outcome, which you adjust for), a mediator (on the causal path, which you leave alone when the total effect is the target), or a collider (a common effect, where adjusting actively opens bias rather than removing it). The back-door criterion then reads the adjustment set straight off the graph. What a DAG cannot do is prove itself: it encodes assumptions, and the arrow you left out, the unmeasured common cause, is exactly the identifying assumption the design then has to defend. Sort your variables in the chooser →
Causal designs without randomization
Difference-in-differences, regression discontinuity, instrumental variables, synthetic control, and propensity-score methods each neutralize a specific dominant threat to causal inference. The craft is matching the design to the threat that actually endangers your question, not reaching for the most familiar tool. If no design fits, you may have an associational study, which is worth knowing before you claim otherwise. Subscriber write-up: causal inference, a methods toolkit → Seen in the Part D trace →
Identifying assumptions
Every causal design rests on a claim the data cannot verify: parallel trends for difference-in-differences, continuity at the cutoff for regression discontinuity, the exclusion restriction for instrumental variables, conditional independence for propensity scores. Writing that assumption as one sentence is half the work of defending it, because it tells you exactly what a skeptic will attack.
Regression families
The outcome dictates the model. Most of these are one family underneath: a generalized linear model is a choice of outcome distribution plus a link function, and that pairing is what makes "the outcome dictates the model" a principle rather than a lookup table. Continuous outcomes call for linear regression, returning a mean difference; binary outcomes for logistic, returning an odds ratio; counts for Poisson or negative binomial, returning a rate ratio, with the choice between them set by overdispersion, since Poisson assumes the variance equals the mean and real count data usually exceed it. Time-to-event outcomes call for Cox or parametric survival, returning a hazard ratio; ordered categories for ordinal models; clustered or repeated measures for mixed-effects or GEE. Each family emits its own kind of estimate, so the family you fit decides which effect measure you are even reporting. Forcing the wrong one produces estimates that are precise and wrong. Walk it as a chooser →
Checking model assumptions
Distinct from the identifying assumption a causal design rests on, every regression carries statistical assumptions that are checkable with diagnostics, and a model whose assumptions fail produces standard errors and p-values that cannot be trusted. A few checks run on the raw data before you fit: the shape and relationship of variables (a LOESS smoother shows nonlinearity before you commit to a linear term, building on characterizing the distribution) and multicollinearity among predictors (the variance inflation factor, or the condition number). Most checks are necessarily post-fit, because they read the residuals: heteroscedasticity from a residual-versus-fitted plot, confirmed if needed with Breusch-Pagan (Koenker's studentized version) or White; non-normal residuals from a QQ plot; remaining nonlinearity from partial-residual plots or Ramsey's RESET; non-independence from Durbin-Watson or Breusch-Godfrey; influential points from Cook's distance. Cox models add proportional hazards, tested with scaled Schoenfeld residuals. One caution: formal normality and heteroscedasticity tests are underpowered at small n and oversensitive at large n, so they make poor gates. The modern default for non-constant variance is to skip the test and use heteroscedasticity-robust (sandwich) standard errors outright. The discipline is the remedy, not the test: a failed check sends you to a transformation, cluster-robust standard errors, or a different family, not to reporting the broken fit anyway. Match the checks to your model in the chooser →
Robust statistics for heavy tails
Means and standard deviations are fragile when the data have heavy tails or outliers, as cost and utilization data almost always do. Median-based summaries and MAD-scaled z-scores resist a few extreme points that would otherwise dominate, which matters when a handful of providers or patients drive the totals.
Multiplicity control
Test enough hypotheses and some will look significant by chance. Controlling the false-discovery rate, through Benjamini-Hochberg and its relatives, keeps the expected share of false positives in check across many comparisons, a routine necessity in screening and any flagging exercise. The catch is that these procedures assume things about the null distribution that are not always true.
Prediction and machine learning
When the goal is to predict rather than to explain, flexible models like gradient boosting and regularized regression, with interpretability tools such as SHAP, earn their place. A predictive model can be useful without being causal, but it has to be judged on the right terms: out-of-sample performance and calibration, not the plausibility of its coefficients.
Effect measures
The same result reads differently depending on the scale. Risk ratios and odds ratios are relative; risk differences and number-needed-to-treat are absolute; hazard ratios are relative on the rate scale. Odds ratios in particular are routinely misread as risk ratios when the outcome is common, which overstates the effect. Getting the measure and its interpretation right is the precondition for the reporting choice the next item turns on.
Uncertainty and inference
A point estimate without its uncertainty is half a result. Confidence intervals show the range compatible with the data, and when observations cluster, such as patients within hospitals or repeated measures within a person, standard errors have to account for it or they will be too small. A tight interval around a biased estimate is false comfort.
Relative versus absolute
Given the measures above, which scale you lead with is a communication choice with stakes. A 50 percent relative reduction sounds dramatic and can still be a move from 2 percent to 1 percent. Patients and decisions live in the absolute scale, where the number-needed-to-treat makes the size of the benefit concrete and the baseline risk it depends on becomes visible. Reporting only the relative effect is the most common way a modest result is made to sound large.
Hazard ratios and non-proportional hazards
A single hazard ratio assumes the treatment's effect on instantaneous risk is constant over time. When it is not, through delayed effects, waning benefit, or crossing curves, the reported ratio becomes a weighted average that depends on the censoring pattern rather than the clinical story. Restricted mean survival time gives a number that survives this and that a patient can actually use. Dispatch →
Calibration versus discrimination
Discrimination, measured by the AUC, asks whether a model ranks higher-risk patients above lower-risk ones; calibration asks whether its predicted risks match observed rates. A model can discriminate well and still be badly miscalibrated, which is the failure that matters when a number drives a bedside decision. Portability depends on calibration, and it is the more often neglected of the two.
Model fit and prediction error
The continuous-outcome counterpart to calibration and discrimination, and the one most often misread. R² is the share of outcome variance a model explains, but it climbs mechanically as predictors are added, so use adjusted R² or an out-of-sample R², and never read a high R² as proof the model is correct or unbiased. For prediction, the honest number is error on held-out data: the mean squared error and its root RMSE, in the outcome's own units and punishing large misses hardest, or the mean absolute error when a few big errors shouldn't dominate. In-sample fit always flatters; the figure that generalizes is the one computed on data the model never saw.
Meta-analysis and pooling
Combining studies can sharpen an estimate or average away a real difference, depending on whether the studies are estimating the same thing. Heterogeneity statistics, and a careful look at why studies differ, decide whether a pooled number is informative or a fiction. A tight pooled estimate over heterogeneous studies is a warning, not a reassurance.
Risk-of-bias appraisal
Not all evidence deserves equal weight. Structured tools, RoB 2 for trials and ROBINS-I for observational studies, score how a study's design and conduct threaten its result domain by domain, rather than relying on a gestalt impression. The same logic can be adapted to newer evidence sources. A framework in this vein →
Certainty of evidence (GRADE)
GRADE rates how much confidence a body of evidence warrants, separately from the size of the effect, downgrading for risk of bias, inconsistency, indirectness, imprecision, and publication bias. A large effect from low-certainty evidence and a small effect from high-certainty evidence are different things, and conflating them is a frequent error in how findings get reported.
Generalizability and transportability
An effect estimated in one population does not automatically apply to another. Generalizability asks whether the study sample represents the target; transportability formalizes when and how an estimate can be carried to a different population. The honest default is the narrower claim, with extrapolation argued rather than assumed.
Immortal time bias
When a span of follow-up during which the outcome could not have occurred is mistakenly assigned to the treated group, it manufactures a survival advantage out of bookkeeping. It is a recurring trap in observational drug studies, and a good example of why the start of follow-up has to be defined as carefully as the exposure itself.
Thresholds and cut points
Turning a continuous risk or measurement into a yes/no action is convenient and lossy. A cutoff treats a patient just below and just above as categorically different when they are nearly identical, and the choice of where to cut encodes a value judgment about the costs of acting versus waiting.
Risk calculators and prediction tools
A calculator packages a model into something usable at the bedside, but it carries its development population with it. Applied to patients who differ from that population, a well-built tool can be systematically off, which is why external validation and recalibration matter more than the elegance of the original model.
Operating characteristics
Sensitivity and specificity describe a test in the abstract; predictive values describe what a result means for the patient in front of you, and they shift with prevalence. The same test that is reassuring in a high-prevalence clinic can generate mostly false positives in a low-prevalence screening setting.
Decision-curve analysis
A test or model is only worth using if acting on it does more good than harm across the range of thresholds a clinician might reasonably hold. Decision-curve analysis weighs those trade-offs directly in terms of net benefit, going beyond accuracy metrics that ignore the consequences of acting.
Cost-effectiveness
When a decision rule has to account for resources, not just outcomes, incremental cost-effectiveness ratios and Markov models put benefit and cost on the same page. The output is sensitive to assumptions about the time horizon, discounting, and utilities, so the headline ratio means little without the sensitivity analysis behind it.
Strength of recommendation
Guideline bodies signal how firmly they are willing to speak: ACC/AHA's class and level of evidence, or GRADE's split between strong and conditional recommendations. The strength is a claim about confidence, and it should track the evidence. A confident recommendation on thin evidence is exactly the mismatch worth noticing.
Evidence-to-decision
Moving from a body of evidence to a recommendation is not automatic. Frameworks make the step explicit, weighing benefits and harms alongside values, feasibility, equity, and cost. Two panels reading the same evidence can land on different recommendations because these other inputs differ, which is legitimate but should be visible rather than buried.
The evidence–recommendation gap
The most useful thing a statistically literate reader can do with a guideline is ask which rung its confidence is actually resting on. A recommendation can be worded firmly while leaning on an extrapolated threshold, a single trial, or a measurement method the clinic does not reproduce. The gap between the phrasing and the support is where appraisal earns its keep. Seen in the 120 mmHg trace → and in the tenecteplase trace →
Reporting standards
Checklists like CONSORT for trials, STROBE for observational studies, PRISMA for systematic reviews, and TRIPOD for prediction models make a study's methods auditable by requiring the details that let a reader judge it. They are not bureaucracy; they are the difference between a result you can interrogate and one you have to take on faith.
Sensitivity analysis
A result is only as trustworthy as it is robust to the choices behind it. Pre-specified sensitivity analyses deliberately vary the assumptions most likely to be challenged and report what happens, which is far more credible than the same analyses run only after a reviewer asks. Sensitivity analyses designed in are a strength; ones bolted on afterward are a tell. Subscriber write-up: sensitivity analysis and robustness →
Bias quantification
Rather than asserting that unmeasured confounding is unlikely, quantify it. The E-value asks how strong a hidden confounder would have to be to explain away the result; Rosenbaum bounds do the analogous job for matched designs. A finding that survives a large E-value is sturdier than one a modest confounder could erase.
Placebo and falsification tests
A good way to test a design is to look for an effect where none should exist: a pre-treatment period, an outcome the intervention cannot plausibly affect, an untreated group. If the design finds an effect there, something is wrong with it. Real falsification tests try to break the result rather than to confirm it.
Leave-one-out and specification curves
If dropping a single site, year, or cohort overturns the finding, the result rests on that one unit rather than on the effect. Leave-one-out re-estimation exposes that fragility, and specification-curve analysis does the same across the many defensible modeling choices, showing whether the conclusion holds broadly or only along one path.
Monte Carlo simulation
When closed-form theory does not fit the design, you can generate data under a known process, run the planned analysis, and watch how it behaves over many replicates. Simulation answers how much bias an estimator carries, whether its confidence intervals cover at the stated rate, and how much sample size a non-standard design actually needs. Subscriber write-up: Monte Carlo simulation →
Decision tools
Four decisions the pathway asks of you, in rung order, each one routing your specific question into the methods above and on to the trace that shows it.
Sizing the study
How many you need depends on the design and the outcome, and one input usually dominates the answer. This is the Framing rung’s sample-size calculation as a routing decision: the formula that fits your design, the quantity that drives N, and the package that computes it. The full calculation with worked scenarios is in the populations and sample size write-up.
Are you choosing N for a planned study, or finding the power of a sample you already have?
I already have a fixed sample (registry, natural experiment, public-use file)
Reverse the calculation. The sample is fixed, so instead of solving for N you solve for the minimum detectable effect at that N and your target power, and report it as a power analysis in the methods. The reasoning is the same, the direction is flipped. Pathway: populations and sample size.
I am choosing N for a planned study
Two groups, a continuous outcome
Two-sample t-test formula. N is driven by the standardized effect size, the mean difference divided by the SD. R: pwr::pwr.t.test. Pathway: populations and sample size.
Two groups, a binary outcome (proportions)
Chi-squared or Fisher with continuity correction. N is driven by the control-group event rate and the difference you want to detect. R: pwr::pwr.2p.test. Pathway: populations and sample size · effect measures.
A time-to-event outcome
Freedman, Schoenfeld, or Lakatos (log-rank). N is driven by the number of events and the hazard ratio, not follow-up length alone, so you power for events and then work back to enrollment. R: powerSurvEpi or gsDesign. Pathway: populations and sample size · hazard ratios and survival.
A regression coefficient, adjusting for covariates
Observations-per-predictor and events-per-variable rules. For linear regression, Harrell's 10 to 20 observations per predictor; for logistic, the events-per-variable rule or the Hsieh formula. The binding constraint is the partial effect of your predictor net of the others, and for logistic it is the event count, not the total N. Pathway: populations and sample size.
Cluster or group randomization
Inflate the simple-RCT N by the design effect. The intracluster correlation and the cluster size decide how much larger the trial has to be. R: clusterPower. Pathway: populations and sample size.
A non-standard design (multilevel, longitudinal with dropout, adaptive, composite)
Simulate it. When no closed form fits, generate data under the planned analysis and find the N that hits target power over many replicates. R: simr. Pathway: Monte Carlo simulation · populations and sample size.
Whichever design: how sure are the inputs?
Compute N under three scenarios. Sample size is exquisitely sensitive to the effect-size and variability assumptions, so report an optimistic, a best-estimate, and a conservative N rather than a single number; the conservative one usually goes in the protocol. Pathway: populations and sample size · sensitivity analysis.
Characterizing your data
Before any model, describe what you measured, and the right description depends on the variable’s type. This is the Measurement rung’s exploratory work made concrete: the summary, the plot, and the distribution check that fit each kind of data. Open the branch that matches what you are looking at.
Are you characterizing one variable, or a relationship between two?
One variable, by its type
Continuous (a measurement)
Shape, then center and spread. A histogram or density shows the shape, summarized by skewness and kurtosis; the median and IQR give center and spread while resisting the outliers that distort the mean and SD; a QQ plot lets you eyeball normality, read as a picture rather than a pass-or-fail test. Watch for heavy tails, floor or ceiling effects, and multimodality. Pathway: characterizing the distribution · robust statistics.
Binary or categorical
Counts and proportions. A frequency table and a bar chart; for a binary outcome the event rate is the summary that matters, and it doubles as a prevalence. Pathway: characterizing the distribution · measures of disease frequency · effect measures.
A count or a rate
Skew, then mean versus variance. The distribution is usually right-skewed; compare the mean and the variance, because a variance well above the mean (overdispersion) is what later sends you from Poisson to negative binomial, and check for excess zeros. Pathway: characterizing the distribution · measures of disease frequency.
Time-to-event, with censoring
A Kaplan-Meier curve and the median survival, with the censoring pattern described, since a mean is not defined when follow-up is incomplete. Pathway: characterizing the distribution · hazard ratios and survival.
A relationship between two variables
Continuous against continuous
A scatter with a LOESS smoother to see the shape before assuming it is linear, plus a Pearson correlation if the relationship is linear or Spearman if it is monotone but curved. Pathway: characterizing the distribution · checking model assumptions.
Continuous against a group
Boxplots or violins by group, with the group medians, which shows differences in location and spread without assuming normality. Pathway: characterizing the distribution.
Category against category
A cross-tabulation of counts with row or column proportions, and a mosaic plot when the table is large. Pathway: characterizing the distribution · effect measures.
How much is missing, and why?
Quantify missingness per variable and inspect its pattern before deciding how to handle it, since whether the data are missing completely at random, at random, or not at random decides what is safe to do. Pathway: missing data: MCAR, MAR, MNAR.
Choosing what to adjust for
A Model-rung decision. Which variables go in the model is a causal question before it is a statistical one. For an effect estimate, the answer is not whatever improves fit; it is whatever the causal diagram says blocks confounding without opening new bias. Route each candidate variable by the role it plays.
Are you selecting variables to estimate an effect, or to predict?
To predict an outcome
This is a prediction problem, not a causal one. Select variables for out-of-sample performance, through regularization such as LASSO or ridge or a domain-driven feature set, and judge the result on held-out error rather than on causal roles. A DAG should not drive it. Pathway: prediction and machine learning · model fit and prediction error.
To estimate an effect: let the DAG sort each variable
A common cause of both exposure and outcome (confounder)
Adjust for it. This is the back-door path you have to block; leaving a confounder out biases the effect estimate. Pathway: causal diagrams · identifying assumptions.
On the causal path from exposure to outcome (mediator)
Do not adjust, if the total effect is the target. Conditioning on a mediator removes part of the very effect you are measuring. Adjust only when you are explicitly decomposing direct and indirect effects. Pathway: causal diagrams.
A common effect of two variables (collider)
Do not adjust. Conditioning on a collider opens a spurious path and induces bias where none existed, the trap that makes more adjustment actively worse. Pathway: causal diagrams.
Predicts the outcome but is unrelated to the exposure
Optional, and including it usually helps. A pure outcome predictor is not a confounder, so it is not required, but adding it tightens the precision of the estimate. Pathway: causal diagrams · uncertainty and inference.
Affects the exposure only (an instrument)
Keep it out of the outcome model. Adjusting for an instrument can amplify residual confounding rather than reduce it; its proper use is an instrumental-variable design, not a covariate. Pathway: causal diagrams · causal designs without randomization.
Measured after the exposure (post-treatment)
Generally do not adjust. A descendant of the exposure is a mediator or a collider in disguise, so adjusting risks both over-control and collider bias; the fix is to define covariates in a pre-exposure window. Pathway: causal diagrams · assembling the cohort.
Choosing a model
The Model rung in one decision. The outcome you are modeling picks the family, the family picks the effect measure it returns, and the family also fixes the assumptions you then have to check. Open the branch that matches your outcome.
Are you explaining an effect, or predicting an outcome?
Predicting: I want accurate forecasts on new data
Prediction model. Reach for flexible methods such as gradient boosting or regularized regression, and judge them on out-of-sample performance and calibration, not the plausibility of coefficients. Pathway: prediction and machine learning · calibration versus discrimination · model fit and prediction error.
Explaining: I want an interpretable effect
A continuous outcome
Linear regression, returning a mean difference. Check linearity, constant variance, normal residuals, and independence; if the data are heavy-tailed or outlier-prone, switch to robust methods or a transformation. Pathway: regression families · checking model assumptions · robust statistics.
A binary (yes / no) outcome
Logistic regression, returning an odds ratio. Check linearity in the logit and influential points, and watch for separation; there is no constant-variance assumption to check. Pathway: regression families · effect measures · checking model assumptions.
A count or a rate
Poisson regression, returning a rate ratio, or negative binomial when the variance exceeds the mean (overdispersion). Check for overdispersion and excess zeros. Pathway: regression families · checking model assumptions.
Ordered categories
Ordinal (proportional-odds) regression. Its load-bearing assumption is proportional odds, so test it before trusting the fit. Pathway: regression families · checking model assumptions.
Time-to-event, with censoring
Cox or parametric survival, returning a hazard ratio. Check proportional hazards with scaled Schoenfeld residuals; when it fails, report restricted mean survival time instead. Pathway: regression families · hazard ratios and non-proportional hazards · checking model assumptions.
Whichever outcome, but observations are clustered or repeated
Layer a mixed-effects model or GEE on top of the family above. Patients within hospitals, or repeated measures within a person, violate independence, and ignoring it makes standard errors too small. Pathway: regression families · uncertainty and inference · checking model assumptions.
Traces
Each trace is the same pathway walked on a real case, tagged to the rung it tests hardest and linked back to the pathway rung by rung, so a method above and its worked example here point at each other. They are grouped by the condition they concern, by ICD-10 code, with a methods group for the data-standards and trial work that no single condition owns.
Hypertension · I10All hypertension traces →
Measurement rung
The 120 mmHg systolic target
The 2017 ACC/AHA intensive target leans on the SPRINT trial, which measured blood pressure with automated, rested, averaged readings that run lower than the single manual cuff most clinics use. The trace walks the recommendation down to that measurement choice and back.
Acute ischemic stroke · I63All stroke traces →
Model & estimate rungs
Tenecteplase versus alteplase
The 2026 AHA/ASA guideline made the two thrombolytics co-equal first-line. The strongest head-to-head evidence is non-inferiority, not superiority, so the trace asks what a “not worse by more than a set margin” result, plus a logistical advantage, can and cannot support.
Type 2 diabetes · E11
Difference-in-differences
The $35 insulin cap · Part D
A two-way fixed-effects difference-in-differences on the IRA insulin cap: event study, parallel-trends defense, placebo years. End-to-end on real Part D data, measurement through defend-it.
Cardiometabolic risk · E88.81
Survey-weighted analysis
Cardiometabolic risk in NHANES
Survey-weighted prevalence and risk, multiple imputation, calibration versus AUC, and a Pooled Cohort Equations head-to-head. End-to-end on NHANES, framing through defend-it.
Methods
Data-standards and trial work that no single condition owns, each built end-to-end on real data.
Real-world data
Medicaid spending outliers
Peer-group robust z-scores, BH-FDR multiplicity, an isolation-forest second opinion, and a county cost atlas. Measurement through defend-it.
Trial data standards
CDISC SDTM/ADaM pilot
Double-programming an FDA-grade analysis package in SAS and R, executed on the CDISC pilot data. Measurement, framing, and reporting standards.
More conditions and analyses are in progress, each traced on the same pathway: statin primary-prevention thresholds and the pooled-risk equations behind them, the glycemic targets in type 2 diabetes, and the screening-interval recommendations for breast and prostate cancer. They publish here as each one is fully worked and sourced.
Methodology frameworks
A separate strand of work develops methodological frameworks at the boundary of evidence synthesis and clinical AI evaluation. These are public drafts intended to be redlined.
Risk-of-bias appraisal for AI training corpora (v0.1). Adapting Cochrane RoB 2 / ROBINS-I logic to the text an LLM was actually trained on. Six bias domains with inline signaling questions and a stylized worked example end-to-end.
Additional frameworks (GRADE for AI-synthesized claims, PRISMA-style reporting checklist for clinical AI as evidence synthesizer) are in progress and will publish as v0.1 drafts when ready.
Across the traces
Read one at a time, a trace appraises a single recommendation. Read together, the traces start to show patterns: the rungs where guideline confidence and statistical support most often diverge, the measurement choices that quietly decide an estimate, the difference between a number that is precise and a number that is portable. As the set grows, this section will carry that synthesis: how a reader who understands both the clinical stakes and the statistics should weight the evidence behind a given recommendation.
That synthesis is methodological commentary, not clinical advice. It is about what the numbers can and cannot support, written for people who design studies, defend methods, and read the literature critically. It is not guidance for treating a patient, and it is not a substitute for the guidelines themselves or for clinical judgment.
If you have a recommendation whose statistical basis you want traced, for a manuscript, a guideline-development effort, or your own appraisal, that is the kind of work I take on. Book a discovery call →