Paulina Del Mundo — Writing

When the hazard ratio misleads

Paulina Del Mundo — Tue, 02 Jun 2026 00:00:00 GMT

The hazard ratio is the reflex summary for any time-to-event trial. One number, one confidence interval, done. It rests on an assumption that is easy to skip over and quietly often false: proportional hazards.

What proportional hazards actually assumes

A Cox model’s hazard ratio is a single constant. It assumes the treatment’s effect on the instantaneous risk is the same in month two as in year four. When that holds, the HR is a clean summary. When it doesn’t, the reported HR becomes a weighted average of a time-varying effect, and the weights depend on the censoring pattern rather than on anything clinical. The number you report is then an artifact of when people happened to drop out, not a stable description of the treatment.

Non-proportional hazards are not exotic. They are the norm for therapies with a delayed effect (many immunotherapies do nothing for months and then separate the curves), for treatments whose benefit wanes, and for any setting where the survival curves cross. In all of these a single HR can look modest while the curves tell a dramatic story, or look impressive while the benefit is confined to a window that matters less than the average implies.

What to report instead

Two alternatives give a number that survives non-proportional hazards and that a clinician and a patient can actually interpret:

Restricted mean survival time (RMST). The area under the survival curve up to a chosen horizon. The between-arm difference reads directly as “this many more event-free months over five years,” with no proportional-hazards assumption required.
Milestone (landmark) analysis. The event rate at a fixed, pre-specified time, for example five-year survival. Blunt, but honest and easy to communicate.

Neither replaces the survival curve itself, which should always be shown.

The discipline is in the timing

The trap is choosing RMST after seeing the curves cross. That is a post-hoc switch, and reviewers will read it as one. The fix costs nothing if you do it early: pre-specify the primary analysis (Cox, RMST, or milestone), and name the proportional-hazards diagnostic that would trigger a switch, before the data are unblinded. The cheapest place to handle non-proportional hazards is the analysis plan, not the response to reviewers.

If you are designing a time-to-event study, the study-design chapter walks the endpoint and analysis-plan decisions this connects to.

Risk-of-bias appraisal for AI training corpora

Paulina Del Mundo — Mon, 25 May 2026 00:00:00 GMT

Where does bias actually enter a clinical-AI system?

When people talk about bias in clinical AI, they’re almost always talking about the model’s outputs. That’s the visible layer. It isn’t the only one, and arguably not the first place to look.

A layer that gets less systematic attention is the body of evidence underneath the model’s recommendations. In a traditional systematic review, that body of evidence is the included studies, and the field has spent roughly three decades building tools to appraise it, beginning with the work of the Cochrane Collaboration. In an LLM that emits clinical advice, the analogous body of evidence is the text the model was trained on. In effect, it is the model’s evidence base.

There are relatively few organized methods for grading that evidence base. The existing risk-of-bias and reporting tools for clinical AI (formally: PROBAST-AI, CONSORT-AI, SPIRIT-AI, TRIPOD-AI) sit one step downstream: they appraise the prediction model, the trial that validated it, or the report describing either. None of them is designed to appraise the corpus that trained the model in the first place.

The claim of this v0.1 is fairly narrow: the methodology behind RoB 2 and ROBINS-I (Cochrane’s two risk-of-bias tools, for randomized and non-randomized studies respectively) is a reasonable starting point for appraising training corpora. The translation may be more straightforward than it first appears, once the corpus itself is treated as the unit of analysis.

In the rest of this draft, I’ll walk through what those two original tools actually do, where the training-corpus analogy holds and where it breaks, and propose a first set of domains for grading a corpus the way Cochrane grades a study. The goal isn’t to publish a finished tool. It’s to put a v0.1 on the table that other people can argue with.

What do RoB 2 and ROBINS-I actually do?

Both tools were built by Cochrane methodologists to do the same job: walk an appraiser through a clinical study and produce a structured judgment about where bias could have entered the study’s design or reporting. The two tools differ mainly in the kind of study they’re built for.

RoB 2 is the current Cochrane risk-of-bias tool for randomized controlled trials, released in 2019 (replacing an earlier version from 2008). It organizes bias into five domains: the randomization process, deviations from the intended interventions, missing outcome data, measurement of the outcome, and selection of the reported result. For each domain, the appraiser answers a short set of signaling questions about what the study did. Each question carries five response options (yes, probably yes, probably no, no, or no information), and a built-in algorithm collapses those answers into a judgment of low risk of bias, some concerns, or high risk of bias. The overall judgment for the study is then drawn from the per-domain judgments. RoB 2 grades at the level of a specific outcome rather than the study as a whole, since the same trial can have low risk of bias for one outcome and high risk for another.

ROBINS-I is the analogous Cochrane tool for non-randomized studies of interventions: cohort studies, case-control studies, interrupted time series, and similar observational designs. It was released in 2016 and uses a similar signaling-question + algorithm structure, but with seven domains rather than five; the additional ones (confounding and selection of participants) are precisely the bias sources that randomization is supposed to neutralize. Its conceptual hook is the target trial, the hypothetical randomized study the observational data is trying to approximate. The appraiser grades the real study against that imagined trial, with judgments of low, moderate, serious, or critical risk of bias.

Three design choices, common to both tools, matter for what comes next.

Signaling questions, not free-form prose. The appraiser doesn’t write a paragraph about each domain; they answer structured items. This is what makes the judgments reproducible across appraisers.
Domain-by-domain, then aggregated. Each domain gets its own judgment first, and only then is an overall judgment produced. This forces the appraiser to think about specific bias mechanisms rather than form a global impression.
Worst-case aggregation. A single domain rated high (or critical) is usually enough to drive the overall judgment to the same level. The tool is designed to surface vulnerabilities, not to average them away.

Where does the analogy hold, and where does it break?

The instinct to port RoB 2 directly to a training corpus runs into trouble, but the underlying claim — that risk-of-bias methodology can grade a corpus the way it grades a study — is roughly correct. The two halves of that need to be argued separately.

What transfers cleanly: the apparatus. All three of the design choices in §2 carry over with very little modification.

Signaling questions still work. Whether the unit being graded is a trial or a corpus, structured response items (yes, probably yes, probably no, no, or no information) keep appraiser judgments reproducible. Nothing about that depends on the underlying object being a trial.
Per-domain-then-overall aggregation still works. Decomposing bias into specific mechanisms before producing a global judgment is a sound move regardless of what’s being decomposed. It is, if anything, more important for corpora, where global impressions (“trained on a lot of medical text”) are particularly uninformative.
Worst-case aggregation still works. A corpus with one domain at critical risk of bias is critically biased overall, by the same logic that drives a trial-level RoB 2 judgment to high.

What doesn’t transfer: most of the specific domains. This is where naive porting fails. A training corpus differs from a clinical trial in several ways that matter for bias grading.

There is no protocol. A clinical trial pre-specifies its population, intervention, primary outcome, and analysis plan. A training corpus is assembled out of scraping rules, licensing decisions, and copyright filters, most of which are documented unevenly, if at all. RoB 2’s domain on “deviations from intended interventions” has no clear corpus analogue, because there is no “intended intervention” to deviate from.
There is no defined population. A trial has inclusion and exclusion criteria. A training corpus has whatever text was reachable, retained, and filtered. The corpus’s “population” is implicit in the data-curation pipeline rather than declared up front.
There is no measured outcome. Trials measure outcomes; corpora don’t. The RoB 2 domain on “measurement of the outcome” has no corpus equivalent. The model’s outputs are measured downstream, but that’s where PROBAST-AI and CONSORT-AI already do their work, not at the corpus level.
There is no allocation to randomize, and no confounding to control. RoB 2’s randomization domain and ROBINS-I’s confounding domain both assume a causal-effect estimation problem. Building a training corpus is not that kind of problem.
The appraiser usually can’t see the corpus. For most production LLMs, the appraiser is reading a model card or a technical report rather than the corpus itself. The information substrate for bias judgment is thinner than for a published trial, and the signaling questions need to be designed to accommodate that.

The cumulative effect of those five gaps is that most of the content of the RoB 2 and ROBINS-I domains has to be rebuilt for a corpus, not ported. What gets ported is the architecture: structured signaling questions, domain-level judgments, worst-case aggregation. What gets rebuilt is the list of domains and the signaling questions inside them.

What are the bias domains for a training corpus?

The previous section established that the domain content of RoB 2 won’t transfer to corpora directly. What follows is the v0.1 attempt at a corpus-specific domain set: six domains drawn from the bias surfaces that matter for clinical text and clinical AI specifically. As with RoB 2 and ROBINS-I, each domain comes with a small set of signaling questions, designed to be answerable from a model card or technical report rather than from the corpus itself.

A note before the list: signaling questions are answered yes / probably yes / probably no / no / no information, following RoB 2 convention. For each domain, the appraiser produces a domain-level judgment of low, moderate, serious, or critical, matching the four-level scale ROBINS-I uses, since corpora are conceptually closer to observational than randomized evidence.

Domain 1: Source provenance

What bias does this domain track? The mix of source types in the training corpus: peer-reviewed clinical literature, consensus guidelines, textbooks, regulatory documents, case reports, vendor-sponsored material, scraped forum threads, and AI-generated text. Each source type carries a different epistemic weight, and a corpus that combines them without explicit weighting transfers that mix to the model’s outputs.

Signaling questions.

1.1 Does the corpus documentation describe the source-type composition (e.g., approximate proportions of journal text, guideline text, forum text)?
1.2 Are sources from the peer-reviewed clinical literature explicitly represented?
1.3 Is consensus-guideline material (e.g., NICE, USPSTF, specialty-society guidelines) explicitly represented?
1.4 Is the corpus filtered to exclude or down-weight non-peer-reviewed material (forum posts, vendor marketing, AI-generated text)?
1.5 Is the proportion of AI-generated content in the corpus disclosed?

Typical paths to a high-risk judgment. Heavy reliance on undocumented web-scrape data; substantial AI-generated content; absence of any reporting on source-type composition.

Domain 2: Population coverage relative to deployment context

What bias does this domain track? Mismatch between the populations represented in the corpus and the population the model will be deployed against: demographic, geographic, disease-burden, and care-setting mismatch. This is the corpus-level analogue of external validity in trial design.

Signaling questions.

2.1 Does the documentation describe the geographic provenance of the training text (US-centric, multi-region, language coverage)?
2.2 Are pediatric, geriatric, pregnancy, and rare-disease populations explicitly represented or sub-collected?
2.3 Are non-English clinical sources represented in the corpus?
2.4 Is the deployment context (intended care settings, intended populations) documented in a way that allows comparison to the corpus composition?

Typical paths to a high-risk judgment. Deployment in a population not meaningfully represented in the documented corpus; no disclosure of language or geographic composition.

Domain 3: Temporal coverage and recency

What bias does this domain track? The time window of the corpus and the unevenness of its temporal density. Guidelines, drug indications, and standard-of-care all change; a corpus weighted toward older or newer sources will skew model outputs accordingly.

Signaling questions.

3.1 Does the documentation specify a training data cutoff date?
3.2 Are sources older than ~10 years explicitly down-weighted or filtered, where relevant to evolving clinical practice?
3.3 For clinical content where guidance has substantively changed in the last 5 years (e.g., GLP-1 indications, sepsis bundles, prostate-cancer screening), is the corpus weighted toward current guidance?
3.4 Is the temporal distribution of the corpus (e.g., proportion of post-2020 sources) disclosed?

Typical paths to a high-risk judgment. Training cutoff substantially predates major shifts in the deployment topic; no temporal-distribution reporting in a fast-moving clinical area.

Domain 4: Internal-contradiction adjudication

What bias does this domain track? How the corpus and the training pipeline handle contradictions present in the source material: older vs. newer guidelines, vendor-sponsored vs. independent reviews, US vs. European guidance, expert opinion vs. trial evidence. When contradictions are not adjudicated, or are adjudicated invisibly by the training process, model outputs reflect a weighted average rather than a defensible synthesis.

Signaling questions.

4.1 Does the documentation describe how contradictory clinical guidance was handled during corpus construction (e.g., source prioritization, recency rules)?
4.2 Were retracted or corrected publications removed from the corpus?
4.3 Were superseded guideline versions removed or down-weighted?
4.4 Is there any process for ensuring the model surfaces, rather than averages over, known disagreement in the literature?

Typical paths to a high-risk judgment. No documented retraction handling; no documented process for superseded guidelines; no evidence of any adjudication step in the corpus pipeline.

Domain 5: Labeling and preference-tuning provenance

What bias does this domain track? Bias introduced by the human labeling and reinforcement-learning steps applied after the corpus is assembled: labelers’ clinical expertise (or lack of it), the instructions they were given, and the criteria used to define a “good” output. RLHF and instruction-tuning shape what the model says even when the underlying corpus is well-curated; many clinical models use general-purpose labeling workforces with no clinical training.

Signaling questions.

5.1 Is the clinical expertise of the human labelers documented (proportion with clinical training, specialty backgrounds)?
5.2 Are the labeling instructions for clinical content publicly available, or summarized in the model card?
5.3 Were clinical safety criteria explicitly built into the preference-tuning objective, rather than left to general “helpfulness” criteria?
5.4 Is the demographic composition of the labeling workforce documented?

Typical paths to a high-risk judgment. Non-clinician labeling workforce for clinical content with no clinical-review backstop; preference-tuning objective measured only on non-clinical “helpfulness” benchmarks.

Domain 6: Epistemic monoculture

What bias does this domain track? Over-reliance on a single knowledge tradition, typically US-centric, English-language, and biased toward a small set of authoritative bodies (e.g., NIH, FDA, NCCN). Distinct from Domain 2 (population coverage): monoculture is about whose framework counts as authoritative, not whose patient data is represented.

Signaling questions.

6.1 Are guidelines from multiple regional or international bodies represented (e.g., NICE, ESMO, WHO, JSCO, alongside US guidance)?
6.2 Is non-English peer-reviewed literature represented?
6.3 Where multiple legitimate clinical paradigms exist (e.g., palliative-care framings, screening philosophies that differ across systems), is the corpus documented as representing more than one?
6.4 Are the corpus-curation decisions documented in a way that allows an appraiser to identify systematic exclusions of particular traditions?

Typical paths to a high-risk judgment. Documented or evident exclusive reliance on US/English-language sources for a model with non-US deployment; no representation of major regional guideline bodies.

Aggregation, and one departure from RoB 2

Aggregation across the six domains follows the worst-case rule introduced in §2: a single domain at critical drives the overall corpus judgment to critical, and the overall judgment cannot be lower than the worst domain-level judgment.

v0.1 also introduces one explicit departure from RoB 2 convention. RoB 2 treats no information as a neutral answer to a signaling question, neither positive nor negative for the bias judgment. For training corpora, where opacity is the norm rather than the exception, treating no information as neutral lets nearly every closed-weights model coast. v0.1 therefore proposes: a domain in which more than half of the signaling questions are answered no information receives a default judgment of serious, on the grounds that ungradable opacity is itself a bias signal. This is a methodological choice open to redlining (it might be too aggressive, or not aggressive enough), and is one of the explicit open questions in §6.

What this means for you: to apply the framework, work through the six domains in order against the model card or technical report at hand. Answer each signaling question with yes / probably yes / probably no / no / no information; resolve each domain to low / moderate / serious / critical; take the worst as the overall corpus judgment, with the opacity rule applied as described. The next section walks the framework through a stylized example end-to-end.

How does this look in practice?

Consider a hypothetical vendor — call them PrimaryCareLLM Inc. — pitching an AI assistant for US primary-care physicians. The pitch materials make the following claims, and the public documentation contains the following:

Marketing. Trained on “the full clinical literature plus consensus guidelines”; intended to help PCPs with differential-diagnosis suggestions and triage.
Model card. Discloses training on “high-quality medical text, including peer-reviewed publications, clinical guidelines, and textbooks.” Lists PubMed Central, MedlinePlus, NIH guideline documents, and “selected medical websites” as example sources. Says proprietary filtering removed low-quality content.
Technical report. Training data cutoff: December 2024. Mentions “instruction-tuning using a workforce of medically-trained reviewers.”
Validation. Peer-reviewed study showing high agreement with PCPs on differential-diagnosis tasks; validation cohort drawn from US adult primary care.
Not disclosed. Source-type proportions; AI-generated content proportion; temporal distribution of the corpus; non-US guideline coverage; labeler demographics; “medically-trained reviewer” definition; treatment of retracted or superseded sources.

What follows is what the v0.1 framework would say about this corpus.

Domain 1 (Source provenance): Moderate

The named sources are reasonable (peer-reviewed literature, NIH guideline documents, textbooks), and the vendor reports applying a filtering step, which suggests at least nominal curation. But the source-type composition isn’t quantified, the filtering criteria aren’t disclosed, and the proportion of AI-generated content isn’t reported. The corpus could plausibly contain anywhere from 0% to substantial AI-generated material under the disclosure as written. Judgment: moderate. The source mix as described is reasonable in kind; the quantitative composition is opaque.

Domain 2 (Population coverage): Moderate

Graded against the declared deployment context (US adult primary care), population coverage is acceptable in the broad strokes: English-language US clinical content matches the use case. But the documentation gives no breakdown of pediatric, geriatric, pregnancy, or rare-disease representation in the corpus. The validation cohort is US adult primary care, which leaves the model ungraded on age extremes and on contexts a PCP routinely encounters (well-child visits, prenatal care, geriatric polypharmacy). Judgment: moderate, with the caveat that any deployment beyond US adult primary care re-grades this to serious.

Domain 3 (Temporal coverage): Serious

The training cutoff is disclosed (December 2024). That’s the only temporal-coverage signaling question answered. The corpus’s temporal distribution is undisclosed, there’s no description of whether older sources are down-weighted, and the documentation says nothing about handling fast-changing clinical areas. Three of four signaling questions are no information. Judgment: serious, by the opacity rule. A real appraisal would also note that this is a domain where opacity has direct clinical consequences: a primary-care assistant that draws disproportionately from pre-2020 sources will give pre-2020 advice, and there’s no way to tell from the documentation whether that’s happening.

Domain 4 (Contradiction adjudication): Serious

All four signaling questions in this domain are no information. The documentation says nothing about retraction handling, nothing about superseded-guideline handling, and nothing about how the training pipeline resolved contradictory clinical guidance. Judgment: serious, by the opacity rule. The opacity here is particularly load-bearing: a clinical assistant whose adjudication process is invisible cannot defend any of its recommendations against the standard methodological critique — why did you choose this guideline over that one? — because no such choice is documented to have been made.

Domain 5 (Labeling provenance): Serious

The technical report mentions “medically-trained reviewers,” which is the only positive signal in this domain. “Medically-trained” is undefined (MDs? nurses? medical assistants? people with online certification?). The labeling instructions aren’t disclosed; the preference-tuning objective isn’t described in clinical-safety terms; the demographic composition of the labeling workforce is undisclosed. Three of four signaling questions are no information. Judgment: serious, by the opacity rule. A more defensible disclosure would specify the proportion of labelers with active clinical practice, the specialty mix, and the clinical-safety criteria built into the preference-tuning objective.

Domain 6 (Epistemic monoculture): Serious

The named sources are all US-anchored: NIH guideline documents, MedlinePlus, PubMed Central (which is US/English-weighted in practice). No mention of NICE, ESMO, WHO, or any non-US regional body. “Primarily English” is the explicit language stance. Two signaling questions resolve as probably no; the other two are no information. Judgment: serious. This judgment rests on the substantive evidence; the opacity rule’s >50% threshold sits just below firing (2 of 4 questions are no information), but the probably no answers do the bias-pointing work on their own.

Overall judgment: Serious

Applying the worst-case rule from §2: four of six domains land at serious, two at moderate. The overall corpus judgment is serious.

What would an appraiser do with that judgment? In Cochrane practice, a serious RoB judgment doesn’t disqualify a study from a synthesis; it qualifies the conclusions that can be drawn from it. Translated to a corpus appraisal: a serious judgment on PrimaryCareLLM’s training data doesn’t mean the tool can’t be used; it means the tool’s outputs cannot be relied upon for clinical decisions in any area where the unaddressed bias surfaces (temporal currency, contradiction adjudication, labeling expertise, monoculture) plausibly matter. For a primary-care assistant, that’s nearly all of clinical practice. A defensible appraisal would not characterize the corpus as ready for the deployment claims being made.

What this means for you: the worked sketch isn’t meant to indict any particular vendor. PrimaryCareLLM is stylized, and the disclosure patterns described here are common in the field today rather than worst-case. The point is that the framework can take a realistic information substrate and produce a defensible domain-level judgment, including a defensible overall judgment, in a way that gives the appraiser something to argue with the vendor about. If you read a vendor’s documentation and find yourself answering no information to a majority of signaling questions across most domains, the framework is doing exactly what RoB methodology is supposed to do: it’s surfacing what isn’t known, and treating that ignorance as a graded epistemic fact rather than a neutral one.

What does v0.1 explicitly not do?

This is a v0.1, not a complete tool. What follows is the section the manifesto stance (“publish in public, revise in public”) commits the framework to: an explicit list of what’s load-bearing-and-shaky, what’s deliberately scoped out, and what the open methodological questions are for v0.2.

Limitations

The opacity-penalty rule is the framework’s biggest methodological stake. RoB 2 treats no information as neutral; v0.1 treats it as serious at the domain level when it predominates. There is no precedent for this in the Cochrane tools. It could be too aggressive (penalizing vendors who are honest about disclosure gaps rather than vendors who actually have curation gaps), or not aggressive enough (the >50% threshold is arbitrary, and a 1-in-4 threshold on a high-criticality domain might be more defensible). Both objections are open.
The framework grades documentation, not the corpus itself. An appraiser using v0.1 can only grade what the vendor discloses. Well-documented corpora with subtle problems may be graded too generously, and poorly-documented corpora with reasonable underlying curation may be graded too harshly. Better disclosure is a necessary but not sufficient condition for a defensible appraisal, and v0.1 has no mechanism for distinguishing absent because the work wasn’t done from absent because it isn’t being disclosed.
No inter-rater reliability data. RoB 2 has been refined across years of reviewer studies. v0.1 has none. Whether two appraisers applying these signaling questions to the same model card would reach the same domain-level judgment is an open empirical question, and the answer matters: a tool that produces meaningfully different judgments across appraisers is doing something other than what RoB methodology is supposed to do.
The framework treats LLM training corpora as the canonical case. Multimodal models (training on images alongside text), retrieval-augmented systems (where a second corpus is consulted at inference time), and heavily fine-tuned variants (where the bias surface shifts toward the fine-tuning data rather than the base training corpus) all have corpus structures that v0.1 doesn’t explicitly handle. The framework should still be approximately applicable to these cases, but the signaling questions weren’t designed with them in mind.
Worst-case aggregation can be too strict in narrow deployment contexts. A domain at critical drives the overall judgment to critical even when the affected bias surface is irrelevant to the declared use. RoB 2 has the same issue; the standard response is that the appraiser annotates the judgment with outcome-specific reasoning. The same workaround applies here, but it makes the overall judgment less load-bearing than it might appear.
The framework doesn’t grade the model; it grades the corpus. A well-graded corpus doesn’t guarantee a well-behaved model; a poorly-graded corpus doesn’t preclude a usable model in narrowly-scoped deployments. The corpus appraisal is one input to a fuller AI-product appraisal (the other inputs being where PROBAST-AI, CONSORT-AI, and the in-progress GRADE-for-AI framework operate), not the final word.

Open questions for v0.2

Is the >50% no-information threshold the right cutoff for the opacity penalty? A scaled rule (lower threshold for high-criticality domains, higher for lower-criticality ones) may be more defensible. v0.1 chose the flat threshold for simplicity.
Should Domain 5 (labeling provenance) be split into two domains? Labeler composition (clinical expertise, demographics) and instruction transparency (preference-tuning criteria, safety objectives) are conceptually distinct and may deserve separate domain-level grades.
Should there be a “deployment-context fit” meta-domain? Domain 2 handles one piece of this. The broader question of whether the corpus, as a whole, fits the declared use case might warrant its own domain rather than being implicit in domain-level grading.
How should retrieval-augmented systems be graded? A retrieval corpus stacked on a training corpus is a different bias structure. Two-tier grading is the obvious move, but the relationship between the two corpora is non-trivial. For v0.2 to handle this it may need a companion framework rather than an extension.
What’s the right granularity for non-English source representation? Domain 2 and Domain 6 both touch this. v0.1 asks whether non-English sources are represented at all; a more defensible v0.2 might specify which guideline bodies, which languages, and in what proportions.

What I’d like redlined

This is a v0.1: explicitly first-pass, explicitly open to redlining. The fastest way to improve it is for people who appraise clinical AI for a living (or who teach the methodology behind RoB and ROBINS-I) to apply it to a real product they’re already evaluating, and tell me where it bent or broke. Send notes via the contact link on the About page. v0.2 will incorporate what comes back.

This is the first of three frameworks being drafted in public. The other two, GRADE for AI-synthesized clinical claims and a PRISMA-style reporting checklist for clinical AI as an evidence synthesizer, will appear as their own dispatches as the drafts firm up.

Why this site exists

Paulina Del Mundo — Thu, 14 May 2026 00:00:00 GMT

Most writing about clinical AI right now talks about it the way physicians talk about a new drug at a sponsored dinner: lots of mechanism, a few cherry-picked endpoints, and very little discussion of what the evidence base would have to look like to actually trust it in clinic on Monday.

That’s a strange way to talk about clinical AI, because we already know how to talk about clinical evidence. Cochrane and GRADE have been doing it for decades. PRISMA tells you how to report a synthesis. ROBINS-I extends risk-of-bias grading to non-randomized studies. PROBAST and TRIPOD-AI tell you how to read prediction models. None of these tools are new. They just haven’t been imported into the conversation about clinical AI in any organized way.

This site is my attempt to do that. I’m a physician with an MPH in epidemiology and biostatistics from Johns Hopkins. I spent a chunk of my training doing systematic reviews — including one that shaped Wilms tumor chemotherapy guidelines for the Philippines. I now work as a clinical data scientist with EHR, claims, and SDoH data at scale. Evidence synthesis is the lens I already use to read studies. I want to apply it, in public, to clinical AI.

What I plan to write

Study teardowns. A new clinical AI paper comes out — I read it three ways. With the trial-evaluation toolkit (GRADE, RoB 2). With the prediction-model toolkit (PROBAST, TRIPOD-AI). With the implementation toolkit (decision-curve analysis, calibration, cost-effectiveness). The question isn’t “is the model good.” The question is “what claim does this evidence actually support.”
Framework drafts. Cochrane RoB 2 was built for randomized trials. ROBINS-I extended risk-of-bias grading to observational studies. What does an equivalent tool look like for what an LLM “read” during training? I’m going to publish v0.1s and let the rough edges show, then revise in public.
Case-study companions. Each of the project notebooks on this site already has the analytic detail. What’s missing is the narrative — methods choices, what I’d do differently, what the analysis can and can’t claim. I’ll write those next to the code.

Who this is for

If you read clinical AI papers and find yourself wishing someone would just grade the evidence, you’re the reader I’m writing for. That includes clinicians evaluating AI vendor pitches, the PMs and operators on the other side of those pitches, MPH and med students learning evidence synthesis for the first time, and the policy and journalism people who have to translate clinical AI into something coherent for a non-specialist audience.