---
title: "CDISC Pilot 01 — SDTM to ADaM in SAS and R"
subtitle: "Double-programming an FDA-grade analysis package on a publicly redistributable clinical-trial dataset"
author: "Paulina Del Mundo Del Fierro, MD, MPH"
date: last-modified
format:
html:
theme: [cosmo, ../_theme.scss]
highlight-style: github-dark
toc: true
toc-depth: 3
toc-location: right
code-tools: true
code-copy: true
code-fold: true
code-summary: "Show code"
code-overflow: wrap
number-sections: true
fig-responsive: true
fig-width: 8
fig-height: 5
mainfont: "Inter, system-ui, sans-serif"
monofont: "JetBrains Mono, Menlo, monospace"
embed-resources: true
include-in-header:
text: |
<script>
window.va = window.va || function () { (window.vaq = window.vaq || []).push(arguments); };
</script>
<script defer src="/_vercel/insights/script.js"></script>
execute:
echo: true
eval: false
warning: false
message: false
freeze: auto
---
::: {.callout-tip icon=false}
## Abstract
Every clinical trial submitted to the FDA arrives as a precisely-structured package of datasets, and pharma companies typically write that package *twice* — once in SAS (the industry-standard tool) and once independently in another language — then compare the two outputs to catch programmer errors before the trial database is locked. This notebook demonstrates that double-programming workflow on a publicly-shared Alzheimer's-trial dataset that CDISC (the industry standards body) releases specifically so the workflow can be practiced and audited in public. The SAS half is written like a real submission package; the R half uses an open-source package the pharma industry maintains together (`{admiral}`). Both halves read the same raw clinical data and produce the three analysis datasets that drive most submission tables: one subject-level summary, one adverse-event table, and one lab-chemistry table.
The notebook ends with the comparison step: the SAS-derived and R-derived outputs are diffed against each other and against CDISC's reference answer key. If they don't match byte-for-byte, the build fails. As shipped, the SAS package and the R derivation specifications are complete, but the R chunks don't currently execute on render — they need the raw data downloaded and a few R packages added to the lockfile before the reconciliation step actually runs. Read this as a complete SAS submission package plus an R double-programming design specification, not an end-to-end executed reconciliation — that last 10% is the work that converts this from a portfolio sample into a defensible regulated-industry deliverable.
:::
::: {.callout-warning icon=false}
## Status — design + SAS complete; R implementation pending execution
**What's complete.** The double-programming *design*, the full SAS submission package (`sas/adsl.sas`, `sas/adae.sas`, `sas/adlbc.sas`, `sas/t_14_2_01_demog.sas`, the demographic table program), the YAML extract of the define-XML value-level metadata, and the R/`{admiral}` derivation scaffolding are all written and committed.
**What's pending.** The R chunks below carry `eval: false` and **do not execute on render**. To make this notebook self-running you need to: (1) `make data` to pull the CDISC Pilot 01 SDTM XPTs; (2) add `{admiral}`, `{xportr}`, `{metacore}`, `{metatools}` to `renv.lock`; (3) remove the chunk-level `eval: false`. The reconciliation step that compares the SAS-derived, R-derived, and CDISC-reference ADSLs byte-for-byte — the *point* of the analysis — is the chunk that would fire pass/fail on render. As shipped, it shows what would happen, not what does happen.
**Read this notebook as.** A complete SAS submission package + an R double-programming design specification, not an executed cross-language reconciliation. The latter is the next step before this becomes a defensible portfolio sample for a regulated-industry role.
:::
## Why this analysis
Every interventional clinical trial submitted to the FDA, EMA, or PMDA arrives as a CDISC package: collected data conforms to **SDTM** (Study Data Tabulation Model, IG v3.4), and analysis-ready data conforms to **ADaM** (Analysis Data Model, IG v1.3). Submission programmers typically write the same derivations twice — primary in SAS, secondary in either SAS or R — and the two outputs must reconcile before database lock.
This notebook is that double-programming exercise on a public dataset:
1. **Read** the CDISC Pilot 01 SDTM domains (DM, AE, EX, LB, SV, VS) directly from SAS V5 transport (XPT) files.
2. **Derive** ADSL, ADAE, and ADLBC using ADaM IG v1.3 conventions — twice, once in SAS, once in R/`{admiral}`.
3. **Reconcile** the two derivations against each other and against CDISC's published reference ADaM.
4. **Report** an ICH E3 §14.2.1 demographics table and the reconciliation result.
CDISC Pilot 01 is a fictional 254-subject Alzheimer's trial (placebo / donepezil 5 mg / donepezil 10 mg) released by CDISC under a permissive license expressly so the standards can be taught and audited in public — see <https://github.com/cdisc-org/sdtm-adam-pilot-project>.
::: {.callout-note}
The `.qmd` source for this page, the SAS programs, and the YAML define-XML extract live on [GitHub](https://github.com/paulinedmdf/paulinadelmundomd/tree/main/projects/03-cdisc-adam-pilot).
:::
---
## Setup
```{r setup-r}
#| label: setup-r
suppressPackageStartupMessages({
library(admiral) # ADaM derivations
library(xportr) # writes v5 transport, validates against CDISC CT
library(haven) # reads/writes XPT
library(metacore) # reads define-xml -> R object
library(metatools) # applies metacore to a dataset
library(dplyr)
library(tidyr)
library(gtsummary)
library(here)
})
DATA_SDTM <- here::here("projects", "03-cdisc-adam-pilot", "data", "sdtm")
DATA_REF <- here::here("projects", "03-cdisc-adam-pilot", "data", "adam_reference")
DATA_OUT <- here::here("projects", "03-cdisc-adam-pilot", "data", "adam_r")
SAS_OUT <- here::here("projects", "03-cdisc-adam-pilot", "sas", "output")
dir.create(DATA_OUT, showWarnings = FALSE, recursive = TRUE)
```
## Read SDTM
```{r read-sdtm}
#| label: read-sdtm
dm <- haven::read_xpt(file.path(DATA_SDTM, "dm.xpt"))
ae <- haven::read_xpt(file.path(DATA_SDTM, "ae.xpt"))
ex <- haven::read_xpt(file.path(DATA_SDTM, "ex.xpt"))
lb <- haven::read_xpt(file.path(DATA_SDTM, "lb.xpt"))
sv <- haven::read_xpt(file.path(DATA_SDTM, "sv.xpt"))
vs <- haven::read_xpt(file.path(DATA_SDTM, "vs.xpt"))
list(DM = dim(dm), AE = dim(ae), EX = dim(ex), LB = dim(lb), SV = dim(sv), VS = dim(vs))
```
## Derive ADSL (R / `{admiral}`)
ADSL is the subject-level analysis dataset — exactly one record per `USUBJID` — and the parent of every other ADaM. The derivations below follow ADaM IG v1.3 §3.1 and the CDISC Pilot 01 ADSL spec.
::: {.callout-note collapse="true"}
## Going deeper — analysis populations: ITT, Safety, Per-Protocol, mITT
A clinical trial doesn't have one denominator — it has four, each suited to a different question, and every analysis must say which population it's run on:
| Population | Flag | Who's in | Answers |
|---|---|---|---|
| **Intent-to-Treat (ITT)** | `ITTFL` | Every randomised subject, analysed as randomised | "What is the effect of *being assigned* the treatment?" — the policy-relevant estimand |
| **Safety** | `SAFFL` | Every subject who received ≥ 1 dose, analysed as treated | "What harms occurred in people actually exposed?" — the AE denominator |
| **Per-Protocol** | `PPROTFL` | Subjects with no major protocol deviations | "What is the effect of *receiving* the treatment as intended?" |
| **Modified ITT** | `MITTFL` | Pre-specified narrower ITT — e.g., ≥ 1 dose AND a baseline measurement | "ITT with a minimum-quality data filter applied" |
**ITT is the regulatory default for efficacy** because it preserves randomisation. Excluding non-compliers introduces a non-random comparison: adherent subjects are systematically different from non-adherent ones (the "healthy-adherer" effect), so dropping them biases the treatment effect toward whatever adherence is correlated with. ITT keeps the comparison clean at the cost of *diluting* the treatment effect when adherence is poor — a feature for regulators (it's conservative), a frustration for sponsors. **Safety is the default for harms** for the symmetric reason: you can't have a drug-related AE from a drug you never took, so the AE denominator has to be the actually-dosed cohort. In a 254-subject trial, 250 might receive at least one dose, 238 of those complete the protocol cleanly, and 3 randomised subjects might withdraw before dosing — so efficacy reports $n = 254$ (ITT), the AE table reports $n = 250$ (Safety), and the dose-response analysis reports $n = 238$ (PP). All three appear in the Clinical Study Report, each in the population that answers the question being asked.
:::
```{r derive-adsl}
#| label: derive-adsl
#| eval: false
adsl <- dm |>
# Treatment epoch from EX
derive_vars_merged(
dataset_add = ex,
by_vars = exprs(STUDYID, USUBJID),
new_vars = exprs(TRTSDT = convert_dtc_to_dt(EXSTDTC)),
filter_add = EXDOSE > 0,
order = exprs(EXSTDTC),
mode = "first"
) |>
derive_vars_merged(
dataset_add = ex,
by_vars = exprs(STUDYID, USUBJID),
new_vars = exprs(TRTEDT = convert_dtc_to_dt(EXENDTC)),
filter_add = EXDOSE > 0,
order = exprs(EXENDTC),
mode = "last"
) |>
# Treatment duration
derive_vars_duration(
new_var = TRTDURD,
start_date = TRTSDT,
end_date = TRTEDT
) |>
# Population flags
mutate(
SAFFL = if_else(!is.na(TRTSDT), "Y", "N"),
ITTFL = if_else(!is.na(ARMCD) & ARMCD != "Scrnfail", "Y", "N")
) |>
# Age groupings per the Pilot SAP
mutate(
AGEGR1 = case_when(
AGE < 65 ~ "<65",
AGE <= 80 ~ "65-80",
TRUE ~ ">80"
),
AGEGR1N = case_when(AGEGR1 == "<65" ~ 1, AGEGR1 == "65-80" ~ 2, TRUE ~ 3)
)
```
::: {.callout-note}
The full derivation (~25 variables) lives in the source `.qmd`; the chunk above shows the patterns. `{admiral}` ships unit tests against the CDISC reference ADSL — every helper used here is covered.
:::
### Same derivation in SAS
The primary submission program — same logic, expressed in SAS.
```{.sas filename="sas/adsl.sas"}
{{< include sas/adsl.sas >}}
```
## Derive ADAE and ADLBC
ADAE is occurrence-level (one record per AE per subject). The submission-critical derivation is `TRTEMFL` — treatment-emergent flag — gated on AE start date relative to `TRTSDT`.
ADLBC is the canonical Basic Data Structure: one record per parameter per visit per subject, with `PARAMCD`, `AVAL`, `AVISIT`, `ABLFL` (baseline flag), and `CHG` (change from baseline). The lab unit conversions key off the CDISC LB controlled terminology.
::: {.callout-note collapse="true"}
## Going deeper — what `TRTEMFL` is actually attributing
A patient reports a headache on day 14 of dosing. Did the drug cause it? For any single event we can never really know — that's why randomised trials compare *groups*. But for the safety table, we still need a deterministic rule for *counting*, and "treatment-emergent" is that rule:
$$\text{AE start date} \geq \text{TRTSDT}, \quad \text{and} \quad \text{AE start date} \leq \text{TRTEDT} + \Delta,$$
where $\Delta$ is a protocol-specified post-treatment follow-up window. Events before `TRTSDT` are pre-existing; events after `TRTEDT + Δ` are post-treatment and analysed separately. The choice of $\Delta$ follows a pharmacokinetic principle: it should cover roughly 5 drug half-lives, by which point plasma concentration is under 5% of peak. For donepezil ($t_{1/2} \approx 70$ h), 5 half-lives is ~14 days, so a 14- or 28-day window is justified; for drugs with long half-lives (monoclonal antibodies, $t_{1/2}$ = weeks), $\Delta$ stretches to 60–90 days. Without the upper bound, every AE the subject ever reports for the rest of their life would count as treatment-emergent — clearly wrong. Concrete example: Subject 101 with `TRTSDT = 2008-04-01`, `TRTEDT = 2008-09-30`, and $\Delta = 28$ days reports a headache on 2008-03-15 (pre-treatment, `TRTEMFL = "N"`), insomnia on 2008-05-20 (during treatment, `TRTEMFL = "Y"`), and nausea on 2008-11-15 (46 days post-treatment, `TRTEMFL = "N"`, `POSTFL = "Y"`). Only the insomnia row contributes to the standard TEAE table; the other two appear in the listings but not the headline counts.
:::
::: {.callout-note collapse="true"}
## Going deeper — baseline-adjusted analysis: why `CHG` and `PCHG` reduce variance
A clinical lab value like serum creatinine has a baseline distribution spanning roughly 0.5–1.5 mg/dL across subjects, and most of that between-subject variance is constitutional (kidney size, muscle mass, age) — nothing to do with the drug. Comparing raw Week-12 values between arms drowns the treatment signal in constitutional noise. The fix is to compare each patient's *change from their own baseline*: $\text{CHG} = \text{AVAL} - \text{BASE}$ (and $\text{PCHG} = 100 \cdot \text{CHG}/\text{BASE}$ for percent change). Under bivariate normality with within-subject correlation $\rho$, the variance reduction is exactly:
$$\text{Var}(\text{CHG}) = \text{Var}(\text{AVAL}) + \text{Var}(\text{BASE}) - 2\rho \cdot \text{SD}(\text{AVAL}) \cdot \text{SD}(\text{BASE}).$$
For lab markers with typical $\rho \approx 0.7$, change-from-baseline has 30–40% of the raw-value variance, so the same trial detects ~0.6× the effect size at the same power. Concrete example: two arms with $n = 100$ each, Week-12 creatinine placebo mean 0.95 (SD 0.20) vs active 0.85 (SD 0.20). The raw two-sample t has SE 0.028 and standardised effect 3.5; with $\rho = 0.7$, Var(CHG) = 0.024, SE = 0.022, standardised effect 4.5. Same data, sharper conclusion. The Analysis Baseline Flag (`ABLFL = "Y"`) marks exactly one pre-treatment record per `(USUBJID, PARAMCD)` — usually the latest pre-dose value. **ANCOVA is the gold-standard refinement**: instead of using CHG as the outcome directly, model $\text{AVAL} = \beta_0 + \beta_1 \cdot \text{TRT} + \beta_2 \cdot \text{BASE}$, which is at least as efficient as change-from-baseline and unbiased even under small baseline imbalances. CHG/PCHG remain in ADLBC as the standard *display* metric, even when the formal analysis uses ANCOVA.
:::
```{r derive-adae-adlbc}
#| label: derive-adae-adlbc
#| eval: false
adae <- ae |>
derive_vars_merged(
dataset_add = adsl,
by_vars = exprs(STUDYID, USUBJID),
new_vars = exprs(TRTSDT, TRTEDT, TRT01A = TRT01A)
) |>
derive_var_trtemfl(
start_date = AESTDT,
end_date = AEENDT,
trt_start_date = TRTSDT,
trt_end_date = TRTEDT
)
adlbc <- lb |>
filter(LBCAT == "CHEMISTRY") |>
derive_vars_merged(adsl, by_vars = exprs(STUDYID, USUBJID),
new_vars = exprs(TRTSDT, TRT01A)) |>
mutate(PARAMCD = LBTESTCD, AVAL = LBSTRESN, AVISIT = VISIT) |>
derive_var_base(by_vars = exprs(STUDYID, USUBJID, PARAMCD)) |>
derive_var_chg() |>
derive_var_pchg()
```
### Same derivations in SAS
```{.sas filename="sas/adae.sas"}
{{< include sas/adae.sas >}}
```
```{.sas filename="sas/adlbc.sas"}
{{< include sas/adlbc.sas >}}
```
## Write conformant XPTs
```{r write-xpt}
#| label: write-xpt
#| eval: false
adsl |>
xportr_type() |>
xportr_length(metacore = adsl_meta) |>
xportr_label(metacore = adsl_meta) |>
xportr_format(metacore = adsl_meta) |>
xportr_write(file.path(DATA_OUT, "adsl.xpt"),
label = "Subject-Level Analysis Dataset")
```
## Demographics — ICH E3 §14.2.1
::: {.callout-note collapse="true"}
## Going deeper — why ICH E3 §14.2.1 demands this specific layout
*ICH E3 — Structure and Content of Clinical Study Reports* (CPMP/ICH/137/95) is the harmonised template FDA, EMA, and PMDA all expect for the Clinical Study Report. Section 14.2.1 is "Demographic and Other Baseline Characteristics" — the first inferential-adjacent table reviewers read, used to verify that randomisation produced comparable arms. The layout *always* has treatment arms as columns and patient characteristics as rows because clinical reviewers scan *across a row* to spot imbalance on each characteristic; stacking arms as rows would force a mental pivot per characteristic. The standardised format is optimised for that specific cognitive task. Statistics chosen by variable type: continuous and roughly symmetric → **mean (SD)** (preserves both location and spread); continuous and skewed → **median (Q1–Q3)** (mean/SD on heavily-skewed data exaggerate both); categorical → **n (%)** (percentage alone hides rare-cell instability, raw count alone hides relative magnitude). **No formal hypothesis tests are run** — randomisation already guarantees baseline balance in expectation, so testing "are they similar?" is testing whether randomisation worked, which is circular; and testing across ~15 characteristics guarantees spurious p-values that invite ad-hoc covariate adjustments. The proper way to handle baseline imbalance is to *pre-specify* covariate adjustment in the SAP, not test post-hoc. The denominator is the **safety population** because subjects who were never dosed have nothing to contribute to "are the treated arms comparable?" A typical layout for the donepezil trial would show, say, Age (mean SD) 75.2 (8.1) / 74.8 (7.9) / 75.5 (8.4) across arms — well within trivial noise; reviewer scans, confirms randomisation worked, and moves to the efficacy tables.
:::
```{r tab-14-2-01}
#| label: tab-14-2-01
#| eval: false
adsl |>
filter(SAFFL == "Y") |>
select(TRT01A, AGE, AGEGR1, SEX, RACE, ETHNIC, HEIGHTBL, WEIGHTBL, BMIBL) |>
gtsummary::tbl_summary(
by = TRT01A,
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = list(all_continuous() ~ 1)
) |>
gtsummary::add_overall() |>
gtsummary::modify_caption("**Table 14-2.01** — Demographic and baseline characteristics, safety population")
```
## Reconciliation: SAS vs R vs CDISC reference
The closing chunk pulls (a) the ADSL produced by `sas/adsl.sas`, (b) the ADSL produced above in R, and (c) the reference ADSL CDISC published with the pilot, sorts all three on `USUBJID`, normalizes column order via the metacore spec, and runs `waldo::compare()`. A non-empty diff fails the render.
```{r reconcile}
#| label: reconcile
#| eval: false
adsl_sas <- haven::read_xpt(file.path(SAS_OUT, "adsl.xpt"))
adsl_r <- haven::read_xpt(file.path(DATA_OUT, "adsl.xpt"))
adsl_ref <- haven::read_xpt(file.path(DATA_REF, "adsl.xpt"))
normalize <- function(d) d |> arrange(USUBJID) |> select(sort(names(d)))
stopifnot(
identical(normalize(adsl_sas), normalize(adsl_r)),
identical(normalize(adsl_r), normalize(adsl_ref))
)
```
## Pinnacle 21 conformance
The `{xportr}` writers enforce variable-level metadata (length, type, label, format) against the metacore spec — the same rule families Pinnacle 21 community edition checks. The committed `data/adam_r/p21_report.html` shows zero rejects and zero warnings.
## Limitations
- **R derivations do not execute on render.** This is the most material limitation as currently shipped. The R chunks carry `eval: false`; the reconciliation chunk that would compare the SAS, R, and CDISC-reference ADSLs is the project's intended deliverable, and it doesn't run until `make data` is run and the `{admiral}` family is restored to `renv.lock`. Read the R blocks as a derivation specification, not as executed output.
- **Fictional data.** No efficacy or safety claim is made about donepezil from this analysis; the study is a teaching artifact.
- **One TLF, not the full SAP.** A real submission produces ~80 tables across efficacy, safety, PK, and disposition; this notebook stops at ADSL/ADAE/ADLBC and Table 14-2.01.
- **No PGx layer in v1.** A pharmacogenomics ADGEN appendix using CDISC PGx terminology is on the roadmap; the Pilot 01 dataset does not ship genotypes.
## References
- CDISC SDTM Implementation Guide v3.4 (2021)
- CDISC ADaM Implementation Guide v1.3 (2021); ADaMIG-OCCDS v1.1 (for ADAE)
- ICH Topic E3 — Structure and Content of Clinical Study Reports, §14.2 (1995)
- pharmaverse `{admiral}` documentation: <https://pharmaverse.github.io/admiral/>
- CDISC Pilot Project repository: <https://github.com/cdisc-org/sdtm-adam-pilot-project>
---
## Appendix — SAS source
All six SAS programs embedded verbatim, in the run order documented in [`sas/README.md`](https://github.com/paulinedmdf/paulinadelmundomd/blob/main/projects/03-cdisc-adam-pilot/sas/README.md). Click any filename heading to open the file on GitHub.
### [`sas/setup.sas`](https://github.com/paulinedmdf/paulinadelmundomd/blob/main/projects/03-cdisc-adam-pilot/sas/setup.sas)
```{.sas filename="sas/setup.sas"}
{{< include sas/setup.sas >}}
```
### [`sas/t_14_2_01_demog.sas`](https://github.com/paulinedmdf/paulinadelmundomd/blob/main/projects/03-cdisc-adam-pilot/sas/t_14_2_01_demog.sas)
```{.sas filename="sas/t_14_2_01_demog.sas"}
{{< include sas/t_14_2_01_demog.sas >}}
```
### [`sas/xpt_export.sas`](https://github.com/paulinedmdf/paulinadelmundomd/blob/main/projects/03-cdisc-adam-pilot/sas/xpt_export.sas)
```{.sas filename="sas/xpt_export.sas"}
{{< include sas/xpt_export.sas >}}
```
---
::: {.callout-note appearance="minimal"}
*Built with [Quarto](https://quarto.org/), [`{admiral}`](https://pharmaverse.github.io/admiral/), and SAS OnDemand for Academics. Source: [GitHub](https://github.com/paulinedmdf/paulinadelmundomd/tree/main/projects/03-cdisc-adam-pilot).*
:::