Clinical-trial programming in SAS: subject- and medication-level datasets, TFLs, and QC

From Data to Bedside · a working reference

Why this page

Clinical-trial analysis for a regulated submission is a programming discipline with its own standards: the data arrives as CDISC SDTM, the analysis datasets are derived in CDISC ADaM, every result traces back to its source, and every number is independently re-programmed before it ships. This page walks that SAS workflow on real CDISC domains, the kind packaged in the pharmaversesdtm and safetyData test data, covering the tasks that fill a programmer’s day: building the subject-level and medication-level analysis datasets, coding and deriving medication variables, generating tables, figures, and listings, and proving the output correct by double-programming.

The SDTM domains used below are the standard ones: DM (demographics), EX (exposure), CM (concomitant medications), and VS (vital signs), with their analysis counterparts ADSL (subject level) and the basic-data-structure datasets ADCM and ADVS.

SDTM in, ADaM out

The raw case-report-form data is first mapped to SDTM, one domain per kind of data, a faithful tabulation of what was collected. Analysis datasets are then derived in ADaM: a subject-level ADSL with one row per participant carrying every population flag and treatment variable, and basic-data-structure datasets carrying the derived endpoints the statistical analysis plan calls for. The governing rule is traceability: every analysis value has to trace back through ADaM to its SDTM source and the original CRF, so a reviewer can follow any number in a table to the record it came from. This is the same pipeline the pathway’s clinical-trial dataset assembly node describes in the abstract.

The subject-level dataset (ADSL)

ADSL is one row per subject and the spine every other analysis dataset merges against. It carries the planned and actual treatment, the analysis population flags, and the derived grouping variables the tables break on.

proc sort data=sdtm.dm out=dm; by usubjid; run;

data adsl;
  set dm;
  length saffl ittfl $1 agegr1 $8 trt01p trt01a $40;
  /* Population flags from the arm variables */
  ittfl = ifc(armcd    not in ("", "SCRNFAIL", "NOTASSGN"), "Y", "N");
  saffl = ifc(actarmcd not in ("", "SCRNFAIL", "NOTASSGN"), "Y", "N");
  /* Planned and actual treatment for analysis */
  trt01p = arm;
  trt01a = actarm;
  /* Analysis age grouping */
  if      age = .  then agegr1 = "Missing";
  else if age < 65 then agegr1 = "<65";
  else                  agegr1 = ">=65";
run;

The medication-level dataset (ADCM)

A medication-level dataset has one row per reported medication per subject, not one row per subject, so it cannot live on ADSL. It is built from the CM domain and given a treatment-emergent flag by merging the first-dose date from EX.

/* First dose date per subject, from the exposure domain */
proc sql;
  create table firstdose as
    select usubjid,
           min(input(exstdtc, ?? yymmdd10.)) as trtsdt format=date9.
    from sdtm.ex
    where exdose > 0
    group by usubjid;
quit;

proc sort data=sdtm.cm out=cm; by usubjid; run;

data adcm;
  merge cm(in=incm) firstdose;
  by usubjid;
  if incm;
  format cmstdt date9.;
  cmstdt = input(cmstdtc, ?? yymmdd10.);   /* tolerant of partial ISO dates */
  /* Treatment-emergent: medication started on or after first dose */
  trtemfl = ifc(cmstdt ne . and trtsdt ne . and cmstdt >= trtsdt, "Y", "N");
run;

Dictionary coding and medication mapping

A trial records medications as free text, and analysis needs them grouped by therapeutic class. Coding maps each verbatim term to a standardized preferred term and a class, conventionally through the WHO Drug Dictionary, whose Anatomical Therapeutic Chemical (ATC) hierarchy gives the class levels a summary table groups on. The dictionary itself is licensed, but the mechanic is a lookup join, and it is the same against any reference table: match the coded term, attach the class.

/* Verbatim CMTRT is coded to CMDECOD; attach the ATC class for grouping. */
proc sql;
  create table adcm as
    select a.*, b.atc_class
    from adcm  as a
    left join atcref as b
      on upcase(a.cmdecod) = upcase(b.cmdecod);
quit;

The same coding discipline applies to adverse events through MedDRA and to medical history, and it is what the pathway’s data-standards-and-provenance node means by a coded value inheriting structure before any analysis.

Tables, figures, and listings (TFLs)

The deliverables of a trial analysis are tables, figures, and listings, defined in advance by shells in the statistical analysis plan and rendered to RTF or PDF through ODS. A table summarizes (PROC REPORT over counts or statistics), a figure plots (PROC SGPLOT), and a listing prints records verbatim (PROC REPORT with no grouping).

/* TABLE: demographics by treatment, safety population */
proc freq data=adsl(where=(saffl = "Y")) noprint;
  tables trt01a * agegr1 / out=demo_n;
run;

ods rtf file="t14-1-1-demographics.rtf" style=journal;
proc report data=demo_n nowd;
  column agegr1 trt01a, count;
  define agegr1 / group  "Age group (years)";
  define trt01a / across "Treatment";
  define count  / "n";
run;
ods rtf close;

/* FIGURE: mean systolic BP over visits, by treatment, with 95% CIs */
ods graphics on;
proc sgplot data=advs(where=(paramcd = "SYSBP"));
  vline avisitn / response=aval group=trt01a stat=mean limitstat=clm;
  xaxis label="Visit";
  yaxis label="Mean systolic BP (mmHg)";
run;

/* LISTING: every concomitant medication for one subject */
proc report data=adcm(where=(usubjid = "01-701-1015")) nowd;
  column usubjid cmtrt atc_class cmstdt trtemfl;
  define cmtrt / "Reported term";
run;

Programming QC by double-programming

The credibility standard for a regulated table is that two programmers produce it independently and the results agree to the last digit. The second programmer re-derives the dataset or table from the same SDTM without seeing the first program, and PROC COMPARE reconciles them. A compare that reports no unequal values is the sign-off; any difference is a finding to resolve before the number ships.

/* Independent re-derivation (adsl_qc) reconciled against production (adsl) */
proc compare base=adsl compare=adsl_qc
             out=diffs outnoequal listall criterion=1e-8;
  id usubjid;
run;

This independent re-programming is the trial-analysis form of the reproducibility the pathway’s data management and reproducibility node asks of every analysis.

Iterative updates and reruns

Trial outputs are produced many times as the data refreshes and the database moves toward lock, so the programs are written to rerun without editing. Parameterizing a step in a macro and calling it per parameter regenerates a whole family of tables on one submit, and a driver program that %INCLUDEs the dataset and output steps in order reruns the full deliverable set against the latest extract.

%macro vs_table(paramcd=, title=);
  ods rtf file="vs_&paramcd..rtf" style=journal;
  title "&title";
  proc means data=advs(where=(paramcd = "&paramcd")) n mean std;
    class trt01a avisitn;
    var aval;
  run;
  ods rtf close;
%mend;

%vs_table(paramcd=SYSBP, title=Systolic blood pressure by visit)
%vs_table(paramcd=DIABP, title=Diastolic blood pressure by visit)

How this maps to the pathway

This page is the SAS-side companion to the pathway’s clinical-trial dataset assembly and data management and reproducibility nodes, and it shares the double-programming idea the CDISC ADaM trace runs in SAS and R. For the observational and survey side of SAS programming, see survey data in SAS.


← Back to the pathway

Two ways to take this further: