Survey data in SAS: weights, skip patterns, and repeated-measures restructuring

From Data to Bedside · a working reference

Why this page

Complex-sample survey data does not behave like a simple random sample, and the analysis breaks in quiet ways if its design is ignored: the point estimates come out biased if the weights are dropped, and the confidence intervals come out too narrow if the clustering and stratification are. This page walks a full SAS workflow on real NHANES variables, from raw transport files to a design-correct estimate, covering the four tasks that recur in every survey project: pooling cycles and building the analysis weight, handling questionnaire skip patterns, restructuring repeated measures, and running the survey procedures.

The variable names below are the real ones. NHANES ships every file as a SAS transport file, so the read step is native, and the design variables are stable across cycles: SEQN (the respondent id), SDMVSTRA (the masked variance pseudo-stratum), SDMVPSU (the masked variance pseudo-PSU), WTMEC2YR (the two-year examined-sample weight), and SDDSRVYR (which cycle a record came from).

The one rule that governs everything: never subset, use a domain

The single most common survey-analysis error is to restrict to a subgroup with a WHERE clause or a subsetting IF, then run the analysis on what is left. Deleting rows changes the variance estimate, because the design structure of the dropped observations still carries information about the sampling. The correct move is to keep every row and pass a subpopulation indicator to the procedure’s DOMAIN statement (or TABLES cross-classification in SURVEYFREQ). Every code block below follows that rule.

Reading and pooling multiple cycles

A single NHANES cycle is two years and gives unstable estimates, so the analytic guidelines recommend pooling cycles whenever possible. Pooling needs a corrected weight: when k two-year cycles are combined, the two-year weight is divided by k. Here two cycles, 2015–2016 (SDDSRVYR = 9) and 2017–2018 (SDDSRVYR = 10), are stacked and given a four-year weight.

/* Read the SAS transport files NHANES distributes (one per cycle) */
libname x1516 xport "DEMO_I.XPT";   /* 2015-2016 demographics */
libname x1718 xport "DEMO_J.XPT";   /* 2017-2018 demographics */

data demo;
  set x1516.demo_i
      x1718.demo_j;
run;

/* Two two-year cycles pooled -> divide the two-year MEC weight by 2.
   Keying off SDDSRVYR keeps this correct if more cycles are added later. */
data demo;
  set demo;
  if sddsrvyr in (9, 10) then wtmec4yr = wtmec2yr / 2;
run;

The same stack-and-reweight pattern extends to six or eight years: add the cycles to the SET statement and divide by the new cycle count. Keeping SDDSRVYR on the dataset lets a later analysis check which cycles a record came from without re-reading the source.

Skip patterns and branching logic

Questionnaires branch: a gate question routes respondents past items that do not apply to them, so the skipped items are blank by design, not missing data. Treating a by-design blank as missing throws away information and can bias a denominator. The smoking module is the canonical case. SMQ020 asks whether the respondent has smoked at least 100 cigarettes in their life; anyone who answers no skips SMQ040 (“do you now smoke?”), so SMQ040 is legitimately blank for every never-smoker. The derivation has to encode that logic rather than read the blank as unknown.

data smoke;
  set smq;
  /* SMQ020: 1 = Yes (>=100 cigarettes), 2 = No, 7 = Refused, 9 = Don't know
     SMQ040 is ASKED ONLY when SMQ020 = 1, so its blank for SMQ020 = 2
     is a never-smoker, not a data error.
     SMQ040: 1 = every day, 2 = some days, 3 = not at all */
  length smk_status $8;
  if      smq020 = 2                      then smk_status = "Never";
  else if smq020 = 1 and smq040 in (1, 2) then smk_status = "Current";
  else if smq020 = 1 and smq040 = 3       then smk_status = "Former";
  else                                         smk_status = "";  /* refused/DK/truly missing */
run;

The discipline generalizes: for every derived variable, decide what a skipped value means and recode it explicitly, and keep the truly-missing category (refused, don’t know, not reached) separate from the skipped-by-design category. Conflating the two is one of the failures the pathway’s missing-data node warns about.

Restructuring repeated measures

NHANES records up to four blood-pressure readings per exam (BPXSY1–BPXSY4 systolic, BPXDI1–BPXDI4 diastolic). The analytic convention is to average the readings after dropping the first, which runs high. That is a repeated-measures restructuring problem: the readings sit across columns (wide) and have to be turned into rows (long) to be filtered and summarized, then collapsed back to one value per subject. PROC TRANSPOSE is the data-step tool for it.

/* Wide -> long: one row per (subject, reading) */
proc transpose data=bpx out=bp_long(rename=(col1=sbp)) name=reading;
  by seqn;
  var bpxsy1 bpxsy2 bpxsy3 bpxsy4;
run;

/* Recover the reading number from the variable name: "BPXSY3" -> 3 */
data bp_long;
  set bp_long;
  read_no = input(compress(reading, , "kd"), 8.);
run;

/* Long -> one row per subject: mean of readings 2-4, nonmissing only */
proc means data=bp_long noprint;
  by seqn;
  where read_no >= 2 and sbp > .;
  var sbp;
  output out=sbp_mean(drop=_type_ _freq_) mean=sbp_mean;
run;

The same wide-to-long step is what longitudinal and panel surveys need to stack repeated responses across waves into one record per subject-occasion, the format a mixed model or GEE reads. PROC TRANSPOSE goes the other way too (long to wide) when a procedure wants one column per occasion; the BY and ID statements control the shape.

Design-correct estimation

With the analysis dataset assembled, weighted, and carrying a subpopulation flag, the survey procedures produce the estimates. Each one takes the same three design statements: STRATA, CLUSTER, and WEIGHT. The adult flag is computed for every row, then passed to DOMAIN, so the subpopulation analysis never deletes a record.

data analysis;
  merge demo smoke sbp_mean;
  by seqn;
  adult = (ridageyr >= 18);   /* computed for ALL rows, used as the domain */
run;

/* Mean systolic BP among adults, with a design-based 95% CI */
proc surveymeans data=analysis mean clm;
  strata  sdmvstra;
  cluster sdmvpsu;
  weight  wtmec4yr;
  domain  adult;
  var sbp_mean;
run;

/* Weighted smoking-status prevalence by adult status */
proc surveyfreq data=analysis;
  strata  sdmvstra;
  cluster sdmvpsu;
  weight  wtmec4yr;
  tables adult * smk_status / row cl;
run;

/* Design-based regression of systolic BP on age and sex, among adults */
proc surveyreg data=analysis;
  strata  sdmvstra;
  cluster sdmvpsu;
  weight  wtmec4yr;
  domain  adult;
  model sbp_mean = ridageyr riagendr;
run;

PROC SURVEYMEANS and its siblings use the available cases for each variable and report design-based standard errors. When item-missingness is substantial enough to need imputation rather than available-case analysis, PROC MI and PROC MIANALYZE carry the design variables through the imputation and pooling so the multiply-imputed estimates stay design-correct, which is the survey-data version of the multiple imputation the NHANES cardiometabolic trace runs in R.

How this maps to the pathway

This page is the SAS-side, hands-on companion to two pathway rungs. The weighting and design logic is the method behind complex-sample design and survey weighting in the Measurement rung, and the upstream choices that produce those weights are survey sampling design and questionnaire and instrument design; the skip-pattern and restructuring work is the analytic-cohort assembly step done on survey data. For the clinical-trial side of SAS programming, see the CDISC clinical-trial programming trace, which double-programs the analysis package in SAS and R.

← Back to the pathway

Learn the methods. Create a free account → to follow new write-ups and traces as they go up, alongside the full From Data to Bedside pathway.

Sources for the NHANES specifics: NHANES analytic guidelines and the weighting tutorial, the 2017–2018 demographics file, the smoking-questionnaire codebook, and the blood-pressure codebook.