Public datasets to practice on, in R, Python, and SAS

From Data to Bedside · a working reference

Why this page

You learn a method by running it on data, and the friction is usually finding data you are actually allowed to download, share, and rerun. This page lists public, redistributable datasets that are good for practice, maps each one to the pathway rung it exercises, and shows it loading in R, Python, and SAS so you can work in whichever language a project demands.

One fact makes the cross-language part easy. The standard exchange format in this field is the SAS Transport file (.xpt), a decades-old open format that all three languages read. NHANES, the CDISC pilot, and FDA public submission packages all ship as .xpt. SAS reads it natively; R reads it with haven::read_xpt(); Python reads it with pandas.read_sas(..., format="xport") or pyreadstat.read_xport(). So one downloaded file is a practice corpus for all three at once, with no conversion step.

A note on what is not here: MIMIC-IV is real, deidentified hospital data and is excellent, but it is credentialed and not redistributable (you sign a data use agreement and complete training), so it cannot anchor a shareable practice set. Everything below is freely downloadable.

The datasets at a glance

Dataset	What it is	Pathway rung it exercises	Format
NHANES	Real US survey, public	Complex-sample design, disease frequency	`.xpt`
CDISC SDTM/ADaM pilot	Clinical-trial tabulation + analysis data	Data standards, trial-dataset assembly	R package / `.xpt`
Synthea	Synthetic EHR, unrestricted	Data sources, cohort assembly	CSV / FHIR
OMOP CDM samples	Claims/EHR in a common data model	Data sources, time-varying confounding	SQLite / DuckDB

NHANES — real survey data

The National Health and Nutrition Examination Survey is a probability sample of the US population, released in two-year cycles. It is the obvious dataset for practicing survey-weighted estimation, because using it correctly forces you to carry its weights, strata, and primary sampling units. The CDC distributes every file as a SAS transport file. The 2017–2018 demographics file is DEMO_J.XPT.

R — the nhanesA package pulls files by name; or read the transport file directly:

install.packages(c("nhanesA", "haven", "survey"))

# By name, straight from CDC:
library(nhanesA)
demo <- nhanes("DEMO_J")

# Or read the .xpt yourself:
url <- "https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT"
tmp <- tempfile(fileext = ".xpt")
download.file(url, tmp, mode = "wb")
demo <- haven::read_xpt(tmp)

Python — pandas reads the transport format natively:

import urllib.request, pandas as pd

url = "https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT"
urllib.request.urlretrieve(url, "DEMO_J.XPT")
demo = pd.read_sas("DEMO_J.XPT", format="xport")

SAS — the transport engine reads it after a local download:

/* DEMO_J.XPT downloaded to C:\nhanes\ */
libname xptin xport "C:\nhanes\DEMO_J.XPT";
proc copy in=xptin out=work; run;   /* table DEMO_J now in WORK */

CDISC SDTM/ADaM — clinical-trial data

To practice the trial side of the pipeline, the SDTM tabulation and ADaM analysis structure, you want CDISC-shaped data. SDTM (the standard for the raw collected trial data) and ADaM (the analysis-ready datasets derived from it) are the two CDISC formats behind essentially every trial submitted to the FDA. The cleanest source in R needs no external download at all: the pharmaversesdtm package (Apache 2.0) ships SDTM domains as R datasets, and safetyData (MIT) ships ready-made ADaM datasets repackaged from the PHUSE sample.

R — SDTM domains and a finished ADaM subject-level table:

install.packages(c("pharmaversesdtm", "safetyData", "admiral"))

library(pharmaversesdtm)
data("dm")    # demographics (SDTM DM domain)
data("vs")    # vital signs
data("ae")    # adverse events

adsl <- safetyData::adam_adsl   # a ready-made ADaM ADSL

The admiral package (the production-grade pharmaverse tool for deriving ADaM from SDTM) reads exactly these to build analysis datasets the way a real submission does.

Python — CDISC transport or .sas7bdat files read the same way as NHANES:

import pandas as pd

adsl = pd.read_sas("adsl.xpt", format="xport")        # PHUSE Test Data Factory .xpt
# a native SAS dataset works too:
adsl = pd.read_sas("adsl.sas7bdat", format="sas7bdat")

SAS — this is the home format; the transport file imports directly:

libname tdf xport "adsl.xpt";   /* PHUSE Test Data Factory transport file */
proc copy in=tdf out=work; run;

Synthea — synthetic EHR

Synthea (MITRE, Apache 2.0) generates fully synthetic but clinically plausible patient records, with no privacy restrictions because no real person is in it. It is the most freely shareable EHR-style source, which makes it the right sandbox for practicing cohort assembly: the CSV output is a set of relational tables (patients.csv, conditions.csv, encounters.csv, medications.csv) that you have to join, define an index date on, and shape into one analysis table. You can download a pre-generated sample or run the generator for any cohort size.

library(readr)
patients   <- read_csv("csv/patients.csv")
conditions <- read_csv("csv/conditions.csv")

Python

import pandas as pd
patients   = pd.read_csv("csv/patients.csv")
conditions = pd.read_csv("csv/conditions.csv")

SAS

proc import datafile="csv/patients.csv" out=work.patients dbms=csv replace; run;

OMOP CDM samples — claims and EHR in a common model

The OHDSI community standardizes observational data into the OMOP common data model, a shared table layout that lets the same analysis code run on claims or EHR data from different institutions, and several sample databases are public. Eunomia (Apache 2.0) bundles a small synthetic OMOP database as a local SQLite/DuckDB file, with no server needed, so it is the fastest way to practice querying a CDM and running OHDSI tools. A larger option is the CMS DE-SynPUF data mapped to OMOP on AWS Open Data, downloadable with no account:

aws s3 ls --no-sign-request s3://synpuf-omop/

R — Eunomia hands you a connection:

install.packages("Eunomia")   # pulls DatabaseConnector as a dependency
library(Eunomia)
cd   <- getEunomiaConnectionDetails()
conn <- DatabaseConnector::connect(cd)
person <- DatabaseConnector::querySql(conn, "SELECT * FROM person")

Python — query the same CDM tables through DuckDB or SQLite:

import duckdb
con = duckdb.connect("eunomia.duckdb")
person = con.execute("SELECT * FROM person").df()

SAS — connect to the OMOP database through your ODBC source, then read tables with SQL:

/* libname omop set to your OMOP ODBC source */
proc sql;
  create table person as select * from omop.person;
quit;

How to use these alongside the pathway

Each dataset above answers a different part of the pathway, so a good progression is to follow the rungs and pick the matching corpus. Use NHANES when the lesson is measurement and survey design; use Synthea or an OMOP sample when the lesson is assembling an observational cohort and defining time zero; use the CDISC data when the lesson is the trial pipeline and the SDTM-to-ADaM derivation. The applied traces already do this on real data: the cardiometabolic case study runs survey-weighted estimation on NHANES, and the CDISC pilot trace double-programs the SDTM-to-ADaM step in both SAS and R.

For end-to-end SAS workflows on this data, two companion pieces walk real code: survey data in SAS (pooling NHANES cycles, skip patterns, repeated-measures restructuring, and PROC SURVEYMEANS) and the CDISC clinical-trial programming trace (subject- and medication-level CDISC datasets, dictionary coding, TFLs, and double-programming QC in SAS and R).

← Back to the pathway

Learn the methods. Create a free account → to follow new write-ups and traces as they go up, alongside the full From Data to Bedside pathway.