Public datasets to practice on, in R, Python, and SAS
From Data to Bedside · a working reference
Why this page
You learn a method by running it on data, and the friction is usually finding data you are actually allowed to download, share, and rerun. This page lists public, redistributable datasets that are good for practice, maps each one to the pathway rung it exercises, and shows it loading in R, Python, and SAS so you can work in whichever language a project demands.
One fact makes the cross-language part easy. The standard exchange format in this field is the SAS Transport file (.xpt), a decades-old open format that all three languages read. NHANES, the CDISC pilot, and FDA public submission packages all ship as .xpt. SAS reads it natively; R reads it with haven::read_xpt(); Python reads it with pandas.read_sas(..., format="xport") or pyreadstat.read_xport(). So one downloaded file is a practice corpus for all three at once, with no conversion step.
A note on what is not here: MIMIC-IV is real, deidentified hospital data and is excellent, but it is credentialed and not redistributable (you sign a data use agreement and complete training), so it cannot anchor a shareable practice set. Everything below is freely downloadable.
The datasets at a glance
| Dataset | What it is | Pathway rung it exercises | Format |
|---|---|---|---|
| NHANES | Real US survey, public | Complex-sample design, disease frequency | .xpt |
| CDISC SDTM/ADaM pilot | Clinical-trial tabulation + analysis data | Data standards, trial-dataset assembly | R package / .xpt |
| Synthea | Synthetic EHR, unrestricted | Data sources, cohort assembly | CSV / FHIR |
| OMOP CDM samples | Claims/EHR in a common data model | Data sources, time-varying confounding | SQLite / DuckDB |
NHANES — real survey data
The National Health and Nutrition Examination Survey is a probability sample of the US population, released in two-year cycles. It is the obvious dataset for practicing survey-weighted estimation, because using it correctly forces you to carry its weights, strata, and primary sampling units. The CDC distributes every file as a SAS transport file. The 2017–2018 demographics file is DEMO_J.XPT.
R — the nhanesA package pulls files by name; or read the transport file directly:
install.packages(c("nhanesA", "haven", "survey"))
# By name, straight from CDC:
library(nhanesA)
demo <- nhanes("DEMO_J")
# Or read the .xpt yourself:
url <- "https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT"
tmp <- tempfile(fileext = ".xpt")
download.file(url, tmp, mode = "wb")
demo <- haven::read_xpt(tmp)Python — pandas reads the transport format natively:
import urllib.request, pandas as pd
url = "https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT"
urllib.request.urlretrieve(url, "DEMO_J.XPT")
demo = pd.read_sas("DEMO_J.XPT", format="xport")SAS — the transport engine reads it after a local download:
/* DEMO_J.XPT downloaded to C:\nhanes\ */
libname xptin xport "C:\nhanes\DEMO_J.XPT";
proc copy in=xptin out=work; run; /* table DEMO_J now in WORK */
CDISC SDTM/ADaM — clinical-trial data
To practice the trial side of the pipeline, the SDTM tabulation and ADaM analysis structure, you want CDISC-shaped data. The cleanest source in R needs no external download at all: the pharmaversesdtm package (Apache 2.0) ships SDTM domains as R datasets, and safetyData (MIT) ships ready-made ADaM datasets repackaged from the PHUSE sample.
R — SDTM domains and a finished ADaM subject-level table:
install.packages(c("pharmaversesdtm", "safetyData", "admiral"))
library(pharmaversesdtm)
data("dm") # demographics (SDTM DM domain)
data("vs") # vital signs
data("ae") # adverse events
adsl <- safetyData::adam_adsl # a ready-made ADaM ADSLThe admiral package (the production-grade pharmaverse tool for deriving ADaM from SDTM) reads exactly these to build analysis datasets the way a real submission does.
Python — CDISC transport or .sas7bdat files read the same way as NHANES:
import pandas as pd
adsl = pd.read_sas("adsl.xpt", format="xport") # PHUSE Test Data Factory .xpt
# a native SAS dataset works too:
adsl = pd.read_sas("adsl.sas7bdat", format="sas7bdat")SAS — this is the home format; the transport file imports directly:
libname tdf xport "adsl.xpt"; /* PHUSE Test Data Factory transport file */
proc copy in=tdf out=work; run;
Synthea — synthetic EHR
Synthea (MITRE, Apache 2.0) generates fully synthetic but clinically plausible patient records, with no privacy restrictions because no real person is in it. It is the most freely shareable EHR-style source, which makes it the right sandbox for practicing cohort assembly: the CSV output is a set of relational tables (patients.csv, conditions.csv, encounters.csv, medications.csv) that you have to join, define an index date on, and shape into one analysis table. You can download a pre-generated sample or run the generator for any cohort size.
R
library(readr)
patients <- read_csv("csv/patients.csv")
conditions <- read_csv("csv/conditions.csv")Python
import pandas as pd
patients = pd.read_csv("csv/patients.csv")
conditions = pd.read_csv("csv/conditions.csv")SAS
proc import datafile="csv/patients.csv" out=work.patients dbms=csv replace; run;
OMOP CDM samples — claims and EHR in a common model
The OHDSI community standardizes observational data into the OMOP common data model, and several sample databases are public. Eunomia (Apache 2.0) bundles a small synthetic OMOP database as a local SQLite/DuckDB file, with no server needed, so it is the fastest way to practice querying a CDM and running OHDSI tools. A larger option is the CMS DE-SynPUF data mapped to OMOP on AWS Open Data, downloadable with no account:
aws s3 ls --no-sign-request s3://synpuf-omop/R — Eunomia hands you a connection:
install.packages("Eunomia") # pulls DatabaseConnector as a dependency
library(Eunomia)
cd <- getEunomiaConnectionDetails()
conn <- DatabaseConnector::connect(cd)
person <- DatabaseConnector::querySql(conn, "SELECT * FROM person")Python — query the same CDM tables through DuckDB or SQLite:
import duckdb
con = duckdb.connect("eunomia.duckdb")
person = con.execute("SELECT * FROM person").df()SAS — connect to the OMOP database through your ODBC source, then read tables with SQL:
/* libname omop set to your OMOP ODBC source */
proc sql;
create table person as select * from omop.person;
quit;
How to use these alongside the pathway
Each dataset above answers a different part of the pathway, so a good progression is to follow the rungs and pick the matching corpus. Use NHANES when the lesson is measurement and survey design; use Synthea or an OMOP sample when the lesson is assembling an observational cohort and defining time zero; use the CDISC data when the lesson is the trial pipeline and the SDTM-to-ADaM derivation. The applied traces already do this on real data: the cardiometabolic case study runs survey-weighted estimation on NHANES, and the CDISC pilot trace double-programs the SDTM-to-ADaM step in both SAS and R.
Two ways to take this further:
- Learn the methods. Create a free account → to follow new write-ups and traces as they go up, alongside the full From Data to Bedside pathway.
- Put them to work on your study. Book a discovery call → for study design, causal inference, sample size, and analysis that survives review.