Chapter 2: Defining populations and computing sample sizes
Handbook of Biostatistics for Medical Research
When this chapter applies
Use this chapter when you’re designing a study where you choose who’s in it and how many. The work covered here is the second front-end task in any study: from a research question (Chapter 1) to a precisely-defined population that can support a sample-size calculation.
A careful pass through this chapter should leave you with:
- Precise written inclusion and exclusion criteria
- A target-population statement (who you want to generalize to) and a study-population statement (who you actually have)
- A generalizability claim written as one sentence
- An expected effect size, with the source of the estimate (prior literature, pilot, clinically meaningful difference)
- An outcome variability estimate, with the same documentation discipline
- A sample size computed under three scenarios (optimistic, best estimate, conservative)
- A sample-size sensitivity table showing how N moves with each assumption
If your data source already exists (a natural experiment, a public-use file, a registry extract), the calculations in Steps 4–7 are diagnostic rather than design-driving: you compute what detectable effect size your existing sample supports and write the resulting power analysis into the methods section. The reasoning is the same; the direction is reversed.
The decision framework
Defining a study population and computing the sample size to study it is a sequence of seven decisions. The first three define who; the next three define how many; the seventh checks the robustness of both.
Step 1. Define inclusion and exclusion criteria precisely
Inclusion criteria specify the positive features a participant must have to enter the study. Exclusion criteria specify the disqualifying features that remove an otherwise-eligible participant. Both should be written before any data work and should be precise enough that two independent analysts applying them produce the same study population.
Inclusion criteria typically cover demographic features (age range, sex if relevant to the question, race or ethnicity if relevant), clinical state (disease, exposure, treatment status), time frame (calendar date range), and setting (inpatient or outpatient, primary or specialty care). Exclusion criteria typically cover confounding clinical state, insufficient follow-up time, missing key variables, and competing risks that would mask the outcome of interest.
The most common failure here is loose definitions that get adjusted post-hoc. “Adults with diabetes” is loose; “non-pregnant adults aged 18–75 with a documented ICD-10 E11.x or HbA1c ≥ 6.5% during the index visit, with at least 12 months of pre-index enrollment” is precise. Loose definitions create reviewer-response liability because they leave the door open to “did you mean type 1 or type 2,” “did you exclude pregnancy,” “what about pre-diabetes,” and a dozen other reasonable questions.
Step 2. Distinguish the target population from the study population
The target population is who you want to generalize to. The study population is who you actually have. The gap between them is generalizability.
NHANES, the survey design used in the cardiometabolic-risk case study, is one of the few US data sources designed to make these equivalent: the study population, when properly weighted, approximates the US civilian non-institutionalized population. Most studies aren’t built that way. A trial of a new diabetes drug in tertiary academic centers has a study population (volunteer adults at academic referral centers) that differs systematically from the target population (US adults with diabetes). The gap matters when you write up the results.
Documenting both populations in the protocol is the discipline that lets later reviewers and readers evaluate generalizability honestly.
Step 3. Specify generalizability scope
Write a one-sentence claim: “Results from this study generalize to [target population] under [conditions].” That sentence is the public statement of what your study supports.
Default to the narrower claim. A study run in three urban academic medical centers in the northeastern US should claim results that generalize to patients similar to those in three urban academic medical centers in the northeastern US, not US adults. Generalizability beyond the study population is an extrapolation, and extrapolations should be justified with reasoning, not assumed.
Step 4. Estimate the expected effect size
Sample-size calculation needs an effect size. Three sources, in rough order of credibility:
- Prior literature. Meta-analyses, systematic reviews, well-conducted similar studies. Prefer effect sizes from populations and settings close to yours.
- Pilot data. Your own preliminary work. Useful but small pilot studies tend to overestimate effect sizes because of regression to the mean.
- Clinically meaningful difference. The smallest effect that would matter to a clinician or a patient. Use this when prior literature is ambiguous or absent. Powering a study to detect an effect smaller than what matters clinically is wasteful; powering to detect an effect larger than is plausible is optimistic.
The discipline is to write down the source of the effect-size estimate in the protocol. “d = 0.3 from Smith 2019” is documented; “d = 0.3” alone is not.
Step 5. Estimate the outcome variability
For continuous outcomes, you need the standard deviation in the target population. For dichotomous outcomes, the expected event rate in the control group. For time-to-event outcomes, a median survival or hazard rate.
Variability estimates come from the same sources as effect-size estimates: prior literature, pilot data, sometimes administrative data on the population of interest. Variability is often the most uncertain input, and underestimating it is a common source of underpowered studies that fail to detect real effects.
Step 6. Compute the sample size
For most common designs, closed-form formulas exist:
- Two-sample t-test, equal variances: any standard statistics textbook
- Two proportions: Fisher’s exact or chi-squared with continuity correction
- Linear regression with k predictors: rules of thumb (Harrell’s 10–20 events per variable) or formal calculation
- Logistic regression: events per variable rules; formal calculation via Hsieh, Bloch, and Larsen
- Survival analysis: Freedman, Schoenfeld, or Lakatos formulas for log-rank tests
- Cluster-randomized designs: design-effect inflation of the simple-RCT sample size
Closed-form formulas have limits. Simulation-based sample size is the right move when the design is non-standard (multilevel, longitudinal with informative dropout, complex composite endpoints), when the outcome is non-normal in a way that closed-form approximates poorly, when the effect-size or variability assumptions are uncertain and you want a sensitivity range, or when the trial includes adaptive elements.
R packages for closed-form: pwr for the textbook scenarios, WebPower for a broader catalog, Hmisc::popower for power on proportional-odds. For simulation: simr for linear mixed models, clusterPower for cluster designs, or custom simulations written against the planned analysis.
Step 7. Run sensitivity analyses on the calculation
Sample size is exquisitely sensitive to the effect-size and variability assumptions. A 10% downward shift in effect size, or a 10% upward shift in variability, increases N by roughly 20%. Always compute N under three scenarios:
- Optimistic. Larger effect, lower variability. The smallest plausible N.
- Best estimate. Literature-based or pilot-based effect and variability. The number that goes in the protocol.
- Conservative. Smaller effect, higher variability. The N that protects you against assumption error.
Report all three. The protocol’s planned N is typically the conservative one; the discussion can describe the range. Reviewers and IRB members read sample-size sensitivity tables as a credibility signal: they suggest the investigator understands the fragility of the calculation.
Worked example
The NHANES cardiometabolic case study is a Step 1–3 example: NHANES has its own complex sampling design, and the case study walks through how survey weights and multiple imputation handle the target-vs-study-population gap. It is not a sample-size example (the sample is fixed by the survey design); the full Chapter 2 walks through the harder case of designing a study where you choose the sample.
The Medicaid outliers case study shows peer-group definition as a related form of population specification: the question of outlier compared to whom is structurally the same as the question of which population is this study designed to address. Peer-group construction in real-world-data work is a Step 1–3 problem dressed in claims-data clothing.
Further reading
- Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Routledge, 1988.
- Chow SC, Shao J, Wang H, Lokhnygina Y. Sample Size Calculations in Clinical Research. 3rd ed. CRC Press, 2017.
- Greenland S. Generalizability of epidemiologic studies. American Journal of Epidemiology 2012; 175(7): 645–655.
- Hernán MA, Hernández-Díaz S, Robins JM. A structural approach to selection bias. Epidemiology 2004; 15(5): 615–625.
- Harrell FE. Regression Modeling Strategies. 2nd ed. Springer, 2015. (Chapter on events-per-variable rules.)
- Champely S. pwr: Basic Functions for Power Analysis. R package documentation.
Get the methods by email
This chapter is part of the free methods reference on this site. The Confounder delivers the same methodological spine to your inbox, one piece at a time, alongside shorter dispatches on new research and methods. Free, roughly every other week.