Chapter 1: From research question to study design

Handbook of Biostatistics for Medical Research

When this chapter applies

Use this chapter when the study you’re planning hasn’t been formally designed yet. The work covered here is the front-end of any research project: from “I have a clinical hypothesis” to “I have a written protocol that an IRB or a journal reviewer can evaluate.”

A careful pass through this chapter should leave you with:

  • A research-question statement precise enough to be statistically actionable
  • A PICO or PECO table specifying population, intervention or exposure, comparator, and outcome
  • For observational work, a target-trial emulation framing
  • A primary endpoint definition with explicit alternatives and the rationale for choosing one
  • A pre-registration draft (where applicable) covering primary analysis, secondary analyses, and pre-specified subgroups
  • A first-draft statistical analysis plan (SAP) outline

If any of these is missing from your current planning, this chapter applies. If they’re all already in place, you can probably skip to Chapter 2 (population definition and sample size).

The decision framework

Moving from a clinical hypothesis to a defensible study design is a sequence of seven decisions. Each closes off a degree of freedom that you don’t want to be choosing post-hoc.

Step 1. State the research question precisely

Before any methods choice, write the research question as a single sentence with no compound clauses. The PICO or PECO format makes this easier: who is the population, what is the intervention or exposure, what is the comparator, what is the outcome, over what time horizon. A research question that cannot be written as a single PICO sentence is not yet specified enough to design a study around.

A common failure mode is bundling two questions into one. “Does treatment X reduce mortality and improve quality of life in elderly patients with condition Y?” is two questions, not one. Designing for either requires different sample sizes, different endpoints, and possibly different study populations. Pick one as primary; demote the other to a pre-specified secondary aim.

Step 2. Specify the comparator explicitly

The C in PICO carries methodological weight that’s easy to skip. Treatment X versus what? Versus placebo, versus standard of care, versus an active comparator, versus no treatment, versus a different timing or dose of the same treatment. Each comparator answers a different real-world question, and the answers can diverge.

In observational designs especially, comparator choice determines what causal effect you are estimating. The target-trial emulation framework (Hernán and Robins 2016) is the cleanest discipline here: write down the hypothetical randomized trial whose effect you’re trying to estimate, then design the observational study to approximate that trial. The framework forces you to be explicit about treatment assignment, eligibility, follow-up start, and outcome ascertainment — all the moves a real RCT would make formally and that an observational study otherwise makes implicitly.

Step 3. Choose primary and secondary endpoints

A primary endpoint is the one your sample-size calculation is built on and your headline statistical claim is reported against. Secondary endpoints are pre-specified, reported in the methods, but typically not the basis of the headline conclusion. Three discipline points:

  1. Pre-specify all endpoints before data collection or analysis. Picking the endpoint that worked best after looking at the data is the most common form of analytic flexibility that produces retracted papers.
  2. For composite endpoints, document the components and the counting rule. Does a patient count as having the event on the first component event, on any event, on a hierarchical composite (Finkelstein–Schoenfeld, win ratio)? Each rule answers a slightly different clinical question.
  3. For continuous outcomes, decide upfront how change will be analyzed. Change-from-baseline, post-treatment value adjusted for baseline (ANCOVA), or a transformation of either — each has different statistical power and different interpretation, and the right choice depends on the within-subject correlation structure.

Step 4. Pick the study design that matches the question

For most questions in clinical and health-services research, three families of design are candidates:

  • Randomized controlled trial. Highest internal validity. Use when randomization is feasible, ethical, and the question is whether the intervention works (efficacy or effectiveness).
  • Observational study with a target-trial framing. Use when randomization is infeasible (rare conditions, large policy effects, ethical constraints) but the question is still about an effect. Target-trial emulation specifies the hypothetical RCT you would have run, then designs the observational study to approximate it.
  • Natural experiment or quasi-experimental design. Use when a policy change, regulatory shift, or external shock provides plausibly exogenous variation. Difference-in-differences, regression discontinuity, instrumental variables, and synthetic control are the main toolkit (Chapter 3 covers the methods in detail).

The match between question and design is the highest-stakes decision in this chapter. A well-executed analysis of a mis-matched design produces precise estimates of the wrong thing.

Step 5. Specify the analysis plan before seeing the data

A pre-specified statistical analysis plan covers, at minimum: the primary outcome and its statistical test; the handling of missing data; the sensitivity analyses; the subgroup analyses; and the stopping rules where applicable. Documenting the SAP before unblinding the data is what distinguishes a confirmatory analysis from an exploratory one — and it’s the single discipline that most reduces reviewer methodology objections downstream.

Pre-registration on ClinicalTrials.gov or the Open Science Framework is the public version of this discipline. Any pre-specified analysis is harder to challenge as cherry-picked. For observational studies, OSF or AsPredicted are the standard registries; for clinical trials in the US, ClinicalTrials.gov is generally required.

Step 6. Define the population precisely enough to compute a sample size

Sample size calculation requires four inputs: the expected effect size, the variability of the outcome in the target population, the statistical power you want, and the significance threshold (alpha). Without a precisely-defined study population, the effect-size and variability estimates are guesses. Chapter 2 covers the calculation in detail; Step 6 of this chapter is the precondition for it.

Step 7. Plan the sensitivity analyses before you need them

Chapter 4 covers sensitivity-analysis design in detail. The Step-7 discipline here is to list, while designing the primary analysis, the methodological choices that are most likely to be challenged at review. Missing-data handling, alternative endpoint definitions, alternative population specifications, alternative comparator definitions, and alternative model specifications are the usual suspects. Each is a candidate for a pre-specified sensitivity analysis. Sensitivity analyses pre-specified at design time are credible; sensitivity analyses run only after a reviewer asks for them are not.

Worked example

The Part D insulin DiD case study walks through how this seven-step decision framework was applied to a policy-evaluation question: did the Inflation Reduction Act’s $35 insulin cap actually move utilization, or just shift cost between payers? The case study shows the target-trial-emulation reasoning, the primary endpoint choice (log 30-day fills as a clean utilization measure that survives the IRA’s contemporaneous cost-sharing redesigns), the pre-specified sensitivity analyses, and the placebo design that earned its place in the protocol.

The Medicaid outliers case study is a different application of the same framework: research question (which providers bill in patterns that warrant a second look), comparator (every other provider billing the same code, with the explicit caveat that the peer group is too broad), primary outcome (paid-per-beneficiary on a robust z-score scale), and the explicit pre-specification of robustness checks (isolation-forest triangulation, BH-FDR multiplicity, peer-group sensitivity).

Further reading

  • Hernán MA, Robins JM. Using big data to emulate a target trial when a randomized trial is not available. American Journal of Epidemiology 183(8): 758–764. 2016.
  • Schulz KF, Altman DG, Moher D, for the CONSORT Group. CONSORT 2010 statement: Updated guidelines for reporting parallel group randomised trials. BMJ 340: c332. 2010.
  • von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP, STROBE Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: Guidelines for reporting observational studies. Lancet 370(9596): 1453–1457. 2007.
  • Pocock SJ, Stone GW. The primary outcome is positive — is that good enough? New England Journal of Medicine 375(10): 971–979. 2016.
  • Ioannidis JPA. Why most published research findings are false. PLoS Medicine 2(8): e124. 2005.

Get the methods by email

This chapter is part of the free methods reference on this site. The Confounder delivers the same methodological spine to your inbox, one piece at a time, alongside shorter dispatches on new research and methods. Free, roughly every other week.

Subscribe to The Confounder →