LINKS
Statistics and Peer Review

Statistical Power, Sample Size, and Their Reporting in Randomized Controlled Trials

(JAMA. 1994;272:122-124)

David Moher, MSc; Corinne S. Dulberg, PhD, MPH; George A. Wells, PhD

Objective.--To describe the pattern over time in the level of statistical power and the reporting of sample size calculations in published randomized controlled trials (RCTs) with negative results.

Design.--Our study was a descriptive survey. Power to detect 25% and 50% relative differences was calculated for the subset of trials with negative results in which a simple two-group parallel design was used. Criteria were developed both to classify trial results as positive or negative and to identify the primary outcomes. Power calculations were based on results from the primary outcomes reported in the trials.

Population.--We reviewed all 383 RCTs published in JAMA, Lancet, and the New England Journal of Medicine in 1975, 1980, 1985, and 1990.

Results.--Twenty-seven percent of the 383 RCTs (n=102) were classified as having negative results. The number of published RCTs more than doubled from 1975 to 1990, with the proportion of trials with negative results remaining fairly stable. Of the simple two-group parallel design trials having negative results with dichotomous or continuous primary outcomes (n=70), only 16% and 36% had sufficient statistical power (80%) to detect a 25% or 50% relative difference, respectively. These percentages did not consistently increase over time. Overall, only 32% of the trials with negative results reported sample size calculations, but the percentage doing so has improved over time from 0% in 1975 to 43% in 1990. Only 20 of the 102 reports made any statement related to the clinical significance of the observed differences.

Conclusions.--Most trials with negative results did not have large enough sample sizes to detect a 25% or a 50% relative difference. This result has not changed over time. Few trials discussed whether the observed differences were clinically important. There are important reasons to change this practice. The reporting of statistical power and sample size also needs to be improved.

(JAMA. 1994;272:122-124)


THE EFFICACY of new interventions are most readily accepted if the results are from randomized controlled trials (RCTs).[1] Essential to the planning of an RCT is estimation of the required sample size. The investigator should ensure that there is sufficient power to detect, as statistically significant, a treatment effect of an a priori specified size. The opposite perspective of conceiving the problem is that the investigator should ensure that there is a low beta value, or probability of making a type II error, or concluding that the result is not statistically significant, when the observed effect is clinically meaningful.

The relationship between negative findings (ie, when statistical significance was not reached) and statistical power has been well illustrated in Freiman and colleagues'[2] review of 71 RCTs with negative results published during 1960 to 1977. These RCTs were drawn from a collection of 300 simple two-group parallel design trials with dichotomous primary outcomes published in 20 journals. Freiman and colleagues were interested in assessing whether RCTs with negative results had sufficient statistical power to detect a 25% and a 50% relative difference between treatment interventions. Their review indicated that most of the trials had low power to detect these effects: only 7% (5/71) had at least 80% power to detect a 25% relative change between treatment groups and that 31% (22/71) had a 50% relative change, as statistically significant (alpha=.05, one tailed).

Since its publication, the report of Freiman and colleagues[2] has been cited more than 700 times, possibly indicating the seriousness with which investigators have taken the findings. Given this citation record, one might expect an increase over time in the awareness of the consequences of low power in published RCTs and, hence, an increase in the reporting of sample size calculations made before clinical trials are conducted.

Our objective was to describe the pattern, over time, in the level of power and in the reporting of sample size calculations in published RCTs with negative results since the publication of Freiman and colleagues'[2] report. We did this by extending Frieman and colleagues' objectives during the period from 1975 to 1990.

METHODS

Study Sample

We reviewed RCTs published in JAMA, Lancet, and the New England Journal of Medicine. These were the three of the 20 journals from which more than half (36/71) of the trials with negative results reported by Frieman and colleagues were drawn.[2] To capture the denominator, each volume published in 1975, 1980, 1985, and 1990 was hand-searched to extract the RCTs. To be considered an RCT, the study being assessed had to contain an explicit statement about randomization. The identified RCTs were consecutively coded and divided into three equal groups for review by members of the study team, with use of a structured data collection form. The data collected included information on whether the trial results were positive or negative, the study design, whether a priori or post hoc sample size calculations were performed, and the elements necessary for us to calculate power (eg, observed proportions or means and SDs of the primary outcome obtained in each group).

After closely following Frieman and colleagues'[2] selection criteria, our power calculations were performed on the subset of the trials with negative results in which a simple two-group parallel design was used. However, rather than calculating power only for trials with dichotomous outcomes, we also calculated power for trials with continuous primary outcomes. We calculated each trial's power to detect a 25% and a 50% relative change, with an alpha value of .05, using a z test or a t test, as appropriate for the scale of measurement of the primary outcome. Our calculations differed from those of Frieman and colleagues in that we employed a two-tailed rather than a one-tailed alpha. A standard program[3] was used for our power calculations. We also calculated the percentages over time of trials reporting sample size calculations.

Selection of Trials and Identification of Primary Outcome

For an RCT to be classified as having negative results, there had to be an explicit statement in the text that negative results had been obtained. When an explicit statement was missing, classification of a trial as having negative results required identification of the primary outcome measure. As Pocock et al[4] observed in 1987, primary outcome measures are not usually clearly specified. Encountering the same problem, we specified a series of decision rules to select the primary outcome. If an article reported a sample size calculation, the outcome used in the calculation was taken as the primary outcome. Published descriptive statistics on this variable were then used in our power calculations. If sample size calculations were not reported and multiple outcomes were evaluated, at least 50% of the results of statistical tests had to be nonsignificant for the RCT to be classified as having negative results. Among the multiple outcomes, the most serious was identified as the primary outcome. For example, if outcomes included both disease-free survival and overall mortality, mortality was considered to be the primary outcome.

RESULTS

Description of Study Sample

A total of 393 RCTs were published in JAMA, Lancet, and the New England Journal of Medicine during the 4 years of our review. Ten trials were excluded from the analysis for the following reasons: results based on invalid statistical analyses precluded classification of the trial (n=5), randomization was not explicit or not all patients were randomized (n=2), and no statistical analysis was performed (n=3).

Twenty-seven percent (n=102) of the remaining 383 trials had negative results. Although the number of RCTs published has more than doubled between 1975 and 1990 (67 vs 148), the percentage that had negative results has remained fairly stable over time: 33%, 27%, 25%, and 25% in 1975, 1980, 1985, and 1990, respectively.

Statistical Power

We calculated power for 70 of the 102 RCTs with negative results that employed a simple two-group parallel design with dichotomous (n=52) and continuous (n=18) primary outcomes. The Table presents the distribution, over time, in the percentage of trials that had at least 80% power (alpha=.05, two tailed) to detect two effect sizes: a relative difference between treatment groups of 25% and of 50%. Overall, only 16% and 36% of the trials had at least 80% power to detect a 25% or a 50% relative difference, respectively. These figures have not consistently improved over time.

Sample Size Calculations

Among the 102 trials with negative findings, only 32% (n=33) reported a sample size calculation. While this number is small, the situation has improved over time since 1980. None of the 22 trials with negative results published in 1975 was found to have included a sample size calculation, 32% (7/22) did so in 1980, 48% (10/21) did so in 1985, and 43% (16/37) did so in 1990. Only 20 (20%) of the 102 trials with negative results made any kind of statement about the clinical significance of the results with respect to the observed statistical differences between the treatment groups.

An examination of the 33 trials with negative results with sample size calculations revealed serious deficiencies in the reporting of the variables essential for these calculations. No trial indicated the statistical test on which the calculation was based. Only 45% reported the control group event rate, but 79% specified the power level. Slightly more than half (58%) reported the alpha level, but few (18%) indicated whether the alpha value was one tailed or two tailed. In fact, in only 30% of these trials was there sufficient detail to enable us to replicate the reported calculated sample size.

COMMENT

If a trial with negative results has a sufficient sample size to detect a clinically important effect, then the negative results are interpretable--the treatment did not have an effect at least as large as the effect considered to be clinically relevant. If a trial with negative results has insufficient power, a clinically important but statistically nonsignificant effect is usually ignored or, worse, is taken to mean that the treatment under study made no difference. Thus, there are important scientific reasons to report sample size and/or power calculations.

There are also ethical reasons to estimate sample size when planning a trial. Altman[5] noted that ethics committees may not want to approve the rare oversized trial because of the unnecessary costs and involvement of additional patients. More commonly, ethics committees may not want to approve trials that are too small to observe clinically important differences, because, as Altman put it, such a trial may "be scientifically useless, and hence unethical in its use of subjects and other resources."

Our results indicate that most trials with negative results had too few patients to detect a relative difference of 25% or 50% with sufficient statistical power and that this has not changed over time. Despite our unique set of decision rules to identify trials with negative results and primary outcomes, the results are similar to those originally reported by Freiman and colleagues[2] 16 years ago.

Our observation that most trials do not report sample size calculations is consistent with other descriptive surveys of RCTs that did not focus solely on trials with negative results. DerSimonian and colleagues[6] and Pocock and [4] evaluated general methodologic and statistical problems of clinical trials published in 1979 and 1985, respectively. Both reports found that statistical power was discussed in only about 12% of the published RCTs selected for review. More recently, Altman and Dore[7] reported that 39% of a convenience

It is possible that investigators do plan required sample sizes but that this information is not included in the published reports, but this does not appear to be the case. After personally contacting principal authors, Liberati and colleagues[8] discovered that only a very small percentage had conducted sample size calculations but not included this information in the published report.

Another explanation for the absence of sample size calculations is a lack of understanding of the concept of effect size. Indeed, one of the most challenging aspects of sample size planning is determining a clinically important effect. The 25% and 50% relative differences on which our power calculations and those of Frieman and colleagues[2] were based may, in fact, represent very large differences. More modest but clinically important treatment effects would necessitate trials with substantially larger sample sizes. Reviewing the cardiovascular literature to evaluate the magnitude of treatment effects, Yusuf and colleagues[9] found that relatively small treatment effects should be expected.

A third possibility for most trials failing to report sample size calculations is that this may reflect the view that planning sample size is unnecessary because RCTs, whatever their outcome, are invaluable to systematic reviews (meta-analyses).[10] Even if one holds this view, it does not preclude the value of publishing post hoc calculation of the power of a study with negative results to detect a clinically important difference. This calculation would enable the reader to make a more informed judgment as to the clinical relevance of the observed absence of statistical significance. Furthermore, evidence indicates that because of publication bias,[11] studies with positive results are more likely to be published than are those with negative results. As a consequence, unpublished trials with negative results might not enter into a systematic review.

In a recent commentary, Cohen[12] found the absence of power calculations in published psychological research to be inexplicable. He suggested that this problem could exemplify the "slow movement of methodological advance" or could reflect difficulties researchers face in performing the appropriate calculations. He also commented that the "passive acceptance of this state of affairs by editors and reviewers is even more of a mystery."

Our results indicate that even when sample size calculations were published, the details necessary to replicate the calculations were missing in most cases. These deficiencies are consistent with a general concern about the quality of reporting of trials.[13] [14] Reports of RCTs should provide readers with detailed information about the design, execution, analysis, and interpretation of the trial and its findings. A minimum set of required information, ie, "structured reporting," would help readers to evaluate the validity of a trial. A similar approach, more informative abstracts, has had a positive impact on how the results of abstracts are communicated.[15]

We propose that authors should report sample size calculations and that the following information should be contained in all published reports of RCTs: (1) The primary dependent measure(s) should be clearly identified. (2) A clinically important treatment effect should be specified. (3) The treatment effect should be clearly indicated as being an absolute or a relative difference. (4) The statistical test, directionality, alpha level, and statistical power used to estimate sample size should be reported. If sample size calculations were not conducted a priori, a published report of an RCT with negative results should provide post hoc statistical power calculations to detect a clinically important difference. The benefits of including this information are clearly worth the extra space required in the publications.


From the Clinical Epidemiology Unit, Loeb Medical Research Institute (Mr Moher and Dr Wells), and the Faculties of Medicine (Mr Moher and Dr Wells) and Health Sciences (Dr Dulberg), University of Ottawa, Ottawa, Ontario.

Presented in part at the Second International Congress on Peer Review in Biomedical Publication, Chicago, Ill, September 10, 1993.

Reprint requests to the Clinical Epidemiology Unit, Loeb Medical Research Institute, Ottawa Civic Hospital, 1053 Carling Ave, Ottawa, Ontario, Canada K1Y 4E9 (Mr Moher).


References

1. Cook DJ, Guyatt GH, Laupacis A, Sackett DL. Rules of evidence and clinical recommendations on the use of antithrombotic agents. Chest. 1992;102(suppl):305S-311S.

2. Freiman JA, Chalmers TC, Smith H, Kuebler RR. The importance of beta, the type II error, and sample size in the design and interpretation of the randomized controlled trial: survey of 71 'negative' trials. N Engl J Med. 1978;299:690-694.

3. Dupont WD, Plummer WD. Power and sample size calculations: a review and computer program. Controlled Clin Trials. 1990;11:116-128.

4. Pocock SJ, Hughes MD, Lee RJ. Statistical problems in the reporting of clinical trials: a survey of three medical journals. N Engl J Med. 1987;317:426-432.

5. Altman DG. Statistics and ethics in medical research, III: how large a sample? BMJ. 1980;281:1336-1338.

6. DerSimonian R, Charette LJ, McPeek B, Mosteller F. Reporting on methods in clinical trials. N Engl J Med. 1982;306:1332-1337.

7. Altman DG, Dore CJ. Randomization baseline comparisons in clinical trials. Lancet. 1990;335:149-153.

8. Liberati A, Himel HN, Chalmers TC. A quality assessment of randomized control trials of primary treatment of breast cancer. J Clin Oncol. 1986;4:942-951.

9. Yusuf S, Collins R, Peto R. Why do we need some large, simple randomized trials? Stat Med. 1984;3:409-420.

10. Chalmers TC. Clinical trial quality needs to be improved to facilitate meta-analyses. Online J Curr Clin Trials. September 11, 1993; Doc No. 89. Serial online.

11. Dickersin K, Min YI. NIH clinical trials and publication bias. Online J Curr Clin Trials. April 28, 1993; Doc No. 50. Serial online.

12. Cohen J. A power primer. Psychol Bull. 1992;112:155-159.

13. Grant A. Reporting controlled trials. Br J Obstet Gynecol. 1989;96:397-400.

14. Mosteller F, Gilbert JP, McPeek B. Reporting standards and research strategies for controlled trials. Controlled Clin Trials. 1980;1:37-58.

15. Haynes RB, Mulrow CD, Huth EJ, Mtman DG, Gardner MJ. More informative abstracts revisited. Ann Intern Med. 1990;113:69-76.

Table of Contents