Assessing the Quality of Randomization From Reports of
Controlled Trials Published in Obstetrics and Gynecology Journals
(JAMA. 1994;272:125-128)
Kenneth F. Schulz, MBA; Iain Chalmers, MBBS, MSc; David A. Grimes,
MD; Douglas G. Altman
Objective.--To assess the methodologic quality of
approaches used to allocate participants to comparison groups in
randomized controlled trials from one medical specialty.
Design.--Survey of published, parallel group randomized
controlled trials.
Data Sources.--All 206 reports with allocation described as
randomized from the 1990 and 1991 volumes of four journals of
obstetrics and gynecology.
Main Outcome Measures.--Direct and indirect measures of the
adequacy of randomization and baseline comparisons.
Results.--Only 32% of the reports described an adequate
method for generating a sequence of random numbers, and only 23%
contained information showing that steps had been taken to conceal
assignment until the point of treatment allocation. A mere 9%
described both sequence generation and allocation concealment. In
reports of trials that had apparently used unrestricted randomization,
the differences in sample sizes between treatment and control groups
were much smaller than would be expected due to chance. In reports of
trials in which hypothesis tests had been used to compare baseline
characteristics, only 2% of reported test results were statistically
significant, lower than the expected rate of 5%.
Conclusions.--Proper randomization is required to generate
unbiased comparison groups in controlled trials, yet the reports in
these journals usually provided inadequate or unacceptable information
on treatment allocation. Additional analyses suggest that nonrandom
manipulation of comparison groups and selective reporting of baseline
comparisons may have occurred.
(JAMA. 1994;272:125-128)
RANDOMIZATION eliminates selection biases in
controlled trials. Unfortunately, investigators often address
randomization improperly in the design and implementation phases of
trials and neglect it in published reports.[1]
[2]
[3]
Moreover, an analysis of prominent general journals revealed that among trials
in which unrestricted randomization was used, the sample sizes in the two
comparison groups were more similar than would be expected by
chance.[3] Furthermore, results of only 4% of hypothesis
tests comparing baseline characteristics were significant at the 5%
level.
We conducted a systematic evaluation of reports of randomized
controlled trials (RCTs) published in the two main US and the two main
British journals of obstetrics and gynecology. The American
Journal of Obstetrics and Gynecology (AJOG) and
Obstetrics and Gynecology (OG) are published in the
United States, and the British Journal of Obstetrics and
Gynaecology (BJOG) and the Journal of Obstetrics and
Gynaecology (JOG) are published in the United Kingdom.
Earlier research has suggested that the methodologic quality of RCTs in
this specialty may be inadequate[4]
[5]
[6]
[7]; we anticipated that
descriptions of adequate approaches to treatment assignment would be
rarer in these journals than in general journals. We also hypothesized
that (1) the reports published in the BJOG would be of better
quality than those published in the other three journals because a
concerted editorial effort had been made to improve the quality of
reporting in the BJOG,[8]
[9]
[10]
(2) the numbers of
patients in the comparison groups of trials in which unrestricted
randomization was used would be more similar than would be expected by
chance, and (3) the percentage of reported statistically significant
differences in baseline characteristics would be less than the expected 5%.
METHODS
We collected data from all reports (N=206) of trials published in the
1990 and 1991 volumes of the AJOG, the BJOG, the
JOG, and OG. To identify eligible reports, we handsearched
the journals and then cross-checked that search using the Oxford
Database of Perinatal Trials[11] (issue 8) and MEDLINE. We
included articles in which authors reported that individuals had been
randomly allocated to parallel (uncrossed) groups. A report was
included as long as it purported to refer to a randomized trial, even
if the actual method described nonrandom allocation. We included only
the first publications relating to particular trials.
We examined reports and collected data using methods similar to those
used in the analysis of general journals.[3] For consistency
of measurement across journals, one of us (K.F.S.) performed all of the
assessments. To examine the reproducibility of items on the
questionnaire, another of us (D.A.G.) assessed a sample (random number
table) of 15 trials while blinded to the initial assessments. We found
no notable differences on our main outcome measures. We entered data
into an Epi-Info questionnaire.[12]
The reduction of bias in trials depends crucially on preventing
foreknowledge of treatment assignment. Concealing assignments until the
point of allocation prevents foreknowledge, but that process has
sometimes been confusingly referred to as randomization
blinding.[13] This term, if used at all, has seldom been
distinguished clearly from other forms of blinding (masking) and is
unsatisfactory for at least three reasons. First, the rationale for
generating comparison groups at random, including the steps taken to
conceal the assignment schedule, is to eliminate selection bias. By
contrast, other forms of blinding, used after the assignment of
treatments, serve primarily to reduce ascertainment bias. Second, from
a practical standpoint, concealing treatment assignment up to the point
of allocation is always possible, regardless of the study topic,
whereas blinding after allocation is not attainable in many instances,
such as in trials conducted to compare surgical and medical treatments.
Third, control of selection bias pertains to the trial as a whole, and
thus to all outcomes being compared, whereas control of ascertainment
bias may be accomplished successfully for some outcomes but not for
others. Thus, concealment up to the point of allocation of treatment
and blinding after that point address different sources of bias and
differ in their practicability. In light of those considerations, we
refer to the former as allocation concealment and reserve the
term blinding for measures taken to conceal group identity
after allocation.
We considered the following approaches to the generation of an
allocation sequence as adequate: computer, random number table,
shuffled cards or tossed coins, and minimization. We considered the
following approaches to allocation concealment as adequate: central
randomization (eg, by telephone to a trials office), a pharmacy,
numbered or coded containers, and sequentially numbered, opaque, sealed
envelopes. Nonrandom (often called systematic) approaches included
alternate assignment and assignment by odd/even birth date or hospital
number. Other terms are described elsewhere.[14] [15]
Restriction forces sample sizes in comparison groups to be more similar
than would occur by simple randomization.[14] Blocking is the
most commonly used form. Our analyses of the differences in reported
sample sizes of comparison groups has been limited to two-group,
unrestricted trials. We categorized trials as "unrestricted" if
the trial had not been reported as restricted or stratified (which are
more likely to be restricted).
To assess whether authors reported appropriate measures of variability
for means or medians when reporting baseline comparisons, we looked for
the SD, range, or raw data. Unless otherwise indicated, we used
chi2 tests to compare nominally scaled variables. The
Greenland and Robins approach was used to obtain confidence intervals
for relative risks.[12]
RESULTS
We found 206 reports of trials in four journals. More than three
quarters (78%) failed to provide information about the type of
randomization. Despite purporting to be randomized trials, 11 reports
(5%) described the use of a nonrandom method of assignment. Only 29
(14%) of the reports described the use of restriction (23 of the 29
descibed blocking). None reported the use of replacement randomization.
Reports published in the BJOG stated the type of randomization
more frequently than reports published in the other journals (48% vs
14%, P<.001, 1 df).
Only 32% of the reports specified an adequate method for generating
random numbers, and the rates were similar among the four journals
(P=.27, 3 df; Table). A computer random number generator was the most frequently specified method
(18%), followed by a random number table (11%).
Almost half (48%) of the reports did not describe the mechanism used
to allocate treatments. Authors specified use of envelopes most
frequently (25%), but only one quarter of those (6% of all) stated
that the envelopes had been sequentially numbered, opaque, and sealed.
Fifteen reports (7%) specified the pharmacy, another 15 (7%)
specified numbered bottles or containers, and five (2%) described
central randomization. Ten (5%) stated that a list, table, or schedule
had been used for allocation, and the other 11 used nonrandom,
unconcealed assignment. Overall, only 23% stated an adequate approach
to allocation concealment (Table). The proportion of reports describing
adequate concealment varied markedly among the four journals
(P<.001, 3 df). The BJOG
had a rate 2.6 times higher than the other three journals combined
(95% confidence interval, 1.6 to 4.1; P<.001). Only 9%
described an adequate method for both sequence generation and
allocation concealment (Table).
In 96 reports of apparently unrestricted trials, sample sizes of the
treatment and control groups differed by less than would be expected
due to chance alone. The Figure illustrates those differences in relation to total trial size. About five trials should
fall outside the outer pair of straight lines--none did; about 48
should fall outside the inner pair of lines--only eight did
(P<.001; chi2 goodness of fit, 2
df). That 54% of the unrestricted trials had
differences in group sizes of zero or one further indicates the
similarity of group sizes. Surprisingly, only 36% of the blocked
trials had differences in group sizes of zero or one.
Authors presented comparisons of baseline characteristics in 84% of
the reports. Comparisons presented as continuous variables were
reported in 78% of the trials, and among those only 68% were
accompanied by appropriate measures of variability. Reports in the
BJOG were more likely than those in the other three specialty
journals to present appropriate measures of variability, but the
differences among the four journals were not statistically significant
(P=.22, 3 df). In 41% of the 206
reports, either authors did not present baseline characteristics or did
not report appropriate measures of variability.
Authors used hypothesis tests for baseline comparisons in 125 reports
(61%) and presented results of 1076 tests. Of those results, only 2%
were statistically significant at the 5% level, a departure from
expectation (P<.001, z test).
In 50 (24%) of the reports, authors reported sample sizes to be based
on prior statistical power calculations. The rates were 0% for the
JOG, 18% for OG, 19% for the AJOG, and
52% for the BJOG. Trials published in the BJOG
reported power calculations 3.3 times more frequently than those from
the other three journals combined (95% confidence interval, 2.1 to
5.2; P<.001).
COMMENT
Randomized controlled trials provide the most valid basis for the
comparison of interventions in health care. If improperly conducted,
however, trials purporting to be "randomized" can yield biased
results. Indeed, bias has been detected in trials not reporting
adequate allocation concealment.[13] Thus, for readers to
have justifiable confidence in the internal validity of a trial, the
report should demonstrate adequate randomization. Considering its
central importance, we are surprised that authors have not been more
meticulous in publishing clear reports of the randomization process.
Our estimate of 32% for adequate sequence generation may be
generous, as it includes processes such as shuffled cards and tossed
coins as "adequate." Because those methods open the production of
assignment schedules to human perturbations and result in
unreproducible results, we consider them less than optimal; others
consider them unacceptable.[15] We recommend tables and
computers not only because of reproducibility but also because of ease
and speed.
Allocation concealment generally outweighs the importance of generating
assignments per se,[16] [17] yet only about half of the reports
provided information adequate to assess that aspect of trial design and
conduct. We judged that less than one quarter of the reports described
adequate allocation concealment, but, even with many of those, further
clarifying information should have been provided. Few reports stated
who had prepared the randomization scheme; those who prepared the
scheme should not have been involved in determining eligibility,
administering treatment, or assessing outcome.
Reports of trials published in the BJOG provided more
information than those in the other three journals. They more
frequently included information about the type of randomization,
reported an adequate approach to randomization concealment, and
reported statistical power calculations. Also, the quality of reports
in the BJOG matched or exceeded that found in the four general
journals.[3] Even so, those of us (I.C. and D.G.A.) who had
been involved in editorial efforts to improve the quality of reports in
the BJOG were disappointed at how much room for improvement
remains. The reports from the two US journals were comparable with each
other but superior to those in the JOG. Editorial efforts
similar to those made at the BJOG in the mid-1980s are now
occurring at OG,[18] and those, too, may result in
improved quality.
The relative sizes of comparison groups in the unrestricted trials
should have reflected random variation. In other words, some
discrepancy between the numbers in the comparison groups would be
expected. We found the contrary, however, which supported an earlier
finding.[3] The strong tendency for the comparison groups to
be of equal or similar sizes may be explained by unreported use of (1)
restriction, usually blocking; (2) replacement randomization; (3) a
nonrandom method of assignment; or (4) nonrandom manipulation of
assignments or data to balance sample sizes. Use of restriction would
be the most palatable of these possible explanations, and it likely
explains some instances. It probably does not explain most, however,
because few trials reported restriction and because blocked trials
yielded differences more disparate than those found in the unrestricted
trials. We found no evidence of replacement randomization. We found
evidence of nonrandom allocation; thus, its unidentified use in other
trials may explain some of the similarities. This is hardly reassuring,
however, given the risk of bias due to nonrandomness and difficulties
with concealment.
The fourth potential explanation, nonrandom manipulation, has serious
implications because it is the most likely to introduce selection bias.
Our findings provide indirect evidence that it could have happened.
Some investigators may have believed that they would increase the
credibility of their trial if they presented comparison groups of equal
size. Unfortunately for good science, but fortunately for those
investigators, most readers probably shared their misconception.
Paradoxically, the results of those possible manipulations have had
exactly the opposite effect when analyzed in aggregate in our study.
While our results indicate clearly that the set of trials that had
supposedly used unrestricted randomization were not what they purported
to be, the identification of any particular trial as suspect is
impossible, as some trials would be expected to achieve similar numbers
simply by chance.
While randomization assigns treatments without bias, it does not
necessarily produce balanced groups with respect to prognostic factors.
On strictly theoretical grounds, if randomization is properly
implemented, establishment of comparability at baseline is unnecessary.
Random assignment eliminates bias, even though, in a particular study,
the groups compared may never be perfectly balanced for important
prognostic variables. The process of randomization underlies
significance testing, and that process is independent of prognostic
factors, known or unknown.[19]
Baseline characteristics in RCTs should be addressed by authors, but
the common, inappropriate use of hypothesis tests to compare
characteristics concerns us.[20] [21] That process assesses the
probability that differences observed could have occurred by chance. In
properly randomized trials, however, any observed differences have
occurred by chance. As noted elsewhere,[21] "Such a procedure is clearly absurd." Hypothesis tests are superfluous, and
their use in comparisons of baseline characteristics can mislead
investigators and their readers. Rather, comparisons should be based on
consideration of the prognostic strength of the variables measured and
the magnitude of any chance imbalances that have
occurred.[21]
Hypothesis tests in these reports resulted in many fewer statistically
significant comparisons than expected. One plausible explanation for
that discrepancy is that a few investigators may have decided not to
report statistically significant comparisons, believing that by
withholding that information they would increase the credibility of
their reports. In fact, the opposite has occurred in this aggregated
analysis. Investigators should report baseline comparisons on
important prognostic variables, regardless of statistical significance.
Not only are hypothesis tests superfluous and potentially misleading,
but they can be harmful if they lead investigators to suppress any
baseline imbalances.
None of our findings are particularly reassuring. As a whole, the
trials from this medical specialty fared somewhat worse than the poor
showing of the general journals.[3] Furthermore, these
results probably represent what would be found in many other
specialties as well. Although failure to report steps to reduce bias
does not constitute direct evidence that those steps have not been
taken, at least one study, in which clarification was sought from the
authors of reports, has shown that inadequate reporting usually
reflects inadequate methods.[22] Thus, while
reporting must clearly be improved, deficiencies in the design and conduct of
trials must also be addressed.
Omission of randomization details to date has probably been primarily
an author-based phenomenon rather than a result of journal editors
extracting important material from manuscripts. Moreover, refereeing
and editorial work cannot improve what was actually done in a
trial--only how well it was reported. Thus, the burden for improvement
should fall primarily on investigators, although editors could
stimulate that process.
Protestations from authors about lack of space do not constitute
acceptable excuses for omission. Space will always be a limitation; the
issue is the relative importance of the topics addressed. Authors
frequently include information that has little bearing on scientific
validity, while they omit critical elements of the randomization
process. In a double-blind RCT, however, aspects other than
randomization may be scientifically inconsequential to the analysis
because they would have been applied equally to unbiased comparison
groups. Certainly, we would not wish to promote a cavalier attitude
toward other methodologic elements: surely some have to be adequately
described for readers to interpret findings and extrapolate results.
Yet, proper reporting of the randomization procedures should have the
highest priority, and those trials that fail to provide such
information should be interpreted cautiously.
At a minimum, reports of RCTs should include descriptions of (1) the
type of randomization, (2) the method of sequence generation, (3) the
method of allocation concealment, (4) the persons generating and
executing the scheme, and (5) the comparative baseline characteristics,
with proper interpretation. Furthermore, tolerance for groups of
unequal sizes in unrestricted trials should be cultivated in addition
to intolerance for hypothesis testing of baseline characteristics.
From the London (England) School of Hygiene and Tropical Medicine
(Mr Schulz); The United Kingdom Cochrane Centre, Oxford, England (Mr
Schulz and Dr Chalmers); the Division of STD/HIV Prevention, National
Center for Prevention Services, Centers for Disease Control and
Prevention, Atlanta, Ga (Mr Schulz); the Department of Obstetrics,
Gynecology, and Reproductive Sciences, University of California-San
Francisco (Dr Grimes); and the Medical Statistics Laboratory, Imperial
Cancer Research Fund, London (Mr Altman).
Presented in part at the Second International Congress on Peer Review
in Biomedical Publication, Chicago, Ill, September 10, 1993.
This work was supported by the Centers for Disease Control and
Prevention. Dr Chalmers was supported by the National Health Service
Research and Development Programme.
Reprint requests to the Division of STD/HIV Prevention, Centers for
Disease Control and Prevention, NCPS, Mailstop E-02, Atlanta, Ga 30333
(Mr Schulz).
References
1. Mosteller F, Gilbert JP, McPeek B. Reporting standards
and research strategies for controlled trials: agenda for the editor.
Controlled Clin Trials. 1980;1:37-58.
2. DerSimonian R, Charette LJ, McPeek B, Mosteller F.
Reporting on methods in clinical trials. N Engl J Med.
1982;306:1332-1337.
3. Altman DG, Dore CJ. Randomisation and baseline
comparisons in clinical trials. Lancet. 1990;335:149-153.
4. Tyson JE, Furzan JA, Reisch JS, Mize SG. An evaluation
of the quality of therapeutic studies in perinatal medicine. J
Pediatr. 1983;102:10-13.
5. Thacker SB. The efficacy of intrapartum electronic fetal
monitoring. Am J Obstet Gynecol. 1987;156:24-30.
6. Keirse MJNC. Amniotomy or oxytocin for induction of
labor: re-analysis of a randomized controlled trial. Acta Obstet
Gynecol Scand. 1988;67:731-735.
7. Grimes DA, Schulz KF. Randomized controlled trials of
home uterine activity monitoring: a review and critique. Obstet
Gynecol. 1992;79:137-142.
8. Bracken MB. Reporting observational studies. Br J
Obstet Gynaecol. 1989;96:383-388.
9. Wald N, Cuckle H. Reporting the assessment of screening
and diagnostic tests. Br J Obstet Gynaecol. 1989;96:389-396.
10. Grant A. Reporting controlled trials. Br J Obstet
Gynaecol. 1989;96:397-400.
11. Chalmers I, Hetherington J, Newdick M, et al. The
Oxford Database of Perinatal Trials: developing a register of published
reports of controlled trials. Controlled Clin Trials.
1986;7:306-324.
12. Dean AG, Dean JA, Burton AH, Dicker RC. Epi Info
Version 5: A Word Processing, Database, and Statistics Program for
Epidemiology on Microcomputers. Atlanta, Ga: Centers for Disease
Control; 1990.
13. Chalmers TC, Celano P, Sacks HS, Smith H Jr. Bias in
treatment assignment in controlled clinical trials. N Engl J
Med. 1983;309:1358-1361.
14. Pocock SJ. Clinical Trials: A Practical
Approach. Chichester, England: John Wiley & Sons; 1983.
15. Meinert CL. Clinical Trials: Design, Conduct, and
Analysis. New York, NY: Oxford University Press; 1986.
16. Hill AB. The clinical trial. N Engl J Med.
1952;247:113-119.
17. Chalmers TC, Levin H, Sacks HS, Reitman D, Berrier J,
Nagalingam R. Meta-analysis of clinical trials as a scientific
discipline, I: control of bias and comparison with large co-operative
trials. Stat Med. 1987;6:315-325.
18. Grimes DA. Randomized controlled trials: 'it ain't necessarily so.' Obstet Gynecol. 1991;78:703-704.
19. Fisher RA. The Design of Experiments. 8th ed.
Edinburgh, Scotland: Oliver & Boyd Ltd; 1966.
20. Rothman KJ. Epidemiologic methods in clinical trials.
Cancer. 1977;39 (suppl 4):1771-1775.
21. Altman DG. Comparability of randomised groups.
Statistician. 1985;34:125-136.
22. Liberati A, Himel HN, Chalmers TC. A quality assessment
of randomized control trials of primary treatment of breast cancer.
J Clin Oncol. 1986;4:942-951.
Table of Contents