Instruments for Assessing the Quality of Drug Studies
Published in the Medical Literature
(JAMA. 1994;272:101-104)
Mildred K. Cho, PhD, Lisa A. Bero, PhD
Objectives.--To develop valid and reliable instruments
to assess the methodologic quality and clinical relevance of drug
studies.
Design.--We developed an instrument to assess the
methodologic quality of articles reporting clinical research and an
instrument to measure nonmethodologic measures of quality, such as
clinical relevance, generalizability, and adherence to ethical
standards. Each instrument was pretested by seven independent, masked
reviewers and modified based on interrater agreement and content
validity of individual items. We determined correlational validity of
the final methodologic quality instrument by comparing quality scores
assigned to 10 articles by means of our instrument and a previously
published one.
Participants.--Clinical drug studies published in symposium
proceedings and peer reviewed biomedical literature.
Main Outcome Measures.--Interrater reliability of overall
quality scores, measured by intraclass correlation (r )
and Kendall's coefficient of concordance (W), and interrater
reliability of individual items, by percentage agreement.
Main Results.--The interrater reliability of the pretest
methodologic quality instrument was high (r=.89 [95%
confidence interval, .73 to .96]; W=0.64). Correlational validity of
the final instrument was suggested by the high degree of concordance
with another previously published one (W=0.74). The interrater
reliability of the pretest clinical relevance instrument was moderate
(r=.41 [95% confidence interval, .18 to .64]; W=0.47).
Reviewers confirmed the content validity of both instruments.
Conclusions.--The two instruments we developed, one
measuring methodologic quality and one measuring clinical relevance of
articles reporting clinical research, are reliable, valid, and
applicable to a variety of research designs.
(JAMA. 1994;272:101-104)
TO CLINICIANS and scientists, the medical
literature is an important means for acquiring new information to guide
clinical decision making and research. Assessing the quality of the
literature is an important activity of readers, meta-analysts, and
participants in the peer review process.
Several instruments to assess methodologic quality of biomedical
literature have been developed; most assess the quality of randomized
control trials only.[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
Although such instruments are
generally designed to be used by more than one rater, interrater
reliability has been infrequently reported. External validity of these
instruments cannot be determined for lack of a "gold standard."
Other forms of validity, such as content validity or correlational
validity, have also generally not been reported. Few instruments in the
literature measure nonmethodologic aspects of quality, such as external
validity,[10] interest to readers,[11] and
importance of the study to the problem.[11]
[12]
[13]
Therefore, we developed an instrument to measure methodologic quality
that was applicable not just to randomized controlled trials but also
to studies of other experimental or observational designs. We also
developed an instrument to assess the clinical relevance and
generalizability of a study and its adherence to ethical standards. To
the extent possible, we tested the interrater reliability, content
validity, and correlational validity of both instruments.
MATERIALS AND METHODS
We tested our instruments on clinical drug studies published in the
biomedical literature. We defined a "study" as a publication that
(1) contained a separate section describing methods, (2) did not
specifically state that it was a review, and (3) did not contain tables
or figures outside the text acknowledged to be reprinted from other
sources. We defined a study as "clinical" if it was performed on
humans.
Article Selection
We tested the instruments on articles from symposium proceedings and
peerreviewed journals because the instruments were initally designed to
examine the quality of articles from these two sources. Using a
computer-generated list of random numbers from 1 to 625, we randomly
selected symposium articles published between January 1, 1980, through
December 30, 1989, and reported clinical drug studies from 625 symposia
proceedings that were identified for a previous study.[14]
From each symposium, we randomly selected one clinical drug study by
means of a random number generated by a calculator.
Date of publication, journal, and therapeutic class of drug could
confound the association between the source of publication (ie,
symposium vs peer reviewed journal) and quality.[4] [15] [16] We
matched each symposium article to an article from the peer reviewed
parent journal that was published up to 2 years before or after the
symposium article. We matched articles by the therapeutic class of drug
based on the Food and Drug Administration classification[17]
to consider, for certain types of drugs, randomized or masked study
designs that would be considered the highest in quality are not always
possible.
We identified peer reviewed articles in MEDLINE, with the help of a
professional librarian, by the following search strategy: (xjo
<journal name>) and (py <year>) and (exp xs <drug class>)
and (lan eng) and (su human) and not (pt review or pt letter or pt
editorial). We randomly selected one matching article for each
symposium article by means of random numbers generated by a calculator.
For the methodologic quality instrument, we selected three symposium
and three matched peer reviewed articles. For the clinical relevance
instrument, we selected five symposium and five matched peer reviewed
articles. All reviewers received the articles in the same randomized
order.
Pretest of Methodologic Quality Instrument
We defined "methodologic quality" of a study as minimization
of systematic bias and consistency of conclusions with results. We are
able to determine methodologic quality of a study only to the
extent that study design and analytic methods are reported.
However, because our instrument measures criteria critical to achieving
high methodologic quality, these criteria should be adequately reported
in any article describing a clinical study.
Instrument Development.--Our pretest methodologic
quality instrument was based on that of Spitzer et al,[9]
which was the only comprehensive methodologic quality assessment
instrument we found in the medical literature applicable to both
observational and experimental studies. We eliminated items from the
Spitzer et al instrument that did not address systematic bias and
consistency.
Using our pretest instrument, reviewers were first asked to determine
the type of study design used (responses could range from case reports
to randomized controlled trials). They could respond "Yes,"
"Partial," "No," or "Not applicable" to how well the
article fulfilled each of the remaining 15 criteria. A space for
comments was also provided. The pretest instrument was revised on the
basis of interrater agreement and reviewers' comments on content
validity of individual items.
Scoring System.--To derive an overall quality score for each
article, we assigned 2, 1, 0, and 0 points to "Yes," "Partial,"
"No," and "Not applicable" responses, respectively, for each
item except for study design. One to five points were given for study
design (one for case reports, two for time series or uncontrolled
experiments, three for cohort or case-control studies, four for
unrandomized control trials, and five for randomized control trials),
according to the system used by the US Preventive Services Task
Force.[8] Total points were divided by the total possible
points (the sum of the maximum points for each item, except "Not
applicable" items) to yield a fraction between 0 and 1. A score of 1
represents the highest quality.
As there is little empiric evidence on the relative importance of
the individual quality criteria to the control of systematic bias, the
effects of three different weighting schemes on the absolute values and
interrater reliability of overall scores were tested. Weighting scheme
1 gave all items equal weight. Weighting scheme 2 gave more weight to
study design by multiplying the points awarded for study design by an
arbitrarily selected weighting factor of 3. In weighting scheme 3,
points for study design, randomization, blinding, statistical analysis,
and support of conclusions by the results were given more weight,
similar to the scheme proposed by Chalmers et al[1]
Reviewer Background, Training, and Masking.--For the
pretest, we recruited seven reviewers with research experience in
health sciences (two biostatisticians, three epidemiologists, and two
clinical pharmacologists). The reviewers were given a six-page set of
written instructions, worked independently, were masked to whether an
article was from a symposium, and were given photocopies of articles
from which author names, institutions, journal names, dates, and all
other reference information had been obliterated. Reviewers were also
unaware of the purpose of our study.
Interrater Reliability.--Interrater reliability of
individual items was assessed by percentage agreement. Interrater
reliability of overall scores was assessed by Kendall's coefficient of
concordance (W) with adjustment for tied ranks.[18] As others
have more commonly reported the intraclass correlation (r) as
a measure of interrater reliability of similar instruments, we also
determined r (treating both reviewers and articles as random
effects[19] ). For tests of significance, one-tailed alpha=.05 was used for W and two-tailed alpha=.05 for r.
Correlational Validity of Final Methodologic Quality Instrument
Correlational validity of the final methodologic quality instrument was
assessed by comparing the rank order of quality scores previously
assigned to 10 articles by means of a modified version of the
instrument designed by Chalmers et al.[20] This instrument
was the only one we found for which both the references and exact
quality scores of articles were reported.[21] Two reviewers
(both clinical pharmacologists) were given photocopies of the 10
articles, which were masked as described above. Because the interrater
reliability between the two reviewers was high (r=.79; 95%
confidence interval [CI], .39 to .94), we used the mean of the two
reviewers' scores for each article to represent the score assigned by
our instrument. The rank order of the mean scores was compared by means
of Kendall's W.
Pretest of Clinical Relevance Instrument
To assess nonmethodologic measures of quality, such as clinical
relevance, generalizability, and ethics, we designed a seven-item
instrument that we call the "clinical relevance" instrument. We
awarded points to each response as follows: two for "Yes," one for
"Partial" or "Insufficient evidence," and zero for "No" or
"No controls." The overall quality score was the total of awarded
points divided by 14, the total possible points. We weighted all items
equally.
Seven reviewers with research experience in health sciences (one
biostatistician and six physicians, including two general internists,
one anesthesiologist, one ophthalmologist, one pediatrician, and one
adult critical care physician) pretested this instrument. They received
four pages of written instructions, worked independently, and were
masked as described above.
RESULTS
Pretest of Methodologic Quality Instrument
The pretest of the methodologic quality instrument required
approximately 30 minutes per article per reviewer. There were no
missing data. The average of reviewers' scores for each article ranged
from 0.15 to 0.65 (SD, 0.18). The interrater reliability of overall
quality scores differed between the three different weighting schemes
used to derive the scores. For weighting scheme 1, W was 0.60 and
r was .81 (95% CI, .58 to .93). For weighting scheme 2, W was
0.64 and r was .89 (95% CI, .73 to .96). For weighting scheme
3, W was 0.73 and r was .85 (95% CI, .65 to .94). The
differences between r values were not significant (z;
P>.5 for all pairwise combinations of the three r
values). We report only the results obtained with weighting scheme 2
because it generally resulted in the highest interrater reliability for
overall scores and is simple to interpret.
Table 1 shows interrater agreement for
individual items. We considered an item to have "good agreement" if
five of seven reviewers gave the same response. For each item, we
counted the fraction of articles (of a total of six) that had good
agreement. In general, reviewers found that individual items had
content validity but suggested that the questions be more specific.
(See Table 2 for the final instrument.) To achieve
more specificity, we split compound questions (eg, item 13 in the
pretest instrument became items 15 and 16 in the final instrument) or
items that represented overgeneralized concepts (eg, item 3 in the
pretest instrument became items 7 through 9 in the final instrument).
Reviewers suggested clarifying the difference between whether something
was done vs whether it was reported (items 6 and 10 through 13 in the
final instrument). Reviewers also suggested adding several items (items
1, 2, 18, and 19 in the final instrument).
Because low percentage agreement was generally the result of
discrepancies in the use of the responses "Partial" and
"Not applicable," we added illustrative examples to the
instruction manual and listed criteria for "Not applicable"
responses for each item on the instrument itself.
Correlational Validity of Final Methodologic Quality Instrument
The mean (+/-SD) quality score for the 10 articles used to assess
correlational validity was 0.60+/-0.13 with our instrument (range, 0.36
to 0.74; possible range, 0 to 1). The mean of the scores obtained by
Detsky et al[21] with their instrument was 0.23+/-0.15
(range, 0.04 to 0.54; possible range, 0 to 1). The mean scores differed
significantly (t test; P<.001). However, the rank
order of the scores were similar, indicated by a high value of
Kendall's W (0.74).
Pretest of Clinical Relevance Instrument
The pretest of this instrument required approximately 30 minutes per
article per reviewer. There were only two instances of missing data,
both from one reviewer. The average of reviewer scores for each paper
ranged from 0.30 to 0.66 (SD, 0.10). The interrater reliability for
overall scores was moderate (r=.41 [95% CI, .18 to .64];
W=0.47). The percentage agreement of individual items was high (Table 3).
Reviewers generally thought that the instrument had content validity.
We changed the phrase "control group" to "comparison group" in
item 3 and added the words "As far as could be determined from the
article" in item 7 to improve interrater agreement, as suggested by
the reviewers. All other items were left unchanged. Table 4 shows the wording of items in the final
instrument.
COMMENT
We have developed valid and reliable instruments to assess the
methodologic quality and clinical relevance of drug studies published
in the biomedical literature. Because the instruments measure different
aspects of quality, they complement each other and, together, define a
broad range of criteria by which to assess overall quality.
Our instruments improve on those reported in the literature in
four ways. First, our methodologic quality instrument is applicable to
a variety of study designs and is relatively short and easy to use.
Second, both instruments have a high interrater reliability
compared with most similar, published instruments.[15,21]
[22]
[23]
The moderate interrater reliability of the overall scores of our
clinical relevance instrument may have resulted from the low
variability of scores among the 10 articles in our sample. However, the
interrater agreement for individual items of the clinical relevance
instrument was surprisingly high, considering the high proportion of
subjective items as compared with the methodologic quality instrument.
Third, the methodologic quality instrument was used successfully
to assess articles that were not about drugs,[21] suggesting
that it is applicable to a variety of topics. The good correlation of
the rank order of the quality scores assigned by means of our
instrument and the one used by Detsky et [21] indicates
that the instruments are similar in measuring methodologic quality.
Because there is no accepted gold standard of methodologic quality
against which to measure external validity, we used correlational
validity as the best available alternative.
Fourth, our instruments had high reliability and validity when used by
reviewers from diverse backgrounds who worked completely independently
and without lengthy training sessions. Complete independence of
reviewers avoids the systematic bias that can be introduced by
consensus conferencing if one of the reviewers is persuasive or has
strong opinions. We tested the reliability of our instruments under
conditions we thought would be most likely to be used in real-life
situations, such as in the peer review of manuscripts.
We will be able to use both instruments for which they were originally
designed. Namely, we will have independent reviewers determine whether
the type of sponsorship or type of review process of drug studies is
associated with methodologic or nonmethodologic quality. The
instruments will also be useful for separating high- and low-quality
articles for meta-analysis. Finally, for health care providers,
such as clinical pharmacists or pharmacologists, who must screen vast
amounts of literature in providing drug information services, our
instruments can augment available guidelines on how to read medical
literature[24] [25] by providing specific criteria by which to
evaluate articles.
From the Institute for Health Policy Studies, School of Medicine
(Drs Cho and Bero), and Division of Clinical Pharmacy, School of
Pharmacy (Dr Bero), University of California-San Francisco; Center for
Health Care Evaluation, Department of Veterans Affairs, Veterans
Affairs Medical Center, Palo Alto, Calif (Dr Cho); and Department of
Health Research and Policy, Stanford (Calif) University (Dr Cho).
Presented in part at the Second International Congress on Peer Review
in Biomedical Publication, Chicago, Ill, September 9, 1993.
This investigation was supported by funds provided by the American
Association for Retired Persons (Dr Bero), the Cigarette and Tobacco
Surtax Fund of the State of California through the Tobacco-Related
Disease Research Program of the University of California under award
2KT0072 (Dr Bero), the Pew Charitable Trusts (Dr Cho), and the Veterans
Affairs Office of Academic Affairs and Health Services Research and
Development Service Research Funds (Dr Cho).
We gratefully acknowledge our reviewers for their time and critique of
our instruments: Neal Benowitz, MD, John Flynn, MD, Piero Gepetti, MD,
Stan Glantz, PhD, Peter Lurie, MD, MPH, Haim Mayan, MD, Mitchell
Pelter, PharmD, John Piette, PhD, MPH, Adrienne Randolph, MD, Susan
Rosenkranz, PhD, Gordon Rubenfeld, MD, Serena Seifer, MD, MPH, Dan
Stryer, MD, and Tracey Woodruff, PhD, MPH. We also thank Phillip Lollar
for administrative help.
Reprint requests to Institute for Health Policy Studies, 1388 Sutter
St, 11th Floor, San Francisco, CA 94109 (Dr Cho).
References
1. Chalmers TC, Smith H Jr, Blackburn B, et al. A
method for assessing the quality of a randomized control trial.
Controlled Clin Trials. 1981;2:31-49.
2. Cooper GS, Zangwill L. An analysis of the quality of
research reports in the Journal of General Internal Medicine.
J Gen Intern Med. 1989;4:232-236.
3. Detsky AS, Naylor CD, O'Rourke KO, McGeer AJ, L'Abbe
KA. Incorporating variations in the quality of individual randomized
trials into meta-analysis. J Clin Epidemiol. 1992;45:255-265.
4. DerSimonian R, Charette LJ, McPeek B, Mosteller F.
Reporting on methods in clinical trials. N Engl J
Med. 1982;306:1332-1337.
5. Bailar JC, Mosteller F. Guidelines for statistical
reporting in articles for medical journals. Ann Intern Med.
1988;18:266-273.
6. Berlin JA, Goodman SN, Fletcher SW, Fletcher RH. An
instrument for assessing the quality of reporting of clinical research.
Read before the Second International Congress on Peer Review in
Biomedical Publication, September 9, 1993, Chicago, Ill.
7. Canadian Task Force for the Periodic Health Examination.
The periodic health examination. Can Med Assoc J.
1979;121:1193-1197.
8. US Preventive Services Task Force. Guide to Clinical
Preventive Services: An Assessment of the Effectiveness of 169
Interventions. Baltimore, Md: Williams & Wilkins; 1989.
9. Spitzer WO, Lawrence V, Dales R, et al. Links between
passive smoking and disease: a best evidence synthesis: a report of the
working group on passive smoking. Clin Invest Med.
1990;13:17-42.
10. Liberati A, Himel HN, Chalmers TC. A quality assessment
of randomized control trials of primary treatment of breast cancer.
J Clin Oncol. 1986;4:942-951.
11. Scott WA. Interreferee agreement on some characteristics
of manuscripts submitted to the Journal of Personality and Social
Psychology. Am Psychol. 1974;29:698-702.
12. Wolff WM. A study of criteria for journal manuscripts.
Am Psychol. 1970;25:636-639.
13. McReynolds P. Reliability of ratings of research papers.
Am Psychol. 1971;26:400-401.
14. Bero LA, Galbraith A, Rennie D. The publication of
sponsored symposia in medical journals. N Engl J
Med. 1992;327:1135-1140.
15. Andrew E, Eide P, Fuglerud EK, et al. Publications on
clinical trials with X-ray contrast media: differences in quality
between journals and decades. Eur J Radiol. 1990;10:92-96.
16. Hemminki E. Quality of clinical trials: a concern of
three decades. Methods Inform Med. 1982;21:81-85.
17. Food and Drug Administration. Offices of Drug
Evaluation Statistical Report. Washington, DC: Center for Drug
Evaluation and Research; 1990.
18. Siegel S, Castellan NJ. Nonparametric Statistics
for the Behavioral Sciences. 2nd ed. New York, NY: McGraw-Hill
International Book Co; 1988.
19. Haggard EA. Intraclass Correlation and the Analysis
of Variance. New York, NY: Dryden Press Inc; 1958.
20. L'Abbe KA, Detsky AS, O'Rourke K. Meta-analysis in
clinical research. Ann Intern Med. 1987;107:224-233.
21. Detsky AS, Baker JP, O'Rourke K, Goel V. Perioperative
parenteral nutrition: a meta-analysis. Ann Intern Med.
1987;107:195-203.
22. Feurer I, Becker G, Picus D, Ramires E, Darcy M, Hicks
M. Evaluating peer reviews: pilot testing of a grading instrument. Read
before the Second International Congress on Peer Review in Biomedical
Publication, September 9, 1993, Chicago, Ill.
23. Andrew E. Method for assessment of the reporting
standard of clinical trials with roentgen contrast media. Acta
Radiol Diagn. 1984;25:55-58.
24. Guyatt GH, Sackett DL, Cook DJ. Users' guides to the
medical literature, II: how to use an article about therapy or
prevention, A: are the results of the study valid? JAMA.
1993;270:2598-2601.
25. Guyatt GH, Sackett DL, Cook DJ. Users' guides to the
medical literature, II: how to use an article about therapy or
prevention, B: what were the results and will they help me in caring
for my patients? JAMA. 1994;271:59-63.
Table of Contents