LINKS
Peer Review and Quality Control

Instruments for Assessing the Quality of Drug Studies Published in the Medical Literature

(JAMA. 1994;272:101-104)

Mildred K. Cho, PhD, Lisa A. Bero, PhD

Objectives.--To develop valid and reliable instruments to assess the methodologic quality and clinical relevance of drug studies.

Design.--We developed an instrument to assess the methodologic quality of articles reporting clinical research and an instrument to measure nonmethodologic measures of quality, such as clinical relevance, generalizability, and adherence to ethical standards. Each instrument was pretested by seven independent, masked reviewers and modified based on interrater agreement and content validity of individual items. We determined correlational validity of the final methodologic quality instrument by comparing quality scores assigned to 10 articles by means of our instrument and a previously published one.

Participants.--Clinical drug studies published in symposium proceedings and peer reviewed biomedical literature.

Main Outcome Measures.--Interrater reliability of overall quality scores, measured by intraclass correlation (r ) and Kendall's coefficient of concordance (W), and interrater reliability of individual items, by percentage agreement.

Main Results.--The interrater reliability of the pretest methodologic quality instrument was high (r=.89 [95% confidence interval, .73 to .96]; W=0.64). Correlational validity of the final instrument was suggested by the high degree of concordance with another previously published one (W=0.74). The interrater reliability of the pretest clinical relevance instrument was moderate (r=.41 [95% confidence interval, .18 to .64]; W=0.47). Reviewers confirmed the content validity of both instruments.

Conclusions.--The two instruments we developed, one measuring methodologic quality and one measuring clinical relevance of articles reporting clinical research, are reliable, valid, and applicable to a variety of research designs.

(JAMA. 1994;272:101-104)


TO CLINICIANS and scientists, the medical literature is an important means for acquiring new information to guide clinical decision making and research. Assessing the quality of the literature is an important activity of readers, meta-analysts, and participants in the peer review process.

Several instruments to assess methodologic quality of biomedical literature have been developed; most assess the quality of randomized control trials only.[1] [2] [3] [4] [5] [6] [7] [8] [9] Although such instruments are generally designed to be used by more than one rater, interrater reliability has been infrequently reported. External validity of these instruments cannot be determined for lack of a "gold standard." Other forms of validity, such as content validity or correlational validity, have also generally not been reported. Few instruments in the literature measure nonmethodologic aspects of quality, such as external validity,[10] interest to readers,[11] and importance of the study to the problem.[11] [12] [13]

Therefore, we developed an instrument to measure methodologic quality that was applicable not just to randomized controlled trials but also to studies of other experimental or observational designs. We also developed an instrument to assess the clinical relevance and generalizability of a study and its adherence to ethical standards. To the extent possible, we tested the interrater reliability, content validity, and correlational validity of both instruments.

MATERIALS AND METHODS

We tested our instruments on clinical drug studies published in the biomedical literature. We defined a "study" as a publication that (1) contained a separate section describing methods, (2) did not specifically state that it was a review, and (3) did not contain tables or figures outside the text acknowledged to be reprinted from other sources. We defined a study as "clinical" if it was performed on humans.

Article Selection

We tested the instruments on articles from symposium proceedings and peerreviewed journals because the instruments were initally designed to examine the quality of articles from these two sources. Using a computer-generated list of random numbers from 1 to 625, we randomly selected symposium articles published between January 1, 1980, through December 30, 1989, and reported clinical drug studies from 625 symposia proceedings that were identified for a previous study.[14] From each symposium, we randomly selected one clinical drug study by means of a random number generated by a calculator.

Date of publication, journal, and therapeutic class of drug could confound the association between the source of publication (ie, symposium vs peer reviewed journal) and quality.[4] [15] [16] We matched each symposium article to an article from the peer reviewed parent journal that was published up to 2 years before or after the symposium article. We matched articles by the therapeutic class of drug based on the Food and Drug Administration classification[17] to consider, for certain types of drugs, randomized or masked study designs that would be considered the highest in quality are not always possible.

We identified peer reviewed articles in MEDLINE, with the help of a professional librarian, by the following search strategy: (xjo <journal name>) and (py <year>) and (exp xs <drug class>) and (lan eng) and (su human) and not (pt review or pt letter or pt editorial). We randomly selected one matching article for each symposium article by means of random numbers generated by a calculator. For the methodologic quality instrument, we selected three symposium and three matched peer reviewed articles. For the clinical relevance instrument, we selected five symposium and five matched peer reviewed articles. All reviewers received the articles in the same randomized order.

Pretest of Methodologic Quality Instrument

We defined "methodologic quality" of a study as minimization of systematic bias and consistency of conclusions with results. We are able to determine methodologic quality of a study only to the extent that study design and analytic methods are reported. However, because our instrument measures criteria critical to achieving high methodologic quality, these criteria should be adequately reported in any article describing a clinical study.

Instrument Development.--Our pretest methodologic quality instrument was based on that of Spitzer et al,[9] which was the only comprehensive methodologic quality assessment instrument we found in the medical literature applicable to both observational and experimental studies. We eliminated items from the Spitzer et al instrument that did not address systematic bias and consistency.

Using our pretest instrument, reviewers were first asked to determine the type of study design used (responses could range from case reports to randomized controlled trials). They could respond "Yes," "Partial," "No," or "Not applicable" to how well the article fulfilled each of the remaining 15 criteria. A space for comments was also provided. The pretest instrument was revised on the basis of interrater agreement and reviewers' comments on content validity of individual items.

Scoring System.--To derive an overall quality score for each article, we assigned 2, 1, 0, and 0 points to "Yes," "Partial," "No," and "Not applicable" responses, respectively, for each item except for study design. One to five points were given for study design (one for case reports, two for time series or uncontrolled experiments, three for cohort or case-control studies, four for unrandomized control trials, and five for randomized control trials), according to the system used by the US Preventive Services Task Force.[8] Total points were divided by the total possible points (the sum of the maximum points for each item, except "Not applicable" items) to yield a fraction between 0 and 1. A score of 1 represents the highest quality.

As there is little empiric evidence on the relative importance of the individual quality criteria to the control of systematic bias, the effects of three different weighting schemes on the absolute values and interrater reliability of overall scores were tested. Weighting scheme 1 gave all items equal weight. Weighting scheme 2 gave more weight to study design by multiplying the points awarded for study design by an arbitrarily selected weighting factor of 3. In weighting scheme 3, points for study design, randomization, blinding, statistical analysis, and support of conclusions by the results were given more weight, similar to the scheme proposed by Chalmers et al[1]

Reviewer Background, Training, and Masking.--For the pretest, we recruited seven reviewers with research experience in health sciences (two biostatisticians, three epidemiologists, and two clinical pharmacologists). The reviewers were given a six-page set of written instructions, worked independently, were masked to whether an article was from a symposium, and were given photocopies of articles from which author names, institutions, journal names, dates, and all other reference information had been obliterated. Reviewers were also unaware of the purpose of our study.

Interrater Reliability.--Interrater reliability of individual items was assessed by percentage agreement. Interrater reliability of overall scores was assessed by Kendall's coefficient of concordance (W) with adjustment for tied ranks.[18] As others have more commonly reported the intraclass correlation (r) as a measure of interrater reliability of similar instruments, we also determined r (treating both reviewers and articles as random effects[19] ). For tests of significance, one-tailed alpha=.05 was used for W and two-tailed alpha=.05 for r.

Correlational Validity of Final Methodologic Quality Instrument

Correlational validity of the final methodologic quality instrument was assessed by comparing the rank order of quality scores previously assigned to 10 articles by means of a modified version of the instrument designed by Chalmers et al.[20] This instrument was the only one we found for which both the references and exact quality scores of articles were reported.[21] Two reviewers (both clinical pharmacologists) were given photocopies of the 10 articles, which were masked as described above. Because the interrater reliability between the two reviewers was high (r=.79; 95% confidence interval [CI], .39 to .94), we used the mean of the two reviewers' scores for each article to represent the score assigned by our instrument. The rank order of the mean scores was compared by means of Kendall's W.

Pretest of Clinical Relevance Instrument

To assess nonmethodologic measures of quality, such as clinical relevance, generalizability, and ethics, we designed a seven-item instrument that we call the "clinical relevance" instrument. We awarded points to each response as follows: two for "Yes," one for "Partial" or "Insufficient evidence," and zero for "No" or "No controls." The overall quality score was the total of awarded points divided by 14, the total possible points. We weighted all items equally.

Seven reviewers with research experience in health sciences (one biostatistician and six physicians, including two general internists, one anesthesiologist, one ophthalmologist, one pediatrician, and one adult critical care physician) pretested this instrument. They received four pages of written instructions, worked independently, and were masked as described above.

RESULTS

Pretest of Methodologic Quality Instrument

The pretest of the methodologic quality instrument required approximately 30 minutes per article per reviewer. There were no missing data. The average of reviewers' scores for each article ranged from 0.15 to 0.65 (SD, 0.18). The interrater reliability of overall quality scores differed between the three different weighting schemes used to derive the scores. For weighting scheme 1, W was 0.60 and r was .81 (95% CI, .58 to .93). For weighting scheme 2, W was 0.64 and r was .89 (95% CI, .73 to .96). For weighting scheme 3, W was 0.73 and r was .85 (95% CI, .65 to .94). The differences between r values were not significant (z; P>.5 for all pairwise combinations of the three r values). We report only the results obtained with weighting scheme 2 because it generally resulted in the highest interrater reliability for overall scores and is simple to interpret.

Table 1 shows interrater agreement for individual items. We considered an item to have "good agreement" if five of seven reviewers gave the same response. For each item, we counted the fraction of articles (of a total of six) that had good agreement. In general, reviewers found that individual items had content validity but suggested that the questions be more specific. (See Table 2 for the final instrument.) To achieve more specificity, we split compound questions (eg, item 13 in the pretest instrument became items 15 and 16 in the final instrument) or items that represented overgeneralized concepts (eg, item 3 in the pretest instrument became items 7 through 9 in the final instrument). Reviewers suggested clarifying the difference between whether something was done vs whether it was reported (items 6 and 10 through 13 in the final instrument). Reviewers also suggested adding several items (items 1, 2, 18, and 19 in the final instrument).

Because low percentage agreement was generally the result of discrepancies in the use of the responses "Partial" and "Not applicable," we added illustrative examples to the instruction manual and listed criteria for "Not applicable" responses for each item on the instrument itself.

Correlational Validity of Final Methodologic Quality Instrument

The mean (+/-SD) quality score for the 10 articles used to assess correlational validity was 0.60+/-0.13 with our instrument (range, 0.36 to 0.74; possible range, 0 to 1). The mean of the scores obtained by Detsky et al[21] with their instrument was 0.23+/-0.15 (range, 0.04 to 0.54; possible range, 0 to 1). The mean scores differed significantly (t test; P<.001). However, the rank order of the scores were similar, indicated by a high value of Kendall's W (0.74).

Pretest of Clinical Relevance Instrument

The pretest of this instrument required approximately 30 minutes per article per reviewer. There were only two instances of missing data, both from one reviewer. The average of reviewer scores for each paper ranged from 0.30 to 0.66 (SD, 0.10). The interrater reliability for overall scores was moderate (r=.41 [95% CI, .18 to .64]; W=0.47). The percentage agreement of individual items was high (Table 3).

Reviewers generally thought that the instrument had content validity. We changed the phrase "control group" to "comparison group" in item 3 and added the words "As far as could be determined from the article" in item 7 to improve interrater agreement, as suggested by the reviewers. All other items were left unchanged. Table 4 shows the wording of items in the final instrument.

COMMENT

We have developed valid and reliable instruments to assess the methodologic quality and clinical relevance of drug studies published in the biomedical literature. Because the instruments measure different aspects of quality, they complement each other and, together, define a broad range of criteria by which to assess overall quality.

Our instruments improve on those reported in the literature in four ways. First, our methodologic quality instrument is applicable to a variety of study designs and is relatively short and easy to use.

Second, both instruments have a high interrater reliability compared with most similar, published instruments.[15,21] [22] [23] The moderate interrater reliability of the overall scores of our clinical relevance instrument may have resulted from the low variability of scores among the 10 articles in our sample. However, the interrater agreement for individual items of the clinical relevance instrument was surprisingly high, considering the high proportion of subjective items as compared with the methodologic quality instrument.

Third, the methodologic quality instrument was used successfully to assess articles that were not about drugs,[21] suggesting that it is applicable to a variety of topics. The good correlation of the rank order of the quality scores assigned by means of our instrument and the one used by Detsky et [21] indicates that the instruments are similar in measuring methodologic quality. Because there is no accepted gold standard of methodologic quality against which to measure external validity, we used correlational validity as the best available alternative.

Fourth, our instruments had high reliability and validity when used by reviewers from diverse backgrounds who worked completely independently and without lengthy training sessions. Complete independence of reviewers avoids the systematic bias that can be introduced by consensus conferencing if one of the reviewers is persuasive or has strong opinions. We tested the reliability of our instruments under conditions we thought would be most likely to be used in real-life situations, such as in the peer review of manuscripts.

We will be able to use both instruments for which they were originally designed. Namely, we will have independent reviewers determine whether the type of sponsorship or type of review process of drug studies is associated with methodologic or nonmethodologic quality. The instruments will also be useful for separating high- and low-quality articles for meta-analysis. Finally, for health care providers, such as clinical pharmacists or pharmacologists, who must screen vast amounts of literature in providing drug information services, our instruments can augment available guidelines on how to read medical literature[24] [25] by providing specific criteria by which to evaluate articles.


From the Institute for Health Policy Studies, School of Medicine (Drs Cho and Bero), and Division of Clinical Pharmacy, School of Pharmacy (Dr Bero), University of California-San Francisco; Center for Health Care Evaluation, Department of Veterans Affairs, Veterans Affairs Medical Center, Palo Alto, Calif (Dr Cho); and Department of Health Research and Policy, Stanford (Calif) University (Dr Cho).

Presented in part at the Second International Congress on Peer Review in Biomedical Publication, Chicago, Ill, September 9, 1993.

This investigation was supported by funds provided by the American Association for Retired Persons (Dr Bero), the Cigarette and Tobacco Surtax Fund of the State of California through the Tobacco-Related Disease Research Program of the University of California under award 2KT0072 (Dr Bero), the Pew Charitable Trusts (Dr Cho), and the Veterans Affairs Office of Academic Affairs and Health Services Research and Development Service Research Funds (Dr Cho).

We gratefully acknowledge our reviewers for their time and critique of our instruments: Neal Benowitz, MD, John Flynn, MD, Piero Gepetti, MD, Stan Glantz, PhD, Peter Lurie, MD, MPH, Haim Mayan, MD, Mitchell Pelter, PharmD, John Piette, PhD, MPH, Adrienne Randolph, MD, Susan Rosenkranz, PhD, Gordon Rubenfeld, MD, Serena Seifer, MD, MPH, Dan Stryer, MD, and Tracey Woodruff, PhD, MPH. We also thank Phillip Lollar for administrative help.

Reprint requests to Institute for Health Policy Studies, 1388 Sutter St, 11th Floor, San Francisco, CA 94109 (Dr Cho).


References

1. Chalmers TC, Smith H Jr, Blackburn B, et al. A method for assessing the quality of a randomized control trial. Controlled Clin Trials. 1981;2:31-49.

2. Cooper GS, Zangwill L. An analysis of the quality of research reports in the Journal of General Internal Medicine. J Gen Intern Med. 1989;4:232-236.

3. Detsky AS, Naylor CD, O'Rourke KO, McGeer AJ, L'Abbe KA. Incorporating variations in the quality of individual randomized trials into meta-analysis. J Clin Epidemiol. 1992;45:255-265.

4. DerSimonian R, Charette LJ, McPeek B, Mosteller F. Reporting on methods in clinical trials. N Engl J Med. 1982;306:1332-1337.

5. Bailar JC, Mosteller F. Guidelines for statistical reporting in articles for medical journals. Ann Intern Med. 1988;18:266-273.

6. Berlin JA, Goodman SN, Fletcher SW, Fletcher RH. An instrument for assessing the quality of reporting of clinical research. Read before the Second International Congress on Peer Review in Biomedical Publication, September 9, 1993, Chicago, Ill.

7. Canadian Task Force for the Periodic Health Examination. The periodic health examination. Can Med Assoc J. 1979;121:1193-1197.

8. US Preventive Services Task Force. Guide to Clinical Preventive Services: An Assessment of the Effectiveness of 169 Interventions. Baltimore, Md: Williams & Wilkins; 1989.

9. Spitzer WO, Lawrence V, Dales R, et al. Links between passive smoking and disease: a best evidence synthesis: a report of the working group on passive smoking. Clin Invest Med. 1990;13:17-42.

10. Liberati A, Himel HN, Chalmers TC. A quality assessment of randomized control trials of primary treatment of breast cancer. J Clin Oncol. 1986;4:942-951.

11. Scott WA. Interreferee agreement on some characteristics of manuscripts submitted to the Journal of Personality and Social Psychology. Am Psychol. 1974;29:698-702.

12. Wolff WM. A study of criteria for journal manuscripts. Am Psychol. 1970;25:636-639.

13. McReynolds P. Reliability of ratings of research papers. Am Psychol. 1971;26:400-401.

14. Bero LA, Galbraith A, Rennie D. The publication of sponsored symposia in medical journals. N Engl J Med. 1992;327:1135-1140.

15. Andrew E, Eide P, Fuglerud EK, et al. Publications on clinical trials with X-ray contrast media: differences in quality between journals and decades. Eur J Radiol. 1990;10:92-96.

16. Hemminki E. Quality of clinical trials: a concern of three decades. Methods Inform Med. 1982;21:81-85.

17. Food and Drug Administration. Offices of Drug Evaluation Statistical Report. Washington, DC: Center for Drug Evaluation and Research; 1990.

18. Siegel S, Castellan NJ. Nonparametric Statistics for the Behavioral Sciences. 2nd ed. New York, NY: McGraw-Hill International Book Co; 1988.

19. Haggard EA. Intraclass Correlation and the Analysis of Variance. New York, NY: Dryden Press Inc; 1958.

20. L'Abbe KA, Detsky AS, O'Rourke K. Meta-analysis in clinical research. Ann Intern Med. 1987;107:224-233.

21. Detsky AS, Baker JP, O'Rourke K, Goel V. Perioperative parenteral nutrition: a meta-analysis. Ann Intern Med. 1987;107:195-203.

22. Feurer I, Becker G, Picus D, Ramires E, Darcy M, Hicks M. Evaluating peer reviews: pilot testing of a grading instrument. Read before the Second International Congress on Peer Review in Biomedical Publication, September 9, 1993, Chicago, Ill.

23. Andrew E. Method for assessment of the reporting standard of clinical trials with roentgen contrast media. Acta Radiol Diagn. 1984;25:55-58.

24. Guyatt GH, Sackett DL, Cook DJ. Users' guides to the medical literature, II: how to use an article about therapy or prevention, A: are the results of the study valid? JAMA. 1993;270:2598-2601.

25. Guyatt GH, Sackett DL, Cook DJ. Users' guides to the medical literature, II: how to use an article about therapy or prevention, B: what were the results and will they help me in caring for my patients? JAMA. 1994;271:59-63.

Table of Contents