LINKS
Mechanisms of Peer Review

Do Readers and Peer Reviewers Agree on Manuscript Quality?

(JAMA. 1994;272:117-119)

Amy C. Justice, MD; Jesse A. Berlin, ScD; Suzanne W. Fletcher, MD; Robert H. Fletcher, MD; Steven N. Goodman, MD, PhD

Objective.--To study readers' judgments of manuscript quality and the degree to which readers agreed with peer reviewers.

Design.--Cross-sectional study.

Setting.--Annals of Internal Medicine.

Subjects.--One hundred thirteen consecutive manuscripts reporting original research and selected for publication. Each of two manuscript versions (one before and one after revision) was judged by two readers, randomly sampled from those who said (based on the title) that they would read the article; one peer reviewer (peer), chosen in the usual way for Annals; and one expert in clinical research methods (expert). Each judge completed an instrument that included a 10-point subjective summary grade of manuscript quality.

Main Outcome Measures.--Agreement on the 10-point summary grade of manuscript quality between reader-expert, reader-peer, and reader-reader.

Results.--Readers and peers gave high grades (77% and 73% gave a grade of 5 or better, respectively), while experts were more critical (52% gave a grade of 5 or better; P<.0001). Agreement was relatively high among judge groups (in all cases, >69%) but agreement beyond chance was poor (kappa<0.04). One third of readers (33%) thought that the manuscript had little relevance to their work.

Conclusion.--Readers, like most peer reviewers, are generally satisfied with the quality of manuscripts but would like research articles to be more relevant to their clinical practice.

(JAMA. 1994;272:117-119)


MEDICAL JOURNALS serve two different audiences; clinical researchers and practicing physicians. Researchers with specialized knowledge of either the methods or the content area of a manuscript regularly participate in the peer review process, but practicing physicians do not. To understand better the reader's view of manuscript quality and to determine whether peer reviewers represent the opinions of journal readers, we used data collected as part of a larger study of manuscript quality[1] to determine whether readers and peer reviewers agree on manuscript quality.

MATERIALS AND METHODS

Setting

Annals of Internal Medicine has a circulation of 100,000; about half the subscribers are general internists and half are subspecialists in internal medicine. Approximately 10% of readers are full-time academicians. Annals receives more than 2400 manuscript submissions per year, of which half are reports of original research. Half of the manuscripts are rejected after internal review, and the remaining half are sent for peer review by at least two independent reviewers who are chosen from a computerized list of 7000, most of whom are clinical researchers. For each paper, approximately six reviewers with expertise in the content and/or methods of the paper are chosen from the list and are consecutively contacted until two agree to review the paper. Approximately 15% of submitted original research manuscripts are published.

Manuscript Selection

One hundred thirteen consecutive reports of original research were accepted for publication between March 1, 1992, and March 1, 1993. After authors agreed to participate, the submitted and final manuscript versions were obtained on a diskette and printed in the same format, so that their appearance was comparable. Authors' names and affiliations were removed. Manuscripts were selected and prepared in this fashion so that judgments could be compared before and after editorial revision; the results of this comparison are reported elsewhere.[1] For most manuscripts, reviewers and readers judged the manuscripts before they were published.

Judge Selection

Readers were identified from a 2.7% random sample of Annals subscribers. Everyone in the sample was sent a letter asking them to rate a list of titles for how likely they would be to read the corresponding article. A random sample of readers who agreed to participate and were "likely" or "highly likely" to read a manuscript were sent the manuscript. Two hundred eight peers, who had been selected by the usual peer review procedure at Annals, but did not participate in the initial review of the manuscript in question, were selected. Thirty experts were selected from a panel of researchers identified by us for their expertise in clinical research methods. No judge read both the prerevision and revised versions of the same manuscript.

Measurements

Although readers, peers, and experts received different surveys, all surveys included the statement "this is a study of manuscript quality" and an identical 10-point subjective summary grade for the manuscript quality, with the prompts of poor, fair, acceptable, good, and superb. Readers received a survey with 18 five-point ordinal questions (Table 1). Peers received a survey containing five general (five-point) questions on the importance of the question, the written presentation, the originality of the research, its scientific validity, and its appropriateness for Annals. Experts received a longer survey containing 34 (five-point) specific questions primarily about reporting of research methods. The expert survey and its results are reported separately.[1]

Analysis

We first plotted the univariate distribution of summary grades given by readers, peers, and experts and tested whether the percentage of grades dichotomized as "acceptable" (a grade of >/=5) or "not acceptable" (</= 4) by judge type were different with the chi2 test. Next we measured weighted agreement among the judges on the overall manuscript quality in paired comparisons and tested for agreement beyond chance with the kappa index.[2] We explored the five-point component questions on the reader's survey to identify questions that correlated with the summary grade and to identify areas of reader dissatisfaction. Finally, we compared areas of reader dissatisfaction with those of peers and experts.

RESULTS

Summary grades were completed by 81% (364/452) of the readers, 93% (211/226) of the peers, and 98% (221/226) of the experts.

Distribution of Manuscript Summary Grades

Judges generally gave high grades; the median grades given by readers, peers and experts were 7, 8, and 6, respectively (Figure). Although all three types of judges used most of the scale, experts were the toughest graders. When manuscript grades were dichotomized by "acceptable" or "not acceptable," 84% of readers and 79% of peers graded the manuscripts as acceptable, while only 61% of the grades given by experts were as high.

Agreement Among Judges on Individual Manuscripts

We obtained 371 (81%) of the 456 possible reader-expert pairs, 352 (77%) of the 456 reader-peer pairs, and 159 (70%) of the 226 reader-reader pairs. Agreement among readers and peers, readers and experts, and readers and readers was high (Table 2). However, agreement did not exceed that which would have been expected by chance alone. The confidence intervals around the kappa index, a test of agreement beyond chance, included the possibility of "fair" disagreement (a negative kappa index) and excluded the possibility of better than "fair" agreement (kappa=0.4 or better).[2]

Component Quality Questions

In general, reader component scores were high and correlated with the summary grade (in all cases, r=.25 to .71; P<.001). The component question most correlated with the summary grade concerned the enjoyability of the article (r=.71; confidence interval, .61 to .81). Only 20% of readers gave any component question a score below 3, with one exception: 33% of readers believed that the manuscript had little to very little relevance to their work (Table 1). Readers were also relatively less satisfied with the explanation of the strengths and weaknesses of the article, the clarity of the figures and tables, and how enjoyable the article was to read.

Peers completed a much shorter survey (not shown). They had a high level of satisfaction for all component questions (83% to 97% gave a grade of 3 or better). In particular, 97% of peers were satisfied with the importance of the study question and 90% were satisfied with the clarity of the study presentation. Experts were critical of how various aspects of the manuscript's methods were reported.[1] Experts (93%) were generally satisfied that quantitative results were reported in a manner that most could understand.

COMMENT

Readers and peers thought that the overall quality of manuscripts selected for publication at Annals was "acceptable" or better. Experts were somewhat more critical. Agreement was high but did not exceed that expected by chance for any of the comparisons.

Agreement can be both high and not beyond chance whenever there is insufficient variation in grades.[2] The 10-point summary grades that we studied demonstrated a strong tendency toward high grades for all manuscripts. Thus, lack of agreement beyond chance may be the result of a ceiling effect. Only 15% of the original-research manuscripts submitted to Annals are selected for publication, and all the manuscripts in this study were from this select group. The strong tendency toward high grades may reflect a uniformly high level of quality among these manuscripts, and the remaining variation around these grades may be simply "noise." Another possible explanation is that the small variation that was present resulted from unadjusted differences in judges' personal grading style or level of expertise with a given manuscript. In contrast, lack of agreement beyond chance was not caused by limited sample size; the 95% confidence intervals around kappa excluded the possibility of better than fair agreement.

When we analyzed component survey questions, we found that readers were more likely to rate a manuscript highly if they found it enjoyable. Furthermore, readers were relatively dissatisfied with the relevance of the manuscript to their medical practice, even though only readers who responded that they were "likely" or "highly likely" to read the manuscript were included in the study.

Readers also did not agree with peer or expert assessment of clinical importance or manuscript comprehensibility. These differences of opinion may have several causes: original research articles are chosen partly because they present new information, but clinicians tend to avoid unestablished treatments; readers are not as familiar with other articles on the same question, which form the context in which the article at hand is judged; and readers may be less prepared to assess the scientific strengths and weaknesses of the study. From the readers' perspective, researchers may be less prepared to assess clinical relevance or the comprehensibility and enjoyability of articles.

Our study must be interpreted in light of several limitations. It involved only one medical journal, and its results may not be generalizable to journals that are aimed at other medical specialties, have fewer or more homogeneous subscribers, or receive fewer submissions. We studied only manuscripts selected for publication; it is possible that readers would more consistently agree with peer reviewers and other readers if they were asked to decide which manuscripts should be published from among all manuscripts submitted. Each category of judge completed a different survey with only one identical question, the global quality grade; the differences between judge groups may be partially explained by a framing effect related to the differences between the survey instruments.

CONCLUSIONS

General medical journals are intended to serve clinicians as well as researchers. Practicing physicians require clinical relevance--an element that may not be well assessed by traditional peer review. In the future, editors should find ways to incorporate the reader's perspective into the peer review process and study the effects of their efforts.


From the Center for Clinical Epidemiology and Biostatistics, University of Pennsylvania Medical Center, Philadelphia (Drs Justice and Berlin); Leonard Davis Institute of Health Economics, University of Pennsylvania, Philadelphia (Dr Justice); Editorial Offices, Annals of Internal Medicine, Philadelphia (Drs S. Fletcher and R. Fletcher); and Department of Oncology, The Johns Hopkins University School of Medicine, Baltimore, Md (Dr Goodman). Drs S. Fletcher and R. Fletcher are now with the Department of Ambulatory Care and Prevention, Harvard Medical School and Harvard Community Health Plan, Boston, Mass.

Presented in part at the Second International Congress on Peer Review in Biomedical Publication, Chicago, Ill, September 9, 1993.

We thank Tori Ransome and Grace Lobb for their indefatigable efforts in coordinating this project and Andrew Langman for coordinating the computer formating of manuscripts in the study.

Reprint requests to Center for Clinical Epidemiology and Biostatistics, 420 Service Dr, Floor 2L, Nursing Education Bldg, University of Pennsylvania, Philadelphia, PA 19104-6095 (Dr Justice).


References

1. Goodman SN, Berlin JA, Fletcher SW, Fletcher RH. Manuscript quality before and after peer review and editing at Annals of Internal Medicine. Ann Intern Med. 1994;121:11-21.

2. Kramer MS, Feinstein AR. Clinical biostatistics, LIV: the biostatistics of concordance. Clin Pharmacol Ther. 1981;29:111-123.

Table of Contents