Do Readers and Peer Reviewers Agree on Manuscript Quality?
(JAMA. 1994;272:117-119)
Amy C. Justice, MD; Jesse A. Berlin, ScD; Suzanne W. Fletcher,
MD; Robert H. Fletcher, MD; Steven N. Goodman, MD, PhD
Objective.--To study readers' judgments of manuscript
quality and the degree to which readers agreed with peer reviewers.
Design.--Cross-sectional study.
Setting.--Annals of Internal Medicine.
Subjects.--One hundred thirteen consecutive manuscripts
reporting original research and selected for publication. Each of two
manuscript versions (one before and one after revision) was judged by
two readers, randomly sampled from those who said (based on the title)
that they would read the article; one peer reviewer (peer), chosen in
the usual way for Annals; and one expert in clinical research
methods (expert). Each judge completed an instrument that included a
10-point subjective summary grade of manuscript quality.
Main Outcome Measures.--Agreement on the 10-point summary
grade of manuscript quality between reader-expert, reader-peer, and
reader-reader.
Results.--Readers and peers gave high grades (77% and 73%
gave a grade of 5 or better, respectively), while experts were more
critical (52% gave a grade of 5 or better; P<.0001).
Agreement was relatively high among judge groups (in all cases,
>69%) but agreement beyond chance was poor (kappa<0.04). One third
of readers (33%) thought that the manuscript had little relevance to
their work.
Conclusion.--Readers, like most peer reviewers, are
generally satisfied with the quality of manuscripts but would like
research articles to be more relevant to their clinical practice.
(JAMA. 1994;272:117-119)
MEDICAL JOURNALS serve two different
audiences; clinical researchers and practicing physicians.
Researchers with specialized knowledge of either the methods or the
content area of a manuscript regularly participate in the peer review
process, but practicing physicians do not. To understand better the
reader's view of manuscript quality and to determine whether peer
reviewers represent the opinions of journal readers, we used data
collected as part of a larger study of manuscript quality[1]
to determine whether readers and peer reviewers agree on manuscript
quality.
MATERIALS AND METHODS
Setting
Annals of Internal Medicine has a circulation of
100,000; about half the subscribers are general internists and
half are subspecialists in internal medicine. Approximately 10% of
readers are full-time academicians. Annals receives more than
2400 manuscript submissions per year, of which half are reports of
original research. Half of the manuscripts are rejected after internal
review, and the remaining half are sent for peer review by at least two
independent reviewers who are chosen from a computerized list of 7000,
most of whom are clinical researchers. For each paper, approximately
six reviewers with expertise in the content and/or methods of the paper
are chosen from the list and are consecutively contacted until two
agree to review the paper. Approximately 15% of submitted original
research manuscripts are published.
Manuscript Selection
One hundred thirteen consecutive reports of original research were
accepted for publication between March 1, 1992, and March 1, 1993.
After authors agreed to participate, the submitted and final manuscript
versions were obtained on a diskette and printed in the same format, so
that their appearance was comparable. Authors' names and affiliations
were removed. Manuscripts were selected and prepared in this fashion so
that judgments could be compared before and after editorial revision;
the results of this comparison are reported elsewhere.[1] For most manuscripts, reviewers and readers judged the
manuscripts before they were published.
Judge Selection
Readers were identified from a 2.7% random sample of Annals
subscribers. Everyone in the sample was sent a letter asking them to
rate a list of titles for how likely they would be to read the
corresponding article. A random sample of readers who agreed to
participate and were "likely" or "highly likely" to read a
manuscript were sent the manuscript. Two hundred eight peers, who had
been selected by the usual peer review procedure at Annals,
but did not participate in the initial review of the manuscript in
question, were selected. Thirty experts were selected from a panel of
researchers identified by us for their expertise in clinical research
methods. No judge read both the prerevision and revised versions of the
same manuscript.
Measurements
Although readers, peers, and experts received different surveys, all
surveys included the statement "this is a study of manuscript
quality" and an identical 10-point subjective summary grade for the
manuscript quality, with the prompts of poor, fair, acceptable, good,
and superb. Readers received a survey with 18 five-point ordinal
questions (Table 1). Peers received a survey containing five general (five-point) questions on the importance of the
question, the written presentation, the originality of the research,
its scientific validity, and its appropriateness for Annals.
Experts received a longer survey containing 34 (five-point) specific
questions primarily about reporting of research methods. The expert
survey and its results are reported separately.[1]
Analysis
We first plotted the univariate distribution of summary grades given by
readers, peers, and experts and tested whether the percentage of grades
dichotomized as "acceptable" (a grade of >/=5) or "not
acceptable" (</= 4) by judge type were different with the
chi2 test. Next we measured weighted agreement among the
judges on the overall manuscript quality in paired comparisons and
tested for agreement beyond chance with the kappa index.[2]
We explored the five-point component questions on the reader's survey
to identify questions that correlated with the summary grade and to
identify areas of reader dissatisfaction. Finally, we compared areas of
reader dissatisfaction with those of peers and experts.
RESULTS
Summary grades were completed by 81% (364/452) of the readers, 93%
(211/226) of the peers, and 98% (221/226) of the experts.
Distribution of Manuscript Summary Grades
Judges generally gave high grades; the median grades given by readers,
peers and experts were 7, 8, and 6, respectively
(Figure). Although all three types of judges used most of the scale, experts were the toughest graders. When manuscript
grades were dichotomized by "acceptable" or "not acceptable,"
84% of readers and 79% of peers graded the manuscripts as acceptable,
while only 61% of the grades given by experts were as high.
Agreement Among Judges on Individual Manuscripts
We obtained 371 (81%) of the 456 possible reader-expert pairs, 352
(77%) of the 456 reader-peer pairs, and 159 (70%) of the 226
reader-reader pairs. Agreement among readers and peers, readers and
experts, and readers and readers was high (Table 2).
However, agreement did not exceed that which would have been expected
by chance alone. The confidence intervals around the kappa index, a
test of agreement beyond chance, included the possibility of "fair"
disagreement (a negative kappa index) and excluded the possibility of
better than "fair" agreement (kappa=0.4 or better).[2]
Component Quality Questions
In general, reader component scores were high and correlated with the
summary grade (in all cases, r=.25 to .71; P<.001).
The component question most correlated with the summary grade concerned
the enjoyability of the article (r=.71; confidence interval,
.61 to .81). Only 20% of readers gave any component question a score
below 3, with one exception: 33% of readers believed that the
manuscript had little to very little relevance to their work (Table 1).
Readers were also relatively less satisfied with the explanation of the
strengths and weaknesses of the article, the clarity of the figures and
tables, and how enjoyable the article was to read.
Peers completed a much shorter survey (not shown). They had a high
level of satisfaction for all component questions (83% to 97% gave a
grade of 3 or better). In particular, 97% of peers were satisfied
with the importance of the study question and 90% were satisfied with
the clarity of the study presentation. Experts were critical of how
various aspects of the manuscript's methods were
reported.[1] Experts (93%) were generally satisfied that quantitative results were reported in a manner that most could understand.
COMMENT
Readers and peers thought that the overall quality of manuscripts
selected for publication at Annals was "acceptable" or
better. Experts were somewhat more critical. Agreement was high but did
not exceed that expected by chance for any of the comparisons.
Agreement can be both high and not beyond chance whenever there is
insufficient variation in grades.[2] The 10-point summary grades that we studied demonstrated a strong tendency toward high
grades for all manuscripts. Thus, lack of agreement beyond chance may
be the result of a ceiling effect. Only 15% of the original-research
manuscripts submitted to Annals are selected for publication,
and all the manuscripts in this study were from this select group. The
strong tendency toward high grades may reflect a uniformly high level
of quality among these manuscripts, and the remaining variation around
these grades may be simply "noise." Another possible explanation is
that the small variation that was present resulted from unadjusted
differences in judges' personal grading style or level of expertise
with a given manuscript. In contrast, lack of agreement beyond chance
was not caused by limited sample size; the 95% confidence intervals
around kappa excluded the possibility of better than fair agreement.
When we analyzed component survey questions, we found that readers were
more likely to rate a manuscript highly if they found it enjoyable.
Furthermore, readers were relatively dissatisfied with the relevance of
the manuscript to their medical practice, even though only readers who
responded that they were "likely" or "highly likely" to read
the manuscript were included in the study.
Readers also did not agree with peer or expert assessment of clinical
importance or manuscript comprehensibility. These differences of
opinion may have several causes: original research articles are chosen
partly because they present new information, but clinicians tend to
avoid unestablished treatments; readers are not as familiar with other
articles on the same question, which form the context in which the
article at hand is judged; and readers may be less prepared to assess
the scientific strengths and weaknesses of the study. From the
readers' perspective, researchers may be less prepared to assess
clinical relevance or the comprehensibility and enjoyability of
articles.
Our study must be interpreted in light of several limitations. It
involved only one medical journal, and its results may not be
generalizable to journals that are aimed at other medical specialties,
have fewer or more homogeneous subscribers, or receive fewer
submissions. We studied only manuscripts selected for publication; it
is possible that readers would more consistently agree with peer
reviewers and other readers if they were asked to decide which
manuscripts should be published from among all manuscripts submitted.
Each category of judge completed a different survey with only one
identical question, the global quality grade; the differences between
judge groups may be partially explained by a framing effect related to
the differences between the survey instruments.
CONCLUSIONS
General medical journals are intended to serve clinicians as well as
researchers. Practicing physicians require clinical relevance--an
element that may not be well assessed by traditional peer review. In
the future, editors should find ways to incorporate the reader's
perspective into the peer review process and study the effects of their
efforts.
From the Center for Clinical Epidemiology and Biostatistics,
University of Pennsylvania Medical Center, Philadelphia (Drs Justice
and Berlin); Leonard Davis Institute of Health Economics, University of
Pennsylvania, Philadelphia (Dr Justice); Editorial Offices,
Annals
of Internal Medicine, Philadelphia (Drs S. Fletcher and R.
Fletcher); and Department of Oncology, The Johns Hopkins University
School of Medicine, Baltimore, Md (Dr Goodman). Drs S. Fletcher and R.
Fletcher are now with the Department of Ambulatory Care and Prevention,
Harvard Medical School and Harvard Community Health Plan, Boston, Mass.
Presented in part at the Second International Congress on Peer Review
in Biomedical Publication, Chicago, Ill, September 9, 1993.
We thank Tori Ransome and Grace Lobb for their indefatigable
efforts in coordinating this project and Andrew Langman for
coordinating the computer formating of manuscripts in the study.
Reprint requests to Center for Clinical Epidemiology and Biostatistics,
420 Service Dr, Floor 2L, Nursing Education Bldg, University of
Pennsylvania, Philadelphia, PA 19104-6095 (Dr Justice).
References
1. Goodman SN, Berlin JA, Fletcher SW, Fletcher RH. Manuscript quality before and after peer review and editing at
Annals of Internal Medicine. Ann Intern Med.
1994;121:11-21.
2. Kramer MS, Feinstein AR. Clinical biostatistics, LIV: the biostatistics of concordance. Clin Pharmacol Ther.
1981;29:111-123.
Table of Contents