Evaluating Peer Reviews
Pilot Testing of a Grading Instrument
(JAMA. 1994;272:98-100)
Irene D. Feurer, MSEd; Gary J. Becker, MD; Daniel Picus, MD; Estella
Ramirez; Michael D. Darcy, MD; Marshall E. Hicks, MD
Objective.--To measure the reliability and preliminary
validity of a grading instrument for editors to evaluate the quality of
peer reviews.
Design.--The consecutive sample design included 53 reviews
of 23 manuscripts. Reviews were systematically assigned to interrater
reliability (n=41; power greater than 0.90 to detect a difference of
greater than one point) and preliminary criterion-related validity
(n=12) subsamples. Content validity was closely examined.
Setting.--Nonclinical.
Participants.--Three graders evaluated reliability. One
individual examined content validity and two editors tested preliminary
criterion-related validity.
Intervention (Instrument).--Attributes reflecting two basic
dimensions, review content and format, were identified and scored
(values are possible points/percent contribution): timeliness, 3/21%;
grade sheet, 1/7%; etiquette, 1/7%; sectional narratives, 3/21%;
citations, 2/14%; narrative summary, 2/14%; and insights, 2/14%. A
scoring guide was provided.
Main Outcome Measures.--Statistical analyses used to test
the interrater reliability of the total score included the intraclass
correlation coefficient and analysis of variance with the expectation
to uphold the null hypothesis. Kendall's coefficient of concordance
was used to test preliminary criterion-related validity.
Results.--The intraclass correlation coefficient was .84
(P<.001) and a lack of difference between mean scores was
demonstrated by analysis of variance (P=.46). Content validity
was confirmed and preliminary criterion-related validity was indicated
(Kendall's coefficient of concordance=.94, P=.038).
Conclusions.--The instrument is reliable. Content validation
has been completed, and further criterion-related validation is
warranted.
(JAMA. 1994;272:98-100)
PEER REVIEW of scholarly manuscripts by qualified
outside referees is the cornerstone of the editorial review process.
Its primary purpose is to identify those manuscripts that are worthy of
publication[1] ; its important secondary roles have been well
described.[2]
[3]
[4]
[5]
[6]
Theoretically, a grading instrument providing data about the quality of
individual peer reviews could aid editors by (1) identifying those
reviewers who are consistently helpful, outstanding, or weak, timely or
late, and exacting or lenient, and those who contribute the most and
least to the review process, and (2) providing objective data to use
when updating the reviewer list, making merit-based promotions to the
editorial board, and providing feedback to reviewers. Reliability and
content validity should be demonstrated early in the testing of any
such instrument.[7]
Although several studies have focused on the reliability of
manuscript reviews,[8]
[9]
[10]
[11]
few objective evaluations
of review quality have been conducted. Of 35 original reports and
posters presented at The First International Congress on Peer Review,
only seven directly addressed the testing of
reviews.[1,11]
[12]
[13]
[14]
[15]
[16]
Of these, only one involved the editor's
assessment of reviews via specific content-related
questions.[12]
We examined the reliability, content validity, and preliminary
criterion-related validity of an instrument designed for editors of the
Journal of Vascular and Interventional Radiology
(JVIR) to evaluate manuscript reviews along two basic
dimensions: content and format. Content refers to the substance and
scientific scrutiny evident in the review, while format addresses
practicalities of the peer review and editorial processes; both
dimensions affect the quality of the review and the feedback provided
to the authors and editor.
METHODS
Background
The study was conducted by individuals involved in the peer review
process for JVIR, including the editor in chief (G.J.B.), the
deputy editor (D.P.), the editorial assistant (E.R.), and three
experienced reviewers (I.D.F., M.D.D., and M.E.H.).
The JVIR employs a parallel refereeing system,[17]
with manuscript assignment on the basis of reviewers' areas of
expertise, history with respect to preparing rigorous reviews and
meeting deadlines, and availability. Reviewers are provided a summary
grade sheet and detailed general instructions regarding the scientific
elements to be evaluated, review format, and turnaround time, as well
as reviewing etiquette and ethics. These elements provided the
foundation for instrument development.
Subjects and Sample
Interrater reliability of the instrument was evaluated by a panel of
three graders (D.P., M.D.D., and M.E.H.) not involved in its
development. (G.J.B. and D.P. participated in the preliminary
criterion-related validation.)
The sample was drawn from the pool of manuscripts mailed to reviewers
between October 1 and November 30, 1992. The period was arbitrary, with
the stipulations that (1) the resultant sample provides
sufficient power to evaluate interrater reliability and (2) reviews be
sufficiently old to permit maximal score ranges on all attributes
(including timeliness). The cross-sectional, consecutive sample design
reduced the likelihood of encountering more than one review by a
particular individual. Reviews prepared by participants in this study
were excluded, and if an individual had prepared two reviews during the
target period, only the most recent was included (to avoid a
potentially confounded design).
The final sample included 53 reviews of 23 manuscripts. Forty-one
reviews of 16 manuscripts and 12 reviews of seven manuscripts were
systematically assigned to the reliability and preliminary
criterion-validation subsamples, respectively. Subsample formation was
employed to avoid the necessity of making multiple comparisons with one
grader's (D.P.) observations and resultant inflation of the presumed
alpha level.[18]
Because a reliable instrument requires that the null hypothesis be
sustained when assessing agreement among graders' scores (ie, that
differences would not reach statistical significance), power was a
fundamental consideration. Power was estimated before setting the
sample size and was confirmed post hoc by standard
methods.[18]
[19]
[20]
Power was preliminarily estimated to be
greater than 0.85 at the .05 alpha level. Post hoc power calculations
were comparable: 0.91 to detect a score difference of at least one
point (the intended precision of the instrument) and 0.86 to detect a
10% score difference (less than one point at the observed mean).
Instrument
Seven variables addressing both scientific content and review format
were identified, and their relative contributions (scoring) were
refined (Table) during the content validation process.
Content validation, which involves the subjective "rational
analysis" of an instrument's structure[7] (an essential
first step in validation), was conducted in conjunction with the editor
in chief by one of us (I.D.F.) who had not participated in the
instrument's original development. This process, accomplished through
several iterations, entailed (1) examination of the instructions to
reviewers, (2) identification of instrument attributes, and (3)
inspection of each attribute's weight as defined by its scoring. An
explicit scoring guide was developed to provide clear guidance to
individuals applying the instrument and to avoid scoring bias.
Data Collection
Data were collected during February 1993. Graders were each provided a
copy of the manuscript, the accompanying reviews and grade sheets (with
the attributes "timeliness" and "grade sheet" precoded), and
the scoring guide. Manuscripts and accompanying reviews were shuffled
before distribution, and reviews were evaluated in a different order by
each grader.
Data Analysis
Interrater reliability was evaluated in two fashions: (1) as the level
of agreement among graders' mean scores via analysis of variance and
(2) via the intraclass correlation coefficient.[21]
Validity was examined from qualitative and quantitative
perspectives. Detailed content validation was conducted as described
above. Preliminary criterion-related validity was assessed by comparing
rankings derived from an editor's (M.D.D.) grades obtained with the
new instrument to those of the editor in chief obtained with an
earlier version. Kendall's coefficient, a nonparametric test of rank
ordering concordance, was calculated. Absent a superior criterion or
valid external measure of review quality, the editor in chief's
ranking was considered to be the best available item of data to hold as
a preliminary criterion standard.
All analyses were performed with a microcomputer-based statistical
software package (SPSS/PC+, SPSS Inc, Chicago, Ill). The primary
dependent variable for the quantitative analyses was the total score
(TOTAL), although statistical conclusions are identical for all
relevant subscores and transformations.
RESULTS
Interrater reliability was demonstrated on the basis of an intraclass
correlation coefficient of .84 (P<.001) and the lack of a
statistically significant difference across graders' mean scores by
analysis of variance. Data for TOTAL were as follows (values are group
mean [SD]): 7.80 (2.50), 7.80 (2.27), and 7.22 (2.53)(F=.787,
P=.457).
The instrument was intentionally developed to maximize content validity
("face validity" or "logical validity")[7] through
the content validation process. The quantitative assessment of
preliminary criterion-related validity indicated a significant level of
rank ordering concordance (Kendall's coefficient of concordance=.94;
P=.038).
COMMENT
A manuscript review grading instrument was developed to help
JVIR editors evaluate the quality of peer reviews. The purpose
of our study was to examine the interrater reliability, content
validity, and preliminary criterion-related validity of the instrument.
The JVIR publishes original clinical and laboratory research
in addition to technical notes, case reports, subject reviews, works in
progress, and special-format articles. The reviewer's instructions
accompanying each manuscript explicitly describe the criteria
to be followed for each type of report. For simplicity and
practicality, we developed an instrument that could be applied in the
evaluation of any manuscript review. Our 14-point grading system can be
used to evaluate review content and format via seven individually
scored attributes (Table). While precise distinctions cannot be
made, those attributes emphasizing review content include the substance
and thoroughness of the section-by-section narrative, use of supporting
references, the narrative summary and recommendation, and any new
insights/perspectives offered. Timeliness and etiquette are primarily
format related. Completion of the grade sheet addresses both dimensions
in that reviewers are requested to evaluate scientific merit, practical
value, and overall manuscript quality in a standardized fashion (via a
rating scale).
The grading instrument itself does not apply a uniform rating scale
across all attributes but rather awards varying point values according
to defined criteria. Scoring for each attribute thereby defines its
relative contribution to the total score. Decisions regarding attribute
weighting were made from a practical standpoint in consideration of the
instrument's intended application. For instance, timeliness and
content-thoroughness of the section-by-section review each contribute
up to three points (21% each of total score). While the importance of
the latter is intuitive, timeliness was considered to be of comparable
value, as inordinately late reviews sometimes necessitate rendering
editorial dispositions with only two of three reviews in hand.
Instrument development always entails establishing reliability and
validity, with the latter being impossible absent the former. Both
characteristics may be defined (and demonstrated) in several fashions.
In our case, the instrument is intended to be used by more than one
individual (editors), so interrater reliability was essential.
Reliability was demonstrated via the absence of a statistically
significant disagreement among graders' mean scores (P=.46)
and the presence of a statistically significant intraclass correlation
(intraclass correlation coefficient=.84; P<.001). Content
validation, although subjective by definition, was conducted
systematically and provided an essential first step toward overall
instrument validation. The test of preliminary criterion-related
validity, although statistically significant, is not definitive because
a gold standard criterion for review quality is lacking. If an
external, previously validated criterion were identified, additional
validation could be accomplished. Definitive criterion-related
validation could lead to the identification of score ranges that
discriminate among reviewers: those who are generally helpful (good or
outstanding) and those at both extremes (outstanding and poor).
After an experienced editor has studied a manuscript and its
accompanying reviews, use of the grading instrument requires only about
1 minute per review. Scores are entered into the JVIR reviewer
and manuscript tracking system and are currently reported in a
preliminary fashion. A personalized letter/report outlining reviewers'
strengths and weaknesses has been developed in an effort to provide
feedback and to improve the peer review process. After further
validation, it is hoped that the grading system can be of assistance
when revising the reviewer list and making merit-based promotions to
the editorial board.
From the
Journal of Vascular and Interventional Radiology,
Nashville, Tenn (Ms Feurer);
Journal of Vascular and
Interventional Radiology Editorial Office and the Miami Vascular
Institute, Baptist Hospital of Miami, Miami, Fla (Dr Becker and Ms
Ramirez); the Mallinckrodt Institute of Radiology, Washington
University School of Medicine, St Louis, Mo (Drs Picus, Darcy, and
Hicks).
Presented at the Second International Congress on Peer Review in
Biomedical Publication, Chicago, Ill, September 9, 1993.
Reprint requests to Journal of Vascular and Interventional
Radiology, Miami Vascular Institute, Baptist Hospital of Miami,
8900 N Kendall Dr, Miami, FL 33176 (Dr Becker).
References
1. Gallagher EB, Ferrante J. Agreement among peer reviewers
for a middle-sized biomedical journal. In: Peer Review in
Scientific Publishing: Papers From the First International Congress on
Peer Review in Biomedical Publication. Chicago, Ill: Council of
Biology Editors Inc; 1991:153-158.
2. Rennie D. Preface. In: Peer Review in Scientific
Publishing: Papers From the First International Congress on Peer Review
in Biomedical Publication. Chicago, Ill: Council of Biology Editors
Inc; 1991:1-3.
3. Sharp DW. What can and should be done to reduce
publication bias? the perspective of an editor. JAMA.
1990;263:1390-1391.
4. Chalmers TC, Frank CS, Reitman D. Minimizing the three
stages of publication bias. JAMA. 1990;263:1392-1395.
5. Garfield E, Welljams-Dorof A. The impact of
fraudulent research on the scientific literature: the Steven E.
Breuning case. JAMA. 1990;263:1424-1426.
6. Korn D. Scientific integrity and scientific misconduct:
the interface between research institutions and journals. In: Peer
Review in Scientific Publishing: Papers From the First International
Congress on Peer Review in Biomedical Publication. Chicago, Ill:
Council of Biology Editors Inc; 1991:205-212.
7. Allen MJ, Yen WM. Introduction to Measurement
Theory. Monterey, Calif: Brooks/Cole Publishing Co; 1979.
8. Strayhorn J Jr, McDermott JF Jr, Tanguay P. An
intervention to improve the reliability of manuscript reviews for the
Journal of the American Academy of Child and Adolescent
Psychiatry. Am J Psychiatry. 1993;150:947-952.
9. Peters DP, Ceci SJ. Peer-review practices of
psychological journals: the fate of unpublished articles, submitted
again. Behav Brain Sci. 1982;5:187-255.
10. Cicchetti DV. Reliability of reviews for the
American Psychologist: a biostatistical assessment of the
data. Am Psychol. 1980;35:300-303.
11. Garfunkel JM, Ulshen MH, Hamrick HJ, Lawson EE. Problems
identified by secondary review of accepted manuscripts. JAMA.
1990;263:1369-1371.
12. McNutt RA, Evans AT, Fletcher RH, Fletcher SW. The
effects of blinding on the quality of peer review: a randomized trial.
JAMA. 1990;263:1371-1376.
13. Garfunkel JM, Lawson EE, Hamrick HJ, Ulshen MH. Effect
of acceptance or rejection on the author's evaluation of peer review
of medical manuscripts. JAMA. 1990;263:1376-1378.
14. Chalmers I, Adams M, Dickersin K, et al. A cohort study
of summary reports of controlled trials. JAMA.
1990;263:1401-1405.
15. McNabe SM. The effects of peer review on two groups of
Dutch doctoral candidates. In: Peer Review in Scientific
Publishing: Papers From the First International Congress on Peer Review
in Biomedical Publication. Chicago, Ill: Council of Biology Editors
Inc; 1991:159-163.
16. Solberg LI. Does the quality of manuscript preparation
affect editorial and reviewer judgments? In: Peer Review in
Scientific Publishing: Papers From the First International Congress on
Peer Review in Biomedical Publication. Chicago, Ill: Council of
Biology Editors Inc; 1991:164-168.
17. Hargens LL. Variation in journal peer review systems:
possible causes and consequences. JAMA. 1990;263:1348-1352.
18. Keppel G. Design and Analysis: A Researcher's
Handbook. 2nd ed. Englewood Cliffs, NJ: Prentice Hall International
Inc; 1982.
19. Stevens J. Applied Multivariate Statistics for the
Social Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates
Publishers; 1986.
20. Pearson ES, Hartley HO. Charts of the power function for
analysis of variance tests, derived from the non-central F
distribution. Biometrika. 1951;38:112-130.
21. Shrout PE, Fleiss JL. Intraclass correlations: uses in
assessing rater reliability. Psychol Bull. 1979;86:420-428.
Table of Contents