LINKS
Peer Review and Quality Control

Evaluating Peer Reviews

Pilot Testing of a Grading Instrument

(JAMA. 1994;272:98-100)

Irene D. Feurer, MSEd; Gary J. Becker, MD; Daniel Picus, MD; Estella Ramirez; Michael D. Darcy, MD; Marshall E. Hicks, MD

Objective.--To measure the reliability and preliminary validity of a grading instrument for editors to evaluate the quality of peer reviews.

Design.--The consecutive sample design included 53 reviews of 23 manuscripts. Reviews were systematically assigned to interrater reliability (n=41; power greater than 0.90 to detect a difference of greater than one point) and preliminary criterion-related validity (n=12) subsamples. Content validity was closely examined.

Setting.--Nonclinical.

Participants.--Three graders evaluated reliability. One individual examined content validity and two editors tested preliminary criterion-related validity.

Intervention (Instrument).--Attributes reflecting two basic dimensions, review content and format, were identified and scored (values are possible points/percent contribution): timeliness, 3/21%; grade sheet, 1/7%; etiquette, 1/7%; sectional narratives, 3/21%; citations, 2/14%; narrative summary, 2/14%; and insights, 2/14%. A scoring guide was provided.

Main Outcome Measures.--Statistical analyses used to test the interrater reliability of the total score included the intraclass correlation coefficient and analysis of variance with the expectation to uphold the null hypothesis. Kendall's coefficient of concordance was used to test preliminary criterion-related validity.

Results.--The intraclass correlation coefficient was .84 (P<.001) and a lack of difference between mean scores was demonstrated by analysis of variance (P=.46). Content validity was confirmed and preliminary criterion-related validity was indicated (Kendall's coefficient of concordance=.94, P=.038).

Conclusions.--The instrument is reliable. Content validation has been completed, and further criterion-related validation is warranted.

(JAMA. 1994;272:98-100)


PEER REVIEW of scholarly manuscripts by qualified outside referees is the cornerstone of the editorial review process. Its primary purpose is to identify those manuscripts that are worthy of publication[1] ; its important secondary roles have been well described.[2] [3] [4] [5] [6]

Theoretically, a grading instrument providing data about the quality of individual peer reviews could aid editors by (1) identifying those reviewers who are consistently helpful, outstanding, or weak, timely or late, and exacting or lenient, and those who contribute the most and least to the review process, and (2) providing objective data to use when updating the reviewer list, making merit-based promotions to the editorial board, and providing feedback to reviewers. Reliability and content validity should be demonstrated early in the testing of any such instrument.[7]

Although several studies have focused on the reliability of manuscript reviews,[8] [9] [10] [11] few objective evaluations of review quality have been conducted. Of 35 original reports and posters presented at The First International Congress on Peer Review, only seven directly addressed the testing of reviews.[1,11] [12] [13] [14] [15] [16] Of these, only one involved the editor's assessment of reviews via specific content-related questions.[12]

We examined the reliability, content validity, and preliminary criterion-related validity of an instrument designed for editors of the Journal of Vascular and Interventional Radiology (JVIR) to evaluate manuscript reviews along two basic dimensions: content and format. Content refers to the substance and scientific scrutiny evident in the review, while format addresses practicalities of the peer review and editorial processes; both dimensions affect the quality of the review and the feedback provided to the authors and editor.

METHODS

Background

The study was conducted by individuals involved in the peer review process for JVIR, including the editor in chief (G.J.B.), the deputy editor (D.P.), the editorial assistant (E.R.), and three experienced reviewers (I.D.F., M.D.D., and M.E.H.).

The JVIR employs a parallel refereeing system,[17] with manuscript assignment on the basis of reviewers' areas of expertise, history with respect to preparing rigorous reviews and meeting deadlines, and availability. Reviewers are provided a summary grade sheet and detailed general instructions regarding the scientific elements to be evaluated, review format, and turnaround time, as well as reviewing etiquette and ethics. These elements provided the foundation for instrument development.

Subjects and Sample

Interrater reliability of the instrument was evaluated by a panel of three graders (D.P., M.D.D., and M.E.H.) not involved in its development. (G.J.B. and D.P. participated in the preliminary criterion-related validation.)

The sample was drawn from the pool of manuscripts mailed to reviewers between October 1 and November 30, 1992. The period was arbitrary, with the stipulations that (1) the resultant sample provides sufficient power to evaluate interrater reliability and (2) reviews be sufficiently old to permit maximal score ranges on all attributes (including timeliness). The cross-sectional, consecutive sample design reduced the likelihood of encountering more than one review by a particular individual. Reviews prepared by participants in this study were excluded, and if an individual had prepared two reviews during the target period, only the most recent was included (to avoid a potentially confounded design).

The final sample included 53 reviews of 23 manuscripts. Forty-one reviews of 16 manuscripts and 12 reviews of seven manuscripts were systematically assigned to the reliability and preliminary criterion-validation subsamples, respectively. Subsample formation was employed to avoid the necessity of making multiple comparisons with one grader's (D.P.) observations and resultant inflation of the presumed alpha level.[18]

Because a reliable instrument requires that the null hypothesis be sustained when assessing agreement among graders' scores (ie, that differences would not reach statistical significance), power was a fundamental consideration. Power was estimated before setting the sample size and was confirmed post hoc by standard methods.[18] [19] [20] Power was preliminarily estimated to be greater than 0.85 at the .05 alpha level. Post hoc power calculations were comparable: 0.91 to detect a score difference of at least one point (the intended precision of the instrument) and 0.86 to detect a 10% score difference (less than one point at the observed mean).

Instrument

Seven variables addressing both scientific content and review format were identified, and their relative contributions (scoring) were refined (Table) during the content validation process.

Content validation, which involves the subjective "rational analysis" of an instrument's structure[7] (an essential first step in validation), was conducted in conjunction with the editor in chief by one of us (I.D.F.) who had not participated in the instrument's original development. This process, accomplished through several iterations, entailed (1) examination of the instructions to reviewers, (2) identification of instrument attributes, and (3) inspection of each attribute's weight as defined by its scoring. An explicit scoring guide was developed to provide clear guidance to individuals applying the instrument and to avoid scoring bias.

Data Collection

Data were collected during February 1993. Graders were each provided a copy of the manuscript, the accompanying reviews and grade sheets (with the attributes "timeliness" and "grade sheet" precoded), and the scoring guide. Manuscripts and accompanying reviews were shuffled before distribution, and reviews were evaluated in a different order by each grader.

Data Analysis

Interrater reliability was evaluated in two fashions: (1) as the level of agreement among graders' mean scores via analysis of variance and (2) via the intraclass correlation coefficient.[21] Validity was examined from qualitative and quantitative perspectives. Detailed content validation was conducted as described above. Preliminary criterion-related validity was assessed by comparing rankings derived from an editor's (M.D.D.) grades obtained with the new instrument to those of the editor in chief obtained with an earlier version. Kendall's coefficient, a nonparametric test of rank ordering concordance, was calculated. Absent a superior criterion or valid external measure of review quality, the editor in chief's ranking was considered to be the best available item of data to hold as a preliminary criterion standard.

All analyses were performed with a microcomputer-based statistical software package (SPSS/PC+, SPSS Inc, Chicago, Ill). The primary dependent variable for the quantitative analyses was the total score (TOTAL), although statistical conclusions are identical for all relevant subscores and transformations.

RESULTS

Interrater reliability was demonstrated on the basis of an intraclass correlation coefficient of .84 (P<.001) and the lack of a statistically significant difference across graders' mean scores by analysis of variance. Data for TOTAL were as follows (values are group mean [SD]): 7.80 (2.50), 7.80 (2.27), and 7.22 (2.53)(F=.787, P=.457).

The instrument was intentionally developed to maximize content validity ("face validity" or "logical validity")[7] through the content validation process. The quantitative assessment of preliminary criterion-related validity indicated a significant level of rank ordering concordance (Kendall's coefficient of concordance=.94; P=.038).

COMMENT

A manuscript review grading instrument was developed to help JVIR editors evaluate the quality of peer reviews. The purpose of our study was to examine the interrater reliability, content validity, and preliminary criterion-related validity of the instrument.

The JVIR publishes original clinical and laboratory research in addition to technical notes, case reports, subject reviews, works in progress, and special-format articles. The reviewer's instructions accompanying each manuscript explicitly describe the criteria to be followed for each type of report. For simplicity and practicality, we developed an instrument that could be applied in the evaluation of any manuscript review. Our 14-point grading system can be used to evaluate review content and format via seven individually scored attributes (Table). While precise distinctions cannot be made, those attributes emphasizing review content include the substance and thoroughness of the section-by-section narrative, use of supporting references, the narrative summary and recommendation, and any new insights/perspectives offered. Timeliness and etiquette are primarily format related. Completion of the grade sheet addresses both dimensions in that reviewers are requested to evaluate scientific merit, practical value, and overall manuscript quality in a standardized fashion (via a rating scale).

The grading instrument itself does not apply a uniform rating scale across all attributes but rather awards varying point values according to defined criteria. Scoring for each attribute thereby defines its relative contribution to the total score. Decisions regarding attribute weighting were made from a practical standpoint in consideration of the instrument's intended application. For instance, timeliness and content-thoroughness of the section-by-section review each contribute up to three points (21% each of total score). While the importance of the latter is intuitive, timeliness was considered to be of comparable value, as inordinately late reviews sometimes necessitate rendering editorial dispositions with only two of three reviews in hand.

Instrument development always entails establishing reliability and validity, with the latter being impossible absent the former. Both characteristics may be defined (and demonstrated) in several fashions. In our case, the instrument is intended to be used by more than one individual (editors), so interrater reliability was essential. Reliability was demonstrated via the absence of a statistically significant disagreement among graders' mean scores (P=.46) and the presence of a statistically significant intraclass correlation (intraclass correlation coefficient=.84; P<.001). Content validation, although subjective by definition, was conducted systematically and provided an essential first step toward overall instrument validation. The test of preliminary criterion-related validity, although statistically significant, is not definitive because a gold standard criterion for review quality is lacking. If an external, previously validated criterion were identified, additional validation could be accomplished. Definitive criterion-related validation could lead to the identification of score ranges that discriminate among reviewers: those who are generally helpful (good or outstanding) and those at both extremes (outstanding and poor).

After an experienced editor has studied a manuscript and its accompanying reviews, use of the grading instrument requires only about 1 minute per review. Scores are entered into the JVIR reviewer and manuscript tracking system and are currently reported in a preliminary fashion. A personalized letter/report outlining reviewers' strengths and weaknesses has been developed in an effort to provide feedback and to improve the peer review process. After further validation, it is hoped that the grading system can be of assistance when revising the reviewer list and making merit-based promotions to the editorial board.


From the Journal of Vascular and Interventional Radiology, Nashville, Tenn (Ms Feurer); Journal of Vascular and Interventional Radiology Editorial Office and the Miami Vascular Institute, Baptist Hospital of Miami, Miami, Fla (Dr Becker and Ms Ramirez); the Mallinckrodt Institute of Radiology, Washington University School of Medicine, St Louis, Mo (Drs Picus, Darcy, and Hicks).

Presented at the Second International Congress on Peer Review in Biomedical Publication, Chicago, Ill, September 9, 1993.

Reprint requests to Journal of Vascular and Interventional Radiology, Miami Vascular Institute, Baptist Hospital of Miami, 8900 N Kendall Dr, Miami, FL 33176 (Dr Becker).


References

1. Gallagher EB, Ferrante J. Agreement among peer reviewers for a middle-sized biomedical journal. In: Peer Review in Scientific Publishing: Papers From the First International Congress on Peer Review in Biomedical Publication. Chicago, Ill: Council of Biology Editors Inc; 1991:153-158.

2. Rennie D. Preface. In: Peer Review in Scientific Publishing: Papers From the First International Congress on Peer Review in Biomedical Publication. Chicago, Ill: Council of Biology Editors Inc; 1991:1-3.

3. Sharp DW. What can and should be done to reduce publication bias? the perspective of an editor. JAMA. 1990;263:1390-1391.

4. Chalmers TC, Frank CS, Reitman D. Minimizing the three stages of publication bias. JAMA. 1990;263:1392-1395.

5. Garfield E, Welljams-Dorof A. The impact of fraudulent research on the scientific literature: the Steven E. Breuning case. JAMA. 1990;263:1424-1426.

6. Korn D. Scientific integrity and scientific misconduct: the interface between research institutions and journals. In: Peer Review in Scientific Publishing: Papers From the First International Congress on Peer Review in Biomedical Publication. Chicago, Ill: Council of Biology Editors Inc; 1991:205-212.

7. Allen MJ, Yen WM. Introduction to Measurement Theory. Monterey, Calif: Brooks/Cole Publishing Co; 1979.

8. Strayhorn J Jr, McDermott JF Jr, Tanguay P. An intervention to improve the reliability of manuscript reviews for the Journal of the American Academy of Child and Adolescent Psychiatry. Am J Psychiatry. 1993;150:947-952.

9. Peters DP, Ceci SJ. Peer-review practices of psychological journals: the fate of unpublished articles, submitted again. Behav Brain Sci. 1982;5:187-255.

10. Cicchetti DV. Reliability of reviews for the American Psychologist: a biostatistical assessment of the data. Am Psychol. 1980;35:300-303.

11. Garfunkel JM, Ulshen MH, Hamrick HJ, Lawson EE. Problems identified by secondary review of accepted manuscripts. JAMA. 1990;263:1369-1371.

12. McNutt RA, Evans AT, Fletcher RH, Fletcher SW. The effects of blinding on the quality of peer review: a randomized trial. JAMA. 1990;263:1371-1376.

13. Garfunkel JM, Lawson EE, Hamrick HJ, Ulshen MH. Effect of acceptance or rejection on the author's evaluation of peer review of medical manuscripts. JAMA. 1990;263:1376-1378.

14. Chalmers I, Adams M, Dickersin K, et al. A cohort study of summary reports of controlled trials. JAMA. 1990;263:1401-1405.

15. McNabe SM. The effects of peer review on two groups of Dutch doctoral candidates. In: Peer Review in Scientific Publishing: Papers From the First International Congress on Peer Review in Biomedical Publication. Chicago, Ill: Council of Biology Editors Inc; 1991:159-163.

16. Solberg LI. Does the quality of manuscript preparation affect editorial and reviewer judgments? In: Peer Review in Scientific Publishing: Papers From the First International Congress on Peer Review in Biomedical Publication. Chicago, Ill: Council of Biology Editors Inc; 1991:164-168.

17. Hargens LL. Variation in journal peer review systems: possible causes and consequences. JAMA. 1990;263:1348-1352.

18. Keppel G. Design and Analysis: A Researcher's Handbook. 2nd ed. Englewood Cliffs, NJ: Prentice Hall International Inc; 1982.

19. Stevens J. Applied Multivariate Statistics for the Social Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates Publishers; 1986.

20. Pearson ES, Hartley HO. Charts of the power function for analysis of variance tests, derived from the non-central F distribution. Biometrika. 1951;38:112-130.

21. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86:420-428.

Table of Contents