LINKS
Blinded Peer Review

Multiple Blinded Reviews of the Same Two Manuscripts

Effects of Referee Characteristics and Publication Language

(JAMA. 1994;272:149-151)

Magne Nylenna, MD; Povl Riis, MD; Yngve Karlsson

Objective.--To study the association between referee characteristics and their manuscript assessments, the influence of manuscript language on referees' judgments, and the usefulness, quality, and extent of referees' free-text comments.

Design.--Two nonauthentic, but realistic, short manuscripts with a number of common methodological flaws were sent to 180 Scandinavian referees. Through randomization, each referee received one of the manuscripts in English and the other manuscript in the national language. A structured assessment of the manuscript quality was expressed on a 5-point scale, and the impact of referee characteristics (age, gender, experience, and so on) was analyzed by multiple linear regression.

Main Outcome.--Manuscript quality assessed by referees.

Results.--A total of 312 reviews from 156 referees could be used for the study of referee characteristics and language. With increasing experience, the referees gave lower quality scores (P<.05). A tendency toward stricter assessment with younger age was seen (P<.05). No influence of referees' gender, specialty, or nationality was found. For the test manuscript of the poorest quality, the English version was assessed to be better than the national-language version (P<.05). A total of 159 of 312 reviews included free-text comments applicable for the methodological study. In 54 reviews (34%), no methodological comments accompanied the assessment, and in six reviews they were only incomplete. Wrong sampling unit was mentioned by one fourth of 80 referees. Only one referee mentioned the incorrect use of a parametric test in the analysis of data whose distribution was nonparametric.

Conclusions.--Experienced and young referees gave a stricter assessment of the manuscripts than their less experienced and older colleagues. An English version seemed to be accepted more easily than a national-language version of the same manuscript. Most referees spontaneously mentioned the shortcomings of the manuscripts only as part of their overall judgment.

(JAMA. 1994;272:149-151)


PEER reviewing is subjective, and the quality of a manuscript may be assessed differently by different referees. Referees differ in many respects, such as time spent on reviewing and experience as a referee.[1] [2] It is, however, unclear whether these differences affect the assessment of a given manuscript.

To study the variation within a group of referees and to examine the possible association between referee characteristics and manuscript assessment, we presented the same two manuscripts to a group of referees from the three most frequent medical specialties in Scandinavia (family physicians, internists, and surgeons).

In non-English-speaking countries, scientific manuscripts in the national language are sometimes considered inferior to English-language manuscripts of the same scientific quality.[3] [4] To study the impact of publication language in identical manuscripts, the manuscripts were presented to the referees in an English or a national-language version via randomization. The project also included a study of the usefulness of referees' statements to editors and their ability to detect and comment spontaneously on common flaws and shortcomings.

METHODS

Two nonauthentic, but realistic, short manuscripts of relevance to family physicians, internists, and surgeons were produced. Both manuscripts presented cross-sectional studies comparing patient characteristics with a control group. The manuscripts included a few common methodological flaws known from the international literature as wrong sampling unit (using biopsies in different numbers from each patient as units), inadequate choice of statistical methods (parametric vs nonparametric), and inappropriate use of control groups (ie, groups with potential aberrations from normal subjects). The manuscripts also contained suboptimal standards of problem formulation, interpretation, language style, and so on. These qualitative characteristics could, for obvious reasons, only be measured semiquantitatively. For safety reasons, both manuscripts only described pathogenetic aspects of the two conditions to avoid even a hypothetical influence on the poststudy behavior of referees, who were all clinicians.

Manuscript A ("Special Personality Traits in Patients With Alcohol-Related Acute Conditions") described personality traits in 46 patients with known alcohol abuse attending a surgical or medical department.

Manuscript B ("The Concentration of Acetylcholinesterase Inhibitor in the Colonic Mucosal Membrane in Patients With the Irritable Colon Syndrome") described the concentration of the enzyme in rectal biopsies from 92 patients.

Both manuscripts were produced in an English version and a national-language version (in Danish, Norwegian, or Swedish). The authors' names and addresses, names in the text, and the list of references were all blinded.

Each referee received two manuscripts, one in English and the other in the national language. We used a table of random numbers to determine which manuscript they reviewed in which of the two languages. Twenty family physicians, 20 internists, and 20 surgeons in each of the Scandinavian countries (Denmark, Norway, and Sweden) were included (ie, a total of 180 physicians). They were selected among physicians used as referees or regarded as competent referees by the national medical journals. Physicians who refused to participate were replaced consecutively according to a premade list of substitutes for each specialty in each country.

Enclosed with each manuscript was a review form designed for the study in which the referees were asked to give an overall consideration of the manuscript on a scale from 1 (very bad) to 5 (excellent), as well as to score a number of other qualities of content and presentation (Table 1). Free-text comments were given optionally on a separate sheet. The referees received a cover letter and a scheme asking them to evaluate the design and usefulness of the review form and to state their age and gender, the time spent on refereeing in general and for each of the actual manuscripts, and their own experience as referees (given as number of reviews performed during the last 12 months). To classify referees on a simple scale, they were divided into three categories: least experienced (fewer than three reviews during the last 12 months), medium experienced (three through nine reviews during the last 12 months), and most experienced (more than nine reviews during the last 12 months).

The free-text comments were classified according to a 4-point qualitative scale (4, all major flaws were mentioned; 3, at least one was mentioned; 2, none was mentioned yet the comments were still usable; and 1, comments were of minor importance).

The unit of analysis was each review.

The scores on the 5-point scale were analyzed by both parametric and nonparametric methods. No important discrepancies appeared. These scores are presented as means, though comparison of scores was done by a Mann-Whitney Test. Analyses of other differences between groups were performed with 95% confidence intervals (CIs), chi2 tests, and Student's t tests. The impact of referee characteristics was analyzed by multiple linear regression.

RESULTS

Referees

A total of 156 referees (136 men, 17 women, and three with sex not stated; mean age, 52 years; range, 35 to 70 years) completed the review forms for both manuscripts, ie, a total of 312 reviews. Nationalities and specialties are given in Table 2.

The referees had on average reviewed for 4.3 journals, and the mean number of reviews during the last 12 months was 12.9 for men (95% CI, 10.0 to 15.8) and 6.8 for women (95% CI, 3.3 to 10.2). The mean time spent on reviewing was 96 minutes (95% CI, 90 to 103). Surgeons reported significantly shorter time than internists with a mean of 81 minutes (95% CI, 73 to 88) vs 115 minutes (95% CI, 101 to 128), with family physicians in between with a mean of 94 minutes (95% CI, 81 to 106). Women spent a shorter time per manuscript than men with a mean of 77 minutes (95% CI, 63 to 90) vs 99 minutes (95% CI, 91 to 106). The mean time spent on reviewing the two short manuscripts was 37 minutes on manuscript A (95% CI, 32 to 42) and 40 minutes on manuscript B (95% CI, 35 to 45). The time spent decreased with increasing experience (P<.05). In all, 89% of the referees principally preferred a structured form when reviewing, but the number of those finding such forms useful decreased with increasing experience (P<.05).

Referee Characteristics and Quality Assessment

We planned for both manuscripts to be of subnormal quality to secure a sufficient spreading of assessments. The distribution of judgments of the total quality of manuscripts reflects such variation (Table 3). The overall quality of manuscript B ("The Irritable Colon Project") was considered higher than that of manuscript A ("The Alcohol and Personality Project") with a mean score of 2.3 on a 5-point scale vs 2.0 (P=.02). For both manuscripts the quality of presentation was considered higher than the scientific quality of content (P<.01). The overall quality of the manuscripts was closer to the quality of content than to the quality of presentation.

No systematic association between the referees' nationality or specialty and their assessments was found.

Bivariate associations were found between the assessment of the quality of the manuscripts and the referee's age and experience as a referee, revealing that younger or more experienced reviewers gave stricter assessments and men gave lower quality scores than did women. A multiple linear regression analysis with overall quality as the dependent variable and the referee characteristics that turned out to be most important (age, gender, and referee experience) as the independent variables was performed. In this analysis the sex difference disappeared, probably because of differences in experience between men and women. For both manuscripts assessment of scientific quality of content was significantly and independently related to age (P<.05) and experience (P<.01). For manuscript B significant associations were found, even between age, experience, quality of presentation, and total quality.

The effect of age and experience as a referee is illustrated by the mean score of overall manuscript quality of 3.2 among reviews given by referees older than 60 years and those having reviewed fewer than three manuscripts during the last 12 months (n=9), compared with a mean score of 1.8 among referees younger than 50 years and those having reviewed 10 or more manuscripts during the last 12 months (n=40) (P<.01).

Publication Language

The assessment of the English version compared with the national-language version of the two manuscripts is shown in Table 1. For manuscript A the overall quality of the English version was considered higher than that of the national-language version with mean scores of 2.2 vs 1.8, respectively (P=.01). No such difference was found for manuscript B. The difference between the overall quality of the English and the national-language versions was greatest among referees with the least experience (2.7 vs 2.3; P=.1; n=42) compared with medium experienced (2.3 vs 2.1; P=not significant; n=59) and most experienced (1.9 vs 1.9; P=not significant; n=53).

Free-Text Comments

This part of the study composed the optional free-text comments, ie, the total number of methodological assessments was expected to be lower than the number of referees. A total of 159 reviews were applicable for the methodological study, yet in 54 (34%) of the reviews no detailed methodological comments accompanied an overall assessment, and six contained only incomplete comments. In all, 200 comments on individual manuscripts were usable for analysis, reviewed manuscripts being the units.

On a 4-point qualitative scale (4 meaning that all major flaws were mentioned) treated like an index, the average was 1.7, without any significant differences between specialties, countries, or languages. Wrong sampling unit was mentioned by one fourth of 80 referees, and only one referee spontaneously mentioned the incorrect use of a parametric test in a nonparametric material.

COMMENT

The study was presented to the referees with emphasis on evaluating a new version of a structured form for referee judgment, and this, as well as the rather poor quality of both manuscripts, may have affected the assessments. More referees might have noticed, but not mentioned explicitly, the methodological errors, because such a detection was only referred to the optional free comments.

Our Scandinavian referees resemble a group of British referees described by Lock and Smith[2] by the strong predominance of men, by the number of journals they serve (4.3 vs 5), and by their workload (on average one manuscript reviewed per month for both groups). The mean time spent per manuscript is also similar to the British study (1.5 hours vs 1.4 hours). The time taken for assessing the two manuscripts presented herein was less than half of this, probably reflecting the short length of the manuscripts but perhaps also a difference between "estimated average time" and "time used."

Only age and experience were significantly associated with strictness of assessment. The stricter assessment with younger age should be seen in relation to the findings by Evans et al[5] that younger reviewers produce the best reviews.

The results partly confirmed our hypothesis that an English version of a manuscript is generally considered more acceptable than a national-language version, this not being caused by a difference in understandability. Though only the language difference of the total quality of manuscript A was statistically significant at a 5% level, the majority of different aspects of scientific content was assessed to be better in English than in the national-language version for both manuscripts (Table 1). Even in regard to the presentation quality, of which a national-language version would be expected to be most acceptable, the contrary was found, especially for manuscript A. The closing "language gap" with increasing experience seems to indicate a reduction of awe with greater experience as a referee. In the Scandinavian countries, as in many other non-English-language countries, publishing in the mother tongue has become a handicap to physicians with academic ambitions.[3,4]


From the Journal of the Norwegian Medical Association, Lysaker, Norway (Dr Nylenna); Medical Department C, Herlev (Denmark) University Hospital (Dr Riis); and the Journal of the Swedish Medical Association, Stockholm, Sweden (Mr Karlsson).

Presented at the Second International Congress on Peer Review in Biomedical Publication, Chicago, III, September 11, 1993.

Reprint requests to Journal of the Norwegian Medical Association, Fjellveien 5, N-1324 Lysaker, Norway (Dr Nylenna).


References

1. Yankauer A. Who are the peer reviewers and how much do they review? JAMA. 1990;263:1338-1340.

2. Lock S, Smith J. What do peer reviewers do? JAMA. 1990;263:1341-1343.

3. Vandenbroucke JP. On not being born a native speaker of English. BMJ. 1989;298:1461-1462.

4. Bakewell D. Publish in English, or perish? Nature. 1992;356:648.

5. Evans AT, McNutt RA, Fletcher SW, Fletcher RH. Characteristics of peer reviewers who produce good reviews. Presented at the Second International Congress on Peer Review in Biomedical Publication; September 9, 1993; Chicago, Ill.

Table of Contents