Multiple Blinded Reviews of the Same Two Manuscripts
Effects of Referee Characteristics and Publication Language
(JAMA. 1994;272:149-151)
Magne Nylenna, MD; Povl Riis, MD; Yngve Karlsson
Objective.--To study the association between referee
characteristics and their manuscript assessments, the influence of
manuscript language on referees' judgments, and the usefulness,
quality, and extent of referees' free-text comments.
Design.--Two nonauthentic, but realistic, short manuscripts
with a number of common methodological flaws were sent to 180
Scandinavian referees. Through randomization, each referee received one
of the manuscripts in English and the other manuscript in the national
language. A structured assessment of the manuscript quality was
expressed on a 5-point scale, and the impact of referee characteristics
(age, gender, experience, and so on) was analyzed by multiple linear
regression.
Main Outcome.--Manuscript quality assessed by referees.
Results.--A total of 312 reviews from 156 referees could be
used for the study of referee characteristics and language. With
increasing experience, the referees gave lower quality scores
(P<.05). A tendency toward stricter assessment with younger
age was seen (P<.05). No influence of referees' gender,
specialty, or nationality was found. For the test manuscript of the
poorest quality, the English version was assessed to be better than the
national-language version (P<.05). A total of 159 of 312
reviews included free-text comments applicable for the methodological
study. In 54 reviews (34%), no methodological comments accompanied the
assessment, and in six reviews they were only incomplete. Wrong
sampling unit was mentioned by one fourth of 80 referees. Only one
referee mentioned the incorrect use of a parametric test in the
analysis of data whose distribution was nonparametric.
Conclusions.--Experienced and young referees gave a stricter
assessment of the manuscripts than their less experienced and older
colleagues. An English version seemed to be accepted more easily than a
national-language version of the same manuscript. Most referees
spontaneously mentioned the shortcomings of the manuscripts only as
part of their overall judgment.
(JAMA. 1994;272:149-151)
PEER reviewing is subjective, and the quality of a manuscript
may be assessed differently by different referees. Referees differ in
many respects, such as time spent on reviewing and experience as a
referee.[1] [2]
It is, however, unclear whether these differences affect the assessment of a
given manuscript.
To study the variation within a group of referees and to examine the
possible association between referee characteristics and manuscript
assessment, we presented the same two manuscripts to a group of
referees from the three most frequent medical specialties in
Scandinavia (family physicians, internists, and surgeons).
In non-English-speaking countries, scientific manuscripts in the
national language are sometimes considered inferior to English-language
manuscripts of the same scientific quality.[3]
[4] To study the
impact of publication language in identical manuscripts, the
manuscripts were presented to the referees in an English or a
national-language version via randomization. The project also included
a study of the usefulness of referees' statements to editors and their
ability to detect and comment spontaneously on common flaws and
shortcomings.
METHODS
Two nonauthentic, but realistic, short manuscripts of relevance to
family physicians, internists, and surgeons were produced. Both
manuscripts presented cross-sectional studies comparing patient
characteristics with a control group. The manuscripts included a few
common methodological flaws known from the international literature as
wrong sampling unit (using biopsies in different numbers from each
patient as units), inadequate choice of statistical methods (parametric
vs nonparametric), and inappropriate use of control groups (ie, groups
with potential aberrations from normal subjects). The manuscripts
also contained suboptimal standards of problem formulation,
interpretation, language style, and so on. These qualitative
characteristics could, for obvious reasons, only be measured
semiquantitatively. For safety reasons, both manuscripts only described
pathogenetic aspects of the two conditions to avoid even a hypothetical
influence on the poststudy behavior of referees, who were all
clinicians.
Manuscript A ("Special Personality Traits in Patients With
Alcohol-Related Acute Conditions") described personality traits in 46
patients with known alcohol abuse attending a surgical or medical
department.
Manuscript B ("The Concentration of Acetylcholinesterase Inhibitor in
the Colonic Mucosal Membrane in Patients With the Irritable Colon
Syndrome") described the concentration of the enzyme in rectal
biopsies from 92 patients.
Both manuscripts were produced in an English version and a
national-language version (in Danish, Norwegian, or Swedish). The
authors' names and addresses, names in the text, and the list of
references were all blinded.
Each referee received two manuscripts, one in English and the other in
the national language. We used a table of random numbers to determine
which manuscript they reviewed in which of the two languages. Twenty
family physicians, 20 internists, and 20 surgeons in each of the
Scandinavian countries (Denmark, Norway, and Sweden) were included (ie,
a total of 180 physicians). They were selected among physicians used as
referees or regarded as competent referees by the national medical
journals. Physicians who refused to participate were replaced
consecutively according to a premade list of substitutes for each
specialty in each country.
Enclosed with each manuscript was a review form designed for the study
in which the referees were asked to give an overall consideration of
the manuscript on a scale from 1 (very bad) to 5 (excellent), as well
as to score a number of other qualities of content and presentation
(Table 1). Free-text comments were given optionally
on a separate sheet. The referees received a cover letter and a scheme
asking them to evaluate the design and usefulness of the review form
and to state their age and gender, the time spent on refereeing in
general and for each of the actual manuscripts, and their own
experience as referees (given as number of reviews performed during the
last 12 months). To classify referees on a simple scale, they were
divided into three categories: least experienced (fewer than three
reviews during the last 12 months), medium experienced (three through
nine reviews during the last 12 months), and most experienced (more
than nine reviews during the last 12 months).
The free-text comments were classified according to a 4-point
qualitative scale (4, all major flaws were mentioned; 3, at least one
was mentioned; 2, none was mentioned yet the comments were still
usable; and 1, comments were of minor importance).
The unit of analysis was each review.
The scores on the 5-point scale were analyzed by both parametric and
nonparametric methods. No important discrepancies appeared. These
scores are presented as means, though comparison of scores was done by
a Mann-Whitney Test. Analyses of other differences between groups were
performed with 95% confidence intervals (CIs), chi2 tests,
and Student's t tests. The impact of referee characteristics
was analyzed by multiple linear regression.
RESULTS
Referees
A total of 156 referees (136 men, 17 women, and three with sex not
stated; mean age, 52 years; range, 35 to 70 years) completed the review
forms for both manuscripts, ie, a total of 312 reviews. Nationalities
and specialties are given in Table 2.
The referees had on average reviewed for 4.3 journals, and the mean
number of reviews during the last 12 months was 12.9 for men (95% CI,
10.0 to 15.8) and 6.8 for women (95% CI, 3.3 to 10.2). The mean time
spent on reviewing was 96 minutes (95% CI, 90 to 103). Surgeons
reported significantly shorter time than internists with a mean of 81
minutes (95% CI, 73 to 88) vs 115 minutes (95% CI, 101 to 128), with
family physicians in between with a mean of 94 minutes (95% CI, 81 to
106). Women spent a shorter time per manuscript than men with a mean of
77 minutes (95% CI, 63 to 90) vs 99 minutes (95% CI, 91 to 106). The
mean time spent on reviewing the two short manuscripts was 37 minutes
on manuscript A (95% CI, 32 to 42) and 40 minutes on manuscript B
(95% CI, 35 to 45). The time spent decreased with increasing
experience (P<.05). In all, 89% of the referees principally
preferred a structured form when reviewing, but the number of those
finding such forms useful decreased with increasing experience
(P<.05).
Referee Characteristics and Quality Assessment
We planned for both manuscripts to be of subnormal quality to secure a
sufficient spreading of assessments. The distribution of judgments of
the total quality of manuscripts reflects such variation
(Table 3). The overall quality of
manuscript B ("The Irritable Colon Project") was considered higher than that of
manuscript A ("The Alcohol and Personality Project") with a mean
score of 2.3 on a 5-point scale vs 2.0 (P=.02). For both
manuscripts the quality of presentation was considered higher than the
scientific quality of content (P<.01). The overall quality of
the manuscripts was closer to the quality of content than to the
quality of presentation.
No systematic association between the referees' nationality or
specialty and their assessments was found.
Bivariate associations were found between the assessment of the quality
of the manuscripts and the referee's age and experience as a referee,
revealing that younger or more experienced reviewers gave stricter
assessments and men gave lower quality scores than did women. A
multiple linear regression analysis with overall quality as the
dependent variable and the referee characteristics that turned out to
be most important (age, gender, and referee experience) as the
independent variables was performed. In this analysis the sex
difference disappeared, probably because of differences in experience
between men and women. For both manuscripts assessment of scientific
quality of content was significantly and independently related to age
(P<.05) and experience (P<.01). For manuscript B
significant associations were found, even between age, experience,
quality of presentation, and total quality.
The effect of age and experience as a referee is illustrated by the
mean score of overall manuscript quality of 3.2 among reviews given by
referees older than 60 years and those having reviewed fewer than three
manuscripts during the last 12 months (n=9), compared with a mean score
of 1.8 among referees younger than 50 years and those having reviewed
10 or more manuscripts during the last 12 months (n=40)
(P<.01).
Publication Language
The assessment of the English version compared with the
national-language version of the two manuscripts is shown in Table 1.
For manuscript A the overall quality of the English version was
considered higher than that of the national-language version with mean
scores of 2.2 vs 1.8, respectively (P=.01). No such difference
was found for manuscript B. The difference between the overall quality
of the English and the national-language versions was greatest among
referees with the least experience (2.7 vs 2.3; P=.1; n=42)
compared with medium experienced (2.3 vs 2.1; P=not
significant; n=59) and most experienced (1.9 vs 1.9; P=not
significant; n=53).
Free-Text Comments
This part of the study composed the optional free-text comments, ie,
the total number of methodological assessments was expected to be lower
than the number of referees. A total of 159 reviews were applicable for
the methodological study, yet in 54 (34%) of the reviews no detailed
methodological comments accompanied an overall assessment, and six
contained only incomplete comments. In all, 200 comments on individual
manuscripts were usable for analysis, reviewed manuscripts being the
units.
On a 4-point qualitative scale (4 meaning that all major flaws were
mentioned) treated like an index, the average was 1.7, without any
significant differences between specialties, countries, or languages.
Wrong sampling unit was mentioned by one fourth of 80 referees, and
only one referee spontaneously mentioned the incorrect use of a
parametric test in a nonparametric material.
COMMENT
The study was presented to the referees with emphasis on evaluating a
new version of a structured form for referee judgment, and this, as
well as the rather poor quality of both manuscripts, may have affected
the assessments. More referees might have noticed, but not mentioned
explicitly, the methodological errors, because such a detection was
only referred to the optional free comments.
Our Scandinavian referees resemble a group of British referees
described by Lock and Smith[2] by the strong predominance of
men, by the number of journals they serve (4.3 vs 5), and by their
workload (on average one manuscript reviewed per month for both
groups). The mean time spent per manuscript is also similar to the
British study (1.5 hours vs 1.4 hours). The time taken for assessing
the two manuscripts presented herein was less than half of this,
probably reflecting the short length of the manuscripts but perhaps
also a difference between "estimated average time" and "time
used."
Only age and experience were significantly associated with strictness
of assessment. The stricter assessment with younger age should be seen
in relation to the findings by Evans et al[5] that younger reviewers produce the best reviews.
The results partly confirmed our hypothesis that an English version of
a manuscript is generally considered more acceptable than a
national-language version, this not being caused by a difference in
understandability. Though only the language difference of the total
quality of manuscript A was statistically significant at a 5% level,
the majority of different aspects of scientific content was assessed to
be better in English than in the national-language version for both
manuscripts (Table 1). Even in regard to the presentation quality, of
which a national-language version would be expected to be most
acceptable, the contrary was found, especially for manuscript A. The
closing "language gap" with increasing experience seems to indicate
a reduction of awe with greater experience as a referee. In the
Scandinavian countries, as in many other non-English-language
countries, publishing in the mother tongue has become a handicap to
physicians with academic ambitions.[3,4]
From the
Journal of the Norwegian Medical Association, Lysaker, Norway (Dr Nylenna); Medical Department C, Herlev (Denmark)
University Hospital (Dr Riis); and the
Journal of the Swedish Medical Association, Stockholm, Sweden (Mr Karlsson).
Presented at the Second International Congress on Peer Review in Biomedical Publication, Chicago, III, September 11, 1993.
Reprint requests to Journal of the Norwegian Medical Association, Fjellveien 5, N-1324 Lysaker, Norway (Dr Nylenna).
References
1. Yankauer A. Who are the peer reviewers
and how much do they review? JAMA. 1990;263:1338-1340.
2. Lock S, Smith J. What do peer reviewers
do? JAMA. 1990;263:1341-1343.
3. Vandenbroucke JP. On not being born a
native speaker of English. BMJ. 1989;298:1461-1462.
4. Bakewell D. Publish in English, or perish?
Nature. 1992;356:648.
5. Evans AT, McNutt RA, Fletcher SW,
Fletcher RH. Characteristics of peer reviewers who produce good reviews. Presented at the Second International Congress on Peer Review in Biomedical
Publication; September 9, 1993; Chicago, Ill.
Table of Contents