Download - Comparative analysis of the reliability of job performance ratings

Journal of Applied Psychology1996, Vol. 81, No. 5, 557-574

Copyright 1996 by the American Psychological Association, Inc.0021-9010/96/J3.00

Comparative Analysis of the Reliability of Job Performance Ratings

Chockalingam ViswesvaranFlorida International University

Deniz S. OnesUniversity of Houston

Frank L. SchmidtUniversity of Iowa

This study used meta-analytic methods to compare the interrater and intrarater reliabil-ities of ratings of 10 dimensions of job performance used in the literature; ratings ofoverall job performance were also examined. There was mixed support for the notionthat some dimensions are rated more reliably than others. Supervisory ratings appear tohave higher interrater reliability than peer ratings. Consistent with H. R. Rothstein(1990), mean interrater reliability of supervisory ratings of overall job performance wasfound to be .52. In all cases, interrater reliability is lower than intrarater reliability, indi-cating that the inappropriate use of intrarater reliability estimates to correct for biasesfrom measurement error leads to biased research results. These findings have importantimplications for both research and practice.

Several measures of job performance have been usedover the years as criterion measures (cf. Campbell, 1990;Campbell, Gasser, & Oswald, 1996; Cleveland, Murphy,& Williams, 1989). Attempts have also been made toidentify the specifications for these criteria (Blum &Naylor, 1968; Brogden, 1946; Dunnette, 1963; Stuit &Wilson, 1946; Toops, 1944). For example, Blum andNaylor (1968) identified 11 dimensions or characteristicson which the different criteria can be evaluated, whereasBrogden (1946) identified relevance, reliability, andpracticality as desired characteristics for criteria. Reli-ability of criteria has been included as an important con-sideration by all authors writing about job performancemeasurement.

Indeed, for a measure to have any research or admin-

Chockalingam Viswesvaran, Department of Psychology,Florida International University; Deniz S. Ones, Department ofManagement, University of Houston; Frank L. Schmidt, De-partment of Management and Organizations, University ofIowa. The order of authorship is arbitrary, and all three authorscontributed equally. Deniz S. Ones is now at the Department ofPsychology, University of Minnesota.

An earlier version of this article was presented at a sympo-sium called "Reliability and Accuracy Issues in Measuring JobPerformance," which was chaired by Frank L. Schmidt at the10th Annual Meeting of the Society of Industrial and Organiza-tional Psychology, Orlando, Florida, May 1995.

Correspondence concerning this article should be addressedto Chockalingam Viswesvaran, Department of Psychology,Florida International University, Miami, Florida 33199. Elec-tronic mail may be sent via Internet to [email protected].

istrative use, it must have some reliability. Low-reliabilityresults in systematic reduction in the magnitude of ob-served relationships and can therefore distort theory test-ing. The recent resurgence of interest in criteria (Austin& Villanova, 1992) and in developing a theory of job per-formance (e.g., Campbell, 1990; Campbell, McCloy, Op-pier, & Sager, 1993; McCloy, Campbell, & Cudeck, 1994;Schmidt & Hunter, 1992) also emphasizes the impor-tance of the reliability of criterion measures. A thoroughinvestigation of the criterion domain ought to include anexamination of the reliability of dimensions of job per-formance. The focus of this article is the reliability of jobperformance ratings.

Of the different ways to measure job performance, per-formance ratings are the most prevalent. Ratings are sub-jective evaluations that can be obtained from supervisors,peers, subordinates, self, or customers, with supervisorsbeing the most commonly used source (Cascio, 1991;Cleveland et al., 1989) and peers constituting the secondmost commonly used source. For example, Bernardinand Beatty (1984) found in a survey of human resourcemanagers that over 90% of the respondents used supervi-sory ratings as their primary source of performance rat-ings and peers were the second most widely used sourceof ratings.

Constructing comprehensive and valid theories of hu-man motivation and work behavior is predicated on thereliable measurement of constructs. Given the centralityof the construct of job performance in industrial and or-ganizational psychology (Campbell et al., 1996), andgiven that ratings are the most commonly used source for

557

https://www.researchgate.net/publication/247648290_Development_of_multiple_OB_performance_measures_in_a_representative_sample_of_jobs?el=1_x_8&enrichId=rgreq-b58dee41-afd1-4e43-b4f4-31cafd59159e&enrichSource=Y292ZXJQYWdlOzIzMjQ4OTM1NztBUzoxOTc1Nzg3MDk4MzU3NzZAMTQyNDExNzg0MzgyOA==

558 VISWESVARAN, ONES, AND SCHMIDT

measuring job performance, it is important to estimateprecisely the reliability of job performance ratings. Fur-thermore, competing cognitive process mechanisms havebeen postulated (e.g., Borman, 1979; Wohlers & London,1989) to explain the convergence in ratings between tworaters. An accurate evaluation of these competing mech-anisms will facilitate and enhance understanding of thepsychology underlying the rating or the evaluative judg-ment process in general, and job performance in partic-ular. Finally, many human resource practices recom-mended to practitioners in organizations are predicatedon the reliable measurement of job performance. Assuch, both from a theoretical perspective (i.e., to analyzeand build theories contributing to the science of indus-trial and organizational psychology) and from a practiceperspective, a comparative analysis of reliability of jobperformance ratings is warranted.

The primary purpose of this study was to investigatethe reliability of peer and supervisory ratings of variousjob performance dimensions. Meta-analytic principles(Hunter & Schmidt, 1990) were used to cumulate reli-ability estimates across studies. Reliability estimates ofvarious job performance dimensions can be comparedto identify which dimensions are reliably rated so as toidentify and improve through training the dimensionsrated with low reliability. A second objective of this studywas to compare interrater agreement and intrarater con-sistency in the reliability of ratings. A third and final ob-jective of this study was to compile and present reliabilitydistributions that can be used in future meta-analyses in-volving job performance ratings.

Comparing Reliability Across Dimensions

Supervisory and peer ratings have been used to assessindividuals on many dimensions of job performance.Comparing the reliability of ratings of different dimen-sions enables an empirical test of the hypothesis that cer-tain dimensions of job performance are easier to evaluatethan others (cf. Wohlers & London, 1989). In essence,the thrust of this hypothesis is that some dimensions ofjob performance are easier than others to evaluate be-cause they are easier to observe and clearer standards ofevaluation are available. Wohlers and London (1989)suggested that dimensions of performance such as ad-ministrative competence, leadership, and communica-tion competence are more difficult to evaluate than di-mensions such as output and errors.

Similarly, Borman (1979) found that "raters evaluatedratees significantly more accurately on some dimensionsthan on others, and that for most part these differenceswere consistent across formats" (p. 419). Borman(1979) also stated that the rank order accuracy on thedifferent dimensions in his study were similar to the rank

order accuracy in an earlier study by Borman, Hough,and Dunnette (1976). The rank order correlation was.88 for assessing managers and .54 for recruiters. Thatis, the rank order accuracy of different dimensions wereconsistent across rating formats, studies and samples,and jobs. Borman (1979) noted that this consistent di-mension effect, even across a variety of formats (and ifwe may add, across jobs and samples), may be due tosomething inherent in the nature of the dimensions thatmakes them either difficult or easy for raters. Further-more, Borman suggested that "accuracy is highest onthose dimensions for which actors provided the least am-biguous, most consistent performances, perhaps becausethey, as well as the student raters, understood those par-ticular dimensions better than some of the other dimen-sions" (p. 420).

The hypothesis that certain dimensions of job perfor-mance are easier to evaluate than others is also found inpersonality literature (e.g., Christensen, 1974). This lineof thought is also found in literature on social psychology.For example, Bandura (1977), as well as Salancik andPfeffer (1978), posited from a social information-pro-cessing framework that when there are no clear interpret-able signs of behavior or where the standards of evalua-tion are ambiguous, the interrater agreement will belower in comparison with where there are clear interpret-able signs and standards are unambiguous. This is alsohypothesized to be true when certain dimensions of jobperformance have rare occurrence (low base rate) orhave greater salience in memory (e.g., accidents).

Comparing the reliability of dimensions facilitates anempirical test of the hypothesis (Borman, 1979; Wohlers& London, 1989) of a gradient in reliabilities across jobperformance dimensions. Such knowledge will facilitatean understanding of the rating processes.

Comparing Different Typesof Reliability Estimates

Comparing the different types of reliability estimates(coefficient of equivalence, coefficient of stability, etc.)for each dimension of job performance is also valuable.Reliability of a measure is defined as the ratio of the trueto observed variance (Nunnally, 1978). Different typesof reliability coefficients assign different sources of vari-ance to measurement error. In general, the most fre-quently used reliability coefficients associated with crite-rion ratings can be broadly classified into two categories:interrater and intrarater. In the context of performancemeasurement, interrater reliability assesses the extent towhich different raters agree on the performance ofdifferent individuals. As such, individual raters' idiosyn-cratic perceptions of job performance are considered tobe part of measurement error. Intrarater reliability, on

https://www.researchgate.net/publication/241728060_The_Influence_of_Trait_Sex_and_Information_on_Accuracy_of_Personality_Assessment?el=1_x_8&enrichId=rgreq-b58dee41-afd1-4e43-b4f4-31cafd59159e&enrichSource=Y292ZXJQYWdlOzIzMjQ4OTM1NztBUzoxOTc1Nzg3MDk4MzU3NzZAMTQyNDExNzg0MzgyOA==

https://www.researchgate.net/publication/232521584_Situational_Constraint_Effects_on_Performance_Affective_Reactions_and_Turnover?el=1_x_8&enrichId=rgreq-b58dee41-afd1-4e43-b4f4-31cafd59159e&enrichSource=Y292ZXJQYWdlOzIzMjQ4OTM1NztBUzoxOTc1Nzg3MDk4MzU3NzZAMTQyNDExNzg0MzgyOA==

https://www.researchgate.net/publication/232424142_Format_and_Training_Effects_on_Rating_Accuracy_and_Rater_Errors?el=1_x_8&enrichId=rgreq-b58dee41-afd1-4e43-b4f4-31cafd59159e&enrichSource=Y292ZXJQYWdlOzIzMjQ4OTM1NztBUzoxOTc1Nzg3MDk4MzU3NzZAMTQyNDExNzg0MzgyOA==


https://www.researchgate.net/publication/234021315_Social_Learning_Theoy?el=1_x_8&enrichId=rgreq-b58dee41-afd1-4e43-b4f4-31cafd59159e&enrichSource=Y292ZXJQYWdlOzIzMjQ4OTM1NztBUzoxOTc1Nzg3MDk4MzU3NzZAMTQyNDExNzg0MzgyOA==

RELIABILIT Y OF RATINGS 559

the other hand, assigns any specific error unique to theindividual rater to true variance. That is, each rater's id-iosyncratic perceptions of job performance is relegatedto the true variance component. Both coefficient alphaand the coefficient of stability (rate-rerate reliability withthe same rater) are forms of intrarater reliability. Intrar-ater reliability is most frequently indexed by coefficientalpha computed on ratings from a single rater on the basisof the correlations or covariances among different ratingitems or dimensions. Coefficient alpha assesses the extentto which the different items used to measure a criterionare indeed assessing the same criterion.1 Rate-rerate re-liability computed using data from the same rater at twopoints in time assesses the extent to which there is consis-tency in performance appraisal ratings of a given raterover time. Both of these indices of intrarater reliability,coefficient alpha and coefficient of stability (over shortperiod of times when it is assumed that true performancedoes not change), estimate what the correlation would beif the same rater rerated the same employees (Cronbach,1951).2

Thus, different types of reliability estimates assigndifferent sources of variance to measurement error vari-ance. When a single judge rates a job performance di-mension with a set of items, coefficient alpha may becomputed on the basis of the correlations or covariancesamong the items. Coefficient alpha, which is a measureof equivalence of the items, assigns the item specific vari-ance and variance because of random responses in rat-ings to measurement error variance. Influences uniqueto the particular rating occasion and unique to the raterare not assigned to measurement error but incorporatedinto the true variance. When job performance is assessedby the same rater with the same set of items at two differ-ent points in time, the resulting coefficient of stability(rate-rerate reliability) assigns variance because of tran-sient errors in rating (i.e., variance from mental statesand other factors in raters that vary over days) to mea-surement error variance (Schmidt & Hunter, 1996).

Thus, by comparing the different reliability estimatesfor the same dimension of job performance, one cangauge the magnitude of a particular source of error inratings involving that dimension of job performance.Such knowledge can be valuable in designing rating for-mats and rater training programs.

Constructing Artifact Distributions

Constructing artifact distributions for different dimen-sions of job performance also serves meta-analytic cu-mulations involving ratings of job performance. The re-liability distributions reported here could be used inmeta-analyses of studies involving ratings of perfor-mance.3 Also, some published meta-analyses involving

ratings have erroneously combined estimates of in-terrater and intrarater reliability in one artifact distribu-tion as if these estimates were equivalent. With the in-creasing emphasis on precision of estimates to be used intheory testing (Schmidt & Hunter, 1996; Viswesvaran &Ones, 1995), it is imperative that future meta-analysesuse the appropriate reliability estimates. By providing es-timates of different reliability coefficients for each dimen-sion of job performance, this article aims to provide auseful source of reference to researchers.

Thus, the primary purpose in this article is to cumu-late the reliabilities of job performance ratings with theprinciples of psychometric meta-analysis (Hunter &Schmidt, 1990) and to compare the reliability of the rat-ings of different dimensions. Comparing the reliability ofdifferent dimensions enables an evaluation of the hypoth-esis (Borman, 1979; Wohlers & London, 1989) that eval-uation difficulty varies across dimensions. A secondarypurpose in this article is to compare the magnitude ofthe different sources of errors (by comparing interraterreliabilities, coefficient alphas, and coefficients ofstability) that exist in ratings of each dimension of jobperformance. A third purpose is to provide reliability dis-tributions that could be used in future meta-analytic cu-mulations involving ratings of performance.

Method

Database

We searched the literature for articles that reported reliabilitycoefficients either for job performance dimensions or for overalljob performance. Only studies that were based on actual jobperformance were included. Interviewer ratings, assessmentcenter ratings, and ratings of performance in simulated exer-cises were excluded. We searched all issues starting from the firstissue of each journal through January 1994 of the following 15journals: Journal of Applied Psychology, Personnel Psychology,Academy of Management Journal, Human Relations, Journalof Business and Psychology, Journal of Management, Organiza-tional Behavior and Human Decision Processes, Accident Anal-

1 Coefficient alpha computed on ratings from a single rater isan estimate of the rate-rerate reliability with the same rater. Assuch, it is a form of intrarater reliability. However, it should benoted that a different coefficient alpha can be used to index in-terrater reliability. This is possible if the variance-covariancematrix across raters is used in the computations. In this study,we did not examine coefficient alphas obtained by using dataacross raters.

2 For a recent discussion of these and other reliabilities in in-dustrial and organizational psychology research, see Schmidtand Hunter (1996).

3 Frequency distributions of the reliabilities contributing tothe analyses reported in this article may be obtained by writingto Chockalingam Viswesvaran.



ysis and Prevention, International Journal of Intercultural Re-lations, Journal of Vocational Behavior, Journal of Applied Be-havioral Analysis, Human Resources Management Research,Journal of Occupational Psychology, Psychological Reports, andJournal of Organizational Behavior.

Analyses

In cumulating results across studies, the same job perfor-mance dimensions can be referred to with different labels. Anygrouping of the different labels as measuring the same criteriahas to be guided by theoretical considerations. That is, we needto theoretically define the criteria first (Campbell et al., 1993).The broader the definition, the more general and possibly moreuseful the criteria are; on the other hand, the narrower the defi-nition (up to a point), the more unambiguous the criteria be-come. The delineation of the job performance domain to itscomponent dimensions was undertaken as part of a study ex-amining whether a general job performance factor is responsi-ble for the covariation among job performance dimensions(Viswesvaran, 1993). Viswesvaran (1993) identified 10 jobperformance dimensions that comprehensively represented theentire job performance domain. In this study, all the job perfor-mance measures used in the individual studies were listed andthen grouped into one of the conceptually similar categories bythe authors. That is, the definition of the job performance di-mensions and classification of the job performance ratings intothese 10 dimensions preceded the coding of the reliability esti-mates. We read all the articles making up our database and thenclassified the reliabilities. In other words, not only did we takeinto account the definitions but also the context (and all otherinformation) provided in the article in classifying the reliabili-ties into the dimensions. Interrater agreement was 93%. Dis-agreements were resolved through mutual discussion until con-sensus was reached. Definitions for the 10 groups of ratings forwhich analyses are reported here are provided in Table 1.

Given 10 dimensions of job performance and three types ofreliabilities (interrater, stability, and equivalence), there werepotentially 30 reliability distributions to be investigated. Be-cause our interest was in examining the reliability of both su-pervisory and peer ratings, there are potentially 60 distributionsto be meta-analyzed. Of these, some combinations have notbeen assessed in the literature. The reliability values obtainedfrom the individual studies were coded into 1 of the 60distributions.

Next, in cumulating the reliability of any particular criteriaacross several studies, the length of the measuring instrument(number of raters for interrater reliability estimates and num-ber of items for coefficient alpha estimates) varied across thestudies. One option was to use the Spearman-Brown formulato bring all estimates to a common length. We reduced all in-terrater reliability estimates to that of one rater. In many organ-izations, there will almost never be more than one supervisordoing the rating, but there will almost never be an instrumentwith only one item (i.e., performance dimension rated). Assuch, we did not correct the coefficient alphas for the numberof items. Furthermore, most rating instruments had a range ofitems where the Spearman-Brown adjustments did not make apractical difference.

For the coefficient of stability, without knowing the func-tional relationship between the estimates of stability and thetime interval between the measurements, corrections to bringestimates of stability to the same interval are impossible. All wecan say, intuitively speaking, is that as the time interval in-creases, the reliability estimate generally decreases. The func-tion that captures this decrease is unknown. Jensen (1980),based on fitting curves to empirical data, reported that the sta-bility of IQ test scores is a function of the square root of theratio of the chronological ages at the two points of measure-ment. Another possibility is to assume an asymptotic functionwhere reliability estimate falls as time interval increases (at in-finite time interval, the reliability estimate will be zero, or anasymptote at zero). This is similar to Rothstein (1990) whopresented empirical evidence that as the opportunity to observe(indexed by number of years supervised or observed) increases,the interrater reliability increases but reaches an asymptoticmaximum of .60. Lacking information on the functional rela-tionship between reliability estimates and time intervals be-tween measurements, no corrections were made to bring all es-timates of coefficients of stability included in a meta-analysis tothe same interval.

Note that in our intrarater reliability analyses, we were care-ful to include only the coefficients of stability that were basedon ratings from the same rater. Rate-rerate correlations fromdifferent raters at two points in time are interrater reliabilitycoefficients and will be lower than the former estimates wherethe same rater provides the ratings at the two points in time andthus intrarater reliability is assessed (Cronbach, 1947).

A meta-analysis correcting only for sampling error was con-ducted for each of the 60 distributions for which there were atleast four estimates to be cumulated. The sample size weightedmean, observed standard deviation, and residual standard devia-tion were computed for each distribution. We also computed theunweighted mean and standard deviation. The computations ofthe unweighted mean and standard deviation do not weight thereliability estimates by sample size of each study contributing tothe analysis. Each reliability coefficient is equally weighted. Thesample size weighted mean gives the best estimate of the meanreliability, whereas the unweighted mean ensures that our resultsare not skewed by a few large sample estimates.

In addition, we also computed the mean and standard devia-tion of the square root of the reliabilities. The mean of the squareroot of the reliabilities differ slightly from the square root of themean of the reliabilities. Therefore, we also assessed the meanand standard deviation of the square root of the reliabilities. Bothsample size weighted and unweighted (i.e., frequency weighted)analyses were undertaken. Thus, for each of the 60 distributions,the objective was to estimate the mean and standard deviationof (a) sample size weighted reliability estimates, (b) reliabilityestimates (unweighted or frequency weighted), (c) sample sizeweighted square root of the reliabilities, and (d) square root ofthe reliabilities (unweighted or frequency weighted).

The sampling error variance associated with the mean of thereliability was estimated as the variance divided by the numberof estimates averaged (Callender & Osburn, 1988). The sam-pling error of the mean was used to construct an 80% confidenceinterval around the mean. Assuming normality, 80% of pointsin the distribution falls within this interval. That is, the proba-

https://www.researchgate.net/publication/232537707_Unbiased_estimation_of_sampling_variance_of_correlations?el=1_x_8&enrichId=rgreq-b58dee41-afd1-4e43-b4f4-31cafd59159e&enrichSource=Y292ZXJQYWdlOzIzMjQ4OTM1NztBUzoxOTc1Nzg3MDk4MzU3NzZAMTQyNDExNzg0MzgyOA==


Table 2Interrater Reliabilities of Supervisory Ratings of Job Performance

Dimension Af w SDW AC 80% CI cred

Overall jobperformance

ProductivityQualityLeadershipCommunication

competenceAdministrative

competenceEffortInterpersonal

competenceJob knowledgeCompliance with or

acceptance ofauthority

14,6502,0151,2252,171

1,563

1,1202,714

3,00614,072

905

40191020

9

924

3120

8

.52

.57

.63

.53

.45

.58

.55

.47

.53

.56

.0950

.1540

.1191

.0928

.1282

.1040

.1250

.1664

.0508

.1276

.68

.57

.65

.55

.43

.59

.56

.53

.56

.60

.1469

.1769

.1406

.1124

.1824

.1674

.1601

.1983

.1976

.1295

.72

.75

.79

.73

.66

.76

.74

.68

.73

.74

.0605

.1079

.0756

.0598

.1071

.0659

.0858

.1332

.0392

.1548

.82

.75

.80

.74

.64

.76

.74

.70

.73

.77

.0924

.1236

.0885

.0742

.1568

.1056

.1113

.1711

.2356

.0900

.50-.54

.S2-.62

.S8-.68

.50-.56

.40-.50

.S4-.62

.S2-.58

.43-.51

.S2-.54

.50-.62

.0870

.1392

.1058

.0617

.1129

.0851

.1062

.1461

.0429

.1099

.41 -.63

.39-.7SA9-.ll.45-.61

.31-.59

.47-.69

.41 -.69

.2S-.66

.48-.5S

.42-.70

Note, k = number of reliabilities included in the meta-analysis; wt = sample size weighted; unwt = unweighted or frequency weighted; sqwt =square root of the estimates, weighted; squnwt = square root of the estimates, unweighted; CI = confidence interval; cred = credibility interval;res = residual.

eluded in the meta-analysis. Columns 4 and 5 provide thesample size weighted mean and standard deviation of thevalues meta-analyzed. The unweighted (or frequencyweighted) mean and standard deviation of the valuesmeta-analyzed are in Columns 6 and 7, respectively. Thesample size weighted mean and standard deviation of thesquare root of the reliabilities are in Columns 8 and 9,respectively. Finally, the unweighted (or frequencyweighted) mean and standard deviation of the square

root of the reliabilities are in Columns 10 and 11, respec-tively. Column 12 provides the 80% confidence intervalthat is based on the sample size weighted mean reliabilityvalues. Different intervals (e.g., 95%) can be constructedon the basis of the values reported in Columns 3, 4, and5. Similarly, different intervals can be computed for (a)unweighted (or frequency weighted) reliability values de-rived from data reported in Columns 3, 6, and 7; (b)sample size weighted square root of the reliability esti-

Table 3Interrater Reliabilities of Peer Ratings and Coefficients of Stability for Supervisory Ratings of Job Performance

Performancedimension « k A/« SDM A/unwl 5Dunwt

AC.W15Aqwt AC.unwt Oi'squnwt 80% CI 5»»

80%cred

Peer ratings


ProductivityLeadershipEffortInterpersonal



2,389205434348

635249

220

9447

94

5

.42

.34

.38

.42

.42

.33

.71

.1063

.1419

.1142

.2454

.1451

.1012

.0493

.44

.30

.36

.48

.50

.33

.73

.1615

.1775

.1595

.2452

.1613

.1268

.0740

.64

.57

.61

.61

.64

.57

.84

.0764

.1414

.0950

.2332

.1095

.0900

.0290

.66

.52

.59

.66

.70

.57

.86

.1141

.1743

.1387

.2243

.1196

.1151

.0437

.S7-.47

.2S-.43

.31-.45

.30-.54

.36-.4S

.27-.39

.6S-.74

.0935

.0676

.0789

.2152

.1063

.1000

.0467

.30-.54

.2S-.43

.28-.4S

.14-.70

.28-.S6

.20-.46

.6S-.77

Overall jobperformance 1,374 12 .81 .0895

Supervisory ratings

.84 .0875 .95 .0685 .91 .0462 .7S-.84 .0835 .70-.92

Note, k = number of reliabilities included in the meta-analysis; wt = sample size weighted; unwt = unweighted or frequency weighted; sqwt =square root of the estimates, weighted; squnwt = square root of the estimates, unweighted; CI = confidence interval; cred = credibility interval;res = residual.


Table 1Definition of Job Performance Rating

Dimension rated Definition

Overall job performance

Job performance or productivity

Quality

Leadership

Communication competence

Administrative competence

Effort

Interpersonal competence

Job knowledge

Compliance with or acceptanceof authority

Ratings on statements (or ranking of individuals on statements) referring to overall performance, overalleffectiveness, overall job performance, overall work reputation, or the sum of all individualdimensions rated.

Ratings of the quantity or volume of work produced. Raters' rating or ranking individuals were based onproductivity or sales; examples include ratings of the number of accounts opened by bank tellers andthe number of transactions completed by sales clerks.

Measure of how well the job was done. Ratings of (or rankings of individuals on statements referring to)the quality of tasks completed, lack of errors, accuracy to specifications, thoroughness, and amount ofwastage.

Measure of the ability to inspire, to bring out extra performance in others, to motivate others to scalegreat heights, and professional stature; includes performance appraisal statments such as "getssubordinates to work efficiently," "stimulates subordinates effectively," and "maintains authorityeasily and comfortably."

Skill in gathering and transmitting information (both in oral and written format). The proficiency toexpress, either in written or oral format, information views, opinions, and positions. This refers to theability to make oneself understood; includes performance appraisal statements such as "very good inmaking reports," "reports are clear," "reports are unambiguous," and "reports need no furtherclarification."

Proficiency in handling the coordination of different roles in an organization. This refers to proficiencyin organizing and scheduling work periods, administrative maintenance of records (note, however,that clarity is under Communication competence above), ability to place and assign subordinates, andknowledge of the job duties and responsibilities of others.

Amount of work an individual expends in striving to do a good job. Measure of initiative, attention toduty, alertness, resourcefulness, enthusiasm about work, industriousness, earnestness at work,persistence in seeking goals, dedication, personal involvement in the job, and effort and energyexpended on the job characterize this dimension of job performance.

Abilit y to work well with others. Ratings or rankings of individuals on cooperation with others,customer relations, working with co-workers, and acceptance by others, as well as nominations for"easy to get along with," are included in this dimension.

Measure of the knowledge required to get the job done. Includes ratings or rankings of individuals onjob knowledge, keeping up-to-date, as well as nominations of who knows the job best andnominations of who keeps up-to-date.

A generally positive perspective about rules and regulations; includes obeying rules, conforming toregulations in the work place, having a positive attitude toward supervision, conforming toorganizational norms and culture, without incessant complaining about organizational policies andfollowing instructions.

bility of obtaining a value higher than the upper bound of theinterval and the probability of obtaining a value lower than thelower bound is. 10.

For both interrater reliability and coefficient of stability(rate-rerate reliability with the same rater), in addition to theconfidence interval, the sampling error of the correlation wascomputed, and credibility intervals were constructed. A resid-ual standard deviation was computed as the square root of thedifference between observed and sampling error variance of thecorrelation (i.e., the interrater reliability coefficient in the for-mer case and the rate-rerate reliability coefficient in the lattercase). Note, however, that the sampling error formula for co-efficient alpha is different from those of interrater reliability co-efficients and coefficients of stability (i.e., correlationcoefficients). Given the mean and residual standard deviationalong with the normality assumption (assuming two-tailedtests), we can compute the estimated reliability below which thepopulation reliability value is likely to fall with a 90% chance; M+ 1.28(residual standard deviation). Though different (90%,95%, etc.) credibility intervals (and upper bound values) can beconstructed, we report only on the 80% credibility interval forthe sample size weighted mean reliability estimate for the reli-

ability distributions. Interested readers can compute the othercredibility intervals (90%, 95%, etc.) on the basis of the meanreliability and the residual standard deviation.

Results

Tables 2-4 summarize the results of the meta-analyses.Interrater reliability estimates for supervisory ratings aresummarized in Table 2. Interrater reliability estimatesfor peer ratings are in Table 3, and the estimates of co-efficient of stability for supervisory ratings of overall jobperformance are also in Table 3. Estimates of coefficientalpha for supervisory and peer ratings are provided inTable 4. Notice that all 10 dimensions are not present inevery table; we do not present the results of meta-analysesthat were based on less than four reliability estimates.

In each table, Column 1 indicates the job performancedimension being meta-analyzed, Column 2 indicates thetotal sample size (the total number of individuals ratedacross studies included in that meta-analysis), and Col-umn 3 provides the number of independent estimates in-


Table 4Coefficient Alpha Reliabilities of Supervisory and Peer Ratings of Job Performance (Intrarater Reliabilities)

Dimension n k MM SDW Mivl unwt SDmw M^ SAqw, Miqunwl SD^n 80% CI

Supervisory


ProductivityQualityLeadershipCommunication

competenceAdministrative

competenceEffortInterpersonal




LeadershipEffortInterpersonal

competence

17,8992,697

7393,821

943

4,7543,112

10,955959

3,438

1,2701,0821,205

325

89176

21

8

1620

569

15

1057

8

.86

.82

.81

.77

.73

.79

.79

.77

.79

.77

.85

.61

.77

.61

.1433

.1248

.0752

.1239

.1707

.0544

.1147

.1691

.1077

.1194

.1193

.1931

.2372

.2067

.84

.85

.81

.77

.69

.79

.75

.75

.77

.76

Peer

.81

.53

.72

.61

.1510

.1110

.0828

.1315

.2091

.0965

.1392

.1902

.1290

.1858

.1205

.2934

.2526

.2298

.92

.90

.90

.87

.85

.89

.88

.87

.89

.87

.92

.76

.86

.77

.0942

.0711

.0413

.0735

.1103

.0305

.0678

.1185

.0645

.0790

.0664

.1892

.1705

.2054

.91

.92

.90

.87

.82

.89

.86

.85

.87

.86

.85

.67

.83

.76

.1089

.0630

.0455

.0768

.1359

.0543

.0835

.1327

.0770

.1383

.0677

.3014

.1852

.1797

.84-.S8

.7S-.86

.77-.8S

.74-.80

.65-81

.77-81

.76-82

.74-80

.74-84

.73-81

.80-90

.50-72

.66-88

.52-70

Note, k = number of reliabilities included in the meta-analysis; wt = sample size weighted; unwt = unweighted or frequency weighted; sqwtsquare root of the estimates, weighted; squnwt = square root of the estimates, unweighted; CI = confidence interval.

mates derived from data presented in Columns 3, 8, and9; and (c) unweighted (or frequency weighted) squareroot of reliabilities derived from information provided inColumns 3, 10, and 11. For interrater reliability and co-efficient of stability, the residual standard deviations ofthe reliability distributions and the 80% credibility in-tervals are reported in Columns 13 and 14, respectively.The credibility interval refers to the entire distribution,not the mean value. Also, it refers to population values(the estimated distribution of population values), notobserved values, which were affected by sampling error.

In discussing the results, we first compared the super-visory rating reliability of different dimensions of ratedperformance for each type of reliability (e.g., interrater).Then we focused on the same type of reliability (e.g.,interrater), but it was based on peer ratings of the differ-ent dimensions. Third, we compared the reliability forpeer and supervisory ratings. These three steps were re-peated for each type of reliability: interrater, stability, andcoefficient alphas. A final section discusses assessment ofthe relative influence of the different sources of error.

Interrater Reliability

From the results reported in Table 2, the mean in-terrater reliability for supervisory ratings of overall job

performance was .52 (k = 40, N = 14,650). The 80% cred-ibilit y interval ranged from .41 to .63. That is, it is esti-mated that 90% of the values of interrater reliability ofsupervisory ratings of overall job performance are below.63. For supervisors, the mean sample size weighted meaninterrater reliability across nine specific job performancedimensions (excluding overall job performance) was .53.It appears that, for supervisors, interrater reliability ofoverall job performance ratings is similar to the mean in-terrater reliability across job performance dimensions.This is noteworthy because most interrater reliabilities foroverall performance in our database were for sums ofitems across different job performance dimensions. Con-trary to expectations, higher intrarater reliability associ-ated with longer rating forms (see also the interrater reli-ability section below) does not appear to improve in-terrater reliability in the job performance domain.

A second interesting point to note is that there is vari-ation across the 10 dimensions in the mean interrater re-liabilities for supervisory ratings. Although the credibil-ity intervals for all the 10 dimensions do overlap, the 80%confidence intervals indicate that for example, both com-munication competence and interpersonal competenceare rated less reliably, on average, than productivity orquality. Thus, the hypothesis of Wohlers and London(1989) and Borman (1979) is partially supported.


Interrater reliability for peer ratings of 7 of the 10 di-mensions are reported in Table 3 (there were less thanfour estimates for the other three dirhensions). The esti-mates ranged from .34 for ratings of productivity (SD =.14) to .71 for ratings of compliance with authority (SD= .05). For peers, the sample size weighted mean in-terrater reliability across six specific dimensions of jobperformance (i.e., excluding overall job performance)was .42. For ratings of overall job performance, interraterreliability for peer ratings was also .42 (SD = . 11). The80% credibility intervals for interrater reliability of peerratings of overall job performance ranged from .30 to .54.That is, 90% of the actual (population) values are esti-mated to be less than .54, and 90% of the values wereestimated to be greater than .30. Similar to the results forsupervisors, interrater reliability of overall job perfor-mance ratings is the same as the sample size weightedmean interrater reliability across individual job perfor-mance dimensions. Even though a large portion of thepeer interrater reliabilities for overall performance in ourdatabase were computed in studies by summing itemsacross different job performance dimensions, higher in-trarater reliability associated with longer rating forms(also see Coefficient Alphas: Measures oflntrarater Reli-ability below) does not appear to lead to higher peer in-terrater reliability. This mirrors the case for supervisors.

A comparison of the results reported in Tables 2 and 3seems to indicate that there was generally more agreementbetween two supervisors than there was between two peers.However, caution is needed in inferring such a conclusion.First, the interrater reliability estimates for peer ratings werebased on small number of studies (so are some dimensionsof supervisory ratings). Second, there is considerable over-lap in credibility intervals between interrater reliability es-timates of peer and supervisory ratings. Finally, two of thestudies reporting interrater reliabilities of peer ratings(Borman, 1974; Hausman & Strupp, 1955) reported verylow values. When these two studies were eliminated fromthe database as outliers, peers and supervisors had compa-rable levels of interrater agreement. However, similar to theoverall results of this meta-analysis, we should note that arecent large sample primary study has also reported lowerinterrater reliability estimates for peers compared with su-pervisors (Scullen, Mount, & Sytysma, 1995). Further-more, in practice, given that peer ratings are based on theaverage ratings of several peers, the averaged multiple peerratings may be more reliable than the ratings from a singlesupervisor. The Spearman-Brown prophecy formula canbe used to determine the number of peer raters required.

Coefficient of Stability

Compared with the number of studies reporting in-terrater reliabilities or coefficient alphas, very few studies

reported coefficients of stability. This is consistent withthe general trend of more cross-sectional than longitudi-nal studies among published journal articles. In fact, wewere able to assess the coefficient of stability only for su-pervisory ratings of overall job performance. There were12 reliabilities across 1,374 individuals contributing tothis analysis. For supervisory ratings of overall job per-formance, the sample size weighted mean coefficient ofstability was .81 (SD = .09). This analysis included onlyestimates where the same rater was used at the two pointsin time.

Coefficient Alphas: Measures oflntraraterReliability

Intrarater reliabilities assessed by coefficient alphaswere also substantial. For supervisory ratings, overall jobperformance was most reliably rated (.86). The least re-liably rated dimension was communication competence(.73). Although the alpha estimates for all dimensionsas well as for overall ratings were higher than .70, it isimportant to note that these estimates are inclusive of,among other things, a halo component. Another observa-tion is that for supervisory ratings of overall job perfor-mance, the coefficient of stability reported above and thecoefficient alpha were similar in size (.81 and .86,respectively). This finding supports the inference that thevariance of transient errors (variance because of ratermental states or moods that vary over days) is small.These figures suggest that this source of measurement er-ror variance in overall job performance ratings is only 5%of the total variance (.86-.81 = .05).

In Table 4, it can be seen that peer ratings of overall jobperformance had a mean alpha of .85 (k = 10, N =1,270). The intrarater reliability associated with peer rat-ings of leadership, effort, and interpersonal competencewere above .60. Similar to interrater reliability, compar-ing alphas for peer and supervisory ratings should be ten-tative. When comparing peer and supervisory ratings, itappears that the intrarater reliability is lower for peerthan for supervisory ratings of specific job performancedimensions, but not for overall performance.

An interesting point to note for both peer and supervi-sory ratings is that the alphas were higher for the overalljob performance ratings than for any of the dimensionalratings. For supervisors, the coefficient alpha for overalljob performance was .86, whereas the mean sample sizeweighted alpha across the specific job performance di-mensions was .78. For peers, the coefficient alpha foroverall job performance was .85, whereas the mean sam-ple size weighted alpha across the specific job perfor-mance dimensions was .68. There are two potential ex-planations for this result. First, this could have been dueto greater length of the instrument used for measure-


ment. In a large number of the studies we coded, overalljob performance was measured by summing the variousdimensions of job performance into a composite. How-ever, we should point out that the nature of the relation-ship between the number of items and reliability can bebest described as convex. The reliability increases rapidlyinitially as the number of items increases, but after somepoint the increase in the reliability is very small. Most ofthe scales meta-analyzed in this article had enough itemsthat further increases in length and the application of theSpearman-Brown formula did not make an appreciabledifference. Note that this could indirectly explain ourearlier finding that both supervisor and peer interraterreliabilities for specific dimensions of job performanceare similar to the interrater reliabilities for overall jobperformance. The second potential explanation forhigher alphas for ratings of overall job performance stemsfrom the broadness of the construct of overall job perfor-mance compared with any of the constructs representedby the individual dimensions of job performance. More-over, there is some evidence (at least in the personalitydomain) that broader constructs are more reliably ratedthan narrower constructs (Ones & Viswesvaran, inpress). That is, this finding could reflect that broadertraits or constructs are more reliably rated than narrowlydefined traits (Ones & Viswesvaran, in press). Unfortu-nately, this meta-analytic investigation cannot determinewhich of the two potential explanations is correct.

Comparison of Different Types of ReliabilityEstimates

Conceptually, given that (a) the reliability coefficient isthe ratio of true to observed variance and (b) observedvariance is true plus error variance, all types of reliabilityestimates have the same denominator. Coefficient alpha(using a single rater) has variance specific to the raterand variance because of transient error in the numerator.Coefficient of stability or rate-rerate with the same raterhas variance specific to the rater in the numerator(assuming true performance did not change in the rate-rerate interval), but not transient variance. Thus, thedifference between coefficient alpha and coefficient ofstability with the same rater gives an estimate of the tran-sient error in that job performance dimension, as notedearlier. Interrater reliability does not have variance spe-cific to the rater or transient error variance in the numer-ator. Therefore, the difference between interrater and co-efficient of stability provides an estimate of the variancefrom rater idiosyncrasy.

For both peer and supervisory ratings and for all di-mensions and overall ratings, interrater reliability esti-mates are substantially lower than intrarater reliabilityestimates (coefficients of stability and coefficient alphas).

For example, consider supervisory ratings of overall jobperformance: the mean interrater reliability estimate is.52, the mean coefficient of stability is .81 (on the basis ofratings from same rater at the two points in time), andthe mean coefficient alpha is .86. Approximately 29% ofthe variance (.81-.52 = .29) in supervisory ratings ofoverall job performance appears to be due to rater idio-syncrasy, whereas 5%, or .86-.81 X 100, of the varianceis estimated to be from transient error, assuming true jobperformance is stable. Similar analyses can be done forother dimensions on the basis of data reported in Tables2-4 to compare the magnitude of the source of error inratings of different dimensions of job performance. In-trarater reliability for supervisory ratings of job perfor-mance dimensions are between .70 and .90. However, themean interrater reliabilities range approximately be-tween .50 and .65. The difference between intrarater andinterrater reliability estimates that we obtained indicatethat 20% to 30% of the variance in job performance di-mension ratings of the average rater is specific to the rater.Using coefficient alpha instead of interrater reliability ofjob performance ratings to correct observed validities(say in validating interviews) will underestimate the va-lidity. Lacking empirically derived reliability distribu-tions, like those yielded by this study, previous meta-ana-lysts may have combined the correct interrater and incor-rect intrarater reliabilities. However, future meta-analyses involving job performance ratings should usethe appropriate reliability coefficients (Schmidt &Hunter, 1996) to obtain more precise estimates of corre-lations that could be used for theory testing.

Discussion

Job performance measures play a crucial role in re-search and practice. Ratings (both peer and supervisory)are an important method of job performance measure-ment in organizations. Many decisions are made on thebasis of ratings. As such, the reliability of ratings is animportant concern in organizational science. Dependingon the objective of the researcher, different reliability es-timates need to be assessed. In personnel selection, theuse of intrarater reliabilities to correct criterion-relatedvalidity coefficients for unreliability in job performanceratings may result in substantial downward biases in esti-mates of actual operational validity. This bias arisesmostly from including rater specific error variance(variance due to rater idiosyncrasies) as true job perfor-mance variance in computing intrarater reliability. Onthe other hand, what is needed to assess actual job per-formance and its dimensions is an answer to the question:Would the same ratings be obtained if a different butequally knowledgeable judge rated the same employees.This calls for an assessment of interrater reliability. This


is why interrater reliability is the appropriate reliabilityin making corrections for criterion unreliability in vali-dation research, not coefficient alpha or rate-rerate reli-ability with the same rater.

This article quantitatively summarizes the availableevidence in the literature for use by researchers and prac-titioners. A question for future research is whether in-terrater reliability ratings of overall job performance canbe increased by obtaining dimensional ratings before ob-taining the overall ratings.4 (Note that a similar potentialdoes not exist for intrarater reliabilities.) It is possiblethat when overall performance is rated after dimensionratings are made, interrater reliabilities for overall ratingsare higher because all raters have a more similar frame-of-reference compared to when the overall performancerating is made on its own or when overall ratings precededimensional ratings. Furthermore, the issue here is com-plicated by the fact that in many studies overall job per-formance ratings are obtained by summing the dimen-sional ratings, whereas in others overall ratings are ob-tained on a single item (or a few items) before or afterdimensional ratings are provided. To the extent frame-of-reference effects were operating, the standard deviationof the mean interrater reliability for overall ratingsshould be higher than the standard deviation of mean di-mensional ratings. That is, some studies would have ob-tained overall performance ratings prior to dimensionalratings, and others would have obtained overall ratingsafter dimensional ratings. If the frame-of-reference hy-pothesis were correct, in a meta-analytic investigationthis would have been detected as greater variance in theinterrater reliability of overall job performance ratings.Of course, the interrater reliability of dimensional ratingswould not have this source of variance. Hence, the stan-dard deviation of the interrater reliability for overall rat-ings would be high compared with the standard deviationof dimensional ratings. Our results indicate that this isnot the case. However, given that the studies contributingto our overall job performance analyses were a mixture ofa sum of dimensional ratings and items directly assessingoverall job performance, we cannot reach any definiteconclusions regarding the frame-of-reference effects. Inany event, this is an interesting hypothesis for futureresearch.

In cumulating results across studies, a concern existswhether moderating influences are obscured. The lowvalues of the standard deviations (compared with themeans) mitigate this concern to some extent. Further-more, Churchill and Peter (1984) and Peterson (1994)examined as many as 13 moderators of reliability esti-mates (e.g., whether the reliabilities were obtained for re-search or administrative purposes). No substantial rela-tionships were found between any hypothesized modera-tor and magnitude of reliability estimates. A potentially

important moderating influence may be whether the rat-ings were obtained for research or administrativepurposes. McDaniel, Whetzel, Schmidt, and Maurer(1994) found that the purpose of the performance ratings(administrative vs. research) moderated the validities ofemployment interviews. In this study, we examined threemoderators of job performance rating reliabilities: typeof reliability (interrater vs. intrarater), source of rating(peer vs. supervisors), and job performance-dimensionrated. We were not able to examine the moderating in-fluences of administrative versus research-based ratings.This was primarily because, given the number of studies,analysis of any other moderator in a fully hierarchicaldesign (Hunter & Schmidt, 1990) would have resulted intoo few studies for a robust meta-analysis. The concernfor sufficient data to detect moderators, coupled with thefact that previous meta-analyses (e.g., Peterson, 1994)that included alternate moderators did not find supportfor those alternate moderators, led us to focus only onthese three moderators (type of reliability, source of rat-ing, and rating content). However, future research shouldexamine the interaction of these three moderators withother potential moderators, such as the purpose forwhich the ratings were obtained (administrative vs.research).

The results reported here can be used to construct re-liability artifact distributions to be used in meta-analyses(Hunter & Schmidt, 1990) when correcting for unreli-ability in the criterion ratings. For example, the report bya National Academy of Sciences (NAS) panel (Hartigan& Wigdor, 1989) evaluating the utility gains from validitygeneralization (Hunter, 1983) maintained that the meaninterrater reliability estimate of .60 used by Hunter(1983) was too small and that the interrater reliability ofsupervisory ratings of overall job performance is betterestimated as .80. The results reported here indicate thatthe average interrater reliability of supervisory ratings ofjob performance (cumulated across all studies availablein the literature) is .52. FurthermoVe, this value is similarto that obtained by Rothstein (1990), although weshould point out that a recent large-scale primary study(N = 2,249) obtained a lower value of .45 (Scullen et al.,1995). On the basis of our findings, we estimate that theprobability of interrater reliability of supervisory ratingsof overall job performance being as high as .80 (asclaimed by the NAS panel) is only .0026. These findingsindicate that the reliability estimate used by Hunter(1983) is, if anything, probably an overestimate of thereliability of supervisory ratings of overall job perfor-mance. Thus, it appears that Schmidt, Ones, and Hunter(1992) were correct in concluding that the NAS panelunderestimated the validity of the General Aptitude Test

4 We thank an anonymous reviewer for suggesting this.

https://www.researchgate.net/publication/235361369_Research_Design_Effects_on_the_Reliability_of_Rating_Scales_A_Meta-Analysis?el=1_x_8&enrichId=rgreq-b58dee41-afd1-4e43-b4f4-31cafd59159e&enrichSource=Y292ZXJQYWdlOzIzMjQ4OTM1NztBUzoxOTc1Nzg3MDk4MzU3NzZAMTQyNDExNzg0MzgyOA==


Battery (GATE). The estimated validity of other opera-tional tests may be similarly rescrutinized.

An anonymous reviewer presented two concerns asfundamental questions that need to be addressed. First,the reviewer raised the question whether reliability cor-rections should be undertaken when one does not haveestimates from the same study in which the validity wasestimated. Second, if the answer was affirmative to thefirst question, another question arises as to whether oneshould use the mean reliabilities reported in this articleor some conservative value (e.g., the 80% upper boundvalues reported in this article).

There are two reasons for answering the first questionin the affirmative. First, any bias introduced in the esti-mated true validity from using reliability estimates re-ported in this article will be much less than the downwardbias in validity estimates if no corrections were under-taken. That is, when reliability estimates from the sampleare not available, the alternative is to make no correc-tions. Second, the meta-analytically obtained reliabilityestimates reported here may be more accurate than thesample-based estimates a primary researcher could ob-tain, given the major effect of sampling error on reliabil-ity estimates in single studies (which typically have smallsample sizes). Using the meta-analytically obtained esti-mates reported here instead of the sample-based esti-mates may result in greater accuracy. The numerous sim-ulation studies indicating the robustness of artifact dis-tribution-based meta-analyses (cf. Hunter & Schmidt,1990) support the conclusion that bias is lower whenmeta-analytically obtained means are used to correct forbias than if either (a) sample-based estimates are used inthe corrections or (b) no corrections are made.

The answer to the second question raised by the re-viewer can also be framed in terms of bias in the esti-mated correlations. Using conservative values for reli-ability results in more bias than the use of the mean val-ues. Many researchers maintain that being conservativeis good science, but conservative estimates are by defini-tion biased estimates. We believe it is more appropriateto aim for unbiased estimates because the research goalis to maximize the accuracy of the final estimates.

Future meta-analytic research is needed to examinethe reliability of criteria obtained from other sourcessuch as customer ratings, self-ratings, and subordinateratings. In a large-scale primary study (N= 2,273), Scul-len et al. (1995) reported that the inter rater reliability ofsubordinate ratings is similar to those obtained for peers(ranging between .31 and .36) for various dimensions ofjob performance. We see the efforts of Scullen et al.(1995) as a valuable first step in reaching generalizableconclusions about the reliability of subordinate ratings.Future research is also needed to examine the processmechanisms (e.g., Campbell, 1990; DeNisi, Cafferty, &

Meglino, 1984) by which the criterion data are gatheredand thus improve the reliability of the obtained ratings.

There are several unique contributions of the presentstudy. Particularly, we want to clearly delineate how ourstudy contributes over and beyond the Rothstein (1990)study, which was the largest scale primary study reportedexamining the interrater reliability of supervisory rat-ings. First, Rothstein (1990) focused only on interraterreliabilities. Here, we investigated interrater and intrara-ter reliabilities, cumulated interrater reliabilities, co-efficient alphas, and test-retest reliabilities. Second,Rothstein (1990) focused on overall job performance.Rothstein (1990) did not examine the reliabilities of di-mensions of the job performance construct. Given thetheoretical arguments and rating processes hypothesizedthat posit different reliabilities for different dimensions,we examined the reliability of different dimensions of jobperformance as well as the reliability of overall job per-formance. Third, whereas the Rothstein (1990) studywas based on a large sample, it was nevertheless a singleprimary study confined to one research corporation thatmarkets the Supervisory Profile Record (see Rothstein,1990). Finally, Rothstein (1990) focused on reliabilitiesof supervisory ratings only. We analyzed both supervi-sory and peer ratings, and we examined whether the reli-abilities of peer and supervisory ratings are similar acrossjob performance dimensions.

However, in contrast to our study, Rothstein (1990)was able to examine the effects of length of exposure oninterrater reliability with her primary data. We were notable to test this effect, as most studies did not specify howlong the raters were exposed to the ratees. (Of course,that was not the focus in many studies making up ourdatabase.) Future meta-analytic research should attemptto generalize the Rothstein (1990) findings with regardto length of exposure to other rating instruments.

The results of this article offer psychometric insightsinto the psychological and substantive characteristics ofjob performance measures. The construction of gener-alizable theories of job performance starts with an ex-amination of the reliable measurement of job perfor-mance dimensions. Given that ratings (supervisory andpeer) are used most frequently in the measurement ofthis central construct, it is crucial that researchers andmanagers be concerned about the reliability of thesemeasurements. For research involving the construct ofjob performance, accurate construct measurement ispredicated on reliable job performance measurement.For practice, accurate administrative decisions dependon the reliable measurement of job performance. It isour hope that the results presented here can be used tounderstand and improve job performance measurementin organizations.


References

The asterisk (*) indicates studies that were included inthe meta-analysis.

*Albrecht, P. A., Glaser, E. M., & Marks, J. (1964). Validationof a multiple-assessment procedure for managerial person-nel. Journal of Applied Psychology, 48, 351-360.

*Anderson, H. E., Jr., Roush, S. L., & McClary, J. E. (1973).Relationships among ratings, production, efficiency, and thegeneral aptitude test battery scales in an industrial setting.Journal of Applied Psychology, 58, 77-82.

*Arvey, R. D., Landon, T. E., Nutting, S. M., & Maxwell,S. E. (1992). Development of physical ability tests for policeofficers: A construct validation approach. Journal of AppliedPsychology, 77, 996-1009.

*Ashford, S. J., & Tsui, A. S. (1991). Self-regulation for mana-gerial effectiveness: The role of active feedback seeking. Acad-emy of Management Journal, 34, 251-280.

Austin, J. T, & Villanova, P. (1992). The criterion problem:1917-1992. Journal of Applied Psychology, 77, 836-874.

*Baird, L. S. (1977). Self and superior ratings of performance:As related to self-esteem and satisfaction with supervision.Academy of Management Journal, 20, 291-300.

Bandura, A. (1977). Social learning theory. Englewood Cliffs,NJ: Prentice-Hall.

*Barrick, M. R., Mount, M. K., & Strauss, J. P. (1993). Con-scientiousness and performance of sales representatives: Testof the mediating effects of goal setting. Journal of AppliedPsychology, 78, 715-722.

*Bass, A. R., & Turner, J. N. (1973). Ethnic group differencesin relationships among criteria of job performance. Journalof Applied Psychology, 57, 101-109.

*Becker, T. E., & Vance, R. J. (1993). Construct validity ofthree types of organizational citizenship behavior: An illus-tration of the direct product model with refinements. Journalof Management, 19, 663-682.

*Bernardin, H. J. (1987). Effect of reciprocal leniency on therelation between consideration scores from the leader behav-ior description questionnaire and performance ratings. Psy-chological Reports, 60, 479-487.

Bernardin, H. J., & Beatty, R. W. (1984). Performance ap-praisal: Assessing human behavior at work. Boston: Kent.

*Bhagat, R. S., & Allie, S. M. (1989). Organizational stress,personal lif e style, and symptoms of lif e strains: An examina-tion of the moderating role of sense of competence. Journalof Vocational Behavior, 35, 231-253.

*Blank, W., Weitzel, J. R., & Green, S. G. (1990). A test of thesituational leadership theory. Personnel Psychology, 43, 579-597.

*Blanz, F., & Ghiselli, E. E. (1972). The mixed standard scale:A new rating system. Personnel Psychology, 25, 185-199.

*Blau, G. (1986). The relationship of management level toeffort level, direction of effort, and managerial performance.Journal of Vocational Behavior, 29, 226-239.

*Blau, G. (1988). An investigation of the apprenticeship organ-izational socialization strategy. Journal of Vocational Behav-ior, 32, 176-195.

*Blau, G. (1990). Exploring the mediating mechanisms affect-

ing the relationship of recruitment source to employee per-formance. Journal of Vocational Behavior, 37, 303-320.

*Bledsoe, J. C. (1981). Factors related to academic and job per-formance of graduates of practical nursing programs. Psy-chological Reports, 49, 367-371.

Blum, M. L., & Naylor, J. C. (1968). Industrial psychology: Itstheoretical and social foundations. New York: Harper & Row.

*Borman, W. C. (1974). The rating of individuals in organiza-tions: An alternate approach. Organizational Behavior andHuman Performance, 12, 105-124.

Borman, W. C. (1979). Format and training effects on ratingaccuracy and rater errors. Journal of Applied Psychology, 64,410-412.

Borman, W. C., Hough, L. M., & Dunnette, M. D. (1976). Per-formance ratings: An investigation of reliability, accuracy,and relationship between individual differences and rater er-ror. Minneapolis, MN: Personnel Decisions.

*Breaugh, J. A. (1981 a). Predicting absenteeism from prior ab-senteeism and work attitudes. Journal of Applied Psychology,66, 555-560.

*Breaugh, J. A. (1981b). Relationships between recruitingsources and employee performance, absenteeism, and workattitudes. Academy of Management Journal, 24, 142-147.

Brogden, H. E. (1946). An approach to the problem of differ-ential prediction. Psychometrika, 11, 139-154.

*Buckner, D. N. (1959). The predictability of ratings as a func-tion of interrater agreement. Journal of Applied Psychology,43, 60-64.

*Buel, W. D., & Bachner, V. M. (1961). The assessment of cre-ativity in a research setting. Journal of Applied Psychology,45, 353-358.

*Bushe, G. R., & Gibbs, B. W. (1990). Predicting organizationdevelopment consulting competence from the Myers-Briggstype indicator and state of ego development. Journal of Ap-plied Behavioral Science, 26, 337-357.

*Butler, M. C., & Ehrlich, S. B. (1991). Positional influences onjob satisfaction and job performance: A multivariate, predic-tive approach. Psychological Reports, 69, 855-865.

Callender, J. C., & Osburn, H. G. (1988). Unbiased estimationof the sampling variance of correlations. Journal of AppliedPsychology, 73, 312-315.

*Campbell, C. H., Ford, P., Rumsey, M. G., Pulakos, E. D.,Borman, W. C., Felker, D. B., De Vera, M. V, & Riegelhaupt,B. J. (1990). Development of multiple job performance mea-sures in a representative sample of jobs. Personnel Psychol-ogy, 43, 277-300.

Campbell, J. P. (1990). Modeling the performance predictionproblem in industrial and organizational psychology. In M.Dunnette&L. M. Hough (Eds.), Handbook ofindustrial or-ganizational psychology (2nd ed.. Vol. 1, pp. 687-732). PaloAlto, CA: Consulting Psychologists Press.

*Campbell, J. P., Dunnette, M. D., Arvey, R. D., & Hellervik,L. V. (1973). The development and evaluation of behavior-ally based rating scales. Journal of Applied Psychology, 57,15-22.

Campbell, J. P., Gasser, M. B., & Oswald, F. L. (1996). Thesubstantive nature of job performance variability. In K. R.Murphy (Ed.), Individual differences and behavior in organi-zations (pp. 258-299). San Francisco: Jossey-Bass.


Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E.(1993). A theory of performance. In N. Schmitt & W. C.Borman (Eds.), Personnel selection in organizations (pp. 35-70). San Francisco: Jossey-Bass.

Cascio, W. F. (1991). Applied psychology in personnel manage-ment (4th ed.). Englewood Cliffs, NJ: Prentice-Hall.

*Cascio, W. F., & Valenzi, E. R. (1977). Behaviorally anchoredrating scales: Effects of education and job experience of ratersandratees. Journal of Applied Psychology, 62, 278-282.

*Cascio, W. E, & Valenzi, E. R. (1978). Relations among cri-teria of police performance. Journal of Applied Psychology,63, 22-28.

*Cheloha, R. S., & Farr, J. L. (1980). Absenteeism, job involve-ment, and job satisfaction in an organizational setting.Journal of Applied Psychology, 65, 467-473.

Christensen, L. (1974). The influence of trait, sex, and infor-mation accuracy of personality assessment. Journal of Per-sonality Assessment, 38, 130-135.

Churchill, G. A., Jr., & Peter, J. P. (1984). Research designeffects on the reliability of rating scales: A meta-analysis.Journal of Marketing Research, 21, 360-375.

*Cleveland, J. N., & Landy, F. J. (1981). The influence of raterand ratee age on two performance judgments. Personnel Psy-chology, 34, 19-29.

Cleveland, J. N., Murphy, K. R., & Williams, R. E. (1989).Multiple uses of performance appraisal: Prevalence and cor-relates. Journal of Applied Psychology, 74, 130-135.

*Cleveland, J. N., & Shore, L. M. (1992). Self- and supervisoryperspectives on age and work attitudes and performance.Journal of Applied Psychology, 77, 469-484.

*Colarelli, S. M., Dean, R. A., & Konstans, C. (1987). Com-parative effects of personal and situational influences on joboutcomes of new professionals. Journal of Applied Psychol-ogy, 72, 558-566.

"Cooper, R. (1966). Leader's task relevance and subordinatebehaviour in industrial work groups. Human Relations, 19,57-84.

*Cooper, R., & Payne, R. (1967). Extraversion and some as-pects of work behavior. Personnel Psychology, 20, 45-57.

*Cortina, J. M., Doherty, M. L., Schmitt, N., Kaufman, G., &Smith, R. G. (1992). The "Big Five" personality factors inthe IPI and MMPI: Predictors of police performance. Person-nel Psychology, 45, 119-140.

*Cotton, J., & Stoltz, R. E. (1960). The general applicability ofa scale for rating research productivity. Journal of AppliedPsychology, 44, 276-277.

Cronbach, L. J. (1947). Test reliability: Its meaning and deter-mination. Psychometrika, 12, 1-16.

Cronbach, L. J. (1951). Coefficient alpha and the internalstructure of tests. Psychometrika, 16, 297-334.

*David, F. R., Pearce, J. A., II, & Randolph, W. A. (1989).Linking technology and structure to enhance group perfor-mance. Journal of Applied Psychology, 74, 233-241.

*Day, D. W., & Silverman, S. B. (1989). Personality and jobperformance: Evidence of incremental validity. PersonnelPsychology, 42, 25-36.

*Deadrick, D. L., & Madigan, R. M. (1990). Dynamic criteriarevisited: A longitudinal study of performance stability andpredictive validity. Personnel Psychology, 43,117-744.

DeNisi, A. S., Cafferty, T. P., & Meglino, B. M. (1984). A cog-nitive view of the performance appraisal process: A modeland research propositions. Organizational Behavior and Hu-man Performance, 33, 360-396.

*Dicken, C. F, & Black, J. D. (1965). Predictive validity ofpsychometric evaluations of supervisors. Journal of AppliedPsychology, 49, 34-47.

"Dickinson, T. L., & Tice, T. E. (1973). A multitrait-multimethod analysis of scales developed by retranslation.Organizational Behavior and Human Performance, 9, 421-438.

*Distefano, M. K., Jr., Pryer, M. W., & Erffmeyer, R. C. (1983).Application of content validity methods to the developmentof a job-related performance rating criterion. Personnel Psy-chology, 56,621-631.

*Dreher, G. F. (1981). Predicting the salary satisfaction of ex-empt employees. Personnel Psychology, 34, 579-589.

*Dunegan, K. J., Duchon, D., & Uhl-Bien, M. (1992). Exam-ining the link between leader-member exchange and subor-dinate performance: The role of task analyzability and vari-ety as moderators. Journal of Management, 18, 59-76.

Dunnette, M. D. (1963). A note on the criterion. Journal ofApplied Psychology, 47, 251-253.

*Edwards, P. K. (1979). Attachment to work and absence be-havior. Human Relations, 32, 1065-1080.

*Ekpo-Ufot, A. (1979). Self-perceived task-relevant abilities,rated job performance, and complaining behavior of junioremployees in a government ministry. Journal of Applied Psy-chology, 64, 429-434.

*Farh, J., Podsakoff, P. M., & Organ, D. W. (1990). Accountingfor organizational citizenship behavior: Leader fairness andtask scope versus satisfaction. Journal of Management, 16,705-721.

*Farh, J., Werbel, J. D, & Bedeian, A. G. (1988). An empiricalinvestigation of self-appraisal-based performance evalua-tion. Personnel Psychology, 41, 141-156.

*Farr, J.-L., O'Leary, B. S., & Bartlett, C. J. (1971). Ethnicgroup membership as a moderator of the prediction of jobperformance. Personnel Psychology, 24, 609-636.

*Flanders, J. K. (1918). Mental tests of a group of employedmen showing correlations with estimates furnished by em-ployer. Journal of Applied Psychology, 2, 197-206.

*Gardner, D. G., Dunham, R. B., Cummings, L. L., & Pierce,J. L. (1989). Focus of attention at work: Construct definitionand empirical validation. Journal of Occupational Psychol-ogy, 62, 61-77.

*Gerloff, E. A., Muir, N. K., & Bodensteiner, W. D. (1991).Three components of perceived environmental uncertainty:An exploratory analysis of the effects of aggregation. Journalof Management, 17, 749-768.

*Ghiselli, E. E. (1942). The use of the Strong vocational inter-est blank and the Pressy senior classification test in the selec-tion of casualty insurance agents. Journal of Applied Psychol-ogy, 26, 793-799.

*Gough, H. G., Bradley, P., & McDonald, J. S. (1991). Perfor-mance of residents in anesthesiology as related to measuresof personality and interests. Psychological Reports, 68, 979-994.

*Graen, G., Dansereau, F, Jr., & Minami, T. (1972). An em-


pirical test of the man-in-the-middle hypothesis among exec-utives in a hierarchical organization employing a unit-setanalysis. Organizational Behavior and Human Performance,8, 262-285.

*Graen, G., Novak, M. A., & Sommerkamp, P. (1982). Theeffects of leader-member exchange and job design on pro-ductivity and satisfaction: Testing a dual attachment model.Organizational Behavior and Human Performance, 30, 109-131.

*Green, S. B., & Stutzman, T. (1986). An evaluation of meth-ods to select respondents to structured job-analysis question-naires. Personnel Psychology, 39, 543-564.

*Greenhaus, J. H., Bedeian, A. G., & Mossholder, K. W.(1987). Work experiences, job performance, and feelings ofpersonal and family well-being. Journal of Vocational Behav-ior, 31,200-215.

*Griffin , R. W. (1991). Effects of work redesign on employeeperceptions, attitudes, and behaviors: A long-term investiga-tion. Academy of Management Journal, 34,425-435.

*Guion, R. M. (1965). Synthetic validity in a small company:A demonstration. Personnel Psychology, 18, 49-63.

"•Gunderson, E. K. E., & Nelson, P. D. (1966). Criterion mea-sures for extremely isolated groups. Personnel Psychology, 19,67-80.

*Gunderson, E. K. E., & Ryman, D. H. (1971). Convergent anddiscriminant validities of performance evaluations in ex-tremely isolated groups. Personnel Psychology, 24,115-724.

"Hackman, J. R., & Lawler, E. E., III . (1971). Employee reac-tions to j ob characteristics [ Monograph ]. Journal of AppliedPsychology, 55, 259-286.

"Hackman, J. R., & Porter, L. W. (1968). Expectancy theorypredictions of work effectiveness. Organizational Behaviorand Human Performance, 3,417-426.

Hartigan, J. A., & Wigdor, A. K. (Eds.). (1989). Fairness inemployee testing: Validity generalization, minority issues,and the General Aptitude Test Battery. Washington, DC: Na-tional Academy Press.

*Hatcher, L., Ross, T. L., & Collins, D. (1989). Prosocial be-havior, job complexity, and suggestion contribution undergainsharing plans. Journal of Applied Behavioral Science, 25,231-248.

*Hater, J. J., & Bass, B. M. (1988). Superiors' evaluations andsubordinates' perceptions of transformational and transac-tional leadership. Journal of Applied Psychology, 73, 695-702.

*Hausman, H. J., & Strupp, H. H. (1955). Non-technical fac-tors in supervisors' ratings of job performance. PersonnelPsychology, 5,201-217.

"Heneman, H. G., III . (1974). Comparisons of self- and supe-rior ratings of managerial performance. Journal of AppliedPsychology, 59, 638-642.

*Heron, A. (1954). Satisfaction and satisfactoriness: Comple-mentary aspects of occupational adjustment. OccupationalPsychology, 28, 140-153.

*Hilton, A. C., Bolin, S. R, Parker, J. W, Jr., Taylor, E. K., &Walker, W. B. (1955), The validity of personnel assessmentsby professional psychologists. Journal of Applied Psychology,39, 287-293.

"Hoffman, C. C., Nathan, B. R., & Holden, L. M. (1991). A

comparison of validation criteria: Objective versus subjectiveperformance measures and self- versus supervisor ratings.Personnel Psychology, 44, 601-619.

"Hogan, J., Hogan, R., & Busch, C. M. (1984). How to measureservice orientation. Journal of Applied Psychology, 69, 167-173.

*Hough, L. M. (1984). Development and evaluation of the "ac-complishment record" method of selecting and promotingprofessionals. Journal of Applied Psychology, 69, 135-146.

*Huck, J. R., & Bray, D. W. (1976). Management assessmentcenter evaluations and subsequent job performance of whiteand black females. Personnel Psychology, 29, 13-30.

"•Hughes , G. L., & Prien, E. P. (1986). An evaluation of al-ternate scoring methods for the mixed standard scale. Person-nel Psychology, 39, 839-847.

Hunter, J. E. (1983). Test validation for 12,000 jobs: An appli-cation of job classification and validity generalization to Gen-eral Aptitude Test Battery (U.S. Employment Service TestResearch Report No. 45). Washington, DC: U.S. Depart-ment of Labor.

Hunter, J. E., & Schmidt, F. L. (1990). Methods ofmeta-analy-sis: Correcting for error and bias in research findings. New-bury Park, CA: Sage.

*Ivancevich, J. M. (1980). A longitudinal study of behavioralexpectation scales: Attitudes and performance. Journal ofApplied Psychology, 65, 139-146.

"Ivancevich, J. M. (1983). Contrast effects in performanceevaluation and reward practices. Academy of ManagementJournal, 26, 465-476.

*Ivancevich, J. M. (1985). Predicting absenteeism from priorabsence and work attitudes. Academy of Management Jour-nal, 28, 219-228.

"Ivancevich, J. M., & McMahon, J. T. (1977). Black-whitedifferences in a goal-setting program. Organizational Behav-ior and Human Performance, 20, 287-300.

"Ivancevich, J. M., & McMahon, T. J. (1982). The effects ofgoal setting, external feedback, and self-generated feedbackon outcome variables: A field experiment. Academy of Man-agement Journal, 25, 359-372.

"•Ivancevich , J. M., & Smith, S. V. (1981). Goal setting in-terview skills training: Simulated and on-the-job analyses.Journal of Applied Psychology, 66, 697-705.

"•Ivancevich , J. M., & Smith, S. V. (1982). Job difficulty as in-terpreted by incumbents: A study of nurses and engineers.Human Relations, 35, 391 -412.

"Jamal, M. (1984). Job stress and job performance contro-versy: An empirical assessment. Organizational Behavior andHuman Performance, 33, 1-21.

"•James , L. R., & Ellison, R. L. (1973). Creation composites forscientific creativity. Personnel Psychology, 26, 147-161.

Jensen, A. R. (1980). Bias in mental testing. New York: FreePress.

"Johnson, J. A., & Hogan, R. (1981). Vocational interests, per-sonality and effective police performance. Personnel Psychol-ogy, 34, 49-53.

"•Jones , J. W., & Terris, W. (1983). Predicting employees' theftin home improvement centers. Psychological Reports, 52,187-201.


*Jordan, J. L. (1989). Effects of race on interrater reliability ofpeer ratings. Psychological Reports, 64, 1221-1222.

"Jurgensen, C. E. (1950). Intercorrelations in merit ratingtraits. Journal of Applied Psychology, 34,240-243.

*Keller, R. T. (1984). The role of performance and absenteeismin the prediction of turnover. Academy of Management Jour-nal, 27, 176-183.

*King, L. M., Hunter, J. E., & Schmidt, F. L. (1980). Halo in amultidimensional forced-choice performance evaluationscale. Journal of Applied Psychology, 65, 507-516.

*Klaas, B. S. (1989). Managerial decision making about em-ployee grievances: The impact of the grievant's work history.Personnel Psychology, 42, 53-68.

*Klaas, B. S., & DeNisi, A. S. (1989). Managerial reactions toemployee dissent: The impact of grievance activity on perfor-mance ratings. Academy of Management Journal, 32, 705-717.

*Klimoski , R. J., & Hayes, N. J. (1980). Leader behavior andsubordinate motivation. Personnel Psychology, 33, 543-555.

*Knauft, E. B. (1949). A selection battery for bake shop man-agers. Journal of Applied Psychology, 33, 304-315.

*Kubany, A. J. (1957). Use of sociometric peer nominations inmedical education research. Journal of Applied Psychology,41, 389-394.

*Landy, F. J., & Guion, R. M. (1970). Development of scalesfor the measurement of work motivation. Organizational Be-havior and Human Performance, 5, 93-103.

"Latham, G. P., Fay, C. H., & Saari, L. M. (1979). The devel-opment of behavioral observation scales for appraising theperformance of foremen. Personnel Psychology, 32, 299-311.

*Latham, G. P., & Wexley, K. N. (1977). Behavioral observa-tion scales for performance appraisal purposes. PersonnelPsychology, 30, 255-268.

*Lawshe, C. H., & McGinley, A. D., Jr. (1951). Job perfor-mance criteria studies: I. The job performance of proofread-ers. Journal of Applied Psychology, 35, 316-320.

*Lee, R., Malone, M., & Greco, S. (1981). Multitrait-multimethod-multirater analysis of performance ratings forlaw enforcement personnel. Journal of Applied Psychology,66, 625-632.

*Love, K. G. (1981). Comparison of peer assessment methods:Reliability, validity, friendship bias, and user reaction.Journal of Applied Psychology, 66, 451-457.

*Love, K. G., & O'Hara, K. (1987). Predicting job perfor-mance of youth trainees under a job training partnership actprogram (JTPA): Criterion validation of a behavior-basedmeasure of work maturity. Personnel Psychology, 40, 323-340.

*MacKenzie, S. B., Podsakoff, P. M., & Fetter, R. (1991). Or-ganizational citizenship behavior and objective productivityas determinants of managerial evaluations of salespersons'performance. Organizational Behavior and Human DecisionProcesses, 50, 123-150.

*Matteson, M. T., Ivancevich, J. M., & Smith, S. V. (1984).Relation of type A behavior to performance and satisfactionamong sales personnel. Journal of Vocational Behavior, 25,203-214.

*Mayfield, E. C. (1970). Management selection: Buddy nomi-nations revisited. Personnel Psychology, 23, 377-391.

*McCarrey, M. W., & Edwards, S. A. (1973). Organizationalclimate conditions for effective research scientist role perfor-mance. Organizational Behavior and Human Performance,9, 439-459.

*McCauley, C. D., Lombardo, M. M., & Usher, C. J. (1989).Diagnosing management development needs: An instrumentbased on how managers develop. Journal of Management, 15,389-403.

McCloy, R. A., Campbell, J. P., & Cudeck, R. (1994). A con-firmatory test of a model of performance determinants.Journal of Applied Psychology, 79,493-505.

McDaniel, M. A., Whetzel, D. L., Schmidt, F. L., & Maurer,S. D. (1994). The validity of employment interviews: A com-prehensive review and meta-analysis. Journal of Applied Psy-chology, 79, 599-616.

*McEvoy, G. M., & Beatty, R. W. (1989). Assessment centersand subordinate appraisals of managers: A seven-year exam-ination of predictive validity. Personnel Psychology, 42, 37-52.

*Meglino, B. M., Ravlin, E. C., & Adkins, C. L. (1989). A workvalues approach to corporate culture: A field test of the valuecongruence process and its relationship to individual out-comes. Journal of Applied Psychology, 74,424-432.

*Meredith, G. M. (1990). Dossier evaluation in screening can-didates for excellence in teaching awards. Psychological Re-ports, 67, 879-882.

*Meyer, J. P., Paunonen, S. V., Gellatly, I. R., Coffin, R. D., &Jackson, D. N. (1989). Organizational commitment and jobperformance: It's the nature of the commitment that counts.Journal of Applied Psychology, 74, 152-156.

*Miner, J. B. (1970). Executive and personnel interviews aspredictors of consulting success. Personnel Psychology, 23,521-538.

*Miner, J. B. (1970). Psychological evaluations as predictors ofconsulting success. Personnel Psychology, 23, 393-405.

*Mitchell , T. R., & Albright, D. W. (1972). Expectancy theorypredictions of the satisfaction, effort, performance, and re-tention of naval aviation officers. Organizational Behaviorand Human Decision Processes, 8, 1-20.

*Morgan, R. B. (1993). Self- and co-worker perceptions of eth-ics and their relationships to leadership and salary. Academyof Management Journal, 36, 200-214.

*Morse, J. J., & Wagner, F. R. (1978). Measuring the process ofmanagerial effectiveness. Academy of Management Journal,21, 23-35.

*Mossholder, K. W., Bedeian, A. G., Norris, D. R., Giles, W. F.,& Feild, H. S. (1988). Job performance and turnover deci-sions: Two field studies. Journal of Management, 14, 403-414.

*Motowidlo, S. J. (1982). Relationship between self-rated per-formance and pay satisfaction among sales representatives.Journal of Applied Psychology, 67, 209-213.

*Mount, M. K. (1984). Psychometric properties of subordinateratings of managerial performance. Personnel Psychology, 37,687-702.

*Nathan, B. R., Morman, A. M., Jr., & Milliman, J. (1991).Interpersonal relations as a context for the effects of appraisalinterviews on performance and satisfaction: A longitudinalstudy. Academy of 'Management Journal, 34, 352-369.


*Nealey, S. M., & Owen, T. W. (1970). A multitrait-multimethod analysis of predictors and criteria of nursingperformance. Organizational Behavior and Human Perfor-mance, 5, 348-365.

*Niehoff, B. P., & Moorman, R. H. (1993). Justice as a media-tor of the relationship between methods of monitoring andorganizational citizenship behavior. Academy of Manage-ment Journal, 36, 527-556.

*Noe, R. A., & Schmitt, N. (1986). The influence of traineeattitudes on training effectiveness: Test of a model. PersonnelPsychology, 39, 497-523.

*Norris, D. R., & Niebuhr, R. E. (1984). Organization tenureas a moderator of the job satisfaction-job performance rela-tionship. Journal of Vocational Behavior, 24, 169-178.

Nunnally, J. C. (1978). Psychometric theory. New York:McGraw Hill .

""O'Connor, E. J., Peters, L. H., Pooyan, A., Weekley, J., Frank,B., & Erenkrantz, B. (1984). Situational constraint effects onperformance, affective reactions, and turnover: A field repli-cation and extension. Journal of Applied Psychology, 69,663-672.

*Oldham, G. R. (1976). The motivational strategies used bysupervisors: Relationships to effectiveness indicators. Orga-nizational Behavior and Human Performance, 15, 66-86.

Ones, D. S., & Viswesvaran, C. (in press). Bandwidth-fidelitydilemma in personality measurement for personnel selection.Journal of Organizational Behavior.

*Organ, D. W., & Konovsky, M. (1989). Cognitive versusaffective determinants of organizational citizenship behavior.Journal of Applied Psychology, 74, 157-164.

*Otten, M. W., & Kahn, M. (1975). Effectiveness of crisis cen-ter volunteers and the personal orientation inventory. Psycho-logical Reports, 37, 1107-1111.

*Parker, J. W., Taylor, E. K., Barrett, R. S., & Martens, L.(1959). Rating scale content: III . Relationship between su-pervisory- and self-ratings. Personnel Psychology, 12, 49-63.

*Parsons, C. K., Herold, D. M., & Leatherwood, M. L. (1985).Turnover during initial employment: A longitudinal study ofthe role of causal attributions. Journal of Applied Psychology,70, 337-341.

*Penley, L. E., & Hawkins, B. L. (1980). Organizational com-munication, performance, and job satisfaction as a functionof ethnicity and sex. Journal of Vocational Behavior, 16, 368-384.

Peterson, R. A. (1994). A meta-analysis of Cronbach's coeffi-cient alpha. Journal of Consumer Research, 21, 381-391.

*Podsakoff, P. M., Niehoff, B. P., MacKenzie, S. B., & Williams,M. L. (1993). Do substitutes for leadership really substitutefor leadership? An empirical examination of Kerr and Jer-mier's situational leadership model. Organizational Behaviorand Human Decision Processes, 54, 1-44.

*Podsakoff, P. M., Todor, W. D., & Skov, R. (1982). Effects ofleader contingent and noncontingent reward and punishmentbehaviors on subordinate performance and satisfaction.Academy of Management Journal, 25, 810-821.

*Prien, E. P., & Kult, M. (1968). Analysis of performance cri-teria and comparison of a priori and empirically-derived keysfor a forced-choice scoring. Personnel Psychology, 21, 505-513.

*Prien, E. P., & Liske, R. E. (1962). Assessments of higher levelpersonnel: III . Rating criteria: A comparative analysis of su-pervisor ratings and incumbent self-ratings of job perfor-mance. Personnel Psychology, 15, 187-194.

*Puffer, S. M. (1987). Prosocial behavior, noncompliant behav-ior, and work performance among commission salespeople.Journal of Applied Psychology, 72, 615-621.

*Pulakos, E. D., Borman, W. C., & Hough, L. M. (1988). Testvalidation for scientific understanding: Two demonstrationsof an approach to studying predictor-criterion linkages. Per-sonnel Psychology, 41, 703-716.

*Pulakos, E. D., & Wexley, K. N. (1983). The relationshipamong perceptual similarity, sex, and performance ratings inmanager-subordinate dyads. Academy of Management Jour-nal, 26, 129-139.

*Pym, D. L. A., & Auld, H. D. (1965). The self-rating as ameasure of employee satisfactoriness. Occupational Psychol-ogy, 39, 103-113.

*Rabinowitz, S., & Stumpf, S. A. (1987). Facets of role conflict,role-specific performance, and organizational level within theacademic career. Journal of Vocational Behavior, 30, 72-83.

*Ronan, W. W. (1963). A factor analysis of eleven job perfor-mance measures. Personnel Psychology, 16, 255-267.

*Rosinger, G., Myers, L. B., Levy, G. W., Loar, M., Mohrman,S. A., & Stock, J. R. (1982). Development of a behaviorallybased performance appraisal system. Personnel Psychology,35, 75-88.

*Ross, P. F., & Dunfield, N. M. (1964). Selecting salesmen foran oil company. Personnel Psychology, 17, 75-84.

*Rosse, J. G. (1987). Job-related ability and turnover. Journalof Business and Psychology, 1, 326-336.

*Rosse, J. G., & Kraut, A. I. (1983). Reconsidering the verticaldyad linkage model of leadership. Journal of OccupationalPsychology, 56,63-71.

*Rosse, J. G., Miller, H. E., & Barnes, L. K. (1991). Combiningpersonality and cognitive ability predictors for hiring service-oriented employees. Journal of Business and Psychology, 5,431-445.

*Rothstein, H. R. (1990). Interrater reliability of job perfor-mance ratings: Growth to asymptote level with increasing op-portunity to observe. Journal of Applied Psychology, 75, 322-327.

*Rothstein, H. R., Schmidt, F. L., Erwin, F. W., Owens, W. A.,& Sparks, C. P. (1990). Biographical data in employmentselection: Can validities be made generalizable? Journal ofApplied Psychology, 75, 175-184.

*Rousseau, D. M. (1978). Relationship of work to nonwork.Journal of Applied Psychology, 63, 513-517.

*Rush, C. H. Jr. (1953). A factorial study of sales criteria. Per-sonnel Psychology, 6, 9-24.

*Russell, C. J. (1990). Selecting top corporate leaders: An ex-ample of biographical information. Journal of Management,16, 73-86.

*Russell, C. J., Mattson, J., Devlin, S. E., & Atwater, D. (1990).Predictive validity of biodata items generated from retrospec-tive lif e experience essays. Journal of Applied Psychology, 75,569-580.

*Sackett, P. R., Zedeck, S., & Fogli, L. (1988). Relations be-


tween measures of typical and maximum job performance.Journal of Applied Psychology, 73, 482-486.

Salancik, G. R., & Pfeffer, J. (1978). A social information pro-cessing approach to job attitudes and task design. Admin-istrative Science Quarterly, 23, 224-253.

*Schaubroeck, J., Ganster, D. C, Sime, W. E., & Ditman, D.(1993). A field experiment testing supervisory role clarifi-cation. Personnel Psychology, 46, 1-25.

*Schippmann, J. S., & Prien, E. P. (1986). Psychometric evalu-ation of an integrated assessment procedure. PsychologicalReports, 59, 111-122.

Schmidt, F. L., & Hunter, J. E. (1992). Development of a causalmodel of processes determining job performance. CurrentDirections in Psychological Science, 1, 89-92.

Schmidt, F. L., & Hunter, J. E. (1996). Measurement error inpsychological research: Lessons from 26 research scenarios.Psychological Methods, 1, 199-223.

Schmidt, F. L., Ones, D. S., & Hunter, J. E. (1992). Personnelselection. In M. R. Rosenzweig & L. W. Porter (Eds.), Annualreview of psychology (pp. 627-670). Palo Alto, CA: AnnualReviews.

*Schuerger, J. M., Kochevar, K. F., & Reinwald, J. E. (1982).Male and female correction officers: Personality and ratedperformance. Psychological Reports, 51, 223-228.

Scullen, S. E., Mount, M. K., & Sytysma, M. R. (1995). Com-parison of self, peer, direct report and boss ratings of manag-ers'performance. Unpublished manuscript.

*Seybolt, J. W., & Pavett, C. M. (1979). The prediction of effortand performance among hospital professionals: Moderatingeffects of feedback on expectancy theory formulations.Journal of Occupational Psychology, 52, 91-105.

*Siegel, A. I., Schultz, D. G., Fischl, M. A., & Lanterman,R. S. (1968). Absolute scaling of job performance. Journalof Applied Psychology, 52, 313-318.

*Siegel, L. (1982). Paired comparison evaluations of manage-rial effectiveness by peers and supervisors. Personnel Psychol-ogy, 35, 843-852.

*Slocum, J. W., Jr., & Cron, W. L. (1985). Job attitudes andperformance during three career stages. Journal of Voca-tional Behavior, 26, 126-145.

*Smircich, L., & Chesser, R. J. (1981). Superiors' and subordi-nates' perceptions of performance: Beyond disagreement.Academy of Management Journal, 24, 198-205.

*Sneath, F. A., White, G. C., & Randell, G. A. (1966). Validat-ing a workshop reporting procedure. Occupational Psychol-ogy, 40, 15-29.

*Soar, R. S. (1956). Personal history as a predictor of success inservice station management. Journal of Applied Psychology,40, 383-385.

*South, J. C. (1974). Early career performance of engineers—Its composition and measurement. Personnel Psychology, 27,225-243.

*Spector, P. E., Dwyer, D. J., & Jex, S. M. (1988). Relation ofjob stressors to affective, health, and performance outcomes:A comparison of multiple data sources. Journal of AppliedPsychology, 73, 11-19.

*Spencer, D. G., & Steers, R. M. (1981). Performance as amoderator of the job satisfaction-turnover relationship.Journal of Applied Psychology, 66, 511-514.

*Spitzer, M. E., & McNamara, W. J. (1964). A managerial se-lection study. Personnel Psychology, 17, 19-40.

*Sprecher, T. B. (1959). A study of engineers' criteria for cre-ativity. Journal of Applied Psychology, 43, 141-148.

*Springer, D. (1953). Ratings of candidates for promotion byco-workers and supervisors. Journal of Applied Psychology,37,347-351.

*Steel, R. P., & Mento, A. J. (1986). Impact of situational con-straints on subjective and objective criteria of managerial jobperformance. Organizational Behavior and Human DecisionProcesses, 37, 254-265.

*Steel, R. P., Mento, A. J., & Hendrix, W. H. (1987). Con-straining forces and the work performance of finance com-pany cashiers. Journal of Management, 13, 473-482.

*Steel, R. P., & Ovalle, N. K. (1984). Self-appraisal based uponsupervisory feedback. Personnel Psychology, 37, 667-685.

*Steel, R. P., Shane, G. S., & Kennedy, K. A. (1990). Effects ofsocial-system factors on absenteeism, turnover, and job per-formance. Journal of Business and Psychology, 4, 423-430.

•Stout , S. K., Slocum, J. W., Jr., & Cron, W. L. (1987). Careertransitions of superiors and subordinates. Journal of Voca-tional Behavior, 30, 124-137.

Stuit, D. B., & Wilson, J. T. (1946). The effect of an increasinglywell defined criterion on the prediction of success at navaltraining school (tactical radar). Journal of Applied Psychol-ogy, 30, 614-623.

*Stumpf, S. A. (1981). Career roles, psychological success, andjob attitudes. Journal of Vocational Behavior, 19, 98-112.

*Stumpf, S. A., & Rabinowitz, S. (1981). Career stage as amoderator of performance relationships with facets of jobsatisfaction and role perceptions. Journal of Vocational Be-havior, 18,202-218.

*Sulkin, H. A., & Pranis, R. W. (1967). Comparison of griev-ants with non-grievants in a heavy machinery company. Per-sonnel Psychology, 20, 111-119.

*Swaroff, P. G., Barclay, L. A., & Bass, A. R. (1985). Recruitingsources: Another look. Journal of Applied Psychology, 70,720-728.

*Szilagyi, A. D. (1980). Causal inferences between leader re-ward behaviour and subordinate performance, absenteeism,and work satisfaction. Journal of Occupational Psychology,53, 195-204.

"Taylor, E. K.., Schneider, D. E., & Symons, N. A. (1953). Ashort forced-choice evaluation form for salesmen. PersonnelPsychology, 6, 393-401.

Taylor, R. L., & Wilsted, W. D. (1974). Capturing judgmentpolicies: A field study of performance appraisal. Academy ofManagement Journal, 17, 440-449.

Taylor, R. L., & Wilsted, W. D. (1976). Capturing judgmentpolicies in performance rating. Industrial Relations, 15,216-224.

Taylor, S. M., & Schmidt, D. W. (1983). A process-orientedinvestigation of recruitment source effectiveness. PersonnelPsychology, 36, 343-354.

Tenopyr, M. L. (1969). The comparative validity of selectedleadership scales relative to success in production manage-ment. Personnel Psychology, 22, 77-85.

Thompson, D. E., & Thompson, T. A. (1985). Task-based per-


formance appraisal for blue-collar jobs: Evaluation of raceand sex effects. Journal of Applied Psychology, 70, 747-753.

"Thomson, H. A. (1970). Comparison of predictor and crite-rion judgments of managerial performance using themultitrait-multimethod approach. Journal of Applied Psy-chology, 54, 496-502.

Toops, H. A. (1944). The criterion. Educational and Psycho-logical Measurement, 4, 271-297.

Tsui, A. S., & Ohlott, P. (1988). Multiple assessment of man-agerial effectiveness: Interrater agreement and consensus ineffectiveness models. Personnel Psychology, 41, 779-803.

Tucker, M. F., Cline, V. B., & Schmitt, J. R. (1967). Predictionof creativity and other performance measures from biograph-ical information among pharmaceutical scientists. Journal ofApplied Psychology, 51, 131-138.

Turner, W. W. (1960). Dimensions of foreman performance:A factor analysis of criterion measures. Journal of AppliedPsychology, 44, 216-223.

"Validity information exchange. (1954). No. 7-045. PersonnelPsychology, 7, 279.

"Validity information exchange. (1954). No. 7-089. PersonnelPsychology, 7, 565-566.


"Validity information exchange. (1958). No. 11-10. PersonnelPsychology.il, 121-123.






"Vecchio, R. P. (1987). Situational leadership theory: An ex-amination of a prescriptive theory. Journal of Applied Psy-chology, 72,444-451.

"Vecchio, R. P., & Gobdel, B. C. (1984). The vertical dyad link-age model of leadership: Problems and prospects. Organiza-tional Behavior and Human Performance, 34, 5-20.

"Villanova, P., & Bernardin, J. H. (1990). Work behavior cor-relates of interviewer job compatibility. Journal of Businessand Psychology, 5, 179-195.

Viswesvaran, C. (1993). Modeling job performance: Is there ageneral factor? Unpublished doctoral dissertation, Universityof Iowa, Iowa City.

Viswesvaran, C., & Ones, D. S. (1995). Theory testing: Com-bining psychometric meta-analysis and structural equationsmodeling. Personnel Psychology, 48, 865-887.

"Waldman, D. A., Yammarino, F. J., & Avolio, B. J. (1990).A multiple level investigation of personnel ratings. PersonnelPsychology, ¥5,811-835.

"Wanous, J. P., Stumpf, S. A., & Bedrosian, H. (1979). Job

survival of new employees. Personnel Psychology, 32, 651-662.

"Wayne, S. J., & Ferris, G. R. (1990). Influence tactics, affect,and exchange quality in supervisor-subordinate interactions:A laboratory experiment and field study. Journal of AppliedPsychology, 75, 487-499.

"Wernimont, P. F, & Kirchner, W. K. (1972). Practical prob-lems in the revalidation of tests. Occupational Psychology, 46,25-30.

"Wexley, K. N., Alexander, R. A., Greenawalt, J. P., & Couch,M. A. (1980). Attitudinal congruence and similarity as re-lated to interpersonal evaluations in manager-subordinatedyads. Academy of Management Journal, 23, 320-330.

"Wexley, K. N., & Pulakos, E. D. (1982). Sex effects on perfor-mance ratings in manager-subordinate dyads: A field study.Journal of Applied Psychology, 67,433-439.

"Wexley, K. N., & Youtz, M. A. (1985). Rater beliefs aboutothers: Their effects on rating errors and rater accuracy.Journal of Occupational Psychology, 58, 265-275.

"Williams, C. R., Labig, C. E., Jr., & Stone, T. H. (1993). Re-cruitment sources and posthire outcomes for job applicantsand new hires: A test of two hypotheses. Journal of AppliedPsychology, 78, 163-172.

"Williams, L. J., & Anderson, S. E. (1991). Job satisfaction andorganizational commitment as predictors of organizationalcitizenship and in-role behaviors. Journal of Management,77,601-617.

"Williams, W. E., & Seiler, D. A. (1973). Relationship betweenmeasures of effort and job performance. Journal of AppliedPsychology, 57, 49-54.

Wohlers, A. J., & London, M. (1989). Ratings of managerialcharacteristics: Evaluation difficulty, co-worker agreement,and self-awareness. Personnel Psychology, 42, 235-261.

"Woodmansee, J. J. (1978). Validation of the nurturance scaleof the Edwards Personal Preference Schedule. PsychologicalReports, 42, 495-498.

"Worbois, G. M. (1975). Validation of externally developed as-sessment procedures for identification of supervisory poten-tial. Personnel Psychology, 28, 77-91.

"Yammarino, F. J., & Dubinsky, A. J. (1990). Salesperson per-formance and managerially controllable factors: An investi-gation of individual and work group effects. Journal of Man-agement, 16, 87-106.

"Yukl, G. A., & Latham, G. P. (1978). Interrelationshipsamong employee participation, individual differences, goaldifficulty , goal acceptance, goal instrumentality, and perfor-mance. Personnel Psychology, 31, 305-323.

"Zedeck, S., & Baker, H. T. (1972). Nursing performance asmeasured by behavioral expectation scales: A multitrait-multirater analysis. Organizational Behavior and Human De-cision Processes, 7, 457-466.

Received October 10, 1995Revision received March 29, 1996

Accepted April 22, 1996 •