Week 4 DQ2

gascie4x2001
performanceappraisalmilieu2017.pdf

ORIGINAL PAPER

The Performance Appraisal Milieu: A Multilevel Analysis of Context Effects in Performance Ratings

J. Kemp Ellington1 • Mark A. Wilson2

Published online: 10 February 2016

� Springer Science+Business Media New York 2016

Abstract

Purpose The purpose of this study was to take an

inductive approach in examining the extent to which

organizational contexts represent significant sources of

variance in supervisor performance ratings, and to explore

various factors that may explain contextual rating

variability.

Design/Methodology/Approach Using archival field per-

formance rating data from a large state law enforcement

organization, we used a multilevel modeling approach to

partition the variance in ratings due to ratees, raters, as well

as rating contexts.

Findings Results suggest that much of what may often be

interpreted as idiosyncratic rater variance, may actually

reflect systematic rating variability across contexts. In

addition, performance-related and non-performance factors

including contextual rating tendencies accounted for sig-

nificant rating variability.

Implications Supervisor ratings represent the most com-

mon approach for measuring job performance, and under-

standing the nature and sources of rating variability is

important for research and practice. Given the many uses of

performance rating data, our findings suggest that continuing

to identify contextual sources of variability is particularly

important for addressing criterion problems, and improving

ratings as a form of performance measurement.

Originality/Value Numerous performance appraisal mod-

els suggest the importance of context; however, previous

research had not partitioned the variance in supervisor rat-

ings due to omnibus context effects in organizational set-

tings. The use of a multilevel modeling approach allowed the

examination of contextual influences, while controlling for

ratee and rater characteristics.

Keywords Job performance � Performance appraisal � Performance ratings � Multilevel � Rater effects � Context effects

Introduction

Job performance is considered one of the most important

variables in organizational research and practice (Bennett

et al. 2006; Borman 2004), yet performance ‘‘criterion

problems’’ are well documented (Austin and Crespin 2006;

Austin and Villanova 1992). The most common method for

measuring performance is a supervisory rating, and an

extensive literature documents the many issues associated

with ratings of performance (e.g., Landy and Farr 1980;

Murphy 2008; Woehr and Roch 2012). In particular,

research suggests that although ratings reflect actual ratee

performance to a degree, they also reflect systematic rater

effects (as well as measurement error). For example, sev-

eral studies have examined the structure of multisource

performance ratings (MSPR), and found relatively large

idiosyncratic rater effects (Hoffman et al. 2010; Mount

et al. 1998; Scullen et al. 2000). With regard to supervisor

ratings specifically, depending on the sample and

methodology employed, estimates range from 43 %

& J. Kemp Ellington ellingtonjk@appstate.edu

Mark A. Wilson

mark_wilson@ncsu.edu

1 Department of Management, Walker College of Business,

Appalachian State University, 4090 Peacock Hall, Boone,

NC 28608, USA

2 Department of Psychology, North Carolina State University,

Box 7801, Poe Hall, Raleigh, NC 27695, USA

123

J Bus Psychol (2017) 32:87–100

DOI 10.1007/s10869-016-9437-x

(Hoffman et al. 2010; Scullen et al. 2000) to as much as

58 % (O’Neill et al. 2012) of performance rating variance

which is idiosyncratic to the rater.

Given this evidence regarding the presence of rather

large rater effects in performance ratings, one implication

is that a potential solution to the criterion problem is to

identify the factors that drive these effects, so that steps

may be taken to lessen their impact (Murphy 2008). Con-

sequently, researchers have investigated and identified a

variety of rater and situational characteristics that influence

ratings (for thorough reviews see, Landy and Farr 1980;

Levy and Williams 2004; Murphy 2008; Murphy and

Cleveland 1995). However, pertinent questions remain

regarding the nature and sources of performance rating

variance. More specifically, although both theory and

existing research suggest that context plays a significant

role in performance appraisal (Ilgen and Feldman 1983;

Judge and Ferris 1993; Levy and Williams 2004; Murphy

and Cleveland 1995), it is currently unclear as to the extent

to which context may be a systematic source of variance in

ratings. Raters are nested within rating contexts, and as

noted by Murphy and DeShon (2000), what is often viewed

as idiosyncratic rater variance is more likely, ‘‘a combi-

nation of the effects of rater characteristics and the effects

of the context in which the rater operates’’ (p. 879).

In studying contextual effects in performance appraisal,

it is important to first define what is meant by the term

‘‘context.’’ Here, we conceptualize context similar to Johns

(2006), as ‘‘situational opportunities and constraints that

affect the occurrence and meaning of organizational

behavior as well as functional relationships between vari-

ables’’ (p. 386). In addition, context can be defined broadly

as omnibus context (i.e., who, where, when, and why), as

well as in terms of the discrete contextual characteristics of

the social, task, and physical environment (Hattrup and

Jackson 1996; Johns 2006; Mowday and Sutton 1993).

With regard to the context of performance appraisal

specifically, Murphy and Cleveland (1995) call for research

on ‘‘levels of context,’’ and note that several intra-organi-

zational units may be salient. Organizational units and

work groups (e.g., divisions, departments, offices, stores,

etc.) can be viewed as omnibus contexts, particularly when

they are distinctive with respect to their discrete contextual

features (i.e., social, task, and physical characteristics). If

supervisors do indeed vary systematically in their ratings

across these organizational contexts, this not only raises

further concerns regarding the validity of supervisor ratings

(Murphy 2008; O’Neill et al. 2012), but also has implica-

tions for performance appraisal research and practice.

With these issues in mind, the current research sought to

determine the extent to which work contexts account for

variance in supervisory task performance ratings, and to

also explore characteristics that are potentially responsible

for this variation. In order to address these goals, we take

an inductive approach in investigating sources of perfor-

mance rating variance, and incorporate a multilevel mod-

eling methodology in analyzing archival field data from a

large state law enforcement agency. In examining the

influence of ratees, raters, and rating contexts, we first

partition the variability in supervisory performance ratings

due to each source, operationalizing omnibus rating con-

texts (Johns 2006) as distinct organizational units. Sec-

ondly, in order to more thoroughly study the nature of the

rating variability associated with each source, we include

both performance-related and non-performance factors as

ratee- and rater-level control variables. This not only pro-

vides critical information regarding the degree to which

these variables account for both rater and contextual vari-

ance, but also gives an estimate of the remaining variability

associated with each source that is yet to be explained.

Finally, we also investigate several discrete contextual

characteristics as potential predictors of between-context

rating variance.

Multilevel Model of Supervisory Rating Variance

The majority of research to date examining sources of

variance in performance ratings has incorporated a con-

firmatory factor analytic (CFA) approach, and has focused

on MSPRs (Hoffman et al. 2010; Mount et al. 1998;

Scullen et al. 2000). However, linear mixed models

(LMM), including the more specific case of multilevel

random coefficient (MRC) models, have also been pro-

posed as an alternative approach for decomposing rating

variance, which may offer certain advantages (LaHuis and

Avis 2007; O’Neill et al. 2012; Putka et al. 2008). For

example, O’Neill et al. (2012) recently applied the MRC

modeling approach to partition the variance in perfor-

mance ratings due to ratees, raters, and rater–ratee inter-

actions, and found substantial rater effects (i.e., 58 %), as

well as influential predictors such as familiarity with the

ratee and the number of ratees evaluated. Despite the

potential benefits of the approach, MRC models have not

yet been extensively applied in the case of fully ‘‘nested’’

rating systems in field settings (for an exception see

LaHuis and Avis 2007). As mentioned previously,

supervisor performance ratings are the most common

method for measuring job performance (Murphy 2008;

Woehr and Roch 2012), and in most cases each supervisor

evaluates a unique group of ratees (i.e., the employees

within their span of control). Ratees can therefore be

viewed as nested within raters, and both ratees and raters

are often nested within work contexts such as organiza-

tional units, thereby creating a hierarchical or multilevel

data structure.

88 J Bus Psychol (2017) 32:87–100

123

Initial Partitioning of Rater and Contextual

Variance

Before examining potential explanatory variables at each

level of analysis, it is necessary to first partition the vari-

ability due to groups/clusters, to provide a preliminary

estimate as to the rating variance associated with raters and

work contexts. With regard to rater variance, numerous

models of the performance appraisal process suggest that

rater characteristics, tendencies, biases, goals, and/or

intentions are likely to result in the presence of rater effects

in job performance ratings (DeCotiis and Petit 1978;

DeNisi et al. 1984; Ilgen and Feldman 1983; Judge and

Ferris 1993; Landy and Farr 1980; Levy and Williams

2004; Murphy and Cleveland 1995; Spence and Keeping

2013; Wherry and Bartlett 1982). And, as discussed pre-

viously, empirical support for this proposition is well

established, with several studies demonstrating relatively

large rater effects in performance ratings (Hoffman et al.

2010; LaHuis and Avis 2007; Mount et al. 1998; O’Neill

et al. 2012; Scullen et al. 2000).

In addition to variability between raters (within con-

texts), there are also reasons to expect systematic rating

differences across work environments. From a theoretical

standpoint, numerous scholars in performance appraisal

have proposed that ratings must be considered in context,

and that both proximal and distal contextual influences are

likely to shape rating behaviors (Ilgen and Feldman 1983;

Judge and Ferris 1993; Levy and Williams 2004; Murphy

and Cleveland 1995). Furthermore, empirical research has

identified several specific situational variables that are

influential in performance appraisal (e.g., rating purpose,

rater accountability, climate; Greguras et al. 2003; Jawahar

and Williams 1997; Mero et al. 2003; Murphy et al. 2003).

Although researchers to date have not attempted to parti-

tion the variance in supervisory ratings due to omnibus

context effects while controlling for ratee and rater char-

acteristics, several other related empirical findings also

suggest the likely importance of work context sources of

rating variability. For example, Dierdorff and Surface

(2007) examined sources of variance in peer ratings, and

found significant rating variability associated with the sit-

uations (i.e., defined as distinct training exercises) in which

peers performed and evaluated one another. Although these

contexts are different in many respects than organizational

units, importantly the performance situations varied in

terms of environmental cues, required tasks, as well as

normative expectations (Dierdorff and Surface 2007).

With regard to the rating contexts created by organiza-

tional units, Waldman et al. (1990) examined supervisor

performance ratings collected as part of a training needs

analysis, and found significant rating differences between

organizational ‘‘departments’’ for several dimensions of

performance. An average of 10 % of variance was asso-

ciated with departments (Waldman et al. 1990), providing

some evidence that supervisors’ rating behavior may be

influenced by the intra-organizational contexts in which

they work. Moreover, variability across organizational

units may be even greater with appraisals that have been in

use for longer durations, and that are formal components of

organizations’ performance management systems (e.g., as

opposed to one-time appraisals for assessing training

needs). For example, Ilgen and Feldman (1983) noted that

rating norms are more likely to develop once an appraisal

system has been in place for some time. In this study, we

focus specifically on examining an appraisal that has been

in place in a field setting for a considerable duration (i.e.,

over 10 years).

Accounting for Ratee- and Rater-Level

Characteristics

Although MRC models are well suited to nested perfor-

mance rating data structures, it should be noted that the

approach assumes that ratees are comparable across raters

or higher level groups such as contexts, and if this

assumption is violated (as is likely the case with field data)

then efforts should be made to control for ratee charac-

teristics (LaHuis and Avis 2007). Moreover, LaHuis and

Avis (2007, p. 98) note that, ‘‘a major advantage of MRC

modeling is the ability to study how the attributes of raters

influence their ratings while controlling for ratee charac-

teristics [italics added]’’ (p. 98). In the case of multilevel

models incorporating additional hierarchical levels such as

organizational units, this also allows the ability to examine

contextual influences while holding both ratee- and rater-

level characteristics constant. In other words, it is possible

that rater and contextual variance merely reflects that some

supervisors and organizational units have better, more

experienced subordinates. In addition, other ratee- and

rater-level characteristics may also vary across supervisors

and units, and thus, partially explain rating differences

across raters and contexts.

With MRC models, variables entered at lower levels

(e.g., ratees or raters) can explain variability at the level of

entry, as well as at higher levels (e.g., raters and contexts),

to the extent that the lower-level predictors vary system-

atically across the higher level groups (note that this is only

true when the variables are either scaled in their raw metric

or centered around their grand mean). If rater or context

effects are primarily a function of differences in ratee

objective performance outcomes, ratee job tenure, super-

visor rating tendencies, or other ratee- and rater-level fac-

tors, this would be evidenced by a non-significant rater or

context variance component after controlling for these

variables. The inclusion of ratee- and rater-level variables

J Bus Psychol (2017) 32:87–100 89

123

therefore serves dual roles, to explain the rating variability

within each level, and to more accurately estimate the total

variance accounted for at higher levels. Accordingly, a

sequence of models were estimated which included char-

acteristics at each level, and at each stage the variance

components were tested to determine whether significant

rating variance due to raters and contexts remained.

Ratee-Level Variables

After the initial rating variance partitioning, ratee variables

were first added to the model that are likely to be predictive

of performance differences. In particular, at least to a

degree, objective measures of ratee effectiveness or per-

formance outcomes (Campbell et al. 1993) should reflect

some actual performance variability, and hence are likely

to be associated with supervisor ratings (Bommer et al.

1995; Deadrick and Gardner 1997; Heneman 1986; Reb

and Cropanzano 2007; Reb and Greguras 2010). In addi-

tion, to the extent that it conveys information about expe-

rience, variability in job tenure is also likely to be related to

ratee performance differences, and therefore should be

associated with ratings (McDaniel et al. 1988; Schmidt and

Hunter 1998). Importantly, the remaining variability can

therefore be interpreted as rating variance due to raters and

contexts (and ratees), after controlling for differences in

ratee objective performance outcomes and job tenure.

In addition, contextual factors at the ratee level may not

only explain rating variance across ratees, but may also

account for variability across raters and/or organizational

units. Numerous studies of performance appraisal indicate

the importance of observation, suggesting that the extent to

which raters have had opportunities to observe ratee per-

formance is likely to impact their appraisals (Ilgen et al.

1993; Kingstrom and Mainstone 1985; Kozlowski et al.

1986; O’Neill et al. 2012). In addition, the performance

appraisal ‘‘purpose effect’’ is also well documented, with

ratees tending to receive higher ratings when there are

administrative consequences, versus when ratings are

assigned for developmental or research purposes (Jawahar

and Williams 1997). In particular, the use or purpose of the

appraisal is believed to influence rater intentions and goals

(Murphy 2008; Murphy and Cleveland 1995; Spence and

Keeping 2013). Although in many applied settings the

rating purpose may be consistent across all ratees, in cer-

tain cases of mixed-use appraisals this contextual factor

varies at the ratee level. For example, ratings can be largely

developmental for some employees, while having admin-

istrative implications for others, such as those being con-

sidered for promotion. Therefore, we controlled for both

the number of documented performance incidents (a proxy

for the number of observations) as well as the rating pur-

pose, in order to determine the extent to which these ratee-

level situational variables are responsible for ratee, rater,

and work context rating variability.

Rater-Level Variables

Numerous supervisor characteristics may also explain rater

variance in performance ratings, and potentially contextual

variation as well. One of the most commonly suggested

sources of rater effects includes differences in the

idiosyncratic rating tendencies of the raters. More specifi-

cally, rater tendencies for leniency/severity are believed to

be pervasive concerns in applied settings (Hauenstein

1992; Murphy and Cleveland 1995; Scullen et al. 2000).

Furthermore, there is also evidence that leniency is a rel-

atively stable rater tendency (Kane et al. 1995). In addition,

raters also have varying degrees of supervisory job tenure,

which is indicative of within-organization appraisal expe-

rience. Although research on rater experience is somewhat

mixed, there is empirical evidence suggesting that experi-

ence affects the quality of rating data (Landy and Farr

1980; Zalesny and Highhouse 1992). In addition, recent

research found that raters with more experience in con-

ducting performance appraisals gave lower ratings than

those with less experience (Spence and Keeping 2010).

Finally, raters also differ in terms of the number of ratees

they supervise and evaluate (i.e., span of control), and

research by O’Neill et al. (2012) indicates that the number

of ratees evaluated explains significant rating variance.

Consequently, in order to establish the degree to which

rater-level characteristics account for both rater and con-

textual rating variance, we controlled for supervisor rating

tendencies, supervisory job tenure, and span of control.

Research Question 1 (RQ1): Is there significant work

context variability in supervisor task performance

ratings after controlling for ratee- and rater-level

characteristics?

Context-Level Characteristics

In order to potentially explain the remaining rating vari-

ability across organizational units, we also explored several

context-level characteristics. As suggested previously,

intra-organizational units often represent distinct social,

task, and physical environments (Hattrup and Jackson

1996; Johns 2006; Mowday and Sutton 1993), and vari-

ables associated with each of these dimensions of discrete

context may be influential in shaping rating behaviors. First

of all, much like individual supervisors have rating ten-

dencies, such distributional tendencies may also exist at the

work context level. If a tendency exists in a given orga-

nizational unit for higher/lower ratings, then it could be

expected that subsequent appraisals in that context would

90 J Bus Psychol (2017) 32:87–100

123

also display corresponding higher/lower mean ratings, even

with a distinct set of ratees. In other words, if there are

contextual tendencies for rating behavior, then the mean

unit/context ratings for ratees from previous performance

cycles may explain between-context variability in subse-

quent appraisals for a different group of ratees. Given the

archival nature of our study, we cannot provide a definitive

theoretical interpretation of contextual performance rating

means; however, we can speculate as to the potential

meaning of such a variable. For example, contextual rating

tendencies may represent an aspect of the social context

(Hattrup and Jackson 1996; Johns 2006; Mowday and

Sutton 1993), which could be a function of norms,

expectations, or standards for acceptable ratee behavior as

well as performance ratings.

Discrete characteristics of the task context may also

explain variability across organizational units. Even when

employees hold the same job title, it is possible that due to

their particular work context, some task activities may be

performed more/less often than others. For example, pre-

vious research indicates that elements of the task context

shape work role requirements (Dierdorff et al. 2009). In the

law enforcement organization under study here, units are

geographically dispersed, and the frequency of certain

work activities varies based on the geographic region.

More specifically, work contexts differ with respect to the

number of accidents investigated, the number of cases

made (i.e., the number of cases brought to court), and the

number of calls for service. Therefore, supervisors in a

context in which the level of these work activities is higher

may potentially weight and evaluate task performance

differently than supervisors in contexts in which these

activities occur less frequently.

The impact of the physical work context on organiza-

tional behavior has been generally understudied (Johns

2006), and has also rarely been examined in performance

appraisal research (Murphy and Cleveland 1995). How-

ever, previous scholars have concluded that, ‘‘physical

environments play a major role in facilitating and con-

straining organizational action’’ (Elsbach and Pratt 2008,

p. 182). For instance, the physical work context has been

shown to impact technical role requirements (Dierdorff

et al. 2009), and has been suggested to constrain task

performance (Peters and O’Connor 1980). In addition to

occupying distinct physical spaces geographically, contexts

within a state law enforcement agency differ in terms of the

physical presence of an interstate highway. Consequently,

work contexts that include an interstate may differ from

contexts consisting of only rural state roads, in terms of the

types of circumstances often encountered (e.g., contact

with out-of-state travelers) and/or the frequency or

importance of specific task performance dimensions. In

other words, as discrete context dimensions are not

orthogonal (Dierdorff et al. 2009), this physical context

characteristic may shape ratings via its influence on the

task context. Therefore, in order to further explore potential

predictors of between-unit rating variation, we examined

several work context characteristics.

Research Question 2 (RQ2): Do contextual rating

tendencies, work activities, and/or the presence of an

interstate highway explain work context variability in

supervisor task performance ratings, after controlling

for ratee- and rater-level characteristics?

Methods

Participants

Performance ratings were collected from members of a

large state law enforcement agency as part of the organi-

zation’s annual performance management process, and

these archival performance records provided the data for

this study. Although all ratees had the same basic job title,

in order to ensure the most comparable ratee sample pos-

sible, ratees were excluded if their primary job responsi-

bilities were not typical patrol activities (e.g., training,

aviation, etc.). In addition, in order to calculate supervisor

and contextual rating tendencies, raters and contexts (and

their corresponding ratees) were excluded if they did not

have sufficient data from previous performance cycles.

Complete data were available for a sample of 804 ratees,

119 supervisors/raters, from 58 organizational units/con-

texts (i.e., ‘‘districts’’). Ratees were nested within raters,

and both ratees and raters were nested within units/con-

texts, which were geographically distinct and consisted of

their own respective unit offices. The number of ratees per

supervisor ranged from 1 to 16 (M = 6.76, SD = 2.71),

and the number of supervisors per unit/context ranged from

1 to 4 (M = 2.05, SD = .78). The majority of the sample

was male (97.5 %) as well as Caucasian (84.7 %).

Procedure

The organization’s performance management process

stipulates that supervisors provide performance ratings

annually for all of their respective subordinates. Further-

more, policy dictates that supervisors document behavioral

observations of their subordinates’ performance throughout

the course of each performance cycle (i.e., 1 year). All

supervisors are provided rater training on how to record

these observations, in addition to frame-of-reference

training in performance ratings (Bernardin and Buckley

1981; Roch et al. 2012). Furthermore, refresher training

was provided annually to all supervisors. The organization

J Bus Psychol (2017) 32:87–100 91

123

also maintains ongoing records regarding objective per-

formance data (e.g., the number of accidents investigated,

the number of cases made, etc.) by year and unit/context.

Finally, in any given year, approximately 10 % of the

ratees participate in the organization’s annual promotion

process. For those participants, the performance ratings

have a stronger administrative impact, as they must receive

a rating of average or above across specific performance

dimensions in order to be eligible to remain in considera-

tion for promotion. The ratings for the majority are more

developmental in nature, in that they have no direct impact

on promotions or raises.

Ratee-Level Measures

Task Performance Ratings

A job analysis conducted in the organization of interest

identified 10 task performance dimensions. These dimen-

sions are incorporated into the performance management

process, with supervisors documenting behavioral obser-

vations on these dimensions of performance, and providing

ratings in the organization’s annual evaluation process.

Examples of dimensions rated include ‘‘collision investi-

gation,’’ ‘‘preventative patrol,’’ and ‘‘arrest procedures.’’

Performance dimensions are rated on a scale ranging from

1 = excellent, to 7 = well below average, which was

reverse coded in order to ease the interpretation of results.

Objective Performance Outcomes

Based on organizational records regarding the number of

‘‘cases made’’ per ratee, as well as the number of hours

on patrol, a variable was created reflecting the number of

cases per hour (i.e., by dividing the total of number cases

by the hours worked). Cases made consisted of a variety

of objective indicators common in law enforcement set-

tings (e.g., speeding, seatbelt, driving while impaired, and

drug violations). It should be noted that this is an

objective measure of performance ‘‘quantity,’’ which does

not capture performance ‘‘quality.’’ In addition, to

examine the potential for a curvilinear association, a

quadratic term was also calculated for objective perfor-

mance outcomes.

Ratee Job Tenure

Ratee job tenure was operationalized based on the number

of months the individual worked for the organization. This

operationalization can therefore be interpreted as a time-

based measure (Tesluk and Jacobs 1998). As we are unable

to determine how much previous law enforcement experi-

ence participants may have had with other organizations,

our measure provides an indication of the within-organi-

zation ratee experience.

Number of Documented Incidents

As described previously, the performance management

process required supervisors to document behavioral inci-

dents of performance (corresponding with the performance

dimensions) throughout the performance cycle. The num-

ber of incidents was therefore operationalized based on the

sum total of documented performance incidents per ratee

over the performance management cycle.

Rating Purpose

In order to differentiate those with a more administrative

versus developmental appraisal purpose, a dummy-coded

variable was created. Ratees who were participating in the

organization’s annual promotion process in the same year

that the criterion performance ratings were collected were

coded ‘‘1’’ (i.e., a stronger administrative purpose), and all

other employees were coded ‘‘0’’ (i.e., a more develop-

mental purpose).

Rater-Level Measures

Supervisor Rating Tendency

Based on the archival performance rating records, a mean

task performance rating was calculated for all supervisors

who had provided ratings in any of the previous five per-

formance cycles. It is important to note that, similar to the

approach employed by Kane et al. (1995), any ratees who

were evaluated by a rater in the performance management

cycle used to operationalize our criterion performance data

were excluded from the calculation of that rater’s mean

tendency. This ensured that the mean value (i.e., predictor)

did not overlap in terms of the ratees who were evaluated

(i.e., criterion) by a given rater. Furthermore, any given ratee

was only included once in the calculation of a rater’s mean

tendency. The number of previous rating cycles used to

calculate supervisor mean rating tendencies ranged from 1 to

5 (M = 1.94, SD = 1.02), and the number of previous ratees

evaluated ranged from 3 to 31 (M = 10.64, SD = 5.90). The

reliability of the rater means was calculated using an intra-

class correlation coefficient (ICC; Bartko 1976; Bliese

2000). The ICC(2) for the rater mean tendencies was .73.

Supervisor Job Tenure

Rater supervisory experience was operationalized as the

number of months the rater worked as a supervisor in

the organization. This measure therefore represents a

92 J Bus Psychol (2017) 32:87–100

123

time-based measure (Tesluk and Jacobs 1998), and should

be interpreted as an indication of within-organization

supervisory experience (i.e., we cannot determine previous

supervisory experience from other organizations).

Span of Control

Similar to other studies examining the number of ratees

(LaHuis and Avis 2007; O’Neill et al. 2012), the span of

control was defined as the number of subordinates/ratees

supervised and evaluated by each supervisor/rater.

Context-Level Measures

Contextual Rating Tendency

Contextual rating tendencies were operationalized in a

manner similar to our measure of supervisor rating tenden-

cies, as the mean task performance rating in each unit/con-

text from the previous five performance cycles. Again, any

ratees who were evaluated in a given context during the cycle

used to operationalize our criterion data were excluded from

the calculation of the contextual rating tendency (i.e., mean)

for that unit. Consequently, the mean value (i.e., predictor)

did not overlap in terms of the ratees who were evaluated

(i.e., criterion) in a given context. Furthermore, any given

ratee was only included once in the calculation of a context’s

mean tendency. The majority (89 %) of units/contexts

included rating data from two or more of the five previous

performance cycles, however several units were recently

formed, and thus only included data from the previous year.

Therefore, the number of previous rating cycles used to

calculate the contextual rating tendencies ranged from 1 to 5

(M = 4.25, SD = 1.33), and the number of previous ratees

evaluated ranged from 3 to 43 (M = 16.03, SD = 9.39). The

ICC(2) for the contextual rating tendencies was .72, indi-

cating the reliability of the group means.

Work Activity

Organizational data were obtained for three indicators of

contextual work activity: the number of accidents investi-

gated (M = 1714.86, SD = 956.90), the number of cases

made (M = 17,703, SD = 7834.24), and the number of

calls for service (M = 6798.95, SD = 2758.78). In order

to combine the indicators into a single work activity vari-

able, each indicator was first standardized, and an average

was calculated across the three standardized values.

Interstate Highway

Each unit/context covers a different geographic region,

approximately 55 % of which contain an interstate

highway. Therefore, a dummy-coded variable was created.

Contexts that included an interstate highway were coded

‘‘1,’’ and those that did not were coded ‘‘0.’’

Analytical Approach

A multilevel modeling approach was incorporated to

address our research questions, as this method is appro-

priate for nested or hierarchical data, and allows for the

simultaneous modeling of both within- and between-

group variance (Raudenbush and Bryk 2002). This

approach allows intercepts (means) to vary as a function

of nested groups (i.e., raters and work contexts), and

therefore allows the partitioning of rating variance due to

ratees (within-rater), raters (between-rater, within-unit),

and contexts (between-unit). A staged modeling approach

was incorporated, with the first stage including the esti-

mation of an unconditional or ‘‘null model’’ with no

predictors, in order provide the initial partitioning of

variance in ratings. More specifically, the null model

results allow the calculation of ICC(1), which indicates

the proportion of total variance explained by group

membership (Bliese 2000; Raudenbush and Bryk 2002).

For comparison purposes, a preliminary two-level, null

model was first estimated with ratees (level-1) nested

within raters (level-2), in order to determine the propor-

tion of variance assigned to the rater when ignoring

context. All subsequent analyses included three-level

models, with ratees comprising level-1, raters as level-2,

and contexts as level-3. The null model was followed by a

series of random intercept and fixed slope models

(RIFSM; Aguinis et al. 2013; Raudenbush and Bryk

2002), which included entering our various predictors

from each level in stages. Predictors were centered

around their grand mean, as our research questions were

consistent with an ‘‘incremental’’ perspective (Hofmann

and Gavin 1998). All multilevel modeling was conducted

using HLM 7 software (Raudenbush et al. 2011).

Results

Table 1 presents the descriptive statistics and zero-order

correlations for all study variables. An examination of the

correlations among the ratee-level (level-1) variables

indicates that task performance ratings were positively

correlated with objective performance outcomes (r = .16,

p \ .01), job tenure (r = .18, p \ .01), number of docu- mented performance incidents (r = .08, p \ .05), and an administrative rating purpose (r = .16, p \ .01). In addi- tion, rater-level (level-2) correlations suggest that raters’

previous rating tendencies were negatively associated

with supervisory experience (r = -.23, p \ .01). The

J Bus Psychol (2017) 32:87–100 93

123

context-level (level-3) correlations indicated that previous

contextual rating tendencies were negatively related to

work activity (r = -.42, p \ .01), and contexts with interstate highways had higher levels of work activity

(r = .36, p \ .01). With regard to the multilevel model results, the pre-

liminary two-level, null model showed significant rater

variance (s00 = .08, df = 118, v 2 = 774.34, p \ .001),

and suggested that 47 % of the rating variability would be

attributed to the rater when ignoring context. The three-

level model results are presented in Table 2. The null

model estimating both rater and context effects indicated

that raters (sp0 = .03, df = 61, v 2 = 189.81, p \ .001) as

well as contexts (sb00 = .05, df = 57, v 2 = 187.19,

p \ .001) accounted for significant variability in supervisor task performance ratings. More specifically, 17 % of the

variance was attributable to raters, and 28 % was associ-

ated with work contexts.

The initial RIFSM results indicated that both objective

performance outcomes (c100 = .32, p \ .01) and ratee job tenure (c300 = .00, p \ .01) predicted rating variability. In addition, the quadratic term for objective performance

outcomes was also significant (c200 = -.04, p \ .01), suggesting that the initial positive linear association

between objective performance outcomes and ratings

diminishes at higher levels of cases per hour. These pre-

dictors explained 13 % of the ratee variance, but did not

explain rater or contextual variability, with significant rater

(sp0 = .03, df = 61, v 2 = 214.26, p \ .001) and context

(sb00 = .05, df = 57, v 2 = 191.92, p \ .001) variance

remaining. The second model introduced ratee-level con-

textual variables, and found that both the number of

documented performance incidents (c400 = .01, p \ .01) and rating purpose (c500 = .10, p \ .01) were significant predictors. The combination of all level-1 variables

explained 19 % of the ratee variability, and 5 % of the rater

variance in ratings. Again, significant variability remained

between raters (sp0 = .03, df = 61, v 2 = 209.29, p \

.001) and contexts (sb00 = .05, df = 57, v 2 = 219.75,

p \ .001). The third model added rater-level variables, and the results indicated that the supervisor’s rating tendency

(c010 = .30, p \ .01) was a significant predictor. However, neither supervisory tenure (c020 = .00, p [ .05) nor span of control (c030 = -.00, p [ .05) were significant. The addition of the rater characteristics explained a total of

23 % of the rater variance. This model also addressed

research question 1 (RQ1), in that significant variability

remained across raters (sp0 = .02, df = 58, v 2 = 181.36,

p \ .001) and contexts (sb00 = .05, df = 57, v 2 = 232.19,

p \ .001) after controlling for ratee- and rater-level characteristics.

The final model indicated that contextual rating ten-

dencies were positively associated with performance rat-

ings (c001 = .48, p \ .01), but neither contextual work activity (c002 = -.01, p [ .05) nor the presence of an interstate (c003 = .01, p [ .05) were significant, addressing research question 2 (RQ2). With regard to the variance

explained across the three levels in the final model, the

respective predictors explained 19 % of the ratee vari-

ability, 27 % of the rater variability, and 17 % of the

contextual variability in ratings. After all variables were

included, significant variation remained between raters

(sp0 = .02, df = 58, v 2 = 180.08, p \ .001) and contexts

(sb00 = .04, df = 54, v 2 = 204.19, p \ .001).

Table 1 Descriptive statistics and zero-order correlations

M SD 1 2 3 4

Ratee-level (L1) variables

1. Objective performance outcomes 1.17 .60

2. Ratee job tenure 100.43 78.89 -.15**

3. Number of documented incidents 16.83 6.25 .18** -.04

4. Rating purpose .14 .35 -.11** .21** -.02

5. Task performance ratings 4.70 .41 .16** .18** .08* .16**

Rater-level (L2) variables

1. Supervisor rating tendency 5.02 .30

2. Supervisor job tenure 55.70 36.75 -.28**

3. Span of control 7.31 2.81 .17 -.16

Context-level (L3) variables

1. Contextual rating tendency 4.91 .22

2. Work activity .02 .93 -.42**

3. Interstate highway .55 .50 -.11 .36**

L1 level 1, L2 level 2, L3 level 3, L1 N = 804, L2 N = 119, and L3 N = 58

* p \ .05; ** p \ .01

94 J Bus Psychol (2017) 32:87–100

123

Discussion

The goal of this research was to contribute to the existing

literature on sources of variance in job performance ratings,

by partitioning rating variability due to several sources. In

particular, our study adds additional evidence to the

proposition that contexts can play an important role in

shaping rating behavior (Levy and Williams 2004; Murphy

and Cleveland 1995). Although disentangling rater and

contextual rating variance is difficult (Murphy and DeShon

2000), partitioning variance due to omnibus contexts in

terms of distinct units within an organization provides a

potentially useful approach for separating additional sour-

ces of variance. Although the findings here should be

replicated in other settings/samples to ensure generaliz-

ability, our results suggest that much of what is often

considered to be rater variance may be systematic con-

textual variability. More specifically, when estimating a

two-level model using our data (i.e., ignoring context),

47 % of the rating variability would be interpreted as rater

variance. It should be noted that this estimate is very

similar to those found in other studies (i.e., 43–58 %)

examining supervisory rating variance (Hoffman et al.

2010; O’Neill et al. 2012; Scullen et al. 2000). However,

when the rating context is modeled, the data suggest that

28 % represents contextual variation, and 17 % reflects

rater variance (within context). Although rater variance

still represents a large portion of rating variability, our

findings indicate that an even greater proportion may be

due to aspects of the work environment that influence the

rating behavior of the supervisors in those contexts.

Importantly, this contextual variation remained even after

accounting for several ratee- and rater-level characteristics.

We also examined a diverse set of predictor variables

across levels, in order to determine the extent to which

these commonly cited factors in performance appraisal

Table 2 Multilevel modeling results

Level and variable Model

Null RIFSM 1 RIFSM 2 RIFSM 3 RIFSM 4

Ratee level (L1)

Intercept (c000) 4.692** (.035) 4.691** (.035) 4.689** (.036) 4.689** (.035) 4.685** (.033)

Objective performance outcomes (c100) .318** (.044) .282** (.043) .289** (.044) .312** (.044)

Objective performance outcomes quadratic (c200) -.036** (.007) -.035** (.007) -.036** (.007) -.040** (.008)

Ratee job tenure (c300) .001** (.000) .001** (.000) .001** (.000) .001** (.000)

Number of documented incidents (c400) .013** (.004) .013** (.004) .013** (.004)

Rating purpose (c500) .096** (.027) .098** (.027) .095** (.027)

Rater level (L2)

Supervisor rating tendency (c010) .299** (.077) .320** (.078)

Supervisor job tenure (c020) .001 (.001) .001 (.001)

Span of control (c030) -.002 (.008) .001 (.007)

Context level (L3)

Contextual rating tendency (c001) .484** (.180)

Work activity (c002) -.006 (.037)

Interstate highway (c003) .007 (.073)

Variance components

Within-rater (L1) variance (r2) .094 .081 .076 .076 .076

Between-rater within-context (L2) variance (sp0) .030** .031** .028** .023** .022**

Between-context (L3) variance (sb00) .048** .048** .053** .049** .040**

Additional information

Rater (L2) ICC(1) .174

Context (L3) ICC(1) .278

Ratee-level (L1) pseudo R 2

– .133 .185 .185 .185

Rater-level (L2) pseudo R 2

– .000 .048 .234 .268

Context-level (L3) pseudo R 2

– .000 .000 .000 .165

Values in parentheses are robust standard errors; t statistics were computed as the ratio of each regression coefficient divided by its standard error

RIFSM random intercept and fixed slope model, ICC intraclass correlation coefficient, L1 level 1, L2 level 2, L3 level 3, L1 N = 804,

L2 N = 119, and L3 N = 58

* p \ .05; ** p \ .01

J Bus Psychol (2017) 32:87–100 95

123

research explain the rating variance associated with ratees,

raters, and contexts. First of all, the results suggest that

ratee differences in objective performance outcomes (i.e.,

quantity) and job tenure explained variance at the ratee

level (13 %), but did not account for variance across

supervisors or work contexts, suggesting that a large por-

tion of the rater and context variance may be due to other

factors. Although our measure of objective performance is

certainly an imperfect one, and result-based measures of

performance often do not correlate strongly with perfor-

mance ratings (Bommer et al. 1995; Heneman 1986),

incorporating this variable (and job tenure) allowed at least

some degree of control over potential true performance

differences across raters and contexts (LaHuis and Avis

2007). It is also interesting to note that we found evidence

of a curvilinear association between objective performance

outcomes and ratings, which to our knowledge had not

been examined in previous investigations of the relation-

ship between objective and subjective measures (Bommer

et al. 1995; Heneman 1986). The positive linear association

between outcomes and ratings plateaued and then appeared

to become negative. However, this only occurred at very

high levels of cases per hour (i.e., over 4 SDs above the

mean), therefore caution should be taken in interpreting

this finding. With the inclusion of the ratee-level contextual

characteristics (i.e., number of documented incidents and

rating purpose), a total of 19 % of the level-1 rating vari-

ability was accounted for, and these variables explained a

small portion of the variability across raters (5 %).

With regard to rater characteristics, idiosyncratic ten-

dencies for leniency are often cited as a ubiquitous concern

in performance appraisal, and a likely driver of rating

differences across supervisors (Hauenstein 1992; Murphy

and Cleveland 1995; Scullen et al. 2000). In addition, as

noted previously, research has suggested that tendencies

for leniency are a relatively stable rater characteristic over

time (Kane et al. 1995). Following the approach employed

by Kane et al. (1995), we were able to estimate the extent

to which supervisors’ rating tendencies (i.e., mean task

performance ratings) from the past, were associated with

their subsequent ratings of a different set of ratees. The

results indicated that these previous rating tendencies

explained an additional 19 % of the between-rater vari-

ability, but did not explain contextual variation. This is

noteworthy, in that a supervisor’s mean tendency seems to

account for about one-fifth of the rater effect, after con-

trolling for ratee-level characteristics. Although several

previous studies suggested an association between perfor-

mance ratings and supervisory experience (Landy and Farr

1980; Spence and Keeping 2010; Zalesny and Highhouse

1992), as well as span of control (O’Neill et al. 2012), our

data did not support a link between these characteristics

and rating behavior. However, findings regarding these

particular rater variables have been mixed, as other

researchers also did not find a significant relationship

(Judge and Ferris 1993; Klores 1966; LaHuis and Avis

2007). The mixed results across studies suggest that other

factors may moderate the extent to which supervisory

experience and span of control predict ratings. For exam-

ple, the previously cited study which found a relationship

between the number of ratees evaluated and supervisor

ratings (O’Neill et al. 2012) used a ‘‘relative’’ appraisal

approach (Goffin et al. 2009), therefore it may be that span

of control is only influential when explicitly making

comparisons among ratees. In addition, our measure of

supervisory tenure was a within-organization, time-based

operationalization, so supervisory experience may be more

meaningful when considering other definitions of experi-

ence (e.g., amount or density; Tesluk and Jacobs 1998).

One of the primary objectives of this research was to not

only estimate the amount of rating variability due to rating

contexts, but to also attempt to explain this variability. The

collection of ratee- and rater-level characteristics above did

not account for significant variance across contexts, sug-

gesting that other factors are driving the rating differences

across organizational units. We proposed that rating ten-

dencies may also exist at the work context level, and thus

could be influential in explaining between-unit differences

in performance ratings. Our results suggest that the rating

distributional tendencies (i.e., means) of organizational

units show some level of consistency over time, with pre-

vious rating tendencies predicting subsequent ratings of a

distinct group of ratees. Given the limitations of our data,

we are unable to determine the mechanism driving these

mean tendencies; however, we previously offered conjec-

ture that one possible explanation is that these tendencies

reflect an aspect of the social context, and are the result of

contextual norms or standards for performance and rating

behavior. If this were the case, it would be consistent with

prior theory (DeCotiis and Petit 1978; Ilgen and Feldman

1983; Murphy and Cleveland 1995; Spence and Keeping

2013) as well as previous lab-based research (Shore and

Tashchian 2002; Spence and Keeping 2010) suggesting the

importance of rating norms in performance appraisal. We

also explored characteristics of the task and physical work

contexts; however, these variables were not predictive of

between-context rating variability (discussed further in

Future Research). The inclusion of contextual rating ten-

dencies accounted for 17 % of the context effect, and it is

important to note again that this relationship was demon-

strated while holding all of the previously discussed ratee

and rater characteristics constant, and in an organization in

which supervisors are provided frame-of-reference train-

ing, along with annual refresher training (Bernardin and

Buckley 1981; Roch et al. 2012). Although many others

have cited the potential importance of the rating context in

96 J Bus Psychol (2017) 32:87–100

123

performance appraisal (e.g., Murphy 2008; Murphy and

Cleveland 1995), our findings add valuable empirical evi-

dence as to the extent to which this may be the case in field

settings.

Study Limitations

The findings presented here should be considered in light of

several study limitations. First, given that the data analyzed

here were from a single organization (which was a pre-

dominantly Caucasian, male sample from a law enforcement

organization), this may limit the generalizability of our

results. The similarity of our estimate of rater variability (i.e.,

if ignoring context) to those found in other studies does

suggest that our results are comparable to previous research;

however, future studies should seek to partition rating vari-

ance due to context in other settings, and with more diverse

samples. Second, our within-context (i.e., raters per context)

sample size was relatively low, which may have impacted

our results. Although the organization here was a fairly large

organization, even larger samples may be needed to examine

contexts with more supervisors per unit.

Third, several potential issues with the measures incor-

porated here deserve mention. For example, though raters

were excluded who had only completed one or two pre-

vious appraisals, rater and contextual rating tendencies

were in some cases based on relatively low numbers of

previous evaluations, and thus the tendencies in those cases

may represent less stable estimates. Nonetheless, overall

the rater/context means were relatively reliable, and were

predictive of both rater and contextual variability. In

addition, as mentioned previously, our measure of objec-

tive performance was a results-based operationalization

that did not capture performance ‘‘quality,’’ and may not

have been an adequate control for true ratee performance

differences. Therefore, some degree of the remaining rater

and context variability likely still reflects actual perfor-

mance differences across groups. In addition, the nature of

our data prevents us from drawing definitive conclusions

regarding the extent to which all variables represent valid

or biasing sources of variability. Of the significant pre-

dictors in our study, we believe there are plausible reasons

to expect that objective performance outcomes (quantity)

and job tenure likely reflect at least some degree of valid

rating variability, and that rating purpose and rater/con-

textual distributional tendencies likely reflect bias; how-

ever, this is less clear for documented incidents of

performance. In our study, the number of incidents were

positively associated with ratings, but we cannot determine

if the rating variance explained represents true ratee per-

formance differences, or bias based on supervisor famil-

iarity (or lack thereof) with the ratee. Furthermore, though

we described the number of documented incidents as a

proxy measure for the opportunity to observe performance,

other factors may in fact have systematically impacted the

number of incidents recorded. For example, rater motiva-

tion or beliefs about the importance of documenting per-

formance may have more to do with the number of

incidents recorded than actual opportunity to observe per-

formance (Harris 1994). However, again, this variable was

nevertheless associated with ratee performance variability.

Practical Implications and Future Research

Given the inductive nature of our study, we believe caution

should be taken in making recommendations for practice;

however, there are a few important practical implications

of our research. First off, the presence of relatively large

rater and context effects in supervisor ratings suggests that

practitioners (or researchers) utilizing ratings as criteria

when validating selection instruments should incorporate

analytic approaches which account for this nested data

structure (e.g., MRC models). Previous research demon-

strates that ignoring the nonindependence in criterion data

can have the effect of attenuating statistical power, par-

ticularly with higher levels of between-group variance

(Bliese and Hanges 2004). In other words, if doing vali-

dation with performance ratings as the criterion using a

method that does not account for the hierarchical nature of

the data (e.g., ordinary least squares regression), one may

erroneously conclude that an individual-level predictor

(e.g., selection test) is not significantly related to perfor-

mance. In addition, our results suggest that caution should

be taken when using supervisor ratings to make ratee

comparisons across supervisors and/or contexts for

administrative purposes, as inconsistent expectations and

standards may exist (Ilgen and Feldman 1983), even when

attempts have been made to impart a common frame-of-

reference among supervisors (Bernardin and Buckley 1981;

Roch et al. 2012). Furthermore, this issue may be even

more pronounced in multinational organizations, where

work contexts potentially differ more drastically in terms

of their social, task, and physical characteristics.

With regard to avenues for future research, this study

demonstrates the potential utility of the MRC modeling

approach in better understanding sources of performance

rating variability, and in evaluating performance appraisal

interventions. Previous research has incorporated this

approach in examining rater effects (LaHuis and Avis 2007;

O’Neill et al. 2012), and we believe the benefits extend to the

study of additional levels/variables such as contexts. Study-

ing sources of variability has been proposed as a useful

approach for studying the quality of rating data in field

research (Hoffman et al. 2012), as opposed to utilizing direct

measures of rating accuracy, which is typically confined to

lab settings (Murphy and Cleveland 1995). For example,

J Bus Psychol (2017) 32:87–100 97

123

despite a historical moratorium on rating scale research

(Landy and Farr 1980), more recently scholars have proposed

innovative new performance measurement methods (e.g.,

Borman et al. 2001; Hoffman et al. 2012), and additional field

research investigating these approaches is warranted (Landy

2010). The MRC modeling approach could for instance be

used to examine the effect of scale design interventions on

systematic rater and contextual rating variability.

However, although we are certainly not the first to call

for research on the context of performance appraisal (e.g.,

Levy and Williams 2004; Murphy and Cleveland 1995),

our findings regarding significant differences associated

with organizational units suggest a particular need for

future research on contextual sources of rating variance. As

described previously, intra-organizational units may differ

in terms of discrete social, task, and physical characteristics

(Hattrup and Jackson 1996; Johns 2006; Mowday and

Sutton 1993), and additional research on all three of these

contextual dimensions may be beneficial. For example, if

future research can confirm that social norms or climates

for performance management are a mechanism driving

contextual variation, this would be in accordance with

previous suggestions that, ‘‘the interventions most likely to

improve the quality of performance appraisals in organi-

zations are likely to look more like organizational devel-

opment than scale development’’ (Murphy 2008, p. 158). In

other words, an additional approach to the criterion prob-

lem may be to continue to identify social contextual

influences (Levy and Williams 2004), and to potentially

attempt to take steps to avoid contextual norms for rating

behavior that may not support accurate evaluations of job

performance or employee development. For example,

many have called for research on rater goals and intentions

(Murphy 2008; Murphy and Cleveland 1995; Spence and

Keeping 2011), and MRC models seem particularly rele-

vant for examining the degree to which these goals are

primarily a ratee- or rater-level phenomenon, or a function

of goals which result from contextual influences.

Building on the points above, research is needed to

better understand the nature and development of norms for

rating behavior. In this study, we operationalized rating

tendencies based on actual distributional rating character-

istics (i.e., means), however, future research should directly

collect field data regarding perceived normative influences

from supervisors or peers. For example, Spence and

Keeping (2013) propose a framework based on the theory

of planned behavior (Ajzen 1991; Ajzen and Fishbein

2005), which includes several propositions regarding the

role of subjective norms in influencing rater intentions and

behavior. To the extent that we can better understand these

influences, researchers and/or practitioners may be able to

develop and test new interventions to potentially shape the

development of norms across organizational units/contexts.

Furthermore, although in our study features of the task

and physical context did not explain rating variance, addi-

tional research should examine these factors in order to

identify the circumstances under which these aspects of the

work context may be of more/less influence. We examined

the frequency of certain work activities, but other factors

such as contextual differences in task importance or diffi-

culty may shape the standards used to evaluate performance.

In addition, previous research indicates that supervisors

develop ‘‘folk theories’’ of task performance (Borman

1987), and these views may be in part a function of the task

context in which the supervisor works. Research also indi-

cates that other features of the task context influence role

requirements (e.g., accountability, autonomy, routinization;

Dierdorff et al. 2009) and performance ratings (e.g.,

accountability; Mero et al. 2003), and these factors may

explain contextual rating differences to the extent that they

vary systematically across intra-organizational work con-

texts. Furthermore, although the physical presence of an

interstate highway did not seem to influence ratings in our

study, other physical context characteristics may be relevant

depending on the organizational setting. For instance,

physical work space characteristics such as the number of

enclosures have been shown to influence performance

(Oldham et al. 1991), and differences in these physical

characteristics may influence the nature/frequency of

employee performance-related interactions or the impor-

tance of work tasks.

Conclusion

Employee performance ratings are at least in part dependent

on the supervisor/rater who produces them, as well as the

work context in which they are produced. Given the many

issues associated with ratings, there is currently a debate as to

whether ratings should be abandoned altogether, or if con-

tinued efforts should be made to improve upon them as a

component of performance management (Adler et al., in

press). Going forward, it remains to be seen as to whether the

latter goal can be achieved; however, if future efforts are to

be made toward improving ratings, we believe that contin-

uing to identify contextual influences in appraisal is a worthy

endeavor. Although many questions remain regarding the

nature of contextual rating variability, this line of research

(among others) may help to better understand and hence

improve ratings as a form of performance measurement.

References

Adler, S., Campion, M., Colquitt, A., Grubb, A., Murphy, K. R.,

Ollander-Krane, R., et al. (in press). Getting rid of performance

98 J Bus Psychol (2017) 32:87–100

123

ratings: Genius or folly? A debate. Industrial and Organizational

Psychology: Perspectives on Science and Practice.

Aguinis, H., Gottfredson, R. K., & Culpepper, S. A. (2013). Best-

practice recommendations for estimating cross-level interaction

effects using multilevel modeling. Journal of Management, 39,

1490–1528. doi:10.1177/0149206313478188.

Ajzen, I. (1991). The theory of planned behavior. Organizational

Behavior and Human Decision Processes, 50, 179–211. doi:10.

1016/0749-5978(91)90020-T.

Ajzen, I., & Fishbein, M. (2005). The influence of attitudes on

behavior. In D. Albarracı́n, B. T. Johnson, & M. P. Zanna (Eds.),

The handbook of attitudes (pp. 173–221). Mahwah, NJ:

Lawrence Erlbaum Associates.

Austin, J. T., & Crespin, T. R. (2006). Problems of criteria in industrial

and organizational psychology: Progress, pitfalls, and prospects. In

W. Bennett Jr, C. E. Lance, & D. J. Woehr (Eds.), Performance

measurement: Current perspectives and future challenges (pp.

9–48). Mahwah, NJ: Lawrence Erlbaum Associates.

Austin, J. T., & Villanova, P. (1992). The criterion problem:

1917-1992. Journal of Applied Psychology, 77, 836–874.

Bartko, J. J. (1976). On various intraclass correlation reliability

coefficients. Psychological Bulletin, 83, 762–765.

Bennett, W, Jr, Lance, C. E., & Woehr, D. J. (2006). Introduction. In

W. Bennett Jr, C. E. Lance, & D. J. Woehr (Eds.), Performance

measurement: Current perspectives and future challenges (pp.

1–5). Mahwah, NJ: Lawrence Erlbaum Associates.

Bernardin, H. J., & Buckley, R. B. (1981). Strategies in rater training.

Academy of Management Review, 6, 205–212.

Bliese, P. D. (2000). Within-group agreement, non-independence, and

reliability: Implications for data aggregation and analysis. In K.

J. Klein & S. W. J. Kozlowski (Eds.), Multilevel theory,

research, and methods in organizations: Foundations, exten-

sions, and new directions (pp. 349–381). San Francisco, CA:

Jossey-Bass.

Bliese, P. D., & Hanges, P. J. (2004). Being both too liberal and too

conservative: The perils of treating grouped data as though they

were independent. Organizational Research Methods, 7,

400–417. doi:10.1177/1094428104268542.

Bommer, W. H., Johnson, J., Rich, G. A., Podsakoff, P. M., &

MacKenzie, S. B. (1995). On the interchangeability of objective

and subjective measures of employee performance: A meta-

analysis. Personnel Psychology, 48, 587–605. doi:10.1111/j.

1744-6570.1995.tb01772.x.

Borman, W. C. (1987). Personal constructs, performance schemata,

and ‘‘folk theories’’ of subordinate effectiveness: Explorations in

an Army officer sample. Organizational Behavior and Human

Decision Processes, 40, 307–322.

Borman, W. C. (2004). The concept of organizational citizenship.

Current Directions in Psychological Science, 13, 238–241.

Borman, W. C., Buck, D. E., Motowildo, S. J., Hanson, M. A., Stark,

S., & Drasgow, F. (2001). An examination of the comparative

reliability, validity, and accuracy of performance ratings made

using computerized adaptive rating scales. Journal of Applied

Psychology, 86, 965–973. doi:10.1037//0021-9010.86.5.965.

Campbell, J. P., McCloy, R. A., Oppler, S. H., & Sager, C. E. (1993).

A theory of performance. In N. Schmitt & W. C. Borman (Eds.),

Personnel selection in organizations (pp. 35–70). San Francisco,

CA: Jossey-Bass.

Deadrick, D. L., & Gardner, D. G. (1997). Distributional ratings of

performance levels and variability. Group and Organization

Management, 22, 317–342.

DeCotiis, T., & Petit, A. (1978). The performance appraisal process:

A model and some testable propositions. Academy of Manage-

ment Review, 3, 635–646. DeNisi, A. S., Cafferty, T. P., & Meglino, B. M. (1984). A cognitive

view of the performance appraisal process: A model and

research propositions. Organizational Behavior & Human Per-

formance, 33, 360–396.

Dierdorff, E. C., Rubin, R. S., & Morgeson, F. P. (2009). The milieu

of managerial work: an integrative framework linking work

context to role requirements. Journal of Applied Psychology, 94,

972.

Dierdorff, E. C., & Surface, E. A. (2007). Placing peer ratings in

context: Systematic influences beyond ratee performance. Per-

sonnel Psychology, 60, 93–126. doi:10.1111/j.1744-6570.2007.

00066.x.

Elsbach, K. D., & Pratt, M. G. (2008). The physical environment in

organizations. In J. P. Walsh & A. P. Brief (Eds.), The academy

of management annals (Vol. 1, pp. 181–224). New York: Taylor

& Francis Group/Lawrence Erlbaum Associates.

Goffin, R. D., Jelley, R. B., Powell, D. M., & Johnston, N. G. (2009).

Taking advantage of social comparisons in performance

appraisal: The relative percentile method. Human Resource

Management, 48, 251–268. doi:10.1002/hrm.20278.

Greguras, G. J., Robie, C., Schleicher, D. J., & Goff, M, I. I. I. (2003).

A field study of the effects of rating purpose on the quality of

multisource ratings. Personnel Psychology, 56, 1–21.

Harris, M. M. (1994). Rater motivation in the performance appraisal

context: A theoretical framework. Journal of Management, 20,

737–756.

Hattrup, K., & Jackson, S. (1996). Learning about individual

differences by taking situations seriously. In K. R. Murphy

(Ed.), Individual differences and behavior in organizations (pp.

507–547). San-Francisco: Jossey-Bass.

Hauenstein, N. M. A. (1992). An information-processing approach to

leniency in performance judgments. Journal of Applied Psy-

chology, 77, 485.

Heneman, R. L. (1986). The relationship between supervisory ratings

and results-oriented measures of performance: A meta-analysis.

Personnel Psychology, 39, 811–826.

Hoffman, B. J., Gorman, C. A., Blair, C. A., Meriac, J. P., Overstreet,

B., & Atchley, E. K. (2012). Evidence for the effectiveness of an

alternative multisource performance rating methodology. Per-

sonnel Psychology, 65, 531–563. doi:10.1111/j.1744-6570.2012.

01252.x.

Hoffman, B., Lance, C. E., Bynum, B., & Gentry, W. A. (2010). Rater

source effects are alive and well after all. Personnel Psychology,

63, 119–151. doi:10.1111/j.1744-6570.2009.01164.x.

Hofmann, D. A., & Gavin, M. B. (1998). Centering decisions in

hierarchical linear models: Implications for research in organi-

zations. Journal of Management, 24, 623–641.

Ilgen, D. R., Barnes-Farrell, J. L., & McKellin, D. B. (1993).

Performance appraisal process research in the 1980s: What has it

contributed to appraisals in use? Organizational Behavior and

Human Decision Processes, 54, 321–368.

Ilgen, D. R., & Feldman, J. M. (1983). Performance appraisal: A process

focus. In L. Cummings & B. Staw (Eds.), Research in organiza-

tional behavior (Vol. 5, pp. 141–197). Greenwich, CT: JAI Press.

Jawahar, I. M., & Williams, C. R. (1997). Where all the children are

above average: The performance appraisal purpose effect.

Personnel Psychology, 50, 905–925.

Johns, G. (2006). The essential impact of context on organizational

behavior. Academy of Management Review, 31, 386–408.

Judge, T. A., & Ferris, G. R. (1993). Social context of performance

evaluation decisions. Academy of Management Journal, 36,

80–105.

Kane, J. S., Bernardin, H. J., Villanova, P., & Peyrefitte, J. (1995).

Stability of rater leniency: Three studies. Academy of Manage-

ment Journal, 38, 1036–1051.

Kingstrom, P. O., & Mainstone, L. E. (1985). An investigation of the

rater–ratee acquaintance and rater bias. Academy of Management

Journal, 28, 641–653. doi:10.2307/256119.

J Bus Psychol (2017) 32:87–100 99

123

Klores, M. S. (1966). Rater bias in forced-distribution performance

ratings. Personnel Psychology, 19, 411–421.

Kozlowski, S. W. J., Kirsch, M. P., & Chao, G. T. (1986). Job

knowledge, ratee familiarity, conceptual similarity and halo

error: An exploration. Journal of Applied Psychology, 71, 45–49.

LaHuis, D. M., & Avis, J. M. (2007). Using multilevel random

coefficient modeling to investigate rater effects in performance

ratings. Organizational Research Methods, 10, 97–107.

Landy, F. (2010). Performance ratings: Then and now. In J. L. Outtz

(Ed.), Adverse impact: Implications for organizational staffing

and high stakes selection (pp. 227–248). New York: Routledge/

Taylor & Francis Group.

Landy, F. J., & Farr, J. L. (1980). Performance rating. Psychological

Bulletin, 87, 72–107.

Levy, P. E., & Williams, J. R. (2004). The social context of

performance appraisal: A review and framework for the future.

Journal of Management, 30, 881–905.

McDaniel, M. A., Schmidt, F. L., & Hunter, J. E. (1988). Job

experience correlates of job performance. Journal of Applied

Psychology, 73, 327–330.

Mero, N. P., Motowidlo, S. J., & Anna, A. L. (2003). Effects of

accountability on rating behavior and rater accuracy. Journal of

Applied Social Psychology, 33, 2493–2514.

Mount, M. K., Judge, T. A., Scullen, S. E., Sytsma, M. R., & Hezlett,

S. A. (1998). Trait, rater and level effects in 360-degree

performance ratings. Personnel Psychology, 51, 557–576.

Mowday, R. T., & Sutton, R. I. (1993). Organizational behavior:

Linking individuals and groups to organizational contexts.

Annual Review of Psychology, 44, 195–229. doi:10.1146/

annurev.ps.44.020193.001211.

Murphy, K. R. (2008). Explaining the weak relationship between job

performance and ratings of job performance. Industrial and

Organizational Psychology: Perspectives on Science and Prac-

tice, 1, 148–160. doi:10.1111/j.1754-9434.2008.00030.x.

Murphy, K. R., & Cleveland, J. N. (1995). Understanding perfor-

mance appraisal: Social, organizational, and goal-based per-

spectives. Thousand Oaks, CA: Sage Publications.

Murphy, K. R., Cleveland, J. N., Kinney, T. B., Skattebo, A. L.,

Newman, D. A., & Sin, H. P. (2003). Unit climate, rater goals

and performance ratings in an instructional setting. Irish Journal

of Management, 24, 48.

Murphy, K. R., & DeShon, R. (2000). Interrater correlations do not

estimate the reliability of job performance ratings. Personnel

Psychology, 53, 873–900.

O’Neill, T. A., Goffin, R. D., & Gellatly, I. R. (2012). The use of

random coefficient modeling for understanding and predicting

job performance ratings: An application with field data. Orga-

nizational Research Methods, 15, 436–462. doi:10.1177/

1094428112438699.

Oldham, G. R., Kulik, C. T., & Stepina, L. P. (1991). Physical

environments and employee reactions: Effects of stimulus-

screening skills and job complexity. Academy of Management

Journal, 34, 929–938. doi:10.2307/256397.

Peters, L. H., & O’Connor, E. J. (1980). Situational constraints and

work outcomes: The influences of a frequently overlooked

construct. Academy of Management Review, 5, 391–398. doi:10.

5465/AMR.1980.4288856.

Putka, D. J., Ingerick, M., & McCloy, R. A. (2008). Integrating

traditional perspectives on error in ratings: Capitalizing on

advances in mixed-effects modeling. Industrial & Organizational

Psychology, 1, 167–173. doi:10.1111/j.1754-9434.2008.00032.x.

Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear

models: Applications and data analysis methods (2nd ed.).

Thousand Oaks, CA: Sage Publications.

Raudenbush, S. W., Bryk, A. S., Cheong, Y. F., Congdon, R. T., & du

Toit, M. (2011). HLM 7. Lincolnwood, IL: Scientific Software

International.

Reb, J., & Cropanzano, R. (2007). Evaluating dynamic performance:

The influence of salient gestalt characteristics on performance

ratings. Journal of Applied Psychology, 92, 490–499.

Reb, J., & Greguras, G. J. (2010). Understanding performance ratings:

Dynamic performance, attributions, and rating purpose. Journal

of Applied Psychology, 95, 213–220.

Roch, S. G., Woehr, D. J., Mishra, V., & Kieszczynska, U. (2012).

Rater training revisited: An updated meta-analytic review of

frame-of-reference training. Journal of Occupational & Orga-

nizational Psychology, 85, 370–395. doi:10.1111/j.2044-8325.

2011.02045.x.

Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of

selection methods in personnel psychology: Practical and

theoretical implications of 85 years of research findings. Psy-

chological Bulletin, 124, 262–274. doi:10.1037/0033-2909.124.

2.262.

Scullen, S. E., Mount, M. K., & Goff, M. (2000). Understanding the

latent structure of job performance ratings. Journal of Applied

Psychology, 85, 956–970.

Shore, T. H., & Tashchian, A. (2002). Accountability forces in

performance appraisal: Effects of self-appraisal information,

normative information, and task performance. Journal of Business

and Psychology, 17, 261–274. doi:10.1023/A:1019689616654.

Spence, J. R., & Keeping, L. M. (2010). The impact of non-

performance information on ratings of job performance: A

policy-capturing approach. Journal of Organizational Behavior,

31, 587–608.

Spence, J. R., & Keeping, L. (2011). Conscious rating distortion in

performance appraisal: A review, commentary, and proposed

framework for research. Human Resource Management Review,

21, 85–95. doi:10.1016/j.hrmr.2010.09.013.

Spence, J. R., & Keeping, L. M. (2013). The road to performance

ratings is paved with intentions: A framework for understanding

managers’ intentions when rating employee performance. Or-

ganizational Psychology Review, 3, 360–383. doi:10.1177/

2041386613485969.

Tesluk, P. E., & Jacobs, R. R. (1998). Toward an integrated model of

work experience. Personnel Psychology, 51, 321–355.

Waldman, D. A., Yammarino, F. J., & Avolio, B. J. (1990). A

multiple level investigation of personnel ratings. Personnel

Psychology, 43, 811–835.

Wherry, R. J., & Bartlett, C. J. (1982). The control of bias in ratings:

A theory of rating. Personnel Psychology, 35, 521–551.

Woehr, D. J., & Roch, S. (2012). Supervisory performance ratings. In

N. Schmitt (Ed.), The Oxford handbook of personnel assessment

and selection (pp. 517–531). New York: Oxford University

Press.

Zalesny, M. D., & Highhouse, S. (1992). Accuracy in performance

evaluations. Organizational Behavior and Human Decision

Processes, 51, 22–50.

100 J Bus Psychol (2017) 32:87–100

123

  • The Performance Appraisal Milieu: A Multilevel Analysis of Context Effects in Performance Ratings
    • Abstract
      • Purpose
      • Design/Methodology/Approach
      • Findings
      • Implications
      • Originality/Value
    • Introduction
    • Multilevel Model of Supervisory Rating Variance
      • Initial Partitioning of Rater and Contextual Variance
      • Accounting for Ratee- and Rater-Level Characteristics
        • Ratee-Level Variables
        • Rater-Level Variables
      • Context-Level Characteristics
    • Methods
      • Participants
      • Procedure
      • Ratee-Level Measures
        • Task Performance Ratings
        • Objective Performance Outcomes
        • Ratee Job Tenure
        • Number of Documented Incidents
        • Rating Purpose
      • Rater-Level Measures
        • Supervisor Rating Tendency
        • Supervisor Job Tenure
        • Span of Control
      • Context-Level Measures
        • Contextual Rating Tendency
        • Work Activity
        • Interstate Highway
      • Analytical Approach
    • Results
    • Discussion
      • Study Limitations
      • Practical Implications and Future Research
    • Conclusion
    • References