Agreement Indices in MultiLevel Analysis
Ayala Cohen
Faculty of Industrial Engineering& Management
Technion-Israel Institute of Technology
May 2007
1
Outline
• Introduction ( on Interrater agreementIRA) • rWG(J) Index of agreement • AD ( Absolute Deviation), Alternative measure of agreement -------------------------------Review Our work (2001) (2007) Etti Doveh Etti Doveh Uri Eick Inbal Shani
2
INTRODUCTION Why we need a measure of agreement
In recent years there has been a growing number of studies based on multi-level data in applied psychology, organizational behavior, clinical trials. Typical data structure: Individuals within groups ( two levels) Groups within departments (three levels)
3
Constructs
• Constructs are our building blocks in developing and in testing theory. • Group-level constructs describe the group as a whole and are of three types (Kozlowski & Klein, 2000): – Global, shared, or configural.
4
Global Constructs
• Relatively objective, easily observable, descriptive group characteristics. • Originate and are manifest at the group level. • Examples: – Group function, size, or location. • No meaningful within-group variability. • Measurement is generally straightforward.
5
Shared Constructs
• Group characteristics that are common to group members • Originate in group members‟ attitudes, perceptions, cognitions, or behaviors
– Which converge as a function of socialization, leadership, shared experience, and interaction.
• Within-group variability predicted to be low. • Examples: Group climate, norms.
6
Configural Group-Level Constructs
• Group characteristics that describe the array, pattern, dispersion, or variability within a group. • Originate in group member characteristics (e.g., demographics, behaviors, personality) – But no assumption or prediction of convergence. • Examples: – Diversity, team star or weakest member.
7
Justifying Aggregation
• Why is this essential?
– In the case of shared constructs, our construct definitions rest on assumptions regarding withinand between-group variability. – If our assumptions are wrong, our construct “theories,” our measures, are flawed and so are our conclusions.
• So, test both:
Within group agreement
The construct is supposed to be shared, is it really?
– Between group variability (reliability)
Groups are expected to differ significantly, do they really?
8
Chen, Mathieu & Bliese ( 2004) proposed a framework for conceptualizing and testing multilevel constructs. This framework includes the assessment of inter-group agreement
Assessment of agreement is a pre-requisite for arguing that a higher level construct can be operationalized .
9
Distinction should be made between: Interrater reliability (IRR= Interrater Reliability) and
Interrater agreement (IRA= Interrater Agreement) Many past studies wrongly used the two terms interchangeably in their discussions.
10
The term interrater agreement refers to the degree to which ratings from individuals are interchangeable ; namely, it reflects the extent to which raters provide essentially the same rating. (Kozlowski & Hattrup,1992;Tinsley&Weiss,1975 ( .
11
Interrater reliability refers to the degree to which ratings of different judges are proportional when expressed as deviations from their means
12
Interrater reliability (IRR) refers to the relative consistency and assessed by correlations
Interrater agreement (IRA) refers to the absolute consensus in scores assigned by the raters and is assessed by measures of variability.
13
Scale of Measurement
• Questionnaire with J parallel items on a Likert scale with A categories e.g. A=5 1 2 3 4 5
Strongly Disagree Indifferent Agree disagree Strongly agree
14
Example
Item Rater1 Rater2
k=3 raters 1 Likert scale A=7 categories J= 5 items 2 3 7 5 5 4 2 2
Rater 3 Dev from mean
3 1 1
1 -1 -1
4
5
15
6
7
3
4
2
3
0
1
Prior to aggregation , we assessed within unit agreement on…… To do so, we used two complementary approaches (Kozlowski & Klein, 2000) A consistency based approach ,computation of the intra class correlation coefficient ,ICC(1)
A consensus based approach ( index of agreement)
16
How can we assess agreement ?
• Variability measures: e.g. Variance MAD( Mean Absolute Deviation) Problem: What are “small / large” values ?
17
The most widely used index of interrater agreement on Likert type scales has been rWG(J), introduced by James ,Demaree & Wolf (1984).
J stands for the number of items in the scale
18
Examples when rWG(J) was used to assess interrater agreement Group cohesiveness Group socialization emphasis Transformational and transactional leadership Positive and negative affective group tone Organizational climate
19
This index compares the observed within group variances to an expected variance from “random responding “ In the particular case of one item (stimulus) , (J=1) this index is denoted as rWG and is equal to
rWG
20
Sx 1 2
2
rWG
Sx
2
Sx 1 2
2
is the variance of ratings for the single stimulus
2 is the variance of some “null distribution” corresponding to no agreement
21
Problem:
A limitation of rWG(J) is that there is no clear-cut definition of a random response and the appropriate specification of the null distribution which models no agreement is debatable If the null distribution used to define 2 fails to model properly a random response, then the interpretability of the index is suspect.
22
The most natural candidate to represent non agreement is the uniform (rectangular) distribution, which implies that for an item with number of categories which equals to A, the proportion of cases in each category will be equal to 1/A
23
For a uniform null
For an item with A number of categories
2 null
A 1 12
2
rWG
Sx 1 2
2
24
How to calculate the sample variance Sx 2 ? We have n ratings and suppose n is “small”
1 n 2 Sx ( X k X ) n k 1
2
25
Example
A=5 k=9 raters: 3 3 3 3 5 5 5 5 4 ( With ( n-1) in the denominator),
Sx 2 rWG 1 2 Sx 2 1 2 2
2
null
A2 1 12
rWG
26
1 1 1 2 2
James et al. (1984): “ The distribution of responses could be non-
uniform when no genuine agreement exists among the judges. The systematic biasing of a response distribution due to a common response style within a group of judges be considered. This distribution may be negatively skewed, yielding a smaller variance than the variance of a uniform distribution”.
27
Slight Skewed Null
1 = .05 2 = .15 3 = .20 Yielding 2
null
4 = .35 5 = .25 = 1.34
Used as a “null distribution” in several studies (e.g., Schreisheim et al., 1995; Shamir et al., 1998). Their justification for choosing this null distribution was that it appears to be a reasonably good approximation of random response to leadership and attitude questionnaires.
28
Null Distributions A=5
0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 slight skewed
29
James et al )1984( suggested several skewed distributions , (which differ in their skewness and variance) to accommodate for systematic bias .
30
Often, several null distributions (including the uniform) could be suitable to model disagreement. Thus, the following procedure is suggested. Consider the subset of likely null distributions and calculate the largest and smallest null variance specified by this subset.
31
Additional “problem”
The index can have negative values
rWG
Sx 2 1 2
Larger variance than expected from random response
32
Bi-modal distribution: ( extreme disagreement) Example: A=5 Half answer 1 , Half answer 5 Variance: 4 Uniform variance
2 null
A2 1 2 12
rWG
33
Sx 1 2 1
2
What to do when rWG is negative?
James et al ( 1984) recommended replacing a negative value by zero. Criticized by Lindell et al. ( 1999)
34
For a scale of J items
rWG ( J )
2
J[1 ( s 2 / 2 )] J[1 ( s 2 / 2 )] s 2 / 2
s
Is the average variance over the J items
35
For a scale of J items
s r1 1 2
rWG ( J ) J[1 ( s / )] J[1 ( s 2 / 2 )] s 2 / 2
2 2
2
Jr1 Jr1 Jr1 (1 r1) 1 ( J 1)r1
36
For a scale of J items
rWG ( J ) Jr1 1 ( J 1)r1
Spearman Brown Reliability :
kr1 1 (k 1)r1
37
Example
3 raters 7 categories Likert scale 5 items
rWG ( J ) J[1 ( s 2 / 2 )] J[1 ( s 2 / 2 )] s 2 / 2
Item Rater Rater Rater 1 3 2 Var*
5[1 ( s 2 / 2 )] 0.66 2 2 1 4[1 ( s / )] 1 ( s 2 / 2 ) 0.2777
38
1 2 3 4 5
7 5 5 6 7
4 2 2 3 4
3 1 1 2 3
78/27 78/27 78/27 78/27 78/27
Var calculated with n in denominator
Since its introduction, the use of rWG(J) has raised several criticisms and debates. It was initially described by James et al. (1984) as a measure of interrater reliability . Schmidt & Hunter (1989) criticized this index claiming that an index of reliability cannot be defined on a single item
39
In response, Kozlowski and Hattrup (1992) argued that it is an index of agreement not reliability. James, Demaree & Wolf (1993) concurred with this distinction, and it has now been accepted that rWG(J) ) is a measure of agreement .
40
Lindell, Brandt and Whitney (1999) suggested, as an alternative to , rWG(J) a modified index which is allowed to obtain negative values (even beyond minus 1)
r *WG ( J )
s 1 2
2
41
The modified index r*WG(J) provides corrections to two of the criticisms which were raised against rWG(J) .
First, it can obtain negative values, when the observed agreement is less than hypothesized. Secondly, unlike rWG(J) it does not include a Spearman-Brown correction and thus it does not depend on the number of items (J(
42
Academy of management Journal 2006 Does Ceo Carisma matter…..Agle et al. 128 Ceo’s 770 team members “Because of strengths and weaknesses of various interrater agreement measures, we computed both the intraclass correlation statistics ICC(1) and ICC(2), and the interrater agreement statistics r*WG(J)……………”
43
Agle et al. 2006
Overall, the very high interrater agreement justified the combination of individual manager‟s responses into a single measure of charisma for each CEO…. -----------------They display ICC(1)= ICC(2)=
r*WG(J) = One number ?
44
Ensemble of groups(RMNET)
Shall we report median, mean? ….” Observed distributions of rWG(J) are often wildly skewed ….medians are the most appropriate summary statistic”…..
45
Ehrhart,M.G.(Personnel Psychology, 2004) Leadership and procedural justice climate –as antecedents of unit-level organizational citizenship behavior Grocery store chain 3914 employees in 249 departments
46
….”The median rwg values across the 249 departments were : 0.88 for servant leadership, …………
WHAT TO CONCLUDE ??????
47
Rule-Of-Thumb
The practice of viewing rWG in the 0.70‟s and higher as representing acceptable convergence is widespread. For example: Zohar (2000) cited rWG values in the .70‟s and mid .80‟s as proof that judgments “were sufficiently homogeneous for within group aggregation”
48
Benchmarking rWG Interrater Agreement Indices: Let‟s Drop the .70 Rule-Of-Thumb
Paper presented in the Annual Conference of the Society for Industrial and Organizational Psychology Chicago April 2004 R.J. Harvey and E. Hollander
49
“It is puzzling why many researchers and practitioners continue to rely on arbitrary rules-of-thumb to interpret rWG, especially the popular rule-ofthumb stating that rWG ≥ 0.70 denotes acceptable agreement”…..
50
“The justification of the rule rests largely on the argument that some researchers ( e.g. James et al., 1984) viewed rater agreement as being similar to reliability, reliabilities as low as .7 are useful ( e.g. Nunnaly,1978) , therefore rWG ≥ 0.7 implies interrater reliability”…..
51
There is little empirical basis for a .70 cutoff and few studies have attempted to determine how various values equate with “real world” levels of interrater agreement
52
The sources of four commonly reported cutoff criteria
Lance, Butts, Michels (2006) ORM 1) GFI>.9 Indicates well fitting SEM‟s 2) Reliability of .7 or higher is acceptable 3) rWG „s >.7 justify aggregation of individual responses to group-level measures 4) Keep the number of factors whose eigenvalues are greater than 1.
53
Rule-Of-Thumb
A reviewer … “ I believe the phrase has its origins in describing the size of the stick appropriate for beating one‟s wife. A stick was to be no larger in diameter than the man‟s thumb….. Thus, use of this phrase might be offensive to some of the readership”…
54
Rule-Of-Thumb
Feminists often make that claim that the “rule of thumb” used to mean that it was legal to beat your wife with a rod, so long as that rod was no thicker than the husband‟s thumb. But, it turns out to be an excellent example of what may be called fiction….
55
Rule-Of-Thumb
From carpentry:
The length from the tip of one‟s thumb to the first knuckle was used as an approximation for one inch
As such, we apologize to readers who may be offended by the reference to “rule of thumb” but remind them of the mythology surrounding its interpretation.
56
Statistical Tests
Test the null hypothesis of no agreement . Dunlop et al. (JAP, 2003) Provided a table of rWG “critical values”. Under the null hypothesis of uniform null ( J=1, one item, different A values of the Likert scale, and different number of judges )
57
Critical Values of the rWG statistic at the 5% Levels of statistical Significance
A
n 4 1.0
2
3
5
11
5
8
.85
.61
.78
.59
58
10
.53
.49
Dunlop et al (2003(…. “Statistical tests of rWG are useful if one‟s objective is to determine if any nonzero agreement exists ; although useful, this reflects a qualitatively different goal from determining if reasonable consensus exists for a group to aggregate individual level data to the group level of analysis “ ……
59
Alternative index AD
Proposed by Burke, Finkelstein & Dusig Organizational Research Methods, 1999)
ADM ( j) 1 n | X jk X j | n k 1
AD M ( J ) =
60
AD M ( j )
J
Gonzalez-Roma,V. ,Peiro,J.M.,&Tordera,N. (2002). JAP
“ The ADM(J) index has several
advantages compared with the James, Demaree and Wolf ( 1984) interrater agreement index rwg, see Burke et al. (1999(”…..
61
The ADM(J) index does not require modeling the null response distribution. It only requires an a priori specification of a null response range of interrater agreement. Second, the index provides estimates of interrater agreement in the metric of the original response range.
62
We followed Burke and colleagues‟(1999) specification of using a null response range equal to or less than 1 when the response scale is a Likert-type 5 point scale. This value is consistent with our judgement that any two contiguous scale points are somewhat similar for the 5-point Likert-type scales used in the present study
63
…..Organizational commitment Measured by three items ( J=3). Respondents answered using a 5-point scale (A=5)
The mean ADM(J) was 0.68 ( SD=0.32) and the ICC(1) was .22. The one –way ANOVA result , F(196,441)=1.7,p<.01, Suggests an adequate between differentiation and supports the validity of the aggregate organizational commitment measure.
64
Statistical Tests
Test the null hypothesis of no agreement . Dunlop et al. (JAP, 2003) provided a table of AD “critical values”. Under the null hypothesis of uniform null ( One item, different A values of the Likert scale, and different number of judges )
65
Critical Values of the AD statistic at the 5% Levels of statistical Significance
A n 4 5 8 2 3 5 0.0 .40 .69 11 .75 1.04 1.53
10
66
.74
1.70
Criticism vs Reality
• Citations and applications of rWG(J)
in more than 450 papers
67
So far, the index rWG(J) has been much more frequently used than ADM(J) . We performed a systematic literature search of organizational multilevel studies, that were published during the years 2000-2006 (ADM(J) was introduced in 1999). Among the 41 papers that included justification of the aggregation from individual to group level, there were 40 (98%) that used the index rWG(J) and only 2 (5%) used the index ADM(J) . One study used both indices.
68
Statistical properties of rWG(J)
• Cohen, A., Doveh, E., & Eick, U. (2001). Statistical properties of the rWG(J index of agreement. Psychological Methods, 6, 297310. Studied the sampling distribution of rWG(J) under the null hypothesis
69
Simulations to Study the Sampling Distribution
Depends on
• J( number of items in the scale)
• A Likert scale • n group size • The null variance 2 • Correlation structure between items in 2 2 the scale J[1 ( s / )] rWG ( J ) 2 2 2 2 J[1 ( s / )] s /
70
Simulations to Study the Sampling Distribution
“Easy” for a single item rWG
Sx 1 2
2
Simulate data for a discrete “null” distribution
71
Dependent vs Independent
Display of E (rWG(J)) n=3,10,100 corresponding to small,medium,large group sizes J=6,10 Data either uniform or slight skew A=5 Independent or CS ( Compound Symmetry) ρ=0.6
72
Compound Symmetry Structure CS
V33
2 2
2
2
2
2
2 2
2
73
Uniform data , uniform null
“Error of first kind”
Skew data uniform null “power”
74
CS =0.6
75
Testing Agreement for Multi-item Scales with the Indices rWG(J) and ADM(J)
Ayala Cohen ,Etti Doveh & Inbal Shani Organizational Research Methods (2007)
76
Table
of
.95
RWG
Percentile
CS correlation structure
n Group size
3 5 6 7 8 15 20
77
J=5
A=5
rho=0 0.87 0.74 0.70 0.69 0.66 0.52 0.49
rho= .3 0.90 0.77 0.73 0.70 0.66 0.56 0.50
rho= .4 0.90 0.78 0.75 0.72 0.67 0.58 0.53
Simulation
• Software available: Simulate the null distribution of the index for a given n, A, J, k and the correlation structure among the items.
78
Example
Bliese et al.'s (2002) sample of 2042 soldiers in 49 U.S. Army Companies. The companies ranged in size from n=10 to n=99.
Task significance is a three-item scale ( J=3) assessing a sense of task significance during the deployment (A=5).
79
SimulationSample based values percentiles size 10 rwg .00 AD .97 .95 rwg .72 .05 AD .63
Rejection area 1=in 0=out rwg 0 AD 0
20
37 41 45
.67
.44 .46 .00
.69
.78 .81 .93
.61
.48 .47 .45
.72
.80 .80 .81
1
0 0 0
1
1 0 0
53
58 68 85
.00
.39 .16 .51
.99
.76 .91 .80
.43
.43 .40 .37
.82
.83 .85 .86
0
0 0 1
0
1 0 1
99
80
.29
.87
.34
.88
0
1
Inference on the Ensemble
• ICC summarizes the whole structure, assuming homogeneity of within variances. • Agreement indices are derived for each group separately. • How to infer on the ensemble? Analogous to regression analysis: F vs individual t tests
81
Based on the rWG(J) tests, significance was obtained for 5 companies, while based on ADM(J) it was obtained for 6 companies.
Under the null hypothesis, the probability of obtaining at least 5 (out of 49 independent tests), with a significance level α=0.05, is 0.097 and it is 0.035 for obtaining at least 6.
Thus, if we allow a global significance level of no more than 0.05, we cannot reject the null hypothesis based on rWG(J) , but will reject based on ADM(J).
82
Ensemble of groups(RMNET)
What shall we do with groups that fail the threshold? 1) Toss them out because something is “wrong” with these groups. The logic is that if they do not agree, then the construct has not solidified with them, may be they are not a collective, they are distinct subgroups….
83
Ensemble of groups(RMNET)
2) KEEP them… If on average we get reasonable agreement across groups on a construct, it justifies aggregation for all. ( …CFA: We index everyone, even if some people may have answered in a way which is inconsistent across items, but we do not drop them..)
84
Open Questions
• Extension of slight skew to A>5 • Power Analysis • Comparison of degree of agreements , non-null cases
85