1
Measuring Agreement
2
Introduction Different types of agreement Diagnosis by different methods
Do both methods give the same results? Disease absent or Disease present
Staging of carcinomas Will different methods lead to the same results? Will different raters lead to the same results?
Measurements of blood pressure How consistent are measurements made
Using different devices? With different observers? At different times?
3
Investigating agreement
Need to consider Data type
Categorical or continuous How are the data repeated?
Measuring instrument (s), rater(s), time(s) The goal
Are ratings consistent? Estimate the magnitude of differences between
measurements Investigate factors that affect ratings
Number of raters
4
Data type
Categorical Binary
Disease absent, disease present Nominal
Hepatitis Viral A, B, C, D, E or autoimmune
Ordinal Severity of disease
Mild, moderate, severe Continuous
Size of tumour Blood pressure
5
How are data repeated? Same person, same measuring instrument
Different observers Inter-rater reliability
Same observer at different times Intra-rater reliability
Repeatability
Internal consistency Do the items of a test measure the same attribute?
6
Measures of agreement Categorical
Kappa Weighted Fleiss’
Continuous Limits of agreement Coefficient of variation (CV) Intraclass Correlation (ICC)
Cronbach’s Internal consistency
7
Number of raters
Two
Three or more
8
Categorical data: two raters
Kappa Magnitude quoted ≥0.75 Excellent, 0.40 to 0.75 Fair to good, < 0.40 as Poor 0 to 0.20 Slight, >0.20 to 0.40 Fair, >0.40 to 0.60 Moderate,
>0.60 to 0.80 Substantial, >0.80 Almost perfect
Degree of disagreement can be included Weighted kappa
Values close together do not count to disagreement as much as those further apart
Linear / quadratic weightings
9
Categorical data: > two raters
Different tests forBinomial dataData with more than two categories
Online calculatorshttp://www.vassarstats.net/kappa.html
10
Example 1 Two raters
Scores 1 to 5
Unweighted kappa 0.79, 95% CI (0.62 to 0.96) Linear weighting 0.84, 95% CI (0.70 to 0.98) Quadratic weighting 0.90, 95% CI (0.77 to 1.00)
11
Example 2
Binomial data Two raters Two ratings each Inter-rater agreement Intra-rater agreement
12
Example 2 ctd.
Inter-rater agreement Kappa1,2= 0.865 (P<0.001) Kappa1,3= 0.054 (P=0.765) Kappa2,3= -0.071 (P=0.696)
Intra-rater agreement Kappa1= 0.800 (P<0.001) Kappa2= 0.790 (P<0.001) Kappa3= 0.000 (P=1.000)
13
Continuous data
Test for bias Check differences not related to magnitude Calculate mean and SD of differences Limits of agreement Coefficient of variation ICC
14
Test for bias
Student’s paired t (mean) Wilcoxon matched pairs (median) If there is bias, agreement cannot be
investigated further
15
Example 3: Test for bias
Paired t test P=0.362 No bias
16
Check differences unrelated to magnitude
Clearly no relationship
17
Calculate Mean and SD differences
this is s
N MeanStd.
Deviation
Difference 17 4.9412 21.72404
Valid N (listwise) 17
this is mean
18
Limits of agreement
Lower limit of agreement (LLA) = mean - 1.96×s = -37.6 Upper limit of agreement (ULA) = mean + 1.96×s = 47.5 95% of differences between a pair of measurements for an
individual lie in (-37.6, 47.5)
19
Coefficient of variation
Measure of variability of differences Expressed as a proportion of the average
measured value
Suitable when error (the differences between pairs) increases with the measured values Other measures require this not to be the case
100 × s ÷ mean of the measurements 100 × 21.72 ÷ 447.88 4.85%
20
Intraclass Correlation
Continuous data Two or more sets of measurements Measure of correlation that adjusts for
differences in scale Several models
Absolute agreement of consistency Raters chosen randomly or same raters
throughout Single or average measures
21
Intraclass Correlation
≥0.75 Excellent 0.4 to 0.75 Fair to Good <0.4 Poor
22
Cronbach’s α Internal consistency
Total scores Several components.
α ≥0.8 good ≥0.7 adequate
23
Investigating agreement
Data type Categorical
Chi squared Continuous
Limits of agreement Coefficient of variation Intraclass correlation
How are the data repeated? Measuring instrument (s), rater(s), time(s)
Number of raters Two
Straightforward Three or more
Help!