Inter-rater Reliability > Fleiss’ Kappa
What is Fleiss’ Kappa?
Fleiss’ Kappa is a way to measure agreement between three or more raters. It is recommended when you have Likert scale data or other closed-ended, ordinal scale or nominal scale (categorical) data. Like most correlation coefficients, Kappa ranges from 0 to 1, where:
- 0 is no agreement (or agreement that you would expect to find by chance),
- 1 is perfect agreement.
It is possible to have values of less than 1, meaning the values are less than expected by chance. For practical purposes, these values can be counted as zero, or no agreement. In general, a coefficient over .75 (75%) is considered “good”, although what exactly is an “acceptable” level of agreement depends largely on your specific field. In other words, check with your supervisor, professor or previous research before concluding that a Fleiss’ kappa over .75 is acceptable.
In some cases, Fleiss’ Kappa may return low values even when agreement is actually high. This is a known (and largely, unavoidable) problem and is one reason why it’s less popular than other measures of inter-rater agreement.
Similar Measures of Inter-rater Reliability
- Fleiss’s Kappa is an extension of Cohen’s kappa for three raters or more. In addition, the assumption with Cohen’s kappa is that your raters are deliberately chosen and fixed. With Fleiss’ kappa, the assumption is that your raters were chosen at random from a larger population.
- Kendall’s Tau is used when you have ranked data, like two people ordering 10 candidates from most preferred to least preferred.
- Krippendorff’s alpha is useful when you have multiple raters and multiple possible ratings.
Falotico, R & Quatto, P. (2015). “Fleiss’ Kappa statistic without paradoxes.” Qual. Quant. 49:463-470. Available here.
Fleiss, J. L. and Cohen, J. (1973) “The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability” in Educational and Psychological Measurement, Vol. 33 pp. 613–619