Design of Experiments > Inter-rater Reliability
What is Inter-rater Reliability?
Inter-rater reliability is the level of agreement between raters or judges. If everyone agrees, IRR is 1 (or 100%) and if everyone disagrees, IRR is 0 (0%). Several methods exist for calculating IRR, from the simple (e.g. percent agreement) to the more complex (e.g. Cohen’s Kappa). Which one you choose largely depends on what type of data you have and how many raters are in your model.
Inter-Rater Reliability Methods
1. Percent Agreement for Two Raters
The basic measure for inter-rater reliability is a percent agreement between raters.
To find percent agreement for two raters, a table (like the one above) is helpful.
- Count the number of ratings in agreement. In the above table, that’s 3.
- Count the total number of ratings. For this example, that’s 5.
- Divide the total by the number in agreement to get a fraction: 3/5.
- Convert to a percentage: 3/5 = 60%.
The field you are working in will determine the acceptable agreement level. If it’s a sports competition, you might accept a 60% rater agreement to decide a winner. However, if you’re looking at data from cancer specialists deciding on a course of treatment, you’ll want a much higher agreement — above 90%. In general, above 75% is considered acceptable for most fields.
Percent Agreement for Multiple Raters
If you have multiple raters, calculate the percent agreement as follows:
As you can probably tell, calculating percent agreements for more than a handful of raters can quickly become cumbersome. For example, if you had 6 judges, you would have 16 combinations of pairs to calculate for each contestant (use our combinations calculator to figure out how many pairs you would get for multiple judges).
A major flaw with this type of inter-rater reliability is that it doesn’t take chance agreement into account and overestimate the level of agreement. This is the main reason why percent agreement shouldn’t be used for academic work (i.e. dissertations or academic publications).
Several methods have been developed that are easier to compute (usually they are built into statistical software packages) and take chance into account:
- If you have one or two meaningful pairs, use Interclass correlation (equivalent to the Pearson Correlation Coefficient).
- If you have more than a couple of pairs, use Intraclass correlation. This is one of the most popular IRR methods and is used for two or more raters.
- Cohen’s Kappa: commonly used for categorical variables.
- Fleiss’ Kappa: similar to Cohen’s Kappa, suitable when you have a constant number of m raters randomly sampled from a population of raters, with a different sample of m coders rating each subject.
- Gwet’s AC2 Coefficient is calculated easily in Excel with the AgreeStats add on.
- Krippendorff’s Alpha is arguably the best measure of inter-rater reliability, but it computationally complex.
If you prefer an online interactive environment to learn R and statistics, this free R Tutorial by Datacamp is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try this Statistics with R track.Comments? Need to post a correction? Please post on our Facebook page.