Statistics How To

Inter-rater Reliability IRR: Definition, Calculation

Design of Experiments > Inter-rater Reliability

What is Inter-rater Reliability?

Inter-rater reliability is the level of agreement between raters or judges. If everyone agrees, IRR is 1 (or 100%) and if everyone disagrees, IRR is 0 (0%). Several methods exist for calculating IRR, from the simple (e.g. percent agreement) to the more complex (e.g. Cohen’s Kappa). Which one you choose largely depends on what type of data you have and how many raters are in your model.

Inter-Rater Reliability Methods

1. Percent Agreement for Two Raters

The basic measure for inter-rater reliability is a percent agreement between raters.

inter-rater reliability

In this competition, judges agreed on 3 out of 5 scores. Percent agreement is 3/5 = 60%.




To find percent agreement for two raters, a table (like the one above) is helpful.

  1. Count the number of ratings in agreement. In the above table, that’s 3.
  2. Count the total number of ratings. For this example, that’s 5.
  3. Divide the total by the number in agreement to get a fraction: 3/5.
  4. Convert to a percentage: 3/5 = 60%.

The field you are working in will determine the acceptable agreement level. If it’s a sports competition, you might accept a 60% rater agreement to decide a winner. However, if you’re looking at data from cancer specialists deciding on a course of treatment, you’ll want a much higher agreement — above 90%. In general, above 75% is considered acceptable for most fields.

Percent Agreement for Multiple Raters

If you have multiple raters, calculate the percent agreement as follows:

Step 1: Make a table of your ratings. For this example, there are three judges:
multiple agreement

Step 2: Add additional columns for the combinations(pairs) of judges. For this example, the three possible pairs are: J1/J2, J1/J3 and J2/J3.
multiple agreement 2

Step 3: For each pair, put a “1” for agreement and “0” for agreement. For example, contestant 4, Judge 1/Judge 2 disagreed (0), Judge 1/Judge 3 disagreed (0) and Judge 2 / Judge 3 agreed (1).
multiple agreement 3

Step 4: Add up the 1s and 0s in an Agreement column:
multiple agreement 4

Step 5: Find the mean for the fractions in the Agreement column.
Mean = (3/3 + 0/3 + 3/3 + 1/3 + 1/3) / 5 = 0.53, or 53%.
The inter-rater reliability for this example is 54%.

Disadvantages

As you can probably tell, calculating percent agreements for more than a handful of raters can quickly become cumbersome. For example, if you had 6 judges, you would have 16 combinations of pairs to calculate for each contestant (use our combinations calculator to figure out how many pairs you would get for multiple judges).

A major flaw with this type of inter-rater reliability is that it doesn’t take chance agreement into account and overestimate the level of agreement. This is the main reason why percent agreement shouldn’t be used for academic work (i.e. dissertations or academic publications).

Alternative Methods

Several methods have been developed that are easier to compute (usually they are built into statistical software packages) and take chance into account:

  • If you have one or two meaningful pairs, use Interclass correlation (equivalent to the Pearson Correlation Coefficient).
  • If you have more than a couple of pairs, use Intraclass correlation. This is one of the most popular IRR methods and is used for two or more raters.
  • Cohen’s Kappa: commonly used for categorical variables.
  • Fleiss’ Kappa: similar to Cohen’s Kappa, suitable when you have a constant number of m raters randomly sampled from a population of raters, with a different sample of m coders rating each subject.
  • Gwet’s AC2 Coefficient is calculated easily in Excel with the AgreeStats add on.
  • Krippendorff’s Alpha is arguably the best measure of inter-rater reliability, but it computationally complex.
------------------------------------------------------------------------------

If you prefer an online interactive environment to learn R and statistics, this free R Tutorial by Datacamp is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try this Statistics with R track.

Comments are now closed for this post. Need help or want to post a correction? Please post a comment on our Facebook page and I'll do my best to help!
Inter-rater Reliability IRR: Definition, Calculation was last modified: November 13th, 2017 by Stephanie Glen

2 thoughts on “Inter-rater Reliability IRR: Definition, Calculation

  1. Peter Bill

    Hello dear Statistics how to team,

    if i want to calculate the IRR with Krippendorfs Alpha, Gwet’s AC2 or Kendalls W,
    and I have two cases, that I want to rate, is it then necessary, that each judge have to
    rate all both cases? Or is it possible, that every ramdomized judge just rate one of the cases?

    Thanks a lot!
    Stefan Bill

  2. Andale Post author

    All are measures of inter rater reliability. If you don’t have two judges rate the same thing, then you can’t measure agreement between them.