Design of Experiments > Inter-rater Reliability

## What is Inter-rater Reliability?

Inter-rater reliability is the **level of agreement between raters or judges**. If everyone agrees, IRR is 1 (or 100%) and if everyone disagrees, IRR is 0 (0%). Several methods exist for calculating IRR, from the simple (e.g. percent agreement) to the more complex (e.g. Cohen’s Kappa). Which one you choose largely depends on what type of data you have and how many raters are in your model.

## Inter-Rater Reliability Methods

### 1. Percent Agreement for Two Raters

The basic measure for inter-rater reliability is a **percent agreement between raters**.

To find

**percent agreement for two raters**, a table (like the one above) is helpful.

- Count the number of ratings in agreement. In the above table, that’s 3.
- Count the total number of ratings. For this example, that’s 5.
- Divide the total by the number in agreement to get a fraction: 3/5.
- Convert to a percentage: 3/5 = 60%.

The field you are working in will determine the acceptable agreement level. If it’s a sports competition, you might accept a 60% rater agreement to decide a winner. However, if you’re looking at data from cancer specialists deciding on a course of treatment, you’ll want a much higher agreement — above 90%. In general, above 75% is considered acceptable for most fields.

### Percent Agreement for Multiple Raters

If you have multiple raters, calculate the percent agreement as follows:

Step 1: Make a table of your ratings. For this example, there are three judges:

Step 2: Add additional columns for the combinations(pairs) of judges. For this example, the three possible pairs are: J1/J2, J1/J3 and J2/J3.

Step 3: For each pair, put a “1” for agreement and “0” for agreement. For example, contestant 4, Judge 1/Judge 2 disagreed (0), Judge 1/Judge 3 disagreed (0) and Judge 2 / Judge 3 agreed (1).

Step 4: Add up the 1s and 0s in an Agreement column:

Step 5: Find the mean for the fractions in the Agreement column.

Mean = (3/3 + 0/3 + 3/3 + 1/3 + 1/3) / 5 = 0.53, or 53%.

The inter-rater reliability for this example is 54%.

### Disadvantages

As you can probably tell, calculating percent agreements for more than a handful of raters can quickly become cumbersome. For example, if you had 6 judges, you would have 16 combinations of pairs to calculate for each contestant (use our combinations calculator to figure out how many pairs you would get for multiple judges).

A major flaw with this type of inter-rater reliability is that it doesn’t take chance agreement into account and overestimate the level of agreement. This is the main reason why **percent agreement shouldn’t be used for academic work** (i.e. dissertations or academic publications).

### Alternative Methods

Several methods have been developed that are easier to compute (usually they are built into statistical software packages) and take chance into account:

- If you have one or two meaningful pairs, use
**Interclass correlation**(equivalent to the Pearson Correlation Coefficient). - If you have more than a couple of pairs, use
**Intraclass correlation**. This is one of the most popular IRR methods and is used for two or more raters. **Cohen’s Kappa**: commonly used for categorical variables.**Fleiss’ Kappa:**similar to Cohen’s Kappa, suitable when you have a constant number of m raters randomly sampled from a population of raters, with a different sample of m coders rating each subject.**Gwet’s AC2 Coefficient**is calculated easily in Excel with the AgreeStats add on.**Krippendorff’s Alpha**is arguably the best measure of inter-rater reliability, but it computationally complex.

If you prefer an online interactive environment to learn R and statistics, this *free R Tutorial by Datacamp* is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try *this Statistics with R track*.

*Facebook page*and I'll do my best to help!

Hello dear Statistics how to team,

if i want to calculate the IRR with Krippendorfs Alpha, Gwet’s AC2 or Kendalls W,

and I have two cases, that I want to rate, is it then necessary, that each judge have to

rate all both cases? Or is it possible, that every ramdomized judge just rate one of the cases?

Thanks a lot!

Stefan Bill

All are measures of inter rater reliability. If you don’t have two judges rate the same thing, then you can’t measure agreement between them.