Statistics How To

Intraclass Correlation

Correlation Coefficients >
Intraclass Correlation

Intraclass Correlation

Intraclass correlation measures the reliability of ratings or measurements for clusters — data that has been collected as groups or sorted into groups. A related term is interclass correlation, which is usually another name for Pearson correlation (other statistics can be used, like Cohen’s kappa, but this is rare). Pearson’s is usually used for inter-rater reliability when you only have one or two meaningful pairs from one or two raters. For more, you’ll want to use the ICC. Like most correlation coefficients, the ICC ranges from 0 to 1.

  • A high Intraclass Correlation Coefficient (ICC) close to 1 indicates high similarity between values from the same group.
  • A low ICC close to zero means that values from the same group are not similar.

This is best illustrated with an example. In the image below, values from the same group are clustered fairly tightly together. For example, group 3 (on the x-axis) is clustered between about -1.3 and -0.4 on the y-axis. Most of the groups are similarly clustered, giving the entire set a high ICC of 0.91:

A dotplot of a dataset with high intraclass correlation. Image: skbkekas|Wikimedia Commons.

A dotplot of a dataset with high intraclass correlation. Image: skbkekas|Wikimedia Commons.




Compare that set to the following graph of a dataset with an extremely low ICC of 0.07. The values within groups are widely scattered without any clusters:
intraclass correlation

Dataset with a low ICC. Image: Skbkekas|Wikimedia Commons.



Common Uses and Calculation

The ICC is used to measure a wide variety of numerical data from clusters or groups, including:

  • How closely relatives resemble each other with regard to a certain characteristic or traits.
  • Reproducibility of numerical measurements made by different people measuring the same thing.

Calculating the ICC is very complex by hand, in part because of the number of ICC formulas to choose from, and partly because the formulas themselves are complex. The main reason for all of this complexity is that the ICC is very flexible and can be adjusted for inconsistent raters for all ratees. For example, let’s say you have a group of 10 raters who rate 20 ratees. If 9 of the raters rate 15 of the ratees and 1 rater rates all of them, or if 10 raters rate 2 each, you can still calculate the ICC.
Calculating the ICC is usually performed with software, and each program has its own terminology and quirks. For example, in SPSS, you’re given three different options for calculating the ICC.

  • If you have inconsistent raters/ratees, use “One-Way Random.”
  • If you have consistent raters/ratees (e.g. 10 raters each rate 10 ratees), and you have sample data. use “Two-Way Random.”
  • If you have consistent raters/ratees (e.g. 10 raters each rate 10 ratees), and you have population data, use “Two-Way Random.”
------------------------------------------------------------------------------

If you prefer an online interactive environment to learn R and statistics, this free R Tutorial by Datacamp is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try this Statistics with R track.

Comments are now closed for this post. Need help or want to post a correction? Please post a comment on our Facebook page and I'll do my best to help!
Intraclass Correlation was last modified: November 13th, 2017 by Stephanie Glen

One thought on “Intraclass Correlation

  1. Pieter

    Thank you for your insights into ICC’s. However, I’m still struggling to chose the right model and type. I’m hoping you could give me a push on how to assess the uniformity of a newly developed measurement device. Used specifics to assess consistency

    a sample size of 16 participants(random)
    2 measurement instruments: 1 x newly developed measurement device and 1 x gold measurement device
    2 different conditions were (repeatedly) assessed per session
    2 sessions on different days

    On the one hand, I want to know the intra session robustness and compare it (categorial) to a golden standard. On the other hand, the agreement test-retest is key.

    => intra-session: 3 measurements were taken per condition per session, each providing a single value
    3 measurements of the same device are being compared per condition within each session. Both two-way mixed as two-way random model seem to be applied within my research domain for validation studies. Values do not eem to differ between the mixed and random model, however, you mentioned that mixed is typically larger than random. Interest is here how consistent these values are among participants; consistency type? For all comparisons, we always use the average values. So I would chose for the average measures though these values are always higher. Still, most validation-reliability studies provide the single measurement value though they want to state that 1 measurement value is representative, where I am nowadays using the average of 3 values and will do so later on..
    ==> two-way random, average, consistency ?

    => inter-session:test-retest: the average measures of session 1 versus 2 are assessed.
    To exclude systematic bias in the amplitude of values over sessions I was thinking about absolute insteed of consistent agreement. Values person to person wíthin each session are very different; heterogenity. Can I use the absolute agreement or should I e.g. create a Bland-Altman plot for such assessment?
    ==> two-way random, single, absolute ?

    I really want to use the correct methodology. Thank you in advance