Tetrachoric Correlation: Definition, Examples, Formula

Correlation Coefficients > Tetrachoric Correlation

What is Tetrachoric Correlation?

Watch the video, or read on below.

Watch this video on YouTube

Can’t see the video? Click here to watch it on YouTube.

Tetrachoric correlation estimates the association between two underlying continuous variables that have been dichotomized into binary outcomes (e.g., yes/no, correct/incorrect), assuming those latent variables (which are hidden variables that can’t be measured or observed directly) follow a bivariate normal distribution.
Binary data refer to variables with two possible categories.

The tetrachoric correlation estimates the Pearson correlation that would exist if the variables had been measured on their original continuous scale instead of being dichotomized.

It is commonly used in Item Response Theory (IRT), psychometrics, and for converting comorbidity statistics into underlying correlation coefficients that reflect the association between latent continuous traits.

An advantage of tetrachoric correlation is that it is less influenced by marginal proportions (base rates) than some other association measures (such as the phi coefficient), because it explicitly models the underlying continuous distribution rather than depending solely on the observed cell frequencies.

The term “tetrachoric correlation” comes from the tetrachoric series, a numerical method used before the advent of computers. While it’s more common to estimate correlations with methods like maximum likelihood estimation, there is a basic formula you can use to get an approximation.

Formula and Example

The formula involves the cosine trigonometric function and can be applied to a 2×2 matrix or contingency table:

r_tet = cos (180/(1 + √(BC/AD)).

For the contingency table shown, note the placement of a/b/c/d in the table. Using those values and plugging them into the formula, we get:

r_tet = cos (180/(1 + √(32*17/13*23))
r_tet = cos (180/(1 + 1.34885))
r_tet = cos (180/(2.34885))
r_tet = cos (76.63324)
r_tet = 0.23.

.23 is a low correlation.

Assumptions for the Test

The two main assumptions are:

The underlying variables come from a normal distribution. With only two variables, this is impossible to test. You should, therefore, have a good theoretical reason for using this particular type of correlation; in other words, you might know that the type of data you are dealing with tends to follow a normal distribution most of the time. Rating errors should also follow a normal distribution.
There is a latent continuous scale underneath your binary data. In other words, the trait you are measuring should be continuous and not discrete.

In addition, you may want to make sure that errors are independent between raters and cases and the variance for errors is homogeneous across levels of the independent variable.

What the Correlation Means

The tetrachoric correlation coefficient r_tet (sometimes written as r* or r_t) tells you how strong (or weak) the association is between ratings for two raters. A “0” indicates no agreement and a “1” represents a perfect agreement. Most correlations will fall somewhere in between; what constitutes an acceptable level of agreement largely depends on what type of data you’re dealing with. For example, medical ratings between medical professionals will require a higher level of agreement than most non-medical situations. In general, an agreement over about .7 is usually considered “strong enough.”

References

Andrews University. More Correlation Coefficients. Retrieved 7-13-2016 from: http://ow.ly/5GLJ50wqTVC
Kotz, S.; et al., eds. (2006), Encyclopedia of Statistical Sciences, Wiley.