Statistics How To

Jaccard Index / Similarity Coefficient

Statistics Definitions > Jaccard Index

What is the Jaccard Index?

The Jaccard similarity index (sometimes called the Jaccard similarity coefficient) compares members for two sets to see which members are shared and which are distinct. It’s a measure of similarity for the two sets of data, with a range from 0% to 100%. The higher the percentage, the more similar the two populations. Although it’s easy to interpret, it is extremely sensitive to small samples sizes and may give erroneous results, especially with very small samples or data sets with missing observations.

How to Calculate the Jaccard Index

The formula to find the Index is:

Jaccard Index = (the number in both sets) / (the number in either set) * 100

The same formula in notation is:
J(X,Y) = |X∩Y| / |X∪Y|

In Steps, that’s:

  1. Count the number of members which are shared between both sets.
  2. Count the total number of members in both sets (shared and un-shared).
  3. Divide the number of shared members (1) by the total number of members (2).
  4. Multiply the number you found in (3) by 100.

This percentage tells you how similar the two sets are.

  • Two sets that share all members would be 100% similar. the closer to 100%, the more similarity (e.g. 90% is more similar than 89%).
  • If they share no members, they are 0% similar.
  • The midway point — 50% — means that the two sets share half of the members.

Examples

A simple example using set notation: How similar are these two sets?

  • A = {0,1,2,5,6}
  • B = {0,2,3,4,5,7,9}

Solution: J(A,B) = |A∩B| / |A∪B| = |{0,2,5}| / |{0,1,2,3,4,5,6,7,9}| = 3/9 = 0.33.

Notes:

  1. The cardinality of A, denoted |A| is a count of the number of elements in set A.
  2. Although it’s customary to leave the answer in decimal form if you’re using set notation, you could multiply by 100 to get a similarity of 33.33%.

Example problem without set notations: Researchers are studying biodiversity in two rainforests. They catalog specimens from six different species, A,B,C,D,E,F. Two species are shared between the two rainforests. What is the Jaccard coefficient?


Solution:

  1. Two species (3 and 5) are shared between both populations.
  2. There are 6 unique species in the two populations.
  3. 2/6 = 1/3
  4. 1/3 * 100 = 33.33%.

Rainforests A and B are 33% similar.

Jaccard Distance

A similar statistic, the Jaccard distance, is a measure of how dissimilar two sets are. It is the complement of the Jaccard index and can be found by subtracting the Jaccard Index from 100%. For the above example, the Jaccard distance is 1 – 33.33% = 66.67%.

In set notation, subtract from 1 for the Jaccard Distance:
D(X,Y) = 1 – J(X,Y)
Note though, that the decimals are usually converted to percentages as these are easier to interpret.

What to do with missing values

Sometimes data sets will have missing observations, which makes calculating similarity challenging. You have several options for filling in these missing data points:

------------------------------------------------------------------------------

If you prefer an online interactive environment to learn R and statistics, this free R Tutorial by Datacamp is a great way to get started. If you're are somewhat comfortable with R and are interested in going deeper into Statistics, try this Statistics with R track.

Comments are now closed for this post. Need help or want to post a correction? Please post a comment on our Facebook page and I'll do my best to help!
Jaccard Index / Similarity Coefficient was last modified: October 12th, 2017 by Stephanie Glen

3 thoughts on “Jaccard Index / Similarity Coefficient

  1. danny Sy

    Hi!
    Brilliant, however I’m wanting to compare two 300-pages policy documents : accounting standards (FRS 102 ans IFRS ) how would I be able do it?

    Many thanks,

    Danny

  2. Meredith

    Can you explain why the Jaccard index is sensitive to small samples? I’ve seen it mentioned in several places but haven’t been able to find a good explanation.

  3. Andale Post author

    If you have a small sample, it’s hard to determine if there’s actually a similarity. As an extreme example, let’s say you have two identical bugs, one from each population. Can you say that the populations are 100% similar? No. If you had a thousand bugs from each population, and they were all identical, then you’ve got better evidence. A few here and there just doesn’t give you enough data.