Jaccard Index / Similarity Coefficient

What is the Jaccard Index?

The Jaccard similarity index (sometimes called the Jaccard similarity coefficient) compares members for two sets to see which members are shared and which are distinct. It’s a measure of similarity for the two sets of data, with a range from 0% to 100%. The higher the percentage, the more similar the two populations. Although it’s easy to interpret, it is extremely sensitive to small samples sizes and may give erroneous results, especially with very small samples or data sets with missing observations.

How to Calculate the Jaccard Index

The formula to find the Index is:

Jaccard Index = (the number in both sets) / (the number in either set) * 100
The same formula in notation is:
J(X,Y) = |X∩Y| / |X∪Y|

In Steps, that’s:

Count the number of members which are shared between both sets.
Count the total number of members in both sets (shared and un-shared).
Divide the number of shared members (1) by the total number of members (2).
Multiply the number you found in (3) by 100.

This percentage tells you how similar the two sets are.

Two sets that share all members would be 100% similar. the closer to 100%, the more similarity (e.g. 90% is more similar than 89%).
If they share no members, they are 0% similar.
The midway point — 50% — means that the two sets share half of the members.

Examples

A simple example using set notation: How similar are these two sets?

A = {0,1,2,5,6}
B = {0,2,3,4,5,7,9}

Solution: J(A,B) = |A∩B| / |A∪B| = |{0,2,5}| / |{0,1,2,3,4,5,6,7,9}| = 3/9 = 0.33.

Notes:

The cardinality of A, denoted |A| is a count of the number of elements in set A.
Although it’s customary to leave the answer in decimal form if you’re using set notation, you could multiply by 100 to get a similarity of 33.33%.

Example problem without set notations: Researchers are studying biodiversity in two rainforests. They catalog specimens from six different species, A,B,C,D,E,F. Two species are shared between the two rainforests. What is the Jaccard coefficient?

Solution:

Two species (3 and 5) are shared between both populations.
There are 6 unique species in the two populations.
2/6 = 1/3
1/3 * 100 = 33.33%.

Rainforests A and B are 33% similar.

Jaccard Distance

A similar statistic, the Jaccard distance, is a measure of how dissimilar two sets are. It is the complement of the Jaccard index and can be found by subtracting the Jaccard Index from 100%. For the above example, the Jaccard distance is 1 – 33.33% = 66.67%.

In set notation, subtract from 1 for the Jaccard Distance:
D(X,Y) = 1 – J(X,Y)
Note though, that the decimals are usually converted to percentages as these are easier to interpret.

What to do with missing values

Sometimes data sets will have missing observations, which makes calculating similarity challenging. You have several options for filling in these missing data points:

Fill in the blank areas with zeros,
Replace the missing values with the median for the set,
Use a k-nearest neighbor or EM algorithm.

References

Agresti A. (1990) Categorical Data Analysis. John Wiley and Sons, New York.
Dodge, Y. (2008). The Concise Encyclopedia of Statistics. Springer.
Vogt, W.P. (2005). Dictionary of Statistics & Methodology: A Nontechnical Guide for the Social Sciences. SAGE.
Wheelan, C. (2014). Naked Statistics. W. W. Norton & Company