Jaccard index
Encyclopedia : J : JA : JAC : Jaccard index
The Jaccard index, also known as the Jaccard similarity coefficent, is a statistic used for comparing the similarity and diversity of sample sets.
The Jaccard coefficient is defined as the size of the intersection divided by the size of the union of the sample sets: [ J(A,B) = |A \cap B|/|A \cup B|].
A related term that bears mentioning is the Jaccard distance, which measures dissimilarity between sample sets. The Jaccard distance is obtained by subtracting the size of intersection of the sets by the size of the union, and dividing the resulting quantity by the size of the union.
Similarity of Asymmetric Binary Attributes
Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows:
- [M_] represents the total number of attributes where A and B both have a value of 1.
- [M_] represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1.
- [M_] represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0.
- [M_] represents the total number of attributes where A and B both have a value of 0.
- [M_ + M_ + M_ + M_ = n ].
- [J = \over M_ + M_ + M_} ].
- [J' = + M_ \over M_ + M_ + M_} ].
Tanimoto Coefficient (Extended Jaccard Coefficient)
Cosine similarity is a measure of similarity between two vectors of n dimensions by finding the angle between them, often used to compare documents in text mining. Given two vectors of attributes, A and B, the cosine similarity, [\theta], is represented using a dot product and magnitude as
- [ \theta = \arccos ].
This cosine similarity metric may be extended such that it yields the Jaccard Coefficient in the case of binary attributes. This is the Tanimoto Coefficient, [T(A,B)], represented as
- [ T(A,B) = ].
References
- Pang-Ning Tan, Michael Steinbach and Vipin Kumar, Introduction to Data Mining (2005), ISBN 0-321-32136-7
See also
- Sorensen's quotient of similarity
- Mountford's index of similarity
- Hamming distance
- Correlation
External links
- [Jaccard's index and species diversity]
- [Jaccard's coefficient]
- [Introduction to Data Mining lecture notes from Tan, Steinbach, Kumar]
From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.
