jaccard#
- scipy.spatial.distance.jaccard(u, v, w=None)[source]#
Compute the Jaccard dissimilarity between two boolean vectors.
Given boolean vectors \(u \equiv (u_1, \cdots, u_n)\) and \(v \equiv (v_1, \cdots, v_n)\) that are not both zero, their Jaccard dissimilarity is defined as ([1], p. 26)
\[d_\textrm{jaccard}(u, v) := \frac{c_{10} + c_{01}} {c_{11} + c_{10} + c_{01}}\]where
\[c_{ij} := \sum_{1 \le k \le n, u_k=i, v_k=j} 1\]for \(i, j \in \{ 0, 1\}\). If \(u\) and \(v\) are both zero, their Jaccard dissimilarity is defined to be zero. [2]
If a (non-negative) weight vector \(w \equiv (w_1, \cdots, w_n)\) is supplied, the weighted Jaccard dissimilarity is defined similarly but with \(c_{ij}\) replaced by
\[\tilde{c}_{ij} := \sum_{1 \le k \le n, u_k=i, v_k=j} w_k\]- Parameters:
- u(N,) array_like of bools
Input vector.
- v(N,) array_like of bools
Input vector.
- w(N,) array_like of floats, optional
Weights for each pair of \((u_k, v_k)\). Default is
None
, which gives each pair a weight of1.0
.
- Returns:
- jaccardfloat
The Jaccard dissimilarity between vectors u and v, optionally weighted by w if supplied.
Notes
The Jaccard dissimilarity satisfies the triangle inequality and is qualified as a metric. [2]
The Jaccard index, or Jaccard similarity coefficient, is equal to one minus the Jaccard dissimilarity. [3]
The dissimilarity between general (finite) sets may be computed by encoding them as boolean vectors and computing the dissimilarity between the encoded vectors. For example, subsets \(A,B\) of \(\{ 1, 2, ..., n \}\) may be encoded into boolean vectors \(u, v\) by setting \(u_k := 1_{k \in A}\), \(v_k := 1_{k \in B}\) for \(k = 1,2,\cdots,n\).
Changed in version 1.2.0: Previously, if all (positively weighted) elements in u and v are zero, the function would return
nan
. This was changed to return0
instead.Changed in version 1.15.0: Non-0/1 numeric input used to produce an ad hoc result. Since 1.15.0, numeric input is converted to Boolean before computation.
References
[1]Kaufman, L. and Rousseeuw, P. J. (1990). “Finding Groups in Data: An Introduction to Cluster Analysis.” John Wiley & Sons, Inc.
Examples
>>> from scipy.spatial import distance
Non-zero vectors with no matching 1s have dissimilarity of 1.0:
>>> distance.jaccard([1, 0, 0], [0, 1, 0]) 1.0
Vectors with some matching 1s have dissimilarity less than 1.0:
>>> distance.jaccard([1, 0, 0, 0], [1, 1, 1, 0]) 0.6666666666666666
Identical vectors, including zero vectors, have dissimilarity of 0.0:
>>> distance.jaccard([1, 0, 0], [1, 0, 0]) 0.0 >>> distance.jaccard([0, 0, 0], [0, 0, 0]) 0.0
The following example computes the dissimilarity from a confusion matrix directly by setting the weight vector to the frequency of True Positive, False Negative, False Positive, and True Negative:
>>> distance.jaccard([1, 1, 0, 0], [1, 0, 1, 0], [31, 41, 59, 26]) 0.7633587786259542 # (41+59)/(31+41+59)