scipy.spatial.distance.

jaccard#

scipy.spatial.distance.jaccard(u, v, w=None)[source]#

Compute the Jaccard dissimilarity between two boolean vectors.

Given boolean vectors \(u \equiv (u_1, \cdots, u_n)\) and \(v \equiv (v_1, \cdots, v_n)\) that are not both zero, their Jaccard dissimilarity is defined as ([1], p. 26)

\[d_\textrm{jaccard}(u, v) := \frac{c_{10} + c_{01}} {c_{11} + c_{10} + c_{01}}\]

where

\[c_{ij} := \sum_{1 \le k \le n, u_k=i, v_k=j} 1\]

for \(i, j \in \{ 0, 1\}\). If \(u\) and \(v\) are both zero, their Jaccard dissimilarity is defined to be zero. [2]

If a (non-negative) weight vector \(w \equiv (w_1, \cdots, w_n)\) is supplied, the weighted Jaccard dissimilarity is defined similarly but with \(c_{ij}\) replaced by

\[\tilde{c}_{ij} := \sum_{1 \le k \le n, u_k=i, v_k=j} w_k\]

Parameters:

u(N,) array_like of bools: Input vector.
v(N,) array_like of bools: Input vector.
w(N,) array_like of floats, optional: Weights for each pair of \((u_k, v_k)\). Default is None, which gives each pair a weight of 1.0.

Returns:

jaccardfloat: The Jaccard dissimilarity between vectors u and v, optionally weighted by w if supplied.

Notes

The Jaccard dissimilarity satisfies the triangle inequality and is qualified as a metric. [2]

The Jaccard index, or Jaccard similarity coefficient, is equal to one minus the Jaccard dissimilarity. [3]

The dissimilarity between general (finite) sets may be computed by encoding them as boolean vectors and computing the dissimilarity between the encoded vectors. For example, subsets \(A,B\) of \(\{ 1, 2, ..., n \}\) may be encoded into boolean vectors \(u, v\) by setting \(u_k := 1_{k \in A}\), \(v_k := 1_{k \in B}\) for \(k = 1,2,\cdots,n\).

Changed in version 1.2.0: Previously, if all (positively weighted) elements in u and v are zero, the function would return nan. This was changed to return 0 instead.

Changed in version 1.15.0: Non-0/1 numeric input used to produce an ad hoc result. Since 1.15.0, numeric input is converted to Boolean before computation.

References

[1]

Kaufman, L. and Rousseeuw, P. J. (1990). “Finding Groups in Data: An Introduction to Cluster Analysis.” John Wiley & Sons, Inc.

[2] (1,2)

Kosub, S. (2019). “A note on the triangle inequality for the Jaccard distance.” Pattern Recognition Letters, 120:36-38.

[3]

https://en.wikipedia.org/wiki/Jaccard_index

Examples

>>> from scipy.spatial import distance

Non-zero vectors with no matching 1s have dissimilarity of 1.0:

>>> distance.jaccard([1, 0, 0], [0, 1, 0])
1.0

Vectors with some matching 1s have dissimilarity less than 1.0:

>>> distance.jaccard([1, 0, 0, 0], [1, 1, 1, 0])
0.6666666666666666

Identical vectors, including zero vectors, have dissimilarity of 0.0:

>>> distance.jaccard([1, 0, 0], [1, 0, 0])
0.0
>>> distance.jaccard([0, 0, 0], [0, 0, 0])
0.0

The following example computes the dissimilarity from a confusion matrix directly by setting the weight vector to the frequency of True Positive, False Negative, False Positive, and True Negative:

>>> distance.jaccard([1, 1, 0, 0], [1, 0, 1, 0], [31, 41, 59, 26])
0.7633587786259542  # (41+59)/(31+41+59)