scipy.stats.entropy(pk, qk=None, base=None, axis=0)[source]#

Calculate the Shannon entropy/relative entropy of given distribution(s).

If only probabilities pk are given, the Shannon entropy is calculated as H = -sum(pk * log(pk)).

If qk is not None, then compute the relative entropy D = sum(pk * log(pk / qk)). This quantity is also known as the Kullback-Leibler divergence.

This routine will normalize pk and qk if they don’t sum to 1.


Defines the (discrete) distribution. Along each axis-slice of pk, element i is the (possibly unnormalized) probability of event i.

qkarray_like, optional

Sequence against which the relative entropy is computed. Should be in the same format as pk.

basefloat, optional

The logarithmic base to use, defaults to e (natural logarithm).

axisint, optional

The axis along which the entropy is calculated. Default is 0.

S{float, array_like}

The calculated entropy.


Informally, the Shannon entropy quantifies the expected uncertainty inherent in the possible outcomes of a discrete random variable. For example, if messages consisting of sequences of symbols from a set are to be encoded and transmitted over a noiseless channel, then the Shannon entropy H(pk) gives a tight lower bound for the average number of units of information needed per symbol if the symbols occur with frequencies governed by the discrete distribution pk [1]. The choice of base determines the choice of units; e.g., e for nats, 2 for bits, etc.

The relative entropy, D(pk|qk), quantifies the increase in the average number of units of information needed per symbol if the encoding is optimized for the probability distribution qk instead of the true distribution pk. Informally, the relative entropy quantifies the expected excess in surprise experienced if one believes the true distribution is qk when it is actually pk.

A related quantity, the cross entropy CE(pk, qk), satisfies the equation CE(pk, qk) = H(pk) + D(pk|qk) and can also be calculated with the formula CE = -sum(pk * log(qk)). It gives the average number of units of information needed per symbol if an encoding is optimized for the probability distribution qk when the true distribution is pk. It is not computed directly by entropy, but it can be computed using two calls to the function (see Examples).

See [2] for more information.



Shannon, C.E. (1948), A Mathematical Theory of Communication. Bell System Technical Journal, 27: 379-423.


Thomas M. Cover and Joy A. Thomas. 2006. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA.


The outcome of a fair coin is the most uncertain:

>>> import numpy as np
>>> from scipy.stats import entropy
>>> base = 2  # work in units of bits
>>> pk = np.array([1/2, 1/2])  # fair coin
>>> H = entropy(pk, base=base)
>>> H
>>> H == -np.sum(pk * np.log(pk)) / np.log(base)

The outcome of a biased coin is less uncertain:

>>> qk = np.array([9/10, 1/10])  # biased coin
>>> entropy(qk, base=base)

The relative entropy between the fair coin and biased coin is calculated as:

>>> D = entropy(pk, qk, base=base)
>>> D
>>> D == np.sum(pk * np.log(pk/qk)) / np.log(base)

The cross entropy can be calculated as the sum of the entropy and relative entropy`:

>>> CE = entropy(pk, base=base) + entropy(pk, qk, base=base)
>>> CE
>>> CE == -np.sum(pk * np.log(qk)) / np.log(base)