Trimming and winsorization transition guide#

An outlier is an observation that differs substantially from other observations in a sample. There are many references available about identifying outliers, whether to make adjustments for outliers, and if so, what adjustments to make. This guide does not make any recommendations on these topics. Instead, the focus is on how to perform two of the most common procedures:

  • trimming: removing outliers from the dataset, and

  • winsorization: replacing outliers with a more central value,

using tools available in SciPy, with an emphasis on transitioning from use of legacy functions to the preferred approach.

Motivation#

In the past, SciPy has offered several “convenience functions” that combine trimming with computation of a statistic. Consider, for instance, scipy.stats.trim_mean.

import numpy as np
from scipy import stats
np.set_printoptions(linewidth=120)
x = np.arange(20.)
stats.trim_mean(x, proportiontocut=0.1)
np.float64(9.5)

Without carefully reading the documentation, it is not immediately obvious how the operation was performed: does proportiontocut refer to the proportion to cut from each tail, or is it the total proportion of the data to remove?

Likewise, when the proportion multiplied by the number of elements would not result in an integer, is the number to remove rounded up or down?

What if instead of removing a proportion, we want to remove data below and above specified thresholds. It turns out that we cannot achieve this with trim_mean; we would need to use tmean or (the similarly-named, but curiously relegated to a separate namespace) scipy.stats.mstats.trimmed_mean, instead.

What if we want to remove data outside a “confidence interval” estimate. Do any of these functions offer anything to facilitate that? No, not really.

One could take the mean after using one of the seven functions for trimming data. But which one(s) will do what we need?

So at the time of writing, SciPy offers at least half a dozen ways of taking a trimmed mean, but deciding which to use requires a lot of research, and none are able to perform all variations of the trimmed mean that we might reasonably expect them.

What about other statistics - should we use scipy.stats.tvar, scipy.stats.mstats.tvar, or scipy.stats.mstats.trimmed_var to compute the trimmed variance? If there are three functions for that and three for the mean, surely there is at least one function for computing the trimmed skewness? (Nope!)

In the spirit of the Zen of Python, we suggest the new “one– and preferably only one –obvious way” to calculate the trimmed version of many statistics: manually, in two or three lines of code, using more familiar, general-purpose features of NumPy and SciPy. Whether calculating a trimmed or winsorized statistic, there will be three steps:

  1. Identify the outlier threshold.

  2. Trim or winsorize.

  3. Compute the statistic.

Identifying the threshold#

SciPy now offers a quantile function with several methods useful for identifying thresholds beyond which data are to be trimmed or winsorized. The most common thresholds are those corresponding with a certain percentage of the data in each tail. To set the threshold at \(p=10\%\) of the data in each tail, rounding the number of points to trim down:

x = np.arange(22.)
p = 0.1
stats.quantile(x, [p, 1-p], method='round_outward')
array([ 2., 19.])

Rounding the number of points to trim up:

stats.quantile(x, [p, 1-p], method='round_inward')
array([ 3., 18.])

Rounding the number of points to trim to the nearest integer:

stats.quantile(x, [p, 1-p], method='round_nearest')
array([ 2., 19.])

Note that these methods are all subtly different from estimating the 10th and 90th percentiles of the data, which might be called for in different circumstances. Using the default method:

stats.quantile(x, [p, 1-p])
array([ 2.1, 18.9])

The Harrell-Davis method:

stats.quantile(x, [p, 1-p], method='harrell-davis')
array([ 1.69917358, 19.30082642])

Note that quantile is fully vectorized. Suppose we have two rows of data, and we wish to set different thresholds for each row.

x2d = np.stack([x, x])
p2d = [[p  , 1 - p  ],  # thresholds for zeroth row
       [2*p, 1 - 2*p]]  # thresholds for next row
stats.quantile(x2d, p2d, method='round_outward', axis=-1)
array([[ 2., 19.],
       [ 4., 17.]])

Missing data#

When there are missing or ignorable values but omitting them would lead to a ragged array, the two historical options have been to:

  • cover the missing/ignorable values with the “mask” of a NumPy masked array and use stats.mstats functions for analysis, or

  • replace missing/ignorable values with NaNs and use stats functions with nan_policy='omit' for analysis.

The downsides of these approaches are documented elsewhere; here, we recommend using an MArray with stats functions.

from marray import numpy as mxp
mask = ((x2d % 2) == 0)  # ignore odd elements
y = mxp.asarray(x2d, mask=mask)
print(y)
[[  _  1.   _  3.   _  5.   _  7.   _  9.   _ 11.   _ 13.   _ 15.   _ 17.   _ 19.   _ 21.]
 [  _  1.   _  3.   _  5.   _  7.   _  9.   _ 11.   _ 13.   _ 15.   _ 17.   _ 19.   _ 21.]]

quantile and many other scipy.stats functions natively support MArrays, and support grows with each version of SciPy.

stats.quantile(y, mxp.asarray(p2d), method='round_outward', axis=-1)
MArray(
    array([[ 3., 19.],
           [ 5., 17.]]),
    array([[False, False],
           [False, False]])
)

Trimming and winsorizing#

Winsorizing is cliping#

Once a numerical threshold has been identified, winsorizing is simply an application of a fundamental data operation: clip.

Suppose we have \(n=22\) observations and we wish to winsorize \(p=10\%\) from each side of the data. Rounding down with method='round_outward', we would expect to winsorize \(\left \lfloor{pn}\right \rfloor = 2\) points on each side. In this case, quantile returns the third data point from the left (at index 2) as the lower limit. cliping to this limit will winsorize the two lower observations, leaving the third observation unchanged. The story is similar on the right.

x = np.arange(22.)
p = 0.1
low, high = stats.quantile(x, [p, 1-p], method='round_outward')
x_winsorized = np.clip(x, low, high)
x_winsorized
array([ 2.,  2.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 19., 19.])

When extracting the low and high thresholds from the result of quantile, just remember that their positions correspond with the positions of values in the p argument.

stats.quantile(x2d, p2d, method='round_outward', axis=-1)
array([[ 2., 19.],
       [ 4., 17.]])

Here, we want the columns, and we want to preserve their shape so they still align with the shape of x2d.

low_high = stats.quantile(x2d, p2d, method='round_outward', axis=-1)
low2d = low_high[:, :1]
high2d = low_high[:, 1:]
low2d, high2d
(array([[2.],
        [4.]]),
 array([[19.],
        [17.]]))
x2d_winsorized = np.clip(x2d, low2d, high2d)
x2d_winsorized
array([[ 2.,  2.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 19., 19.],
       [ 4.,  4.,  4.,  4.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17., 17., 17., 17., 17.]])

Trimming#

Like winsorization, trimming is also a straightforward application of fundamental array operations.

For 1-D data, trimming is just indexing with a boolean mask that selects data to be retained.

As in the 1-D winsorization example above, suppose we want to trim two points on either side. low is the third observation from the left - the first we want to keep. Therefore, we’ll form a boolean mask with non-strict inequalities to select the data to be kept.

low, high = stats.quantile(x, [p, 1-p], method='round_outward')  # get the threshold values
mask = (x >= low) & (x <= high)
x_trimmed = x[mask]
x_trimmed
array([ 2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19.])

Rather than changing to strict inequalities, use method='round_inward' if you want to round the other way.

low, high = stats.quantile(x, [p, 1-p], method='round_inward')  # get the threshold values
mask = (x >= low) & (x <= high)
x[mask]
array([ 3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17., 18.])

In 2-D, this approach of retaining only certain data may be possible if the number of elements retained in each row happen to be identical. There is one complication: boolean indexing results in a 1-D array:

mask = (x2d >= low) & (x2d <= high)
x2d[mask]
array([ 3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17., 18.,  3.,  4.,  5.,  6.,  7.,  8.,
        9., 10., 11., 12., 13., 14., 15., 16., 17., 18.])

When the number of elements retained in each row are identical (and only then), it is safe to reshape the output to an array with the original number of rows.

x2d[mask].reshape(x2d.shape[0], -1)  # -1 means "infer from the number of elements"
array([[ 3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17., 18.],
       [ 3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17., 18.]])

The more general option is to keep the trimmed elements in their original locations but to mask them.

Note that the mask of elements to remove is the logical NOT of the elements to be kept. Since we used non-strict inequalities to retain elements, we use strict inequalities to trim them.

mask2d = (x2d < low2d) | (x2d > high2d)
x2d_masked = mxp.asarray(x2d, mask=mask2d)
print(x2d_masked)
[[  _   _  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.   _   _]
 [  _   _   _   _  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17.   _   _   _   _]]

Computing the statistic#

When the data is winsorized or trimmed data has actually been removed, working with it is no different than any other data.

np.mean(x_trimmed)
np.float64(10.5)
np.mean(x_winsorized)  # result happens to be identical *for this data*
np.float64(10.5)
np.mean(x2d_winsorized, axis=-1)  # results happen to be identical *for this data*
array([10.5, 10.5])

When the trimmed values are represented as masked elements of a masked array, only slightly more care is needed. For simple statistics like mean and var, use the corresponding function of the MArray namespace, mxp, rather than the NumPy namespace.

mxp.mean(x2d_masked, axis=-1)
MArray(array([10.5, 10.5]), array([False, False]))

Many SciPy functions also natively support masked arrays (see the documentation of the desired function).

stats.kurtosis(x2d_masked, axis=-1)
MArray(array([-1.20743034, -1.21230769]), array([False, False]))

Recipes#

Here, we reproduce the results of several legacy trimming functions using the preferred approach.

rng = np.random.default_rng(339759635264198322188079188741355346875)
x = rng.random(22)
p = 0.1

stats.trim_mean#

stats.trim_mean accepts a proportion of the total number of elements to cut from each side.

stats.trim_mean(x, proportiontocut=p)
np.float64(0.5040176765501222)

It always rounds the number of elements to cut down, so we choose round_outward.

low, high = stats.quantile(x, [p, 1-p], method='round_outward')
mask = (low <= x) & (x <= high)
np.mean(x[mask])
np.float64(0.5040176765501222)

stats.tmean#

tmean accepts absolute lower and upper limits.

limits = (0.1, 0.9)
stats.tmean(x, limits, inclusive=(True, True))
np.float64(0.47320792896720426)

The inclusive argument has no effect unless one of the limits exactly equals an observation.

stats.tmean(x, limits, inclusive=(False, False))
np.float64(0.47320792896720426)

As an example of when it would matter:

limits = (x[9], x[5])
tmean_inclusive = stats.tmean(x, limits, inclusive=(True, True))
tmean_exclusive = stats.tmean(x, limits, inclusive=(False, False))
tmean_inclusive, tmean_exclusive
(np.float64(0.5040176765501222), np.float64(0.5009240752836592))

inclusive refers to the elements to keep, so use non-strict inequality to be inclusive:

mask = (limits[0] <= x) & (x <= limits[1])
np.mean(x[mask])
np.float64(0.5040176765501222)

and strict inequality to be exclusive.

mask = (limits[0] < x) & (x < limits[1])
np.mean(x[mask])
np.float64(0.5009240752836591)

stats.mstats.trimmed_mean#

trimmed_mean is, in some ways, a mash-up of stats.tmean and stats.trim_mean. It accepts limits, like stats.tmean, but by default, these are relative, like stats.trim_mean.

stats.mstats.trimmed_mean(x, (p, p), inclusive=(True, True), relative=True)
np.float64(0.5040176765501222)

Surprisingly, inclusive has no effect when p*len(x) is integral and relative is True (default).

stats.mstats.trimmed_mean(x, (p, p), inclusive=(False, False))
np.float64(0.5040176765501222)

Rather, it affects the method of rounding p*len(x). So let’s consider a case in which it would matter.

rng = np.random.default_rng(324589234958293486245)
x = rng.random(size=19)
stats.mstats.trimmed_mean(x, (p, p), inclusive=(True, True))
np.float64(0.6172768003503452)

According to the documentation, inclusive=(True, True) means that the number of elements “masked” on each side should be rounded. In our case, p*len(x) = 1.9, so we would round to 2 on each side for a total of four.

low, high = stats.quantile(x, [p, 1-p], method='round_nearest')
mask = (x < low) | (high < x)
np.count_nonzero(mask)
np.int64(4)
mxp.mean(mxp.asarray(x, mask=mask))
MArray(array(0.62423841), array(False))

What happened? Apparently the documentation has been wrong since the function was added in 2008. If we trim less data, we get the answer we expect.

low, high = stats.quantile(x, [p, 1-p], method='round_outward')
mask = (x < low) | (high < x)
np.count_nonzero(mask)
np.int64(2)
mxp.mean(mxp.asarray(x, mask=mask))
MArray(array(0.6172768), array(False))

What about inclusive=(False, False)? According to the documentation, the number of elements “masked” should be truncated. In our case, p*len(x) = 1.9, so we would truncate to 1 on each side for a total of two. This should give us the same result as the round_outward calculation just above.

stats.mstats.trimmed_mean(x, (p, p), inclusive=(False, False))
np.float64(0.6242384075150735)

But instead, it gives us the same result as the round_nearest (or a round_inward) calculation. Apparently, the documentation is backwards.

As it turns out, stats.mstats.trimmed_mean is a convenience function equivalent to:

stats.mstats.trim(a, limits=limits, inclusive=inclusive, relative=relative).mean(axis=axis)

So the behavior is best understood in terms of scipy.stats.mstats.trim.

stats.mstats.trim#

The entire implementation of stats.mstats.trim is:

if relative:
    return trimr(a, limits=limits, inclusive=inclusive, axis=axis)
else:
    return trima(a, limits=limits, inclusive=inclusive)

So it is also convenience function that simply wraps the functions stats.mstats.trimr and stats.mstats.trima.

stats.mstats.trima#

stats.mstats.trima is itself a convenience function that simply masks values outside of a given interval.

x = np.arange(20)
low, high = (1, 18)
stats.mstats.trima(x, (low, high), inclusive=(True, True))
masked_array(data=[--, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, --],
             mask=[ True, False, False, False, False, False, False, False, False, False, False, False, False, False,
                   False, False, False, False, False,  True],
       fill_value=999999)

The definition of inclusive is ambiguous at best:

Tuple of (lower flag, upper flag), indicating whether values exactly equal to the lower (upper) limit are allowed.

What does allowed mean? Are values equal to the limits included in the data that is trimmed or retained? Apparently, since only one point has been masked on each side, it refers to elements being retained.

Generating and applying the mask manually is much more explicit. It also tends to be more concise.

mxp.asarray(x, mask=(x < low) | (x > high))  # mask values *strictly* less than 1; also mask values *strictly* greater than 18
MArray(
    array([ _,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18,  _]),
    array([ True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False,
           False, False, False,  True])
)

To reproduce the results we’d get with inclusive=(False, False):

stats.mstats.trima(x, (1, 18), inclusive=(False, False))
masked_array(data=[--, --, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, --, --],
             mask=[ True,  True, False, False, False, False, False, False, False, False, False, False, False, False,
                   False, False, False, False,  True,  True],
       fill_value=999999)

we simply adjust the inequalities.

mxp.asarray(x, mask=(x <= low) | (x >= high))  # mask values 1 or less; also mask values 18 or greater
MArray(
    array([ _,  _,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,  _,  _]),
    array([ True,  True, False, False, False, False, False, False, False, False, False, False, False, False, False, False,
           False, False,  True,  True])
)

stats.mstats.trimr#

stats.mstats.trimr is also a convenience function that trims a specified proportion of the total number of elements from each tail. inclusive has no effect when the product of the proportion and the total number of elements is integral.

p = 0.1
stats.mstats.trimr(x, (p, p), inclusive=(True, True))
masked_array(data=[--, --, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, --, --],
             mask=[ True,  True, False, False, False, False, False, False, False, False, False, False, False, False,
                   False, False, False, False,  True,  True],
       fill_value=999999)
stats.mstats.trimr(x, (p, p), inclusive=(False, False))
masked_array(data=[--, --, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, --, --],
             mask=[ True,  True, False, False, False, False, False, False, False, False, False, False, False, False,
                   False, False, False, False,  True,  True],
       fill_value=999999)

The documentation for inclusive states that it is a:

Tuple of flags indicating whether the number of data being masked on the left (right) end should be truncated (True) or rounded (False) to integers.

Suppose we have \(n=19\) and the proportion to mask is \(p = 0.1\). In that case, \(pn = 1.9\). With inclusive=(False, False), we round the number of elements to be masked to 2.

n = 19.
x = np.arange(n)
stats.mstats.trimr(x, (p, p), inclusive=(False, False))
masked_array(data=[--, --, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, --, --],
             mask=[ True,  True, False, False, False, False, False, False, False, False, False, False, False, False,
                   False, False, False,  True,  True],
       fill_value=1e+20)

This is equivalent to:

low, high = stats.quantile(x, [p, 1-p], method='round_nearest')
mxp.asarray(x, mask=(x < low) | (x > high))
MArray(
    array([  _,   _,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16.,   _,   _]),
    array([ True,  True, False, False, False, False, False, False, False, False, False, False, False, False, False, False,
           False,  True,  True])
)

With inclusive=(True, True), the number of elements to be masked is truncated to 1.

stats.mstats.trimr(x, (p, p), inclusive=(True, True))
masked_array(data=[--, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, --],
             mask=[ True, False, False, False, False, False, False, False, False, False, False, False, False, False,
                   False, False, False, False,  True],
       fill_value=1e+20)
low, high = stats.quantile(x, [p, 1-p], method='round_outward')
mxp.asarray(x, mask=(x < low) | (x > high))
MArray(
    array([  _,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 17.,   _]),
    array([ True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False,
           False, False,  True])
)

stats.mstats.trimboth, stats.trimboth, stats.mstats.trimtail, and stats.trim1.#

For your convenience, stats.mstats.trimboth passes proportiontocut to stats.mstats.trimr twice!

np.all(stats.mstats.trimboth(x, p) == stats.mstats.trimr(x, (p, p)))
np.True_

stats.trimboth does the same job, but it removes (rather than masking) the trimmed data, and shuffles the elements.

a = stats.mstats.trimboth(x, p)
b = stats.trimboth(x, p)
np.all(np.ma.compressed(a) == np.sort(b))
np.True_

trimtail performs the arduous task of passing proportiontocut and None as a tuple.

np.all(stats.mstats.trimtail(x, p, tail='left') == stats.mstats.trimr(x, (p, None)))
np.True_
np.all(stats.mstats.trimtail(x, p, tail='right') == stats.mstats.trimr(x, (None, p)))
np.True_

stats.trim1 is to stats.mstats.trimtail as stats.trimboth is to stats.mstats.trimboth

a = stats.mstats.trimtail(x, p, tail='left')
b = stats.trim1(x, p, tail='left')
np.all(np.ma.compressed(a) == np.sort(b))
np.True_

Replicating these results using quantile and MArray is left as an exercise for the reader.

stats.mstats.winsorize#

In almost all cases, stats.mstats.winsorize is equivalent to:

def winsorize(x, limits, inclusive, nan_policy='propagate'):
    method = 'round_outward' if all(inclusive) else 'round_nearest'
    low_high = stats.quantile(x, [limits[0], 1-limits[1]], method=method, axis=-1, nan_policy=nan_policy)
    return np.clip(x, low_high[..., :1], low_high[..., 1:])
stats.mstats.winsorize(x, (p, p), inclusive=(False, False)).data
array([ 2.,  2.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 16., 16.])
winsorize(x, (p, p), inclusive=(False, False))
array([ 2.,  2.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14., 15., 16., 16., 16.])

Minor generalizations would be needed to support axis other than -1, different left and right values of inclusive, and the inplace argument.

Results may be different when NaNs are present and nan_policy='omit'. This is because stats.mstats.winsorize is incorrect. If we have two NaNs to be omitted, we would have 17 elements left, so \(np = 1.7\), and inclusive=(False, False) would round this number up to 2. This means that two of the values on each side would be replaced with the next more central element.

x[9:11] = np.nan
stats.mstats.winsorize(x, (p, p), inclusive=(False, False), nan_policy='omit').data
array([ 2.,  2.,  2.,  3.,  4.,  5.,  6.,  7.,  8., nan, nan, 11., 12., 13., 14., 15., 16., 17., 18.])

Clearly, no elements in the right tail have been replaced. Our simple implementation fixes the problem.

winsorize(x, (p, p), inclusive=(False, False), nan_policy='omit')
array([ 2.,  2.,  2.,  3.,  4.,  5.,  6.,  7.,  8., nan, nan, 11., 12., 13., 14., 15., 16., 16., 16.])