# A Design Specification for `nan_policy`

#

Many functions in `scipy.stats`

have a parameter called `nan_policy`

that determines how the function handles data that contains `nan`

. In
this section, we provide SciPy developer guidelines for how `nan_policy`

is intended to be used, to ensure that as this parameter is added to new
functions, we maintain a consistent API.

## The basic API#

The parameter `nan_policy`

accepts three possible strings: `'omit'`

,
`'raise'`

and `'propagate'`

. The meanings are:

`nan_policy='omit'`

: Ignore occurrences of`nan`

in the input. Do not generate a warning if the input contains`nan`

(unless the equivalent input with the`nan`

values removed would generate a warning). For example, for the simple case of a function that accepts a single array and returns a scalar (and ignoring the possible use of`axis`

for the moment):func([1.0, 3.0, np.nan, 5.0], nan_policy='omit')

should behave the same as:

func([1.0, 3.0, 5.0])

More generally, for functions that return a scalar,

`func(a, nan_policy='omit')`

should behave the same as`func(a[~np.isnan(a)])`

.For functions that transform a vector to a new vector of the same size and for which each entry in the output array depends on more than just the corresponding value in the input array [1] (e.g.

`scipy.stats.zscore`

,`scipy.stats.boxcox`

*when*`lmbda`

*is None*),:y = func(a, nan_policy='omit')

should behave the same as:

nan_mask = np.isnan(a) y = np.empty(a.shape, dtype=np.float64) y[~nan_mask] = func(a[~nan_mask]) y[nan_mask] = np.nan

(In general, the dtype of

`y`

might depend on`a`

and on the expected behavior of`func`

). In other words, a*nan*in the input gives a corresponding*nan*in the output, but the presence of that*nan*does not affect the calculation of the non-*nan*values.Unit tests for this property should be used to test functions that handle

`nan_policy`

.For functions that return a scalar and that accept two or more arguments but whose values are not related (e.g.

`scipy.stats.ansari`

,`scipy.stats.f_oneway`

), the same idea applies to each input array. So:func(a, b, nan_policy='omit')

should behave the same as:

func(a[~np.isnan(a)], b[~np.isnan(b)])

For inputs with

*related*or*paired*values (e.g.`scipy.stats.pearsonr`

,`scipy.stats.ttest_rel`

) the recommended behavior is to omit all the values for which any of the related values are`nan`

. For a function with two related array inputs, this means:y = func(a, b, nan_policy='omit')

should behave the same as:

hasnan = np.isnan(a) | np.isnan(b) # Union of the isnan masks. y = func(a[~hasnan], b[~hasnan])

The docstring for such a function should clearly state this behavior.

`nan_policy='raise'`

: Raise a`ValueError`

.`nan_policy='propagate'`

: Propagate the`nan`

value to the output. Typically, this means just execute the function without checking for`nan`

, but seefor an example where that might lead to unexpected output.

`nan_policy`

combined with an `axis`

parameter#

There is nothing surprising here–the principle mentioned above still
applies when the function has an `axis`

parameter. Suppose, for example,
`func`

reduces a 1-d array to a scalar, and handles n-d arrays as a
collection of 1-d arrays, with the `axis`

parameter specifying the axis
along which the reduction is to be applied. If, say:

```
func([1, 3, 4]) -> 10.0
func([2, -3, 8, 2]) -> 4.2
func([7, 8]) -> 9.5
func([]) -> -inf
```

then:

```
func([[ 1, nan, 3, 4],
[ 2, -3, 8, 2],
[nan, 7, nan, 8],
[nan, nan, nan, nan]], nan_policy='omit', axis=-1)
```

must give the result:

```
np.array([10.0, 4.2, 9.5, -inf])
```

## Edge cases#

A function that implements the `nan_policy`

parameter should gracefully
handle the case where *all* the values in the input array(s) are `nan`

.
The basic principle described above still applies:

```
func([nan, nan, nan], nan_policy='omit')
```

should behave the same as:

```
func([])
```

In practice, when adding `nan_policy`

to an existing function, it is
not unusual to find that the function doesn’t already handle this case
in a well-defined manner, and some thought and design may have to be
applied to ensure that it works. The correct behavior (whether that be
to return `nan`

, return some other value, raise an exception, or something
else) will be determined on a case-by-case basis.

## Why doesn’t `nan_policy`

also apply to `inf`

?#

Although we learn in grade school that “infinity is not a number”, the
floating point values `nan`

and `inf`

are qualitatively different.
The values `inf`

and `-inf`

act much more like regular floating
point values than `nan`

.

One can compare

`inf`

to other floating point values and it behaves as expected, e.g.`3 < inf`

is True.For the most part, arithmetic works “as expected” with

`inf`

, e.g.`inf + inf = inf`

,`-2*inf = -inf`

,`1/inf = 0`

, etc.Many existing functions work “as expected” with

`inf`

:`np.log(inf) = inf`

,`np.exp(-inf) = 0`

,`np.array([1.0, -1.0, np.inf]).min() = -1.0`

, etc.

So while `nan`

almost always means “something went wrong” or “something
is missing”, `inf`

can in many cases be treated as a useful floating
point value.

It is also consistent with the NumPy `nan`

functions to not ignore
`inf`

:

```
>>> np.nanmax([1, 2, 3, np.inf, np.nan])
inf
>>> np.nansum([1, 2, 3, np.inf, np.nan])
inf
>>> np.nanmean([8, -np.inf, 9, 1, np.nan])
-inf
```

## How *not* to implement `nan_policy`

#

In the past (and possibly currently), some `stats`

functions handled
`nan_policy`

by using a masked array to mask the `nan`

values, and
then computing the result using the functions in the `mstats`

subpackage.
The problem with this approach is that the masked array code might convert
`inf`

to a masked value, which we don’t want to do (see above). It also
means that, if care is not taken, the return value will be a masked array,
which will likely be a surprise to the user if they passed in regular arrays.

Footnotes