scipy.stats.qmc.

LatinHypercube#

class scipy.stats.qmc.LatinHypercube(d, *, scramble=True, strength=1, optimization=None, seed=None)[source]#

Latin hypercube sampling (LHS).

A Latin hypercube sample [1] generates \(n\) points in \([0,1)^{d}\). Each univariate marginal distribution is stratified, placing exactly one point in \([j/n, (j+1)/n)\) for \(j=0,1,...,n-1\). They are still applicable when \(n << d\).

Parameters:

dint

Dimension of the parameter space.

scramblebool, optional

When False, center samples within cells of a multi-dimensional grid. Otherwise, samples are randomly placed within cells of the grid.

Note

Setting scramble=False does not ensure deterministic output. For that, use the seed parameter.

Default is True.

Added in version 1.10.0.

optimization{None, “random-cd”, “lloyd”}, optional

Whether to use an optimization scheme to improve the quality after sampling. Note that this is a post-processing step that does not guarantee that all properties of the sample will be conserved. Default is None.

random-cd: random permutations of coordinates to lower the centered discrepancy. The best sample based on the centered discrepancy is constantly updated. Centered discrepancy-based sampling shows better space-filling robustness toward 2D and 3D subprojections compared to using other discrepancy measures.
lloyd: Perturb samples using a modified Lloyd-Max algorithm. The process converges to equally spaced samples.

Added in version 1.8.0.

Changed in version 1.10.0: Add lloyd.

strength{1, 2}, optional

Strength of the LHS. strength=1 produces a plain LHS while strength=2 produces an orthogonal array based LHS of strength 2 [7], [8]. In that case, only n=p**2 points can be sampled, with p a prime number. It also constrains d <= p + 1. Default is 1.

Added in version 1.8.0.

seed{None, int, numpy.random.Generator}, optional

If seed is an int or None, a new numpy.random.Generator is created using np.random.default_rng(seed). If seed is already a Generator instance, then the provided instance is used.

Methods

`fast_forward`(n)	Fast-forward the sequence by n positions.
`integers`(l_bounds, *[, u_bounds, n, ...])	Draw n integers from l_bounds (inclusive) to u_bounds (exclusive), or if endpoint=True, l_bounds (inclusive) to u_bounds (inclusive).
`random`([n, workers])	Draw n in the half-open interval `[0, 1)`.
`reset`()	Reset the engine to base state.

See also

Quasi-Monte Carlo

Notes

When LHS is used for integrating a function \(f\) over \(n\), LHS is extremely effective on integrands that are nearly additive [2]. With a LHS of \(n\) points, the variance of the integral is always lower than plain MC on \(n-1\) points [3]. There is a central limit theorem for LHS on the mean and variance of the integral [4], but not necessarily for optimized LHS due to the randomization.

\(A\) is called an orthogonal array of strength \(t\) if in each n-row-by-t-column submatrix of \(A\): all \(p^t\) possible distinct rows occur the same number of times. The elements of \(A\) are in the set \(\{0, 1, ..., p-1\}\), also called symbols. The constraint that \(p\) must be a prime number is to allow modular arithmetic. Increasing strength adds some symmetry to the sub-projections of a sample. With strength 2, samples are symmetric along the diagonals of 2D sub-projections. This may be undesirable, but on the other hand, the sample dispersion is improved.

Strength 1 (plain LHS) brings an advantage over strength 0 (MC) and strength 2 is a useful increment over strength 1. Going to strength 3 is a smaller increment and scrambled QMC like Sobol’, Halton are more performant [7].

To create a LHS of strength 2, the orthogonal array \(A\) is randomized by applying a random, bijective map of the set of symbols onto itself. For example, in column 0, all 0s might become 2; in column 1, all 0s might become 1, etc. Then, for each column \(i\) and symbol \(j\), we add a plain, one-dimensional LHS of size \(p\) to the subarray where \(A^i = j\). The resulting matrix is finally divided by \(p\).

References

[1]

Mckay et al., “A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code.” Technometrics, 1979.

[2]

M. Stein, “Large sample properties of simulations using Latin hypercube sampling.” Technometrics 29, no. 2: 143-151, 1987.

[3]

A. B. Owen, “Monte Carlo variance of scrambled net quadrature.” SIAM Journal on Numerical Analysis 34, no. 5: 1884-1910, 1997

[4]

Loh, W.-L. “On Latin hypercube sampling.” The annals of statistics 24, no. 5: 2058-2080, 1996.

[5]

Fang et al. “Design and modeling for computer experiments”. Computer Science and Data Analysis Series, 2006.

[6]

Damblin et al., “Numerical studies of space filling designs: optimization of Latin Hypercube Samples and subprojection properties.” Journal of Simulation, 2013.

[7] (1,2)

A. B. Owen , “Orthogonal arrays for computer experiments, integration and visualization.” Statistica Sinica, 1992.

[8]

B. Tang, “Orthogonal Array-Based Latin Hypercubes.” Journal of the American Statistical Association, 1993.

[9]

Seaholm, Susan K. et al. (1988). Latin hypercube sampling and the sensitivity analysis of a Monte Carlo epidemic model. Int J Biomed Comput, 23(1-2), 97-112. DOI:10.1016/0020-7101(88)90067-0

Examples

Generate samples from a Latin hypercube generator.

>>> from scipy.stats import qmc
>>> sampler = qmc.LatinHypercube(d=2)
>>> sample = sampler.random(n=5)
>>> sample
array([[0.1545328 , 0.53664833], # random
        [0.84052691, 0.06474907],
        [0.52177809, 0.93343721],
        [0.68033825, 0.36265316],
        [0.26544879, 0.61163943]])

Compute the quality of the sample using the discrepancy criterion.

>>> qmc.discrepancy(sample)
0.0196... # random

Samples can be scaled to bounds.

>>> l_bounds = [0, 2]
>>> u_bounds = [10, 5]
>>> qmc.scale(sample, l_bounds, u_bounds)
array([[1.54532796, 3.609945 ], # random
        [8.40526909, 2.1942472 ],
        [5.2177809 , 4.80031164],
        [6.80338249, 3.08795949],
        [2.65448791, 3.83491828]])

Below are other examples showing alternative ways to construct LHS with even better coverage of the space.

Using a base LHS as a baseline.

>>> sampler = qmc.LatinHypercube(d=2)
>>> sample = sampler.random(n=5)
>>> qmc.discrepancy(sample)
0.0196...  # random

Use the optimization keyword argument to produce a LHS with lower discrepancy at higher computational cost.

>>> sampler = qmc.LatinHypercube(d=2, optimization="random-cd")
>>> sample = sampler.random(n=5)
>>> qmc.discrepancy(sample)
0.0176...  # random

Use the strength keyword argument to produce an orthogonal array based LHS of strength 2. In this case, the number of sample points must be the square of a prime number.

>>> sampler = qmc.LatinHypercube(d=2, strength=2)
>>> sample = sampler.random(n=9)
>>> qmc.discrepancy(sample)
0.00526...  # random

Options could be combined to produce an optimized centered orthogonal array based LHS. After optimization, the result would not be guaranteed to be of strength 2.

Real-world example

In [9], a Latin Hypercube sampling (LHS) strategy was used to sample a parameter space to study the importance of each parameter of an epidemic model. Such analysis is also called a sensitivity analysis.

Since the dimensionality of the problem is high (6), it is computationally expensive to cover the space. When numerical experiments are costly, QMC enables analysis that may not be possible if using a grid.

The six parameters of the model represented the probability of illness, the probability of withdrawal, and four contact probabilities. The authors assumed uniform distributions for all parameters and generated 50 samples.

Using scipy.stats.qmc.LatinHypercube to replicate the protocol, the first step is to create a sample in the unit hypercube:

>>> from scipy.stats import qmc
>>> sampler = qmc.LatinHypercube(d=6)
>>> sample = sampler.random(n=50)

Then the sample can be scaled to the appropriate bounds:

>>> l_bounds = [0.000125, 0.01, 0.0025, 0.05, 0.47, 0.7]
>>> u_bounds = [0.000375, 0.03, 0.0075, 0.15, 0.87, 0.9]
>>> sample_scaled = qmc.scale(sample, l_bounds, u_bounds)

Such a sample was used to run the model 50 times, and a polynomial response surface was constructed. This allowed the authors to study the relative importance of each parameter across the range of possibilities of every other parameter.

In this computer experiment, they showed a 14-fold reduction in the number of samples required to maintain an error below 2% on their response surface when compared to a grid sampling.