jorenham / lmo Goto Github PK

Trimmed L-moments and L-comoments for robust statistics.

Home Page: https://jorenham.github.io/Lmo/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

l-moments numpy python statistics robust-optimization robust-statistics fitting-distribution multivariate-statistics probability-statistics l-comoments

lmo's People

Contributors

Stargazers

Watchers

Forkers

wolph

lmo's Issues

Implement the 4-parameter generalized lambda distribution

https://wikipedia.org/wiki/Tukey_lambda_distribution
https://reference.wolfram.com/language/ref/TukeyLambdaDistribution.html

Scalar trim alias missing in `typing.AnyTrim`

Optional polars support

Use the Polars extension API to register namespaces, analogous to those for Pandas

related to #184

integer overflow in `linalg.sh_legendre` on windows

See: https://github.com/jorenham/Lmo/actions/runs/6687308385/job/18167801378

The np.int_ is a C long, which is 32bits in win64, because, well.. err... logic...?

... anyway, this is could be solved by replacing all np.int_'s by np.int64.

Another option could be to switch _sh_jacobi_f in the cases where _sh_jacobi_i will overflow given k and dtype in
sh_legendre, and similarly in sh_jacobi(k, a, b, dtype).

Note that in practise, this is only a problem in _lm._l_weights_pwm. The _lm._l_weights_ostat method is slower, slightly less precise, but a lot less likely to overflow.

For any users facing this issue; the workaround is to use e.g. trim=1e-15 instead of trim=0. This forces it to use the ostat weights, since the pwm weights can only handle integer trimming.

Missing `@overload` for L-moment aliases

E.g. lmo.l_loc(x, axis=None) always returns a scalar T, but the type checker cannot narrow the return type any further than npt.NDArray[T] | T.

Population (T)L-moments from CDF and PPF

Test for normality

Implement the $\tau^2_{3,4}$ for normality from Harri & Coble .

See https://rdrr.io/cran/lmomco/src/R/tau34sq.normtest.R for a reference implementation in R.

automatically decide to cache L-weights

When cache=None (default), cache the weights iff:

n <= CACHE_N_MAX, with e.g. CACHE_N_MAX = (1 << 30) - 1
r <= CACHE_R_MAX, with e.g. CACHE_R_MAX = 16
sum(trim) <= CACHE_TRIM_MAX, with e.g. CACHE_TRIM_MAX = 4
sum(trim) <: int (don't cache if fractional trim; it's weird anyhow)

Support `__array_ufunc__` in `lmo.l_(loc|scale|variation|skew|kurtosis)`

Currently, these always return a np.ndarray, even when the input is an e.g. pd.Series.

By making these functions aware of the potential __array_ufunc__ or __array_function__ methods, it can automatically "lift" itself, so that the types of ufunc-aware instances can pass-though.

It's probably best to use np.frompyfunc for this.

Note that this will require additional @overload's.

Allow passing `scipy.stats` univariate distributions to the `lmo.l_*` functions

The current monkey-patched methods have several issues:

type annotations are impossible for extensions methods, and I've seen no attempts at a PEP that plans to add the required functionality
relies on undocumented scipy.stats.rv_* internals, which can break without warning
not guaranteed to work when using multiprocessing (it might be fine when using fork, but with e.g. spawn or joblib+loky I suspect it could break, although I have yet to test this).
difficult to document properly within mkdocs (hacks required), especially when considering that the docstring styles of scipy and lmo are incompatible.
negatively affects the performance of import lmo (the first time)

Instead, it'd be better to extend the lmo.l_* moment functions, making them accept the rv_ instances directly.

implement for discrete RV's
implement for continuous RV's
document the usage, and add (doctest) examples
write (hypothesis) tests for the discrete/continuous RV's in lmo.distributions and scipy.stats
use two sepearate TypedDict's for **kwargs for specifying extra options to the underlying sample- and population- L-moments estimators
add appropriate @overload's.
remove lmo.contrib.scipy_stats.l_rv_generic and the related monkey-patch machinery

Arbitrary concomitants in the prescence of duplicates

This can be solved with either np.lexsort or np.argsort + structured type's. So a small performance test is requires to select the best of the two.

Alternatively, a post-sort check can quickly be done to identify (consecutive) duplicate values. These sub-array's can then individually be re-ordered, using the concomitant's ordering. But I suspect this to be a lot slower that the other two options.

Implement the Kumaraswamy distribution

https://wikipedia.org/wiki/Kumaraswamy_distribution

incorrect `diagnostic.l_ratio_bounds` for `has_variance=True`

The strict bounds should be divided by the L-scale, not by the upper bound of the L-scale. This results is bounds that are too tight.

E.g. the strict untrimmed L-moment bounds for $r \ge 3$ are $\sqrt{2r - 1} |\lambda_r| \le \sigma$, so the corresponding L-ratio bouds are $\sqrt{2r - 1} |\tau_r| \le \sigma / \lambda_2$.

Variance structure of sample (T)L moments

$\mathbb{E}[l_{k_1}^{(s, t)}, l_{k_2}^{(s, t)}]$

For $s=t$ (L-moments and symmetric TL-moments), Elamir and Seheult (2003) found the exact solution.
For $s \neq t$, more research is required.

Empirical influence functions

Analogous to the lmo.theoretical influence functions from #22, add the following methods:

lmo.l_moment_influence(x: array_like[number], r: int, *, **) -> (T) -> T
lmo.l_ratio_influence(x: array_like[number], r: int, k: int = 2, *, **) -> (T) -> T

L-moment ratio diagram plotting utilities

L-moment ratio diagrams are the swiss jacknife for distribution identification, and is in general a very useful tool for statistical exploratory analysis.

This requires:

Calculation of the L-moment ratio curves for specific families of distributions, e.g. as $\tau_r(\tau_k | \mathcal{F})$.
Non-parametric L-ratio curve estimation, either using repeated measurements, or some bootstrapping method
Plotting both parametric curves (lineplot), and optionally a non-parametric curve (scatterplot), simultaneously, including helpful legend- and axis labels. Preferrably, this should be backend-agnostic, with (for now) one backend implementation in contrib.matplotlib .

Some examples for inspiration:

M.C. Peel et al. (2001):
J.R.M. Hosking (2015):
J. Galeano-Brajones et al. (2023):

Release on anaconda

Lmo heard some news about some spreadsheet software that he wants to play with.
But Lmo has to tame a big snake before he can do that.
Luckily, Lmo is best friends with another small snake already.

Relevant resources:

Scoped global config

Useful for globally or temporarily overriding defaults, e.g. those used for kwargs such as dtype, trim, cache, sort, rowvar, n_extra, etc.

See polars.Config for a similar feature in polars with a clean API.

Usage

Global config:

Setting:

lmo.config.set(sort='stable')

Local config

with lmo.config(sort='stable'):
    ...

@lmo.config.override(sort='stable')
def spam(...):
    ...

Implementation

The code will live in a new lmo.config namespace.
Specify the config options using a class ConfigOptions(TypedDict, total=False): ....
Use a _CONFIG: collections.ChainMap for global config (although maybe an alternative or subclass is needed for TypedDict support).
Initialize the root of _CONFIG using default values.
Implement lmo.config.override(**options: *ConfigOptions) contextmanager + decorator by having it push/pop from _CONFIG.
Implement lmo.config.set(*ConfigOptions) by replacing the root _CONFIG, so that any potential locally overridden options aren't affected

Bonus feature: Environment variables

So that e.g. LMO_SORT="stable" can be used to specifiy sort='stable' (supersedes the root config).

Bonus feature: `pyproject.toml`

Something like

[tool.lmo]
sort = "stable"

It should have lower priorty than the environment variables.

Fractional trimming in sample L-moment

Modify lmo.l_weights to accept trim: tuple[float, float], by replacing the succession-matrix algorithm with a direct approach, so that gamma functions can be used (instead of comb).

This is already possible for population L-moments.

`contrib.sympy`

Optional support for SymPy, e.g. utility functions for:

extend the sympy.stats distributions with symbolic L-moment methods, analogously to scipy.stats
finding closed-form solutions of theoretical
- order stats
- PWM's
- L-moments
- L-comoments
- L-moment (asymptotic) variance-covariance matrix
- IF's, BP's, etc.
- L-MGF's 😏
allow using sympy's numerical integration as an alternative to the current scipy.integrate.quad
support for using the interpolative summation method in l_poly (as alternative to the current naïve summation)
Hahn polynomials (use $_3F_2$; it's absent in scipy), which show up in the trimmed L-moment weights (Hosking, 2015).

A `nan_policy` kwarg for the sample estimators

Currently, data with nan's are either always propagated, or an error is raised (e.g. np.asarray_chkfinite raises a ValueError).
The behaviour around nan's is currently unspecified and untested.

Luckily, SciPy has their very well-specified nan_policy (https://scipy.github.io/devdocs/dev/api-dev/nan_policy.html) which appears to be very much applicable to (at least) Lmo's sample estimators.

Incorrect trimming of `inf` samples in TL-moment estimators

>>> a = np.array([0, 1, 2, 3, 4, 5, 6, np.inf])

>>> lmo.l_stats(a, trim=0)
array([inf, inf, nan, nan])

>>> lmo.l_stats(a, trim=(0, 1))
array([nan, nan, nan, nan])

>>> lmo.l_stats(a, trim=1)
array([nan, nan, nan, nan])

Only the trim=0 is correct, but in the other cases, it should be:

>>> lmo.l_stats(a, trim=(0, 1))
array([2.   , 1.125, 0.   , 0.   ])

>>> lmo.l_stats(a, trim=1)
array([3.5, 0.9, 0. , 0. ])

Implement the Wakeby distribution

https://wikipedia.org/wiki/Wakeby_distribution

improve the `import lmo` runtime

Most of the startup time comes from numpy and scipy imports.

In some cases scipy imports can be replaced with stdlib math functions.
In other cases it sometimes to make sense to move the top-level imports into the relevant functions.

Optional numba support

Add contrib.numba, and use a noop @jit if not available on relevant functions.

Additionally, the scipy.integrate.quad integrand can be sped up with cfunc, see https://numba.readthedocs.io/en/stable/user/cfunc.html#example

In places where scipy.special functions are used, some trickery is needed.
For example, this snippet is used to make scipy.special.erfi work within numba jitted functions for np.float64 input:

# lmo/contrib/numba.py
def _overload_scipy_special_erfi():
    _erfi_f8 = ctypes.CFUNCTYPE(ctypes.c_double, ctypes.c_double)(
        numba.extending.get_cython_function_address(
            'scipy.special.cython_special',
            '__pyx_fuse_1erfi',
        ),
    )
    
    @numba.extending.overload(scipy.special.erfi)
    def numba_erfi(*args):
        match args:
            case (numba.types.Float(),):
                def _numba_erfi(*args):
                    return _erfi_f8(*args)
                return _numba_erfi
            case _:
                return None

def overload_scipy_special():
    _overload_scipy_special_erfi()

# lmo/pyproject.toml
[tool.poetry.plugins.numba_extensions]
init = "lmo.contrib.numba:overload_scipy_special"

`scipy.stats.wald.l_moment` bug

>>> scipy.stats.wald.l_moment(1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-bb1131db051c> in ?()
----> 1 scipy.stats.wald.l_moment(1)

~/.local/lib/python3.12/site-packages/lmo/contrib/scipy_stats.py in ?(self, r, trim, quad_opts, *args, **kwds)
    317                 # undefined mean -> distr is "pathological" (e.g. cauchy)
    318                 return np.full(_r.shape, np.nan)[()]
    319 
    320         # L-moments of the standard distribution (loc=0, scale=scale0)
--> 321         l0_r = self._l_moment(_r, *args, trim=_trim, quad_opts=quad_opts)
    322 
    323         # shift (by loc) and scale
    324         shift_r = loc * (_r == 1)

~/.local/lib/python3.12/site-packages/lmo/contrib/scipy_stats.py in ?(self, r, trim, quad_opts, *args)
    169             of specific distributions, `r` and `trim`.
    170 
    171         """
    172         cdf, ppf = self._get_xxf(*args)
--> 173         lmbda_r = l_moment_from_cdf(
    174             cdf,
    175             r,
    176             trim=trim,

~/.local/lib/python3.12/site-packages/lmo/theoretical.py in ?(cdf, r, trim, support, quad_opts, alpha, ppf)
    333             * eval_sh_jacobi(_r - 2, t + 1, s + 1, p)
    334         )
    335 
    336     a, d = support or _tighten_cdf_support(cdf, support)
--> 337     b, c = (ppf(alpha), ppf(1 - alpha)) if ppf else (a, d)
    338 
    339     loc0 = a if np.isfinite(a) and a > 0 else 0
    340 

~/.local/lib/python3.12/site-packages/scipy/stats/_continuous_distns.py in ?(self, x)
  10207     def _ppf(self, x):
> 10208         return invgauss._ppf(x, 1.0)

~/.local/lib/python3.12/site-packages/scipy/stats/_continuous_distns.py in ?(self, x, mu)
   4537         with np.errstate(divide='ignore', over='ignore', invalid='ignore'):
   4538             x, mu = np.broadcast_arrays(x, mu)
   4539             ppf = _boost._invgauss_ppf(x, mu, 1)
   4540             i_wt = x > 0.5  # "wrong tail" - sometimes too inaccurate
-> 4541             ppf[i_wt] = _boost._invgauss_isf(1-x[i_wt], mu[i_wt], 1)
   4542             i_nan = np.isnan(ppf)
   4543             ppf[i_nan] = super()._ppf(x[i_nan], mu[i_nan])
   4544         return ppf

TypeError: 'numpy.float64' object does not support item assignment

Apparantly, scipy.stats.wald._ppf requires the input to be a >0d array-like (a scalar won't do). This isn't the case for most (or perhaps all) other continuous distributions.

Method of L-moments for unknown (univariate) distributions

depends on #6
related to #5

L-functionals and regression models

Hössjer & Karlsson (2023) describe the framework of L-functionals, which (almost) generalize the Legendre-based (untrimmed) L-moments. They additionally re-invent Hyowon An's Gaussian-centered "HL-moments", and dub them "Hermite L-functionals". Similarly, they introduce the (novel) Laguerre L-functionals with Exp(1) as reference distribution, which sound rather promising IMHO.

They further show how to apply these L-functionals for regression in (transformed) linear- and quantile-regression models (!), using a very flexible, yet practical approach.

Their use of conditional L-functionals might be interesting to attempt to "backport" into the familiar (Legendre- & Jacobi-based) L-moments.

For now, more research into these L-functionals is required before coming up with a concrete implementation plan.

Support multivariate `scipy.stats` continuous random vectors

Extend the scipy.stats joint/multivariate distributions with L-comoment (ratio) methods, using lmo.theoretical.l_comoment_from_pdf and lmo.theoretical.l_coratio_from_pdf.

This requires the joint PDF, and the marginal CDF's.
Unfortunately, the marginals are nowhere to be found in scipy.stats._multivariate.multi_rv_generic or its subtypes.

So the only way to implement this feature, is by figuring out the marginals of each joint-distribution, and manually add the l_co(moment|ratio|stats|loc|scale|rr|skew|kurtosis) methods to their respective scipy.stats._multivariate.{}_(gen|frozen)
types.

multivariate_normal (norm marginals)
dirichlet (beta marginals)
multivariate_t (t marginals)

The 13 other ones are either discrete, matrix-valued, directional, or have no .pdf() method, and therefore out-of-scope.

add methods l_moment, l_stats, etc. to the "generic" and "frozen" distn bases (monkeypatch)
- rv_continuous (depends #6), and use #5 + the name attribute to check if specific theoretical L-moments are known
- rv_discrete (depends #9)
for the theoretically-known distributions from #5, override l_moment on the specific rv_continuous instances

Asymptotic covariance

Asymptotic covariance is defined as

$$\mathbb{AV}_F[T_i, T_j] = \mathbb{E}_F[\psi_{T_i | F}(X) \; \psi_{T_j | F}(X)]\; ,$$

with statistical functional $T_r$ either an L-moment or L-ratio (theoretical or sample estimate), $\psi_{T_j | F}(X)$ its influence function, and $F(x)$ the CDF or ECDF.

parametric
nonparametric

Realistic usage example in the README.md

Illustrate that L-moments can be better choice than conventional moments