jorenham / lmo Goto Github PK
View Code? Open in Web Editor NEWTrimmed L-moments and L-comoments for robust statistics.
Home Page: https://jorenham.github.io/Lmo/
License: BSD 3-Clause "New" or "Revised" License
Trimmed L-moments and L-comoments for robust statistics.
Home Page: https://jorenham.github.io/Lmo/
License: BSD 3-Clause "New" or "Revised" License
Use the Polars extension API to register namespaces, analogous to those for Pandas
related to #184
See: https://github.com/jorenham/Lmo/actions/runs/6687308385/job/18167801378
The np.int_
is a C long, which is 32bits in win64, because, well.. err... logic...?
... anyway, this is could be solved by replacing all np.int_
's by np.int64
.
Another option could be to switch _sh_jacobi_f
in the cases where _sh_jacobi_i
will overflow given k
and dtype
in
sh_legendre
, and similarly in sh_jacobi(k, a, b, dtype)
.
Note that in practise, this is only a problem in _lm._l_weights_pwm
. The _lm._l_weights_ostat
method is slower, slightly less precise, but a lot less likely to overflow.
For any users facing this issue; the workaround is to use e.g. trim=1e-15
instead of trim=0
. This forces it to use the ostat
weights, since the pwm
weights can only handle integer trimming.
E.g. lmo.l_loc(x, axis=None)
always returns a scalar T
, but the type checker cannot narrow the return type any further than npt.NDArray[T] | T
.
Implement the
See https://rdrr.io/cran/lmomco/src/R/tau34sq.normtest.R for a reference implementation in R.
When cache=None
(default), cache the weights iff:
n <= CACHE_N_MAX
, with e.g. CACHE_N_MAX = (1 << 30) - 1
r <= CACHE_R_MAX
, with e.g. CACHE_R_MAX = 16
sum(trim) <= CACHE_TRIM_MAX
, with e.g. CACHE_TRIM_MAX = 4
sum(trim) <: int
(don't cache if fractional trim; it's weird anyhow)Currently, these always return a np.ndarray
, even when the input is an e.g. pd.Series
.
By making these functions aware of the potential __array_ufunc__
or __array_function__
methods, it can automatically "lift" itself, so that the types of ufunc-aware instances can pass-though.
It's probably best to use np.frompyfunc
for this.
Note that this will require additional @overload
's.
The current monkey-patched methods have several issues:
scipy.stats.rv_*
internals, which can break without warningjoblib
+loky
I suspect it could break, although I have yet to test this).scipy
and lmo
are incompatible.import lmo
(the first time)Instead, it'd be better to extend the lmo.l_*
moment functions, making them accept the rv_
instances directly.
lmo.distributions
and scipy.stats
TypedDict
's for **kwargs
for specifying extra options to the underlying sample- and population- L-moments estimators@overload
's.lmo.contrib.scipy_stats.l_rv_generic
and the related monkey-patch machineryThis can be solved with either np.lexsort
or np.argsort
+ structured type's. So a small performance test is requires to select the best of the two.
Alternatively, a post-sort check can quickly be done to identify (consecutive) duplicate values. These sub-array's can then individually be re-ordered, using the concomitant's ordering. But I suspect this to be a lot slower that the other two options.
The strict bounds should be divided by the L-scale, not by the upper bound of the L-scale. This results is bounds that are too tight.
E.g. the strict untrimmed L-moment bounds for
Analogous to the lmo.theoretical
influence functions from #22, add the following methods:
lmo.l_moment_influence(x: array_like[number], r: int, *, **) -> (T) -> T
lmo.l_ratio_influence(x: array_like[number], r: int, k: int = 2, *, **) -> (T) -> T
L-moment ratio diagrams are the swiss jacknife for distribution identification, and is in general a very useful tool for statistical exploratory analysis.
This requires:
contrib.matplotlib
.Some examples for inspiration:
Lmo heard some news about some spreadsheet software that he wants to play with.
But Lmo has to tame a big snake before he can do that.
Luckily, Lmo is best friends with another small snake already.
Relevant resources:
Useful for globally or temporarily overriding defaults, e.g. those used for kwargs such as dtype
, trim
, cache
, sort
, rowvar
, n_extra
, etc.
See polars.Config
for a similar feature in polars with a clean API.
Setting:
lmo.config.set(sort='stable')
with lmo.config(sort='stable'):
...
or
@lmo.config.override(sort='stable')
def spam(...):
...
lmo.config
namespace.class ConfigOptions(TypedDict, total=False): ...
._CONFIG: collections.ChainMap
for global config (although maybe an alternative or subclass is needed for TypedDict
support)._CONFIG
using default values.lmo.config.override(**options: *ConfigOptions)
contextmanager + decorator by having it push/pop from _CONFIG
.lmo.config.set(*ConfigOptions)
by replacing the root _CONFIG
, so that any potential locally overridden options aren't affectedSo that e.g. LMO_SORT="stable"
can be used to specifiy sort='stable'
(supersedes the root config).
pyproject.toml
Something like
[tool.lmo]
sort = "stable"
It should have lower priorty than the environment variables.
Modify lmo.l_weights
to accept trim: tuple[float, float]
, by replacing the succession-matrix algorithm with a direct approach, so that gamma functions can be used (instead of comb
).
This is already possible for population L-moments.
Optional support for SymPy, e.g. utility functions for:
sympy.stats
distributions with symbolic L-moment methods, analogously to scipy.stats
scipy.integrate.quad
l_poly
(as alternative to the current naïve summation)Currently, data with nan's are either always propagated, or an error is raised (e.g. np.asarray_chkfinite
raises a ValueError
).
The behaviour around nan
's is currently unspecified and untested.
Luckily, SciPy has their very well-specified nan_policy
(https://scipy.github.io/devdocs/dev/api-dev/nan_policy.html) which appears to be very much applicable to (at least) Lmo's sample estimators.
>>> a = np.array([0, 1, 2, 3, 4, 5, 6, np.inf])
>>> lmo.l_stats(a, trim=0)
array([inf, inf, nan, nan])
>>> lmo.l_stats(a, trim=(0, 1))
array([nan, nan, nan, nan])
>>> lmo.l_stats(a, trim=1)
array([nan, nan, nan, nan])
Only the trim=0
is correct, but in the other cases, it should be:
>>> lmo.l_stats(a, trim=(0, 1))
array([2. , 1.125, 0. , 0. ])
>>> lmo.l_stats(a, trim=1)
array([3.5, 0.9, 0. , 0. ])
Most of the startup time comes from numpy
and scipy
imports.
In some cases scipy
imports can be replaced with stdlib math
functions.
In other cases it sometimes to make sense to move the top-level imports into the relevant functions.
Add contrib.numba
, and use a noop @jit
if not available on relevant functions.
Additionally, the scipy.integrate.quad
integrand can be sped up with cfunc
, see https://numba.readthedocs.io/en/stable/user/cfunc.html#example
In places where scipy.special
functions are used, some trickery is needed.
For example, this snippet is used to make scipy.special.erfi
work within numba jitted functions for np.float64
input:
# lmo/contrib/numba.py
def _overload_scipy_special_erfi():
_erfi_f8 = ctypes.CFUNCTYPE(ctypes.c_double, ctypes.c_double)(
numba.extending.get_cython_function_address(
'scipy.special.cython_special',
'__pyx_fuse_1erfi',
),
)
@numba.extending.overload(scipy.special.erfi)
def numba_erfi(*args):
match args:
case (numba.types.Float(),):
def _numba_erfi(*args):
return _erfi_f8(*args)
return _numba_erfi
case _:
return None
def overload_scipy_special():
_overload_scipy_special_erfi()
# lmo/pyproject.toml
[tool.poetry.plugins.numba_extensions]
init = "lmo.contrib.numba:overload_scipy_special"
>>> scipy.stats.wald.l_moment(1)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-1-bb1131db051c> in ?()
----> 1 scipy.stats.wald.l_moment(1)
~/.local/lib/python3.12/site-packages/lmo/contrib/scipy_stats.py in ?(self, r, trim, quad_opts, *args, **kwds)
317 # undefined mean -> distr is "pathological" (e.g. cauchy)
318 return np.full(_r.shape, np.nan)[()]
319
320 # L-moments of the standard distribution (loc=0, scale=scale0)
--> 321 l0_r = self._l_moment(_r, *args, trim=_trim, quad_opts=quad_opts)
322
323 # shift (by loc) and scale
324 shift_r = loc * (_r == 1)
~/.local/lib/python3.12/site-packages/lmo/contrib/scipy_stats.py in ?(self, r, trim, quad_opts, *args)
169 of specific distributions, `r` and `trim`.
170
171 """
172 cdf, ppf = self._get_xxf(*args)
--> 173 lmbda_r = l_moment_from_cdf(
174 cdf,
175 r,
176 trim=trim,
~/.local/lib/python3.12/site-packages/lmo/theoretical.py in ?(cdf, r, trim, support, quad_opts, alpha, ppf)
333 * eval_sh_jacobi(_r - 2, t + 1, s + 1, p)
334 )
335
336 a, d = support or _tighten_cdf_support(cdf, support)
--> 337 b, c = (ppf(alpha), ppf(1 - alpha)) if ppf else (a, d)
338
339 loc0 = a if np.isfinite(a) and a > 0 else 0
340
~/.local/lib/python3.12/site-packages/scipy/stats/_continuous_distns.py in ?(self, x)
10207 def _ppf(self, x):
> 10208 return invgauss._ppf(x, 1.0)
~/.local/lib/python3.12/site-packages/scipy/stats/_continuous_distns.py in ?(self, x, mu)
4537 with np.errstate(divide='ignore', over='ignore', invalid='ignore'):
4538 x, mu = np.broadcast_arrays(x, mu)
4539 ppf = _boost._invgauss_ppf(x, mu, 1)
4540 i_wt = x > 0.5 # "wrong tail" - sometimes too inaccurate
-> 4541 ppf[i_wt] = _boost._invgauss_isf(1-x[i_wt], mu[i_wt], 1)
4542 i_nan = np.isnan(ppf)
4543 ppf[i_nan] = super()._ppf(x[i_nan], mu[i_nan])
4544 return ppf
TypeError: 'numpy.float64' object does not support item assignment
Apparantly, scipy.stats.wald._ppf
requires the input to be a >0d array-like (a scalar won't do). This isn't the case for most (or perhaps all) other continuous distributions.
Hössjer & Karlsson (2023) describe the framework of L-functionals, which (almost) generalize the Legendre-based (untrimmed) L-moments. They additionally re-invent Hyowon An's Gaussian-centered "HL-moments", and dub them "Hermite L-functionals". Similarly, they introduce the (novel) Laguerre L-functionals with Exp(1)
as reference distribution, which sound rather promising IMHO.
They further show how to apply these L-functionals for regression in (transformed) linear- and quantile-regression models (!), using a very flexible, yet practical approach.
Their use of conditional L-functionals might be interesting to attempt to "backport" into the familiar (Legendre- & Jacobi-based) L-moments.
For now, more research into these L-functionals is required before coming up with a concrete implementation plan.
Extend the scipy.stats
joint/multivariate distributions with L-comoment (ratio) methods, using lmo.theoretical.l_comoment_from_pdf
and lmo.theoretical.l_coratio_from_pdf.
This requires the joint PDF, and the marginal CDF's.
Unfortunately, the marginals are nowhere to be found in scipy.stats._multivariate.multi_rv_generic
or its subtypes.
So the only way to implement this feature, is by figuring out the marginals of each joint-distribution, and manually add the l_co(moment|ratio|stats|loc|scale|rr|skew|kurtosis)
methods to their respective scipy.stats._multivariate.{}_(gen|frozen)
types.
multivariate_normal
(norm
marginals)dirichlet
(beta
marginals)multivariate_t
(t
marginals)The 13 other ones are either discrete, matrix-valued, directional, or have no .pdf()
method, and therefore out-of-scope.
This builds upon #6, which covered the continuous distributions
.... by using the (reasonable) assumption that "no. samples" > "no. variables"
This can sneakily be exploited to use e.g. lmo.l_moment
for the calculation of pairwise L-comoments with e.g. lmo.l_moment(x[np.argsort(y)], ..., sort=False)
.
Currently, only integral trimming is supported. See Hosking (2015), and my "offline analogue notes", for implementation details.
Plots of the weights of
I.e. the L-moments that are listed here:
https://jorenham.github.io/Lmo/distributions/
blocker: numpy/numpy#26199
update:
numpy/numpy#26199 has been fixed; now waiting for the next 2.0.0(.rc2)?
release
l_moment
, l_stats
, etc. to the "generic" and "frozen" distn bases (monkeypatch)
l_moment
on the specific rv_continuous
instancesAsymptotic covariance is defined as
with statistical functional
Illustrate that L-moments can be better choice than conventional moments
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.