grantjenks / python-runstats Goto Github PK

Python module for computing statistics and regression in a single pass.

Home Page: http://grantjenks.com/docs/runstats/

License: Other

Python 88.42% C++ 11.58%

python-runstats's Introduction

RunStats: Computing Statistics and Regression in One Pass

RunStats is an Apache2 licensed Python module for online statistics and online regression. Statistics and regression summaries are computed in a single pass. Previous values are not recorded in summaries.

Long running systems often generate numbers summarizing performance. It could be the latency of a response or the time between requests. It's often useful to use these numbers in summary statistics like the arithmetic mean, minimum, standard deviation, etc. When many values are generated, computing these summaries can be computationally intensive. It may even be infeasible to keep every recorded value. In such cases computing online statistics and online regression is necessary.

In other cases, you may only have one opportunity to observe all the recorded values. Python's generators work exactly this way. Traditional methods for calculating the variance and other higher moments requires multiple passes over the data. With generators, this is not possible and so computing statistics in a single pass is necessary.

There are also scenarios where a user is not interested in a complete summary of the entire stream of data but rather wants to observe the current state of the system based on the recent past. In these cases exponential statistics are used. Instead of weighting all values uniformly in the statistics computation, an exponential decay weight is applied to older values. The decay rate is configurable and provides a mechanism for balancing recent values with past values.

The Python RunStats module was designed for these cases by providing classes for computing online summary statistics and online linear regression in a single pass. Summary objects work on sequences which may be larger than memory or disk space permit. They may also be efficiently combined together to create aggregate summaries.

Features

Pure-Python
Fully Documented
100% Test Coverage
Numerically Stable
Optional Cython-optimized Extension (5-100 times faster)
Statistics summary computes mean, variance, standard deviation, skewness, kurtosis, minimum and maximum.
Regression summary computes slope, intercept and correlation.
Developed on Python 3.9
Tested on CPython 3.6, 3.7, 3.8, 3.9
Tested on Linux, Mac OS X, and Windows
Tested using GitHub Actions

Quickstart

Installing RunStats is simple with pip:

$ pip install runstats

You can access documentation in the interpreter with Python's built-in help function:

>>> import runstats
>>> help(runstats)                             # doctest: +SKIP
>>> help(runstats.Statistics)                  # doctest: +SKIP
>>> help(runstats.Regression)                  # doctest: +SKIP
>>> help(runstats.ExponentialStatistics)       # doctest: +SKIP

Tutorial

The Python RunStats module provides three types for computing running statistics: Statistics, ExponentialStatistics and Regression.The Regression object leverages Statistics internally for its calculations. Each can be initialized without arguments:

>>> from runstats import Statistics, Regression, ExponentialStatistics
>>> stats = Statistics()
>>> regr = Regression()
>>> exp_stats = ExponentialStatistics()

Statistics objects support four methods for modification. Use push to add values to the summary, clear to reset the summary, sum to combine Statistics summaries and multiply to weight summary Statistics by a scalar.

>>> for num in range(10):
...     stats.push(float(num))
>>> stats.mean()
4.5
>>> stats.maximum()
9.0
>>> stats += stats
>>> stats.mean()
4.5
>>> stats.variance()
8.68421052631579
>>> len(stats)
20
>>> stats *= 2
>>> len(stats)
40
>>> stats.clear()
>>> len(stats)
0
>>> stats.minimum()
nan

Use the Python built-in len for the number of pushed values. Unfortunately the Python min and max built-ins may not be used for the minimum and maximum as sequences are expected instead. Therefore, there are minimum and maximum methods provided for that purpose:

>>> import random
>>> random.seed(0)
>>> for __ in range(1000):
...     stats.push(random.random())
>>> len(stats)
1000
>>> min(stats)
Traceback (most recent call last):
    ...
TypeError: ...
>>> stats.minimum()
0.00024069652516689466
>>> stats.maximum()
0.9996851255769114

Statistics summaries provide five measures of a series: mean, variance, standard deviation, skewness and kurtosis:

>>> stats = Statistics([1, 2, 5, 12, 5, 2, 1])
>>> stats.mean()
4.0
>>> stats.variance()
15.33333333333333
>>> stats.stddev()
3.915780041490243
>>> stats.skewness()
1.33122127314735
>>> stats.kurtosis()
0.5496219281663506

All internal calculations use Python's float type.

Like Statistics, the Regression type supports some methods for modification: push, clear and sum:

>>> regr.clear()
>>> len(regr)
0
>>> for num in range(10):
...     regr.push(num, num + 5)
>>> len(regr)
10
>>> regr.slope()
1.0
>>> more = Regression((num, num + 5) for num in range(10, 20))
>>> total = regr + more
>>> len(total)
20
>>> total.slope()
1.0
>>> total.intercept()
5.0
>>> total.correlation()
1.0

Regression summaries provide three measures of a series of pairs: slope, intercept and correlation. Note that, as a regression, the points need not exactly lie on a line:

>>> regr = Regression([(1.2, 1.9), (3, 5.1), (4.9, 8.1), (7, 11)])
>>> regr.slope()
1.5668320150154176
>>> regr.intercept()
0.21850113956294415
>>> regr.correlation()
0.9983810791694997

Both constructors accept an optional iterable that is consumed and pushed into the summary. Note that you may pass a generator as an iterable and the generator will be entirely consumed.

The ExponentialStatistics are constructed by providing a decay rate, initial mean, and initial variance. The decay rate has default 0.9 and must be between 0 and 1. The initial mean and variance default to zero.

>>> exp_stats = ExponentialStatistics()
>>> exp_stats.decay
0.9
>>> exp_stats.mean()
0.0
>>> exp_stats.variance()
0.0

The decay rate is the weight by which the current statistics are discounted by. Consequently, (1 - decay) is the weight of the new value. Like the Statistics class, there are four methods for modification: push, clear, sum and multiply.

>>> for num in range(10):
...     exp_stats.push(num)
>>> exp_stats.mean()
3.486784400999999
>>> exp_stats.variance()
11.593430921943071
>>> exp_stats.stddev()
3.4049127627507683

The decay of the exponential statistics can also be changed. The value must be between 0 and 1.

>>> exp_stats.decay
0.9
>>> exp_stats.decay = 0.5
>>> exp_stats.decay
0.5
>>> exp_stats.decay = 10
Traceback (most recent call last):
  ...
ValueError: decay must be between 0 and 1

The clear method allows to optionally set a new mean, new variance and new decay. If none are provided mean and variance reset to zero, while the decay is not changed.

>>> exp_stats.clear()
>>> exp_stats.decay
0.5
>>> exp_stats.mean()
0.0
>>> exp_stats.variance()
0.0

Combining ExponentialStatistics is done by adding them together. The mean and variance are simply added to create a new object. To weight each ExponentialStatistics, multiply them by a constant factor. If two ExponentialStatistics are added then the leftmost decay is used for the new object. The len method is not supported.

>>> alpha_stats = ExponentialStatistics(iterable=range(10))
>>> beta_stats = ExponentialStatistics(decay=0.1)
>>> for num in range(10):
...     beta_stats.push(num)
>>> exp_stats = beta_stats * 0.5 + alpha_stats * 0.5
>>> exp_stats.decay
0.1
>>> exp_stats.mean()
6.187836645

All internal calculations of the Statistics and Regression classes are based entirely on the C++ code by John Cook as posted in a couple of articles:

The ExponentialStatistics implementation is based on:

Finch, 2009, Incremental Calculation of Weighted Mean and Variance

The pure-Python version of RunStats is directly available if preferred.

>>> import runstats.core   # Pure-Python
>>> runstats.core.Statistics
<class 'runstats.core.Statistics'>

When importing from runstats the Cython-optimized version _core is preferred and the core version is used as fallback. Micro-benchmarking Statistics and Regression by calling push repeatedly shows the Cython-optimized extension as 20-40 times faster than the pure-Python extension.

Reference and Indices

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

python-runstats's People

Contributors

Stargazers

Watchers

Forkers

giserh florianwilhelm wi-kiwi justusgee zhouliyan111111 afcarl trade-meta pombredanne alesasse ethomson francisfuzz kinokoberuji jab alanmazankiewicz eidantz npham25 vineetp6 sanebow mbradbury

python-runstats's Issues

Feature Request: Exponentially Weighted Incremental Mean and Variance

Hi, I'd like to propose the implementation of an exponentially weighted mean and variance calculation as running statistics over streaming data based on following paper Finch, 2009, Incremental Calculation of Weighted Mean and Variance, Eq: 124 (mean), 143 (variance). It derives incremental formulations for exponential mean and variance. If you are interested I'd be happy to implement it and submit a PR.

Implement len() Function in ExponentialStatistics for Consistency

Hi. I'm currently utilizing the len() function in the Statistics module to hold off on presenting outcomes until a minimum of 5 samples are recorded. I wish to replicate this functionality within the ExponentialStatistics module, but I noticed it hasn't been implemented yet. While I realize that self._count might not be necessary for the internal operations of ExponentialStatistics, I believe it would be beneficial to maintain interface consistency across both modules.

conda-forge build?

What about adding runstats also to conda-forge?
I could help to create the build recipe, if you are interested.

The auto-generated grayskull recipe which could be used as
a starting point looks like this:

{% set name = "runstats" %}
{% set version = "2.0.0" %}

package:
  name: {{ name|lower }}
  version: {{ version }}

source:
  url: https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/runstats-{{ version }}.tar.gz
  sha256: 0f9a5e6cc9938bbac3474b17727ffc29fbf5895f33e55ce8843341e0821e77c2

build:
  noarch: python
  script: {{ PYTHON }} -m pip install . -vv --no-deps --no-build-isolation
  number: 0

requirements:
  host:
    - python >=3.6
    - cython
    - tox
    - pip
  run:
    - python >=3.6

test:
  imports:
    - runstats
  commands:
    - pip check
  requires:
    - pip

about:
  home: http://www.grantjenks.com/docs/runstats/
  summary: Compute statistics and regression in one pass
  dev_url: https://github.com/grantjenks/python-runstats
  license: Apache-2.0
  license_file: LICENSE

Improve Cython Support

Add documentation regarding Cython support and speedup.

There's an interesting permutation of install scenarios. I released the package from my macbook using "sdist bdist_wheel --universal" so that included the binary I built locally and source. On PyPI you'll see the Mac OSX binary alongside the source. So when you install, there are several options:

You run "pip install runstats" on a similar Mac and get my binary.
You run "pip install runstats" on Windows or Linux and only get the pure-Python version.
You run "pip install --no-binary runstats runstats" on any system and it'll try to build the package with Cython and if that fails then fallback to the pure-Python version.

I added tests/benchmark.py to observe the Cython-impact. I also added types. Example benchmark output:

$ python benchmark.py 
core.Statistics: 0.0214369297028
fast.Statistics: 0.000623941421509
  Stats Speedup: 33.36x faster
core.Regression: 0.0528829097748
fast.Regression: 0.00235199928284
   Regr Speedup: 21.48x faster

So Statistics objects are 30x faster and Regression objects are 20x faster.

See #2 also.

Support Arbitrary Data Types (like Decimal)

We use Decimal instead of float due to a requirement for exact representation of amounts when using floating point arithmetic. We would like to use this package with Decimal amounts, and potentially other data types as well.

Supporting arbitrary types can be done by converting all hard-coded floats into a custom datatype which can be specified as an optional argument in the Statistics/Regression constructor.

Would this be considered as a useful feature? We can submit a pull request which makes the necessary changes. This pull request will not likely include a fast Cython implementation.

Provide Native Binaries in Wheels

Consider using:
https://github.com/joerick/cibuildwheel/blob/master/README.md

Statistics.stddev crashing with division by 0 error if only one value

Runstat version '1.8.0'
Python: 3.7
Using runstat with cython.

Also it would be cool to provide a way (maybe its allready there, but not in your doc) on how to use basic python for debugging purpose. Here you have a pdb session showing the bug:


-> stdevs= self.stat.stddev() if self.want_stdev else []
(Pdb) n
ZeroDivisionError: float division by zero
> /opt/handCraftedUtilityShit/eprof2/eprof/event.py(17)to_kvhf()
-> stdevs= self.stat.stddev() if self.want_stdev else []   
(Pdb) p self.stat.get_state()
(1.0, 74689.0, 0.0, 0.0, 0.0, 74689.0, 74689.0)

Cython broke pickle

In [6]: from runstats.fast import Statistics

In [7]: len(pickle.dumps(Statistics(), protocol=2))
Out[7]: 35

In [8]: pickle.dumps(Statistics(), protocol=2)
Out[8]: '\x80\x02crunstats.fast\nStatistics\nq\x00)\x81q\x01.'

In [9]: stats = Statistics()

In [10]: for val in range(10): stats.push(val)

In [11]: stats.mean()
Out[11]: 4.5

In [12]: pickle.dumps(stats)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-aaf271f92de1> in <module>()
----> 1 pickle.dumps(stats)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc in dumps(obj, protocol)
   1378 def dumps(obj, protocol=None):
   1379     file = StringIO()
-> 1380     Pickler(file, protocol).dump(obj)
   1381     return file.getvalue()
   1382 

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc in dump(self, obj)
    222         if self.proto >= 2:
    223             self.write(PROTO + chr(self.proto))
--> 224         self.save(obj)
    225         self.write(STOP)
    226 

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc in save(self, obj)
    304             reduce = getattr(obj, "__reduce_ex__", None)
    305             if reduce:
--> 306                 rv = reduce(self.proto)
    307             else:
    308                 reduce = getattr(obj, "__reduce__", None)

/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy_reg.pyc in _reduce_ex(self, proto)
     68     else:
     69         if base is self.__class__:
---> 70             raise TypeError, "can't pickle %s objects" % base.__name__
     71         state = base(self)
     72     args = (self.__class__, base, state)

TypeError: can't pickle Statistics objects

In [13]: pickle.dumps(stats, protocol=2)
Out[13]: '\x80\x02crunstats.fast\nStatistics\nq\x00)\x81q\x01.'

In [14]: pickle.loads(pickle.dumps(stats, protocol=2))
Out[14]: <runstats.fast.Statistics at 0x110c53e40>

Calculating population statistics

runstats only seems to allow the calculation of the sample variance/stdev. Is it possible to also calculate the population variance/stdev? i.e. using a denominator of N rather than N-1

If the functionality doesn't exist, I'm happy to submit a pull request for this. numpy provides a ddof parameter for std()/var() to achieve this; I think something similar could be easily implemented in runstats.

Unable to open 'fast.pyx' in vscode

When runstats throws an error (for example, trying to get stddev with only 1 item), IDEs like Visual Studio Code looks for .pyx file and as it is not found (because installed from pip, even with cython), the IDE throws error:

Unable to open 'fast.pyx'

This is unfortunate because it hides original error message.

Suggested fix

Generate pyx or remove its dependency if installed from pip.
Return nan instead of divide by 0 error for stddev and variance.

Division By Zero

Hi, i'm not a math expert, but i'm trying to compute regression of a serie of points
but got division by zero....

here is an excerpt of the points that make it crash

>>> r = Regression([(0.6875, 0.7578947368421053), (0.6875, 0.8105263157894737)])
>>> r.slope()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/venv/lib/python2.7/site-packages/runstats.py", line 216, in slope
    return self._sxy / sxx
ZeroDivisionError: float division by zero

Remove hashing

Hashing doesn't really make sense. I don't know what I was thinking. Just remove it. For a hash, use get_state

Bogus conversion

Looking at the source code I found something peculiar:

def push(self, value):
        """Add `value` to the Statistics summary."""
        values = float(value)
        self._min = value if self._min is None else min(self._min, value)
        self._max = value if self._max is None else max(self._max, value)
        delta = value - self._eta
        delta_n = delta / (self._count + 1)
        delta_n2 = delta_n * delta_n
        term = delta * delta_n * self._count
        self._count += 1

You convert value to float but save it in values (tailing s!). In the rest of your function you are using only value. Is this an error?

New release for newer Python versions

Hi,

under Python 3.9 I get the following compile error on pip install "runstats==1.8.0":
runstats/fast.c(9837): error C2039: "tp_print" ist kein Member von "_typeobject".

Under Python 3.8, the installation works. Seems as if in Python 3.9 support of tp_print was removed. It would be very nice if you could make a new release that works under newer Python versions. The current release is from 2019.

Thank you very much in advance.

Kind regards

Boris Wiegand

Use pytest-benchmark for benchmarking

Use pytest-benchmark for benchmarking.

Build error when using Python 3.7 + Cython

Hi, I am experiencing issues when trying to build runstats 1.7.1 under python 3.7 using cython.
Next to some warnings about using parentheses (which are also present in python 3.6), I get the following error:

runstats/fast.c: In function ‘__Pyx_PyCFunction_FastCall’:
runstats/fast.c:10294:5: error: too many arguments to function ‘(struct PyObject * (*)(struct PyObject *, struct PyObject * const*, Py_ssize_t))meth’
     return (*((__Pyx_PyCFunctionFast)meth)) (self, args, nargs, NULL);
     ^

Thank you for your work!
Best, Jannis

Docs: count and push

I think below two should be answered in documentation:

I'm very confused by push method. Does this mean runstats keep all the data in memory (I hope not!). Why is there no pop? If runstate didn't keep all data in memory, its interface should not be container like.
How to get count? Turns out use len. This is good python standard but took me a while to figure out because I assumed runstats isn't a container. It would be great to have count() method instead.

Fix ddof handling in Regression

Various methods in the Regression class appear to handle ddof inflexibly, inconsistently or incorrectly.

slope and intercept are hard-coded to use ddof=1
correlation uses the default ddof=0 when it calls stddev but then uses ddof=1 internally

Proposed solution:

Fix the Regression methods to use ddof correctly, consistently using a passed parameter ddof.

Alternative solution:

Fix the Regression methods to use ddof correctly, consistently using 1.0. However, this would be less flexible than a parameter.

Refactor Duplicated Test Code

test_runstats.py and test_runstats_core.py are nearly identical. One imports from runstats.core and the other from runstats. There must be a way to refactor these tests to remove the duplicated source code.

Does this package support "batch" operations?

Hi all,

I'm working on a problem where I need to compute running statistics on scalars that come in "batches." The naive implementation is something like:

s = Statistics()
for batch in batches:
    for scalar in batch:
        s.push(scalar)

where batch is a PyTorch Tensor or a numpy ndarray. Obviously, the performance of this is not good. Does this library support some kind of .update() in the optimized Cython implementation that efficiently goes through the memoryview and does the update calculations?

s = Statistics()
for batch in batches:
    s.update(batch)

Thanks!

Improve Docs and Highlight Cython Support

Improve docs and highlight Cython support.

Support for removing elements

Thank you for this library, I love it and I've learned a lot from it ❤️

Do you think it be possible to support removal of elements? I'm working on a sliding-windows problem, and this feature would be terrific!

I've found an implementation for this on https://lingpipe-blog.com/2009/07/07/welford-s-algorithm-delete-online-mean-variance-deviation/. It doesn't support kurtosis and skewness, but maybe it can be extended?

I've found an implementation for this on https://lingpipe-blog.com/2009/07/07/welford-s-algorithm-delete-online-mean-variance-deviation/, but it doesn't support kurtosis and skewness.

(Obviously Min and Max cannot work on this scenario)

Initialize from Numpy array

Would you consider adding a constructor to Statistics that initialises the list of observation with a Numpy array in Cython instead of iterating over the argument list and pushing elements one by one?
I'm attempting to use this compute stats on array of billion of observations and initialisation is killing the performance.
Thanks for the great work!

Adapt a CI system like Travis

Currently one unit test namely test_regression() is not working, at least on my machine. A continuous integration system like Travis would allow you to automatically check all unit tests after every push to the repo.
That being said I noticed that the test files test_runstats.py and test_runstats_core.py differ only by a few lines in order to test the cythonized code and the pure python code. I guess there must be a way to avoid this kind of code duplication maybe with some smart usage of py.test fixtures but I am not 100% sure.

Allow weighting of Statistics

Let's say I have two Statistics object and want to merge them by adding them. This perfectly works right now since __add__ is implemented. But what if I want to weight them before I merge? Let's say Statistics object a has a count of 1000 and I want to treat those events as only 10 before adding it to Statistics object b. Therefore it would be really cool to be able to say c = 0.01*a + b.

The implementation of this with the help of __mul__ is quite easy and I will provide a PR if this feature is accepted.

get_params and set_params support

Add get_params and set_params methods like sklearn, etc. See #2 also.

Improve installation process [pip-extras]

Greetings!
Straight to the point, got some problems with installation whenever I want to use cythonised version of this handy package. Since pip doesn't guarantee installation order it's a little bit tricky for doing everything right and keep requirements clean and tidy. Was wondering if one mind me (or someone else) adding "extras" section for setup.py, so it will be possible to install Cython before package binaries only when specific flag is given.

Pip already can install things correctly based on topological order, we want to use that feature :)