scikit-hep / hepstats Goto Github PK

View Code? Open in Web Editor NEW

63.0 6.0 12.0 66.65 MB

Statistics tools and utilities.

Home Page: https://scikit-hep.org/hepstats/

Python 54.61% Jupyter Notebook 45.39%

hep hep-ex statistics python python3 scikit-hep statistical-inference

hepstats's Introduction

`scikit-hep`: metapackage for Scikit-HEP

Project info

The Scikit-HEP project is a community-driven and community-oriented project with the aim of providing Particle Physics at large with an ecosystem for data analysis in Python embracing all major topics involved in a physicist's work. The project started in Autumn 2016 and its packages are actively developed and maintained.

It is not just about providing core and common tools for the community. It is also about improving the interoperability between HEP tools and the Big Data scientific ecosystem in Python, and about improving on discoverability of utility packages and projects.

For what concerns the project grand structure, it should be seen as a toolset rather than a toolkit.

Getting in touch

There are various ways to get in touch with project admins and/or users and developers.

scikit-hep package

scikit-hep is a metapackage for the Scikit-HEP project.

Installation

You can install this metapackage from PyPI with `pip`:

python -m pip install scikit-hep

or you can use Conda through conda-forge:

conda install -c conda-forge scikit-hep

All the normal best-practices for Python apply; you should be in a virtual environment, etc.

Package version and dependencies

Please check the setup.cfg and requirements.txt files for the list of Python versions supported and the list of Scikit-HEP project packages and dependencies included, respectively.

For any installed scikit-hep the following displays the actual versions of all Scikit-HEP dependent packages installed, for example:

>>> import skhep
>>> skhep.show_versions()

System:
    python: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
executable: /srv/conda/envs/notebook/bin/python
   machine: Linux-5.15.0-72-generic-x86_64-with-glibc2.27

Python dependencies:
       pip: 23.1.2
     numpy: 1.24.3
     scipy: 1.10.1
    pandas: 2.0.2
matplotlib: 3.7.1

Scikit-HEP package version and dependencies:
        awkward: 2.2.2
boost_histogram: 1.3.2
  decaylanguage: 0.15.3
       hepstats: 0.6.1
       hepunits: 2.3.2
           hist: 2.6.3
     histoprint: 2.4.0
        iminuit: 2.21.3
         mplhep: 0.3.28
       particle: 0.22.0
          pylhe: 0.6.0
       resample: 1.6.0
          skhep: 2023.06.09
         uproot: 5.0.8
         vector: 1.0.0

Note on the versioning system:

This package uses Calendar Versioning (CalVer).

hepstats's People

Contributors

Stargazers

Watchers

Forkers

pollackscience jonas-eschle bbw7561135 weiminsong wadud001 hercules261188 yanibion amanmdesai tinnuche secholak abdoelabassi shenyangshi

hepstats's Issues

Bayesian blocking gamma prior doesn't make sense

when gamma is supplied,

hepstats/src/hepstats/modeling/bayesian_blocks.py

Line 51 in 4c5bdba

return -np.log(self.gamma)

hepstats/src/hepstats/modeling/bayesian_blocks.py

Line 154 in 4c5bdba

A_R = fit_vec - prior.calc(R + 1)

this is a constant prior, this doesn't make sense because the prior should depend on N_b (https://arxiv.org/pdf/1708.00810.pdf):

Bayesian Blocking prior is astro-specific

hepstats/src/hepstats/modeling/bayesian_blocks.py

Lines 53 to 54 in 85cdb1e

 # eq. 21 from Scargle 2012 

 return 4 - np.log(73.53 * self.p0 * (N**-0.478))

Make ToyGenerator more public

I think it could be worth to make the toy generator more publicly available. It's common to do some toys to study the bias of the fit and encapsulating this could help us implement possible speedups (such as parallelization).

sPlot implementation

The sPlot algorithm could be a good fit in hepstat. There is already a working implementation in hep_ml.
I am however not sure of the freedom of implementation regarding licenses etc....

Unexpeted behaviour in `counting.ipynb` notebook

Description:

When running the notebook counting.ipynb using python==3.11 , hepstats==0.7.1, and zfit==0.16.0, the calculated significance of the asymptotic calculator is reported as 0.0with a p-value of 0.5. In addition, the rest of the notebook seems to run into a similar issue, namely that the evaluated p-value is 0.5 independent of the input.

Also, I found that the tests of the discovery calculator do not test for the significance to be >0, just <2, which might be why it slipped under the radar.

Expected behavior:

The significance of the discovery should 1.6 as pointed out in the markup.

Actual Behavior:

The significance is reported as 0.0.

Minimal working example (based on the notebook):

import numpy as np
import zfit
from zfit.loss import UnbinnedNLL
from zfit.minimize import Minuit
from hepstats.hypotests import Discovery, UpperLimit
from hepstats.hypotests.calculators import (AsymptoticCalculator,
                                            FrequentistCalculator)
from hepstats.hypotests.parameters import POI, POIarray

Nsig = zfit.Parameter("Nsig", 40, -100., 100)
Nbkg = zfit.Parameter("Nbkg", 340, 0, 500)
Nobs = zfit.ComposedParameter("Nobs", lambda a, b: a + b, params=[Nsig, Nbkg])

from collections import OrderedDict
import tensorflow_probability as tfp
from zfit.models.dist_tfp import WrapDistribution
from zfit.util import ztyping

class Poisson(WrapDistribution):
    _N_OBS = 1

    def __init__(self,
                 lamb: ztyping.ParamTypeInput,
                 obs: ztyping.ObsTypeInput,
                 name: str = "Poisson"):
        """
        Poisson distribution, parametrized with an event parameter (lamb).
        """
        (lamb,) = self._check_input_params(lamb)
        params = OrderedDict((('lamb', lamb),))
        dist_params = lambda: dict(rate=lamb.value())
        distribution = tfp.distributions.Poisson
        super().__init__(distribution=distribution, dist_params=dist_params,
                         obs=obs, params=params, name=name)
        
obs = zfit.Space('N', limits=(0, 800))
model = Poisson(obs=obs, lamb=Nobs) #zfit.pdf.Gauss(obs=obs, mu=Nobs, sigma=sigma)

n = 370
nbkg = 340

data = zfit.data.Data.from_numpy(obs=obs, array=np.array([n]))
Nbkg.set_value(nbkg)
Nbkg.floating = False

nll = UnbinnedNLL(model=model, data=data)
minimizer = Minuit(verbosity=0)
minimum = minimizer.minimize(loss=nll)

calculator = AsymptoticCalculator(nll, minimizer)
calculator.bestfit = minimum

discovery_test = Discovery(calculator, POI(Nsig, 0))
pnull, significance = discovery_test.result()

The output of this is then:

p_value for the Null hypothesis = 0.5
Significance (in units of sigma) = 0.0

Possible Solution:

It seems that changing fmin to fminfull in basecalculator.py:105 solves the issue.

When to support Python 3.11?

Hello @jonas-eschle, I guess this largely depends in practice on when zfit will be able to support 3.11. Can you comment?

I believe as soon as this is sorted I will be able to support 3.11 in scikit-hep (checking as we speak).

Many thanks.

Frequentist calculator does not work when loss has constraints.

UL calculation crashes when trying to use the FrequentistCalculator with a loss that has constraints. To reproduce the issue I modified your example notebook (https://github.com/scikit-hep/hepstats/blob/master/notebooks/hypotests/upperlimit_freq_zfit.ipynb) by adding a dummy constraint.

import matplotlib.pyplot as plt
import numpy as np
import zfit

from zfit.loss import ExtendedUnbinnedNLL
from zfit.minimize import Minuit

from hepstats.hypotests import UpperLimit
from hepstats.hypotests.calculators import FrequentistCalculator
from hepstats.hypotests.parameters import POI, POIarray

import tensorflow as tf

bounds = (0.1, 3.0)

# Data and signal

np.random.seed(0)
tau = -2.0
beta = -1/tau
data = np.random.exponential(beta, 300)
peak = np.random.normal(1.2, 0.1, 10)
data = np.concatenate((data,peak))
data = data[(data > bounds[0]) & (data < bounds[1])]

obs = zfit.Space('x', limits=bounds)

lambda_ = zfit.Parameter("lambda",-2.0, -4.0, -1.0)
Nsig = zfit.Parameter("Nsig", 1., -20., len(data))
Nbkg = zfit.Parameter("Nbkg", len(data), 0., len(data)*1.1)

signal = zfit.pdf.Gauss(obs=obs, mu=1.2, sigma=0.1).create_extended(Nsig)
background = zfit.pdf.Exponential(obs=obs, lambda_=lambda_).create_extended(Nbkg)
tot_model = zfit.pdf.SumPDF([signal, background])

# Create the negative log likelihood
data_ = zfit.data.Data.from_numpy(obs=obs, array=data)

# Add a dummy loss
constraint = zfit.constraint.GaussianConstraint(Nbkg, observation=251.57, uncertainty=100)
nll = ExtendedUnbinnedNLL(model=tot_model, data=data_, constraints = [constraint]) 


# Instantiate a minuit minimizer
minimizer = Minuit()


# minimisation of the loss function
minimum = minimizer.minimize(loss=nll)
minimum.hesse()
print(minimum)




# instantation of the calculator
calculator = FrequentistCalculator(nll, minimizer, ntoysnull=5000, ntoysalt=5000)
# calculator = FrequentistCalculator.from_yaml("toys/upperlimit_freq_zfit_toys.yml", nll, minimizer, ntoysnull=5000, ntoysalt=5000)
calculator.bestfit = minimum #optionnal

# parameter of interest of the null hypothesis
poinull = POIarray(Nsig, np.linspace(0.0, 25, 15))
# parameter of interest of the alternative hypothesis
poialt = POI(Nsig, 0.)


# instantation of the discovery test
ul = UpperLimit(calculator, poinull, poialt)

ul.upperlimit(alpha=0.05, CLs=True);

And the traceback:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[31], line 1
----> 1 ul.upperlimit(alpha=0.05, CLs=True);

File /srv/conda/envs/notebook/lib/python3.10/site-packages/hepstats/hypotests/core/upperlimit.py:159, in UpperLimit.upperlimit(self, alpha, CLs, printlevel)
    153 to_interpolate = [observed_key] + [
    154     f"expected{i}" for i in ["", "_p1", "_m1", "_p2", "_m2"]
    155 ]
    157 limits: dict = {}
--> 159 all_pvalues = self.pvalues(CLs)
    160 for k in to_interpolate:
    161     pvalues = all_pvalues[k]

File /srv/conda/envs/notebook/lib/python3.10/site-packages/hepstats/hypotests/core/upperlimit.py:96, in UpperLimit.pvalues(self, CLs)
     83 """
     84 Returns p-values scanned for the values of the parameters of interest
     85 in the null hypothesis.
   (...)
     92     Dictionary of p-values for CLsb, CLs, expected (+/- sigma bands).
     93 """
     94 pvalue_func = self.calculator.pvalue
---> 96 pnull, palt = pvalue_func(
     97     poinull=self.poinull, poialt=self.poialt, qtilde=self.qtilde, onesided=True
     98 )
    100 pvalues = {"clsb": pnull, "clb": palt}
    102 sigmas = [0.0, 1.0, 2.0, -1.0, -2.0]

File /srv/conda/envs/notebook/lib/python3.10/site-packages/hepstats/hypotests/calculators/basecalculator.py:153, in BaseCalculator.pvalue(self, poinull, poialt, qtilde, onesided, onesideddiscovery)
    150     self.check_pois(poialt)
    151     self.check_pois_compatibility(poinull, poialt)
--> 153 return self._pvalue_(
    154     poinull=poinull,
    155     poialt=poialt,
    156     qtilde=qtilde,
    157     onesided=onesided,
    158     onesideddiscovery=onesideddiscovery,
    159 )

File /srv/conda/envs/notebook/lib/python3.10/site-packages/hepstats/hypotests/calculators/frequentist_calculator.py:185, in FrequentistCalculator._pvalue_(self, poinull, poialt, qtilde, onesided, onesideddiscovery)
    182     p = len(qdist[qdist >= qobs]) / len(qdist)
    183     return p
--> 185 qnulldist = self.qnull(
    186     poinull=poinull,
    187     poialt=poialt,
    188     onesided=onesided,
    189     onesideddiscovery=onesideddiscovery,
    190     qtilde=qtilde,
    191 )
    192 pnull = np.empty(len(poinull))
    193 for i, p in enumerate(poinull):

File /srv/conda/envs/notebook/lib/python3.10/site-packages/hepstats/hypotests/calculators/frequentist_calculator.py:86, in FrequentistCalculator.qnull(self, poinull, poialt, onesided, onesideddiscovery, qtilde)
     59 def qnull(
     60     self,
     61     poinull: POI | POIarray,
   (...)
     65     qtilde: bool = False,
     66 ):
     67     """Computes null hypothesis values of the :math:`\\Delta` log-likelihood test statistic.
     68 
     69     Args:
   (...)
     84         >>> q = calc.qnull(poinull, poialt)
     85     """
---> 86     toysresults = self.get_toys_null(poinull, poialt, qtilde)
     87     ret = {}
     89     for p in poinull:

File /srv/conda/envs/notebook/lib/python3.10/site-packages/hepstats/hypotests/calculators/basecalculator.py:513, in ToysCalculator.get_toys_null(self, poigen, poieval, qtilde)
    492 def get_toys_null(
    493     self,
    494     poigen: POI | POIarray,
    495     poieval: POI | POIarray | None = None,
    496     qtilde: bool = False,
    497 ) -> dict[POI, ToyResult]:
    498     """
    499     Return the generated toys for the null hypothesis.
    500 
   (...)
    511         ...     calc.get_toys_alt(p, poieval=poialt)
    512     """
--> 513     return self._get_toys(
    514         poigen=poigen, poieval=poieval, qtilde=qtilde, hypothesis="null"
    515     )

File /srv/conda/envs/notebook/lib/python3.10/site-packages/hepstats/hypotests/calculators/basecalculator.py:486, in ToysCalculator._get_toys(self, poigen, poieval, qtilde, hypothesis)
    483     if ntogen > 0:
    484         print(f"Generating {hypothesis} hypothesis toys for {p}.")
--> 486         self.generate_and_fit_toys(ntoys=ntogen, poigen=p, poieval=poieval_p)
    488     ret[p] = self.get_toyresult(p, poieval_p)
    490 return ret

File /srv/conda/envs/notebook/lib/python3.10/site-packages/hepstats/hypotests/toyutils.py:286, in ToysManager.generate_and_fit_toys(self, ntoys, poigen, poieval)
    283     param_dict = next(samples)
    285 with ExitStack() as stack:
--> 286     for param, value in param_dict:
    287         stack.enter_context(param.set_value(value))
    289     for minimize_trial in range(2):

File /srv/conda/envs/notebook/lib/python3.10/site-packages/tensorflow/python/ops/variables.py:1123, in Variable.__iter__(self)
   1121 def __iter__(self):
   1122   """When executing eagerly, iterates over the value of the variable."""
-> 1123   return iter(self.read_value())

File /srv/conda/envs/notebook/lib/python3.10/site-packages/tensorflow/python/framework/ops.py:583, in Tensor.__iter__(self)
    581   raise TypeError("Cannot iterate over a tensor with unknown shape.")
    582 if not shape:
--> 583   raise TypeError("Cannot iterate over a scalar tensor.")
    584 if shape[0] is None:
    585   raise TypeError(
    586       "Cannot iterate over a tensor with unknown first dimension.")

TypeError: Cannot iterate over a scalar tensor.

readthedoc docs

Currently, the docs are published on github pages. Should we also activate them again on readthedocs to have older versions as well? It fails currently since the requirements.txt is gone, but we could tell readthedocs to install the package, which should be sufficient.

Btw @marinang, I can help with that and you could add me as a maintainer in RTD, feel free to.

Naming of package

Hey, we've discussed at CHEP about a few things, including naming of packages. Maybe the name scikit-stats implies a too broad scope, since it is not a general scikit statistics package, but rather something like hep-stats?

Or what is your take on the name?

(don't remember all the people I discussed that with, so can't tag anyone, maybe @eduardo-rodrigues ?)

Dumping of statistical results

After a chat yesterday at CERN with @mayou36 and Carsten, a RooFit developer, we concluded that storing the results of hepstats is crucial as they can be reused, for plotting for instance, without re-running CPU-intensive calculations done in hepstats.

The idea is to store in a dictionary:

best fit result
likelihood scan
significances
upper limits / intervals
null and alternative hypothesis
parameter ranking (not implement yet)

This dictionary can be easily serialised into yml or json file.

Make use of org scripts in scikit-hep/azure-wheel-helpers

Just so that we do not forget to try and use the org scripts in https://github.com/scikit-hep/azure-wheel-helpers for CI :-).

Tolerance for zfit model in compute_sweights

Hi,

I have come across a strange issue when trying to use compute_sweights using a model from zfit.

This is a snippet of the error I see:

name        value    at limit
------  ---------  ----------
N_s82       619.8       False
N_b82         561       False
lamb82  -0.002628       False
Converged:  True
Valid:  True
Traceback (most recent call last):
  File "GBReweighting.py", line 282, in <module>
    folder, foldernum, mc_tos, mc_tis, data_tos, data_tis, sWeights_tos, sWeights_tis = get_data()
  File "GBReweighting.py", line 67, in get_data
    sWeights_tos = get_sWeights(run_period, 'tos', 'ref', data = data_tos, refit = True, apply_NN = False)
  File "/afs/cern.ch/work/s/sbhasin/Analysis/b2oc-AmAn_B02D0D0Kpi_Run12/utils/zfit_sWeights.py", line 136, in get_sWeights
    sweights = re_fit(tuple_path, global_vars, foldernum, trig, sig_or_ref, data, apply_NN)
  File "/afs/cern.ch/work/s/sbhasin/Analysis/b2oc-AmAn_B02D0D0Kpi_Run12/utils/zfit_sWeights.py", line 117, in re_fit
    weights = compute_sweights(model, mass)
  File "/afs/cern.ch/user/s/sbhasin/public/b02d0d0kpi_AmAn_virtualenv/lib/python3.6/site-packages/hepstats/splot/sweights.py", line 107, in compute_sweights
    "The model needs to fitted to input data in order to comput the sWeights."
hepstats.splot.exceptions.ModelNotFittedToData: The model needs to fitted to input data in order to comput the sWeights.

Clearly the model was fitted to the data as the results are printed above, and the error is only coming up for certain data and not for others (different run periods and trigger requirements)

I tried recreating the bit of hepstats code which causes the error and I get for eg.

models = model.get_models()
p = np.vstack([m.pdf(mass) for m in models]).T
Nx = model.ext_pdf(mass)
pN = p / Nx[:, None]
print(np.array(pN).sum(axis=0))

which returns
[1.00044341 1.00166534]

So looks like I'm just over the tolerance in this case - do you know why this would be happening?
FYI the model is a Double Crystal Ball and Exponential

Thanks in advance!

Implementation of statistical hypothesis tests.

I have developed in lauztat (https://github.com/marinang/lauztat) hypothesis tests like in RooStats, so the idea would be to take it here. However it depends on zfit is that ok?

The structure is like this:

tests: discovery, upper limit, confidence interval
calculators: asymptotic, frequentist
configs: takes model/pdfs, data, loss builder, a minimiser ( and optionally sampling methods )

The calculator is fed by the config, and the test is fed by the calculator very similar to RooStats, see an example here https://nbviewer.jupyter.org/github/marinang/lauztat/blob/master/examples/notebooks/discovery_zfit_freq.ipynb. Maybe it's not ideal and it's the right time to change it.

Finally where should we put it ? Under a test/hypotest submodule ?

[Security] Workflow main.yml is using vulnerable action actions/checkout

The workflow main.yml is referencing action actions/checkout using references v1. However this reference is missing the commit a6747255bd19d7a757dbdda8c654a9f84db19839 which may contain fix to the some vulnerability.
The vulnerability fix that is missing by actions version could be related to:
(1) CVE fix
(2) upgrade of vulnerable dependency
(3) fix to secret leak and others.
Please consider to update the reference to the action.

plotting

Currently, AFAIK, the plotting is not public and just helper functions are provided. However, it could be useful to somehow provide them (based on the output). Maybe it is worth to integrate them at some point into mplhep?

Python 3.5 support

@marinang, is it possible to have Python 3.5 support for now at least? I'm asking having in mind the preparation of the scikit-hep metapackage and the fact that all packages it depends on do support 3.5. It would be nice to add scikit-stats to the gang … Thanks.

[Security] Workflow main.yml is using vulnerable action pre-commit/action

The workflow main.yml is referencing action pre-commit/action using references v2.0.2. However this reference is missing the commit 80db042ff08cdddbbbfb6f89c06a6bfc4dddf0b7 which may contain fix to the some vulnerability.
The vulnerability fix that is missing by actions version could be related to:
(1) CVE fix
(2) upgrade of vulnerable dependency
(3) fix to secret leak and others.
Please consider to update the reference to the action.

Adding 'Energy Test' two-sample test

Following discussions with @chrisburr and others, we propose adding the Energy Test to hepstats. This is a two-sample hypothesis test currently used at LHCb to observe local CP violation in 3- and 4-body decays ([1], [2], [3]).

Chris' suggestion was to use the existing C++ implementation (see here, although some minor extra features have been added to other versions since this one) and to add python bindings for it, along with a python executable for command line usage. There also exists a python implementation, although it's slower & may not be preferable.

Update README badges, setup.cfg to state Python 3.12 is supported

Indeed I see that at least these 2 files only mention up to 3.11. This seems to have slipped from PR #156.

Save result of pseudo-experiments/toys

In #14 the frequentist calculator is introduced which uses toys to build test statistic distributions to compute pvalues for instance.

As the generation of toys + fitting + scanning is quite CPU intensive and takes time, an improvement would be to store the result of the pseudo experiments for instance in an hdf5 file.
This is currently an issue for testing which have time limits. For each pseudo experiment what need to be stored is:

the values of POI at generation
the best fitted values of the POI
the likelihood evaluated at the best fit POI
the likelihood evaluated at the POI points you want to scan (for upper limits or confidence intervals)

The results stored can be reused in hepstats.hypotests with the frequentist calculator without regenerating pseudo-experiments. This would let the possibility to the users to generate pseudo-experiments outside of heptstats as well.

A first design I have in mind is to create a class called ToyManager for instance which collect and save the results of the pseudo-experiments and can reopen them.

What do you think @eduardo-rodrigues @mayou36 @HDembinski ?

Hypotests docs: error with non-negative scale when generating random points

At present : bkg = np.random.exponential(-1/2, 300) uses a non-negative scale which results in error.

Solution:
According to the numpy docs, the scale must be non-negative. Link

Asymptotic calculator histogram dimensionality

When generating the asimov histogram associated with a particular test, why is the generated histogram limited to a one-dimensional fit, implicitly by the call to space.limit1d here? This seems to exclude the possibility for hypothesis tests involving a likelihood function evaluated using a PDF with dimensionality greater than 1 which, in my naive understanding, is a desirable use-case for this calculator.

Thanks for all your work!

New release

Hi @marinang, as mentioned it would be good to have a new release. This is just done by adding the tag and making a github release?
0.4.0?

	# eq. 21 from Scargle 2012
	return 4 - np.log(73.53 * self.p0 * (N**-0.478))

scikit-hep / hepstats Goto Github PK

hepstats's Introduction

scikit-hep: metapackage for Scikit-HEP

Project info

scikit-hep package

Installation

Package version and dependencies

hepstats's People

Contributors

Stargazers

Watchers

Forkers

hepstats's Issues

Description:

Expected behavior:

Actual Behavior:

Minimal working example (based on the notebook):

Possible Solution:

Recommend Projects

Recommend Topics

Recommend Org

`scikit-hep`: metapackage for Scikit-HEP