scikit-hep / aghast Goto Github PK

Aghast: aggregated, histogram-like statistics, sharable as Flatbuffers.

License: BSD 3-Clause "New" or "Revised" License

Python 92.69% Jupyter Notebook 7.26% Shell 0.05%

aghast's Introduction

`scikit-hep`: metapackage for Scikit-HEP

Project info

The Scikit-HEP project is a community-driven and community-oriented project with the aim of providing Particle Physics at large with an ecosystem for data analysis in Python embracing all major topics involved in a physicist's work. The project started in Autumn 2016 and its packages are actively developed and maintained.

It is not just about providing core and common tools for the community. It is also about improving the interoperability between HEP tools and the Big Data scientific ecosystem in Python, and about improving on discoverability of utility packages and projects.

For what concerns the project grand structure, it should be seen as a toolset rather than a toolkit.

Getting in touch

There are various ways to get in touch with project admins and/or users and developers.

scikit-hep package

scikit-hep is a metapackage for the Scikit-HEP project.

Installation

You can install this metapackage from PyPI with `pip`:

python -m pip install scikit-hep

or you can use Conda through conda-forge:

conda install -c conda-forge scikit-hep

All the normal best-practices for Python apply; you should be in a virtual environment, etc.

Package version and dependencies

Please check the setup.cfg and requirements.txt files for the list of Python versions supported and the list of Scikit-HEP project packages and dependencies included, respectively.

For any installed scikit-hep the following displays the actual versions of all Scikit-HEP dependent packages installed, for example:

>>> import skhep
>>> skhep.show_versions()

System:
    python: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
executable: /srv/conda/envs/notebook/bin/python
   machine: Linux-5.15.0-72-generic-x86_64-with-glibc2.27

Python dependencies:
       pip: 23.1.2
     numpy: 1.24.3
     scipy: 1.10.1
    pandas: 2.0.2
matplotlib: 3.7.1

Scikit-HEP package version and dependencies:
        awkward: 2.2.2
boost_histogram: 1.3.2
  decaylanguage: 0.15.3
       hepstats: 0.6.1
       hepunits: 2.3.2
           hist: 2.6.3
     histoprint: 2.4.0
        iminuit: 2.21.3
         mplhep: 0.3.28
       particle: 0.22.0
          pylhe: 0.6.0
       resample: 1.6.0
          skhep: 2023.06.09
         uproot: 5.0.8
         vector: 1.0.0

Note on the versioning system:

This package uses Calendar Versioning (CalVer).

aghast's People

Contributors

Stargazers

Watchers

Forkers

nsmith- benkrikler chrisburr matthewfeickert kreczko jon-burr externalrepositories hercules261188 hdembinski

aghast's Issues

Proper installation instructions in README

It just says, "check out GitHub," without mentioning the fact that you have to run ./generate_flatbuffers.py.

Stability of histogram types

Hi all, thanks a lot for this repo!

What is the status of the stability for the binning types? We are interested to use them as the binning definitions in zfit. Or is there a subset that you would consider "stable", especially the simpler ones?

Flag to say whether VariationBinning varies the weight or not

Adding boost-histogram to tutorial

Is it possible to add boost-histogram to the tutorial on how to use Aghast? People are asking me how to convert ROOT histograms to boost-histograms and it would be nice to point them to the tutorial.

`+` on embedded objects should return detached; `+=` should not be allowed

Write tests for corner cases and fix the ones that are broken.

How to import ROOT

I tried pip install ROOT and cannot find ROOT wheel. How can I install Python ROOT?

Option for `+` behavior

"fast": ignore binning metadata and just add counts arrays naively
"adjust": the current behavior and future default; moves bin positions or creates bins (with zero content) to make binnings compatible, then add
"strict": raise an exception (ValueError) if binnings disagree (in a way that would make "fast" invalid).

(At the suggestion of @benkrikler.)

Pre-commit huge list in 'test_root.py'

~~We should add the pre-commit check for this repo's Actions just like boost-histogram and hist do. It would be beneficial for a better coding style.~~

--Update--
solved by #35

Support for non-rectangular binning?

Particularly for CategoryBinning axes, it would be nice to only save a dense binning for the tuple of categories corresponding to a valid bin, rather than the entire product of category values per axis.

aghast in C++?

A commonly requested feature for Boost.Histogram in C++ is to convert from and to ROOT histograms. In Python, we can do that already now with aghast, but not from C++. Calling into a Python library from C++ is both ghastly and awkward, so I wonder whether we could port Aghast to C++ to support this.

I also like the flatbuffer serialization very much, this could be the "native" format for Boost.Histogram (C++) and boost-histogram (Python) to talk to each other. Right now, the C++ code has only a generic serialization protocol in place that works with Boost.Serialization and compatible libraries like Cereal. The Python library uses this infrastructure to break a Boost.Histogram object into Python primitives, which are then serialized in the usual way. This produces a Python-specific format that is clunky to read from another language. Another commonly requested feature is to be able to dump a histogram in C++ to disk and read it again from Python. This is currently not supported, since the two libraries do not have a common serialization format. The flatbuffers could fill that role.

KeyError: 'sumw2'

@jpivarski There might be a bug: when I run the code

root_hist = aghast.to_root(ghastly_hist, "root_hist")

A KeyError is thrown. It says no key "sumw2" is found. I search <"sumw2"> in this file and find this is the only one.

aghast/python/aghast/_connect/_root.py

Lines 160 to 165 in 2a57c6c

 sumw = obj.counts[tuple(slc)] 

 if isinstance(sumw, dict): 

 sumw, sumw2 = sumw["sumw"], sumw["sumw2"] 

 sumw2 = numpy.array(sumw2, dtype=numpy.float64, copy=False) 

 else: 

 sumw2 = None

P.S., I am using pip install aghast, so the package I'm using might be a little out of date. But I check the latest code and the latest PR #24 , there are no changes with respect to to_root() in this file.

Stagg objects must be pickleable

`+` should unify buffer types

To the minimum byte-size type that covers both.

Binomial error or Poisson error

It says in the docs:

"The error_method does not specify how the histograms or functions were filled, but how the fraction should be interpreted statistically. It may be unspecified, leaving that interpretation unspecified. The normal method (sometimes called “Wald”) is a naive binomial interpretation, in which zero passing or zero failing values are taken to have zero uncertainty. The clopper_pearson method (sometimes called “exact”) is a common choice, though it fails in some statistical criteria. The computation and meaning of the methods are described in the references below."

If Aghast is only about exchanging data, is it necessary to specify the error_method? This is not part of the data, but part of the interpretation. You follow here the assumption that histogram counts should be modeled as binomial proportions, but in HEP that usually does not make sense. We use the Poisson model not only because it is easier, but because it is the better approximation of reality.

Example. You want to measure the differential cross-section of some process p+p -> X as a function of a kinematic variable, let's say eta. You count how often a particle X falls into an eta bin, call that \Delta N. The differential cross-section then is:

d\sigma / d\eta = 1/L \Delta N / \Delta \eta

where L is the luminosity. I use "=" here, although strictly speaking the equality requires the bin width to go to zero.

You care about the uncertainty of d\sigma / d\eta. You don't know the total sum of all \Delta N here, because you have not observed all X, only those in the acceptance of your detector, so you cannot compute the binomial proportional error. The measurement also doesn't depend on the fraction of events in each bin, but really on the number, which you divide by L and not sum over \Delta N. The Poisson is more correct here.

It is not exactly correct either, because the bins are not really independent for a physics reason: you observe K particles per event, and these K particles are not generated independently, they form jets and have some correlations. We usually neglect these correlations. They are not correctly represented by either Poisson or binomial proportion errors.

C++ layer under objects

This would replace the _fromflatbuffers pure-Python infrastructure with an extension module that uses C++ Flatbuffers generated code, bound to Python with pybind11. Objects made in Python and the _toflatbuffers methods would not be affected, but properties providing data from a buffer would not cache in Python, but deliver from the fast C++ every time.

Having a C++ layer would also open the door to fast adding (in "fast," "strict," and maybe even "adjust" modes). Not all functionality would be ported to the C++ layer, only those that would stand the most from this optimization.

The base C++ classes should be separately defined (header-only?), so that they can be reused in other C++ projects.

Conversion from_root fails for non-uniform binning

I am having issues converting a TH1D into an aghast histogram.
Since I am working with rootpy histograms, the blow example uses 2D histograms and ProjectionX() to return the TH1D. The example includes the patch in #22 (otherwise it will stop there)

Code example

import aghast
import awkward
from rootpy.plotting import Hist2D
import numpy as np
import ROOT


def test_fill():
    pileup_bins = [0, 10, 15, 20, 30, 999]
    jet_pt_bins = [35, 90, 120]
      hist = Hist2D(pileup_bins, jet_pt_bins, name='test')

    ets = awkward.fromiter([
        np.random.poisson(30, 5),
        np.random.poisson(30, 2),
        np.random.poisson(30, 3),
    ])
    repeat = ets.stops - ets.starts

    weights = np.ones(len(ets))
    weights = np.repeat(weights, repeat, axis=0)
    pileup = np.random.poisson(50, len(ets))
    pileup = np.repeat(pileup, repeat, axis=0)
   
    hist.fill_array(np.vstack((pileup, ets.content)).T, weights)
    proX = hist.ProjectionX()
    print(type(proX)) # <class 'ROOT.TH1D'>
    assert isinstance(proX, (ROOT.TH1D, ROOT.TH2D, ROOT.TH3D))
    aghast.from_root(proX) # <-- fails with AssertionError: <class 'ROOT.TArrayD'>

Full trace

/software/miniconda/envs/hep_py3/lib/python3.6/site-packages/aghast/__init__.py:31: in from_root
    return aghast._connect._root.from_root(obj, collection=collection)
/software/miniconda/envs/hep_py3/lib/python3.6/site-packages/aghast/_connect/_root.py:286: in from_root
    sumw2array = getbincontents(sumw2obj)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

obj = <ROOT.TArrayD object at 0x55e5e196d5d0>

    def getbincontents(obj):
        if isinstance(obj, (ROOT.TH1C, ROOT.TH2C, ROOT.TH3C)):
            out = numpy.empty(obj.GetNcells(), dtype=numpy.int8)
            arraytype = "char"
        elif isinstance(obj, (ROOT.TH1S, ROOT.TH2S, ROOT.TH3S)):
            out = numpy.empty(obj.GetNcells(), dtype=numpy.int16)
            arraytype = "short"
        elif isinstance(obj, (ROOT.TH1I, ROOT.TH2I, ROOT.TH3I)):
            out = numpy.empty(obj.GetNcells(), dtype=numpy.int32)
            arraytype = "int"
        elif isinstance(obj, (ROOT.TH1F, ROOT.TH2F, ROOT.TH3F)):
            out = numpy.empty(obj.GetNcells(), dtype=numpy.float32)
            arraytype = "float"
        elif isinstance(obj, (ROOT.TH1D, ROOT.TH2D, ROOT.TH3D)):
            out = numpy.empty(obj.GetNcells(), dtype=numpy.float64)
            arraytype = "double"
        else:
>           raise AssertionError(type(obj))
E

Slicing categorical axis

Unless I'm misunderstanding how the slicing should work, it seems like slicing on a categorical axis when there is also another axis, does not have the desired effect:

h = aghast.Histogram([aghast.Axis(aghast.CategoryBinning(["cat1", "cat2"])),
                      aghast.Axis(aghast.RegularBinning(10, aghast.RealInterval(-5, 5)))],
                    aghast.UnweightedCounts(
                        aghast.InterpretedInlineBuffer.fromarray(
                            numpy.array([[  9,  25,  29,  35,  54,  67,  60,  84,  80,  94],
                                         [ 99, 119, 109, 109,  95, 104, 102, 106, 112, 122]]))))

If I wanted to select only the counts for "cat2" I assumed this would work:

h_cat2 = h.loc['cat2']

However, that just produces a histogram with the counts for both categories:

h_cat2.counts.array
array([[ 99, 119, 109, 109,  95, 104, 102, 106, 112, 122],
       [  9,  25,  29,  35,  54,  67,  60,  84,  80,  94]])

Add provenance, like metadata and decoration...

... but not as prolific.

Deserialization support

Does Aghast have deserialization support? I only notice that we can use tobuffer() to translate it into a byte array, but is there a frombuffer() method?

Q: Sparse-ness and merging for IntegerBinning and CategoryBinning

I've been working on adapting our pandas-based histogram filling code to produce aghasts, and was trying to understand IntegerBinning and CategoryBinning better.

In our current version that produces pandas dataframes, if a user asks to bin on a variable but doesn't provide a binning scheme, that's interpreted as implying the variable to already be discrete (ie. integral, categorical, or a finite number of floats-- the latter is a little dangerous, but so far not caught anyone out). This works nicely, in that we can distribute the filling jobs with such configurations, produce sparse binned dataframes, where bins are only present if at least one instance of its value occurred, and then merge these together.

Switching to aghast I can partially solve this by deducing all possible categories or the min / max value for of an integer variable for each chunk and only then create the Ghast, but then I can't be sure that these will be the same between distributed chunks of data.

I've had a look in the aghast code and the specification, and it looks to me like IntegerBinning isn't a sparse implementation. I've also looked at the merging of histograms (the _add method in particular) but struggled to understand the merging process for such a situation. What would happen ff two Histograms are combined, both with an IntegerBinning axis but with different min/max values? Will the result be the union of these axes or the intersection? And similarly for the set of values in two CategoricalBinning axes, would they be unioned or intersected if we add such histograms?

(I wasn't sure whether this should be a question on gitter, email, or an issue, but since it's potentially a feature request, I thought this might be the best place; please say if you prefer I use somewhere for such questions in the future! 😄 )

Standard package layout

Could this follow standard Python layout? The following line should install the package:

pip install git+https://github.com/scikit-hep/aghast.git

But it doesn't since this does not follow standard layout. We have to use:

pip install git+https://github.com/scikit-hep/aghast#subdirectory=python

Test other histograms in 'test_root.py'

aghast/python/tests/test_root.py

Lines 118 to 144 in 3dcafdd

 def check1d(before, after): 

 assert before.GetNbinsX() == after.GetNbinsX() 

 for i in range(before.GetNbinsX() + 2): 

 assert before.GetBinContent(i) == after.GetBinContent(i) 

 assert before.GetBinError(i) == after.GetBinError(i) 

 assert before.GetEntries() == after.GetEntries() 

 assert before.GetMean() == after.GetMean() 

 assert before.GetStdDev() == after.GetStdDev() 

 assert before.GetTitle() == after.GetTitle() 

 assert before.GetXaxis().GetTitle() == after.GetXaxis().GetTitle() 

 assert bool(before.GetXaxis().GetLabels()) == bool(after.GetXaxis().GetLabels()) 

 assert before.GetXaxis().IsVariableBinSize() == after.GetXaxis().IsVariableBinSize() 

 if before.GetXaxis().GetLabels(): 

 assert list(before.GetXaxis().GetLabels()) == list(after.GetXaxis().GetLabels()) 

 elif before.GetXaxis().IsVariableBinSize(): 

 beforeedges = numpy.full(before.GetNbinsX() + 1, 999, dtype=numpy.float64) 

 before.GetXaxis().GetLowEdge(beforeedges) 

 beforeedges[-1] = before.GetXaxis().GetBinUpEdge(before.GetNbinsX()) 

 afteredges = numpy.full(after.GetNbinsX() + 1, 123, dtype=numpy.float64) 

 after.GetXaxis().GetLowEdge(afteredges) 

 afteredges[-1] = after.GetXaxis().GetBinUpEdge(after.GetNbinsX()) 

 assert numpy.array_equal(before, after) 

 else: 

 assert before.GetXaxis().GetBinLowEdge(1) == after.GetXaxis().GetBinLowEdge(1) 

 assert before.GetXaxis().GetBinUpEdge( 

 before.GetNbinsX() 

 ) == after.GetXaxis().GetBinUpEdge(after.GetNbinsX())

'test_root.py' only checks 1-d ROOT histogram classes. We should add [TH2*] and [TH3*] options. Besides, we can create some demo histograms to test, just like what boost_histogram has done.

I'll do it if time permits.

	sumw = obj.counts[tuple(slc)]
	if isinstance(sumw, dict):
	sumw, sumw2 = sumw["sumw"], sumw["sumw2"]
	sumw2 = numpy.array(sumw2, dtype=numpy.float64, copy=False)
	else:
	sumw2 = None

	def check1d(before, after):
	assert before.GetNbinsX() == after.GetNbinsX()
	for i in range(before.GetNbinsX() + 2):
	assert before.GetBinContent(i) == after.GetBinContent(i)
	assert before.GetBinError(i) == after.GetBinError(i)
	assert before.GetEntries() == after.GetEntries()
	assert before.GetMean() == after.GetMean()
	assert before.GetStdDev() == after.GetStdDev()
	assert before.GetTitle() == after.GetTitle()
	assert before.GetXaxis().GetTitle() == after.GetXaxis().GetTitle()
	assert bool(before.GetXaxis().GetLabels()) == bool(after.GetXaxis().GetLabels())
	assert before.GetXaxis().IsVariableBinSize() == after.GetXaxis().IsVariableBinSize()
	if before.GetXaxis().GetLabels():
	assert list(before.GetXaxis().GetLabels()) == list(after.GetXaxis().GetLabels())
	elif before.GetXaxis().IsVariableBinSize():
	beforeedges = numpy.full(before.GetNbinsX() + 1, 999, dtype=numpy.float64)
	before.GetXaxis().GetLowEdge(beforeedges)
	beforeedges[-1] = before.GetXaxis().GetBinUpEdge(before.GetNbinsX())
	afteredges = numpy.full(after.GetNbinsX() + 1, 123, dtype=numpy.float64)
	after.GetXaxis().GetLowEdge(afteredges)
	afteredges[-1] = after.GetXaxis().GetBinUpEdge(after.GetNbinsX())
	assert numpy.array_equal(before, after)
	else:
	assert before.GetXaxis().GetBinLowEdge(1) == after.GetXaxis().GetBinLowEdge(1)
	assert before.GetXaxis().GetBinUpEdge(
	before.GetNbinsX()
	) == after.GetXaxis().GetBinUpEdge(after.GetNbinsX())