Comments (2)
I learned from this project that while Flatbuffers may be a very fast protocol in C++, the Python implementation is purely for functionality. All the indirection that's supposed to be in-cache in C++ is very slow unpacking for Python.
Fundamentally, what should the serialization format for histograms be? This Flatbuffers one was an attempt at a time when I thought that a lot of histogram libraries would proliferate. Now it's settling down to just ROOT and Boost.Histogram. The Flatbuffers schema here is trying to be ultra-general, like allowing the underflow and overflow bins to be anywhere, but as you pointed out, there's only one convention now: ROOT and Boost.Histogram both put the underflow before and the overflow after the bins. Flatbuffers has three strong selling points:
- the C++ serialization/deserialization is very fast (I'm trusting the Flatbuffers documentation on this; haven't tried it myself)
- it's the only rowwise/record-oriented format that I know of that can selectively decompress parts of the record, for any schema
- this indirection also makes it very capable of schema evolution: old code doesn't have to know it's skipping over new fields, as it's just following pointers.
So the question is, how important are these selling points to the specific problem of serializing and deserializing Boost.Histogram? That's a different question than we were asking before—the old question was "What's important for serializing/deserializing any histogram in any histogram library?" Surely, (1) is still important. Is (2) important? That was to accommodate a union of features from all histogram libraries, with the idea that any particular histogram library won't be using all of them. Now, if you're deserializing every field every time, there might not be as much value in that. Similarly, the flexibility in point (3) mattered when we were taking on the problem of all current and future histogramming libraries, but focusing on Boost.Histogram, the data structures may be stable enough that you don't need such an extreme of flexibility. This can be adjusted for in the Flatbuffers schema itself by making never-changing substructures into struct
, rather than record
, which removes a level of indirection. (In the Flatbuffer examples, struct
is used for things like 3D points, which will always have exactly three fields, and record
is used for the video game Monster type, which may get new attributes in the future.)
Flatbuffers's three selling points don't address some points that are important to physics users:
- being able to store histograms and event data in the same file
- interoperability with ROOT
- efficiently storing thousands of histograms with mostly the same metadata—ideally by writing one copy of the metadata.
In principle, (4) could be addressed by implementing the histogram serialization in Parquet. The Flatbuffers schema in Aghast isolates the bin contents, to keep them columnar—in Parquet, they'd automatically be columnar. Everything else (the bulk of the Aghast Flatbuffers schema) can be Parquet metadata, which is any bytestring associated with a node of an array.
Awkward 2.0 will be using this metadata, so here's a demonstration:
>>> import awkward as ak
>>> import numpy as np
>>> import pyarrow.parquet as pq
>>> contents, edges = np.histogram(np.random.normal(0, 1, 100000))
>>> histogram_array = ak._v2.highlevel.Array(
... ak._v2.contents.NumpyArray(
... contents,
... parameters={"edges": edges.tolist()},
... )
... )
>>> histogram_array
<Array [83, 846, 5408, 18452, ..., 12456, 2860, 349, 18] type='10 * int64[pa...'>
>>> histogram_array.type
ArrayType(NumpyType('int64', parameters={'edges': [-4.0401437884795435, -3.1990114867101216, -2.3578791849407, -1.5167468831712787, -0.6756145814018568, 0.16551772036756507, 1.006650022136986, 1.847782323906408, 2.68891462567583, 3.5300469274452517, 4.371179229214674]}), 10)
>>> pq.write_table(
... ak._v2.operations.convert.to_arrow_table(histogram_array),
... "tmp.parquet",
... )
>>> from_disk = ak._v2.operations.convert.from_arrow(pq.read_table("tmp.parquet"))
>>> from_disk
<Array [83, 846, 5408, 18452, ..., 12456, 2860, 349, 18] type='10 * int64[pa...'>
>>> from_disk.type
ArrayType(NumpyType('int64', parameters={'edges': [-4.0401437884795435, -3.1990114867101216, -2.3578791849407, -1.5167468831712787, -0.6756145814018568, 0.16551772036756507, 1.006650022136986, 1.847782323906408, 2.68891462567583, 3.5300469274452517, 4.371179229214674]}), 10)
Awkward Array sends its parameters through JSON (assuming that they're small), but a path that goes directly from Boost.Histogram to Arrow/Parquet could encode them as a binary array of doubles. Or Awkward can be modified to store some parameters in a more efficient format if this becomes a use-case.
It's also possible to write metadata to HDF5, and HDF5 would bring a lot of options for tuning sparse data. However, jagged event data can't be stored in HDF5 efficiently, so if this is for motivation (4), putting event data and histograms in the same file, it wouldn't be satisfied by HDF5.
On the other hand, maybe the edges could be part of the data, rather than metadata:
>>> histogram_array2 = ak._v2.highlevel.Array(
... ak._v2.contents.RecordArray(
... [
... ak._v2.contents.NumpyArray(edges[:-1]),
... ak._v2.contents.NumpyArray(edges[1:]),
... ak._v2.contents.NumpyArray(contents),
... ],
... ["low", "high", "value"],
... parameters={"__array__": "histogram"},
... )
... )
>>> histogram_array2
<Array [{low: -4.34, high: -3.5, value: 23}, ..., {...}] type='10 * struct[[...'>
>>> histogram_array2.show()
[{low: -4.34, high: -3.5, value: 23},
{low: -3.5, high: -2.66, value: 396},
{low: -2.66, high: -1.82, value: 3160},
{low: -1.82, high: -0.98, value: 13006},
{low: -0.98, high: -0.141, value: 28114},
{low: -0.141, high: 0.698, value: 31102},
{low: 0.698, high: 1.54, value: 17920},
{low: 1.54, high: 2.38, value: 5436},
{low: 2.38, high: 3.21, value: 769},
{low: 3.21, high: 4.05, value: 74}]
With the above, a large number of histograms could be stored in one array, satisfying (6), and it's clear to see how it could be generalized to Boost.Histogram's many storage types. If the edges are repeated frequently, they could be dictionary-encoded in Parquet.
Furthermore, maybe an Arrow/Parquet format that has a natural and meaningful deserialization into Pandas would also be a good idea. Arrow is supposed to be a fast backend for Pandas, and "interoperability with Pandas" may be an important criterion (7) in the future.
Then there's criterion (5), interoperability with ROOT. As we've seen, some users are already serializing Boost.Histogram with ROOT streamers—somehow, in a non-standardized way. Standardizing that by adding the ROOT/Cling-dictionary generating macros (dummy macros for when Boost.Histogram is not used in ROOT (i.e. #ifndef...
), would at least ensure that a million formats don't proliferate.
But what about interoperability with ROOT's v7 histograms? Do v7 histograms have all the features of Boost.Histogram, in the sense that Boost.Histograms can be losslessly converted to and from ROOT v7 histograms? If so, then maybe that could be the serialization format. It would satisfy criteria (1), (3), (4), (5), and maybe (6), by joining histograms with like-metadata into a single histogram with an extra axis.
I'm looking forward to the prospect of there being a standardized serialization format for Boost.Histogram. Flatbuffers was chosen for a different reason, interoperability with all histogramming libraries, which is less important now that Boost.Histogram is dominant in for non-ROOT users. Flatbuffers is still an okay choice, but there are some things it doesn't do. Alternatives include Arrow/Parquet, Awkward/Arrow/Parquet, Pandas/Arrow/Parquet, HDF5, ROOT as a new class, and ROOT as v7 histograms.
What do you think?
from aghast.
I looked into the case of ROOT persistence of boost histograms a bit.
This doesn't work with the default storage type (I think because of reasons which @HDembinski outlined in https://root-forum.cern.ch/t/boost-1-70-released-new-library-boost-histogram/33595/12 )
On the other hand with histograms from make_weighted_histogram this seems to work just fine, albeit with some (maybe spurious) warnings.
I've opened an issue for ROOT about this (root-project/root#9371)
I don't think that pre-generating dictionaries is necessarily a complete solution, since one may want to serialize a template instance which is only instantiated on the fly with PyROOT/Cling (but maybe for a subset of commonly used template instances this could make sense)
from aghast.
Related Issues (20)
- `+` on embedded objects should return detached; `+=` should not be allowed
- Add provenance, like metadata and decoration...
- Flag to say whether VariationBinning varies the weight or not HOT 3
- Support for non-rectangular binning? HOT 2
- Binomial error or Poisson error HOT 1
- Q: Sparse-ness and merging for IntegerBinning and CategoryBinning HOT 1
- Slicing categorical axis HOT 1
- Conversion from_root fails for non-uniform binning HOT 1
- Open discussion with pyhf team HOT 1
- convert to/from HEPData format
- Deserialization support HOT 1
- How to import ROOT HOT 5
- KeyError: 'sumw2' HOT 7
- CONTRIBUTING.md HOT 2
- Pre-commit huge list in 'test_root.py' HOT 1
- Test other histograms in 'test_root.py'
- Adding boost-histogram to tutorial HOT 10
- Standard package layout HOT 5
- Stability of histogram types HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aghast.