Giter Club home page Giter Club logo

Comments (2)

jpivarski avatar jpivarski commented on June 16, 2024

I learned from this project that while Flatbuffers may be a very fast protocol in C++, the Python implementation is purely for functionality. All the indirection that's supposed to be in-cache in C++ is very slow unpacking for Python.

Fundamentally, what should the serialization format for histograms be? This Flatbuffers one was an attempt at a time when I thought that a lot of histogram libraries would proliferate. Now it's settling down to just ROOT and Boost.Histogram. The Flatbuffers schema here is trying to be ultra-general, like allowing the underflow and overflow bins to be anywhere, but as you pointed out, there's only one convention now: ROOT and Boost.Histogram both put the underflow before and the overflow after the bins. Flatbuffers has three strong selling points:

  1. the C++ serialization/deserialization is very fast (I'm trusting the Flatbuffers documentation on this; haven't tried it myself)
  2. it's the only rowwise/record-oriented format that I know of that can selectively decompress parts of the record, for any schema
  3. this indirection also makes it very capable of schema evolution: old code doesn't have to know it's skipping over new fields, as it's just following pointers.

So the question is, how important are these selling points to the specific problem of serializing and deserializing Boost.Histogram? That's a different question than we were asking before—the old question was "What's important for serializing/deserializing any histogram in any histogram library?" Surely, (1) is still important. Is (2) important? That was to accommodate a union of features from all histogram libraries, with the idea that any particular histogram library won't be using all of them. Now, if you're deserializing every field every time, there might not be as much value in that. Similarly, the flexibility in point (3) mattered when we were taking on the problem of all current and future histogramming libraries, but focusing on Boost.Histogram, the data structures may be stable enough that you don't need such an extreme of flexibility. This can be adjusted for in the Flatbuffers schema itself by making never-changing substructures into struct, rather than record, which removes a level of indirection. (In the Flatbuffer examples, struct is used for things like 3D points, which will always have exactly three fields, and record is used for the video game Monster type, which may get new attributes in the future.)

Flatbuffers's three selling points don't address some points that are important to physics users:

  1. being able to store histograms and event data in the same file
  2. interoperability with ROOT
  3. efficiently storing thousands of histograms with mostly the same metadata—ideally by writing one copy of the metadata.

In principle, (4) could be addressed by implementing the histogram serialization in Parquet. The Flatbuffers schema in Aghast isolates the bin contents, to keep them columnar—in Parquet, they'd automatically be columnar. Everything else (the bulk of the Aghast Flatbuffers schema) can be Parquet metadata, which is any bytestring associated with a node of an array.

Awkward 2.0 will be using this metadata, so here's a demonstration:

>>> import awkward as ak
>>> import numpy as np
>>> import pyarrow.parquet as pq
>>> contents, edges = np.histogram(np.random.normal(0, 1, 100000))
>>> histogram_array = ak._v2.highlevel.Array(
...     ak._v2.contents.NumpyArray(
...         contents,
...         parameters={"edges": edges.tolist()},
...     )
... )
>>> histogram_array
<Array [83, 846, 5408, 18452, ..., 12456, 2860, 349, 18] type='10 * int64[pa...'>
>>> histogram_array.type
ArrayType(NumpyType('int64', parameters={'edges': [-4.0401437884795435, -3.1990114867101216, -2.3578791849407, -1.5167468831712787, -0.6756145814018568, 0.16551772036756507, 1.006650022136986, 1.847782323906408, 2.68891462567583, 3.5300469274452517, 4.371179229214674]}), 10)
>>> pq.write_table(
...     ak._v2.operations.convert.to_arrow_table(histogram_array),
...     "tmp.parquet",
... )
>>> from_disk = ak._v2.operations.convert.from_arrow(pq.read_table("tmp.parquet"))
>>> from_disk
<Array [83, 846, 5408, 18452, ..., 12456, 2860, 349, 18] type='10 * int64[pa...'>
>>> from_disk.type
ArrayType(NumpyType('int64', parameters={'edges': [-4.0401437884795435, -3.1990114867101216, -2.3578791849407, -1.5167468831712787, -0.6756145814018568, 0.16551772036756507, 1.006650022136986, 1.847782323906408, 2.68891462567583, 3.5300469274452517, 4.371179229214674]}), 10)

Awkward Array sends its parameters through JSON (assuming that they're small), but a path that goes directly from Boost.Histogram to Arrow/Parquet could encode them as a binary array of doubles. Or Awkward can be modified to store some parameters in a more efficient format if this becomes a use-case.

It's also possible to write metadata to HDF5, and HDF5 would bring a lot of options for tuning sparse data. However, jagged event data can't be stored in HDF5 efficiently, so if this is for motivation (4), putting event data and histograms in the same file, it wouldn't be satisfied by HDF5.

On the other hand, maybe the edges could be part of the data, rather than metadata:

>>> histogram_array2 = ak._v2.highlevel.Array(
...     ak._v2.contents.RecordArray(
...         [
...             ak._v2.contents.NumpyArray(edges[:-1]),
...             ak._v2.contents.NumpyArray(edges[1:]),
...             ak._v2.contents.NumpyArray(contents),
...         ],
...         ["low", "high", "value"],
...         parameters={"__array__": "histogram"},
...     )
... )
>>> histogram_array2
<Array [{low: -4.34, high: -3.5, value: 23}, ..., {...}] type='10 * struct[[...'>
>>> histogram_array2.show()
[{low: -4.34, high: -3.5, value: 23},
 {low: -3.5, high: -2.66, value: 396},
 {low: -2.66, high: -1.82, value: 3160},
 {low: -1.82, high: -0.98, value: 13006},
 {low: -0.98, high: -0.141, value: 28114},
 {low: -0.141, high: 0.698, value: 31102},
 {low: 0.698, high: 1.54, value: 17920},
 {low: 1.54, high: 2.38, value: 5436},
 {low: 2.38, high: 3.21, value: 769},
 {low: 3.21, high: 4.05, value: 74}]

With the above, a large number of histograms could be stored in one array, satisfying (6), and it's clear to see how it could be generalized to Boost.Histogram's many storage types. If the edges are repeated frequently, they could be dictionary-encoded in Parquet.

Furthermore, maybe an Arrow/Parquet format that has a natural and meaningful deserialization into Pandas would also be a good idea. Arrow is supposed to be a fast backend for Pandas, and "interoperability with Pandas" may be an important criterion (7) in the future.

Then there's criterion (5), interoperability with ROOT. As we've seen, some users are already serializing Boost.Histogram with ROOT streamers—somehow, in a non-standardized way. Standardizing that by adding the ROOT/Cling-dictionary generating macros (dummy macros for when Boost.Histogram is not used in ROOT (i.e. #ifndef...), would at least ensure that a million formats don't proliferate.

But what about interoperability with ROOT's v7 histograms? Do v7 histograms have all the features of Boost.Histogram, in the sense that Boost.Histograms can be losslessly converted to and from ROOT v7 histograms? If so, then maybe that could be the serialization format. It would satisfy criteria (1), (3), (4), (5), and maybe (6), by joining histograms with like-metadata into a single histogram with an extra axis.

I'm looking forward to the prospect of there being a standardized serialization format for Boost.Histogram. Flatbuffers was chosen for a different reason, interoperability with all histogramming libraries, which is less important now that Boost.Histogram is dominant in for non-ROOT users. Flatbuffers is still an okay choice, but there are some things it doesn't do. Alternatives include Arrow/Parquet, Awkward/Arrow/Parquet, Pandas/Arrow/Parquet, HDF5, ROOT as a new class, and ROOT as v7 histograms.

What do you think?

from aghast.

bendavid avatar bendavid commented on June 16, 2024

I looked into the case of ROOT persistence of boost histograms a bit.

This doesn't work with the default storage type (I think because of reasons which @HDembinski outlined in https://root-forum.cern.ch/t/boost-1-70-released-new-library-boost-histogram/33595/12 )

On the other hand with histograms from make_weighted_histogram this seems to work just fine, albeit with some (maybe spurious) warnings.

I've opened an issue for ROOT about this (root-project/root#9371)

I don't think that pre-generating dictionaries is necessarily a complete solution, since one may want to serialize a template instance which is only instantiated on the fly with PyROOT/Cling (but maybe for a subset of commonly used template instances this could make sense)

from aghast.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.