jacanchaplais / showerpipe Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 298 KB

Pythonic data pipeline for Monte-Carlo showering and hadronisation programs.

Home Page: https://showerpipe.readthedocs.io/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

showerpipe's People

Contributors

Stargazers

Watchers

showerpipe's Issues

Move LHE functionality into heparchy

Heparchy is a dependency over showerpipe anyway, and the cohesion makes more sense given heparchy is for IO.

Publish on conda

Set up a jacanchaplais channel on conda, and publish showerpipe. Ideally, submit it to conda-forge after it has been successfully published.

Guide here:
https://docs.conda.io/projects/conda-build/en/latest/user-guide/tutorials/build-pkgs.html

Extract helicity and status from Pythia

Implement no LHE runs

PythiaGenerator currently goes untested without using a LHE file. I suspect it would not work. This should be investigated and implemented.

PyTorch compatible dataloaders

Enable online learning with PyTorch compatible dataloaders.

These should use the same underlying observer pattern of the CLI, with options to write out asynchronously to disk, and more.

Also should provide an interface to stack pre-processing steps in a pipeline before exposing the data to NNs.

Parallel event generators

Provide API to parallel event generation.

May require I/O update to heparchy, to enable distributed data writing.

mpi4py is compatible with h5py parallel I/O, and gets around the GIL. However, unsure if I can integrate this easily into PyTorch dataloaders down the line.

Remove structure of original observer pattern

Pipeline architecture ended up superseding the observer pattern #4. The functionality of providing more than one observer is made redundant by the branched pipe. Clean up and simplify the code, removing the traces of the original observer pattern.

LHE module consistency updates

Refactor lhe module so that the output of routines is no longer bytestrings by default. Either output LheData objects, or possibly io.BytesIO objects (maybe make it so that everything using LheData, but even when bytestrings would be exported, always do this in a BytesIO buffer?).

Also look at object instantiation, and improve the type hinting.

Add skeleton Sphinx docs

Increase type annotation coverage

Review the library and check where there are weakpoints. Running list is:

investigate what "library stubs" are, and add them if applicable
allow PythiaGenerator to take a Path object for the pythia settings

Edge calculation efficiency

The current method of calculating the edges uses a fairly expensive and confusing set of pandas operations, tracking inconsistently sized numbers of parents / children in tuples, exploding them, etc.

It occurs to me that this may be unnecessary, as well as ugly and inefficient. An array representing the adjacency matrix were pre-allocated, and a simple mapping between particle ids and the rows / columns were established. Then, each particle could use its list of parents as a numpy fancy index, flipping the element values to True on a given row. This would likely be much more efficient, and readable.

Efficiency could potentially be boosted further if used with scipy sparse arrays, or accelerated with numba (but I think numba may struggle with the jagged input, so preprocessing would probably be required).

Edit: acceleration using compiled libraries etc. has been abandoned for the simplicity and readability of a pure Python solution. This is easier to work with and understand, and is still 10× than the original pre-optimal pandas solution.

Implement pipeline plugin architecture

Need to adjust observer pattern, so observers are actually pipelines.

Pipelines are constructed of sources, filters, and sinks. The subject in the original observer pattern is the source. The data structure's interface, passing through this pipeline, should be of the same type at all times.

A plugin architecture will allow users to create their own filters and sinks, although the package will ship with generic filters to transform and cluster the data, as well as generic sinks to write the data to disk, or create visualisations.

Pipelines will then be defined by users in yaml files, and passed to the CLI.

Throw a FileNotFoundError when passed a non-existent LHE path

Currently the error is a XMLSyntaxError thrown by lxml - not very professional!

jacanchaplais / showerpipe Goto Github PK

showerpipe's People

Contributors

Stargazers

Watchers

showerpipe's Issues

Recommend Projects

Recommend Topics

Recommend Org