Giter Club home page Giter Club logo

histogrammar-python's Introduction

histogrammar Python implementation

histogrammar is a Python package for creating histograms. histogrammar has multiple histogram types, supports numeric and categorical features, and works with Numpy arrays and Pandas and Spark dataframes. Once a histogram is filled, it's easy to plot it, store it in JSON format (and retrieve it), or convert it to Numpy arrays for further analysis.

At its core histogrammar is a suite of data aggregation primitives designed for use in parallel processing. In the simplest case, you can use this to compute histograms, but the generality of the primitives allows much more.

Several common histogram types can be plotted in Matplotlib, Bokeh and PyROOT with a single method call. If Numpy or Pandas is available, histograms and other aggregators can be filled from arrays ten to a hundred times more quickly via Numpy commands, rather than Python for loops. If PyROOT is available, histograms and other aggregators can be filled from ROOT TTrees hundreds of times more quickly by JIT-compiling a specialized C++ filler. Histograms and other aggregators may also be converted into CUDA code for inclusion in a GPU workflow. And if PyCUDA is available, they can also be filled from Numpy arrays by JIT-compiling the CUDA code.

This Python implementation of histogrammar been tested to guarantee compatibility with its Scala implementation.

Latest Python release: v1.0.30 (June 2022).

Announcements

Spark 3.0

With Spark 3.0, based on Scala 2.12, make sure to pick up the correct histogrammar jar file:

spark = SparkSession.builder.config("spark.jars.packages", "io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20").getOrCreate()

For Spark 2.X compiled against scala 2.11, in the string above simply replace "2.12" with "2.11".

February, 2021

Example notebooks

Tutorial Colab link
Basic tutorial Open in Colab
Detailed example (featuring configuration, Apache Spark and more) Open in Colab
Exercises Open in Colab

Documentation

See histogrammar-docs for a complete introduction to histogrammar. (A bit old but still good.) There you can also find documentation about the Scala implementation of histogrammar.

Check it out

The historgrammar library requires Python 3.6+ and is pip friendly. To get started, simply do:

$ pip install histogrammar

or check out the code from our GitHub repository:

$ git clone https://github.com/histogrammar/histogrammar-python
$ pip install -e histogrammar-python

where in this example the code is installed in edit mode (option -e).

You can now use the package in Python with:

import histogrammar

Congratulations, you are now ready to use the histogrammar library!

Quick run

As a quick example, you can do:

import pandas as pd
import histogrammar as hg
from histogrammar import resources

# open synthetic data
df = pd.read_csv(resources.data('test.csv.gz'), parse_dates=['date'])
df.head()

# create a histogram, tell it to look for column 'age'
# fill the histogram with column 'age' and plot it
hist = hg.Histogram(num=100, low=0, high=100, quantity='age')
hist.fill.numpy(df)
hist.plot.matplotlib()

# generate histograms of all features in the dataframe using automatic binning
# (importing histogrammar automatically adds this functionality to a pandas or spark dataframe)
hists = df.hg_make_histograms()
print(hists.keys())

# multi-dimensional histograms are also supported. e.g. features longitude vs latitude
hists = df.hg_make_histograms(features=['longitude:latitude'])
ll = hists['longitude:latitude']
ll.plot.matplotlib()

# store histogram and retrieve it again
ll.toJsonFile('longitude_latitude.json')
ll2 = hg.Factory().fromJsonFile('longitude_latitude.json')

These examples also work with Spark dataframes (sdf):

from pyspark.sql.functions import col
hist = hg.Histogram(num=100, low=0, high=100, quantity=col('age'))
hist.fill.sparksql(sdf)

For more examples please see the example notebooks and tutorials.

Project contributors

This package was originally authored by DIANA-HEP and is now maintained by volunteers.

Contact and support

Please note that histogrammar is supported only on a best-effort basis.

License

histogrammar is completely free, open-source and licensed under the Apache-2.0 license.

histogrammar-python's People

Contributors

jpivarski avatar mbaak avatar asvyatkovskiy avatar fnands avatar bwengals avatar sbrugman avatar vincecr0ft avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.