Giter Club home page Giter Club logo

awkward-array's Introduction

awkward-array

Manipulate arrays of complex data structures as easily as Numpy.

awkward-array is a pure Python+Numpy library for manipulating complex data structures as you would Numpy arrays. Even if your data structures

  • contain variable-length lists (jagged or ragged),
  • are deeply nested (record structure),
  • have different data types in the same list (heterogeneous),
  • are masked, bit-masked, or index-mapped (nullable),
  • contain cross-references or even cyclic references,
  • need to be Python class instances on demand,
  • are not defined at every point (sparse),
  • are not contiguous in memory,
  • should not be loaded into memory all at once (lazy),

this library can access them with the efficiency of Numpy arrays. They may be converted from JSON or Python data, loaded from "awkd" files, HDF5, Parquet, or ROOT files, or they may be views into memory buffers like Arrow.

Consider this monstrosity:

import awkward
array = awkward.fromiter([[1.1, 2.2, None, 3.3, None],
                          [4.4, [5.5]],
                          [{"x": 6, "y": {"z": 7}}, None, {"x": 8, "y": {"z": 9}}]
                         ])

It's a list of lists; the first contains numbers and None, the second contains a sub-sub-list, and the third defines nested records. If we print this out, we see that it is called a JaggedArray:

array
# returns <JaggedArray [[1.1 2.2 None 3.3 None] [4.4 [5.5]] [<Row 0> None <Row 1>]] at 79093e598f98>

and we get the full Python structure back by calling array.tolist():

array.tolist()
# returns [[1.1, 2.2, None, 3.3, None],
#          [4.4, [5.5]],
#          [{'x': 6, 'y': {'z': 7}}, None, {'x': 8, 'y': {'z': 9}}]]

But we can also manipulate it as though it were a Numpy array. We can, for instance, take the first two elements of each sub-list (slicing the second dimension):

array[:, :2]
# returns <JaggedArray [[1.1 2.2] [4.4 [5.5]] [<Row 0> None]] at 79093e5ab080>

or the last two:

array[:, -2:]
# returns <JaggedArray [[3.3 None] [4.4 [5.5]] [None <Row 1>]] at 79093e5ab3c8>

Internally, the data has been rearranged into a columnar form, with all values at a given level of hierarchy in the same array. Numpy-like slicing, masking, and fancy indexing are translated into Numpy operations on these internal arrays: they are not implemented with Python for loops!

To see some of this structure, ask for the content of the array:

array.content
# returns <IndexedMaskedArray [1.1 2.2 None ... <Row 0> None <Row 1>] at 79093e598ef0>

Notice that the boundaries between sub-lists are gone: they exist only at the JaggedArray level. This IndexedMaskedArray level handles the None values in the data. If we dig further, we'll find a UnionArray to handle the mixture of sub-lists and sub-sub-lists and record structures. If we dig deeply enough, we'll find the numerical data:

array.content.content.contents[0]
# returns array([1.1, 2.2, 3.3, 4.4])
array.content.content.contents[1].content
# returns array([5.5])

Perhaps most importantly, Numpy's universal functions (operations that apply to every element in an array) can be used on our array. This, too, goes straight to the columnar data and preserves structure.

array + 100
# returns <JaggedArray [[101.1 102.2 None 103.3 None]
#                       [104.4 [105.5]]
#                       [<Row 0> None <Row 1>]] at 724509ffe2e8>

(array + 100).tolist()
# returns [[101.1, 102.2, None, 103.3, None],
#          [104.4, [105.5]],
#          [{'x': 106, 'y': {'z': 107}}, None, {'x': 108, 'y': {'z': 109}}]]

numpy.sin(array)
# returns <JaggedArray [[0.8912073600614354 0.8084964038195901 None -0.1577456941432482 None]
#                       [-0.951602073889516 [-0.70554033]]
#                       [<Row 0> None <Row 1>]] at 70a40c3a61d0>

Rather than matching the speed of compiled code, this can exceed the speed of compiled code (on non-columnar data) because the operation may be vectorized on awkward-array's underlying columnar arrays.

(To do: performance example to substantiate that claim.)

Installation

Install awkward-array like any other Python package:

pip install awkward

or similar (use sudo, --user, virtualenv, or pip-in-conda if you wish).

Strict dependencies:

Recommended dependencies:

  • pyarrow to view Arrow and Parquet data as awkward-arrays
  • h5py to read and write awkward-arrays in HDF5 files

(To do: integration with Dask, Pandas, and Numba.)

Tutorial

Table of contents:

(Parquet exoplanets is in the serialization section.)

Interactive tutorial

(...)

JSON log data processing example

Jaggedness

Record structure

Heterogeneous arrays

Masking

Cross-references

Class instances and methods

Indirection

Sparseness

Non-contiguousness

Laziness

Serialization, reading and writing files

Jagged Lorentz vector arrays; Z peak

Particle isolation cuts

Generator/reconstructed matching

awkward-array's People

Contributors

benkrikler avatar dumbmachine avatar guitargeek avatar henryiii avatar jayd-1234 avatar jpivarski avatar nsmith- avatar reikdas avatar

Stargazers

 avatar

Watchers

 avatar  avatar

awkward-array's Issues

Error while loading JaggedArrayNumba object.

I use the following functions to store and load awkward arrays.

def dataset_generator():
    h5file = h5py.File("dataset{}.hdf5".format(random.randint(1000,10000000)),"w")
    awkwd = awkward.hdf5(h5file)
    for i in range(15):
        jagged_array = awkward_numba.JaggedArray.fromiter([[random.randint(1,1000) for _ in range(random.randint(-1,10))]for _ in range(random.randint(-1,8))])
        awkwd['jagged_array_{}'.format(i)] = jagged_array
        print('jagged_array_{}'.format(i),jagged_array)
    return h5file


def dataset_reader(h5file,numba=True):
    awkwd = awkward_numba.hdf5(h5file)
    for key in h5file.keys():
        if numba:
            min = awkward_numba.JaggedArray._argminmax_general(awkwd[(key)], True)
        else:
            min = awkward_numba.JaggedArray._argminmax_general_native(awkwd[(key)], True)
        print(key,": ",awkwd[(key)]," type: ",type(awkwd[(key)]))
        print("/n argminmax",awkwd[(key)]._argminmax_general(True))

The functions work without any errors when I use it to store JaggedArrays. But when using JaggedArrayNumba it first gives the following error:

RuntimeError: callable not in whitelist; add it by passing a whitelist argument:
    whitelist = awkward.persist.whitelist + [['awkward', 'JaggedArrayNumba', 'fromcounts']]

So next I add the above to the whitelist. After which running the same code as above gives the following error:

AttributeError: module 'awkward' has no attribute 'JaggedArrayNumba'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.