scikit-hep / awkward-0.x Goto Github PK

View Code? Open in Web Editor NEW

216.0 15.0 39.0 6.57 MB

Manipulate arrays of complex data structures as easily as Numpy.

License: BSD 3-Clause "New" or "Revised" License

Python 63.65% Jupyter Notebook 36.35%

python python3 numpy big-data analysis columnar columnar-storage apache-arrow arrow parquet

awkward-0.x's Introduction

`scikit-hep`: metapackage for Scikit-HEP

Project info

The Scikit-HEP project is a community-driven and community-oriented project with the aim of providing Particle Physics at large with an ecosystem for data analysis in Python embracing all major topics involved in a physicist's work. The project started in Autumn 2016 and its packages are actively developed and maintained.

It is not just about providing core and common tools for the community. It is also about improving the interoperability between HEP tools and the Big Data scientific ecosystem in Python, and about improving on discoverability of utility packages and projects.

For what concerns the project grand structure, it should be seen as a toolset rather than a toolkit.

Getting in touch

There are various ways to get in touch with project admins and/or users and developers.

scikit-hep package

scikit-hep is a metapackage for the Scikit-HEP project.

Installation

You can install this metapackage from PyPI with `pip`:

python -m pip install scikit-hep

or you can use Conda through conda-forge:

conda install -c conda-forge scikit-hep

All the normal best-practices for Python apply; you should be in a virtual environment, etc.

Package version and dependencies

Please check the setup.cfg and requirements.txt files for the list of Python versions supported and the list of Scikit-HEP project packages and dependencies included, respectively.

For any installed scikit-hep the following displays the actual versions of all Scikit-HEP dependent packages installed, for example:

>>> import skhep
>>> skhep.show_versions()

System:
    python: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0]
executable: /srv/conda/envs/notebook/bin/python
   machine: Linux-5.15.0-72-generic-x86_64-with-glibc2.27

Python dependencies:
       pip: 23.1.2
     numpy: 1.24.3
     scipy: 1.10.1
    pandas: 2.0.2
matplotlib: 3.7.1

Scikit-HEP package version and dependencies:
        awkward: 2.2.2
boost_histogram: 1.3.2
  decaylanguage: 0.15.3
       hepstats: 0.6.1
       hepunits: 2.3.2
           hist: 2.6.3
     histoprint: 2.4.0
        iminuit: 2.21.3
         mplhep: 0.3.28
       particle: 0.22.0
          pylhe: 0.6.0
       resample: 1.6.0
          skhep: 2023.06.09
         uproot: 5.0.8
         vector: 1.0.0

Note on the versioning system:

This package uses Calendar Versioning (CalVer).

awkward-0.x's People

Contributors

Stargazers

Watchers

awkward-0.x's Issues

flatten an empty jagged array

Jagged array reductions fail after masking on first dimension

I've noticed that when using a boolean mask over the first dimension of a jagged array, the resulting jagged array is unable to be "reduced" over the second dimension. This seems to be true for prod, sum, count_nonzero, min, at least.

Here's the code to reproduce this issue:

>>> import numpy as np

>>> import awkward

>>> a = awkward.JaggedArray([0, 2, 5], [2, 5, 6], np.random.random(6))

>>> a
<JaggedArray [[0.77199625 0.08255843] [0.78680289 0.8103746  0.7055252 ] [0.9514227]] at 7f807bd19a90>

>>> a.prod()
array([0.0637348 , 0.44984645, 0.9514227 ])

>>> a[[False, False, True]].prod()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ben/.local/lib/python2.7/site-packages/awkward/array/jagged.py", line 881, in prod
    out[:len(nonterminal)] = awkward.util.numpy.multiply.reduceat(content[self._starts[0]:self._stops[-1]], nonterminal)
IndexError: index 5 out-of-bounds in multiply.reduceat [0, 1)

The workaround seems to be to swap the order of the masking:

>>> a.prod()[[False, False, True]]
array([0.9514227])

but I feel like this is not what a user would expect.

Implement nested parquet writing

Since one can construct an awkward array from a list or other iterable of lists/dicts (or indeed arrow block), I am wondering how difficult it would be to build a function which works out the various repetition/definition levels needed to store the data into parquet format. I am assuming this could even be numa-jitted.

(this is pure curiosity on my part)

accessing columns with JaggedArray elements without a for loop

I have data in awkward array that represents digitized pulses, and it'd be nice to have an easy way to get a container with all these traces as elements.

So, a call like

data["traces"][:, start_idx:final_idx]

would return an nxm array where n is the number of traces in the data and m is final_idx - start_idx.

Or perhaps if a call like

data["traces"][:]

returned a jagged array?

I should clarify, though, that all my data has traces that are the same length. My second suggestion is motivated by the extreme laziness of not wanting to write extra code to get final_idx.

Feature request: astype

New feature request: changing the dtype of JaggedArrays is a bit clunky:

jarr.content = jarr.content.astype(np.uint16)

It would be nice to have:

jarr = jarr.astype(np.uint16)

(the savings in overall code lines is more noticeable when reading and immediately converting from an uproot file. Having a dtype= argument in uproot's .array() is another option)

awkward.type.Type should have a representation made out of Python builtin types

Some kind of "shape-ish" thing that's a tuple. Unpack the object with a recursive function.

Such as

(inf, 3, {"x": dtype(float), "y": dtype(float), "pdgid": dtype(int)})

JaggedArray.nth_fill([0, 1, 5], 999) → [0.0, 1.1, 5.5], [999, 999, 999], [0, 100, 500]]

@benkrikler

If it weren't for the "fill" part, this would be just a slice:

>>> a = awkward.fromiter([[1.1, 2.2, 3.3], [3.14, 999.9], [4.4, 5.5]])
>>> a[:, [1, 0, 0]].flatten()
array([  2.2 ,   1.1 ,   1.1 , 999.9 ,   3.14,   3.14,   5.5 ,   4.4 ,
         4.4 ])

test_union_ufunc is flaky

======================================================================
ERROR: test_union_ufunc (tests.test_union.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/build/python-awkward/src/awkward-array-0.2.0/tests/test_union.py", line 61, in test_union_ufunc
    b = UnionArray.fromtags([1, 1, 0, 1, 0], [[10.1, 20.2], [123, 456, 789]])
  File "/build/python-awkward/src/awkward-array-0.2.0/awkward/array/union.py", line 47, in fromtags
    out.index = awkward.util.numpy.empty(out._tags.shape, dtype=awkward.util.INDEXTYPE)
  File "/build/python-awkward/src/awkward-array-0.2.0/awkward/array/union.py", line 114, in index
    raise ValueError("index must be a non-negative array")
ValueError: index must be a non-negative array

----------------------------------------------------------------------

The error went away after a simple retry. There seems to be some flakiness here.

JaggedArray second dimension fancy indexing

Documentation on Jagged array

The docs attached to JaggedArray seem to the the ArrayLike mixin docs instead - even a placehold doc probably would be more useful than the wrong docs.

A mention of the other ways to make a JaggedArray might be useful.

Need a low-level type infrastructure

Low-level type information (exactly which array types are composed with each other) are requried to compile functions with VirtualArrays (without simply materializing all of them). The high-level types don't have enough information. It is important to be able to compile VirtualArrays for iterating over large datasets, Parquet or ROOT, without entirely loading them into memory.

The low-level types should mirror the high-level types. Perhaps they should be named "structure". Dumping structure to the terminal, with or without array values, could be useful for diagnostics and introductions to the concepts.

Types need a suite of unit tests

Especially when coming from arrays. The Type infrastructure was touched in the 0.3.0 → 0.4.0 transition and I want to be certain that type.to vs type is consistent everywhere.

In serialization, check for "tojson", "fromjson" methods on the object

As a last-ditch effort before resorting to Pickle.

JaggedArray._broadcast should accept scalars and any AwkwardArray

Two missing cases. Scalars have an obvious implementation. General AwkwardArrays would turn into JaggedArray(IndexArray(...)).

Reading hdf5 and older schemas

I updated AwkwardArray yesterday and now cannot load my awkward arrays saved in HDF5 files.

Here is the error I get:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-9-ff877d229c39> in <module>()
----> 1 PV = collect_truth('data/Oct03_20K_val.h5', pvs=True)
      2 print('PV.n.shape =    ',  PV.n.shape)
      3 print('PV.n[0].shape = ', *PV.n[0].shape)
      4 print('PV.x[0] =       ', *PV.x[0])
      5 print('PV.y[0] =       ', *PV.y[0])

~/git/ml/pv-finder/notebooks/model/collectdata.py in collect_truth(pvs, *files)
     43         with Timer(msg), h5py.File(XY_file, mode='r') as XY:
     44             afile = awkward.hdf5(XY)
---> 45             x_list.append(afile[f"{p}v_loc_x"])
     46             y_list.append(afile[f"{p}v_loc_y"])
     47             z_list.append(afile[f"{p}v_loc"])

/opt/anaconda/lib/python3.6/site-packages/awkward/persist.py in __getitem__(self, where)
    648 
    649     def __getitem__(self, where):
--> 650         return deserialize(self._group, name=where + self.options["schemasuffix"], whitelist=self.options["whitelist"], cache=self.options["cache"])
    651 
    652     def __setitem__(self, where, what):

/opt/anaconda/lib/python3.6/site-packages/awkward/persist.py in deserialize(storage, name, whitelist, cache)
    455             raise ValueError("unrecognized JSON object: {0}".format(repr(schema)))
    456 
--> 457     return unfill(schema["schema"])
    458 
    459 def keys(storage, name="", subschemas=True):

/opt/anaconda/lib/python3.6/site-packages/awkward/persist.py in unfill(schema)
    394             if "call" in schema and isinstance(schema["call"], list) and len(schema["call"]) > 0:
    395                 gen = spec2function(schema["call"], whitelist=whitelist)
--> 396                 args = [unfill(x) for x in schema.get("args", [])]
    397 
    398                 kwargs = {}

/opt/anaconda/lib/python3.6/site-packages/awkward/persist.py in <listcomp>(.0)
    394             if "call" in schema and isinstance(schema["call"], list) and len(schema["call"]) > 0:
    395                 gen = spec2function(schema["call"], whitelist=whitelist)
--> 396                 args = [unfill(x) for x in schema.get("args", [])]
    397 
    398                 kwargs = {}

/opt/anaconda/lib/python3.6/site-packages/awkward/persist.py in unfill(schema)
    393         if isinstance(schema, dict):
    394             if "call" in schema and isinstance(schema["call"], list) and len(schema["call"]) > 0:
--> 395                 gen = spec2function(schema["call"], whitelist=whitelist)
    396                 args = [unfill(x) for x in schema.get("args", [])]
    397 

/opt/anaconda/lib/python3.6/site-packages/awkward/persist.py in spec2function(obj, whitelist)
     79             break
     80     else:
---> 81         raise RuntimeError("callable not in whitelist; add it by passing a whitelist argument:\n\n    whitelist = awkward.persist.whitelist + [{0}]".format(obj))
     82     return gen
     83 

RuntimeError: callable not in whitelist; add it by passing a whitelist argument:

    whitelist = awkward.persist.whitelist + [['numpy', 'frombuffer']]

And, here's the schema in the file:

{
  "awkward": "0.4.4",
  "schema": {
    "id": 0,
    "call": [
      "awkward",
      "JaggedArray",
      "fromcounts"
    ],
    "args": [
      {
        "id": 1,
        "call": [
          "numpy",
          "frombuffer"
        ],
        "args": [
          {
            "call": [
              "zlib",
              "decompress"
            ],
            "args": [
              {
                "read": "1"
              }
            ]
          },
          {
            "dtype": "int64"
          },
          {
            "json": 20000
          }
        ]
      },
      {
        "id": 2,
        "call": [
          "numpy",
          "frombuffer"
        ],
        "args": [
          {
            "read": "2"
          },
          {
            "dtype": "float32"
          },
          {
            "json": 207850
          }
        ]
      }
    ]
  },
  "prefix": "pv_loc_x/"
}

Equivalent of numpy.where for awkward arrays

numpy.where is not a ufunc, but it's useful!

Error when taking max of empty array

awkward-array gives an error if an array is completely empty and you try to take the maximum.

Minimum code to reproduce the problem:
test = awkward.fromiter([])
print test.max()

It breaks with the following error. The error can be avoided by first checking if test.size > 0.

ValueError Traceback (most recent call last)
in ()
3 if test.size <= 0:
4 print "okay"
----> 5 print test.max()

/Users/ahall/.local/lib/python2.7/site-packages/numpy/core/_methods.pyc in _amax(a, axis, out, keepdims, initial)
26 def _amax(a, axis=None, out=None, keepdims=False,
27 initial=_NoValue):
---> 28 return umr_maximum(a, axis, None, out, keepdims, initial)
29
30 def _amin(a, axis=None, out=None, keepdims=False,

ValueError: zero-size array to reduction operation maximum which has no identity

awkward-array of TLorentzVectors from numpy arrays

Maybe I missed an example, but it would be nice to have a way to create an awkward-array of ROOT objects (TLorentzVector is the obvious one) from several numpy arrays, i.e. no IO operations with ROOT, just using the TLorentzVector constructor and numpy.

Is this trivial to do? All the examples I've seen so far assume there are already TLorentzVectors stored in ROOT objects. In ATLAS we often have flatter data structures, i.e. object-wise ntuples stored as px, py, pz, e, and an event-wise ntuple which points to the first and last object in each event.

ObjectArray needs flatten()

And all the awkward methods need to be synchronized.

Still wrong return values when masking (see #50)

I saw the fix in #50 but I belive that it has not been fixed completely. The following still occurs in 0.8.4:

from awkward import JaggedArray

array = JaggedArray.fromiter([[1, 2, 3]])
empty = array[False]

print(empty.content)
# array([1, 2, 3])
print(empty.flatten())
# array([], dtype=int64)

while we also have

print(array[array != 1].content)
# array([2, 3])
print(array[array != 1].flatten())
# array([2, 3])

Should this be the behavior we expect?

Interfacing between pandas Muiltiindex dataframe and JaggedArrays

I'm finding some of the functions such as cross very useful and the conversions into pandas Multiindex dataframes (with levels "entry" and "subentry"). However, I'm struggling in reversing the conversion to pandas. What is the easiest way to convert a Multiindex to start/stops?

The use case is that I have a multiindex dataframe which I've skimmed. I save the multiindex to use it later as needed. Later I want to use the cross function so I need to apply the multiindex on a JaggedArray to be usable with cross. I could convert the cross into a pandas dataframe and then apply the multiindex, however, the skimming reduces the size of the dataframe considerably and the cross function will create an unnecessarily large dataframe.

Tuples of lists of indices does not work as in numpy

Making a issue for posterity after showing this to Jim in the hallway.
This is in awkward 0.5.3.

from awkward import JaggedArray
test = JaggedArray.fromcounts([1,2,1],[1,2,3,4])
test[1,0]

results in 2, as expected.

test[[1],[0]]

results in:

array([[1],
       [2],
       [4]])

Which is incorrect compared to its equivalent for a 3x2 array in numpy (repeating values below to fill out dimension).

import numpy as np
temp = np.array([[1,1],[2,3],[4,4]])
temp[[1],[0]]

results in:

array([2])

and

temp[[1,0],[0,0]]

results in:

array([2, 1])

and

test[[1,0],[0,0]]

results in:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/anaconda2/lib/python2.7/site-packages/awkward/array/jagged.py", line 608, in __getitem__
    index = (index.reshape(-1, len(head)) + self._starts.reshape(-1, 1)).reshape(-1)
ValueError: operands could not be broadcast together with shapes (2,2) (3,1)

All awkward arrays must have a `nbytes` property

This is important! uproot's ArrayCache checks this property to know when to flush the cache.

See scikit-hep/uproot3#247, which I'm closing in favor of this. @zhenbinwu, you'll want to follow this issue instead.

JaggedArray.pairs fails for n>=24

import numpy as np
import awkward

nmax = 35
counts = np.arange(nmax)
values = np.ones(nmax*(nmax-1)//2)
awk = awkward.JaggedArray.fromcounts(counts, values)
cands = awk.pairs(same=False)

cdiff = cands["0"].counts - counts*(counts-1)//2
print("Diff:", cdiff[cdiff!=0.])
print("Expected combinations:", (counts*(counts-1)//2)[cdiff!=0.])
print("Actual combinations:", cands["0"].counts[cdiff!=0.])

results in:

Diff: [-256 -256 -256 -256 -256 -256 -256 -256 -256 -512 -512]
Expected combinations: [276 300 325 351 378 406 435 465 496 528 561]
Actual combinations: [ 20  44  69  95 122 150 179 209 240  16  49]

Note: python 3

argpairs behavior not as expected

The gist of it is that JaggedArray.argpairs() returns global indices rather than local ones, such that array[array.argpairs()._0] is invalid syntax, while I would have expected it to be equivalent to array.pairs()._0

See https://github.com/nsmith-/coffea/blob/master/notebooks/idioms/argpairs.ipynb for a complete example.

numpy functions on jagged arrays don't produce compatible arrays

I wonder if this is my misunderstanding of the API, but I'm running in to the following on a CMS NanoAOD file with a varying number of muons per event:

mu_pt = arrs[b'Muon_pt']
mu_phi = arrs[b'Muon_phi']

#select events with exactly two muons with pt>20
s = (mu_pt>20)
dimuon = (s.counts == 2) & (s.sum()==2)

#select the muon momentum components for such dimuon events
mu_pt_sel = mu_pt[dimuon]
mu_phi_sel = mu_phi[dimuon]

Now when computing the np.sin function on the selected subarray, the jaggedness (starts/stops structure) of the subarray changes.

mu_phi_sel.starts[:10]
array([ 0,  2,  4,  9, 11, 13, 16, 18, 20, 22])

np.sin(mu_phi_sel).starts[:10]
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

so I can't do mu_pt_sel * np.sin(mu_phi_sel) as would seem natural. Is this expected?

I can of course do mu_pt_sel.regular() * np.sin(mu_phi_sel).regular() but this seems to break the API.

Subschemas in serialization

Will be necessary for datasets with a large number of chunks.

Bad dtype specification for `JaggedArray.any` and `JaggedArray.all`

As of commit 55ca35d, line 959 and line 980 have a bug where they reference awkward.BOOLTYPE as opposed to awkward.util.BOOLTYPE

UnionArray or maybe SparseArray computes incorrect ufuncs

The bug noticed by Chris. Should be reproduced and dealt with.

awkward-arrays → Arrow buffers

The reverse (zero-copy views) are already implemented, but we'll soon want the reverse (which can't be zero-copy). Most awkward concepts have some mapping onto Arrow, but it's not complete. At least be sure that Arrow → awkward → Arrow closed loops are possible.

Cannot perform nested ufunc math on jagged arrays

In awkward 5.2 and 5.3 I encounter the following problem if I do not flatten the jagged array :

import numpy as np
import awkward
import uproot_methods

np.random.seed(12345)
nrows = 1000
counts = np.minimum(np.random.exponential(0.5, size=nrows).astype(int), 20)
offsets = np.cumsum(counts)

px = np.random.normal(loc=20.0,scale=5.0,size=np.sum(counts))
py = np.random.normal(loc=20.0,scale=5.0,size=np.sum(counts))
pz = np.random.normal(loc=0, scale=55, size=np.sum(counts))
m_pi = np.full_like(px,fill_value=0.135)
energy = np.sqrt(px*px + py*py + pz*pz + m_pi*m_pi)

mom = awkward.JaggedArray.fromoffsets(offsets,awkward.Table(x = px,
                                                            y = py,
                                                            z = pz,
                                                            E = energy))

pzs_parts = mom['z']
pts_parts = np.sqrt(mom['x']**2 + mom['y']**2)

assert( (mom['z'].counts == pts_parts.counts).all() )
assert( (pzs_parts.counts == pts_parts.counts).all() )

mom['z'] + pzs_parts
pts_parts + pts_parts
awkward.JaggedArray.fromoffsets(offsets,mom['z'].flatten() + pts_parts.flatten())
mom['z'] + pts_parts

fails with

Traceback (most recent call last):
  File "jagged_math_fail.py", line 30, in <module>
    mom['z'] + pts_parts
  File "/anaconda2/lib/python2.7/site-packages/numpy/lib/mixins.py", line 25, in func
    return ufunc(self, other)
  File "/anaconda2/lib/python2.7/site-packages/awkward/array/jagged.py", line 727, in __array_ufunc__
    inputs[i] = inputs[i]._tojagged(starts, stops, copy=False)
  File "/anaconda2/lib/python2.7/site-packages/awkward/array/jagged.py", line 706, in _tojagged
    out = self.copy(starts=starts, stops=stops, content=self._content[index])
IndexError: index 155 is out of bounds for axis 1 with size 155

This causes evaluation of pseudorapidity and other nested operations to fail when operating directly on the jagged array!

Note: this example uses addition, but the same error is encountered for multiplication, division, etc.

Max is broken

You can no longer calculate the maximum of an integer JaggedArray. This snippit illustrates the problem:

import awkward
awkward.JaggedArray.fromiter([[1,2,3]]).max()

Again, this used to work but now is broken.

(I'm actually interested in .max().max(), so will use content.max() for now)

Add a `fromsubentry` helper method to JaggedArray

Would it be possible to add a helper class method to awkward.JaggedArray similar to the other from* methods that receives a list of Subentry indices for each value in the array and deduces from this the offsets for each event? This would be the inverse operation to that used by uproot to create Pandas DataFrame outputs from JaggedArrays.

Example usage would be:

import numpy as np
from awkward import JaggedArray
contents = np.random.normal(0, 1, 10)
subentries = [0, 1, 0, 0, 0, 1, 2, 0, 1, 0]
array = JaggedArray.fromsubentries(subentries, contents)
print("array.starts, array.stops)

output:

>>> [0, 2, 3, 4, 7, 9], [1, 2, 3, 4, 8, 9]

I think it would be reasonable to assume (or at least, it might be necessary to assume) that subentries are monotonically increasing within an event, so that when the difference between two subentries is negative we have the start of a new event. This means subentries need not be contiguous nor start at 0 (unlike my example) but must be ordered in increasing value within the event.

numpy.array(jaggedarray) causes infinite loop

Somehow, we get into recursion.

cross, pairs, distincts must share column numbers

Saving a jagged array

Is there a recommended way to save a jagged array in a numpy zip file? This is what I have so far, but internally this makes an object array of lists which is not ideal. And I actually have several arrays so I don't want to save three differently named arrays per jagged arrays by hand if possible:

from awkward import JaggedArray
import numpy as np

ja = JaggedArray([0,4],[4,6],[1,2,3,4,5,6])

from tempfile import TemporaryFile
outfile = TemporaryFile()
np.savez(outfile, ja=[ja.starts, ja.stops, ja.content])
outfile.seek(0)

f = np.load(outfile)

JaggedArray(*f['ja'])

If I directly try to save the jaggedarray, Python oddly hangs up instead of giving an error or working...

Conversion of JaggedArray to numpy array broken

It seems that things broke recently as

np.linalg.norm(<a jagged array>)

No longer works.

The problem can be reproduced by

import numpy as np
from awkward import *
from awkward.type import *
a = JaggedArray([0, 3, 3, 5], [3, 3, 5, 10], [0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])
np.asarray(a)

causes

<snip>
  File "/home/phxlk/.local/lib/python2.7/site-packages/awkward/array/base.py", line 39, in __array__
    return awkward.util.numpy.array(self, *args, **kwargs)

RuntimeError: maximum recursion depth exceeded while calling a Python object

EDIT: version 0.1.0 reports

  File "~/.local/lib/python2.7/site-packages/numpy/linalg/linalg.py", line 2240, in norm
    x = asarray(x)
  File "~/.local/lib/python2.7/site-packages/numpy/core/numeric.py", line 492, in asarray
    return array(a, dtype, copy=False, order=order)
  File "~/.local/lib/python2.7/site-packages/awkward/array/base.py", line 35, in __array__
    raise Exception("{0} {1}".format(args, kwargs))
Exception: () {}

Sorting

Hi!

I would need for my analysis the ability to sort the individual entries of a JaggedArray. Is there already such a functionality? Do you have any plans on implementing it or some ideas how to proceed?

Right now I'm implementing for myself the following workaround:

Convert the jagged array to a 2d ndarray with shape (n, max(m_i)), where m_i are the lengths of the individual entries of the jagged array.
If there is a element with fewer sub-elements than max(m_i), the remaining elements of the ndarray will be filled with np.nan
I'll sort the ndarray along axis=1, conveniently the nans will be still kept in the end by np.sort.
Convert back to jagged array

Of course that is not optimal, but for my case it's ok since the numbers of sub-entries varies within a relatively narrow range.

Any plans to develop in this direction? I have to make this work for the table too. Ultimately, the goal is of course to sort a TLorentzVectorArray by transverse momentum.

Oh, and does this functionality of "expanding" the jagged in a ndarray maybe exist already? I can imagine that this might be also a relatively common operation.

Have a nice Sunday!
Jonas

Slow accessor in nested jagged table

With a simple nested table structure such as

import numpy as np
from awkward import JaggedArray, Table

counts = np.random.exponential(2, size=100000).astype(int)
entries = np.sum(counts)
v = np.random.uniform(size=entries*4).reshape(-1,4)

t1 = Table({'var0': v[:,0], 'var1': v[:,1], 'var2': v[:,2]})
t2 = Table({'t1': t1, 'var3': v[:,3]})
j = JaggedArray.fromcounts(counts, t2)

a large performance hit is encountered when accessing elements from the JaggedArray instance vs. the flat array instance:

In [2]: %timeit t2.at.t1.at.var0
4.27 µs ± 32.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [3]: %timeit j.at.t1.at.var0
1.18 ms ± 9.47 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

One can access the underlying content and re-construct the jagged array with a smaller performance hit,

In [4]: %timeit JaggedArray.fromoffsets(j.offsets, j.content.at.t1.at.var0)
135 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Likely, the validation in constructing the JaggedArray is compounding in nested calls. Perhaps for these cases it can be bypassed?

make jagged_array work with numba.jit(nopython=True)

Currently, jagged_array does not work with numba.jit(nopython=true), the latter complains that it does not understand the jagged array type. I don't know what has to be done or how complicated it is to teach numba about jagged arrays, but whenever I have jagged arrays, I also want to accelerate operations on them with numba. For my case, I found a workaround, which could be documented in the README:

import numpy as np
from numba import jit

jagged = ... # has only one hierarchy layer, [[1], [1,2,3], [], ...]
@jit(nopython=True)
def go_fast(jagged.starts, jagged.stops, jagged.content):
    # starts, stops and content are all normal numpy arrays, which are understood by numba
    # do stuff

I have a problem though when I want to return a new jagged array with the same shape as the input. I can easily return a numpy array of the "content", but how do I generate a new jagged array from the content, the starts and the stops?

Filter function on JaggedArrays

From this talk and this commit, I am expecting the filter function as recommended method. But from the current awkward-array, I don't get these functions anymore.

I want to apply a series of selection on a jaggedarray. But it seems jag[jag.pt > 30 and jag.pt <50] etc are not working as easily as expected. Is there any alternative and easy way for this? Thanks

Wrong return values when masking?

import awkward  #version 0.5.1

class test(awkward.JaggedArray,):
    def __init__(self,jagged):        
        super(test, self).__init__(jagged.starts,
                                   jagged.stops,
                                   jagged.content)
    @property
    def stuff(self):
        return self['stuff']

a = test(awkward.JaggedArray.fromcounts([2,0,3],awkward.Table(stuff=[1,2,3,4,5])))
mask = [False,False,True]
b = a[mask]
print(type(b),b.stuff)
c = b[(b.stuff > 4)]
d = (b.stuff > 4)
print(type(c),c.stuff)
print(type(d),d)

results in the printouts:

(<class '__main__.test'>, <JaggedArray [[3 4 5]] at 000111547890>)
(<class '__main__.test'>, <JaggedArray [[1]] at 00011e4302d0>)
(<class 'awkward.array.jagged.JaggedArray'>, <JaggedArray [[False False  True]] at 00011e430a50>)

The result of print(type(c),c.stuff), according to my understanding, should be JaggedArray [[5]], so something funny is happening with indices. Otherwise the mask result is correct!

If you use lorentz vector array mixins with this results in an exception when trying to evaluate a function, but I haven't gotten that to repeat in a simple example yet. However, what's shown here is clearly a problem!

Note:

import awkward  #version 0.5.1
a = awkward.JaggedArray.fromcounts([2,0,3],awkward.Table(stuff=[1,2,3,4,5]))
mask = [False,False,True]
b = a[mask]
print(type(b),b['stuff'])
c = b[(b['stuff'] > 4)]
d = (b['stuff'] > 4)
print(type(c),c['stuff'])
print(type(d),d)

also results in:

(<class 'awkward.array.jagged.JaggedArray'>, <JaggedArray [[3 4 5]] at 0001106db190>)
(<class 'awkward.array.jagged.JaggedArray'>, <JaggedArray [[1]] at 00011e430b90>)
(<class 'awkward.array.jagged.JaggedArray'>, <JaggedArray [[False False  True]] at 00011e430590>)

Issue 51 (slowdowns) still alive for TLorentzVectors

Starting from #51 but in 0.5.3 + uproot_methods 0.2.10, the underlying issue seems to still be a problem when using TLorentzVectorArray mixins.

import numpy as np
from awkward import JaggedArray, Table
from uproot_methods.classes.TLorentzVector import TLorentzVectorArray,ArrayMethods
JaggedTLorentzVectorArray = awkward.Methods.mixin(ArrayMethods, awkward.JaggedArray)

counts = np.random.exponential(2, size=100000).astype(int)
entries = np.sum(counts)
v = np.random.uniform(size=entries*4).reshape(-1,4)
v[:,3] = np.sqrt(np.sum(v[:,:3]**2))
tlv = TLorentzVectorArray(v[:,0],v[:,1],v[:,2],v[:,3])

t1 = Table({'var0': v[:,0], 'var1': v[:,1], 'var2': v[:,2]})
t2 = Table({'t1': t1, 'var3': v[:,3]})
t3 = Table({'tlv': tlv})
j1 = JaggedArray.fromcounts(counts, t2)
j2 = JaggedArray.fromcounts(counts, t3)

where for primitive types:

%timeit t2.at.t1.at.var0
The slowest run took 39.76 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.09 µs per loop
%timeit JaggedArray.fromoffsets(j1.offsets, j1.content.at.t1.at.var0)
The slowest run took 53.29 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 84.4 µs per loop

Which is good, repacking should be slower!

and for TLorentzVector ArrayMethods:

%timeit j2.at.tlv.eta
100 loops, best of 3: 19.8 ms per loop
%timeit JaggedArray.fromoffsets(j2.offsets, j2.content.at.tlv.eta)
The slowest run took 112.06 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 80.4 µs per loop

Not quite the case!

Proposal: jit-compilation if numba available

Certain basic operations are starting to appear in profiling that could be optimized. The first such example is offsets2parents
For example, here is a numba re-implementation assuming offsets is monotone (which would be a good flag to track in JaggedArray)

import awkward
import numba
counts = np.random.randint(8, size=100000)

offsets = awkward.array.jagged.counts2offsets(counts)
%timeit awkward.array.jagged.offsets2parents(offsets)

@numba.njit()
def o2p_fast(offsets):
    out = np.empty(offsets[-1], dtype=offsets.dtype)
    j = 0
    k = -1
    for i in offsets:
        while j < i:
            out[j] = k
            j += 1
        k += 1
    return out

print(np.all(o2p_fast(offsets)==awkward.array.jagged.offsets2parents(offsets)))
%timeit o2p_fast(offsets)

which outputs on my laptop:

7.59 ms ± 372 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
True
726 µs ± 15.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

We could swap these out if numba is importable. A more portable (but laborious) operation would be to write our own ufunc module.

Nested table filtering

Hi,

I'm playing around a bit to understand the syntax, perks and potential pitfalls of the system, so please bear with me if the question is really trivial, but I could not find any documentation.
I have the following minimal dataset:

dataset = upfile['Events'].arrays([
      'Muon_pfIsoId',
      'Muon_pt',
      'Muon_eta',
      'Muon_phi',
      'Muon_mass',
      'Muon_charge',
      'Muon_tightId',
      'Jet_jetId',
      'Jet_pt',
      'Jet_eta',
      'Jet_phi',
      'Jet_mass',
      'Jet_btagDeepFlavB',
      'HLT_IsoMu24',
      'Flag_HBHENoiseFilter',
])

Given I want to keep together properties for the same objects I build tables and then I nest them into a single table that I use as I would with a pandas.DataFrame.

muons = Table.named('muons', {i.replace('Muon_', '') : j for i, j in dataset.iteritems() if i.startswith('Muon')})
jets  = Table.named('jets', {i.replace('Jet_', '') : j for i, j in dataset.iteritems() if i.startswith('Jet')})
data = Table({
      'muons' :	muons,
      'jets' : jets,
      'HLT_IsoMu24' : dataset['HLT_IsoMu24'],
      'Flag_HBHENoiseFilter' : dataset['Flag_HBHENoiseFilter'],
})

If I cut on an event-wide variable (the trigger) everything works as expected and I get a smaller number of events

wtrig = data[data['HLT_IsoMu24']]

When I try to create a mask for the muons everything works:

mumask = wtrig['muons']['pt'] > 30

but I cannot apply it to the muons. Naively I would:

wtrig['muons'][mumask]

but results in an exception

TypeError: cannot interpret dtype object as a fancy index or mask

on the single array works as expected.

What is the proper way of doing it?

README.rst installation path is wrong

Hi,

When installing awkward package it also installs README.rst right in the root of the prefix directory. That is, if you install akward as a system package the README.rst file will be installed as /usr/README.rst, which is definitely not the proper place for a readme file. This behavior is caused by a setup configuration

setup.py:78     data_files = ["README.rst"],

Maybe it is possible not to install README.rst at all(as majority of packages do)? Or at least install it to ${PREFIX}/share/doc/awkward/README.rst?

Table: cannot create an OBJECT array from memory buffer

OK, this might be several issues I experienced, but I am putting them here together.

From the NanoAOD file format of CMS, a TLorentzVector Table is reconstructed via:

 flatarray = uproot_methods.classes.TLorentzVector.TLorentzVectorArray.from_ptetaphim(
            arrays["%s_pt" % name].content, arrays["%s_eta" % name].content, arrays["%s_phi" % name].content, arrays["%s_mass" % name].content)
jaggedarray = awkward.Methods.mixin(uproot_methods.classes.TLorentzVector.ArrayMethods, awkward.JaggedArray).fromoffsets(
            arrays["%s_pt" % name].offsets, flatarray)

Then the other property (like Jet_EMFrac etc) are associated with this Table by
jaggedarray["EMFrac"] =arrays["Jet_EMFrac"]
This is OK for local test. But if I run the same code on LPC Condor, the same code crash:

File "/storage/local/data1/condor/execute/dir_20889/awkward/array/jagged.py", line 749, in setitem
self._content[where] = self._broadcast(what)._content
File "/storage/local/data1/condor/execute/dir_20889/awkward/array/jagged.py", line 764, in _broadcast
data = self._util_toarray(data, self._content.dtype)
File "/storage/local/data1/condor/execute/dir_20889/awkward/array/base.py", line 312, in _util_toarray
return cls.numpy.frombuffer(value, dtype=getattr(value, "dtype", defaultdtype)).reshape(getattr(value, "shape", -1))
ValueError: cannot create an OBJECT array from memory buffer

I also tried another approach, by assigning attribute to the Object(awkward.Table type) by
setattr(jaggedarray, "EMFrac", arrays["Jet_EMFrac"])
It is my prefer way as it keeps the code development simpler, jet.EMfrac vs jet["EMFrac"]. It fixed the memory buffer issue above, but created another issue that the object selection Jet = Jet[Jet.Flag] will result in a new awkward.Table, and lost all the attribute.

So I am in a dilemma here ..

JaggedArray._broadcast should have a public form

Boosting by sum

EDIT: For anyone trying to do this themselves, a boost to the rest frame should actually be p4.boost(-p4.sum().boostp3).

I have a JaggedArray of TLorentzVectors. I want to boost the vectors in each row to the rest frame of that row.

I'm using p4 = p4.boost(p4.sum().boostp3) (well the equivalent, since hh = p4.sum()), but I get the error:

Traceback (most recent call last):
  File "./train.py", line 44, in <module>
    p4 = p4.boost(hh.boostp3)
  File "/usr/lib/python3.7/site-packages/uproot_methods/classes/TLorentzVector.py", line 236, in boost
    v = self.p3 + gamma2*bp*p3 + gamma*p3*self.t
  File "/usr/lib/python3.7/site-packages/numpy/lib/mixins.py", line 25, in func
    return ufunc(self, other)
  File "/usr/lib/python3.7/site-packages/awkward/array/jagged.py", line 924, in __array_ufunc__
    result = getattr(ufunc, method)(*inputs, **kwargs)
  File "/usr/lib/python3.7/site-packages/uproot_methods/classes/TLorentzVector.py", line 301, in __array_ufunc__
    raise TypeError("(arrays of) TLorentzVector can only be added to/subtracted from other (arrays of) TLorentzVector")
TypeError: (arrays of) TLorentzVector can only be added to/subtracted from other (arrays of) TLorentzVector

Append missing

np.concatinate or np.append is missing from jagged arrays:

ja1 = awkward.JaggedArray.fromiter([[1,2,3],[4,5]])
ja2 = awkward.JaggedArray.fromiter([[7,8],[9,10,11],[12]])

These are not available but would be numpy like:

ja12 = ja1.append(ja2) # not available in Numpy, but common in Python
ja12 = awkward.append(ja1, ja2) # Numpy like, but would be nice to support *args
ja12 = awkward.concatinate([ja1, ja2]) # Numpy like, supports multiple arrays

What is the correct way to do this?

scikit-hep / awkward-0.x Goto Github PK

awkward-0.x's Introduction

scikit-hep: metapackage for Scikit-HEP

Project info

scikit-hep package

Installation

Package version and dependencies

awkward-0.x's People

Contributors

Stargazers

Watchers

Forkers

awkward-0.x's Issues

Recommend Projects

Recommend Topics

Recommend Org

`scikit-hep`: metapackage for Scikit-HEP