Giter Club home page Giter Club logo

coffea's People

Contributors

agoose77 avatar andrissp avatar andrzejnovak avatar annawoodard avatar bengalewsky avatar bfis avatar btovar avatar dependabot[bot] avatar dntaylor avatar dthain avatar gordonwatts avatar ikrommyd avatar jayjeetatgithub avatar jpata avatar jrueb avatar kondratyevd avatar kratsg avatar lgray avatar moaly98 avatar nikoladze avatar nsmith- avatar paulgessinger avatar pfackeldey avatar pre-commit-ci[bot] avatar pviscone avatar riga avatar samkelson avatar saransh-cpp avatar valsdav avatar yimuchen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

coffea's Issues

JetMet txt_converters creates invalid jagged array

Describe the bug
Thanks for the nice JME tools! I wanted to report the following issue while I also debug it, perhaps we can find the solution faster :)

coffea.lookup_tools.extractor produces an incorrect internal jagged array on the official JME file Summer16_07Aug2017_V11_L1fix_MC_L2Relative_AK4PFchs.txt (added for reference) with the following error:
ValueError: maximum offset 2498 is beyond the length of the content (2497)

To Reproduce

Run the following

from coffea.lookup_tools import extractor

extract = extractor()
extract.add_weight_sets(['* * Summer16_07Aug2017_V11_L1fix_MC_L2Relative_AK4PFchs.txt'])
extract.finalize()
extract.make_evaluator()

To induce the crash at init-time rather than eval-time:

diff --git a/coffea/lookup_tools/txt_converters.py b/coffea/lookup_tools/txt_converters.py
index 73a49ec..26de60e 100644
--- a/coffea/lookup_tools/txt_converters.py
+++ b/coffea/lookup_tools/txt_converters.py
@@ -146,7 +146,9 @@ def _build_standard_jme_lookup(name, layout, pars, nBinnedVars, nBinColumns,
     parm_order = []
     offset_col = 2 * nBinnedVars + 1 + int(not interpolatedFunc) * 2 * nEvalVars
     for i in range(nParms):
-        parms.append(awkward.JaggedArray.fromcounts(jagged_counts, pars[columns[i + offset_col]]))
+        jag = awkward.JaggedArray.fromcounts(jagged_counts, pars[columns[i + offset_col]])
+        assert(jag._valid())
+        parms.append(jag)
         parm_order.append('p%i' % (i))

     wrapped_up = {}

Expected behavior
It works with the older supplied JME files, so it looks like either the JME file is bugged or the parser produces incorrect output.

Output

  File "test_jme.py", line 4, in <module>
    extract.add_weight_sets(['* * data/Summer16_07Aug2017_V11_L1fix_MC_L2Relative_AK4PFchs.txt'])
  File "/Users/joosep/Documents/caltech/hepaccelerate-cms/coffea/coffea/lookup_tools/extractor.py", line 75, in add_weight_sets
    self.import_file(thefile)
  File "/Users/joosep/Documents/caltech/hepaccelerate-cms/coffea/coffea/lookup_tools/extractor.py", line 100, in import_file
    self._filecache[thefile] = file_converters[theformat][thetype](thefile)
  File "/Users/joosep/Documents/caltech/hepaccelerate-cms/coffea/coffea/lookup_tools/txt_converters.py", line 163, in _convert_standard_jme_txt_file
    return _build_standard_jme_lookup(*_parse_jme_formatted_file(jmeFilePath))
  File "/Users/joosep/Documents/caltech/hepaccelerate-cms/coffea/coffea/lookup_tools/txt_converters.py", line 150, in _build_standard_jme_lookup
    assert(jag._valid())
  File "/Users/joosep/anaconda3/lib/python3.6/site-packages/awkward/array/jagged.py", line 439, in _valid
    raise ValueError("maximum offset {0} is beyond the length of the content ({1})".format(self._offsets.max(), len(self._content)))
ValueError: maximum offset 2498 is beyond the length of the content (2497)

Desktop (please complete the following information):

$ python --version
Python 3.6.7 :: Anaconda custom (64-bit)
$ uname -a
Darwin dhcp-112-136.caltech.edu 18.6.0 Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64 x86_64
>>> awkward.__version__
'0.11.1'
coffea revision e776a80762ec17ea9bc05c35f5019d4edc04f091

Additional context
I believe the issue is happening here when creating jagged_counts, but so far I haven't traced it down exactly: https://github.com/CoffeaTeam/coffea/blob/master/coffea/lookup_tools/txt_converters.py#L136

support dataset-specific treenames

in some workflows (ATLAS) the input to coffea are TTrees where the files have variable names

one naming scheme is e.g. names by dataset

fileset = {
    'DoubleMuon': {
        treename: "tree1"
        files: [
            ....
        ]
    }
    ,
    'DoubleMuon': {
        treename: "tree2"
        files: [
            ....
        ]
    }
}

it would be really nice to be able to declare such an input to coffea

Different interfaces for hist.rebin() and hist.group()

Signature: hfcat.rebin(old_axis, new_axis)

vs.

Signature: hfcat.group(new_axis, old_axes, mapping, overflow='none')

Since these are conceptually similar operations, ideally the order of arguments would be the same (old_axis, new_axis, ...).

Timeout SIGALRM kills Parsl workers

The timeout decorator here is resulting in a SIGALRM being raised which kills the worker running the task.

Here is a minimal reproducible example of the error:

import time

import parsl
from parsl.app.app import python_app
from parsl.configs.htex_local import config
config.executors[0].max_workers = 1

from coffea.processor.parsl.timeout import timeout

parsl.set_stream_logger()
parsl.load(config)

@python_app
@timeout
def sleep(seconds, timeout=None):
    import time
    time.sleep(seconds)

    return 'slept {} seconds'.format(seconds)

futures = [sleep(i, timeout=2) for i in [1, 3, 1]]
for f in futures:
    try:
        print('*' * 50 + ' FIRST BATCH: ' + f.result())
    except Exception as e:
        print('caught exception: {}'.format(e))

# comment out to see expected behavior
time.sleep(5)

futures = [sleep(i, timeout=2) for i in [1, 1]]
for f in futures:
    print('*' * 50 + ' SECOND BATCH: ' + f.result())

The expected behavior is for two FIRST BATCH tasks to complete successfully and one to fail, then two SECOND BATCH tasks to complete successfully-- this can be seen if the sleep in the MRE is commented out:

2019-06-28 19:18:14 parsl.dataflow.dflow:464 [INFO]  Task 0 launched on executor htex_local
2019-06-28 19:18:14 parsl.dataflow.dflow:688 [INFO]  Task 1 submitted for App sleep, waiting on tasks []
2019-06-28 19:18:14 parsl.dataflow.dflow:698 [DEBUG]  Task 1 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f074d01c5f8 state=pending> parent=None>
2019-06-28 19:18:14 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f074d020378> to queue with args (3,)
2019-06-28 19:18:14 parsl.dataflow.dflow:464 [INFO]  Task 1 launched on executor htex_local
2019-06-28 19:18:14 parsl.dataflow.dflow:688 [INFO]  Task 2 submitted for App sleep, waiting on tasks []
2019-06-28 19:18:14 parsl.dataflow.dflow:698 [DEBUG]  Task 2 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f074d01c940 state=pending> parent=None>
2019-06-28 19:18:14 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f074d020378> to queue with args (1,)
2019-06-28 19:18:14 parsl.dataflow.dflow:464 [INFO]  Task 2 launched on executor htex_local
2019-06-28 19:18:15 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3841105]
2019-06-28 19:18:15 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 3 active tasks, 1/0/0 running/submitted/pending blocks, and 0 connected workers
2019-06-28 19:18:18 parsl.dataflow.dflow:280 [INFO]  Task 0 completed
************************************************** FIRST BATCH: slept 1 seconds
2019-06-28 19:18:20 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3841105]
2019-06-28 19:18:20 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 2 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
2019-06-28 19:18:20 parsl.app.errors:158 [DEBUG]  Reraising exception of type <class 'Exception'>
2019-06-28 19:18:20 parsl.dataflow.dflow:250 [ERROR]  Task 1 failed
Traceback (most recent call last):
  File "/afs/hep.wisc.edu/home/awoodard/.local/lib/python3.6/site-packages/parsl-0.8.0-py3.6.egg/parsl/dataflow/dflow.py", line 247, in handle_exec_update
    res.reraise()
  File "/afs/hep.wisc.edu/home/awoodard/.local/lib/python3.6/site-packages/parsl-0.8.0-py3.6.egg/parsl/app/errors.py", line 163, in reraise
    reraise(t, v, tb)
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_95apython3/x86_64-centos7-gcc7-opt/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/afs/hep.wisc.edu/home/awoodard/.local/lib/python3.6/site-packages/parsl-0.8.0-py3.6.egg/parsl/app/errors.py", line 172, in wrapper
    return func(*args, **kwargs)
  File "/afs/hep.wisc.edu/user/awoodard/coffea/coffea/processor/parsl/timeout.py", line 23, in wrapper
    retval = func(*args, **kwargs)
  File "test.py", line 17, in sleep
    time.sleep(seconds)
  File "/afs/hep.wisc.edu/user/awoodard/coffea/coffea/processor/parsl/timeout.py", line 15, in _timeout_handler
    raise Exception("Timeout hit")
Exception: Timeout hit
2019-06-28 19:18:20 parsl.dataflow.dflow:271 [INFO]  Task 1 failed after 0 retry attempts
2019-06-28 19:18:20 parsl.app.errors:158 [DEBUG]  Reraising exception of type <class 'Exception'>
caught exception: Timeout hit
2019-06-28 19:18:22 parsl.dataflow.dflow:280 [INFO]  Task 2 completed
************************************************** FIRST BATCH: slept 1 seconds
2019-06-28 19:18:22 parsl.dataflow.dflow:688 [INFO]  Task 3 submitted for App sleep, waiting on tasks []
2019-06-28 19:18:22 parsl.dataflow.dflow:698 [DEBUG]  Task 3 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f074d01ce10 state=pending> parent=None>
2019-06-28 19:18:22 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f074d020378> to queue with args (1,)
2019-06-28 19:18:22 parsl.dataflow.dflow:464 [INFO]  Task 3 launched on executor htex_local
2019-06-28 19:18:22 parsl.dataflow.dflow:688 [INFO]  Task 4 submitted for App sleep, waiting on tasks []
2019-06-28 19:18:22 parsl.dataflow.dflow:698 [DEBUG]  Task 4 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f074d02d828 state=pending> parent=None>
2019-06-28 19:18:22 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f074d020378> to queue with args (1,)
2019-06-28 19:18:22 parsl.dataflow.dflow:464 [INFO]  Task 4 launched on executor htex_local
2019-06-28 19:18:23 parsl.dataflow.dflow:280 [INFO]  Task 3 completed
************************************************** SECOND BATCH: slept 1 seconds
2019-06-28 19:18:25 parsl.dataflow.dflow:280 [INFO]  Task 4 completed
************************************************** SECOND BATCH: slept 1 seconds
2019-06-28 19:18:25 parsl.dataflow.dflow:823 [INFO]  DFK cleanup initiated
2019-06-28 19:18:25 parsl.dataflow.dflow:734 [INFO]  Summary of tasks in DFK:
2019-06-28 19:18:25 parsl.dataflow.dflow:765 [INFO]  Tasks in state States.done: 0, 2, 3, 4
2019-06-28 19:18:25 parsl.dataflow.dflow:765 [INFO]  Tasks in state States.failed: 1
2019-06-28 19:18:25 parsl.dataflow.dflow:772 [INFO]  End of summary
2019-06-28 19:18:25 parsl.dataflow.dflow:847 [INFO]  Terminating flow_control and strategy threads
2019-06-28 19:18:25 parsl.executors.high_throughput.executor:484 [DEBUG]  [HOLD_BLOCK]: Sending hold to manager: 51fc0f805958
2019-06-28 19:18:25 parsl.executors.high_throughput.executor:452 [DEBUG]  Sent hold request to worker: 51fc0f805958
2019-06-28 19:18:25 parsl.providers.local.local:235 [DEBUG]  Terminating job/proc_id: 3841105
2019-06-28 19:18:25 parsl.executors.high_throughput.executor:618 [INFO]  Attempting HighThroughputExecutor shutdown
2019-06-28 19:18:25 parsl.executors.high_throughput.executor:622 [INFO]  Finished HighThroughputExecutor shutdown attempt
2019-06-28 19:18:25 parsl.data_provider.data_manager:98 [DEBUG]  Done with executor shutdown
2019-06-28 19:18:25 parsl.dataflow.dflow:879 [INFO]  DFK cleanup complete

Running the unedited MWE, however, we never see the SECOND BATCH tasks:

2019-06-28 19:20:49 parsl.dataflow.dflow:464 [INFO]  Task 0 launched on executor htex_local
2019-06-28 19:20:49 parsl.dataflow.dflow:688 [INFO]  Task 1 submitted for App sleep, waiting on tasks []
2019-06-28 19:20:49 parsl.dataflow.dflow:698 [DEBUG]  Task 1 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f7f8003d5c0 state=pending> parent=None>
2019-06-28 19:20:49 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f7f80041378> to queue with args (3,)
2019-06-28 19:20:49 parsl.dataflow.dflow:464 [INFO]  Task 1 launched on executor htex_local
2019-06-28 19:20:49 parsl.dataflow.dflow:688 [INFO]  Task 2 submitted for App sleep, waiting on tasks []
2019-06-28 19:20:49 parsl.dataflow.dflow:698 [DEBUG]  Task 2 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f7f8003d908 state=pending> parent=None>
2019-06-28 19:20:49 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f7f80041378> to queue with args (1,)
2019-06-28 19:20:49 parsl.dataflow.dflow:464 [INFO]  Task 2 launched on executor htex_local
2019-06-28 19:20:50 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:20:50 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 3 active tasks, 1/0/0 running/submitted/pending blocks, and 0 connected workers
2019-06-28 19:20:53 parsl.dataflow.dflow:280 [INFO]  Task 0 completed
************************************************** FIRST BATCH: slept 1 seconds
2019-06-28 19:20:55 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:20:55 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 2 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
2019-06-28 19:20:55 parsl.app.errors:158 [DEBUG]  Reraising exception of type <class 'Exception'>
2019-06-28 19:20:55 parsl.dataflow.dflow:250 [ERROR]  Task 1 failed
Traceback (most recent call last):
  File "/afs/hep.wisc.edu/home/awoodard/.local/lib/python3.6/site-packages/parsl-0.8.0-py3.6.egg/parsl/dataflow/dflow.py", line 247, in handle_exec_update
    res.reraise()
  File "/afs/hep.wisc.edu/home/awoodard/.local/lib/python3.6/site-packages/parsl-0.8.0-py3.6.egg/parsl/app/errors.py", line 163, in reraise
    reraise(t, v, tb)
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_95apython3/x86_64-centos7-gcc7-opt/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/afs/hep.wisc.edu/home/awoodard/.local/lib/python3.6/site-packages/parsl-0.8.0-py3.6.egg/parsl/app/errors.py", line 172, in wrapper
    return func(*args, **kwargs)
  File "/afs/hep.wisc.edu/user/awoodard/coffea/coffea/processor/parsl/timeout.py", line 23, in wrapper
    retval = func(*args, **kwargs)
  File "test.py", line 17, in sleep
    time.sleep(seconds)
  File "/afs/hep.wisc.edu/user/awoodard/coffea/coffea/processor/parsl/timeout.py", line 15, in _timeout_handler
    raise Exception("Timeout hit")
Exception: Timeout hit
2019-06-28 19:20:55 parsl.dataflow.dflow:271 [INFO]  Task 1 failed after 0 retry attempts
2019-06-28 19:20:55 parsl.app.errors:158 [DEBUG]  Reraising exception of type <class 'Exception'>
caught exception: Timeout hit
2019-06-28 19:20:57 parsl.dataflow.dflow:280 [INFO]  Task 2 completed
************************************************** FIRST BATCH: slept 1 seconds
2019-06-28 19:21:00 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:21:00 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 0 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
2019-06-28 19:21:00 parsl.dataflow.strategy:230 [DEBUG]  Executor htex_local has 0 active tasks; starting kill timer (if idle time exceeds 120s, resources will be removed)
2019-06-28 19:21:02 parsl.dataflow.dflow:688 [INFO]  Task 3 submitted for App sleep, waiting on tasks []
2019-06-28 19:21:02 parsl.dataflow.dflow:698 [DEBUG]  Task 3 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f7f8003ddd8 state=pending> parent=None>
2019-06-28 19:21:02 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f7f80041378> to queue with args (1,)
2019-06-28 19:21:02 parsl.dataflow.dflow:464 [INFO]  Task 3 launched on executor htex_local
2019-06-28 19:21:02 parsl.dataflow.dflow:688 [INFO]  Task 4 submitted for App sleep, waiting on tasks []
2019-06-28 19:21:02 parsl.dataflow.dflow:698 [DEBUG]  Task 4 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f7f8004e7f0 state=pending> parent=None>
2019-06-28 19:21:02 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f7f80041378> to queue with args (1,)
2019-06-28 19:21:02 parsl.dataflow.dflow:464 [INFO]  Task 4 launched on executor htex_local
2019-06-28 19:21:05 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:21:05 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 2 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
2019-06-28 19:21:10 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:21:10 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 2 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
2019-06-28 19:21:15 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:21:15 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 2 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
2019-06-28 19:21:20 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:21:20 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 2 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
2019-06-28 19:21:25 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:21:25 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 2 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
[ad infinitum]

And looking at the running processes one can see the dead worker:

awoodard 3802394  0.3  0.0      0     0 pts/20   Z    18:54   0:00 [process_worker_] <defunct>

Enable various executors to operate in memory constrained environments

A side-thought from #85.

Use lazy arrays to stay within some memory usage bound, so that we can always comfortably fit in batch slots when using uproot.

There's chunksize, the number of events per job that's reasonable and then either a cache when reading from uproot with lazy arrays (sub-chunking?), or some way to track how much memory is being used by awkward arrays / numpy?

Matching function factory

Is your feature request related to a problem? Please describe.

Feature request to improve code interface.

Describe the solution you'd like

dR_matcher = match_function_factory(deltaRCut=0.3)
jagged_gen.match(jagged_reco, dR_matcher)

Describe alternatives you've considered

No alternatives considered. I'm pretty stubborn.

Additional context

This makes it a bit more flexible and is very similar to how fastjet defines these things. This would also make the interface for function matching much easier to deal with by allowing different match functions to have different interfaces without affecting/rewriting the underlying matching code.

make progress of futures based on callbacks

Would there be opposition to changing

with tqdm(disable=not status, unit=unit, total=len(futures), desc=desc) as pbar:
while len(futures) > 0:
finished = set(job for job in futures if job.done())
for job in finished:
accumulator += job.result()
pbar.update(1)
futures -= finished
del finished
time.sleep(1)

into

for job in tqdm(concurrent.futures.as_completed(futures),disable=not status, unit=unit, total=len(futures)):
    accumulator += job.result()

which relies on callbacks instead of polling every 1s (I guess this saves 1/2 sec on average). Also, as_completed can take a timeout parameter.

If you think it's useful, I can make a small PR later today.

Adding an option to automatically skip failed to read files in processor

Is your feature request related to a problem? Please describe.
Reading xrootd files, sometimes a file or two gets corrupted or just not readable via xrootd temporarily. This currently throws an error possibly wasting time and makes me continuously update the file dictionary I'm feed it.

Describe the solution you'd like
Adding a flag to the processor/run_uproot_job to instead skip a file if it's unreadable and maybe record the numbers and names of the skipped file to print at the end so the user can decide if to ignore it or fix it.

Saving recHits for columnar processing

Another interesting use case would be iterating over a recHits collection. In our analysis we try to match two different track reconstruction algorithms based on the ID of the hits in each object. This information is only in AOD (I believe), so right now this gets done at the ntuplizer level. But if it were possible to process it with a columnar approach, then I could maybe add the recHit information to the flat ntuples, using an index offset marker to separate hits belonging to different objects, as suggested by Nick (and similar to the structure of JaggedArrays themselves).

coffea.hist.plot.clopper_pearson_interval gives wrong result for numerator==denominator

Describe the bug
If numerator == denominator, clopper_pearson_interval always returns [1,1] as the corresponding interval.

To Reproduce

from coffea.hist.plot import clopper_pearson_interval
clopper_pearson_interval(10,10,0.95)
# Result: array([1., 1.])

Expected behavior
I expected to get array([0.69150289, 1.])

Proposed fix
This is trivially fixed by changing this line from

interval[:, num == denom] = 1.

to

interval[1, num == denom] = 1.

If you agree, I'd be happy to make a PR.

Sized-ordered stack plots in coffea.hist.plot1d

Is your feature request related to a problem? Please describe.
When creating a typical stack plot using hist.plot1d(histo, overlay='dataset', stack=True), the stacked histograms are drawn in whatever order histo.identifiers() delivers them (see here).

Describe the solution you'd like
I would like to be able to sort the stacked histograms so that they are plotted in increasing order of their integral (i.e. small ones first so that they are visible in a log plot).

I think this could be implemented by changing hist.identifiers() to add an optional sorting argument, which could be used to turn on integral-based sorting. This would imply that the call to the identifiers() function would trigger integration.

Describe alternatives you've considered
None.

Additional context
None.

possibly make preprocessing step more lazy

I was playing around with these tools and some local NanoAOD based on the example notebook with a DimuonProcessor and processor.run_uproot_job. The preprocessing step was taking a while since I have a large number of files and I only wanted to test one or two chunks, so I made a couple of additions in https://github.com/aminnj/fnal-column-analysis-tools/commit/61108df7fa72b12ca538ebda307200d9ce56a70d , including a maxchunks parameter which limits the number of chunks per dataset.

Currently the event chunking loops through files and calls uproot.numentries. I made it multi-threaded since it takes an executor. And if maxchunks is specified, making the chunking lazy saves a lot of time. I didn't tune the magic numbers too hard, though for single digit file counts, the threadpool overhead wasn't worth it.

Is there another feature I missed, or can this be done in a better way?

Specify a per-module array backend (numpy or cupy) for modules

Is your feature request related to a problem? Please describe.
It can be useful to switch to a GPU-based backend such as cupy on a per-module basis without needing the whole coffea ecosystem to be ported. For example, factorized JEC uncertainty computation is mainly based on interp1d and lookups which are highly efficient on a GPU.

Describe the solution you'd like
Each module would have to carry a configurable backend option. A backend can be as simple as setting JetCorrectionUncertainty.numpy_lib = numpy or cupy), but due to a non-perfect feature parity between the two libraries and possible future alternatives, likely some simple backend class that can provide wrapping functionaly could be provided, e.g. coffea.backends.NumpyBackend, CupyBackend. The backend would either rely on the inputs being located on the correct device or do the transfer on the fly, depending on the design choice.

Describe alternatives you've considered
A switch can be made globally in coffea.util.np, but that is non-ideal before awkward-array works completely with another backend. A physics user may want to use only a part of the coffea library on a certain device.

Additional context
I've made a first attempt at converting the JEC code to a GPU in my fork jpata#1. I've seen a performance improvement of roughly 7x (160s on a CPU vs 24s on a GPU) on computing factorized JEC on 10M random jets.

coffea interface with pyhf - statistical inference

/cc @matthewfeickert @lukasheinrich

Is your feature request related to a problem? Please describe.

Feature request.

Describe the solution you'd like

An interface/way to export to the pyhf JSONs to perform binned statistical fits (should work for anything that allows you to make histograms basically). The pyhf workspaces follow the schema (v1.0.0) defined here: https://diana-hep.org/pyhf/schemas/1.0.0/workspace.json. If you see this issue in the future, we might be at a later version of the schema.

Describe alternatives you've considered

None.

Additional context

A similar issue is on-going to get pyhf and zfit working together as pyhf is binned, but zfit is unbinned: zfit/zfit#120.

Yet another histogramming standard

  • Stack pretty stack plots
  • The errors bars should be correct
  • Should be able to read in from and output to uproot.
    • Maybe pandas and other things too

Log and sqrt functions in btag scale factor formulas

Functions such as log and sqrt show up in the btagging SF formulas for DeepCSV in 2017 & 2018, which break the extractor. Using a csv with these types of formulas in the lookup_tools extractor results in a KeyError for 'log' in the make_evaluator step:

from coffea.lookup_tools import extractor
ext = extractor()
ext.add_weight_sets(["btag * DeepCSV_102XSF_V1.csv"])
ext.finalize()
evaluator = ext.make_evaluator()

If the log and sqrt functions are replaced in the .csv with np.log and np.sqrt, it works fine.

iterative_executor crashes

I think because of

executor_args.setdefault('workers', 1)

and
executor(items, _work_function, output, **executor_args)
,
def iterative_executor(items, function, accumulator, status=True, unit='items', desc='Processing',
will fail. Adding workers=1 to the argument list of iterative_executor fixes it for me

Pythonic interface for add_weight_sets

Rather than taking a particular space-separated string, add_weight_sets() could take a list of dicts, each with the appropriate parameters for add_weight_set(). Alternatively, a namedtuple could be used.

ZIP files through XRootD

@lgray @nsmith- We've talked about the need to access NPZ (and other ZIP) files through XRootD offline; here's the start of a discussion about that.

Python's zipfile module can accept any object as a "file," as long as it supports the file-like protocol. Unlike MutableMapping and such, the file-like protocol wasn't formalized in collections.abc, so I did a test to figure out what methods it needs.

To start with, here's Awkward's save/load interface, which is backed by backward and forward compatible ZIP files. This is what I'll be using for the test.

import awkward

a = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
b = awkward.fromiter([100, 200, 300])
print(a)
print(b)

awkward.save("tmp.awkd", {"a": a, "b": b})  # if a dict, save as named arrays in file
file = awkward.load("tmp.awkd")             # and then read-back file is dict-like

a2 = file["a"]                              # reads upon access; why we want XRootD
b2 = file["b"]
print(a2)
print(b2)

Now to find out how much of the file-like interface it uses (in practice, for this file).

lass FileLike(object):
    def __init__(self, *args, **kwargs):
        print("__init__")
        self.f = open(*args, **kwargs)
        self.methods = set()

    def close(self):
        print("close")
        self.methods.add("close")
        return self.f.close()

    def flush(self):
        print("flush")
        self.methods.add("flush")
        return self.f.flush()

    def fileno(self):
        print("fileno")
        self.methods.add("fileno")
        return self.f.fileno()

    def isatty(self):
        print("isatty")
        self.methods.add("isatty")
        return self.f.isatty()

    def next(self):
        print("next")
        self.methods.add("next")
        return self.f.next()

   def read(self, *args):
        self.methods.add("read")
        if len(args) == 0:
            print("read")
            return self.f.read()
        else:
            print("read", args[0])
            return self.f.read(args[0])

    def readline(self, *args):
        self.methods.add("readline")
        if len(args) == 0:
            print("readline")
            return self.f.readline()
        else:
            print("readline", args[0])
            return self.f.readline(args[0])

    def readlines(self, *args):
        if len(args) == 0:
            print("readlines")
            self.methods.add("readlines")
            return self.f.readlines()
        else:
            print("readlines", args[0])
            self.methods.add("readlines")
            return self.f.readlines(args[0])

    def xreadlines(self):
        print("xreadlines")
        self.methods.add("xreadlines")
        return self.f.xreadlines()

   def seek(self, offset, *args):
        self.methods.add("seek")
        if len(args) == 0:
            print("seek", offset)
            return self.f.seek(offset)
        else:
            print("seek", offset, args[0])
            return self.f.seek(offset, args[0])

    def tell(self):
        print("tell")
        self.methods.add("tell")
        return self.f.tell()

    def truncate(self, *args):
        self.methods.add("truncate")
        if len(args) == 0:
            print("truncate")
            return self.f.truncate()
        else:
            print("truncate", args[0])
            return self.f.truncate(args[0])

    def write(self, data):
        print("write", len(data))
        self.methods.add("write")
        return self.f.write(data)

    def writelines(self, data):
        print("writelines", data)
        self.methods.add("writelines")
        return self.f.writelines(data)

    @property
    def closed(self):
        print("closed")
        self.methods.add("closed")
        return self.f.closed

    @property
    def encoding(self):
        print("encoding")
        self.methods.add("encoding")
        return self.f.encoding

    @property
    def errors(self):
        print("errors")
        self.methods.add("errors")
        return self.f.errors

    @property
    def mode(self):
        print("mode")
        self.methods.add("mode")
        return self.f.mode

    @property
    def name(self):
        print("name")
        self.methods.add("name")
        return self.f.name

    @property
    def newlines(self):
        print("newlines")
        self.methods.add("newlines")
        return self.f.newlines

   @property
    def softspace(self):
        print("softspace")
        self.methods.add("softspace")
        return self.f.softspace

    @property
    def softspace(self):
        print("softspace")
        self.methods.add("softspace")
        return self.f.softspace

    @property
    def seekable(self):
        print("seekable")
        self.methods.add("seekable")
        return self.f.seekable

    def __enter__(self):
        print("__enter__")
        self.methods.add("__enter__")
        return self.f.__enter__()

    def __exit__(self, exc_type, exc_val, exc_tb):
        print("__exit__", exc_type, exc_val, exc_tb)
        self.methods.add("__exit__")
        return self.f.__exit__(exc_type, exc_val, exc_tb)

And we run the test to see which ones are actually used.

f = FileLike("tmp.awkd", "wb")
awkward.save(f, {"a": a, "b": b})
f.methods
# uses {'flush', 'name', 'tell', 'seek', 'write'}

f = FileLike("tmp.awkd", "rb")
file = awkward.load(f)
f.methods
# uses {'read', 'seek', 'name', 'tell'}

f.methods = set()
a2 = file["a"]
f.methods
# uses {'seek', 'tell', 'read', 'seekable'}

b2 = file["b"]
f.methods
# uses {'seek', 'tell', 'read', 'seekable'}

So, in the end, we need a shim that wraps a pyxrootd.File with methods for open, read, write, seek, tell, flush, name (property), and seekable (property).

Adding legend settings forwarding to coffea.hist.plot1d

Is your feature request related to a problem? Please describe.
Sometimes the default legend settings are not satisfactory, adjusting them after they are created requires additional code(grouping the handles), but easy to add when constructed, I though it would be convenient to add another kwarg for legend settings when calling coffea.hist.plot1d. Or something for this already exists?

Describe the solution you'd like
Add additional kwarg to the plot1d signature, something like legend_opts=None, and forward that when calling ax.legend().

Describe alternatives you've considered
N/A

Additional context
N/A

Take advantage of uproot(-methods) profiles.

Is your feature request related to a problem? Please describe.
Uproot has added a feature which assumes a large chunk of the functionality of JaggedCandidateArray or Initializer/Builder. In particular uproot-methods from 0.7.0 exposes a nice SoA object-like interface complete with functional cross references among objects.

The features of JaggedCandidateArray that remain pertinent are the wrappers on top of distincts(), pairs(), choose() that handle composition of physics objects (things with four momenta) into new physics objects where four momenta is summed and extraneous properties are dropped for the parent (i.e. Z boson candidates do not have isolation).

Describe the solution you'd like
We should determine what are the pertinent remaining features of JaggedCandidateArray and Initializer/Builder and reduce the scope of the classes as needed.

As another possibility, it may also be a good idea to try to extract the best aspects of both styles of array manipulation and remove the need for two separate packages that do the same thing.

Describe alternatives you've considered
The alternative is doing nothing and ignoring this great new feature, which is silly.

Add remove function for Axis (and simplify syntax)

Is your feature request related to a problem? Please describe.
"I'm always frustrated when [...]" slicing/grouping/rebinning/removing syntax is not intuitive. Related to #90 (comment)

Take rebin, currently

h.rebin(old_axis_name_string, new_axis_object)

In reality most likely use case is only a change to the binning, so rebin would save a good chunk of duplication by allowing:

h.rebin(old_axis_name_string, new_bin_array)

(rename is probably rare enough to leave as is)

Also removing elements could be improved from not entirely intuitive slicing, say I want to remove a bin in Cat/Sparse Axis

h[["cat1","cat2"], :]

To something like

h.axis('cat').remove(items)

or

h.remove(items, axis="cat")

or even

h.rebin('cat', ["cat1", "cat2"])

New axis copy could be created and reinserted to the hist by these functions to keep the axis object in general common, but allow operations on it when desired

Describe alternatives you've considered
An UI is like a joke, if you have to explain it...

Bitwise operations in Coffea

As discussed in the last meeting, an interesting use case might be doing bitwise-arithmetic in columnar format. Sometimes I pre-compute fancier cuts when running the ntuplizer over my AOD samples (e.g to use some of the CMSSW libraries), and save each cut into one bit of a uint32_t variable. If there's a way to build cutflows out of those bits this would be very convenient. Perhaps this is already possible with just the existing infrastructure?

Thanks!

"cannot allocate memory" error on LPC when following documentation example

I copied the first 4 blocks of code from the documentation onto a jupyter notebook running on LPC nodes. Whenever I try to execute run_uproot_job, I get a "cannot allocate memory" error. After some initial debugging, I've updated all the dependencies to the latest versions but the error persists. I also tried with two different users in the LPC with the same error. Screenshots of the error below:

image

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.