coffeateam / coffea Goto Github PK

View Code? Open in Web Editor NEW

129.0 129.0 126.0 41.89 MB

Basic tools and wrappers for enabling not-too-alien syntax when running columnar Collider HEP analysis.

Home Page: https://coffeateam.github.io/coffea/

License: BSD 3-Clause "New" or "Revised" License

Python 25.42% Jupyter Notebook 71.10% Dockerfile 0.13% Shell 0.12% C 0.04% C++ 3.19%

coffea's People

Contributors

Stargazers

Watchers

Forkers

fnavarro94 belforte dryrun mverzett jpata lgray njutxw fdolek sznajder aminnj mat-adamec mcremone mapsacosta lukasheinrich aprilyw reikdas annawoodard andreasalbert tiradani areinsvo pietro14 hepaccelerate zsurma quantumapple paulgessinger andrzejnovak gitter-badger akanugan alefisico btovar dnoonan08 dntaylor stderr-enst irenedutta23 globus-labs pfackeldey bfis jpriscia jmduarte drankincms applejax124 jdulemba jrueb danbarto rishabhcms aperloff bparkhnls hqucms kondratyevd jayjeetatgithub uccross maxgalli xoqhdgh1002 nikoladze dthain oshadura anttilehikoinen kpedro88 yihui-lai yimuchen jpivarski ssl-hep kmohrman gordonwatts slehti tejinc kratsg bmjoshi stalbrec njmanganelli david-simonetti-nd mehdihajimaghsoud junaid-muhammad valsdav andrewlevin kreczko andrewhennessee ssrothman flomau a-kapoor riga alexander-held anthonyaportela moaly98 vindex10 sgin0630 matthewfeickert alintulu niclaseich neogyk hamzeh-khanpour andrissp aebid ikrommyd cooperative-computing-lab rkansal47 kpachal jbrewster7 ofek raeubaen

coffea's Issues

Factorized jet corrector

Wrapping layer on top of lookups to derive scaled jet pT values, update candidates.

awkward-numba fails unit tests

Presumably after the large update in awkward-array 0.9.0

Implement good-run-list from JSON (make sure we can mask a Jagged array with it too)

Convert dense_evaluated_lookup to use horner-form polynomials instead

All b-tagging scale factors are horner-form polynomials of order six.
Sometimes orders are just omitted.
It's faster to use numpy than the python loop that here's right now.

JetMet txt_converters creates invalid jagged array

Describe the bug
Thanks for the nice JME tools! I wanted to report the following issue while I also debug it, perhaps we can find the solution faster :)

coffea.lookup_tools.extractor produces an incorrect internal jagged array on the official JME file Summer16_07Aug2017_V11_L1fix_MC_L2Relative_AK4PFchs.txt (added for reference) with the following error:
ValueError: maximum offset 2498 is beyond the length of the content (2497)

To Reproduce

Run the following

from coffea.lookup_tools import extractor

extract = extractor()
extract.add_weight_sets(['* * Summer16_07Aug2017_V11_L1fix_MC_L2Relative_AK4PFchs.txt'])
extract.finalize()
extract.make_evaluator()

To induce the crash at init-time rather than eval-time:

diff --git a/coffea/lookup_tools/txt_converters.py b/coffea/lookup_tools/txt_converters.py
index 73a49ec..26de60e 100644
--- a/coffea/lookup_tools/txt_converters.py
+++ b/coffea/lookup_tools/txt_converters.py
@@ -146,7 +146,9 @@ def _build_standard_jme_lookup(name, layout, pars, nBinnedVars, nBinColumns,
     parm_order = []
     offset_col = 2 * nBinnedVars + 1 + int(not interpolatedFunc) * 2 * nEvalVars
     for i in range(nParms):
-        parms.append(awkward.JaggedArray.fromcounts(jagged_counts, pars[columns[i + offset_col]]))
+        jag = awkward.JaggedArray.fromcounts(jagged_counts, pars[columns[i + offset_col]])
+        assert(jag._valid())
+        parms.append(jag)
         parm_order.append('p%i' % (i))

     wrapped_up = {}

Expected behavior
It works with the older supplied JME files, so it looks like either the JME file is bugged or the parser produces incorrect output.

Output

  File "test_jme.py", line 4, in <module>
    extract.add_weight_sets(['* * data/Summer16_07Aug2017_V11_L1fix_MC_L2Relative_AK4PFchs.txt'])
  File "/Users/joosep/Documents/caltech/hepaccelerate-cms/coffea/coffea/lookup_tools/extractor.py", line 75, in add_weight_sets
    self.import_file(thefile)
  File "/Users/joosep/Documents/caltech/hepaccelerate-cms/coffea/coffea/lookup_tools/extractor.py", line 100, in import_file
    self._filecache[thefile] = file_converters[theformat][thetype](thefile)
  File "/Users/joosep/Documents/caltech/hepaccelerate-cms/coffea/coffea/lookup_tools/txt_converters.py", line 163, in _convert_standard_jme_txt_file
    return _build_standard_jme_lookup(*_parse_jme_formatted_file(jmeFilePath))
  File "/Users/joosep/Documents/caltech/hepaccelerate-cms/coffea/coffea/lookup_tools/txt_converters.py", line 150, in _build_standard_jme_lookup
    assert(jag._valid())
  File "/Users/joosep/anaconda3/lib/python3.6/site-packages/awkward/array/jagged.py", line 439, in _valid
    raise ValueError("maximum offset {0} is beyond the length of the content ({1})".format(self._offsets.max(), len(self._content)))
ValueError: maximum offset 2498 is beyond the length of the content (2497)

Desktop (please complete the following information):

$ python --version
Python 3.6.7 :: Anaconda custom (64-bit)
$ uname -a
Darwin dhcp-112-136.caltech.edu 18.6.0 Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64 x86_64
>>> awkward.__version__
'0.11.1'
coffea revision e776a80762ec17ea9bc05c35f5019d4edc04f091

Additional context
I believe the issue is happening here when creating jagged_counts, but so far I haven't traced it down exactly: https://github.com/CoffeaTeam/coffea/blob/master/coffea/lookup_tools/txt_converters.py#L136

support dataset-specific treenames

in some workflows (ATLAS) the input to coffea are TTrees where the files have variable names

one naming scheme is e.g. names by dataset

fileset = {
    'DoubleMuon': {
        treename: "tree1"
        files: [
            ....
        ]
    }
    ,
    'DoubleMuon': {
        treename: "tree2"
        files: [
            ....
        ]
    }
}

it would be really nice to be able to declare such an input to coffea

Different interfaces for hist.rebin() and hist.group()

Signature: hfcat.rebin(old_axis, new_axis)

vs.

Signature: hfcat.group(new_axis, old_axes, mapping, overflow='none')

Since these are conceptually similar operations, ideally the order of arguments would be the same (old_axis, new_axis, ...).

Timeout SIGALRM kills Parsl workers

The timeout decorator here is resulting in a SIGALRM being raised which kills the worker running the task.

Here is a minimal reproducible example of the error:

import time

import parsl
from parsl.app.app import python_app
from parsl.configs.htex_local import config
config.executors[0].max_workers = 1

from coffea.processor.parsl.timeout import timeout

parsl.set_stream_logger()
parsl.load(config)

@python_app
@timeout
def sleep(seconds, timeout=None):
    import time
    time.sleep(seconds)

    return 'slept {} seconds'.format(seconds)

futures = [sleep(i, timeout=2) for i in [1, 3, 1]]
for f in futures:
    try:
        print('*' * 50 + ' FIRST BATCH: ' + f.result())
    except Exception as e:
        print('caught exception: {}'.format(e))

# comment out to see expected behavior
time.sleep(5)

futures = [sleep(i, timeout=2) for i in [1, 1]]
for f in futures:
    print('*' * 50 + ' SECOND BATCH: ' + f.result())

The expected behavior is for two FIRST BATCH tasks to complete successfully and one to fail, then two SECOND BATCH tasks to complete successfully-- this can be seen if the sleep in the MRE is commented out:

2019-06-28 19:18:14 parsl.dataflow.dflow:464 [INFO]  Task 0 launched on executor htex_local
2019-06-28 19:18:14 parsl.dataflow.dflow:688 [INFO]  Task 1 submitted for App sleep, waiting on tasks []
2019-06-28 19:18:14 parsl.dataflow.dflow:698 [DEBUG]  Task 1 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f074d01c5f8 state=pending> parent=None>
2019-06-28 19:18:14 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f074d020378> to queue with args (3,)
2019-06-28 19:18:14 parsl.dataflow.dflow:464 [INFO]  Task 1 launched on executor htex_local
2019-06-28 19:18:14 parsl.dataflow.dflow:688 [INFO]  Task 2 submitted for App sleep, waiting on tasks []
2019-06-28 19:18:14 parsl.dataflow.dflow:698 [DEBUG]  Task 2 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f074d01c940 state=pending> parent=None>
2019-06-28 19:18:14 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f074d020378> to queue with args (1,)
2019-06-28 19:18:14 parsl.dataflow.dflow:464 [INFO]  Task 2 launched on executor htex_local
2019-06-28 19:18:15 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3841105]
2019-06-28 19:18:15 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 3 active tasks, 1/0/0 running/submitted/pending blocks, and 0 connected workers
2019-06-28 19:18:18 parsl.dataflow.dflow:280 [INFO]  Task 0 completed
************************************************** FIRST BATCH: slept 1 seconds
2019-06-28 19:18:20 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3841105]
2019-06-28 19:18:20 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 2 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
2019-06-28 19:18:20 parsl.app.errors:158 [DEBUG]  Reraising exception of type <class 'Exception'>
2019-06-28 19:18:20 parsl.dataflow.dflow:250 [ERROR]  Task 1 failed
Traceback (most recent call last):
  File "/afs/hep.wisc.edu/home/awoodard/.local/lib/python3.6/site-packages/parsl-0.8.0-py3.6.egg/parsl/dataflow/dflow.py", line 247, in handle_exec_update
    res.reraise()
  File "/afs/hep.wisc.edu/home/awoodard/.local/lib/python3.6/site-packages/parsl-0.8.0-py3.6.egg/parsl/app/errors.py", line 163, in reraise
    reraise(t, v, tb)
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_95apython3/x86_64-centos7-gcc7-opt/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/afs/hep.wisc.edu/home/awoodard/.local/lib/python3.6/site-packages/parsl-0.8.0-py3.6.egg/parsl/app/errors.py", line 172, in wrapper
    return func(*args, **kwargs)
  File "/afs/hep.wisc.edu/user/awoodard/coffea/coffea/processor/parsl/timeout.py", line 23, in wrapper
    retval = func(*args, **kwargs)
  File "test.py", line 17, in sleep
    time.sleep(seconds)
  File "/afs/hep.wisc.edu/user/awoodard/coffea/coffea/processor/parsl/timeout.py", line 15, in _timeout_handler
    raise Exception("Timeout hit")
Exception: Timeout hit
2019-06-28 19:18:20 parsl.dataflow.dflow:271 [INFO]  Task 1 failed after 0 retry attempts
2019-06-28 19:18:20 parsl.app.errors:158 [DEBUG]  Reraising exception of type <class 'Exception'>
caught exception: Timeout hit
2019-06-28 19:18:22 parsl.dataflow.dflow:280 [INFO]  Task 2 completed
************************************************** FIRST BATCH: slept 1 seconds
2019-06-28 19:18:22 parsl.dataflow.dflow:688 [INFO]  Task 3 submitted for App sleep, waiting on tasks []
2019-06-28 19:18:22 parsl.dataflow.dflow:698 [DEBUG]  Task 3 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f074d01ce10 state=pending> parent=None>
2019-06-28 19:18:22 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f074d020378> to queue with args (1,)
2019-06-28 19:18:22 parsl.dataflow.dflow:464 [INFO]  Task 3 launched on executor htex_local
2019-06-28 19:18:22 parsl.dataflow.dflow:688 [INFO]  Task 4 submitted for App sleep, waiting on tasks []
2019-06-28 19:18:22 parsl.dataflow.dflow:698 [DEBUG]  Task 4 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f074d02d828 state=pending> parent=None>
2019-06-28 19:18:22 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f074d020378> to queue with args (1,)
2019-06-28 19:18:22 parsl.dataflow.dflow:464 [INFO]  Task 4 launched on executor htex_local
2019-06-28 19:18:23 parsl.dataflow.dflow:280 [INFO]  Task 3 completed
************************************************** SECOND BATCH: slept 1 seconds
2019-06-28 19:18:25 parsl.dataflow.dflow:280 [INFO]  Task 4 completed
************************************************** SECOND BATCH: slept 1 seconds
2019-06-28 19:18:25 parsl.dataflow.dflow:823 [INFO]  DFK cleanup initiated
2019-06-28 19:18:25 parsl.dataflow.dflow:734 [INFO]  Summary of tasks in DFK:
2019-06-28 19:18:25 parsl.dataflow.dflow:765 [INFO]  Tasks in state States.done: 0, 2, 3, 4
2019-06-28 19:18:25 parsl.dataflow.dflow:765 [INFO]  Tasks in state States.failed: 1
2019-06-28 19:18:25 parsl.dataflow.dflow:772 [INFO]  End of summary
2019-06-28 19:18:25 parsl.dataflow.dflow:847 [INFO]  Terminating flow_control and strategy threads
2019-06-28 19:18:25 parsl.executors.high_throughput.executor:484 [DEBUG]  [HOLD_BLOCK]: Sending hold to manager: 51fc0f805958
2019-06-28 19:18:25 parsl.executors.high_throughput.executor:452 [DEBUG]  Sent hold request to worker: 51fc0f805958
2019-06-28 19:18:25 parsl.providers.local.local:235 [DEBUG]  Terminating job/proc_id: 3841105
2019-06-28 19:18:25 parsl.executors.high_throughput.executor:618 [INFO]  Attempting HighThroughputExecutor shutdown
2019-06-28 19:18:25 parsl.executors.high_throughput.executor:622 [INFO]  Finished HighThroughputExecutor shutdown attempt
2019-06-28 19:18:25 parsl.data_provider.data_manager:98 [DEBUG]  Done with executor shutdown
2019-06-28 19:18:25 parsl.dataflow.dflow:879 [INFO]  DFK cleanup complete

Running the unedited MWE, however, we never see the SECOND BATCH tasks:

2019-06-28 19:20:49 parsl.dataflow.dflow:464 [INFO]  Task 0 launched on executor htex_local
2019-06-28 19:20:49 parsl.dataflow.dflow:688 [INFO]  Task 1 submitted for App sleep, waiting on tasks []
2019-06-28 19:20:49 parsl.dataflow.dflow:698 [DEBUG]  Task 1 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f7f8003d5c0 state=pending> parent=None>
2019-06-28 19:20:49 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f7f80041378> to queue with args (3,)
2019-06-28 19:20:49 parsl.dataflow.dflow:464 [INFO]  Task 1 launched on executor htex_local
2019-06-28 19:20:49 parsl.dataflow.dflow:688 [INFO]  Task 2 submitted for App sleep, waiting on tasks []
2019-06-28 19:20:49 parsl.dataflow.dflow:698 [DEBUG]  Task 2 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f7f8003d908 state=pending> parent=None>
2019-06-28 19:20:49 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f7f80041378> to queue with args (1,)
2019-06-28 19:20:49 parsl.dataflow.dflow:464 [INFO]  Task 2 launched on executor htex_local
2019-06-28 19:20:50 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:20:50 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 3 active tasks, 1/0/0 running/submitted/pending blocks, and 0 connected workers
2019-06-28 19:20:53 parsl.dataflow.dflow:280 [INFO]  Task 0 completed
************************************************** FIRST BATCH: slept 1 seconds
2019-06-28 19:20:55 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:20:55 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 2 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
2019-06-28 19:20:55 parsl.app.errors:158 [DEBUG]  Reraising exception of type <class 'Exception'>
2019-06-28 19:20:55 parsl.dataflow.dflow:250 [ERROR]  Task 1 failed
Traceback (most recent call last):
  File "/afs/hep.wisc.edu/home/awoodard/.local/lib/python3.6/site-packages/parsl-0.8.0-py3.6.egg/parsl/dataflow/dflow.py", line 247, in handle_exec_update
    res.reraise()
  File "/afs/hep.wisc.edu/home/awoodard/.local/lib/python3.6/site-packages/parsl-0.8.0-py3.6.egg/parsl/app/errors.py", line 163, in reraise
    reraise(t, v, tb)
  File "/cvmfs/sft.cern.ch/lcg/views/LCG_95apython3/x86_64-centos7-gcc7-opt/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/afs/hep.wisc.edu/home/awoodard/.local/lib/python3.6/site-packages/parsl-0.8.0-py3.6.egg/parsl/app/errors.py", line 172, in wrapper
    return func(*args, **kwargs)
  File "/afs/hep.wisc.edu/user/awoodard/coffea/coffea/processor/parsl/timeout.py", line 23, in wrapper
    retval = func(*args, **kwargs)
  File "test.py", line 17, in sleep
    time.sleep(seconds)
  File "/afs/hep.wisc.edu/user/awoodard/coffea/coffea/processor/parsl/timeout.py", line 15, in _timeout_handler
    raise Exception("Timeout hit")
Exception: Timeout hit
2019-06-28 19:20:55 parsl.dataflow.dflow:271 [INFO]  Task 1 failed after 0 retry attempts
2019-06-28 19:20:55 parsl.app.errors:158 [DEBUG]  Reraising exception of type <class 'Exception'>
caught exception: Timeout hit
2019-06-28 19:20:57 parsl.dataflow.dflow:280 [INFO]  Task 2 completed
************************************************** FIRST BATCH: slept 1 seconds
2019-06-28 19:21:00 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:21:00 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 0 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
2019-06-28 19:21:00 parsl.dataflow.strategy:230 [DEBUG]  Executor htex_local has 0 active tasks; starting kill timer (if idle time exceeds 120s, resources will be removed)
2019-06-28 19:21:02 parsl.dataflow.dflow:688 [INFO]  Task 3 submitted for App sleep, waiting on tasks []
2019-06-28 19:21:02 parsl.dataflow.dflow:698 [DEBUG]  Task 3 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f7f8003ddd8 state=pending> parent=None>
2019-06-28 19:21:02 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f7f80041378> to queue with args (1,)
2019-06-28 19:21:02 parsl.dataflow.dflow:464 [INFO]  Task 3 launched on executor htex_local
2019-06-28 19:21:02 parsl.dataflow.dflow:688 [INFO]  Task 4 submitted for App sleep, waiting on tasks []
2019-06-28 19:21:02 parsl.dataflow.dflow:698 [DEBUG]  Task 4 set to pending state with AppFuture: <AppFuture super=<AppFuture at 0x7f7f8004e7f0 state=pending> parent=None>
2019-06-28 19:21:02 parsl.executors.high_throughput.executor:514 [DEBUG]  Pushing function <function sleep at 0x7f7f80041378> to queue with args (1,)
2019-06-28 19:21:02 parsl.dataflow.dflow:464 [INFO]  Task 4 launched on executor htex_local
2019-06-28 19:21:05 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:21:05 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 2 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
2019-06-28 19:21:10 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:21:10 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 2 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
2019-06-28 19:21:15 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:21:15 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 2 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
2019-06-28 19:21:20 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:21:20 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 2 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
2019-06-28 19:21:25 parsl.providers.local.local:92 [DEBUG]  Checking status of: [3842571]
2019-06-28 19:21:25 parsl.dataflow.strategy:204 [DEBUG]  Executor htex_local has 2 active tasks, 1/0/0 running/submitted/pending blocks, and 1 connected workers
[ad infinitum]

And looking at the running processes one can see the dead worker:

awoodard 3802394  0.3  0.0      0     0 pts/20   Z    18:54   0:00 [process_worker_] <defunct>

Migrate to uproot-methods 0.3.3

Following merge of scikit-hep/uproot3-methods#32, do some cleanup in fnal_column_analysis_tools/lookup_tools/root_converters.py

Enable various executors to operate in memory constrained environments

A side-thought from #85.

Use lazy arrays to stay within some memory usage bound, so that we can always comfortably fit in batch slots when using uproot.

There's chunksize, the number of events per job that's reasonable and then either a cache when reading from uproot with lazy arrays (sub-chunking?), or some way to track how much memory is being used by awkward arrays / numpy?

Create a coffea users mailing list.

That is administrated by cms-coffea egroup.

Matching function factory

Is your feature request related to a problem? Please describe.

Feature request to improve code interface.

Describe the solution you'd like

dR_matcher = match_function_factory(deltaRCut=0.3)
jagged_gen.match(jagged_reco, dR_matcher)

Describe alternatives you've considered

No alternatives considered. I'm pretty stubborn.

Additional context

This makes it a bit more flexible and is very similar to how fastjet defines these things. This would also make the interface for function matching much easier to deal with by allowing different match functions to have different interfaces without affecting/rewriting the underlying matching code.

Implement concise gen-matching / cross-cleaning in JaggedCandidateArray

Not being addressed in the current PR want to close out a new release.

make progress of futures based on callbacks

Would there be opposition to changing

coffea/coffea/processor/executor.py

Lines 43 to 51 in 1b0b8fa

 with tqdm(disable=not status, unit=unit, total=len(futures), desc=desc) as pbar: 

 while len(futures) > 0: 

 finished = set(job for job in futures if job.done()) 

 for job in finished: 

 accumulator += job.result() 

 pbar.update(1) 

 futures -= finished 

 del finished 

 time.sleep(1)

into

for job in tqdm(concurrent.futures.as_completed(futures),disable=not status, unit=unit, total=len(futures)):
    accumulator += job.result()

which relies on callbacks instead of polling every 1s (I guess this saves 1/2 sec on average). Also, as_completed can take a timeout parameter.

If you think it's useful, I can make a small PR later today.

Update matplotlib plots in place in notebooks

@nsmith- is already working on this

Create a single job per multiple files when running with parsl

Modify derive_chunks:
https://github.com/CoffeaTeam/coffea/blob/master/coffea/processor/parsl/detail.py#L52
so that if n-entries < chunksize it will grab files together until it's larger than the chunksize.

Need also to take care of the fact that some 'items' that are returned by get_chunking might be a list of files rather than a single file string.

Adding an option to automatically skip failed to read files in processor

Is your feature request related to a problem? Please describe.
Reading xrootd files, sometimes a file or two gets corrupted or just not readable via xrootd temporarily. This currently throws an error possibly wasting time and makes me continuously update the file dictionary I'm feed it.

Describe the solution you'd like
Adding a flag to the processor/run_uproot_job to instead skip a file if it's unreadable and maybe record the numbers and names of the skipped file to print at the end so the user can decide if to ignore it or fix it.

Add arbitrary selection function to match/argmatch

Extend #29 to allow any sort of of matching function to be passed in.

plot1d density argument doesn't work right

I get taken from left to right by adding density=True flag and the y-axis doesn't look like it's normalized to 1 as expected.

Saving recHits for columnar processing

Another interesting use case would be iterating over a recHits collection. In our analysis we try to match two different track reconstruction algorithms based on the ID of the hits in each object. This information is only in AOD (I believe), so right now this gets done at the ntuplizer level. But if it were possible to process it with a columnar approach, then I could maybe add the recHit information to the flat ntuples, using an index offset marker to separate hits belonging to different objects, as suggested by Nick (and similar to the structure of JaggedArrays themselves).

coffea.hist.plot.clopper_pearson_interval gives wrong result for numerator==denominator

Describe the bug
If numerator == denominator, clopper_pearson_interval always returns [1,1] as the corresponding interval.

To Reproduce

from coffea.hist.plot import clopper_pearson_interval
clopper_pearson_interval(10,10,0.95)
# Result: array([1., 1.])

Expected behavior
I expected to get array([0.69150289, 1.])

Proposed fix
This is trivially fixed by changing this line from

interval[:, num == denom] = 1.

interval[1, num == denom] = 1.

If you agree, I'd be happy to make a PR.

Make JER evaluator / smearer to complement JEC

Name says it all.

Sized-ordered stack plots in coffea.hist.plot1d

Is your feature request related to a problem? Please describe.
When creating a typical stack plot using hist.plot1d(histo, overlay='dataset', stack=True), the stacked histograms are drawn in whatever order histo.identifiers() delivers them (see here).

Describe the solution you'd like
I would like to be able to sort the stacked histograms so that they are plotted in increasing order of their integral (i.e. small ones first so that they are visible in a log plot).

I think this could be implemented by changing hist.identifiers() to add an optional sorting argument, which could be used to turn on integral-based sorting. This would imply that the call to the identifiers() function would trigger integration.

Describe alternatives you've considered
None.

Additional context
None.

possibly make preprocessing step more lazy

I was playing around with these tools and some local NanoAOD based on the example notebook with a DimuonProcessor and processor.run_uproot_job. The preprocessing step was taking a while since I have a large number of files and I only wanted to test one or two chunks, so I made a couple of additions in https://github.com/aminnj/fnal-column-analysis-tools/commit/61108df7fa72b12ca538ebda307200d9ce56a70d , including a maxchunks parameter which limits the number of chunks per dataset.

Currently the event chunking loops through files and calls uproot.numentries. I made it multi-threaded since it takes an executor. And if maxchunks is specified, making the chunking lazy saves a lot of time. I didn't tune the magic numbers too hard, though for single digit file counts, the threadpool overhead wasn't worth it.

Is there another feature I missed, or can this be done in a better way?

Add plotting tests

Following https://matplotlib.org/devel/testing.html#writing-an-image-comparison-test

Gen-particle parent finding without loops

is hist.Hist.profile going to be implemented?

Is your feature request related to a problem? Please describe.
I see profile method of class Hist is not implemented currently,
https://github.com/CoffeaTeam/coffea/blob/master/coffea/hist/hist_tools.py#L847

is it still planned to be done? If so, what alternative do you suggest meanwhile? Thanks!

Describe the solution you'd like
N/A

Describe alternatives you've considered
N/A

Additional context
N/A

Add JEC and b-tag SFs to weights evaluator

Make a nice evaluator syntax for b-tagging (should be same as histograms) and also for applying Energy Corrections and Smearings.

Update coffea to be compatible with latest versions of awkward.

There are failures in Initializer/Builder and the old striped package (maybe we just get rid of it anyway?)

If possible, fixes should be backwards compatible with older uproots.

Start creating boilerplate for the wiki

Start creating a WIKI for this software package as it has become reasonably complex.

Specify a per-module array backend (numpy or cupy) for modules

Is your feature request related to a problem? Please describe.
It can be useful to switch to a GPU-based backend such as cupy on a per-module basis without needing the whole coffea ecosystem to be ported. For example, factorized JEC uncertainty computation is mainly based on interp1d and lookups which are highly efficient on a GPU.

Describe the solution you'd like
Each module would have to carry a configurable backend option. A backend can be as simple as setting JetCorrectionUncertainty.numpy_lib = numpy or cupy), but due to a non-perfect feature parity between the two libraries and possible future alternatives, likely some simple backend class that can provide wrapping functionaly could be provided, e.g. coffea.backends.NumpyBackend, CupyBackend. The backend would either rely on the inputs being located on the correct device or do the transfer on the fly, depending on the design choice.

Describe alternatives you've considered
A switch can be made globally in coffea.util.np, but that is non-ideal before awkward-array works completely with another backend. A physics user may want to use only a part of the coffea library on a certain device.

Additional context
I've made a first attempt at converting the JEC code to a GPU in my fork jpata#1. I've seen a performance improvement of roughly 7x (160s on a CPU vs 24s on a GPU) on computing factorized JEC on 10M random jets.

coffea interface with pyhf - statistical inference

/cc @matthewfeickert @lukasheinrich

Is your feature request related to a problem? Please describe.

Feature request.

Describe the solution you'd like

An interface/way to export to the pyhf JSONs to perform binned statistical fits (should work for anything that allows you to make histograms basically). The pyhf workspaces follow the schema (v1.0.0) defined here: https://diana-hep.org/pyhf/schemas/1.0.0/workspace.json. If you see this issue in the future, we might be at a later version of the schema.

Describe alternatives you've considered

None.

Additional context

A similar issue is on-going to get pyhf and zfit working together as pyhf is binned, but zfit is unbinned: zfit/zfit#120.

Implement MET Corrections

Yet another histogramming standard

Stack pretty stack plots
The errors bars should be correct
Should be able to read in from and output to uproot.
- Maybe pandas and other things too

Uniformized data processing interface for striped and uproot

@nsmith- is already working on this

Implement JEC systematics

Log and sqrt functions in btag scale factor formulas

Functions such as log and sqrt show up in the btagging SF formulas for DeepCSV in 2017 & 2018, which break the extractor. Using a csv with these types of formulas in the lookup_tools extractor results in a KeyError for 'log' in the make_evaluator step:

from coffea.lookup_tools import extractor
ext = extractor()
ext.add_weight_sets(["btag * DeepCSV_102XSF_V1.csv"])
ext.finalize()
evaluator = ext.make_evaluator()

If the log and sqrt functions are replaced in the .csv with np.log and np.sqrt, it works fine.

iterative_executor crashes

I think because of

coffea/coffea/processor/executor.py

Line 116 in 1b0b8fa

executor_args.setdefault('workers', 1)

and

coffea/coffea/processor/executor.py

Line 130 in 1b0b8fa

executor(items, _work_function, output, **executor_args)

coffea/coffea/processor/executor.py

Line 30 in 1b0b8fa

 def iterative_executor(items, function, accumulator, status=True, unit='items', desc='Processing', 

will fail. Adding workers=1 to the argument list of iterative_executor fixes it for me

Pythonic interface for add_weight_sets

Rather than taking a particular space-separated string, add_weight_sets() could take a list of dicts, each with the appropriate parameters for add_weight_set(). Alternatively, a namedtuple could be used.

ZIP files through XRootD

@lgray @nsmith- We've talked about the need to access NPZ (and other ZIP) files through XRootD offline; here's the start of a discussion about that.

Python's zipfile module can accept any object as a "file," as long as it supports the file-like protocol. Unlike MutableMapping and such, the file-like protocol wasn't formalized in collections.abc, so I did a test to figure out what methods it needs.

To start with, here's Awkward's save/load interface, which is backed by backward and forward compatible ZIP files. This is what I'll be using for the test.

import awkward

a = awkward.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
b = awkward.fromiter([100, 200, 300])
print(a)
print(b)

awkward.save("tmp.awkd", {"a": a, "b": b})  # if a dict, save as named arrays in file
file = awkward.load("tmp.awkd")             # and then read-back file is dict-like

a2 = file["a"]                              # reads upon access; why we want XRootD
b2 = file["b"]
print(a2)
print(b2)

Now to find out how much of the file-like interface it uses (in practice, for this file).

lass FileLike(object):
    def __init__(self, *args, **kwargs):
        print("__init__")
        self.f = open(*args, **kwargs)
        self.methods = set()

    def close(self):
        print("close")
        self.methods.add("close")
        return self.f.close()

    def flush(self):
        print("flush")
        self.methods.add("flush")
        return self.f.flush()

    def fileno(self):
        print("fileno")
        self.methods.add("fileno")
        return self.f.fileno()

    def isatty(self):
        print("isatty")
        self.methods.add("isatty")
        return self.f.isatty()

    def next(self):
        print("next")
        self.methods.add("next")
        return self.f.next()

   def read(self, *args):
        self.methods.add("read")
        if len(args) == 0:
            print("read")
            return self.f.read()
        else:
            print("read", args[0])
            return self.f.read(args[0])

    def readline(self, *args):
        self.methods.add("readline")
        if len(args) == 0:
            print("readline")
            return self.f.readline()
        else:
            print("readline", args[0])
            return self.f.readline(args[0])

    def readlines(self, *args):
        if len(args) == 0:
            print("readlines")
            self.methods.add("readlines")
            return self.f.readlines()
        else:
            print("readlines", args[0])
            self.methods.add("readlines")
            return self.f.readlines(args[0])

    def xreadlines(self):
        print("xreadlines")
        self.methods.add("xreadlines")
        return self.f.xreadlines()

   def seek(self, offset, *args):
        self.methods.add("seek")
        if len(args) == 0:
            print("seek", offset)
            return self.f.seek(offset)
        else:
            print("seek", offset, args[0])
            return self.f.seek(offset, args[0])

    def tell(self):
        print("tell")
        self.methods.add("tell")
        return self.f.tell()

    def truncate(self, *args):
        self.methods.add("truncate")
        if len(args) == 0:
            print("truncate")
            return self.f.truncate()
        else:
            print("truncate", args[0])
            return self.f.truncate(args[0])

    def write(self, data):
        print("write", len(data))
        self.methods.add("write")
        return self.f.write(data)

    def writelines(self, data):
        print("writelines", data)
        self.methods.add("writelines")
        return self.f.writelines(data)

    @property
    def closed(self):
        print("closed")
        self.methods.add("closed")
        return self.f.closed

    @property
    def encoding(self):
        print("encoding")
        self.methods.add("encoding")
        return self.f.encoding

    @property
    def errors(self):
        print("errors")
        self.methods.add("errors")
        return self.f.errors

    @property
    def mode(self):
        print("mode")
        self.methods.add("mode")
        return self.f.mode

    @property
    def name(self):
        print("name")
        self.methods.add("name")
        return self.f.name

    @property
    def newlines(self):
        print("newlines")
        self.methods.add("newlines")
        return self.f.newlines

   @property
    def softspace(self):
        print("softspace")
        self.methods.add("softspace")
        return self.f.softspace

    @property
    def softspace(self):
        print("softspace")
        self.methods.add("softspace")
        return self.f.softspace

    @property
    def seekable(self):
        print("seekable")
        self.methods.add("seekable")
        return self.f.seekable

    def __enter__(self):
        print("__enter__")
        self.methods.add("__enter__")
        return self.f.__enter__()

    def __exit__(self, exc_type, exc_val, exc_tb):
        print("__exit__", exc_type, exc_val, exc_tb)
        self.methods.add("__exit__")
        return self.f.__exit__(exc_type, exc_val, exc_tb)

And we run the test to see which ones are actually used.

f = FileLike("tmp.awkd", "wb")
awkward.save(f, {"a": a, "b": b})
f.methods
# uses {'flush', 'name', 'tell', 'seek', 'write'}

f = FileLike("tmp.awkd", "rb")
file = awkward.load(f)
f.methods
# uses {'read', 'seek', 'name', 'tell'}

f.methods = set()
a2 = file["a"]
f.methods
# uses {'seek', 'tell', 'read', 'seekable'}

b2 = file["b"]
f.methods
# uses {'seek', 'tell', 'read', 'seekable'}

So, in the end, we need a shim that wraps a pyxrootd.File with methods for open, read, write, seek, tell, flush, name (property), and seekable (property).

Weights discussion

Continuing the discussion from #59 that doesn't belong there.

Adding legend settings forwarding to coffea.hist.plot1d

Is your feature request related to a problem? Please describe.
Sometimes the default legend settings are not satisfactory, adjusting them after they are created requires additional code(grouping the handles), but easy to add when constructed, I though it would be convenient to add another kwarg for legend settings when calling coffea.hist.plot1d. Or something for this already exists?

Describe the solution you'd like
Add additional kwarg to the plot1d signature, something like legend_opts=None, and forward that when calling ax.legend().

Describe alternatives you've considered
N/A

Additional context
N/A

JECs need validation against the official implementation

Validate our JEC implementation against: https://twiki.cern.ch/twiki/bin/viewauth/CMS/JERCReference

Take advantage of uproot(-methods) profiles.

Is your feature request related to a problem? Please describe.
Uproot has added a feature which assumes a large chunk of the functionality of JaggedCandidateArray or Initializer/Builder. In particular uproot-methods from 0.7.0 exposes a nice SoA object-like interface complete with functional cross references among objects.

The features of JaggedCandidateArray that remain pertinent are the wrappers on top of distincts(), pairs(), choose() that handle composition of physics objects (things with four momenta) into new physics objects where four momenta is summed and extraneous properties are dropped for the parent (i.e. Z boson candidates do not have isolation).

Describe the solution you'd like
We should determine what are the pertinent remaining features of JaggedCandidateArray and Initializer/Builder and reduce the scope of the classes as needed.

As another possibility, it may also be a good idea to try to extract the best aspects of both styles of array manipulation and remove the need for two separate packages that do the same thing.

Describe alternatives you've considered
The alternative is doing nothing and ignoring this great new feature, which is silly.

Add remove function for Axis (and simplify syntax)

Is your feature request related to a problem? Please describe.
"I'm always frustrated when [...]" slicing/grouping/rebinning/removing syntax is not intuitive. Related to #90 (comment)

Take rebin, currently

h.rebin(old_axis_name_string, new_axis_object)

In reality most likely use case is only a change to the binning, so rebin would save a good chunk of duplication by allowing:

h.rebin(old_axis_name_string, new_bin_array)

(rename is probably rare enough to leave as is)

Also removing elements could be improved from not entirely intuitive slicing, say I want to remove a bin in Cat/Sparse Axis

h[["cat1","cat2"], :]

To something like

h.axis('cat').remove(items)

h.remove(items, axis="cat")

or even

h.rebin('cat', ["cat1", "cat2"])

New axis copy could be created and reinserted to the hist by these functions to keep the axis object in general common, but allow operations on it when desired

Describe alternatives you've considered
An UI is like a joke, if you have to explain it...

Weights extractor should read uncertainties from histograms

Bitwise operations in Coffea

As discussed in the last meeting, an interesting use case might be doing bitwise-arithmetic in columnar format. Sometimes I pre-compute fancier cuts when running the ntuplizer over my AOD samples (e.g to use some of the CMSSW libraries), and save each cut into one bit of a uint32_t variable. If there's a way to build cutflows out of those bits this would be very convenient. Perhaps this is already possible with just the existing infrastructure?

Thanks!

Add .csv and .json interpretation to weights extractor

Read b-tag scale factor CSV file, JEC json into useful internal structure.

"cannot allocate memory" error on LPC when following documentation example

I copied the first 4 blocks of code from the documentation onto a jupyter notebook running on LPC nodes. Whenever I try to execute run_uproot_job, I get a "cannot allocate memory" error. After some initial debugging, I've updated all the dependencies to the latest versions but the error persists. I also tried with two different users in the LPC with the same error. Screenshots of the error below:

Change in uproot-methods THn

Following scikit-hep/uproot3-methods#42 a breaking change is made that should propagate to lookup_tools
No release cut yet.

	with tqdm(disable=not status, unit=unit, total=len(futures), desc=desc) as pbar:
	while len(futures) > 0:
	finished = set(job for job in futures if job.done())
	for job in finished:
	accumulator += job.result()
	pbar.update(1)
	futures -= finished
	del finished
	time.sleep(1)

coffeateam / coffea Goto Github PK

coffea's People

Contributors

Stargazers

Watchers

Forkers

coffea's Issues

Recommend Projects

Recommend Topics

Recommend Org