alphatwirl / alphatwirl Goto Github PK
View Code? Open in Web Editor NEWA Python library for summarizing event data into multivariate categorical data
License: BSD 3-Clause "New" or "Revised" License
A Python library for summarizing event data into multivariate categorical data
License: BSD 3-Clause "New" or "Revised" License
at 8db62fc, a few commits after v0.24.2
Traceback (most recent call last):
File "./bdphi-scripts/bdphiROC/twirl_mktbl_heppy.py", line 721, in <module>
main()
File "./bdphi-scripts/bdphiROC/twirl_mktbl_heppy.py", line 62, in main
run(reader_collector_pairs)
File "./bdphi-scripts/bdphiROC/twirl_mktbl_heppy.py", line 609, in run
treeName='tree'
File "./bdphi-scripts/external/atheppy/atheppy/fw.py", line 110, in run
self._run(loop)
File "./bdphi-scripts/external/atheppy/atheppy/fw.py", line 285, in _run
loop()
File "./bdphi-scripts/external/alphatwirl/alphatwirl/datasetloop/loop.py", line 60, in __call__
pickle.dump(self.reader, f, protocol=pickle.HIGHEST_PROTOCOL)
TypeError: can't pickle _thread.lock objects
The error occures in ResumableDatasetLoop
:
alphatwirl/alphatwirl/datasetloop/loop.py
Lines 52 to 61 in 8db62fc
subprocess
, not in htcondor
.The proc
in SubprocessRunner
might not be picklable in Python 3:
alphatwirl/alphatwirl/concurrently/SubprocessRunner.py
Lines 44 to 50 in 8db62fc
self.reader.read(dataset)
or self.reader.begin()
related to #25, propagate the log levels to resume.py
as well.
alphatwirl/tests/unit/summary/test_Summarizer.py
Lines 67 to 71 in 6106bde
It is skipped because int and str cannot be compared in python 3
valOutColumnNames
is a misnomer
summaryColumnNames
or something.alphatwirl/alphatwirl/loop/EventLoop.py
Lines 36 to 39 in fcf553e
self.reader.begin(events)
can fail if events are built from TChain with no file.
This could happen if no files are verified here:
alphatwirl/alphatwirl/roottree/build.py
Lines 24 to 28 in fcf553e
update test for run.py
alphatwirl/alphatwirl/concurrently/run.py
Line 23 in 6c8580b
alphatwirl/alphatwirl/concurrently/run.py
Lines 85 to 99 in 611214e
concurrently
an independent packagerelated to the comment at https://github.com/alphatwirl/alphatwirl-interface/pull/7#issuecomment-369357995
this will be surgical changes.
it will help to solve #10.
steps:
+
(or method merge()
) to Readerreaders
for the same data set are merged
at the end
. - EventDatasetReadermerge()
to ReaderCompositemerge()
to AllwCount, AnywCount, and NotwCountreturn
in end()
in EventDatasetReader
.pointed out in #30 (comment)
need at least three separate lists of requirements
they can be specified in install_requires
in setup.py
and requirements files for pip.
When trying to run on lxplus, with my working directory under /eos
I observed the following traceback:
Traceback (most recent call last):
File "/afs/cern.ch/user/b/bkrikler/.local/bin/fast_carpenter", line 11, in <module>
load_entry_point('fast-carpenter', 'console_scripts', 'fast_carpenter')()
File "/eos/user/b/bkrikler/CHIP/fast-carpenter/fast_carpenter/__main__.py", line 67, in main
return process.run(datasets, sequence)
File "/afs/cern.ch/user/b/bkrikler/.local/lib/python2.7/site-packages/atuproot/atuproot_main.py", line 49, in run
return self._run(loop)
File "/afs/cern.ch/user/b/bkrikler/.local/lib/python2.7/site-packages/atuproot/atuproot_main.py", line 101, in _run
result = loop()
File "/eos/user/b/bkrikler/CHIP/alphatwirl/alphatwirl/datasetloop/loop.py", line 55, in __call__
self.reader.read(dataset)
File "/eos/user/b/bkrikler/CHIP/alphatwirl/alphatwirl/datasetloop/reader.py", line 27, in read
reader.read(dataset)
File "/eos/user/b/bkrikler/CHIP/alphatwirl/alphatwirl/loop/EventDatasetReader.py", line 66, in read
runids = self.eventLoopRunner.run_multiple(eventLoops)
File "/eos/user/b/bkrikler/CHIP/alphatwirl/alphatwirl/loop/MPEventLoopRunner.py", line 93, in run_multiple
return self.communicationChannel.put_multiple(eventLoops)
File "/eos/user/b/bkrikler/CHIP/alphatwirl/alphatwirl/concurrently/CommunicationChannel.py", line 131, in put_multiple
return self.dropbox.put_multiple(packages)
File "/eos/user/b/bkrikler/CHIP/alphatwirl/alphatwirl/concurrently/TaskPackageDropbox.py", line 60, in put_multiple
runids = self.dispatcher.run_multiple(self.workingArea, pkgidxs)
File "/eos/user/b/bkrikler/CHIP/alphatwirl/alphatwirl/concurrently/HTCondorJobSubmitter.py", line 129, in run_multiple
njobs = int(regex.search(stdout).groups()[0])
AttributeError: 'NoneType' object has no attribute 'groups'
The underlying problem in this case was that htcondor submission on lxplus doesn't work below /eos only under /afs, however the traceback there hid this somewhat. It might be helpful if the match object from the regular expression on line 127 was first checked to be valid and not None. If it is None, then the stdout and stderr could be printed and some exception raised. That would help a user understand more immediately why the jobs weren't submitted I suspect.
At the moment, a progressReporter is given to a task as an argument:
alphatwirl/alphatwirl/concurrently/Worker.py
Lines 24 to 29 in dd0a7a3
progressReporter
for a different purpose.
Instead of Worker
giving the reporter to a task, the task should find the reporter if the task wants a reporter.
A possible solution: a worker register a reporter to a DB, e.g., ProgressReporterAgency
or ProgressReporterOffice
defined in the module progressbar
. A task gets the reporter from the DB.
within one line
use str for long text
for all classes
collect results from jobs as they finish instead of collecting them after all jobs finish.
When _deprecated_class_method_option()
or _deprecated_func_option()
is used multiple times on the same method or function, module_name
of the outer decorator will be the module name of the inner decorator, that is deprecation
.
module_name
to the module name of the original function.binnings
of KeyValueComposer
to be None
binnings
itself can be None
. but when binnings
is a list, its element is not allowed to be None
I've no idea how much work this would involve, but to bring alphatwirl out to more sites, support beyond HTCondor would be nice to be able to run components in parallel. In particular, sun grid engine batch systems, although somewhat old-fashioned, are still common such as lxplus or on the Imperial College batch system.
Line 74 of HTCondorJobSubmitter calls a function then immediately gets the first element in what it returns. However two of the three return statements in that function return empty lists, and so this can lead to the code crashing.
at the commit 7405535, the test gets stuck in some cirumstances.
in the top directory of alphatwirl,
pytest -s tests
the test starts normally.
============================================ test session starts =============================================
platform darwin -- Python 3.6.3, pytest-3.4.0, py-1.5.2, pluggy-0.6.0
rootdir: #######, inifile:
plugins: mock-1.6.3, cov-2.5.1, console-scripts-0.1.4, hypothesis-3.38.5
collected 619 items
and all tests pass.
tests/unit/summary/test_Summarizer.py .....s.... [ 0%]
tests/unit/summary/test_WeightCalculatorOne.py . [ 0%]
tests/unit/summary/test_convert_key_vals_dict_to_tuple_list.py ...... [ 0%]
tests/unit/summary/test_parse_indices_config.py .
=================================== 579 passed, 40 skipped in 5.09 seconds =================================
but, the test doesn't finish. the shell prompt won't appear.
hit ctrl-c
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/Users/sakuma/anaconda3/envs/py3_6-pd0_20/lib/python3.6/multiprocessing/popen_fork.py", line 29, in poll
pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
$
the same problem doesn't happen for example if
-s
option is not usedtests/
pytest tests
pytest -s tests/
the same problem doesn't happen in python 2.7
platform darwin -- Python 2.7.12, pytest-3.3.2, py-1.5.2, pluggy-0.6.0
rootdir: #######, inifile: pytest.ini
plugins: mock-1.6.3, cov-2.5.1, console-scripts-0.1.3
I was trying to understand why the unit tests were failing when I run pytest
and it turns out it's because I didn't have the pytest-console-scripts
package installed. I realised that these were needed by looking in the travis CI config file, but it would be good to define pytest
, pytest-mock
, pytest-cov
and pytest-console-scripts
either in the requirements.txt file, or mention it in the README explicitly.
This is based on a series of posts first made to Slack. But for the TL;DR:
ret
:ret['outFileName'] = self.createOutFileName(ret)
For the meantime, I have a hacky solution to my problem, which can be used instead of TableFileNameComposer:
class WithInsertTableFileNameComposer():
def __init__(self, composer, inserts):
self.inserts = inserts
self.composer = composer
self.frame_idx = 0
def __call__(self, columnNames, indices, **kwargs):
this_insert = self.inserts[self.frame_idx]
suffix = kwargs.get("suffix", self.composer.default_suffix)
kwargs["suffix"] = "--{}.{}".format(this_insert, suffix)
self.frame_idx += 1
return self.composer(columnNames, indices, **kwargs)
propagate the config for logging to batch jobs
currently, the package concurrently, uses pickle to send the results from the worker nodes.
pickle files with data frames tend to become large. and loading large pickles is slow.
develop an alternative implementation which is specifically optimized for data frames, for example, by using dask so that loading the results at the interactive node is fast.
0.16.0 (expected to appear in 0.20.0 as well)
ValueError
is raised on loop/merge.py#L7 when duplicate datasets (two or more dataset entries with the same name
attribute)
Duplicate datasets can appear by user error. The final duplicate dataset will overwrite all the other in loop/EventDatasetReader.py#L74, but loop/EventDatasetReader.py#L68 is filled with runids
from all dataset duplicates.
When the code reaches loop/merge.py#L7 it tries to access runids from duplicate datasets that have been overwritten in loop/EventDatasetReader.py#L74
Everything works fine if there are no duplicate datasets.
However, this could be checked or enforced somewhere earlier.
In the version 0.18.3, the following code produce awkward results
from qtwirl import qtwirl
filepath = 'https://github.com/alphatwirl/qtwirl/raw/v0.03.1/tests/data/sample_chain_01.root'
results = qtwirl(
file=filepath, tree_name='tree',
reader_cfg=dict(keyAttrNames='met', binnings=RoundLog(0.2, 100, min=10, underflow_bin=0)))
print results
ย met | n | nvar |
---|---|---|
0.000000 | 102 | 102 |
6.309573 | 0 | 0 |
10.000000 | 53 | 53 |
15.848932 | 77 | 77 |
25.118864 | 100 | 100 |
39.810717 | 141 | 141 |
63.095734 | 153 | 153 |
100.000000 | 166 | 166 |
158.489319 | 114 | 114 |
251.188643 | 78 | 78 |
398.107171 | 14 | 14 |
630.957344 | 2 | 2 |
1000.000000 | 0 | 0 |
There shouldn't be the 2nd row, i.e. the row with met = 6.309573.
This is even wrong. It looks like there is no events with 6.309573 <= met < 10.0, which is wrong.
This happens because, in RoundLog
(and Round
as well), the next to the underflow bin is the bin for the min.
An example
>>> b = RoundLog(0.2, 100, min=10, underflow_bin=0)
>>> b(10)
6.309573444801936
>>> b(8)
0
Here is why this happens. Because 10 is greater than or equal to min (in fact exactly the min), it returns the lower edge of the bin. However, because of the rounding issue of float, 10 doesn't fall into the interval [10, 10^1.2) but one below [10^0.8, 10). So it returns 10^0.8 = 6.30957. On the other hand, 8 is below min, it returns underflow_bin.
This is a featrue but is very uncomfortable. 8 is greator than 6.3. If there is a bin [10^0.8, 10), 8 should fall in that bin.
As long as the bin is determined after the value is compared with the min, this can happen.
This uncomfortable situation can be avoided if the minimum value is the lower edge of the bin the argument min falls in.
sett comment in #51
localize the concept of data sets to certain sub packages
concurrently
an independent packagecurrently, it is impossible for a task in concurrently to include lambda. this is because lambda isn't picklable.
there are several solutions there to serialize lambdas, including dill.
note that unlike results, tasks are generally small. serializing and unserializing are not necessary to be very fast.
$ find ./ -name \*.py | xargs grep 'def' | grep '\[ *\]'
./parallel/build.py:def build_parallel(parallel_mode, quiet=True, processes=4, user_modules=[ ],
./concurrently/HTCondorJobSubmitter.py: def __init__(self, job_desc_extra=[ ], job_desc_dict={}):
./selection/modules/with_count.py: def __init__(self, name='All', selections=[ ]):
./selection/modules/with_count.py: def __init__(self, name='Any', selections=[ ]):
./selection/modules/basic.py: def __init__(self, name='All', selections=[ ]):
./selection/modules/basic.py: def __init__(self, name='Any', selections=[ ]):
./loop/ReaderComposite.py: def __init__(self, readers=[]):
$ find ./ -name \*.py | xargs grep 'def' | grep '{ *}'
./parallel/build.py: logger.warning('unknown parallel_mode "{}", use default "{}"'.format(
./concurrently/HTCondorJobSubmitter.py: def __init__(self, job_desc_extra=[ ], job_desc_dict={}):
./selection/factories/expand.py:def expand_path_cfg(path_cfg, alias_dict={ }, overriding_kargs={ }):
possible more. arguments can be listed in multiple lines.
not sure how to make it picklable.
at either
alphatwirl/alphatwirl/configure/build_counter_collector_pair.py
Lines 27 to 29 in 705bcce
build_counter_collector_pair
can become a class so that datasetColumnName
can be changed by an option to __init__()
.
use deque for self.boundaries:
https://github.com/TaiSakuma/AlphaTwirl/blob/766483220649396bb3a6106059f08cd1791f2f65/alphatwirl/binning/Round.py#L25
take over https://github.com/CMSRA1/AlphaTwirl-old/issues/4
continue from https://github.com/CMSRA1/AlphaTwirl-old/issues/3
rewite this construction
if key not in counts:
counts[key] = <<initial value>>
with
counts.setdefault(key, <<initial value>>)
for example in
https://github.com/TaiSakuma/AlphaTwirl/blob/v0.9.x/AlphaTwirl/Summary/Count.py#L16-L17
It will run faster because it searches for key once instead of twice
[0, 0, 0, 0, 0], [0, 0, 2, 0, 0], [0, 1, 1, 0, 1]
if the number of tasks is 5.The code in the EventLoop class handles the reader (typically a composite reader) and provides it with events from the tree(s) by:
reader.begin()
methodreader.event()
reader.end()
In each case, the return of each of reader
's methods is ignored. This makes it awkward for a reader to tell the event loop to terminate early, or not run at all in the case of begin()
. begin()
can raise an exception of course, but in some cases that is more extreme than really needed.
Have the return of each calll to reader begin
, event
, and end
checked, and execution halted if the return does not evaluate to true.
I'm running over a large heppy dataset and including the existing RA1 AlphaTools sequence as a single Reader. The heppy dataset is being split up using AlphaTwirl and submitted to the Bristol htcondor batch system. One of the datasets in this input is not supposed to be handled by AlphaTools and AlphaTools knows this, so it would normally just ignore it, however this causes an exception to be raised crashing the batch job, which is then re-submitted. To enable things to finish elegantly, I would prefer to be able to signal to the EventLoop that this jobs should not proceed in the begin()
method of the reader that wraps AlphaTools.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.