vepadulano / pyrdf Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 7.0 1.46 MB

Python Library for doing ROOT RDataFrame analysis

Home Page: https://pyrdf.readthedocs.io/en/latest/

Shell 3.82% Python 95.33% C++ 0.84% C 0.01%

pyrdf's People

Contributors

Stargazers

Watchers

Forkers

watson-ij peremato yeckang jshlee cloudpyrdf maxgalli

pyrdf's Issues

Test PyRDF with Python3 versions in the Continuous Integration

Add "Info" operations to PyRDF

The class of operations called "Other Operations", documented in the RDataFrame class reference, is not defined in PyRDF. These include useful information about the data, such as the names of the columns of the dataframe.
These should be added inside the operations_dict in Operation.py and possibly called "Info".

Feature : Allow the user to pass in a serializable function that is run before the mapper function in the Spark executors

This feature lets a user to run start up scripts (in the form of a serializable Python function) in every Spark executor

Originally defined here

Improve Graph pruning code

The current implementation of the graph pruning could be improved with a more robust logic. A simple refactoring of the code changes the result of some tests, pruning more than needed.

For example, a good refactoring to improve the readability of the code could be the following:

    def _is_prunable():
        # Condition 1 - Not enough references
        # Explanation...
        if len(gc.get_referrers(self)) <= 3:
            return True

        # Condition 2 - Action value already computed
        # Explanation...
        if self.operation and self.operation.is_action() and self.value:
            return True

        return False

    def graph_prune(self):
        """
        ...docstrings...
        """
        children = []

        for n in self.children:
            # Select children based on pruning condition
            if not n.graph_prune():
                children.append(n)

        self.children = children

        if not self.children:
            return self._is_prunable()

        return False

However, this change increases the number of references to self, so the condition len(gc.get_referrers(self)) <= 3 is not true any more.

Add newer ROOT Tutorials to PyRDF

Spark configuration may be inconsistent at runtime

Using the Spark backend, the number of partitions can be automatically modified when the number of clusters in the dataset is smaller. This is reported with the following warning:

PyRDF/backend/Dist.py:258: UserWarning: Number of partitions 
is greater than number of clusters in the filelist
  Using 1 partition(s)

However, this change in the number of partitions is not propagated to the backend so the number of workers may end up being greater than the number of ranges. Example:

PyRDF.use("spark", {'npartitions':5})
PyRDF.RDataFrame(tree, filename) # filename contains one single partitions
# PyRDF falls back to the following effective configuration:
# PyRDF.use("spark", {'npartitions':1})

[Proposal] integrate merging of PyRDF objects in the Node class

Right now, the reduce phase of the Spark distributed job takes care of merging PyRDF objects (their PyROOT pointees actually). This means that the reduce function of the Dist backend is a long list of if...else statements. If the method __add__ were implemented in the Node class, the merging of two PyROOT objects could be handled in that class, thus it could be possible to also know the actual name of the operation(s) involved and call the right ROOT function (e.g. Add,Merge). The main problem is that the map phase returns a list of PyROOT object, so the executor actually loses the information from the PyRDF side. A refactoring of the code could be used such that map returns also the PyRDF nodes.

Add pyspark.AddFile feature into PyRDF.include

Explore use of AddFile to send files at runtime (usually headers), after the SparkContext creation.

Long integers raises a syntax error in Python 3

File "/home/PyRDF/tests/unit/backend/test_dist.py", line 163
ranges_reqd = [(0L, 10L)]
                      ^
SyntaxError: invalid syntax

In Python 3, all integers have "unlimited" precision, as documented here. As a consequence, a declaration such as a = 1L raises a SyntaxError: invalid syntax, as discussed here. Depending on how crucial these declarations are, the L symbol could be deleted to ensure compatibility with both versions of Python. Otherwise, we could add some feature detection in order to differentiate the code according to the Python version used, as suggested here.

Construct Backend instance dynamically as per user input

Define here

headers are not sent to workers when backend is distributed

Right now, when executing PyRDF on a distributed backend (such as Spark), if the user
has included any headers, they will just be declared on the local machine (spark driver node).

Some way of storing files on the spark executors is needed to declare all the headers at execution time.

Snapshot error passing branch lists

Hi Javier,

I am trying to save a snapshot with multiple branches but it is not working,

import PyRDF
df = PyRDF.RDataFrame("data", ['https://root.cern/files/teaching/CMS_Open_Dataset.root',])

etaCutStr = "fabs(eta1) < 2.3 && fabs(eta2) < 2.3"
df_f = df.Filter(etaCutStr)

branchList = ("eta1","eta2") # the same error with ["eta1","eta2"] 
df_f.Snapshot('filtered', 'out3.root', branchList)

I am using the last version.

error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-c4cbf40f254f> in <module>
      6 
      7 branchList = ("eta1","eta2")
----> 8 df_f.Snapshot('filtered', 'out3.root', branchList)

/eos/home-o/ozapatam/SWAN_projects/rho3/deps/PyRDF/Proxy.py in _create_new_op(self, *args, **kwargs)
    181             from PyRDF import current_backend
    182             generator = CallableGenerator(self.proxied_node.get_head())
--> 183             current_backend.execute(generator)
    184             return newNode.value
    185         else:

/eos/home-o/ozapatam/SWAN_projects/rho3/deps/PyRDF/backend/Local.py in execute(self, generator)
     61             self.pyroot_rdf = ROOT.ROOT.RDataFrame(*generator.head_node.args)
     62 
---> 63         values = mapper(self.pyroot_rdf)  # Execute the mapper function
     64 
     65         # Get the action nodes in the same order as values

/eos/home-o/ozapatam/SWAN_projects/rho3/deps/PyRDF/CallableGenerator.py in mapper(node_cpp, node_py, rdf_range)
    131             for n in node_py.children:
    132                 # Recurse through children and get their output
--> 133                 prev_vals = mapper(parent_node, node_py=n, rdf_range=rdf_range)
    134 
    135                 # Attach the output of the children node

/eos/home-o/ozapatam/SWAN_projects/rho3/deps/PyRDF/CallableGenerator.py in mapper(node_cpp, node_py, rdf_range)
    131             for n in node_py.children:
    132                 # Recurse through children and get their output
--> 133                 prev_vals = mapper(parent_node, node_py=n, rdf_range=rdf_range)
    134 
    135                 # Attach the output of the children node

/eos/home-o/ozapatam/SWAN_projects/rho3/deps/PyRDF/CallableGenerator.py in mapper(node_cpp, node_py, rdf_range)
    107                 else:
    108                     pyroot_node = RDFOperation(*operation.args,
--> 109                                                **operation.kwargs)
    110 
    111                 # The result is a pyroot object which is stored together with

TypeError: none of the 3 overloaded methods succeeded. Full details:
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void>::Snapshot(basic_string_view<char,char_traits<char> > treename, basic_string_view<char,char_traits<char> > filename, const vector<string>& columnList, const ROOT::RDF::RSnapshotOptions& options = ROOT::RDF::RSnapshotOptions()) =>
    could not convert argument 3 (none of the 10 overloaded methods succeeded. Full details:
  vector<string>::vector<string>() =>
    takes at most 0 arguments (2 given)
  vector<string>::vector<string>(const allocator<string>& __a) =>
    takes at most 1 arguments (2 given)
  vector<string>::vector<string>(unsigned long __n, const allocator<string>& __a = std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::allocator_type()) =>
    could not convert argument 1 (an integer is required)
  vector<string>::vector<string>(unsigned long __n, const string& __value, const allocator<string>& __a = std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::allocator_type()) =>
    could not convert argument 1 (an integer is required)
  vector<string>::vector<string>(const vector<string>& __x) =>
    takes at most 1 arguments (2 given)
  vector<string>::vector<string>(vector<string>&& __x) =>
    takes at most 1 arguments (2 given)
  vector<string>::vector<string>(const vector<string>& __x, const allocator<string>& __a) =>
    could not convert argument 1
  vector<string>::vector<string>(vector<string>&& __rv, const allocator<string>& __m) =>
    could not convert argument 1 (this method can not (yet) be called)
  vector<string>::vector<string>(initializer_list<string> __l, const allocator<string>& __a = std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::allocator_type()) =>
    could not convert argument 1
  vector<string>::vector<string>(__gnu_cxx::__normal_iterator<const basic_string_view<char,char_traits<char> >*,vector<basic_string_view<char,char_traits<char> > > > __first, __gnu_cxx::__normal_iterator<const basic_string_view<char,char_traits<char> >*,vector<basic_string_view<char,char_traits<char> > > > __last, const allocator<string>& __a = std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::allocator_type()) =>
    could not convert argument 1)
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void>::Snapshot(basic_string_view<char,char_traits<char> > treename, basic_string_view<char,char_traits<char> > filename, basic_string_view<char,char_traits<char> > columnNameRegexp = "", const ROOT::RDF::RSnapshotOptions& options = ROOT::RDF::RSnapshotOptions()) =>
    could not convert argument 3
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void>::Snapshot(basic_string_view<char,char_traits<char> > treename, basic_string_view<char,char_traits<char> > filename, initializer_list<string> columnList, const ROOT::RDF::RSnapshotOptions& options = ROOT::RDF::RSnapshotOptions()) =>
    could not convert argument 3 (none of the 3 overloaded methods succeeded. Full details:
  initializer_list<string>::initializer_list<string>() =>
    takes at most 0 arguments (2 given)
  initializer_list<string>::initializer_list<string>(const initializer_list<string>&) =>
    takes at most 1 arguments (2 given)
  initializer_list<string>::initializer_list<string>(initializer_list<string>&&) =>
    takes at most 1 arguments (2 given))

Instant actions do not get triggered instantly

None of the three instant actions implemented in ROOT RDataFrame (Foreach, ForeachSlot and Snapshot) are currently implemented in PyRDF. Furthermore, the logic of the Proxy family of classes triggers is geared towards actions, in the sense that the computational graph is executed only when the user requests the value of an action (e.g myrdf.Count().GetValue() ).

For example, the Snapshot instant action works on ROOT 6.16, but trying to take a snapshot of a PyRDF.RDataFrame will result in the loop not triggering, so no file will be saved to memory. To actually trigger the snapshot action, one has to explicitly call GetValue after Snapshot.

Example taken from the root tutorials

import ROOT
import PyRDF

# A simple helper function to fill a test tree.
def fill_tree(treeName, fileName):
  tdf = PyRDF.RDataFrame(10000)
  tdf.Define("b1", "(int) tdfentry_")\
     .Define("b2", "(float) tdfentry_ * tdfentry_").Snapshot(treeName, fileName)

fileName = "df007_snapshot_py.root"
treeName = "myTree"
fill_tree(treeName, fileName) # this will not trigger the snapshot, no file will be saved

import ROOT
import PyRDF

# A simple helper function to fill a test tree.
def fill_tree(treeName, fileName):
  tdf = PyRDF.RDataFrame(10000)
  tdf.Define("b1", "(int) tdfentry_")\
     .Define("b2", "(float) tdfentry_ * tdfentry_").Snapshot(treeName, fileName).GetValue()

fileName = "df007_snapshot_py.root"
treeName = "myTree"
fill_tree(treeName, fileName) # GetValue() triggers the event, the file will be saved

Wrong import of backend modules in Python 3.7

In __init__.py all backend modules are being imported with absolute notation, e.g. from backend import Utils. The implicit relative path behaviour is allowed in Python 2.7, but doesn't work in Python 3.7, as documented here. In the newer version of Python implicit imports use the dot notation to indicate the current and parent packages involved in the relative import. So the right way to do it would be from .backend import Utils.

Add testing step: Run some local checks before going distributed

For a given user analysis, it would be nice to run some local checks so users can discover possible issues in their code before connecting to a cluster. Possible common errors may be:

Non-serializable functions passed to the PyRDF.initialize method.
Non-correct cpp functions in RDataFrame operations.
Inaccessible input files

These errors can be easily discovered running a reduced version of the analysis for a few entries with a Local backend while running on a given backend the error message might not be that clear.

Adhere to PEP484 for type hints

PEP484 introduces type hints in Python3, but also provides a suggested documentation syntax for packages that need Python 2 compatibility https://www.python.org/dev/peps/pep-0484/#suggested-syntax-for-python-2-7-and-straddling-code . It would be nice if PyRDF adhered to that scheme for all its functions and methods

new RDF is not correctly instantiated

If a user performs an analysis on a RDataFrame, then on the same script (session) tries to create another RDataFrame, the latter will not be created but the first will be kept as a source for the new analysis. See the example:

import PyRDF

df1 = PyRDF.RDataFrame(10)

count1 = df1.Count()

print(count1.GetValue()) # should be 10, output is 10

df2 = PyRDF.RDataFrame(100)

count2 = df2.Count()

print(count2.GetValue()) # should be 100, output is 10

This issue holds with the local backend

Add existing features of RDataFrames in distributed PyRDF

(HI!) As a strong user of this module, it would be great for me to have access to those features by order of priority (that is why I did not created one issue per feature) :

Thanks a lot !

Snapshot fails when TTree file switchover is triggered

When running on a large data set at some point TTree decides the file is too large and switches to a new file. This is in TTree::ChangeFile and usually works but when called from PyRDF the process aborts. The output is:

Fill: Switching to new file: /scratch/test2_1.root
Fatal in TFileMerger::RecursiveRemove: Output file of the TFile Merger (targeting /scratch/test2.root) has been deleted (likely due to a TTree larger than 100Gb)
aborting
...a lot of stack traces...

When I look at the files the relevant files are:

ls -lh
276 Jan 8 23:17 test2_1.root
94G Jan 8 23:17 test2.root

So the relevant file is not in excess of 100 GB (but close to that) and the second file where the whole thing was supposed to switch over to just exists but was not filled significantly.
As a workaround I will try running again with ROOT.TTree.SetMaxTreeSize(10000000000000) (10TB) but this should happen internally when Snapshot encounters a data set larger than the limit so that this feature is effectively disabled and thus the bug is not triggered.

Add lazy-loading to backend modules

Currently all backends (Spark, Dist, Local) are loaded at start up time. There is no need to load a Spark backend if the user is only interested in a Local execution.

Wrong graph pruning in Python3.7

With python-3.7.2 the graph_prune function removes nodes that should not be pruned. As a result, the following code does not work:

import PyRDF
d = PyRDF.RDataFrame(1)
h = d.Histo1D("rdfentry_")
h.Draw()

Same code works for python-2.7.15.

Extend PyRDF.include to accept directories

Currently PyRDF.include only supports a file or a list of files. If multiple headers are needed it might be useful to support something like: PyRDF.include("myHeadersDir") or perhaps add a new function to the API.

Improve docs

Documentation right now is just a list of all the classes in PyRDF with their docstrings and source code. The tutorials must be included in the docs to help the users with their first approach to PyRDF.

Extend PyRDF.include to libraries (.so)

RDataFrame.AsNumpy feature is not supported

Python lambdas or access to external variables

Dear developpers,

I encounter a recent issue using this module, which is about accessing external variables in a dataframe defined in a python code (I use PyROOT in the SWAN environment).

The issue comes from the fact that I have a map or a function in python which should take a column from the dataframe and define another column accordingly. A python lambda function should make the job but python lambdas are not yet supported in PyROOT I think.
Then I tried to define this map or function using the ROOT.gInterpreter.Declare() function, which gives me access to my map from the dataframe.Define() method.

However this works for ROOT.RDataFrames, but not for PyRDF.RDataFrames, which seems logic but not a good news to me.
Then my question is the following, may you help me in that direction ? Either make the ROOT environment/definitions accessible from the jobs, or even better, boost the implementation for using python lambdas in dataframes ?

Bests,
Brian Ventura

Report more expressive error message when Numcluster = 0

During the creation of ranges (BuildRanges), the input dataset is extracted from the arguments passed to RDataFrame. If the dataset cannot be properly read, i.e. due to an incorrect user input, non-existing files, non-reachable... , BuildRanges fails when trying to divide by zero.

PyRDF.use("spark") hangs in SWAN

In a SWAN notebook, PyRDF.use("spark") freezes a notebook cell when there is no SparkContext preloaded, i.e. the connection with a Spark cluster has not been previously established.

[Discussion] Should we clear global sets of headers?

The set of headers (and shared libraries) is kept as a global variable after the execution of the graph. Is this intended?

Add FriendTree capabilities to distributed execution

The distributed execution of the DAG starts by calling ROOT RDataFrame with the arguments passed by the user to PyRDF.RDataFrame.

If the TTree/TChain passed has some friend trees inside, the links to those other files will be lost since the Spark executor will only know about the location of the main TTree/TChain .root file(s).

It is needed to add some friend tree checks in the distributed backend mapper function.

Turning to local doesn't stop running Spark context

Situation:
A user issues a first execution of the graph with Spark as a backend, e.g. the user has previously called PyRDF.use("spark"). The user later calls PyRDF.use("local") to switch to local execution.
Issue:
PyRDF.use("local") doesn't check for existing Spark context and doesn't stop it.

Extend PyRDF.initialize to "string code" and functions from imported modules

Currently PyRDF.include only supports python callable of this shape:

def f(a, b, c):
     import ROOT
     ROOT.myCppFunction(a, b, c)

PyRDF.initialize(f, a, b, c)

It would be useful to support any of the following calls:

PyRDF.initialize(ROOT.myCppFunction, a, b, c)
PyRDF.initialize("myCppFunction", a, b, c)
PyRDF.initialize(f, a, b, c)

This way the code would be cleaner.

Add back AsNumpy tutorials for Spark

Revert this commit once PyROOT has been properly modified to support the serialization of the ndarray wrapper class.

Bug in npartitions parameter ; spark backend

Dear developpers,
I think I have spotted a bug in the partitioning of the dataframes done in PyRDF. Here is a simple snippet of example showing that :
config: (I use the most recent PyRDF version in SWAN so using the zipfile and sys.path.insert as explained in the tutorial, but this should also work)

import PyRDF
PyRDF.use('spark', {"npartitions": 50})
spark

snippet:

r = PyRDF.RDataFrame(1000)
r = r.Define("nu","2.")

lala = PyRDF.RDataFrame(20)

print(r.Count().GetValue())
print(lala.Count().GetValue())
print(r.Sum("nu").GetValue())

Here is my result ;

As I trigger the event loop thrice, 1st and 3rd for 'r' and 2nd for 'lala' it seems that the partition number of lala overrides the one of r in the 3rd event loop...And I do not think that is expected.

Thanks a lot for helping, and best regards,
Brian

Synchronize `AttributeError` with ROOT RDataFrame

When trying to get a non defined attribute on a PyRDF object, be it HeadNode or Node, the error output is of the type:

AttributeError: ['HeadNode']/[`Node`] object has no attribute 'random_attr'

These errors should be more similar to what ROOT RDataFrame outputs, so that users do not get confused with differences regarding just the internal implementation.