Giter Club home page Giter Club logo

pyrdf's People

Contributors

javiercvilla avatar shravan97 avatar vepadulano avatar watson-ij avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pyrdf's Issues

Add "Info" operations to PyRDF

The class of operations called "Other Operations", documented in the RDataFrame class reference, is not defined in PyRDF. These include useful information about the data, such as the names of the columns of the dataframe.
These should be added inside the operations_dict in Operation.py and possibly called "Info".

Improve Graph pruning code

The current implementation of the graph pruning could be improved with a more robust logic. A simple refactoring of the code changes the result of some tests, pruning more than needed.

For example, a good refactoring to improve the readability of the code could be the following:

    def _is_prunable():
        # Condition 1 - Not enough references
        # Explanation...
        if len(gc.get_referrers(self)) <= 3:
            return True

        # Condition 2 - Action value already computed
        # Explanation...
        if self.operation and self.operation.is_action() and self.value:
            return True

        return False

    def graph_prune(self):
        """
        ...docstrings...
        """
        children = []

        for n in self.children:
            # Select children based on pruning condition
            if not n.graph_prune():
                children.append(n)

        self.children = children

        if not self.children:
            return self._is_prunable()

        return False

However, this change increases the number of references to self, so the condition len(gc.get_referrers(self)) <= 3 is not true any more.

Spark configuration may be inconsistent at runtime

Using the Spark backend, the number of partitions can be automatically modified when the number of clusters in the dataset is smaller. This is reported with the following warning:

PyRDF/backend/Dist.py:258: UserWarning: Number of partitions 
is greater than number of clusters in the filelist
  Using 1 partition(s)

However, this change in the number of partitions is not propagated to the backend so the number of workers may end up being greater than the number of ranges. Example:

PyRDF.use("spark", {'npartitions':5})
PyRDF.RDataFrame(tree, filename) # filename contains one single partitions
# PyRDF falls back to the following effective configuration:
# PyRDF.use("spark", {'npartitions':1})

[Proposal] integrate merging of PyRDF objects in the Node class

Right now, the reduce phase of the Spark distributed job takes care of merging PyRDF objects (their PyROOT pointees actually). This means that the reduce function of the Dist backend is a long list of if...else statements. If the method __add__ were implemented in the Node class, the merging of two PyROOT objects could be handled in that class, thus it could be possible to also know the actual name of the operation(s) involved and call the right ROOT function (e.g. Add,Merge). The main problem is that the map phase returns a list of PyROOT object, so the executor actually loses the information from the PyRDF side. A refactoring of the code could be used such that map returns also the PyRDF nodes.

Long integers raises a syntax error in Python 3

File "/home/PyRDF/tests/unit/backend/test_dist.py", line 163
ranges_reqd = [(0L, 10L)]
                      ^
SyntaxError: invalid syntax

In Python 3, all integers have "unlimited" precision, as documented here. As a consequence, a declaration such as a = 1L raises a SyntaxError: invalid syntax, as discussed here. Depending on how crucial these declarations are, the L symbol could be deleted to ensure compatibility with both versions of Python. Otherwise, we could add some feature detection in order to differentiate the code according to the Python version used, as suggested here.

headers are not sent to workers when backend is distributed

Right now, when executing PyRDF on a distributed backend (such as Spark), if the user
has included any headers, they will just be declared on the local machine (spark driver node).

Some way of storing files on the spark executors is needed to declare all the headers at execution time.

Snapshot error passing branch lists

Hi Javier,

I am trying to save a snapshot with multiple branches but it is not working,

import PyRDF
df = PyRDF.RDataFrame("data", ['https://root.cern/files/teaching/CMS_Open_Dataset.root',])

etaCutStr = "fabs(eta1) < 2.3 && fabs(eta2) < 2.3"
df_f = df.Filter(etaCutStr)

branchList = ("eta1","eta2") # the same error with ["eta1","eta2"] 
df_f.Snapshot('filtered', 'out3.root', branchList)

I am using the last version.

error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-c4cbf40f254f> in <module>
      6 
      7 branchList = ("eta1","eta2")
----> 8 df_f.Snapshot('filtered', 'out3.root', branchList)

/eos/home-o/ozapatam/SWAN_projects/rho3/deps/PyRDF/Proxy.py in _create_new_op(self, *args, **kwargs)
    181             from PyRDF import current_backend
    182             generator = CallableGenerator(self.proxied_node.get_head())
--> 183             current_backend.execute(generator)
    184             return newNode.value
    185         else:

/eos/home-o/ozapatam/SWAN_projects/rho3/deps/PyRDF/backend/Local.py in execute(self, generator)
     61             self.pyroot_rdf = ROOT.ROOT.RDataFrame(*generator.head_node.args)
     62 
---> 63         values = mapper(self.pyroot_rdf)  # Execute the mapper function
     64 
     65         # Get the action nodes in the same order as values

/eos/home-o/ozapatam/SWAN_projects/rho3/deps/PyRDF/CallableGenerator.py in mapper(node_cpp, node_py, rdf_range)
    131             for n in node_py.children:
    132                 # Recurse through children and get their output
--> 133                 prev_vals = mapper(parent_node, node_py=n, rdf_range=rdf_range)
    134 
    135                 # Attach the output of the children node

/eos/home-o/ozapatam/SWAN_projects/rho3/deps/PyRDF/CallableGenerator.py in mapper(node_cpp, node_py, rdf_range)
    131             for n in node_py.children:
    132                 # Recurse through children and get their output
--> 133                 prev_vals = mapper(parent_node, node_py=n, rdf_range=rdf_range)
    134 
    135                 # Attach the output of the children node

/eos/home-o/ozapatam/SWAN_projects/rho3/deps/PyRDF/CallableGenerator.py in mapper(node_cpp, node_py, rdf_range)
    107                 else:
    108                     pyroot_node = RDFOperation(*operation.args,
--> 109                                                **operation.kwargs)
    110 
    111                 # The result is a pyroot object which is stored together with

TypeError: none of the 3 overloaded methods succeeded. Full details:
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void>::Snapshot(basic_string_view<char,char_traits<char> > treename, basic_string_view<char,char_traits<char> > filename, const vector<string>& columnList, const ROOT::RDF::RSnapshotOptions& options = ROOT::RDF::RSnapshotOptions()) =>
    could not convert argument 3 (none of the 10 overloaded methods succeeded. Full details:
  vector<string>::vector<string>() =>
    takes at most 0 arguments (2 given)
  vector<string>::vector<string>(const allocator<string>& __a) =>
    takes at most 1 arguments (2 given)
  vector<string>::vector<string>(unsigned long __n, const allocator<string>& __a = std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::allocator_type()) =>
    could not convert argument 1 (an integer is required)
  vector<string>::vector<string>(unsigned long __n, const string& __value, const allocator<string>& __a = std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::allocator_type()) =>
    could not convert argument 1 (an integer is required)
  vector<string>::vector<string>(const vector<string>& __x) =>
    takes at most 1 arguments (2 given)
  vector<string>::vector<string>(vector<string>&& __x) =>
    takes at most 1 arguments (2 given)
  vector<string>::vector<string>(const vector<string>& __x, const allocator<string>& __a) =>
    could not convert argument 1
  vector<string>::vector<string>(vector<string>&& __rv, const allocator<string>& __m) =>
    could not convert argument 1 (this method can not (yet) be called)
  vector<string>::vector<string>(initializer_list<string> __l, const allocator<string>& __a = std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::allocator_type()) =>
    could not convert argument 1
  vector<string>::vector<string>(__gnu_cxx::__normal_iterator<const basic_string_view<char,char_traits<char> >*,vector<basic_string_view<char,char_traits<char> > > > __first, __gnu_cxx::__normal_iterator<const basic_string_view<char,char_traits<char> >*,vector<basic_string_view<char,char_traits<char> > > > __last, const allocator<string>& __a = std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::allocator_type()) =>
    could not convert argument 1)
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void>::Snapshot(basic_string_view<char,char_traits<char> > treename, basic_string_view<char,char_traits<char> > filename, basic_string_view<char,char_traits<char> > columnNameRegexp = "", const ROOT::RDF::RSnapshotOptions& options = ROOT::RDF::RSnapshotOptions()) =>
    could not convert argument 3
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RJittedFilter,void>::Snapshot(basic_string_view<char,char_traits<char> > treename, basic_string_view<char,char_traits<char> > filename, initializer_list<string> columnList, const ROOT::RDF::RSnapshotOptions& options = ROOT::RDF::RSnapshotOptions()) =>
    could not convert argument 3 (none of the 3 overloaded methods succeeded. Full details:
  initializer_list<string>::initializer_list<string>() =>
    takes at most 0 arguments (2 given)
  initializer_list<string>::initializer_list<string>(const initializer_list<string>&) =>
    takes at most 1 arguments (2 given)
  initializer_list<string>::initializer_list<string>(initializer_list<string>&&) =>
    takes at most 1 arguments (2 given))

Instant actions do not get triggered instantly

None of the three instant actions implemented in ROOT RDataFrame (Foreach, ForeachSlot and Snapshot) are currently implemented in PyRDF. Furthermore, the logic of the Proxy family of classes triggers is geared towards actions, in the sense that the computational graph is executed only when the user requests the value of an action (e.g myrdf.Count().GetValue() ).

For example, the Snapshot instant action works on ROOT 6.16, but trying to take a snapshot of a PyRDF.RDataFrame will result in the loop not triggering, so no file will be saved to memory. To actually trigger the snapshot action, one has to explicitly call GetValue after Snapshot.

Example taken from the root tutorials

import ROOT
import PyRDF

# A simple helper function to fill a test tree.
def fill_tree(treeName, fileName):
  tdf = PyRDF.RDataFrame(10000)
  tdf.Define("b1", "(int) tdfentry_")\
     .Define("b2", "(float) tdfentry_ * tdfentry_").Snapshot(treeName, fileName)

fileName = "df007_snapshot_py.root"
treeName = "myTree"
fill_tree(treeName, fileName) # this will not trigger the snapshot, no file will be saved
import ROOT
import PyRDF

# A simple helper function to fill a test tree.
def fill_tree(treeName, fileName):
  tdf = PyRDF.RDataFrame(10000)
  tdf.Define("b1", "(int) tdfentry_")\
     .Define("b2", "(float) tdfentry_ * tdfentry_").Snapshot(treeName, fileName).GetValue()

fileName = "df007_snapshot_py.root"
treeName = "myTree"
fill_tree(treeName, fileName) # GetValue() triggers the event, the file will be saved

Wrong import of backend modules in Python 3.7

In __init__.py all backend modules are being imported with absolute notation, e.g. from backend import Utils. The implicit relative path behaviour is allowed in Python 2.7, but doesn't work in Python 3.7, as documented here. In the newer version of Python implicit imports use the dot notation to indicate the current and parent packages involved in the relative import. So the right way to do it would be from .backend import Utils.

Add testing step: Run some local checks before going distributed

For a given user analysis, it would be nice to run some local checks so users can discover possible issues in their code before connecting to a cluster. Possible common errors may be:

  • Non-serializable functions passed to the PyRDF.initialize method.
  • Non-correct cpp functions in RDataFrame operations.
  • Inaccessible input files

These errors can be easily discovered running a reduced version of the analysis for a few entries with a Local backend while running on a given backend the error message might not be that clear.

new RDF is not correctly instantiated

If a user performs an analysis on a RDataFrame, then on the same script (session) tries to create another RDataFrame, the latter will not be created but the first will be kept as a source for the new analysis. See the example:

import PyRDF

df1 = PyRDF.RDataFrame(10)

count1 = df1.Count()

print(count1.GetValue()) # should be 10, output is 10

df2 = PyRDF.RDataFrame(100)

count2 = df2.Count()

print(count2.GetValue()) # should be 100, output is 10

This issue holds with the local backend

Add existing features of RDataFrames in distributed PyRDF

(HI!) As a strong user of this module, it would be great for me to have access to those features by order of priority (that is why I did not created one issue per feature) :

  • 1. Sum()
  • 2. Snapshot()
  • 3. AsNumpy()
  • 4. Count()
  • 5. Report()

Thanks a lot !

Snapshot fails when TTree file switchover is triggered

When running on a large data set at some point TTree decides the file is too large and switches to a new file. This is in TTree::ChangeFile and usually works but when called from PyRDF the process aborts. The output is:

Fill: Switching to new file: /scratch/test2_1.root
Fatal in TFileMerger::RecursiveRemove: Output file of the TFile Merger (targeting /scratch/test2.root) has been deleted (likely due to a TTree larger than 100Gb)
aborting
...a lot of stack traces...

When I look at the files the relevant files are:

ls -lh
276 Jan 8 23:17 test2_1.root
94G Jan 8 23:17 test2.root

So the relevant file is not in excess of 100 GB (but close to that) and the second file where the whole thing was supposed to switch over to just exists but was not filled significantly.
As a workaround I will try running again with ROOT.TTree.SetMaxTreeSize(10000000000000) (10TB) but this should happen internally when Snapshot encounters a data set larger than the limit so that this feature is effectively disabled and thus the bug is not triggered.

Add lazy-loading to backend modules

Currently all backends (Spark, Dist, Local) are loaded at start up time. There is no need to load a Spark backend if the user is only interested in a Local execution.

Wrong graph pruning in Python3.7

With python-3.7.2 the graph_prune function removes nodes that should not be pruned. As a result, the following code does not work:

import PyRDF
d = PyRDF.RDataFrame(1)
h = d.Histo1D("rdfentry_")
h.Draw()

Same code works for python-2.7.15.

Extend PyRDF.include to accept directories

Currently PyRDF.include only supports a file or a list of files. If multiple headers are needed it might be useful to support something like: PyRDF.include("myHeadersDir") or perhaps add a new function to the API.

Improve docs

Documentation right now is just a list of all the classes in PyRDF with their docstrings and source code. The tutorials must be included in the docs to help the users with their first approach to PyRDF.

Python lambdas or access to external variables

Dear developpers,

I encounter a recent issue using this module, which is about accessing external variables in a dataframe defined in a python code (I use PyROOT in the SWAN environment).

The issue comes from the fact that I have a map or a function in python which should take a column from the dataframe and define another column accordingly. A python lambda function should make the job but python lambdas are not yet supported in PyROOT I think.
Then I tried to define this map or function using the ROOT.gInterpreter.Declare() function, which gives me access to my map from the dataframe.Define() method.

However this works for ROOT.RDataFrames, but not for PyRDF.RDataFrames, which seems logic but not a good news to me.
Then my question is the following, may you help me in that direction ? Either make the ROOT environment/definitions accessible from the jobs, or even better, boost the implementation for using python lambdas in dataframes ?

Bests,
Brian Ventura

Report more expressive error message when Numcluster = 0

During the creation of ranges (BuildRanges), the input dataset is extracted from the arguments passed to RDataFrame. If the dataset cannot be properly read, i.e. due to an incorrect user input, non-existing files, non-reachable... , BuildRanges fails when trying to divide by zero.

PyRDF.use("spark") hangs in SWAN

In a SWAN notebook, PyRDF.use("spark") freezes a notebook cell when there is no SparkContext preloaded, i.e. the connection with a Spark cluster has not been previously established.

Add FriendTree capabilities to distributed execution

The distributed execution of the DAG starts by calling ROOT RDataFrame with the arguments passed by the user to PyRDF.RDataFrame.

If the TTree/TChain passed has some friend trees inside, the links to those other files will be lost since the Spark executor will only know about the location of the main TTree/TChain .root file(s).

It is needed to add some friend tree checks in the distributed backend mapper function.

Turning to local doesn't stop running Spark context

Situation:
A user issues a first execution of the graph with Spark as a backend, e.g. the user has previously called PyRDF.use("spark"). The user later calls PyRDF.use("local") to switch to local execution.
Issue:
PyRDF.use("local") doesn't check for existing Spark context and doesn't stop it.

Extend PyRDF.initialize to "string code" and functions from imported modules

Currently PyRDF.include only supports python callable of this shape:

def f(a, b, c):
     import ROOT
     ROOT.myCppFunction(a, b, c)

PyRDF.initialize(f, a, b, c)

It would be useful to support any of the following calls:

PyRDF.initialize(ROOT.myCppFunction, a, b, c)
PyRDF.initialize("myCppFunction", a, b, c)
PyRDF.initialize(f, a, b, c) 

This way the code would be cleaner.

Bug in npartitions parameter ; spark backend

Dear developpers,
I think I have spotted a bug in the partitioning of the dataframes done in PyRDF. Here is a simple snippet of example showing that :
config: (I use the most recent PyRDF version in SWAN so using the zipfile and sys.path.insert as explained in the tutorial, but this should also work)

import PyRDF
PyRDF.use('spark', {"npartitions": 50})
spark

snippet:

r = PyRDF.RDataFrame(1000)
r = r.Define("nu","2.")

lala = PyRDF.RDataFrame(20)

print(r.Count().GetValue())
print(lala.Count().GetValue())
print(r.Sum("nu").GetValue())

Here is my result ;
image
As I trigger the event loop thrice, 1st and 3rd for 'r' and 2nd for 'lala' it seems that the partition number of lala overrides the one of r in the 3rd event loop...And I do not think that is expected.

Thanks a lot for helping, and best regards,
Brian

Synchronize `AttributeError` with ROOT RDataFrame

When trying to get a non defined attribute on a PyRDF object, be it HeadNode or Node, the error output is of the type:

AttributeError: ['HeadNode']/[`Node`] object has no attribute 'random_attr'

These errors should be more similar to what ROOT RDataFrame outputs, so that users do not get confused with differences regarding just the internal implementation.

Remove dot notation from import statements

There are still some import statements using dot notation, namely the ones importing current_backend variable. These should be changed in favor of clearer from PyRDF import current_backend

Add logging features to PyRDF

There are many cases in the execution of an analysis where things could go wrong or there is just need for more information about the status of the program. A logger could help with regards to those situations.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.