Giter Club home page Giter Club logo

func_adl_servicex's Introduction

func_adl_servicex

Send func_adl expressions to a ServiceX endpoint

GitHub Actions Status Code Coverage

PyPI version Supported Python versions

Introduction

This package contains the single object ServiceXSourceXAOD and ServiceXSourceUpROOT which can be used as a root of a func_adl expression to query large LHC datasets from an active ServiceX instance located on the net.

See below for simple examples.

Further Information

  • servicex documentation
  • func_adl documentation

Usage

To use func_adl on servicex, the only func_adl package you need to install this package. All others required will be pulled in as dependencies of this package.

Using the xAOD backend

See the further information for documentation above to understand how this works. Here is a quick sample that will run against an ATLAS xAOD backend in servicex to get out jet pt's for those jets with pt > 30 GeV.

from func_adl_servicex import ServiceXSourceXAOD

dataset_xaod = "mc15_13TeV:mc15_13TeV.361106.PowhegPythia8EvtGen_AZNLOCTEQ6L1_Zee.merge.DAOD_STDM3.e3601_s2576_s2132_r6630_r6264_p2363_tid05630052_00"
ds = ServiceXSourceXAOD(dataset_xaod)
data = (
    ds
    .SelectMany('lambda e: (e.Jets("AntiKt4EMTopoJets"))')
    .Where('lambda j: (j.pt()/1000)>30')
    .Select('lambda j: j.pt()')
    .AsAwkwardArray(["JetPt"])
    .value()
)

print(data['JetPt'])

Using the CMS Run 1 AOD backend

See the further information for documentation above to understand how this works. Here is a quick sample that will run against an CMS Run 1 AOD backend in servicex. It turns against a 6 TB CMS Open Data dataset, selecting global muons with a pT greater than 30 GeV.

from func_adl_servicex import ServiceXSourceCMSRun1AOD

dataset_xaod = "cernopendata://16"
ds = ServiceXSourceCMSRun1AOD(dataset_xaod)
data = (
    ds
    .SelectMany(lambda e: e.TrackMuons("globalMuons"))
    .Where(lambda m: m.pt() > 30)
    .Select(lambda m: m.pt())
    .AsAwkwardArray(['mu_pt'])
    .value()
)

print(data['mu_pt'])

Using the uproot backend

See the further information for documentation above to understand how this works. Here is a quick sample that will run against a ROOT file (TTree) in the uproot backend in servicex to get out jet pt's. Note that the image name tag is likely wrong here. See XXX to get the current one.

from servicex import ServiceXDataset
from func_adl_servicex import ServiceXSourceUpROOT


dataset_uproot = "user.kchoi:user.kchoi.ttHML_80fb_ttbar"
uproot_transformer_image = "sslhep/servicex_func_adl_uproot_transformer:issue6"

sx_dataset = ServiceXDataset(dataset_uproot, image=uproot_transformer_image)
ds = ServiceXSourceUpROOT(sx_dataset, "nominal")
data = (
    ds.Select("lambda e: {
        'lep_pt_1': e.lep_Pt_1,
        'lep_pt_2': e.lep_Pt_2
        }")
    .value()

print(data)

Running on Local Datasets

It is possible to run on local files. This works well when testing or building out your code, but is horrible if you need to run on a large number of files. It is recommended to use this only with a single file. It is, for the most part, a drop-in replacement for the ServiceX backend version.

First, you must install the local variant of func_adl_servicex. If you are using pip, you can do the following:

pip install func_adl_servicex[local]

With that installed, the following will work:

from func_adl_servicex import SXLocalxAOD

dataset_xaod = "my_local_xaod.root"
ds = SXLocalxAOD(dataset_xaod)
data = (ds
    .SelectMany('lambda e: (e.Jets("AntiKt4EMTopoJets"))')
    .Where('lambda j: (j.pt()/1000)>30')
    .Select('lambda j: j.pt()')
    .AsAwkwardArray(["JetPt"])
    .value()
)

print(data['JetPt'])

And replace SXLocalxAOD with SXLocalCMSRun1AOD for using CMS backend (and, of course, update the query).

Development

PR's are welcome! Feel free to add an issue for new features or questions.

The master branch is the most recent commits that both pass all tests and are slated for the next release. Releases are tagged. Modifications to any released versions are made off those tags.

Qastle

This is for people working with the back-ends that run in servicex.

This is the qastle produced for an xAOD dataset:

(call EventDataset 'ServiceXDatasetSource')

(the actual dataset name is passed in the servicex web API call.)

This is the qastle produced for a ROOT flat file:

(call EventDataset 'ServiceXDatasetSource' 'tree_name')

func_adl_servicex's People

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

func_adl_servicex's Issues

The tree name isn't getting passed correctly to ServiceX

Look at the qastled generated for this query:

def apply_event_cuts (source: ObjectStream) -> ObjectStream:
    '''Event level cuts for the analysis. Keep from sending data that we aren't going to need at all in the end.
    '''
    return (source
        .Where(lambda e: e.trigE or e.trigM))

def good_leptons(source: ObjectStream) -> ObjectStream:
    '''Select out all good leptons from each event. Return their pt, eta, phi, and E, and other
    things needed downstream.

    Because uproot doesn't tie toegher the objects, we can't do any cuts at this point.
    '''
    return source.Select(lambda e:
        {
            'lep_pt': e.lep_pt,
            'lep_eta': e.lep_eta,
            'lep_phi': e.lep_phi,
            'lep_energy': e.lep_E,
            'lep_charge': e.lep_charge,
            'lep_ptcone30': e.lep_ptcone30,
            'lep_etcone20': e.lep_etcone20,
            'lep_typeid': e.lep_type,
            'lep_trackd0pvunbiased': e.lep_trackd0pvunbiased,
            'lep_tracksigd0pvunbiased': e.lep_tracksigd0pvunbiased,
            'lep_z0': e.lep_z0,
            'mcWeight': e.mcWeight,
            # 'scaleFactor': e.scaleFactor_ELE*e.scaleFactor_MUON*e.scaleFactor_LepTRIGGER*e.scaleFactor_PILEUP,
        }) \
        .AsParquetFiles('junk.parquet')

ds = ServiceXSourceUpROOT('cernopendata://dummy',  "mimi", backend='open_uproot')
ds.return_qastle = True
leptons = good_leptons(apply_event_cuts(ds))

See this notebook

qastle:

(Select (Where (call EventDataset 'mimi') (lambda (list e) (or (attr e 'trigE') (attr e 'trigM')))) (lambda (list e) (dict (list 'lep_pt' 'lep_eta' 'lep_phi' 'lep_energy' 'lep_charge' 'lep_ptcone30' 'lep_etcone20' 'lep_typeid' 'lep_trackd0pvunbiased' 'lep_tracksigd0pvunbiased' 'lep_z0') (list (attr e 'lep_pt') (attr e 'lep_eta') (attr e 'lep_phi') (attr e 'lep_E') (attr e 'lep_charge') (attr e 'lep_ptcone30') (attr e 'lep_etcone20') (attr e 'lep_type') (attr e 'lep_trackd0pvunbiased') (attr e 'lep_tracksigd0pvunbiased') (attr e 'lep_z0')))))

Note that the EventDataSet has only a single argument - and it is the tree name - it should be the file plus tree name. What happened?

Crash when import the package

Hi @gordonwatts, could you have a look the following error? Import the package just after installation doesn't work.

...
Successfully installed func-adl-servicex-1.1.2 idna-2.10 qastle-0.11.0 servicex-2.4
(base) KE-MBP:servicex-access-examples kchoi$ python
Python 3.7.4 (default, Aug 13 2019, 15:17:50)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import func_adl_servicex
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kchoi/anaconda3/lib/python3.7/site-packages/func_adl_servicex/__init__.py", line 3, in <module>
    from .ServiceX import ServiceXSourceUpROOT, ServiceXSourceXAOD, ServiceXSourceCMSRun1AOD, FuncADLServerException  # NOQA
  File "/Users/kchoi/anaconda3/lib/python3.7/site-packages/func_adl_servicex/ServiceX.py", line 10, in <module>
    from servicex import ServiceXDataset
  File "/Users/kchoi/anaconda3/lib/python3.7/site-packages/servicex/__init__.py", line 1, in <module>
    from .servicex import ServiceXDataset, StreamInfoUrl, StreamInfoPath  # NOQA
  File "/Users/kchoi/anaconda3/lib/python3.7/site-packages/servicex/servicex.py", line 15, in <module>
    from servicex.servicex_config import ServiceXConfigAdaptor
  File "/Users/kchoi/anaconda3/lib/python3.7/site-packages/servicex/servicex_config.py", line 7, in <module>
    from .utils import ServiceXException
  File "/Users/kchoi/anaconda3/lib/python3.7/site-packages/servicex/utils.py", line 467, in <module>
    from backoff._common import (  # noqa: E402
ImportError: cannot import name '_prepare_logger' from 'backoff._common' (/Users/kchoi/anaconda3/lib/python3.7/site-packages/backoff/_common.py)
>>>

Create a backend that generates qastle

As a user/developer i would like to easily see the qastle that is sent for various func_adl queries to the backend.

Approach

Decide on one of two approaches:

  1. Add a bool parameter that will have ServiceXDatasetSourcexAOD and ServiceXDatasetSourceUpROOT to return the qastle rather than the data
  2. Create a new data source that returns the qastle

Create doctest to show off some typical queries to both the xAOD and uproot backend.

Fix up README with working exampmles

The current version has AsParquet, etc. These are things that might cause problems now - so make sure everything in the documentation is working properly.

Cache can't tell the difference between awkward array and not

If you do this query first:

from servicex import ignore_cache
with ignore_cache():
    electrons_async = [
        (calib_tools.query_update(ds, electron_working_point=wp)
            .SelectMany(lambda e: [ele.eta() for ele in e.Electrons()])
            .value_async())
        for wp in ['TightLHElectron', 'MediumLHElectron', 'LooseLHElectron']
    ]
    electrons = await asyncio.gather(*electrons_async)

you will get pack paths to a local file, as expected. If you then do:

from servicex import ignore_cache
with ignore_cache():
    electrons_async = [
        (calib_tools.query_update(ds, electron_working_point=wp)
            .SelectMany(lambda e: [ele.eta() for ele in e.Electrons()])
            .AsAwkwardArray()
            .value_async())
        for wp in ['TightLHElectron', 'MediumLHElectron', 'LooseLHElectron']
    ]
    electrons = await asyncio.gather(*electrons_async)

you'll also get back windows path - which is not what you would expect. The cache see the windows path and doesn't see the AsAwkwardArray.

Fix depreciation warning

.venv\lib\site-packages\func_adl_servicex\ServiceX.py:5
  c:\users\gordo\code\iris-hep\servicex-backend-tests\.venv\lib\site-packages\func_adl_servicex\ServiceX.py:5: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
    from collections import Iterable
``

Unclear sentence in readme

To use func_adl on servicex, the only func_adl package you only need to install this package.

I think I undertand what this means, but...

Dataset Group

During the last AGC challenge, the idea of a dataset group came up.

  • “ServiceXDataSetGroup” like structure perhaps that wraps multiple datasets

We should explore this, taking into account prior work done by @kyungeonchoi on his project

Add tests for different column naming conventions

We needed these for Awkward, but it should be there for everything:

  • with a list of column names
  • With a single name
  • With nothing (and dict in the last select statement).

As an example:

def test_sx_xaod_awkward_no_columns(async_mock):
    'Test a request for awkward arrays from an xAOD backend'
    sx = async_mock(spec=ServiceXDataset)
    sx.first_supported_datatype.return_value = 'root'
    ds = ServiceXSourceXAOD(sx)
    q = ds.Select("lambda e: e.MET").AsAwkwardArray()

    q.value()

Swallow AsParquetFiles in normal uproot query

Some standard queries for Uproot do not work:

import pandas as pd
from func_adl_servicex import ServiceXSourceUpROOT

dataset_name = "data15_13TeV:data15_13TeV.00282784.physics_Main.deriv.DAOD_PHYSLITE.r9264_p3083_p4165_tid21568807_00"
src = ServiceXSourceUpROOT(dataset_name, "CollectionTree")
data = src.Select("lambda e: {'JetPT': e['AnalysisJetsAuxDyn.pt']}") \
    .AsParquetFiles('junk.parquet') \
    .value()
df = pd.read_parquet(data[0])
print(df)

The reason is the Uproot backend can't deal with AsParquetFiles. This should be swallowed. And one should sort out what
to do with this when column names are added in.

Making Data access more uniform

It would be nice to see a more carefully thought out and straight forward way to ask for data to come back from ServiceX. In short, normalize the access patterns for servicex. The current interface has grown organically, and there are now so many operations and it is hard to surface them from one place to the other. Time to take a step back, perhaps.

What we have now

        sx = ServiceXDataset([uproot_single_file],
                             backend_name=endpoint_uproot,
                             status_callback_factory=None)
        src = ServiceXSourceUpROOT(sx, 'mini')
        r = (src.Select(lambda e: {'lep_pt': e['lep_pt']})
                .AsAwkwardArray()
                .value())

And AsAwkwardArray can be replaced by a bunch of different things:

  • AsPandasDF, as_pandas - a panda dataframe (this does not support nested objects!)
  • AsROOTTTree, as_ROOT_tree - a list of file(s) that contains a root TTree object
  • AsParquetFiles, as_parquet - a list of parquet file(s) built from awkward's to_parquet method
  • AsAwkwardArray, as_awkward - returns an awkward array of all the data

These methods do not return the actual data - just the request to generate the data. The value() call at the end actually triggers the infrastructure to generate the data. There is another version of the method called value_async() that does the same thing, but allows you to easily queue up many requests at once.

There are at least two axes here:

  • What data format should come back from the ServiceX query
  • Should programming interface be sync or asynchronous?

There is yet another axis for the root and parquet queries - do you want the files downloaded locally into a cache or just a uri to access them over the web? This is only accessible via direct calls to the servicex library (e.g. see get_root_files_async, get_root_files_stream, and get_data_rootfiles_uri_stream and get_data_rootfiles_uri_async).

What do users of func_adl want?

Let's look at each one and reason about why different choices are made.

  • Data Format
    • What the user is familiar with ("what is a parquet file?")
    • How the data will be consumed downstream of servicex. Input to awkward distributed processing?
  • URI's to files (for root and parquet) or local copied files?
    • At an analysis facility probably want the local files downloaded. On a local laptop developing downstream code, want the local file option.
    • Downstream code will be quite opinionated about what it can and can't use.
  • Async or Synch access
    • One or many datasets? One probably wants to parallelize access to many datasets.
    • In a notebook almost certainly want synch access to make a "demo" work well.
  • Streaming URI's
    • Only interesting if downstream access can process a stream of URI's.
    • Uses well understood but not widely known streaming async infrastructure in python.

Starting from Scratch

Unify the two backends

Some differences observed during this work that should be unified (this needs to be turned into some issues in the xaod and uproot backends):

  • Uproot can deal with linq type ast's (like (Select xxx) and xAOD needs (call Select xxx. Further, the xAOD backend crashes in a rather subtle way if it isn't given the right version. For xAOD background you can't call qastle.insert_linq_nodes first.
  • the uproot backend can deal with a dictionary as a final input to the ntupler, taking column names from the dictionaries names. The xAOD cannot.
  • Neither backend can deal with dictionaries in the middle of the code (that get made and then destroyed)
  • The xAOD backend needs an AsROOTTtree as a final call, the uproot backend does not. Should decide if different output types should be produced by the transformers and if not, then both should look like the uproot backend.
  • Parquet files are made by the uproot backend and not root files, and vice versa. Should this be normalized somewhere in servicex's infrastructure?
  • Should we be able to tell what endpoint is serving what type of backend by querying it?

This needs to be combined with any input from the testing that is done in the servicex backend tests repo.

Queries passing through all columns fail

I am trying to do a query passing through all columns. This works fine with func_adl_uproot:

from func_adl_uproot import UprootDataset
ds = UprootDataset("nanoaod15.root", "Events")
ds.value()

It does not work with func_adl_servicex:

import servicex
from func_adl_servicex import ServiceXSourceUpROOT

dataset_name = ["https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/nanoaod15.root"]
sx_dataset = servicex.ServiceXDataset(dataset_name, "uproot")
ds = ServiceXSourceUpROOT(sx_dataset, "Events")
ds.value()

This crashes with

Transform be73e06d-0c92-46e5-8b43-13ba009717d4 had 1 errors:
  Error transforming file: h
  -> error.

(see ssl-hep/ServiceX#408). The query succeeds when using ds.Select(lambda e: e.Jet_pt).value() instead.

The file being used is equivalent in both cases. I am using the following versions of libraries:

func-adl             3.0
func-adl-servicex    2.0
func-adl-uproot      1.9.0
servicex             2.5.4

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.