iris-hep / analysis-grand-challenge Goto Github PK

Repository dedicated to AGC preparations & execution

License: MIT License

Jupyter Notebook 91.46% Python 8.53% Shell 0.01%

analysis-grand-challenge's Introduction

Analysis Grand Challenge (AGC)

The Analysis Grand Challenge (AGC) is about performing the last steps in an analysis pipeline at scale to test workflows envisioned for the HL-LHC. This includes

columnar data extraction from large datasets,
processing of that data (event filtering, construction of observables, evaluation of systematic uncertainties) into histograms,
statistical model construction and statistical inference,
relevant visualizations for these steps,

all done in a reproducible & preservable way that can scale to HL-LHC requirements.

The AGC has two major pieces:

specification of a physics analysis using Open Data which captures relevant workflow aspects encountered in physics analyses performed at the LHC,
a reference implementation demonstrating the successful execution of this physics analysis at scale.

The physics analysis task is a $t\bar{t}$ cross-section measurement with 2015 CMS Open Data (see datasets/cms-open-data-2015). The current reference implementation can be found in analyses/cms-open-data-ttbar. In addition to this, analyses/atlas-open-data-hzz contains a smaller scale $H\rightarrow ZZ^*$ analysis based on ATLAS Open Data.

See this talk given at ICHEP 2022 for some more information about the AGC. Additional information is available in two workshops focused on the AGC:

We also have a dedicated webpage and a website for documentation.

AGC and IRIS-HEP

The AGC serves as an integration exercise for IRIS-HEP, allowing the testing of new services, libraries and workflows on dedicated analysis facilities in the context of realistic physics analyses.

AGC and you

We believe that the AGC can be useful in various contexts:

testbed for software library development,
realistic environment to prototype analysis workflows,
functionality, integration & performance test for analysis facilities.

We are very interested in seeing (parts of) the AGC implemented in different ways! Besides the implementation in this repository, have a look at

a ROOT RDataFrame-based implementation: root-project/analysis-grand-challenge,
a pure Julia implementation: Moelf/LHC_AGC.jl.
a columnflow implementation: columnflow/agc_cms_ttbar.

Please get in touch if you have investigated other approaches you would like to share! There is no need to implement the full analysis task — it splits into pieces (for example the production of histograms) that can also be tackled individually.

More details: what is being investigated in the AGC context

New user interfaces: Complementary services that present the analyst with a notebook-based interface. Example software: Jupyter.
Data access: Services that provide quick access to the experiment’s official data sets, often allowing simple derivations and local caching for efficient access. Example software and services: Rucio, ServiceX, SkyHook, iDDS, RNTuple.
Event selection: Systems/frameworks allowing analysts to process entire datasets, select desired events, and calculate derived quantities. Example software and services: Coffea, awkward-array, func_adl, RDataFrame. Histogramming and summary statistics: Closely tied to the event selection, histogramming tools provide physicists with the ability to summarize the observed quantities in a dataset. Example software and services: Coffea, func_adl, cabinetry, hist.
Statistical model building and fitting: Tools that translate specifications for event selection, summary statistics, and histogramming quantities into statistical models, leveraging the capabilities above, and perform fits and statistical analysis with the resulting models. Example software and services: cabinetry, pyhf, FuncX+pyhf fitting service
Reinterpretation / analysis preservation: Standards for capturing the entire analysis workflow, and services to reuse the workflow which enables reinterpretation. Example software and services: REANA, RECAST.

Acknowledgements

This work was supported by the U.S. National Science Foundation (NSF) cooperative agreement OAC-1836650 (IRIS-HEP).

analysis-grand-challenge's People

Contributors

Stargazers

Watchers

analysis-grand-challenge's Issues

NanoAOD file access over http and performance

ServiceX transforms of NanoAOD files and direct uproot-based access via http seem to be slower than for ntuples: https://gist.github.com/alexander-held/4e58811522ed9990afb2d4b73ef9471e.

@masonproffitt pointed out an XRootD issue related to this: xrootd/xrootd#1976. Reading too much data causes a 500 error and uproot subsequently falls back to individual requests, making everything slower. A similar issue is xrootd/xrootd#2003: this is about requesting too many ranges at once, while the former is about requesting too many bytes in a range.

Related uproot issue during these investigations: scikit-hep/uproot5#881.

Impact on ServiceX

More details about the behavior of ServiceX from @masonproffitt:

the uproot backend does not set anything related to chunking; it just uses the default settings for uproot. the problems are a bit different between uproot4 (used in the current version of the servicex uproot transformer) and uproot5. in uproot4, the main problem is that uproot.lazy has an explicit iterator over branches, so the execution time scales linearly with both the number of branches accessed and the round trip latency. in uproot5, this problem should disappear thanks to uproot.dask, but there the issue is that it hits these xrootd limits and falls back to individual requests (at least for each branch, maybe even for each basket)

for uproot5, we can set the step_size in the servicex transformer, but i don't think there's a consistent way to guarantee that we don't hit these limits because there are separate limits for (1) number of byte ranges, (2) total ascii length of the Range field, and (3) total number of actual bytes requested by Range. the problem is that there's no way to know the number and size of the baskets before the code executes. handling this would require either going deep into uproot itself or inspecting a lot of metadata at runtime and modifying the generated code in very non-trivial ways

Impact on coffea

It is currently unclear if this would affect coffea directly ingesting the input dataset differently. Are there any tricks that may matter here @nsmith- @lgray? Currently we are still using "old" coffea, though preparing to switch to coffea 2023.

[CI] Add basic GH Actions in AGC repository

Goals to add a simple CI to be able to check AGC notebook's requirements.txt at least for local execution on (possibly?) AS docker image

Improved func_adl query to not accidentally drop events that are accepted under systematic variations

Query including lepton and jet selection:

def get_query(source: ObjectStream) -> ObjectStream:
    return source.Where(lambda e:\
    # == 1 lep
    e.electron_pt.Where(lambda pT: pT > 25).Count() + e.muon_pt.Where(lambda pT: pT > 25).Count()== 1
)\
.Where(lambda e:\
    # >= 4 jets
    e.jet_pt.Where(lambda pT: pT > 25).Count() >= 4
)\
.Where(lambda e:\
    # >= 1 jet with pT > 25 GeV and b-tag >= 0.5
    {"pT": e.jet_pt, "btag": e.jet_btag}.Zip().Where(lambda jet: jet.btag >= 0.5 and jet.pT > 25).Count() >= 1
)\
.Select(lambda e:\
    # return columns
    {
        "electron_pt": e.electron_pt,
        "muon_pt": e.muon_pt,
        "jet_pt": e.jet_pt,
        "jet_eta": e.jet_eta,
        "jet_phi": e.jet_phi,
        "jet_mass": e.jet_mass,
        "jet_btag": e.jet_btag,
    }
)

This does not account for object systematic variations in the processor that can relax acceptance requirements. The query should be loosened accordingly, though it is unclear how to do that in a way that guarantees full acceptance in a general setup.

Rucio-based file access

Add rucio-based file access to the AGC ttbar notebook. This requires suitable credentials to be run, given that the datasets available through rucio are experiment-restricted.

PIPELINE = "servicex_databinder" currently uses those inputs, but they should be integrated fully with the complete pipeline.

User experience and performance improvements for pipeline demonstrator

This collects various user experience and performance related aspects that the CMS Open Data pipeline demonstration at the AGC 2022 workshop revealed.

Completeness of pipeline

add a machine learning component (e.g. ttbar reconstruction), frequently requested and relevant for many analyses being done in practice

User experience

ServiceX+`coffea`

schema configuration with ServiceX processors in coffea CoffeaTeam/coffea#707
naming transformations ssl-hep/ServiceX#407
understand differences between auto_schema and AGC schema with similarly named columns CoffeaTeam/coffea#665
auto_schema for non-jagged columns CoffeaTeam/coffea#664
non-async method to run ServiceX processor for easier debugging (ideally, NanoEventsFactory.from_root-like method) ssl-hep/ServiceX_frontend#243, ssl-hep/ServiceX#432, ssl-hep/ServiceX#430 -> ssl-hep/ServiceX_frontend#245
single-letter error messages CoffeaTeam/coffea#666, ssl-hep/ServiceX#408
progress bar for overall progress, similar to how that is shown when using coffea without ServiceX ssl-hep/ServiceX#419

ServiceX

it can take a long time for transforms to report how many files are to be processed in total
limiting number of files when querying a rucio dataset ssl-hep/ServiceX#395, works via file name suffix (cell 2 in 03_atlas_xAOD.ipynb) but currently seems broken? -> ssl-hep/ServiceX_DID_Finder_lib#24
MinIO filling up: automatic cleanup? ssl-hep/ServiceX#133
uproot transformer returning root files instead of only parquet ssl-hep/ServiceX#225 -> ssl-hep/ServiceX#475 -> demo at AGC demo day #1
dataset grouping for efficient non-async transforms iris-hep/func_adl_servicex#56
skipping need for dummy ds to create query (mentioned in iris-hep/func_adl_servicex#56)

`coffea`

metadata caching CoffeaTeam/coffea#662
objects changing in surprising ways in systematic variations CoffeaTeam/coffea#661
allow attaching per-object systematic variations to the full event (to enable running over copies of events)? not great for performance, but convenient for usability
weight-based systematics that use object properties but are attached to events CoffeaTeam/coffea#667
bug in bytesread CoffeaTeam/coffea#717
fileset format when handling parquet inputs CoffeaTeam/coffea#734
investigate possibility of CoffeaTeam/coffea#749 being useful in the AGC setup
column overtouching in ML input variable calculation CoffeaTeam/coffea#892

coffea-casa

dask manual scaling settings seem to not be accepted
ServiceX dashboard

`func_adl`

find ways to format queries in a way that helps understand the "layer" at which a given operation acts

processor design

avoid stacking masks of different shapes together (when built after initial filtering), hard to keep track of shapes (perhaps keepdims=True, or masking with None)
improve systematics loop, potentially streamline everything to use the same pattern, or find a way to automatically track which columns change when, and automatically expand observable with systematics dimensions, avoid scaling of jet properties via helper array

Performance

ServiceX+`coffea`

dask scaling CoffeaTeam/coffea#611, CoffeaTeam/coffea#671

ServiceX

DID finder becomes a bottleneck when running over a large amount of files

`coffea`

consider splitting out pre-processing gist / CoffeaTeam/coffea#675, or merge input files to avoid bottleneck

`servicex-databinder` approach

avoid bottleneck with file conversion / copying (feed data straight to Skyhook?)

coffea-casa

understand issues showing up in dask task stream (file access?)
possibility of guaranteeing fixed number of workers for performance benchmarking

`func_adl`

implement full query with proper b-tagging of jets with pT > 25 GeV -> done in #85

`cabinetry`

cabinetry.templates.collect method takes a lot of time when introducing more channels (i.e. 45.3 seconds for 20 channels)
cabinetry.model_utils.prediction(model, fit_results=fit_results) causes notebook to crash due to memory issues on model with many channels. -> potentially related: scikit-hep/awkward#2480

Cabinetry 0.4.0 @ coffea-casa breaking CabinetryHZZAnalysis.ipynb

While using cabinetry 0.4.0 @ coffea-casa makes notebook https://github.com/iris-hep/analysis-grand-challenge/blob/main/analyses/atlas-open-data-hzz/CabinetryHZZAnalysis.ipynb die with:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_1850/4037724515.py in <module>
      2 
      3 start_time = time.time()
----> 4 cabinetry.template_builder.create_histograms(config, method="uproot")
      5 # cabinetry.template_builder.create_histograms(config, method="coffea")
      6 # Using the coffea backend requires this branch of cabinetry: https://github.com/alexander-held/cabinetry/pull/216

AttributeError: module 'cabinetry' has no attribute 'template_builder'

IRIS-HEP / AGC Demo day 16.12.2022

We'll be organizing a demo day on Friday 16.12.2022 at 11am Central, 5pm Central European Time.

Agenda

Dependency management for complex analysis at coffea-casa facility + JWT generation @oshadura @Andrew42 @Sam6734
ServiceX: ROOT files from uproot transformer & scaling @ivukotic @fengpinghu @BenGalewsky @talvandaalen @sthapa
Data management of HEP data (Apache Iceberg) @JayjeetAtGithub
Integrating AGC pipeline at BNL facility @matthewfeickert
First steps using inference server at coffea-casa facility @ekauffma @alexander-held @clundst @oshadura
Using JWT tokens for XCache at coffea-casa facility @jthiltges @oshadura @clundst @Sam6734

Indico agenda

See https://indico.cern.ch/event/1218004/

cc: @bbockelm

Towards AGC task v2

An updated v2 of the ttbar analysis task will

include an extended set of systematic uncertainties, e.g. more complete jet uncertainty handling (can address this via correctionlib, see also https://gitlab.cern.ch/cms-nanoAOD/jsonpog-integration)
- first set of correctionlib-based variations by @Andrew42 -> #119
add ML component (factorized training & inference), cc @ekauffma
- training: #127
- inference: #122
- inference follow-up: #130
use MC generator weights for samples

Format AGC outputs for HEPData

We should think about formatting the AGC outputs (figures, tables, etc.) for HEPData compatibility. For some that will be straightforward (HistFactory workspace in JSON format), some others will probably take an intermediate formatting interface to be done. Could also be a feature for cabinetry for example.

Port ServiceX serving ROOT URIs to ttbar notebook

Port #96 to the CMS ttbar notebook.

cc @ekauffma

GAC.1: Demonstrate ServiceX -> coffea -> cabinetry -> pyhf pipeline

Demonstrate a ServiceX -> coffea -> cabinetry -> pyhf pipeline on coffea-casa. This can be done by extending the ATLAS Open Data H>ZZ* example developed in stormsomething/CoffeaHZZAnalysis.

Target date: Dec 1, 2021

Requires

scikit-hep/cabinetry#219
notebooks need to run on coffea-casa #4

GAC.2: Execute IRIS-HEP AGC tools soft-launch event

Execute IRIS-HEP soft-launch event showcasing tools and workflows planned to be used for AGC. The event will feature interactive notebook talks to introduce tools and interfaces. It will start with a common track and later on split into ATLAS and CMS-specific examples. The event will be split over two afternoons (CET).

Target date: Dec 1, 2021

Requires

CoffeaTeam/coffea-casa#193

Stretch goal

CoffeaTeam/coffea-casa#196 (postponed until end of year)

GAC.5: Coordinate with AS, DOMA, SSL, and operations programs to execute the Analysis Grand Challenge

This issue will be used to track progress and collect issues related to the milestone of executing the AGC.

Target date: March 1, 2023

Requires

definition of analysis
...

Add cutflow tables

We should add cutflow tables for a few sensible cuts to show how they impact the selection. Ideally probably both unweighted and weighted. The implementation can heavily rely on PackedSelection as introduced in #119.

relevant coffea PR: CoffeaTeam/coffea#797

Update CMS ttbar requirements.txt

ServiceX endpoints running 1.1.4 require an updated set of frontend libraries (pip install servicex-clients works for that at the moment).

Follow-up items to ML extension of analysis

Collecting follow-up items to #122 here that are not crucial to be addressed immediately in that PR but can be revisited. cc @ekauffma

understand large change in event yields with new cuts (almost an order of magnitude less events), though the new yields are consistent with what CMS had for the 2022 open data workshop in https://cms-opendata-workshop.github.io/workshop2022-lesson-ttbarljetsanalysis/02-coffea-analysis/index.html#plotting
investigate possibility of merging histogram-writing code into single function that avoids hardcoding information where possible
harmonize object names (includes also the ML training notebook and documentation probably) from e.g. "top_hadron jet" to "b_{had top}" etc, focusing the names of the b-tagged jet on "b" instead of "top"
where did the particle dependency come from?
the model_even, model_odd determination would probably make for a good utils function to remove that from the notebook
turn the first look at all the ML features into a grid of plots to save some space
make func_adl query depending on if inference is used (do not serve extra columns if not needed)
the last cabinetry part also needs a if USE_INFERENCE wrapping

PHYSLITE version of AGC pipeline

reading with uproot ServiceX transformer is blocked by iris-hep/func_adl_uproot#82 (does not work with uproot.lazy, for uproot5 need the uproot.dask equivalent
- investigate xAOD transformer instead -> not possible currently (no R22 deployment)
reading directly with coffea/uproot through XCache at UChicago needs IAM token support at ATLAS coffea-casa AF

misc info:

relevant notebook from first AGC workshop: https://github.com/nikoladze/agc-tools-workshop-2021-physlite/blob/main/talk.ipynb

Document details of AGC execution event

This could either become a website or a GitHub issue. The AGC workshop closing slides show some details: https://indico.cern.ch/event/1260431/timetable/?view=standard#5-closing

Add single top modeling systematics

Use the existing tW variations to define modeling systematics. cc @ekauffma

Relevant input file excerpt:

    "single_top_tW": {
        "nominal": {
            "nevts_total": 1999400,
            "files": [
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_antitop_5f_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19412_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_00000_0000.root",
                    "nevts": 847600
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_antitop_5f_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19412_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_20000_0000.root",
                    "nevts": 151800
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_top_5f_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19419_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_70000_0000.root",
                    "nevts": 1000000
                }
            ]
        },
        "scaledown": {
            "nevts_total": 995200,
            "files": [
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_antitop_5f_scaledown_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19415_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_20000_0000.root",
                    "nevts": 495200
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_top_5f_scaledown_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19422_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_10000_0000.root",
                    "nevts": 213600
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_top_5f_scaledown_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19422_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_80000_0000.root",
                    "nevts": 286400
                }
            ]
        },
        "scaleup": {
            "nevts_total": 998800,
            "files": [
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_antitop_5f_scaleup_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19416_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_20000_0000.root",
                    "nevts": 499600
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_top_5f_scaleup_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19423_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_60000_0000.root",
                    "nevts": 499200
                }
            ]
        },
        "DS": {
            "nevts_total": 1140600,
            "files": [
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_antitop_5f_DS_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19410_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_00000_0000.root",
                    "nevts": 260600
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_antitop_5f_DS_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19410_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_50000_0000.root",
                    "nevts": 238600
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_antitop_5f_DS_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19412_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_20000_0000.root",
                    "nevts": 151800
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_top_5f_DS_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19417_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_10000_0000.root",
                    "nevts": 489600
                }
            ]
        }
    }

SyntaxError: 'await' outside function

If coffea.ipynb os converted to a script using nbconvert, the resulting script will abort with the message in the title, due to the statement

all_histograms = await utils.produce_all_histograms(...)

ServiceX/coffea HZZ Opendata analysis is failing with `Cannot convert OpenFile to pyarrow.lib.NativeFile`

Notebook https://github.com/iris-hep/analysis-grand-challenge/blob/main/analyses/atlas-open-data-hzz/CoffeaHZZAnalysisWithServiceX.ipynb with ServiceX/coffea HZZ Opendata analysis is failing with Cannot convert OpenFile to pyarrow.lib.NativeFile (see traceback below).

I will try to debug but maybe @lgray do you know if it could it be connected to version of pyarrow we are using? (cc: @gordonwatts @BenGalewsky )

I see that https://github.com/CoffeaTeam/coffea/blob/master/setup.py#L63 is expecting pyarrow version to be just > 1.0.0 (so I got with coffea, pyarrow 5.0.0), but then while installing func-adl_servicex it downgrades to '3.0.0'....

>>> pyarrow.__version__
'3.0.0'

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_425/3909352362.py in run_updates_stream(accumulator_stream, name)
     30         try:
---> 31             async for coffea_info in accumulator_stream:
     32                 pass

/opt/conda/lib/python3.8/site-packages/coffea/processor/servicex/executor.py in execute(self, analysis, datasource, title)
     82         async with finished_events.stream() as streamer:
---> 83             async for results in async_accumulate(streamer):
     84                 yield results

/opt/conda/lib/python3.8/site-packages/coffea/processor/accumulator.py in async_accumulate(result_stream)
    101     output = None
--> 102     async for results in result_stream:
    103         if output:

/opt/conda/lib/python3.8/site-packages/aiostream/stream/advanced.py in base_combine(source, switch, ordered, task_limit)
     58             try:
---> 59                 result = task.result()
     60 

/opt/conda/lib/python3.8/site-packages/aiostream/stream/create.py in call(func, *args, **kwargs)
     84     if asyncio.iscoroutinefunction(func):
---> 85         yield await func(*args, **kwargs)
     86     else:

/opt/conda/lib/python3.8/site-packages/coffea/processor/servicex/executor.py in inline_wait(r)
     75             "This could be inline, but python 3.6"
---> 76             x = await r
     77             return x

/opt/conda/lib/python3.8/site-packages/distributed/client.py in _result(self, raiseit)
    244                 typ, exc, tb = exc
--> 245                 raise exc.with_traceback(tb)
    246             else:

/opt/conda/lib/python3.8/site-packages/coffea/processor/servicex/executor.py in run_coffea_processor()
    158     elif data_type == "parquet":
--> 159         events = NanoEventsFactory.from_parquet(
    160             file=str(events_url),

/opt/conda/lib/python3.8/site-packages/coffea/nanoevents/factory.py in from_parquet()
    197             fs_file = fsspec.open(file, "rb")
--> 198             table_file = pyarrow.parquet.ParquetFile(fs_file, **parquet_options)
    199         elif isinstance(file, pyarrow.parquet.ParquetFile):

/opt/conda/lib/python3.8/site-packages/pyarrow/parquet.py in __init__()
    216         self.reader = ParquetReader()
--> 217         self.reader.open(source, use_memory_map=memory_map,
    218                          buffer_size=buffer_size,

/opt/conda/lib/python3.8/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.open()
    946 
--> 947         get_reader(source, use_memory_map, &rd_handle)
    948         with nogil:

/opt/conda/lib/python3.8/site-packages/pyarrow/io.pxi in pyarrow.lib.get_reader()
   1472 
-> 1473     nf = get_native_file(source, use_memory_map)
   1474     reader[0] = nf.get_random_access_file()

/opt/conda/lib/python3.8/site-packages/pyarrow/io.pxi in pyarrow.lib.get_native_file()
   1465 
-> 1466     return source
   1467 

TypeError: Cannot convert OpenFile to pyarrow.lib.NativeFile

scikit-learn missing from `requirements.txt`

It's now a mandatory dependency:

Traceback (most recent call last):
  File "/home/blue/Scratchpad/work/agc/analysis-grand-challenge/analyses/cms-open-data-ttbar/ttbar_analysis_pipeline.py", line 615, in <module>
    model_even = XGBClassifier()
                 ^^^^^^^^^^^^^^^
  File "/home/blue/Tools/miniconda3/envs/agc-py311/lib/python3.11/site-packages/xgboost/core.py", line 620, in inner_f
    return func(**kwargs)
           ^^^^^^^^^^^^^^
  File "/home/blue/Tools/miniconda3/envs/agc-py311/lib/python3.11/site-packages/xgboost/sklearn.py", line 1422, in __init__
    super().__init__(objective=objective, **kwargs)
  File "/home/blue/Tools/miniconda3/envs/agc-py311/lib/python3.11/site-packages/xgboost/sklearn.py", line 584, in __init__
    raise ImportError(
ImportError: sklearn needs to be installed in order to use this module

Skyhook + ServiceX integration

Deploy Skyhook + ServiceX at SSL & UNL (or elsewhere), ready to be used in technical demonstration at the next AGC workshop (#29). This is a requirement also for being able to benchmark Skyhook for the next milestone #5.

Target date: end of April 2022

Requires

[to be filled]

Add a way to validate output histograms against a trusted reference

In first approximation it would be enough to publish reference histogram content, in whatever easily readable format (even JSON, or serialized dictionaries of numpy arrays).

A simple CLI tool or similar that runs the comparison between two such JSON files could be provided on top.

Complications:

one systematic variation depends on RNG numbers, so bin values won't be stable (especially with multi-thread/out-of-order execution): I think it's ok to clearly mark it as such and only expect a very approximate equality there
bin values depend on how many files per process are used: I think we should provide values for at least 1 file per process (for quick validation) and for the full dataset (for full validation)

Handle common code between training and inference pipelines

Currently, the training and inference pipelines in the CMS open data ttbar notebook have overlapping code which is simply duplicated at this point. This will be difficult to maintain, so it would be good to import these common features from one file.

Commonalities:

Calculation of top mass using trijet reconstruction method
Cuts
Options for processing

Structure for processor for correctionlib adoption / refactor

event object contains event data
call to correctionlib to get event_nominal version with all objects calibrated to their nominal state
loop over all systematic variations that change object kinematics (+ nominal)
- figure out what objects changed, vary those objects that did change via correctionlib -> event_for_this_systematic
- run event selection + observable calculation (*1)
- fill histograms
  - if nominal: also fill all weight variations (in another loop over weight-based systematics), weights obtained via correctionlib

(*1): this does not automatically avoid re-calculation of quantities that do not change (e.g. those derived from jets for muon systematic variations)

cc @Andrew42

see also Lindsey's talk at the last AGC demo day https://indico.cern.ch/event/1232470/#4-awkward-dask-integration-int

Local use of `FuturesExecutor`

Testing #89 revealed that FuturesExecutor locally fails with

AttributeError: Can't get attribute 'AGCSchema' on <module '__main__' (built-in)>

On coffea-casa at UNL and at SSL this works fine.

This can be avoided by converting the notebook

jupyter nbconvert --to script coffea.ipynb
mv coffea.py agc.py

where renaming is important as the name coffea collides with the module name. This then results in

AttributeError: module 'vector' has no attribute 'behavior'

which can be solved by removing import vector; vector.register_awkward() which might no longer be needed after the removal of vector to construct 4-vectors manually for ServiceX.

We should track down why exactly these things fail to provide a more general solution.

Add Z+jets background to ttbar example analysis

We were missing Z+jets so far as we did not find it in the available Open Data, but the following files seem to be the right ones to use. Thanks to @tpmccauley for pointing this out!

/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/RunIIFall15MiniAODv2-PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext4-v1/MINIAODSIM
https://opendata.cern.ch/record/16458

/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/RunIIFall15MiniAODv2-PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext1-v2/MINIAODSIM
https://opendata.cern.ch/record/16457

/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/RunIIFall15MiniAODv2-PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1/MINIAODSIM
https://opendata.cern.ch/record/16456

Schema for PhysObjectExtractorTool ntuples from 2015 CMS miniAOD Open Data

Create a schema to handle ntuples produced from 2015 CMS miniAOD Open Data with PhysObjectExtractorTool more conveniently.

Example files: https://cernbox.cern.ch/index.php/s/5Fn4BQ9LNdiCGVv, the _merged version is produced via https://gist.github.com/alexander-held/c4d8a82f45e6ec834ff49d5fa0c98c67 (merging all object trees into a single tree).

With dask/distributed 2021.10.0 names of workers are not anymore consistent

With dask/distributed 2021.10.0, we noticed issues with names of workers, therefore workers are not retired correctly (see coffea-casa patch for distributed):

distributed.deploy.adaptive - INFO - Retiring workers ['CoffeaCasaCluster-0']
distributed.scheduler - INFO - Retire worker names ('CoffeaCasaCluster-0',)
distributed.scheduler - INFO - Retire worker names ('htcondor--16824264.0--',)
distributed.scheduler - INFO - Retire workers {<WorkerState 'tls://red-c7217.unl.edu:43819', name: htcondor--16824264.0--, status: running, memory: 10, processing: 1>}
distributed.scheduler - INFO - Moving 9 keys to other workers
distributed.scheduler - INFO - Remove worker <WorkerState 'tls://red-c7217.unl.edu:43819', name: htcondor--16824264.0--, status: running, memory: 10, processing: 1>
distributed.deploy.adaptive - INFO - Retiring workers ['CoffeaCasaCluster-3']
distributed.scheduler - INFO - Retire worker names ('CoffeaCasaCluster-3',)
distributed.scheduler - INFO - Retire worker names ('htcondor--16824268.0--',)

[Discussion] Define a standard data subset for "micro" challenge

The idea is to define how to reduce the total events processed so people can compare results in microbenchmark settings.

1) pick first N files per process

This would reduce the number of files hit by workload, given that all the files are similarly sized nanoAOD, I worry that different process may have very different # of total files. Another problem is reducing # of files may change how some people prefer to parallelize the whole workload

2) pick first N events per file

This hit all the files like the full-version, effectively pretending each file is smaller than they really are. This hits all the files, so it's fairer from specific FS/network considerations. Again because nanoAODs are similarly sized, this should be reasonable. One problem with this approach is opening many many files in itself is a bottleneck, and this approach does not reduce time spent there.

3,4) pick first some% files / events from each process/file

similarly to 1 and 2 except we do it as %. Should be fairer and produce higher fidelity in terms of physics result, but annoying to implement...

Merge strategy for recent changes

the first two can be flipped

jupytext #92
ServiceX update #96 (once relevant deployments are confirmed to be available everywhere)
ServiceX update ported to ttbar notebook (= ACAT + improved ServiceX pipeline) #100
Improved ServiceX queries (all datasets in parallel) #107
update requirements.txt to reflect ServiceX client version required #112 -> then tag v0.2 -> v0.2.0
NanoAOD switch #102 (cc @ekauffma) -> then tag v1.0 (= v0.2 but with NanoAOD) -> v1.0.0
- no analysis logic changes (e.g. new cuts, x-sec values), these will be introduced after the tag on the way towards v2, we should also track them in a text document to more easily update other AGC implementations

cc @oshadura

Highest priority UX improvements for next demonstration

#64 lists possible UX + performance improvements. This issue gathers the highest priority items towards a next demonstration of the pipeline, possibly at PyHEP 2022 if this gets accepted (Sep 12–16).

support custom schemas when using coffea with data delivered by ServiceX -> CoffeaTeam/coffea#707
harmonize ServiceX executor API in coffea with other executors -> instead switch to servicex-databinder like approach, receiving a list from https://github.com/ssl-hep/ServiceX_frontend/blob/10db3faa1e289d0d0d246ccbcdfc65841fa79004/servicex/servicex.py#L448 -> done via #103
improved systematics handling (extended use of built-in coffea functionality) -> this will be tracked via #101

Accessing ROOT files written out by ServiceX over S3

The new ServiceX functionality to provide .root file outputs from uproot transformers provides an S3 URL to access those. At the moment uproot can not read them (scikit-hep/uproot5#443). We need to understand how to approach accessing these files for subsequent coffea processing.

cc: @talvandaalen, @gordonwatts, @BenGalewsky, @jpivarski, @lgray, @nsmith-, @bbockelm

Serve Open Data without EOS bottleneck

File access via EOS has previously been a bottleneck when serving Open Data via ServiceX. For benchmarking purposes (#5), ensure that this bottleneck is avoided. This will likely have to be done by duplicating the data and serving it from elsewhere (e.g. through SSL).

Target date: June 1, 2022 via #5

Requires

[to be filled]

pyhf + funcX demonstrator

Deploy pyhf + funcX at SSL / UNL and showcase demonstrator at the next AGC workshop (#29). Workshop attendees should be able to access this service to try out the functionality.

Target date: end of April 2022

Requirements

[to be added]

IRIS-HEP / AGC Demo day 24.02.2023

Preliminary draft:

Coffea-casa with working cephfs
Distributed xcache with redirector at coffea-casa
ServiceX working with CMS IAM token
distributed RDF on coffea-casa
scaling ServiceX

agenda: https://indico.cern.ch/event/1232470/

`dask-awkward` migration & `coffea` 2023

GAC.3: Coordinate with AS, DOMA, SSL and operations programs to benchmark performance of prototype system components to be used for Analysis Grand Challenge

This milestone already exists as AS & SSL milestones, but should be an AGC milestone as well. This issue will be used to track progress and collect issues related to this milestone.

Target date: June 1, 2022

Requires

definition of analysis used for demonstrations
- needs scale larger than the ATLAS Open Data H>ZZ* example used for #1 and include systematic uncertainties
- possible target: CMS Open Data H>tautau
- may require ssl-hep/ServiceX#188
- using CMS ttbar analysis
#30 to enable benchmarking with Skyhook
#31
#32

[docs] Update AGC project description and links on IRIS-HEP webpage

See https://iris-hep.org/projects/agc.html

ServiceX transformation fails for a few files with specific query

From the CMS ttbar notebook, slightly adapted:

if PIPELINE == "servicex_databinder":
    from servicex_databinder import DataBinder
    t0 = time.time()

    # query for events with at least 4 jets with 25 GeV, at least one b-tag, and exactly one electron or muon with pT > 25 GeV
    # returning columns required for subsequent processing
    query_string = """Where(
        lambda event: event.electron_pt.Where(lambda pT: pT > 25).Count() + event.muon_pt.Where(lambda pT: pT > 25).Count() == 1
        ).Where(lambda event: event.jet_pt.Where(lambda pT: pT > 25).Count() >= 4
        ).Where(lambda event: event.jet_btag.Where(lambda btag: btag > 0.5).Count() >= 1
        ).Select(
             lambda e: {"electron_pt": e.electron_pt, "muon_pt": e.muon_pt,
                        "jet_pt": e.jet_pt, "jet_eta": e.jet_eta, "jet_phi": e.jet_phi, "jet_mass": e.jet_mass, "jet_btag": e.jet_btag}
    )"""
    
    query_string = """Where(
        lambda event: event.electron_pt.Where(lambda pT: pT > 25).Count() + event.muon_pt.Where(lambda pT: pT > 25).Count() == 1
        ).Select(
             lambda e: {"electron_pt": e.electron_pt, "muon_pt": e.muon_pt,
                        "jet_pt": e.jet_pt, "jet_eta": e.jet_eta, "jet_phi": e.jet_phi, "jet_mass": e.jet_mass, "jet_btag": e.jet_btag}
    )"""

    sample_names = ["ttbar__nominal"]  # 1.5 TB, 7066 files
    sample_list = []

    for sample_name in sample_names:
        sample_list.append({"Name": sample_name, "RucioDID": f"user.ivukotic:user.ivukotic.{sample_name}", "Tree": "events", "FuncADL": query_string})


    databinder_config = {
                            "General": {
                                           "ServiceXBackendName": "uproot",
                                            "OutputDirectory": "outputs_databinder",
                                            "OutputFormat": "root",
                                            "IgnoreServiceXCache": SERVICEX_IGNORE_CACHE
                            },
                            "Sample": sample_list
                        }

    sx_db = DataBinder(databinder_config)
    out = sx_db.deliver()
    print(f"execution took {time.time() - t0:.2f} seconds")

fails for a single file. This seems to be query-related.

Add an option to run the analysis on local data

For the purpose of testing different setups and configurations it is often useful to be able to run the analysis workload with data on local storage (e.g. to cut network I/O, xrootd caches etc. out of the picture).

It would be nice if the AGC implementations had an option to pre-download a small set of files to the specified local path and do a run on that data instead.

Benchmarking epic

This gathers points related to performance and functionality for benchmarking.

UChicago dev instance: FileNotFoundError when running over local files (reproducer)
UChicago prod instance: KilledWorker exception at scale (reproducer: default notebook with N_FILES_MAX_PER_SAMPLE>=1000)
scaling beyond ~50 workers at UNL
dask.distributed scaling behavior (reproducer below): differs per site, see occasionally long tails where the last few tasks do not get picked up for a long time, and workers only process a few tasks sometimes while there are still tasks remaining
basket size can matter a lot for uproot, 10-100 kB per basket is good (could try re-merging with hadd -O), small input files have 1 .num_baskets, 10->1 merged files have 10
split out pre-processing since in principle it's optional (https://gist.github.com/alexander-held/4db5624ab302c423ce40dc82d65965a2)
bytesread bug CoffeaTeam/coffea#717

things to add to notebook (-> #85):

pre-processing skip
adjustable number of branches accessed
turning off processor logic (remove I/O and/or awkward)
optionally removing systematics

points to follow up on (thanks @nsmith-!)

py-spy for profiling https://github.com/benfred/py-spy
CoffeaTeam/coffea#249
cache handling https://github.com/nsmith-/coffea-benchmarks/blob/e3c63509555d823b542e82b46eb6661c3b9f1700/coffea-adl-benchmarks.py#L23-L31

if there is time:

investigate whether unused branches in the files matter for performance as observed by @andrzejnovak in another context

related tooling:

rootreadspeed: https://github.com/root-project/root/blob/master/tree/readspeed/README.md

dask.distributed setup for simple scaling tests:

import time
from dask.distributed import Client, progress

client = Client("tls://localhost:8786")

def do_something(x):
    time.sleep(5)
    return x

x = client.map(do_something, range(1000))
progress(x)

Event rate measurement: what to use as reference (events in input vs events passing selection)? For reference (10 input files per process):

ttbar__nominal            : 442122 events in input ->  88013 passing selection (frac: 19.91%)
ttbar__scaledown          : 435118 events in input ->  81571 passing selection (frac: 18.75%)
ttbar__scaleup            : 416314 events in input ->  86154 passing selection (frac: 20.69%)
ttbar__ME_var             : 455600 events in input ->  94637 passing selection (frac: 20.77%)
ttbar__PS_var             : 406193 events in input ->  76202 passing selection (frac: 18.76%)
single_top_s_chan__nominal: 221600 events in input ->  14847 passing selection (frac: 6.70%)
single_top_t_chan__nominal: 413691 events in input ->  22346 passing selection (frac: 5.40%)
single_top_tW__nominal    : 382354 events in input ->  39945 passing selection (frac: 10.45%)
wjets__nominal            : 412269 events in input ->    438 passing selection (frac: 0.11%)

where (by file size) W+jets is ~31% of all events available, ttbar nominal is ~40%, the four ttbar variations are ~17% together.
By number of events (948 M events total), the breakdown is 46% W+jets, 30% ttbar nominal, 12% ttbar variations, 11% single top t-channel.

some benchmarking calculations for the usual notebook:

print(f"\nexecution took {time_taken:.2f} seconds")
print(f"event rate / worker: {metrics['entries'] / NUM_WORKERS / time_taken / 1000:.3f} kHz (including overhead, so pessimistic estimate)")

print(f"data read: {metrics['bytesread']/1000**3:.3f} GB")
print(f"events processed: {metrics['entries']/1000**2:.3f} M")
print(f"processtime: {metrics['processtime']:.3f} s (?!)")
print(f"processtime per worker: {metrics['processtime']/NUM_WORKERS:.3f} s (should be similar to real runtime, will be lower if opening files etc. is a significant contribution)")
print(f"processtime per chunk: {metrics['processtime']/metrics['chunks']:.3f} s")
print(metrics)

Investigate RNTuple vs TTree

See ROOT::Experimental::RNTupleImporter::Create, would be interesting to compare in the AGC context.

Save visualizations as PDF and PNG post axis and label manipulation

Let me first say that @alexander-held and @oshadura have done a super impressive job with all the notebook demos and this Issue should be considered a super small nitpick that is an "if there is time" thing to consider. For the current state of the HZZ_analysis_pipeline notebook everything inside the notebook looks super good. 🎉

What would be nice though is if all visualizations could be:

saved under the figures directory
as both PDF and PNG file formats (with facecolor="white")
have any axis label manipulations performed after initial figure generation saved out as well.

At the moment, the mplhep and hist plot in cell 9 is not saved out as a figure under the figures directory.
The cabinetry pulls and correlation matrix (cell 15 and 16) plots look great, but are saved (by default) as PDF only, so it would be good if they could be saved as PNG as well.
The final 4 lepton invariant mass plot (cell 17) looks great as well in the rendered notebook

however, the PDF version that is saved out to the figures directory

doesn't have the label modifications of

# modify x-axis label
figure_dict[0]["figure"].axes[1].set_xlabel("m4l [GeV]")

applied to it. It would be great if the saved versions and the displayed versions in the notebook could be identical.

Like I said, all super small nitpicks that aren't make or break for the demos. I can't promise that I'll have time to address these tonight, but if I do have time I'll open a PR for these, but if someone else is feeling bored please feel free to beat me to it. 👍

Understand scale variation changes via correctionlib implementation

#119 changed the pt_scale_up variations, in particular the following changes:

sum(all_histograms[:, "4j2b", "ttbar", "pt_scale_up"].values())

Prior to this PR, the propagation of the pT changes on the observable itself was not taken into account, so the bin-by-bin distribution in this region is expected to differ regardless before and after the PR. It is currently unclear though why also the total event count would change in this region.

The 4j1b region does not seem to be affected by this.

cc @Andrew42

Accessing 2015 CMS Open Data

The 2015 CMS Open Data will be used as input for an AGC demonstrator analysis. This recent release provides miniAODs. To interface the remainder of the envisioned analysis pipeline, we need to be able to serve columns from this Open Data. There are two options:

convert miniAODs to nanoAODs ahead of time, and serve nanoAODs via ServiceX,
serving columns from miniAODs directly via ServiceX.

The conversion miniAOD -> nanoAOD could happen either via ServiceX ahead of time or in another fashion. To avoid bottlenecks, data should likely be duplicated and served from somewhere other than EOS (#32).

Target date: end of April 2022

Requires

[to be filled]

Switch to nanoAOD inputs for CMS Open Data pipeline demonstration

Follow-up to #31: Once available, switch from ntuple inputs to nanoAOD inputs for the CMS Open Data pipeline demonstration.

Requires

availability of nanoAODs for 2015 CMS Open Data, or a miniAOD -> nanoAOD conversion tool
...

Execute the next AGC workshop

The next AGC workshop needs to take place before the 42-month IRIS-HEP review on May 16/17. The format will be similar to the previous workshop (#2).

Target date: end of April 2022

Requires

GAC.4: Document details of physics analysis used in AGC to allow for re-implementations and use in benchmarks

This issue will be used to track progress and collect issues related to the milestone of documenting AGC physics analysis details, allowing for benchmarking use.

Target date: March 1, 2023

iris-hep / analysis-grand-challenge Goto Github PK

analysis-grand-challenge's Introduction

Analysis Grand Challenge (AGC)

AGC and IRIS-HEP

AGC and you

More details: what is being investigated in the AGC context

Acknowledgements

analysis-grand-challenge's People

Contributors

Stargazers

Watchers

Forkers

analysis-grand-challenge's Issues

Impact on ServiceX

Impact on coffea

Completeness of pipeline

User experience

ServiceX+coffea

ServiceX

coffea

coffea-casa

func_adl

processor design

Performance

ServiceX+coffea

ServiceX

coffea

servicex-databinder approach

coffea-casa

func_adl

cabinetry

Agenda

Indico agenda

Requires

Requires

Stretch goal

Requires

Requires

1) pick first N files per process

2) pick first N events per file

3,4) pick first some% files / events from each process/file

Requires

Requirements

Requires

Requires

Requires

Requires

Requires

Recommend Projects

Recommend Topics

Recommend Org

ServiceX+`coffea`

`coffea`

`func_adl`

ServiceX+`coffea`

`coffea`

`servicex-databinder` approach

`func_adl`

`cabinetry`