Giter Club home page Giter Club logo

analysis-grand-challenge's Introduction

Analysis Grand Challenge (AGC)

DOI Documentation Status

The Analysis Grand Challenge (AGC) is about performing the last steps in an analysis pipeline at scale to test workflows envisioned for the HL-LHC. This includes

  • columnar data extraction from large datasets,
  • processing of that data (event filtering, construction of observables, evaluation of systematic uncertainties) into histograms,
  • statistical model construction and statistical inference,
  • relevant visualizations for these steps,

all done in a reproducible & preservable way that can scale to HL-LHC requirements.

analysis pipeline

The AGC has two major pieces:

  1. specification of a physics analysis using Open Data which captures relevant workflow aspects encountered in physics analyses performed at the LHC,
  2. a reference implementation demonstrating the successful execution of this physics analysis at scale.

The physics analysis task is a $t\bar{t}$ cross-section measurement with 2015 CMS Open Data (see datasets/cms-open-data-2015). The current reference implementation can be found in analyses/cms-open-data-ttbar. In addition to this, analyses/atlas-open-data-hzz contains a smaller scale $H\rightarrow ZZ^*$ analysis based on ATLAS Open Data.

See this talk given at ICHEP 2022 for some more information about the AGC. Additional information is available in two workshops focused on the AGC:

We also have a dedicated webpage and a website for documentation.

AGC and IRIS-HEP

The AGC serves as an integration exercise for IRIS-HEP, allowing the testing of new services, libraries and workflows on dedicated analysis facilities in the context of realistic physics analyses.

AGC and you

We believe that the AGC can be useful in various contexts:

  • testbed for software library development,
  • realistic environment to prototype analysis workflows,
  • functionality, integration & performance test for analysis facilities.

We are very interested in seeing (parts of) the AGC implemented in different ways! Besides the implementation in this repository, have a look at

Please get in touch if you have investigated other approaches you would like to share! There is no need to implement the full analysis task — it splits into pieces (for example the production of histograms) that can also be tackled individually.

More details: what is being investigated in the AGC context

  • New user interfaces: Complementary services that present the analyst with a notebook-based interface. Example software: Jupyter.
  • Data access: Services that provide quick access to the experiment’s official data sets, often allowing simple derivations and local caching for efficient access. Example software and services: Rucio, ServiceX, SkyHook, iDDS, RNTuple.
  • Event selection: Systems/frameworks allowing analysts to process entire datasets, select desired events, and calculate derived quantities. Example software and services: Coffea, awkward-array, func_adl, RDataFrame. Histogramming and summary statistics: Closely tied to the event selection, histogramming tools provide physicists with the ability to summarize the observed quantities in a dataset. Example software and services: Coffea, func_adl, cabinetry, hist.
  • Statistical model building and fitting: Tools that translate specifications for event selection, summary statistics, and histogramming quantities into statistical models, leveraging the capabilities above, and perform fits and statistical analysis with the resulting models. Example software and services: cabinetry, pyhf, FuncX+pyhf fitting service
  • Reinterpretation / analysis preservation: Standards for capturing the entire analysis workflow, and services to reuse the workflow which enables reinterpretation. Example software and services: REANA, RECAST.

Acknowledgements

This work was supported by the U.S. National Science Foundation (NSF) cooperative agreement OAC-1836650 (IRIS-HEP).

analysis-grand-challenge's People

Contributors

alexander-held avatar andrew42 avatar andriipovsten avatar andrzejnovak avatar davekch avatar eguiraud avatar ekauffma avatar jayjeetatgithub avatar kondratyevd avatar kyungeonchoi avatar mapsacosta avatar masonproffitt avatar mat-adamec avatar matthewfeickert avatar nanoemc avatar oshadura avatar rasmalais avatar saransh-cpp avatar stormsomething avatar talvandaalen avatar tatianaovsiannikova avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

analysis-grand-challenge's Issues

NanoAOD file access over http and performance

ServiceX transforms of NanoAOD files and direct uproot-based access via http seem to be slower than for ntuples: https://gist.github.com/alexander-held/4e58811522ed9990afb2d4b73ef9471e.

@masonproffitt pointed out an XRootD issue related to this: xrootd/xrootd#1976. Reading too much data causes a 500 error and uproot subsequently falls back to individual requests, making everything slower. A similar issue is xrootd/xrootd#2003: this is about requesting too many ranges at once, while the former is about requesting too many bytes in a range.

Related uproot issue during these investigations: scikit-hep/uproot5#881.

Impact on ServiceX

More details about the behavior of ServiceX from @masonproffitt:

the uproot backend does not set anything related to chunking; it just uses the default settings for uproot. the problems are a bit different between uproot4 (used in the current version of the servicex uproot transformer) and uproot5. in uproot4, the main problem is that uproot.lazy has an explicit iterator over branches, so the execution time scales linearly with both the number of branches accessed and the round trip latency. in uproot5, this problem should disappear thanks to uproot.dask, but there the issue is that it hits these xrootd limits and falls back to individual requests (at least for each branch, maybe even for each basket)

for uproot5, we can set the step_size in the servicex transformer, but i don't think there's a consistent way to guarantee that we don't hit these limits because there are separate limits for (1) number of byte ranges, (2) total ascii length of the Range field, and (3) total number of actual bytes requested by Range. the problem is that there's no way to know the number and size of the baskets before the code executes. handling this would require either going deep into uproot itself or inspecting a lot of metadata at runtime and modifying the generated code in very non-trivial ways

Impact on coffea

It is currently unclear if this would affect coffea directly ingesting the input dataset differently. Are there any tricks that may matter here @nsmith- @lgray? Currently we are still using "old" coffea, though preparing to switch to coffea 2023.

Improved func_adl query to not accidentally drop events that are accepted under systematic variations

Query including lepton and jet selection:

def get_query(source: ObjectStream) -> ObjectStream:
    return source.Where(lambda e:\
    # == 1 lep
    e.electron_pt.Where(lambda pT: pT > 25).Count() + e.muon_pt.Where(lambda pT: pT > 25).Count()== 1
)\
.Where(lambda e:\
    # >= 4 jets
    e.jet_pt.Where(lambda pT: pT > 25).Count() >= 4
)\
.Where(lambda e:\
    # >= 1 jet with pT > 25 GeV and b-tag >= 0.5
    {"pT": e.jet_pt, "btag": e.jet_btag}.Zip().Where(lambda jet: jet.btag >= 0.5 and jet.pT > 25).Count() >= 1
)\
.Select(lambda e:\
    # return columns
    {
        "electron_pt": e.electron_pt,
        "muon_pt": e.muon_pt,
        "jet_pt": e.jet_pt,
        "jet_eta": e.jet_eta,
        "jet_phi": e.jet_phi,
        "jet_mass": e.jet_mass,
        "jet_btag": e.jet_btag,
    }
)

This does not account for object systematic variations in the processor that can relax acceptance requirements. The query should be loosened accordingly, though it is unclear how to do that in a way that guarantees full acceptance in a general setup.

Rucio-based file access

Add rucio-based file access to the AGC ttbar notebook. This requires suitable credentials to be run, given that the datasets available through rucio are experiment-restricted.

PIPELINE = "servicex_databinder" currently uses those inputs, but they should be integrated fully with the complete pipeline.

User experience and performance improvements for pipeline demonstrator

This collects various user experience and performance related aspects that the CMS Open Data pipeline demonstration at the AGC 2022 workshop revealed.

Completeness of pipeline

  • add a machine learning component (e.g. ttbar reconstruction), frequently requested and relevant for many analyses being done in practice

User experience

ServiceX+coffea

ServiceX

coffea

coffea-casa

  • dask manual scaling settings seem to not be accepted
  • ServiceX dashboard

func_adl

  • find ways to format queries in a way that helps understand the "layer" at which a given operation acts

processor design

  • avoid stacking masks of different shapes together (when built after initial filtering), hard to keep track of shapes (perhaps keepdims=True, or masking with None)
  • improve systematics loop, potentially streamline everything to use the same pattern, or find a way to automatically track which columns change when, and automatically expand observable with systematics dimensions, avoid scaling of jet properties via helper array

Performance

ServiceX+coffea

ServiceX

  • DID finder becomes a bottleneck when running over a large amount of files

coffea

servicex-databinder approach

  • avoid bottleneck with file conversion / copying (feed data straight to Skyhook?)

coffea-casa

  • understand issues showing up in dask task stream (file access?)
  • possibility of guaranteeing fixed number of workers for performance benchmarking

func_adl

  • implement full query with proper b-tagging of jets with pT > 25 GeV -> done in #85

cabinetry

  • cabinetry.templates.collect method takes a lot of time when introducing more channels (i.e. 45.3 seconds for 20 channels)
  • cabinetry.model_utils.prediction(model, fit_results=fit_results) causes notebook to crash due to memory issues on model with many channels. -> potentially related: scikit-hep/awkward#2480

Cabinetry 0.4.0 @ coffea-casa breaking CabinetryHZZAnalysis.ipynb

While using cabinetry 0.4.0 @ coffea-casa makes notebook https://github.com/iris-hep/analysis-grand-challenge/blob/main/analyses/atlas-open-data-hzz/CabinetryHZZAnalysis.ipynb die with:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_1850/4037724515.py in <module>
      2 
      3 start_time = time.time()
----> 4 cabinetry.template_builder.create_histograms(config, method="uproot")
      5 # cabinetry.template_builder.create_histograms(config, method="coffea")
      6 # Using the coffea backend requires this branch of cabinetry: https://github.com/alexander-held/cabinetry/pull/216

AttributeError: module 'cabinetry' has no attribute 'template_builder'

IRIS-HEP / AGC Demo day 16.12.2022

We'll be organizing a demo day on Friday 16.12.2022 at 11am Central, 5pm Central European Time.

Agenda

Indico agenda

See https://indico.cern.ch/event/1218004/

cc: @bbockelm

Format AGC outputs for HEPData

We should think about formatting the AGC outputs (figures, tables, etc.) for HEPData compatibility. For some that will be straightforward (HistFactory workspace in JSON format), some others will probably take an intermediate formatting interface to be done. Could also be a feature for cabinetry for example.

GAC.2: Execute IRIS-HEP AGC tools soft-launch event

Execute IRIS-HEP soft-launch event showcasing tools and workflows planned to be used for AGC. The event will feature interactive notebook talks to introduce tools and interfaces. It will start with a common track and later on split into ATLAS and CMS-specific examples. The event will be split over two afternoons (CET).

Target date: Dec 1, 2021

Requires

Stretch goal

Add cutflow tables

We should add cutflow tables for a few sensible cuts to show how they impact the selection. Ideally probably both unweighted and weighted. The implementation can heavily rely on PackedSelection as introduced in #119.

relevant coffea PR: CoffeaTeam/coffea#797

Update CMS ttbar requirements.txt

ServiceX endpoints running 1.1.4 require an updated set of frontend libraries (pip install servicex-clients works for that at the moment).

Follow-up items to ML extension of analysis

Collecting follow-up items to #122 here that are not crucial to be addressed immediately in that PR but can be revisited. cc @ekauffma

  • understand large change in event yields with new cuts (almost an order of magnitude less events), though the new yields are consistent with what CMS had for the 2022 open data workshop in https://cms-opendata-workshop.github.io/workshop2022-lesson-ttbarljetsanalysis/02-coffea-analysis/index.html#plotting
  • investigate possibility of merging histogram-writing code into single function that avoids hardcoding information where possible
  • harmonize object names (includes also the ML training notebook and documentation probably) from e.g. "top_hadron jet" to "b_{had top}" etc, focusing the names of the b-tagged jet on "b" instead of "top"
  • where did the particle dependency come from?
  • the model_even, model_odd determination would probably make for a good utils function to remove that from the notebook
  • turn the first look at all the ML features into a grid of plots to save some space
  • make func_adl query depending on if inference is used (do not serve extra columns if not needed)
  • the last cabinetry part also needs a if USE_INFERENCE wrapping

PHYSLITE version of AGC pipeline

  • reading with uproot ServiceX transformer is blocked by iris-hep/func_adl_uproot#82 (does not work with uproot.lazy, for uproot5 need the uproot.dask equivalent
    • investigate xAOD transformer instead -> not possible currently (no R22 deployment)
  • reading directly with coffea/uproot through XCache at UChicago needs IAM token support at ATLAS coffea-casa AF

misc info:

Add single top modeling systematics

Use the existing tW variations to define modeling systematics. cc @ekauffma

Relevant input file excerpt:

    "single_top_tW": {
        "nominal": {
            "nevts_total": 1999400,
            "files": [
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_antitop_5f_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19412_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_00000_0000.root",
                    "nevts": 847600
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_antitop_5f_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19412_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_20000_0000.root",
                    "nevts": 151800
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_top_5f_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19419_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_70000_0000.root",
                    "nevts": 1000000
                }
            ]
        },
        "scaledown": {
            "nevts_total": 995200,
            "files": [
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_antitop_5f_scaledown_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19415_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_20000_0000.root",
                    "nevts": 495200
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_top_5f_scaledown_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19422_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_10000_0000.root",
                    "nevts": 213600
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_top_5f_scaledown_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19422_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_80000_0000.root",
                    "nevts": 286400
                }
            ]
        },
        "scaleup": {
            "nevts_total": 998800,
            "files": [
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_antitop_5f_scaleup_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19416_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_20000_0000.root",
                    "nevts": 499600
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_top_5f_scaleup_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19423_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_60000_0000.root",
                    "nevts": 499200
                }
            ]
        },
        "DS": {
            "nevts_total": 1140600,
            "files": [
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_antitop_5f_DS_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19410_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_00000_0000.root",
                    "nevts": 260600
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_antitop_5f_DS_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19410_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_50000_0000.root",
                    "nevts": 238600
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_antitop_5f_DS_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19412_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_20000_0000.root",
                    "nevts": 151800
                },
                {
                    "path": "https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/ST_tW_top_5f_DS_inclusiveDecays_13TeV-powheg-pythia8_TuneCUETP8M1/cmsopendata2015_single_top_tW_19417_PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1_10000_0000.root",
                    "nevts": 489600
                }
            ]
        }
    }

SyntaxError: 'await' outside function

If coffea.ipynb os converted to a script using nbconvert, the resulting script will abort with the message in the title, due to the statement

all_histograms = await utils.produce_all_histograms(...)

ServiceX/coffea HZZ Opendata analysis is failing with `Cannot convert OpenFile to pyarrow.lib.NativeFile`

Notebook https://github.com/iris-hep/analysis-grand-challenge/blob/main/analyses/atlas-open-data-hzz/CoffeaHZZAnalysisWithServiceX.ipynb with ServiceX/coffea HZZ Opendata analysis is failing with Cannot convert OpenFile to pyarrow.lib.NativeFile (see traceback below).

I will try to debug but maybe @lgray do you know if it could it be connected to version of pyarrow we are using? (cc: @gordonwatts @BenGalewsky )

I see that https://github.com/CoffeaTeam/coffea/blob/master/setup.py#L63 is expecting pyarrow version to be just > 1.0.0 (so I got with coffea, pyarrow 5.0.0), but then while installing func-adl_servicex it downgrades to '3.0.0'....

>>> pyarrow.__version__
'3.0.0'
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_425/3909352362.py in run_updates_stream(accumulator_stream, name)
     30         try:
---> 31             async for coffea_info in accumulator_stream:
     32                 pass

/opt/conda/lib/python3.8/site-packages/coffea/processor/servicex/executor.py in execute(self, analysis, datasource, title)
     82         async with finished_events.stream() as streamer:
---> 83             async for results in async_accumulate(streamer):
     84                 yield results

/opt/conda/lib/python3.8/site-packages/coffea/processor/accumulator.py in async_accumulate(result_stream)
    101     output = None
--> 102     async for results in result_stream:
    103         if output:

/opt/conda/lib/python3.8/site-packages/aiostream/stream/advanced.py in base_combine(source, switch, ordered, task_limit)
     58             try:
---> 59                 result = task.result()
     60 

/opt/conda/lib/python3.8/site-packages/aiostream/stream/create.py in call(func, *args, **kwargs)
     84     if asyncio.iscoroutinefunction(func):
---> 85         yield await func(*args, **kwargs)
     86     else:

/opt/conda/lib/python3.8/site-packages/coffea/processor/servicex/executor.py in inline_wait(r)
     75             "This could be inline, but python 3.6"
---> 76             x = await r
     77             return x

/opt/conda/lib/python3.8/site-packages/distributed/client.py in _result(self, raiseit)
    244                 typ, exc, tb = exc
--> 245                 raise exc.with_traceback(tb)
    246             else:

/opt/conda/lib/python3.8/site-packages/coffea/processor/servicex/executor.py in run_coffea_processor()
    158     elif data_type == "parquet":
--> 159         events = NanoEventsFactory.from_parquet(
    160             file=str(events_url),

/opt/conda/lib/python3.8/site-packages/coffea/nanoevents/factory.py in from_parquet()
    197             fs_file = fsspec.open(file, "rb")
--> 198             table_file = pyarrow.parquet.ParquetFile(fs_file, **parquet_options)
    199         elif isinstance(file, pyarrow.parquet.ParquetFile):

/opt/conda/lib/python3.8/site-packages/pyarrow/parquet.py in __init__()
    216         self.reader = ParquetReader()
--> 217         self.reader.open(source, use_memory_map=memory_map,
    218                          buffer_size=buffer_size,

/opt/conda/lib/python3.8/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.open()
    946 
--> 947         get_reader(source, use_memory_map, &rd_handle)
    948         with nogil:

/opt/conda/lib/python3.8/site-packages/pyarrow/io.pxi in pyarrow.lib.get_reader()
   1472 
-> 1473     nf = get_native_file(source, use_memory_map)
   1474     reader[0] = nf.get_random_access_file()

/opt/conda/lib/python3.8/site-packages/pyarrow/io.pxi in pyarrow.lib.get_native_file()
   1465 
-> 1466     return source
   1467 

TypeError: Cannot convert OpenFile to pyarrow.lib.NativeFile

scikit-learn missing from `requirements.txt`

It's now a mandatory dependency:

Traceback (most recent call last):
  File "/home/blue/Scratchpad/work/agc/analysis-grand-challenge/analyses/cms-open-data-ttbar/ttbar_analysis_pipeline.py", line 615, in <module>
    model_even = XGBClassifier()
                 ^^^^^^^^^^^^^^^
  File "/home/blue/Tools/miniconda3/envs/agc-py311/lib/python3.11/site-packages/xgboost/core.py", line 620, in inner_f
    return func(**kwargs)
           ^^^^^^^^^^^^^^
  File "/home/blue/Tools/miniconda3/envs/agc-py311/lib/python3.11/site-packages/xgboost/sklearn.py", line 1422, in __init__
    super().__init__(objective=objective, **kwargs)
  File "/home/blue/Tools/miniconda3/envs/agc-py311/lib/python3.11/site-packages/xgboost/sklearn.py", line 584, in __init__
    raise ImportError(
ImportError: sklearn needs to be installed in order to use this module

Skyhook + ServiceX integration

Deploy Skyhook + ServiceX at SSL & UNL (or elsewhere), ready to be used in technical demonstration at the next AGC workshop (#29). This is a requirement also for being able to benchmark Skyhook for the next milestone #5.

Target date: end of April 2022

Requires

  • [to be filled]

Add a way to validate output histograms against a trusted reference

In first approximation it would be enough to publish reference histogram content, in whatever easily readable format (even JSON, or serialized dictionaries of numpy arrays).

A simple CLI tool or similar that runs the comparison between two such JSON files could be provided on top.

Complications:

  • one systematic variation depends on RNG numbers, so bin values won't be stable (especially with multi-thread/out-of-order execution): I think it's ok to clearly mark it as such and only expect a very approximate equality there
  • bin values depend on how many files per process are used: I think we should provide values for at least 1 file per process (for quick validation) and for the full dataset (for full validation)

Handle common code between training and inference pipelines

Currently, the training and inference pipelines in the CMS open data ttbar notebook have overlapping code which is simply duplicated at this point. This will be difficult to maintain, so it would be good to import these common features from one file.

Commonalities:

  • Calculation of top mass using trijet reconstruction method
  • Cuts
  • Options for processing

Structure for processor for correctionlib adoption / refactor

  • event object contains event data
  • call to correctionlib to get event_nominal version with all objects calibrated to their nominal state
  • loop over all systematic variations that change object kinematics (+ nominal)
    • figure out what objects changed, vary those objects that did change via correctionlib -> event_for_this_systematic
    • run event selection + observable calculation (*1)
    • fill histograms
      • if nominal: also fill all weight variations (in another loop over weight-based systematics), weights obtained via correctionlib

(*1): this does not automatically avoid re-calculation of quantities that do not change (e.g. those derived from jets for muon systematic variations)

cc @Andrew42

see also Lindsey's talk at the last AGC demo day https://indico.cern.ch/event/1232470/#4-awkward-dask-integration-int

Local use of `FuturesExecutor`

Testing #89 revealed that FuturesExecutor locally fails with

AttributeError: Can't get attribute 'AGCSchema' on <module '__main__' (built-in)>

On coffea-casa at UNL and at SSL this works fine.

This can be avoided by converting the notebook

jupyter nbconvert --to script coffea.ipynb
mv coffea.py agc.py

where renaming is important as the name coffea collides with the module name. This then results in

AttributeError: module 'vector' has no attribute 'behavior'

which can be solved by removing import vector; vector.register_awkward() which might no longer be needed after the removal of vector to construct 4-vectors manually for ServiceX.

We should track down why exactly these things fail to provide a more general solution.

Add Z+jets background to ttbar example analysis

We were missing Z+jets so far as we did not find it in the available Open Data, but the following files seem to be the right ones to use. Thanks to @tpmccauley for pointing this out!

/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/RunIIFall15MiniAODv2-PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext4-v1/MINIAODSIM
https://opendata.cern.ch/record/16458

/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/RunIIFall15MiniAODv2-PU25nsData2015v1_76X_mcRun2_asymptotic_v12_ext1-v2/MINIAODSIM
https://opendata.cern.ch/record/16457

/DYJetsToLL_M-50_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/RunIIFall15MiniAODv2-PU25nsData2015v1_76X_mcRun2_asymptotic_v12-v1/MINIAODSIM
https://opendata.cern.ch/record/16456

With dask/distributed 2021.10.0 names of workers are not anymore consistent

With dask/distributed 2021.10.0, we noticed issues with names of workers, therefore workers are not retired correctly (see coffea-casa patch for distributed):

distributed.deploy.adaptive - INFO - Retiring workers ['CoffeaCasaCluster-0']
distributed.scheduler - INFO - Retire worker names ('CoffeaCasaCluster-0',)
distributed.scheduler - INFO - Retire worker names ('htcondor--16824264.0--',)
distributed.scheduler - INFO - Retire workers {<WorkerState 'tls://red-c7217.unl.edu:43819', name: htcondor--16824264.0--, status: running, memory: 10, processing: 1>}
distributed.scheduler - INFO - Moving 9 keys to other workers
distributed.scheduler - INFO - Remove worker <WorkerState 'tls://red-c7217.unl.edu:43819', name: htcondor--16824264.0--, status: running, memory: 10, processing: 1>
distributed.deploy.adaptive - INFO - Retiring workers ['CoffeaCasaCluster-3']
distributed.scheduler - INFO - Retire worker names ('CoffeaCasaCluster-3',)
distributed.scheduler - INFO - Retire worker names ('htcondor--16824268.0--',)

[Discussion] Define a standard data subset for "micro" challenge

The idea is to define how to reduce the total events processed so people can compare results in microbenchmark settings.

1) pick first N files per process

This would reduce the number of files hit by workload, given that all the files are similarly sized nanoAOD, I worry that different process may have very different # of total files. Another problem is reducing # of files may change how some people prefer to parallelize the whole workload

2) pick first N events per file

This hit all the files like the full-version, effectively pretending each file is smaller than they really are. This hits all the files, so it's fairer from specific FS/network considerations. Again because nanoAODs are similarly sized, this should be reasonable. One problem with this approach is opening many many files in itself is a bottleneck, and this approach does not reduce time spent there.

3,4) pick first some% files / events from each process/file

similarly to 1 and 2 except we do it as %. Should be fairer and produce higher fidelity in terms of physics result, but annoying to implement...

Merge strategy for recent changes

the first two can be flipped

  • jupytext #92
  • ServiceX update #96 (once relevant deployments are confirmed to be available everywhere)
  • ServiceX update ported to ttbar notebook (= ACAT + improved ServiceX pipeline) #100
  • Improved ServiceX queries (all datasets in parallel) #107
  • update requirements.txt to reflect ServiceX client version required #112 -> then tag v0.2 -> v0.2.0
  • NanoAOD switch #102 (cc @ekauffma) -> then tag v1.0 (= v0.2 but with NanoAOD) -> v1.0.0
    • no analysis logic changes (e.g. new cuts, x-sec values), these will be introduced after the tag on the way towards v2, we should also track them in a text document to more easily update other AGC implementations

cc @oshadura

Highest priority UX improvements for next demonstration

#64 lists possible UX + performance improvements. This issue gathers the highest priority items towards a next demonstration of the pipeline, possibly at PyHEP 2022 if this gets accepted (Sep 12–16).

Serve Open Data without EOS bottleneck

File access via EOS has previously been a bottleneck when serving Open Data via ServiceX. For benchmarking purposes (#5), ensure that this bottleneck is avoided. This will likely have to be done by duplicating the data and serving it from elsewhere (e.g. through SSL).

Target date: June 1, 2022 via #5

Requires

  • [to be filled]

pyhf + funcX demonstrator

Deploy pyhf + funcX at SSL / UNL and showcase demonstrator at the next AGC workshop (#29). Workshop attendees should be able to access this service to try out the functionality.

Target date: end of April 2022

Requirements

  • [to be added]

`dask-awkward` migration & `coffea` 2023

see also #113 and #164

  • __setitem__ for floats for things like events["pt_scale_up"] = 1.03 (instead of using dak.ones_like(events.MET.pt))
  • np.random on dask-awkward arrays
    • could use dask.array (example: dask.array.random.uniform(size=(1000,), chunks=(250,)), dak.to_dask_array and dak.from_dask_array
    • could usenp.random.normal within dak.map_partitions
  • histogram filling with float weights (AttributeError: 'float' object has no attribute 'ndim') scikit-hep/hist#493 -> done via dask-contrib/dask-histogram#69

GAC.3: Coordinate with AS, DOMA, SSL and operations programs to benchmark performance of prototype system components to be used for Analysis Grand Challenge

This milestone already exists as AS & SSL milestones, but should be an AGC milestone as well. This issue will be used to track progress and collect issues related to this milestone.

Target date: June 1, 2022

Requires

  • definition of analysis used for demonstrations
  • #30 to enable benchmarking with Skyhook
  • #31
  • #32

ServiceX transformation fails for a few files with specific query

From the CMS ttbar notebook, slightly adapted:

if PIPELINE == "servicex_databinder":
    from servicex_databinder import DataBinder
    t0 = time.time()

    # query for events with at least 4 jets with 25 GeV, at least one b-tag, and exactly one electron or muon with pT > 25 GeV
    # returning columns required for subsequent processing
    query_string = """Where(
        lambda event: event.electron_pt.Where(lambda pT: pT > 25).Count() + event.muon_pt.Where(lambda pT: pT > 25).Count() == 1
        ).Where(lambda event: event.jet_pt.Where(lambda pT: pT > 25).Count() >= 4
        ).Where(lambda event: event.jet_btag.Where(lambda btag: btag > 0.5).Count() >= 1
        ).Select(
             lambda e: {"electron_pt": e.electron_pt, "muon_pt": e.muon_pt,
                        "jet_pt": e.jet_pt, "jet_eta": e.jet_eta, "jet_phi": e.jet_phi, "jet_mass": e.jet_mass, "jet_btag": e.jet_btag}
    )"""
    
    query_string = """Where(
        lambda event: event.electron_pt.Where(lambda pT: pT > 25).Count() + event.muon_pt.Where(lambda pT: pT > 25).Count() == 1
        ).Select(
             lambda e: {"electron_pt": e.electron_pt, "muon_pt": e.muon_pt,
                        "jet_pt": e.jet_pt, "jet_eta": e.jet_eta, "jet_phi": e.jet_phi, "jet_mass": e.jet_mass, "jet_btag": e.jet_btag}
    )"""

    sample_names = ["ttbar__nominal"]  # 1.5 TB, 7066 files
    sample_list = []

    for sample_name in sample_names:
        sample_list.append({"Name": sample_name, "RucioDID": f"user.ivukotic:user.ivukotic.{sample_name}", "Tree": "events", "FuncADL": query_string})


    databinder_config = {
                            "General": {
                                           "ServiceXBackendName": "uproot",
                                            "OutputDirectory": "outputs_databinder",
                                            "OutputFormat": "root",
                                            "IgnoreServiceXCache": SERVICEX_IGNORE_CACHE
                            },
                            "Sample": sample_list
                        }

    sx_db = DataBinder(databinder_config)
    out = sx_db.deliver()
    print(f"execution took {time.time() - t0:.2f} seconds")

fails for a single file. This seems to be query-related.

Add an option to run the analysis on local data

For the purpose of testing different setups and configurations it is often useful to be able to run the analysis workload with data on local storage (e.g. to cut network I/O, xrootd caches etc. out of the picture).

It would be nice if the AGC implementations had an option to pre-download a small set of files to the specified local path and do a run on that data instead.

Benchmarking epic

This gathers points related to performance and functionality for benchmarking.

  • UChicago dev instance: FileNotFoundError when running over local files (reproducer)
  • UChicago prod instance: KilledWorker exception at scale (reproducer: default notebook with N_FILES_MAX_PER_SAMPLE>=1000)
  • scaling beyond ~50 workers at UNL
  • dask.distributed scaling behavior (reproducer below): differs per site, see occasionally long tails where the last few tasks do not get picked up for a long time, and workers only process a few tasks sometimes while there are still tasks remaining
  • basket size can matter a lot for uproot, 10-100 kB per basket is good (could try re-merging with hadd -O), small input files have 1 .num_baskets, 10->1 merged files have 10
  • split out pre-processing since in principle it's optional (https://gist.github.com/alexander-held/4db5624ab302c423ce40dc82d65965a2)
  • bytesread bug CoffeaTeam/coffea#717

things to add to notebook (-> #85):

  • pre-processing skip
  • adjustable number of branches accessed
  • turning off processor logic (remove I/O and/or awkward)
  • optionally removing systematics

points to follow up on (thanks @nsmith-!)

if there is time:

  • investigate whether unused branches in the files matter for performance as observed by @andrzejnovak in another context

related tooling:

dask.distributed setup for simple scaling tests:

import time
from dask.distributed import Client, progress

client = Client("tls://localhost:8786")

def do_something(x):
    time.sleep(5)
    return x

x = client.map(do_something, range(1000))
progress(x)

Event rate measurement: what to use as reference (events in input vs events passing selection)? For reference (10 input files per process):

ttbar__nominal            : 442122 events in input ->  88013 passing selection (frac: 19.91%)
ttbar__scaledown          : 435118 events in input ->  81571 passing selection (frac: 18.75%)
ttbar__scaleup            : 416314 events in input ->  86154 passing selection (frac: 20.69%)
ttbar__ME_var             : 455600 events in input ->  94637 passing selection (frac: 20.77%)
ttbar__PS_var             : 406193 events in input ->  76202 passing selection (frac: 18.76%)
single_top_s_chan__nominal: 221600 events in input ->  14847 passing selection (frac: 6.70%)
single_top_t_chan__nominal: 413691 events in input ->  22346 passing selection (frac: 5.40%)
single_top_tW__nominal    : 382354 events in input ->  39945 passing selection (frac: 10.45%)
wjets__nominal            : 412269 events in input ->    438 passing selection (frac: 0.11%)

where (by file size) W+jets is ~31% of all events available, ttbar nominal is ~40%, the four ttbar variations are ~17% together.
By number of events (948 M events total), the breakdown is 46% W+jets, 30% ttbar nominal, 12% ttbar variations, 11% single top t-channel.

some benchmarking calculations for the usual notebook:

print(f"\nexecution took {time_taken:.2f} seconds")
print(f"event rate / worker: {metrics['entries'] / NUM_WORKERS / time_taken / 1000:.3f} kHz (including overhead, so pessimistic estimate)")

print(f"data read: {metrics['bytesread']/1000**3:.3f} GB")
print(f"events processed: {metrics['entries']/1000**2:.3f} M")
print(f"processtime: {metrics['processtime']:.3f} s (?!)")
print(f"processtime per worker: {metrics['processtime']/NUM_WORKERS:.3f} s (should be similar to real runtime, will be lower if opening files etc. is a significant contribution)")
print(f"processtime per chunk: {metrics['processtime']/metrics['chunks']:.3f} s")
print(metrics)

Save visualizations as PDF and PNG post axis and label manipulation

Let me first say that @alexander-held and @oshadura have done a super impressive job with all the notebook demos and this Issue should be considered a super small nitpick that is an "if there is time" thing to consider. For the current state of the HZZ_analysis_pipeline notebook everything inside the notebook looks super good. 🎉

What would be nice though is if all visualizations could be:

  • saved under the figures directory
  • as both PDF and PNG file formats (with facecolor="white")
  • have any axis label manipulations performed after initial figure generation saved out as well.
  • At the moment, the mplhep and hist plot in cell 9 is not saved out as a figure under the figures directory.
  • The cabinetry pulls and correlation matrix (cell 15 and 16) plots look great, but are saved (by default) as PDF only, so it would be good if they could be saved as PNG as well.
  • The final 4 lepton invariant mass plot (cell 17) looks great as well in the rendered notebook

download

however, the PDF version that is saved out to the figures directory

without_edits

doesn't have the label modifications of

# modify x-axis label
figure_dict[0]["figure"].axes[1].set_xlabel("m4l [GeV]")

applied to it. It would be great if the saved versions and the displayed versions in the notebook could be identical.

Like I said, all super small nitpicks that aren't make or break for the demos. I can't promise that I'll have time to address these tonight, but if I do have time I'll open a PR for these, but if someone else is feeling bored please feel free to beat me to it. 👍

Understand scale variation changes via correctionlib implementation

#119 changed the pt_scale_up variations, in particular the following changes:

sum(all_histograms[:, "4j2b", "ttbar", "pt_scale_up"].values())

Prior to this PR, the propagation of the pT changes on the observable itself was not taken into account, so the bin-by-bin distribution in this region is expected to differ regardless before and after the PR. It is currently unclear though why also the total event count would change in this region.

The 4j1b region does not seem to be affected by this.

cc @Andrew42

Accessing 2015 CMS Open Data

The 2015 CMS Open Data will be used as input for an AGC demonstrator analysis. This recent release provides miniAODs. To interface the remainder of the envisioned analysis pipeline, we need to be able to serve columns from this Open Data. There are two options:

  • convert miniAODs to nanoAODs ahead of time, and serve nanoAODs via ServiceX,
  • serving columns from miniAODs directly via ServiceX.

The conversion miniAOD -> nanoAOD could happen either via ServiceX ahead of time or in another fashion. To avoid bottlenecks, data should likely be duplicated and served from somewhere other than EOS (#32).

Target date: end of April 2022

Requires

  • [to be filled]

Execute the next AGC workshop

The next AGC workshop needs to take place before the 42-month IRIS-HEP review on May 16/17. The format will be similar to the previous workshop (#2).

Target date: end of April 2022

Requires

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.