Giter Club home page Giter Club logo

agc-reana's Introduction

REANA example - AGC CMS ttbar analysis with Coffea

About

This demo shows the submission of AGC - Analysis Grand Challenge to the REANA using the Snakemake as an workflow engine.

Analysis Grand Challenge

For full explanation please have a look at this documentation: DOI Documentation Status

The Analysis Grand Challenge (AGC) is about performing the last steps in an analysis pipeline at scale to test workflows envisioned for the HL-LHC. This includes

  • columnar data extraction from large datasets,
  • processing of that data (event filtering, construction of observables, evaluation of systematic uncertainties) into histograms,
  • statistical model construction and statistical inference,
  • relevant visualizations for these steps,

The physics analysis task is a $t\bar{t}$ cross-section measurement with 2015 CMS Open Data (see datasets/cms-open-data-2015). The current reference implementation can be found in analyses/cms-open-data-ttbar.

Analysis Structure

1. Input data

We are using 2015 CMS Open Data in this demonstration to showcase an analysis pipeline. The input .root files are located in the nanoAODschema.json. The current coffea AGC version defines the coffea Processor, which includes a lot of the physics analysis details:

  • event filtering and the calculation of observables,
  • event weighting,
  • calculating systematic uncertainties at the event and object level,
  • filling all the information into histograms that get aggregated and ultimately returned to us by coffea.

The analysis takes the following inputs:

  • nanoAODschema.json input .root files
  • Snakefile The Snakefile for
  • ttbar_analysis_reana.ipynb The main notebook file where files are processed and analysed.
  • file_merging.ipynb Notebook to merge each processed .root file in one file with unique keys.
  • final_merging.ipynb Notebook to merge histograms together all of

2. Analysis Code

REANA provides support for the Snakemake workflow engine. To ensure the fastest execution of the AGC ttbar workflow, a two-level (multicascading) parallelization approach with Snakemake is implemented. In the initial step, Snakemake distributes all jobs across separate nodes, each with a single .root file for ttbar_analysis_reana.ipynb. Subsequently, after the completion of each rule, the merging of individual files into one per sample takes place. #Here is the high level of AGC workflow

                                +-----------------------------------------+
                                | Take the CMS open data from nanoaod.json|
                                +-----------------------------------------+
                                                    |
                                                    |
                                                    |
                                                    v  
                                  +-----------------------------------+
                                  |rule: Process each file in parallel|
                                  +-----------------------------------+
                                                    |
                                                    |
                                                    |
                                                    v
                                +-----------------------------------------+                
                                |rule: Merge created files for each sample|
                                +-----------------------------------------+  
                                                    |
                                                    |
                                                    |
                                                    v
                                +----------------------------------------------+ 
                                |rule: Merge sample files into single histogram| 
                                +----------------------------------------------+

3. Compute environment

To be able to rerun the AGC after some time, we need to "encapsulate the current compute environment", for example to freeze the ROOT version our analysis is using. We shall achieve this by preparing a Docker container image for our analysis steps.

We are using the modified verison of the analysis-systems-base Docker image container with additional packages, the main on is papermill which allows to run the Jupyter Notebook from the command line with additional parameters.

In our case, the Dockerfile creates a conda virtual environment with all necessary packages for running the AGC analysis.

$ less environment/Dockerfile

Let's go inside the environment and build it

$ cd environment/

We can build our AGC environment image and give it a name docker.io/reanahub/reana-demo-agc-cms-ttbar-coffea:

$ docker build -t docker.io/reanahub/reana-demo-agc-cms-ttbar-coffea .

We can push the image to the DockerHub image registry:

$ docker push docker.io/reanahub/reana-demo-agc-cms-ttbar-coffea

Some data are located at the eos/public so in order to process the big amount of files, user should be authenticated with Kerberos. In our case we achieve it by setting up:

workflow:
  type: snakemake
  resources:
    kerberos: true
  file: Snakefile

If you are pocessing small amount of files (less than 10) you can set this option to False. Or you can also set the kerberos authentication via the Snakemake rules. For deeper understanding please refer to the (REANA documentation)[https://docs.reana.io/advanced-usage/access-control/kerberos/]

4. Analysis Workflow

The reana.yaml file describes the above analysis structure with its inputs, code, runtime environment, computational workflow steps and expected outputs:

version: 0.8.0
inputs:
  files:
    - ttbar_analysis_reana.ipynb 
    - nanoaod_inputs.json
    - fix-env.sh
    - corrections.json
    - Snakefile
    - file_merging.ipynb
    - final_merging.ipynb
    - prepare_workspace.py

  directories:
    - histograms
    - utils
workflow:
  type: snakemake
  resources:
    kerberos: true
  file: Snakefile
outputs:
  files:
    - histograms_merged.root

We can now install the REANA command-line client, run the analysis and download the resulting plots:

$ # create new virtual environment
$ virtualenv ~/.virtualenvs/reana
$ source ~/.virtualenvs/reana/bin/activate
$ # install REANA client
$ pip install reana-client
$ # connect to some REANA cloud instance
$ export REANA_SERVER_URL=https://reana.cern.ch/
$ export REANA_ACCESS_TOKEN=XXXXXXX
$ # run AGC workflow
$ reana-client run -w reana-agc-cms-ttbar-coffea
$ # ... should be finished in around 6 minutes if you select all files in the Snakefile
$ reana-client status
$ # list workspace files
$ reana-client ls

Please see the REANA-Client documentation for more detailed explanation of typical reana-client usage scenarios.

5. Output results

The output is created under the name of histograms_merged.root which can be further evaluated with variety of AGC tools.

agc-reana's People

Contributors

andriipovsten avatar

Watchers

Brian P Bockelman avatar Gordon Watts avatar Peter Elmer avatar David Lange avatar Oksana Shadura avatar

Forkers

andriipovsten

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.