Giter Club home page Giter Club logo

decaf's Introduction

Dark matter Experience with the Coffea Analysis Framework

Following instructions are to run the generation of the histograms directly from NanoAOD.

Initial Setup

First, log into an LPC node:

ssh -L 9094:localhost:9094 [email protected]

The command will also start forwarding the port 9094 (or whatever number you choose)to be able to use applications like jupyter once on the cluster. Then move into your area on uscms_data:

cd /uscms_data/d?/USERNAME/

where '?' can be [1,2,3]. Fork this repo on github and git clone it:

git clone https://github.com/USERNAME/decaf.git

Then, setup the proper dependences:

source setup_lcg.sh

you will install the necessary packages as user. This is a one-time setup. When you log in next just do:

source env_lcg.sh

by running this script you will also install your grid certificate.

Listing the input NanoAOD

The list of the inputs can be obtained in JSON format per year, packing the files beloning to the same dataset in batches of the desired size. The size of the batch will affect the performances of the following steps. The smaller the batch, the larger number of batches per dataset, the more the jobs you can run in parallel with condor. This needs to be balanced with the number of cores that can be made available through condor. A sweet spot has been found by setting the size of the batch to 75 file. Similarly, packing files in batches of different size will affect the performances once running with Spark. To run with Spark, use a size of 300 files per batch.

To create the JSONS:

cd analysis/metadata
python pack.py -y 2018 -p 300 #(or 75)

The --year option will allow for the generation of the list for a specific year, 2017 in the example. The --pack option will set the size of the batch. The --keep option will save .txt files that list the number of rootfiles per dataset, that will be stored in the metadata/YEAR folder. The JSON files will be stored in the metadata folder and they will include the cross section of the dataset each batch belongs to. A cross section of -1 is assigned to data batches.

Generate the histograms

The generation of histograms can be launched from the analysis folder. Currently a local generation through python futures is supported together with a condor and a Spark implementation.

Compile weights

Before running, weights from secondary inputs like corrections, ids ecc. need to be compiled and stored in .coffea files. To accomplish this task, simply run the following command:

sh generate_secondary_inputs.sh

the script will run the following python modules:

python secondary_inputs/metfilters.py
python secondary_inputs/corrections.py
python secondary_inputs/triggers.py
python secondary_inputs/ids.py

these modules can also run separately by passing to the bash script the argument corresponding to the name of the module you want to run. For example, to run secondary_inputs/corrections.py, just do:

sh generate_secondary_inputs.sh corrections

Separate .coffea files will be generated, corresponding to the four python modules listed above, and stored in the secondary_inputs folder.

Generate Coffea Processor

The next step is generating the Coffea processor file corresponding to the analysis you want to run. For example, to generate the Dark Higgs processor, the following command can be used:

sh generate_processor.sh darkhiggs 2018

The script will run the following command, using the first argument to select the corresponding python module in the processors folder and the second argument to set the --year option:

python processors/darkhiggs.py --year 2018

The module will generate a .coffea file, stored in the processors folder, that contains the processor instance to be used in the following steps. The --year option will allow for the generation of the processor corresponding to a specific year.

Running with Python futures

Phyton futures allows for running on multiple processor on a single node. To run the local histogram generation:

python run.py --year 2018 --dataset MET____0_ --processor darkhiggs2018

In this example, histograms for the 0 batch of the 2018 MET NanoAOD are being generated. The --year option is compulsory and the --dataset is optional. The --lumi is optional. If not provided, it will default to the hard-coded lumi values in run.py. Launching the script without the --dataset option will make the script run over all the batches for all the datasets. If, for example, --dataset TTJets is used, the module will run over all batches and all the datasets that match the TTJets string. The --processor option is compulsory and it will set the processor instance to be used. It corresponds to the name of a specific processor .coffea file that can be generated as described previously.

Running with Condor

Condor will allow to parallelize jobs by running across multiple cores:

python submit_condor.py --year 2018 --processor darkhiggs2018 -t

This way jobs to generate the full set of 2018 histograms will be submitted to condor. the -t will allow for tarring the working environment and the necessary dependences to run on condor nodes. The module has a --dataset option that works like described before for run.py. Will allow you to run on a single batch, dataset, or batches/datasets that match the input string.

From Coffea Histograms to Fits

Generating the model

Taking as example the dark Higgs analysis, run the following command to generate the background model, for example:

python models/darkhiggs.py -y 2018 -f -m 40to120

The models/darkhiggs.py module extracts coffea histgrams from the hists/darkhiggs201?.scaled files and utilize them to generate the different templates that will later be rendered into datacards/workspaces. It also defines transfer factors for the data-driven background models. It produces .model files that are saved into the data folder. More specifically, the models/darkhiggs.py module produces one .model file for each pass/fail and recoil bin.

Running the following command will generate all model files, and the outputs will store in the data directory:

./make_model.sh

Rendering the model

In order for the rendering step to be completed successfully and for the resulting datacards and workspaces to be used in later steps, both the python version and the ROOT version should correspond to the ones that are going to be set when running combine. It is therefore suggested to clone the decaf repository inside the src folder of CMSSSW release that will be used to perform the combine fit and run the rendering from there, without running the env_lcg.sh. You should, instead, do the regular cmsenv command. At the time of writing, the recommended version is CMSSW_10_2_13 (https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/#cc7-release-cmssw_10_2_x-recommended-version). This version uses python 2.7 and ROOT 6.12. In addition, you should install the following packages (to be done after cmsenv):

pip install --user flake8
pip install --user cloudpickle
pip install --user https://github.com/nsmith-/rhalphalib/archive/master.zip

Finally, since we are using Python2.7, you should edit the file

$HOME/.local/lib/python2.7/site-packages/rhalphalib/model.py

and change line 94 from

obsmap.insert(ROOT.std.pair('const string, RooDataHist*')(channel.name, obs))

to

obsmap.insert(ROOT.std.pair('const string, RooDataHist*')(channel.name.encode("ascii"), obs))

and also make the following modification:

import ROOT
#add this line
ROOT.v5.TFormula.SetMaxima(5000000)

The models saved in the .model files at the previous step can be rendered into datacards by running the following command:

python render.py -m model_name

the -m or --model options provide in input the name of the model to render, that corresponds to the name of the .model file where the model is stored. The render.py module launches python futures jobs to process in parallel the different control/signal regions that belongs to a single model. Different models can also be rendered in parallel, by using condor. In order to run rendering condor job, the following command should be ran:

python render_condor.py -m <model_name> -c <server_name> -t -x
python render_condor.py -m darkhiggs -c kisti -t -x

this time, all the models stored into .model files whose name contains the sting passed via the -m options are going to be rendered.

The render.py module saves datacards and workspaces into folders that show the following naming:

datacards/model_name

The render_condor.py module returns .tgz tarballs that contain the different datacards/workspaces, and are stored into the datacards folder. To untar them, simply do:

python macros/untar_cards.py -a darkhiggs 

Where the -a or --analysis options correspond to the analysis name. The untar_cards.py script will untar all the tarballs that contain the string that is passed through the -a option. To merge the datacards, run the following script:

python macros/combine_cards.py -a darkhiggs

Where the -a or --analysis options correspond to the analysis name. The combine_datacards.py script will combine all the datacards whose name contains the string that is passed through the -a option. The script will create a folder inside datacards whose name corresponds to the string that is passed through the -a option, will move all the workspaces that correspond to the datacards it combined inside it, and will save in it the combined datacard, whose name will be set to the string that is passed through the -a option.

You can also plot the inputs from a single workspace with the makeWorkspacePlots.C macro. It stacks all the inputs in a single histogram. From the folder you have just created at the previous step, do, for example:

root -b -q ../../makeWorkspacePlots.C\('"zecr2018passrecoil4"',5000,2000\)

where the first argument (zecr2018passrecoil4) is the name of the workspace; the ROOT file name must be the same, but with the .root extension; the second argument (5000) is the limit of the y-axis of the stack; the third argument (2000) is the y-position of an explanatory label; that label will simply be the same as the first argument again.

Using Combine

Make sure you edit your $CMSSW/src/HiggsAnalysis/CombinedLimit/scripts/text2workspace.py module as suggested below:

import ROOT
#add this line
ROOT.v5.TFormula.SetMaxima(5000000)

In the datacards directory, the directory is created based on the <model_name>. We are using the "MultiSignalModel" method to have one datacard includes all signal mass points. Please submint text2workspace job to the condor by using the following command:

python t2w_condor.py -a darkhiggs -c <server name> -t -x

Currently, <server name> is either lpc or kisti. The output will be saved in the same directory with the same name as the datacard. After you get workspace, you will be able to run fit over condor by doing:

python combine_condor.py -a darkhiggs -c <server name> -m <method> -t -x
python combine_condor.py -a darkhiggs -c kisti -m cminfit -t -x

Macros

Usage examples

python macros/dump_templates.py -w datacards/darkhiggs/darkhiggs.root:w --observable fjmass -o plots/darkhiggs2018/model
python macros/hessian.py -w datacards/darkhiggs/higgsCombineTest.FitDiagnostics.mH120.root:w -f datacards/darkhiggs/fitDiagnostics.root:fit_b

Pulls plotting

Change the path to store outputs in plotConfig.py

python macros/diffNuisances.py -g pulls.root ../fitDiagnostics.root

Postfit plotting - Method 1

If you use fast fit command, then the following command will make prefit and postfit histograms

PostFitShapesFromWorkspace -w path/to/darkhiggs.root -d path/to/darkhiggs.txt -f path/to/fitDiagnostics.root:fit_s --postfit --sampling --samples 300 --skip-proc-errs -o outputfile.root

To make postfit stack plots

python macros/postFitShapesFromWorkSpace.py path/to/outputfile.root

Postfit plotting - Method 2

This method uses the output from dump_templates.py.

python macros/combine_dump_outputs.py plots/darkhiggs2018/dump_postfit (or dump_prefit)

Then you get combined root files in the plots/darkhiggs2018/dump directory.

python macros/dump_darkhiggs_postfit.py plots/darkhiggs2018/dump/outputs.root

This will make postfit stack plots.

decaf's People

Contributors

areinsvo avatar drberry85 avatar erdemyigit avatar fdolek avatar mcremone avatar quantumapple avatar trtomei avatar

Watchers

 avatar  avatar

decaf's Issues

Top pT

Dear colleagues,

Over the past year, we have been working on setting new recommendations for addressing the modeling of the top quark pT in the simulation.

We are happy to inform you that the new guidelines are ready and accessible via
https://twiki.cern.ch/twiki/bin/viewauth/CMS/TWikiTopQuark<https://urldefense.proofpoint.com/v2/url?u=https-3A__nam12.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Ftwiki.cern.ch-252Ftwiki-252Fbin-252Fviewauth-252FCMS-252FTWikiTopQuark-26data-3D02-257C01-257CE.Barberis-2540northeastern.edu-257Cf00435152b13435533d608d84a5f1594-257Ca8eec281aaa34daeac9b9a398b9215e7-257C0-257C0-257C637341119531271383-26sdata-3DZvEVv3YDdFaNsBu3pot2q6fw-252BukkdtwNJOlLwEQ3GWE-253D-26reserved-3D0-26gt-3B&d=DwIBAw&c=gRgGjJ3BkIsb5y6s49QqsA&r=_uRsPyP8wGw2mVESAVlE1Q&m=7abzMmpXzkyzHKHuusY_z6tjFTiIembx9e3THPfPO3U&s=aq73a0JCdkPOFu_qajWSWLW3pz_Z6mwLhpCfRnfpBrY&e=
under Analysis development supportâ€
or with this direct link to the page
https://twiki.cern.ch/twiki/bin/view/CMS/TopPtReweighting.

We have taken into account the different types of analyses and have made categorizations which address the analyses in the TOP PAG and in most of the other PAGs as well. In this version of the page, we include more clarifications on the recommendations and expand on cases with the SM ttbar as background.

There are studies with Run II data that are expected to be added in the near future as we acquire more knowledge about the top pT. New functions will come in an extended pT range and new reweighting schemes in two dimensions (e.g. top pt vs mtt) will be introduced to mitigate the shortcomings of the 1D corrections.

We would like to thank immensely all the people who provided input and feedback!

Nadjieh, Florencia, Andy

NNLO photon corrections

From Michael:
"so the difference for the photons in this procedure is a specific isolation cut the thereticians need to do calculations which are in equations (10) (11) and (13) in https://arxiv.org/pdf/1705.04664.pdf
Then, one basically has to get the final state hadrons, see https://github.com/michaelwassmer/useful_scripts/blob/master/Vboson_Pt_Reweighting/V_boson_pt_reweighting.py#L192
Finally, one has to decide if the photon in the event is "isolated" in the sense above, see https://github.com/michaelwassmer/useful_scripts/blob/master/Vboson_Pt_Reweighting/V_boson_pt_reweighting.py#L252
If there are more than one isolated photon, then one takes the leading one.
After these steps, analogously to the other bosons, one takes the pt and retrieves the corresponding reweighting factors from the histogram
If there is no isolated photon, the event does not get an additional reweighting factor"
Also, add the usual min pT > 100 requirement.

Missing compatibility with el8 and el9 nodes

The current setup_lcg.sh and env_lcg.sh files are configured to use nodes running SL6 or CC7 operating systems.

decaf/env_lcg.sh

Lines 1 to 7 in 8e989f5

# http://lcginfo.cern.ch/release/95apython3/
# Try to guess SL6 vs. CC7
if uname -r | grep -q el6; then
source /cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-slc6-gcc8-opt/setup.sh
else
source /cvmfs/sft.cern.ch/lcg/views/LCG_96python3/x86_64-centos7-gcc8-opt/setup.sh
fi

This works for the time being, as the default node on the LPC cluster still runs on SL7 (which appears to be compatible), however this may be soon become deprecated given that lxplus has switched it's default alias to EL9 as of December 7, 2023.

Currently the LPC offers SL7 and EL8 nodes, and likewise for their GPU nodes. Both scripts above are able to run on SL7 nodes with no issues, but on EL8 nodes the python executable no longer exists, only python3 which these scripts can't find.

[alejands@cmslpc205 decaf]$ . env_lcg.sh
-bash: python: command not found
-bash: python: command not found
Error in find.package("readr") : there is no package called ‘readr’
Calls: cat -> find.package
Execution halted
dirname: missing operand
Try 'dirname --help' for more information.

You can use python after sourcing a CMSSW environment with cmsenv, but you run into a different error of the LCG files no longer being available on these nodes.

[alejands@cmslpc204 decaf]$ echo $CMSSW_BASE
/uscms_data/d3/alejands/MonoZprime/CMSSW_10_6_30_patch1

[alejands@cmslpc204 decaf]$ . env_lcg.sh
/cvmfs/sft.cern.ch/lcg/releases/R/3.5.3-883db/x86_64-centos7-gcc8-opt/lib64/R/bin/exec/R: error while loading shared libraries: libreadline.so.6: cannot open shared object file: No such file or directory

I don't believe this is an immediate cause for concern and we should focus on the UL issues first, but I'm posting it as an issue so it doesn't slip our minds. This feels like it should be straightforward to update (fingers crossed 🤞).

UL: V-tagging

Similar to the top-tagging one, but this is more complicated. We need to:

  1. Identify V-tagging working points. Typically the logic is to use a working point that gives at least 90% signal efficiency. Since we do not have mono-Z' signals yet, I would do this using a mono-V or a Higgs invisible signal sample. This would be the first step to do once the mono-Z' processor is ready.

  2. Measure the scale factors. In this case we will not be measuring mis-tagging scale factors because those will be calibrated in-situ. But we do need to measure signal scale factors.

To do that, we should follow the standard procedure used in CMS for V-tagging. We need to find it digging into twikies. The group that typically measure these scale factors is JMAR. I suggest reaching out to the conveners in case the information is hard to find.

The we need to make a processor, that we will call vtagsf.py to perform such scale factor measurement, following the example of the one used for double-b tagging:

https://github.com/mcremone/decaf/blob/master/analysis/processors/doublebsf.py

  1. As for top-tagging, a class in corrections.py must be implemented to apply the scale factors, as in here:

https://github.com/mcremone/decaf/blob/UL/analysis/utils/corrections.py#L521

NLO corrections

NNLO uncertainties

In principle, composed by three pieces: epsilonQCD, epsilonEW, and epsilonMIX.

epsilonQCD

It describes uncertainties related to variations of the renormalization and factorization scales which are performed to estimate the uncertainty of the theoretical prediction due to missing higher-order contributions.

The uncertainty is split into three parts:

  • epsilonQCD1, a normalization uncertainty
  • epsilonQCD2, a shape uncertainty
  • epsilonQCD3, an additional uncertainty estimating unknown correlations between the QCD uncertainties of the different vector boson plus jets processes.

All three nuisance parameters need to be treated uncorrelated, however each parameter is correlated for all V+jets processes and all bins of boson pT.

Questions:

  • Is epsilonQCD3 a shape or a normalization uncertainty?
  • The fact that each parameter is correlated for all V+jets processes means that is goes up for one process, it should go up for all of them. Does this mean that it goes up/down by the same quantity?

epsilonEW

It describes uncertainties related to missing even higher-order(NNLO) contributions:

  • epsilon1, a universal effect of the missing higher-order corrections and can therefore be used correlated across all V+jets processes.
  • epsilon2 and epsilon3 describe subleading higher-order effects with unknown process correlation. Therefore, these parameters are used uncorrelated for the different processes.

Questions:

  • Is there any correlation between epsilon1, epsilon2, and epsilon3
  • Are there shape or normalization factors?
  • regarding epsilon1 as correlated across all V+jets processes, does it vary by the same quantity for all processes?

epsilonMIX

It is an additional uncertainty regarding mixed contributions that cannot be described with the multiplicative or factorized approach. This nuisance parameter is chosen to be completely uncorrelated between the different processes.

Questions

  • is epsilonMIX a shape or a normalization factor?

So far I see at least three nuisances that for sure need to be accounted for in the fit, that are epsilon2, epsilon3, and epsilonMIX. It needs to be clarified if they are shape or normalization effects. We are going to have a different nuisance per V+jets process, therefore we will have: epsilon2^{V}, epsilon3^{V}, and epsilonMIX^{V}, where V can be W, Z, or gamma.
The remaining nuisances, epsilonQCD1, epsilonQCD2, epsilonQCD3, and epsilon1, if they not only move in correlated fashion for all V+jets processes, but they also move by the same quantity, they are going to cancel out in the transfer factor ratio and therefore they can be excluded from the pool of uncertainties. If they do not, we are going to have one nuisance for all processes.

UL: Corrections

First of all, pull the current most up to date version from the master branch:

https://github.com/mcremone/decaf/blob/master/analysis/utils/corrections.py

then replace the corrections one by one with the ones recommended for UL. Please be mindful of a couple of things:

  1. Use 'correctionlib': https://github.com/cms-nanoAOD/correctionlib
  2. Use the coffea lookup tools to interface with correctionlib. If need help, ask Nick Smith.
  3. Comment on each single correction, including links from where correction files have been taken etc.
  4. Clean up the 'analysis/data' folder from non-UL files, keep only the ones that are used. With correctionlib, they should be reduced to just a bunch of json files

UL: Processors

We will start from updating the dark Higgs processor to interface with the new data and work with the newer coffea:

https://github.com/mcremone/decaf/blob/UL/analysis/processors/darkhiggs.py

Simultaneously the script to execute the processors also needs to be adjusted:

https://github.com/mcremone/decaf/blob/UL/analysis/run.py

this is going to be quite some work. Few things to focus on:

  1. I'm pretty sure run_uproot_job in coffea 0.7 doesn't exist anymore:

https://github.com/mcremone/decaf/blob/UL/analysis/run.py#L45

We need to learn how we can execute processors locally with python futures.

  1. The new UL root files have ParticleNet instead of DeepAK15, therefore this part needs to be changed accordingly:

https://github.com/mcremone/decaf/blob/UL/analysis/processors/darkhiggs.py#L449-L451

  1. We need to verify that the EE fix for 2017 is still needed for UL:

https://github.com/mcremone/decaf/blob/UL/analysis/processors/darkhiggs.py#L352

  1. We need to confirm that triggers are still the same as pre-legacy:

https://github.com/mcremone/decaf/blob/UL/analysis/processors/darkhiggs.py#L75-L107

in principle I see no reason why they shouldn't.

  1. Check if this is the right way to apply PU weights in UL:

https://github.com/mcremone/decaf/blob/UL/analysis/processors/darkhiggs.py#L555

also, check if systematic variations are now provided and, if yes, propagate them.

  1. Check if muon ID SFs are now function of eta or abseta for all years and, if yes, fix this:

https://github.com/mcremone/decaf/blob/UL/analysis/processors/darkhiggs.py#L574-L576

  1. Check if this still applies:

https://github.com/mcremone/decaf/blob/UL/analysis/processors/darkhiggs.py#L590-L596

  1. Check for UL if the eeBadScFilter only applies to data:

https://github.com/mcremone/decaf/blob/UL/analysis/processors/darkhiggs.py#L638

Actually check if the met filter recipe is still the same in UL:

https://github.com/mcremone/decaf/blob/UL/analysis/processors/darkhiggs.py#L24-L51

  1. Move from coffea.hist to hist following these instructions:

CoffeaTeam/coffea#705

I will include more items if anything comes to mind.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.