Giter Club home page Giter Club logo

pyhf's Introduction

scikit-hep: metapackage for Scikit-HEP

image

image

image

image

image

image

image

Project info

The Scikit-HEP project is a community-driven and community-oriented project with the aim of providing Particle Physics at large with an ecosystem for data analysis in Python embracing all major topics involved in a physicist's work. The project started in Autumn 2016 and its packages are actively developed and maintained.

It is not just about providing core and common tools for the community. It is also about improving the interoperability between HEP tools and the Big Data scientific ecosystem in Python, and about improving on discoverability of utility packages and projects.

For what concerns the project grand structure, it should be seen as a toolset rather than a toolkit.

Getting in touch

There are various ways to get in touch with project admins and/or users and developers.

scikit-hep package

scikit-hep is a metapackage for the Scikit-HEP project.

Installation

You can install this metapackage from PyPI with `pip`:

or you can use Conda through conda-forge:

All the normal best-practices for Python apply; you should be in a virtual environment, etc.

Package version and dependencies

Please check the setup.cfg and requirements.txt files for the list of Python versions supported and the list of Scikit-HEP project packages and dependencies included, respectively.

For any installed scikit-hep the following displays the actual versions of all Scikit-HEP dependent packages installed, for example:

Note on the versioning system:

This package uses Calendar Versioning (CalVer).

pyhf's People

Contributors

1nf0rmed avatar actions-user avatar alexander-held avatar aryan26roy avatar beojan avatar dependabot[bot] avatar eschanet avatar henryiii avatar jzf2101 avatar kanishk16 avatar kratsg avatar lhenkelm avatar lnielsen avatar lorenzennio avatar lukasheinrich avatar marcogorelli avatar masonproffitt avatar matthewfeickert avatar moelf avatar nikoladze avatar ntadej avatar phinate avatar pineappleslikei avatar pre-commit-ci[bot] avatar rhnsharma avatar saransh-cpp avatar sauerburger avatar tirkarthi avatar wernerd-cern avatar wiso avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyhf's Issues

implement staterror

should be mostly straight forward but requires additional bookkeeping which samples participate. Essentially one additional constraint term per bin

Bug Report: Optimizer not set when backend is changed

Description

If using a backend other than numpy_backend currently we have to manually set the optimizer. However, this should be done automatically when the backend is changed.

Otherwise this causes an error, as shown below with pyhf.runOnePoint().

Expected Behavior

When changing pyhf.tensorlib have this change be automatically detected and also change pyhf.optimizer appropriately.

Actual Behavior

When changing pyhf.tensorlib pyhf.optimizer must be manually set. If it is not, then errors can be caused, as seen below.

Steps to Reproduce

import pyhf
from pyhf.tensor.tensorflow_backend import tensorflow_backend
from pyhf.simplemodels import hepdata_like
import tensorflow as tf

if __name__ == '__main__':
    default_backend = pyhf.tensorlib
    pyhf.tensorlib = tensorflow_backend(session=tf.Session())

    n_bins = 5
    binning = [n_bins, -0.5, n_bins + 0.5]
    data = [120.0] * n_bins
    bkg = [100.0] * n_bins
    bkgerr = [10.0] * n_bins
    sig = [30.0] * n_bins
    source = {
        'binning': binning,
        'bindata': {
            'data': data,
            'bkg': bkg,
            'bkgerr': bkgerr,
            'sig': sig
        }
    }

    pdf = hepdata_like(source['bindata']['sig'],
                       source['bindata']['bkg'],
                       source['bindata']['bkgerr'])
    data = source['bindata']['data'] + pdf.config.auxdata

    pyhf.runOnePoint(1.0, data, pdf,
                     pdf.config.suggested_init(),
                     pdf.config.suggested_bounds())

    # Reset backend
    pyhf.tensorlib = default_backend

Traceback:

Traceback (most recent call last):
  File "/home/mcf/anaconda3/envs/pyhf/lib/python3.6/site-packages/scipy/optimize/slsqp.py", line 380, in _minimize_slsqp
    fx = float(np.asarray(fx))
TypeError: float() argument must be a string or a number, not 'Tensor'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "runOnePoint_example.py", line 36, in <module>
    pdf.config.suggested_bounds())
  File "/home/mcf/Code/GitHub/pyhf/pyhf/__init__.py", line 412, in runOnePoint
    pdf, init_pars, par_bounds))
  File "/home/mcf/Code/GitHub/pyhf/pyhf/__init__.py", line 380, in generate_asimov_data
    loglambdav, asimov_mu, data, pdf, init_pars, par_bounds)
  File "/home/mcf/Code/GitHub/pyhf/pyhf/optimize/opt_scipy.py", line 26, in constrained_bestfit
    method='SLSQP', args=(data, pdf), bounds=par_bounds)
  File "/home/mcf/anaconda3/envs/pyhf/lib/python3.6/site-packages/scipy/optimize/_minimize.py", line 495, in minimize
    constraints, callback=callback, **options)
  File "/home/mcf/anaconda3/envs/pyhf/lib/python3.6/site-packages/scipy/optimize/slsqp.py", line 382, in _minimize_slsqp
    raise ValueError("Objective function must return a scalar")
ValueError: Objective function must return a scalar

From the traceback it can be seen that opt_scipy is getting used instead of opt_tflow.

Checklist

  • Run git fetch to get the most up to date version of master
  • Searched through existing Issues to confirm this is not a duplicate issue
  • Filled out the Description, Expected Behavior, Actual Behavior, and Steps to Reproduce sections above or have edited/removed them in a way that fully describes the issue

increase coverall tolerance to avoid false positives

Description

coveralls fails if the test coverage drops by even 0.01% which usually just happens because we removed some code or something, but it makes it hard to assess a PR on first glance due to red crosses. It should be possible for the test to only fail if the drop is larger than say 1%

Make sure pipenv-based development workflow works

Description

We support a number of extras like the various tensorbackends, but as far as I know none of them are mutually exclusive and all of them are needed for running the unit tests.

We should therefore make sure pip install -e[develop] installs all necessary packages to run the tests. I think also pytest-benchmark is missing (@kratsg mentioned that)

Add docstrings

The classes and methods of pyhf need docstrings for documentation of the code.

In anticipation of using Sphinx for the docs, the doc strings should follow something along the lines of PyTorch's style (maybe dipping into TensorFlow's style guide at times too). For an example, c.f. PyTorch's Bernoulli distribution.

Make Issue and Pull Request Templates

Having uniform templates for feature requests, bug reports, feature additions, and bug patches will make things easier for maintainers to parse and organize quickly.

Bug Report: Understand and fix TensorFlow scaling behavior

When benchmarking the performance of the TensorFlow backend and optimizer, the performance decreases (run time of test increases) with the number of iterations performed. This should not be happening and needs to be understood and fixed.

This has been noticed in Issue #77.

Expected Behavior

The performance of the backend should be independent of the number of iterations, and should be distributed about some central value for a particular configuration of a fit.

Actual Behavior

The performance is dependent on the number of iterations at a benchmark point

n_runs=5 n_runs=8
benchmark_times_5_log benchmark_times_8_log

Steps to Reproduce

Enable the TensorFlow backend in tests/test_benchmark.py and run

pytest --benchmark-sort=mean --benchmark-histogram=benchmark_tf tests/test_benchmark.py

Checklist

  • Run git fetch to get the most up to date version of master
  • Searched through existing Issues to confirm this is not a duplicate issue
  • Filled out the Description, Expected Behavior, Actual Behavior, and Steps to Reproduce sections above or have edited/removed them in a way that fully describes the issue

create benchmarking code

we'll be interested in scaling behaviour w.r.t number of bins / channels / systematics / etc.. as a start it should not be too hard to measure e.g. pdf evaluation time for a simple hepdata like model

e.g. this code snippet runs sets up a 2-bin likelihood, and we can just generate N-bin likelihoods with a easy loop that spits outs these JSON specs

@matthewfeickert maybe this is something for you?

source = {
  "binning": [2,-0.5,1.5],
  "bindata": {
    "data":    [120.0, 180.0],
    "bkg":     [100.0, 150.0],
    "bkgerr":     [10.0, 10.0],
    "sig":     [30.0, 95.0]
  }
}

from pyhf.simplemodels import hepdata_like
pdf  = hepdata_like(source['bindata']['sig'], source['bindata']['bkg'], source['bindata']['bkgerr'])
data = source['bindata']['data'] + pdf.config.auxdata

#now the call we want to benchmark:
pdf.logpdf(pdf.config.suggested_init(), data)

the timeit module will be useful for this

https://docs.python.org/2/library/timeit.html

allow custom setting of POI

right now we hardode the POI (the parameter associated to the only normfactor in the model) but in general there are multiple normfactors and we should allow annotating the model to select which pars are the POI

shapesys breaks in pytorch

for some reason models with shapesys break in pytorch (infinite dim counting loop).. good test case is hepdata_like(...)

Back-end submodule structuring should be less convoluted?

Description

The current way of importing backends is a little redundant:

from pyhf.tensor.numpy_backend import numpy_backend
from pyhf.tensor.pytorch_backend import pytorch_backend
from pyhf.tensor.tensorflow_backend import tensorflow_backend
from pyhf.tensor.mxnet_backend import mxnet_backend

Something more pythonic would be

from pyhf.tensor.backends import numpy_backend, pytorch_backend, tensorflow_backend, mxnet_backend

Or is there something technically difficult about doing it this way?

Notebooks failing as result of no 'auxdata' attribute

Issue

While testing the Jupyter notebooks in Binder it was noticed that they fail in the prep_data() function as a result of the line

data = source['bindata']['data'] + pdf.auxdata

throwing an error as pdf (defined at pdf = hfpdf(spec)) does not have an 'auxdata' attribute.

Stack Trace

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-133b58fcb35a> in <module>()
     10 }
     11 
---> 12 d,pdf = prep_data(source)
     13 init_pars = pdf.config.suggested_init()
     14 par_bounds = pdf.config.suggested_bounds()

<ipython-input-2-b59dcf407366> in prep_data(source)
     30     }
     31     pdf  = hfpdf(spec)
---> 32     data = source['bindata']['data'] + pdf.auxdata
     33     return data, pdf

AttributeError: 'hfpdf' object has no attribute 'auxdata'

Fix

This can be fixed by by the following line replacement

data = source['bindata']['data'] + pdf.config.auxdata

implement shared shapesys

unclear how ROOT implementation deals with non-compatible shapesys definitions that share a name, e.g. defs that have different histos of rel uncertainty per bin, but same name.. maybe this is undefined behaviour anyways and we can ignore that ?

speed up CI tests (do we need all conda packages?)

By using Conda, unfortunately the setup phase of the CI jobs have become a bit slower than without conda, maybe we can look into speeding them up again by checking whether we need all the packages that we install during CI

Understand differences in hypotest() test statistic across different backends

As noted in Issue #77 when running pyhf.runOnePoint() pyhf.utils.hypotest() the pyhf backends do not all agree with regards to the value of the test statistic, q_μ, that they return for the same model. Specifically, PyTorch returns a test statistic that seems to be consistently smaller then the NumPy backend and TensorFlow. This needs to be investigated and understood/fixed.

First items:

  • Plot the differences in the test statistic for the backends as a function of the number of nuisance parameters (bins in this specific case) in the fitparameters (bins in this specific case) in the fit
  • Test and validate the unconstrained_bestfit() method
  • Test and validate the constrained_bestfit() method

normalize tensorlib behaviour in `.sum(...)`

there are slightly different sematntics between pytorch and numpy for sums that result in a scalar. I notice @pablodecm also did something specific in the TF backend. Need to check what exactly the semantics are in numpy

In [2]: import pyhf.tensor.numpy_backend
In [3]: import pyhf.tensor.tensorflow_backend
In [4]: import pyhf.tensor.pytorch_backend
In [6]: backends = [pyhf.tensor.numpy_backend.numpy_backend(), pyhf.tensor.tensorflow_backend.tensorflow_backend(), pyhf.tensor.pytorch_backend.pytor
   ...: ch_backend()]
In [8]: for b in backends: print b.sum(b.astensor([1,2,3])).shape
()
()
(1L,)

Update schema to separate modifiers out

Need to group definitions of modifiers/data into different definitions as each modifier expects something different with data. Something like

oneOf:
- {$ref: '#/definitions/modifier_definitions/histosysdata'}
- {$ref: '#/definitions/modifier_definitions/normsysdata'}
- {$ref: '#/definitions/modifier_definitions/shapesysdata'}

is desired. This was raised in #113 and is related to #105.

Define API pdf sampling via e.g. probabilistic frameworks like edward

Description

We want to have the ability to sample from the pdf. A nice way to do this is via native probprog frameworks like Edward that hook in somewhat natively into tensor backends (not sure if there are similar projects for PyTorch, MXnet @cranmer ?). Not clear to me yet how to do this cleanly across numpy/TF/PyTorch/MXnet

For reference I added this super-simplified notebook to show how to sample sth like

p(n,a | alpha) = Pois(n |nu(alpha) ) * Gaus(a | alpha)

which is the core structure of HF right now

https://github.com/diana-hep/pyhf/blob/master/examples/experiments/edwardpyhf.ipynb

optimize loops in hfpdf

At various stages, python loops are used in the hfpdf class that could feasibly be converted into more efficient numpy operations

For example:

right now python loop where for each bin we apply the same function .. can be batched
https://github.com/lukasheinrich/pyhf/blob/master/pyhf/__init__.py#L220

@matthewfeickert might look into this

Understand Python 3.5 Failures

In Python 3.5 only the result of resetting the TensorFlow backend graph and session after each run identified in Issue #102 results in a failure of tests. This can be seen in the Python 3.5 output in Travis CI build #333.2 for PR #92.

A brief excerpt from the trace follows:

tests/test_benchmark.py ...FFF...                                        [ 29%]
tests/test_import.py F                                                   [ 32%]
tests/test_notebooks.py .                                                [ 35%]
tests/test_optim.py ...                                                  [ 45%]
tests/test_pdf.py FFFFFFF                                                [ 67%]
tests/test_tensor.py ...                                                 [ 77%]
tests/test_validation.py FFFFF.F                                         [100%]
=================================== FAILURES ===================================
_____________________ test_runOnePoint[tensorflow-10_bins] _____________________
self = <tensorflow.python.client.session.Session object at 0x7f06b412c160>
fn = <function BaseSession._do_run.<locals>._run_fn at 0x7f0699f82ea0>
args = ({b'concat:0': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan], dtype=float32)}, [b'strided_slice_8:0'], [], None, None)
message = 'Input is not invertible.\n\t [[Node: MatrixInverse = MatrixInverse[T=DT_FLOAT, adjoint=false, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Reshape_1)]]'
m = <_sre.SRE_Match object; span=(27, 50), match='[[Node: MatrixInverse ='>
    def _do_call(self, fn, *args):
      try:
>       return fn(*args)
../../../miniconda/envs/test-environment/lib/python3.5/site-packages/tensorflow/python/client/session.py:1327: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

As this does not occur for Python 3.6 it would be worth understanding why. Additionally, asking if it is worth supporting Python 3.5 given that most Python 3 users are following releases and have Python 3.6 installed.

add pythonic API for spec building

Description

we want a API similar (or best the same as) the ROOT HistFactory python bindings to iteratively build up a JSON spec for pyhf

http://ghl.web.cern.ch/ghl/html/HistFactoryDoc.html

  #!/usr/bin/env python

  #
  # A pyROOT script demonstrating
  # an example of writing a HistFactory
  # model using python
  #
  # This example was written to match
  # the example.xml analysis in
  # $ROOTSYS/tutorials/histfactory/
  #
  # Written by George Lewis
  #


  def main():

  try:
          import ROOT
  except:
          print "It seems that pyROOT isn't properly configured"
          return

  """
  Create a HistFactory measurement from python
  """

  InputFile = "./data/example.root"

  # Create the measurement
  meas = ROOT.RooStats.HistFactory.Measurement("meas", "meas")

  meas.SetOutputFilePrefix("./results/example_UsingPy")
  meas.SetPOI("SigXsecOverSM")
  meas.AddConstantParam("Lumi")
  meas.AddConstantParam("alpha_syst1")

  meas.SetLumi(1.0)
  meas.SetLumiRelErr(0.10)
  meas.SetExportOnly(False)

  # Create a channel

  chan = ROOT.RooStats.HistFactory.Channel("channel1")
  chan.SetData("data", InputFile)
  chan.SetStatErrorConfig(0.05, "Poisson")

  # Now, create some samples

  # Create the signal sample
  signal = ROOT.RooStats.HistFactory.Sample("signal", "signal", InputFile)
  signal.AddOverallSys("syst1",  0.95, 1.05)
  signal.AddNormFactor("SigXsecOverSM", 1, 0, 3)
  chan.AddSample(signal)


  # Background 1
  background1 = ROOT.RooStats.HistFactory.Sample("background1", "background1", InputFile)
  background1.ActivateStatError("background1_statUncert", InputFile)
  background1.AddOverallSys("syst2", 0.95, 1.05 )
  chan.AddSample(background1)


  # Background 1
  background2 = ROOT.RooStats.HistFactory.Sample("background2", "background2", InputFile)
  background2.ActivateStatError()
  background2.AddOverallSys("syst3", 0.95, 1.05 )
  chan.AddSample(background2)


  # Done with this channel
  # Add it to the measurement:

  meas.AddChannel(chan)

  # Collect the histograms from their files,
  # print some output, 
  meas.CollectHistograms()
  meas.PrintTree();

  # One can print XML code to an
  # output directory:
  # meas.PrintXML("xmlFromCCode", meas.GetOutputFilePrefix());

  meas.PrintXML("xmlFromPy", meas.GetOutputFilePrefix());

  # Now, do the measurement
  ROOT.RooStats.HistFactory.MakeModelAndMeasurementFast(meas);

  pass


  if __name__ == "__main__":
      main()

Add spec validation on hfpdf.__init__

Description

we now have a spec, that should allow us to do a lint-type check for validity of the spec. passing that doesn't guarantee correctness, but just that the spec is well-formed.

we might implement a validation routine whose first step would be linting and later we can add other checks (e.g. check same-length for all sample data)

Pythonic Specification Generation

I strongly suggest we keep the most stringent JSON spec in #105 as the only JSON-schema we have. A "lite JSON" would probably confuse newcomers and those wishing to use an API. Instead, a more user-friendly, pythonic generation can be built. Something like the following is certainly possible [inspired by how constructions does it]

NPs = [NormSys("JES1"), HistoSys("JES2")]
sample = Sample(
    NormSys("JES1") / Data(.....),
    HistoSys("JES2") / Data(....)
)
chan = "singlechannel" / Channel( "signal" / sample)

then doing something like JSON.dump(chan) would work out of the box, as you can define how to serialize such an object. chan can implement vars(chan) which returns the simple python structure that can be passed into hfpdf -- similar to how argparse::Namespace does it.

Caveat: division does not need to be done. One could just as easily do NormSys("JES1").Data("....") and so on.

JSON Schema / Spec discussion

As initiated in #104, there are questions raised about the spec and the way forward with two overarching goals in mind:

  • intuitive and clean for the user
  • fully-specified and documentable by a schema for an API

There are two main issues raised as described below.

Fully-Specified

A schema like

{
  "singlechannel": {
    "background": {
      "data": [1,2,3,4],
      "mods": [...]
    },
    "signal": {
      "data": [1,2,3,4],
      "mods": [...]
    }
  }
}

is not fully-specified as it contains a dictionary with variable key-names (singlechannel, background, signal). A more fully-specified spec looks like so

[
  {
    "name": "singlechannel",
    "type": "channel",
    "samples": [
      {
        "name": "background",
            "data": [1,2,3,4]
        ],
        "mods": [...]
      },
      {
        "name": "signal",
        "data": [1,2,3,4],
        "mods": [...]
      }
    ]
  }
]

where an array of channels, and samples are specified. This is a first proposal, but still has a nested array which may or may not be useful for many -- and flattening the array is a possibility, through a process of denormalization (see firebase docs).

Intuitive-ness

Currently, modifications are defined as an array

        "mods": [
          {"type": "shapesys", "name": "mod_JES1", "data": [1,2,3,4]},
          {"type": "shapesys", "name": "mod_JES2", "data": [1,2,3,4]},
          {"type": "shapesys", "name": "mod_FlavTag", "data": [1,2,3,4]}
        ]

however, one of the draw-backs is that it makes a user think of each modification as an entire "object". That is, this should define three modification objects, which is not necessarily true. In spirit, a modification refers to a nuisance parameter, such as mod_JES1 along with configurations for that.

first proposal

A first proposal to make this more intuitive was to structure the modifications as a dictionary, with each key name referring to the nuisance parameter that is of interest

        "mods": {
          "mod_JES1": {"type": "shapesys", "data": [1,2,3,4]},
          "mod_JES2": {"type": "shapesys", "data": [1,2,3,4]},
          "mod_flavTag": {"type": "shapesys", "data": [1,2,3,4]}
        ]

A drawback is that we now have configurable dictionary key names, which does not help with JSON Schema / API specification.

second proposal

which separates the nuisance parameter from the actual modifications for a given sample/channel

        "NPs": [
          {"name": "mod_JES1", "mod": {"type": "shapesys", "data": [1,2,3,4]}},
          {"name": "mod_JES2", "mod": {"type": "shapesys", "data": [1,2,3,4]}},
          {"name": "mod_flavTag", "mod": {"type": "shapesys", "data": [1,2,3,4]}},
        ]

implement sampling from pdf

Description

We would like to be able to sample from the pdf. For that we need to

  1. sample values for auxiliary measurements
  2. based on aux data, derive poisson rate parameters
  3. sample from poissons

e.g.

pdf = pyhf.pdf(...)
data  = pdf.sample()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.