Giter Club home page Giter Club logo

scikit-hep-testdata's Introduction

scikit-hep-testdata

Scikit-HEP PyPI version Conda latest release

Github Actions badge Code Coverage pre-commit.ci status Code style: black

A common package to provide example files (e.g. ROOT) for testing and developing packages against. The sample of files is representative of typical files found "in the wild".

In addition to including some root files directly, this package adds some simple helper methods to get larger files from common open-access data repositories.

Installing and usage

To install:

python -m pip install scikit-hep-testdata

Once installed, absolute file paths can be resolved using the helper methods:

from skhep_testdata import data_path

filename = data_path("some_file.root")

By default, if an unknown file is requested an exception is raised but this can be skipped by passing the above method raise_missing=False:

filename = data_path("unknown_file.root", raise_missing=False)

The files are not stored on PyPI, so if installed from SDist/wheel, the "local" files will not be present, but will be downloaded from GitHub and cached in the ~/.local/skhepdata directory. If you make an editable install from the Git repo, or if you set SKHEP_DATA=1 when building/installing from the Git repo, you will have the data files locally.

You can see all "local" files with skhep_testdata.known_files, and you can download all files at once with skhep_testdata.download_all(), optionally selecting the download cache directory.

Remote vs. Local files

Some files, particularly large ones, for example, are not stored within this package and instead live on a remote server; we call these "remote files". To obtain these use the same data_path method as above, however this will trigger the code to download and configure the remote file. This might be slow the first time round but will subsequently be as fast as for a local file. WARNING: the local file caching system has not yet been applied to remote files.

Command-line invocation

You can also interact with this package from the command-line:

# Print a path (download if needed)
python -m skhep_testdata cms_hep_2012_tutorial/data.root

# Show all "local" files
python -m skhep_testdata --list

# Download all files to an existing directory
python -m skhep_testdata --all --dir local

You can also use pipx run scikit-hep-testdata to access the above CLI without installing.

Adding new files

We're on the look out for new, interesting files!

  • Large files: If the file is particularly large, for example > 25 MB, it might be worth adding to an external open access data repository and adding a configuration here so that the internal helper methods can pull this down.
  • Experiment data policies: Please make sure you have permissions to add the file to this collection, and that any private or sensitive data has been appropriately masked, salted, or scrambled.

List of files

The following lists describe the files known by this package.

Files stored in this package

Known remote files

Contributors

We hereby acknowledge the contributors that made this project possible (emoji key):

benkrikler
benkrikler

๐Ÿ’ป ๐Ÿ“–
Jim Pivarski
Jim Pivarski

๐Ÿšง ๐Ÿ”ฃ ๐Ÿ“–
Henry Schreiner
Henry Schreiner

๐Ÿšง ๐Ÿ”ฃ ๐Ÿ’ป ๐Ÿ“–
Eduardo Rodrigues
Eduardo Rodrigues

๐Ÿšง ๐Ÿ”ฃ ๐Ÿ’ป
Matthew Feickert
Matthew Feickert

๐Ÿ”ฃ ๐Ÿ’ป
Pratyush Das
Pratyush Das

๐Ÿ”ฃ ๐Ÿ’ป
Jerry Ling
Jerry Ling

๐Ÿ”ฃ ๐Ÿ’ป
Jonas Eschle
Jonas Eschle

๐Ÿ’ป
Giordon Stark
Giordon Stark

๐Ÿ”ฃ ๐Ÿ’ป
Dmitry Kalinkin
Dmitry Kalinkin

๐Ÿ”ฃ
Michele Peresano
Michele Peresano

๐Ÿ”ฃ
Luis Antonio Obis Aparicio
Luis Antonio Obis Aparicio

๐Ÿ”ฃ
Oksana Shadura
Oksana Shadura

๐Ÿ”ฃ
Nicholas Smith
Nicholas Smith

๐Ÿ”ฃ
Beojan Stanislaus
Beojan Stanislaus

๐Ÿ”ฃ
Lukas
Lukas

๐Ÿ”ฃ
Johannes Schumann
Johannes Schumann

๐Ÿ”ฃ
Elliott Kauffman
Elliott Kauffman

๐Ÿ”ฃ
Tom Eichlersmith
Tom Eichlersmith

๐Ÿ”ฃ
Alexander Puck Neuwirth
Alexander Puck Neuwirth

๐Ÿ”ฃ
ioanaif
ioanaif

๐Ÿ”ฃ

This project follows the all-contributors specification.

Acknowledgements

  • Many of the files collected directly within this package were collated originally by Jim Pivarski for uproot

Running the tests

This package uses pytest to run the unit tests. Install with pip install scikit-hep-testdata[test] or pip install -e .[test] (dev) to get the testing requirements. then run:

pytest

scikit-hep-testdata's People

Contributors

8me avatar allcontributors[bot] avatar apn-pucky avatar ariostas avatar benkrikler avatar beojan avatar chrisburr avatar eduardo-rodrigues avatar ekauffma avatar henryiii avatar ioanaif avatar jonas-eschle avatar jpivarski avatar kratsg avatar lobis avatar lukasheinrich avatar matthewfeickert avatar moelf avatar nsmith- avatar oshadura avatar pre-commit-ci[bot] avatar reikdas avatar tomeichlersmith avatar veprbl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scikit-hep-testdata's Issues

Only the latest data files are accessible

In scikit-hep/uproot5#1115 we noticed that tests seemed to pass even when the newest version of scikit-hep-testdata hadn't been deployed to PyPI.

I took a look at the code and found that the data files are always downloaded using these urls:

baseurl = "https://raw.githubusercontent.com/scikit-hep/scikit-hep-testdata/main/src/skhep_testdata/data/"
zipurl = "https://github.com/scikit-hep/scikit-hep-testdata/zipball/main"

This means that if a file gets updated, then it becomes impossible to use the older version by installing an older version of the package, and instead the only option is to manually move the older version of the file into the cache directory. And if a file gets deleted, then it becomes inaccessible.

So with #144 I unintentionally made previous RNTuple files inaccessible. In this case it's fine since no one should be using the older RNTuple spec, but there might be cases where files need to be updated, but with the option of using older versions by installing previous versions of this package.

So I think it would be better if the above lines could be replaced with something like

baseurl = f"https://github.com/scikit-hep/scikit-hep-testdata/raw/v{__version__}/src/skhep_testdata/data/" 
zipurl = f"https://github.com/scikit-hep/scikit-hep-testdata/archive/refs/tags/v{__version__}.zip" 

and maybe have a fallback in case someone is using an editable version or something like that.

Also, it would be nice if there was a clear_cache function.

YAML loader warning

Currently, the yaml load is showing a warning:

================================================================================= warnings summary ==================================================================================
tests/ak/test_demo.py::test_simple_example
  /Users/henryschreiner/git/scikit-hep/vector/.env/lib/python3.7/site-packages/skhep_testdata/remote_files.py:38: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
    datasets = yaml.load(infile)

-- Docs: https://docs.pytest.org/en/latest/warnings.html

We should use one of the loaders listed here: https://github.com/yaml/pyyaml/wiki/PyYAML-yaml.load(input)-Deprecation

Python 3.5 supported?

In setup.cfg,

Programming Language :: Python :: 3.5

but when I'm testing it in Python 3.5,

https://github.com/scikit-hep/uproot4/pull/8/checks?check_run_id=683554320#step:8:42

On the other hand, importlib_resources claims to support Python 3.5,

https://importlib-resources.readthedocs.io/en/latest/

so I don't know. This mentions a Python 3.5 bug:

https://importlib-resources.readthedocs.io/en/latest/changelog.html#v1-2-0

but I'm using the latest version.

# Name                    Version                   Build  Channel
importlib_resources       1.5.0            py37hc8dfbb8_0    conda-forge

It is the case, though, that scikit-hep-testdata is only tested for Pythons 2.7, 3.6, and 3.8. The matrix should probably be expanded if it's claiming to support everything in setup.cfg.

Seems like there are duplicate files

The test_ntuple_* files seem to play the same role as the uproot_ntuple_* files. They might not be exact duplicates since the uproot_ntuple_* files most likely use a later RNTuple version, but I don't think it makes sense to keep both sets given that the RNTuple spec is not stable yet. Should I remove the uproot_ntuple_* ones and update the test_ntuple_* ones to RNTuple RC2?

First pip release

In order for the package to be useful, e.g. having it in a requirements file, it would be nice to have it on pip.

Sure, it's to be extended, but the basis seems to be here.

Are there any plans for this? travis CI auto-deployment? Or what did you have in mind for the release process?

Error when building: setuptools-scm was unable to detect version

When trying to build packages for openSUSE, using the github tagged 0.3.6 source tarball, I get the following error:

[   10s] + /usr/bin/python3 setup.py build '--executable=/usr/bin/python3 -s'
[   10s] Traceback (most recent call last):
[   10s]   File "setup.py", line 7, in <module>
[   10s]     setup()
[   10s]   File "/usr/lib/python3.8/site-packages/setuptools/__init__.py", line 144, in setup
[   10s]     _install_setup_requires(attrs)
[   10s]   File "/usr/lib/python3.8/site-packages/setuptools/__init__.py", line 132, in _install_setup_requires
[   10s]     dist = distutils.core.Distribution(dict(
[   10s]   File "/usr/lib/python3.8/site-packages/setuptools/dist.py", line 447, in __init__
[   10s]     _Distribution.__init__(self, {
[   10s]   File "/usr/lib64/python3.8/distutils/dist.py", line 292, in __init__
[   10s]     self.finalize_options()
[   10s]   File "/usr/lib/python3.8/site-packages/setuptools/dist.py", line 740, in finalize_options
[   10s]     ep.load()(self)
[   10s]   File "/usr/lib/python3.8/site-packages/setuptools_scm/integration.py", line 48, in infer_version
[   10s]     dist.metadata.version = _get_version(config)
[   10s]   File "/usr/lib/python3.8/site-packages/setuptools_scm/__init__.py", line 148, in _get_version
[   10s]     parsed_version = _do_parse(config)
[   10s]   File "/usr/lib/python3.8/site-packages/setuptools_scm/__init__.py", line 110, in _do_parse
[   10s]     raise LookupError(
[   10s] LookupError: setuptools-scm was unable to detect version for '/home/abuild/rpmbuild/BUILD/scikit-hep-testdata-0.3.6'.
[   10s]
[   10s] Make sure you're either building from a fully intact git repository or PyPI tarballs. Most other sources (such as GitHub's tarballs, a git checkout without the .git
 folder) don't contain the necessary metadata and will not work.

However, PyPI doesn't have any of the source tarballs, only wheel files, so I am not sure how to work around this. Would be grateful for any help.

Spec file and sources here: https://build.opensuse.org/package/show/home:badshah400:branches:science/python-scikit-hep-testdata

Investigate options for large file storage

Storing large files in git repositories can be problematic. Since this repository is only really intended for testing set-ups it seems not unreasonable to do this but we might be able to work with other options as well.

  1. use git lfs: Need to understand if this can play nicely once the package is installed via pip, away from git
  2. Use existing open-access data solutions. Cern's opendata platform is very nice, but it's not straightforward to put new files into it.
  3. Jim Pivarski has suggested S3 as a cheaper alternative to DropBox

Views on adding test files that are not explicitly needed for Scikit-HEP testing

Given ssl-hep/ServiceX#486 (review) it would be useful for testing purposes of IRIS-HEP's ServiceX to have https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/nanoaod15.root available in scikit-hep-testdata.

$ curl -sLO https://xrootd-local.unl.edu:1094//store/user/AGC/nanoAOD/nanoaod15.root
$ file nanoaod15.root 
nanoaod15.root: ROOT file Version 61409 (Compression: 209)
$ ls -lhtra nanoaod15.root 
-rw-rw-r-- 1 feickert feickert 2.7M Nov  2 02:17 nanoaod15.root

This would be useful and so I'm inclined to go for it (I have branch feat/add-servicex-data with this in ready to PR) but we should probably discuss if we are okay hosting any test files or if we only want to accept responsibility for hosting Scikit-HEP project test files long term. This is also a > 1MB file, so that's also something to keep in mind for consideration. Though in general I could see benefit to having file formats like nanoAOD and xAOD in scikit-hep-testdata.

Thoughts?

(cc @eduardo-rodrigues @jpivarski @henryiii @alexander-held)

edit: @gordonwatts has pointed out that there's nothing particularly special about that file in particular, so a different file could be used.

Include files from uproot's github pages

To help uproot make full use of this package, we would need to incorporate the ROOT files from its github pages as well.

@jpivarski 's comment via email:

I'll need to move some new files into scikit-hep-testdata to fully move over. These will come from the gh-pages branch of uproot:

https://github.com/scikit-hep/uproot/tree/gh-pages/examples

which maps names like Event.root (60 MB! can't fit in PyPI repo!) to web URLs: http://scikit-hep.org/uproot/examples/Event.root, etc.

For files that can't fit into pypi, we probably need to resort to the discussion from #3.

second LHE testfile with easy physics

the current LHE test file has a very complex event content which is nice for testing parsing etc but doesn't lend itself for quick tests of functionality as in scikit-hep/pylhe#83

It would be good to have a second testfile from a simple Z->ยตยต process.

@matthewfeickert has one but it's 9MB which might or might not be too large

Source tarball is missing setup.py, skhep_testdata directory, etc

# wget https://files.pythonhosted.org/packages/7f/85/dc18ef24a92c8989d3884c95b0b3837556df1c7a9960c1cfa54189270c22/scikit_hep_testdata-0.4.2.tar.gz
# tar zxf scikit_hep_testdata-0.4.2.tar.gz 
# find scikit_hep_testdata-0.4.2
scikit_hep_testdata-0.4.2
scikit_hep_testdata-0.4.2/PKG-INFO
scikit_hep_testdata-0.4.2/egg-info
scikit_hep_testdata-0.4.2/setup.cfg

0.4.39: pypi source tar ball does not include data

As a consequence network-sandboxed tests fail even if scikit-hep-testdata is installed from pypi. I'll use the github sources for now, but wanted to let you know about it.

I guess this is intentional since the whole data would be too much to distribute over pypi?

Failure of scikit-hep-testdata when installed via `test_requires` in Travis

I'm migrating uproot to get all of its test data from this package (scikit-hep/uproot3#237). However, it fails like this:

____________________ ERROR collecting tests/test_jagged.py _____________________
tests/test_jagged.py:40: in <module>
    class Test(unittest.TestCase):
tests/test_jagged.py:42: in Test
    sample = uproot.open(skhep_testdata.data_path("uproot-sample-6.10.05-uncompressed.root"))["sample"]
build/bdist.linux-x86_64/egg/skhep_testdata/local_files.py:10: in data_path
    ???
build/bdist.linux-x86_64/egg/skhep_testdata/remote_files.py:95: in is_known_remote
    ???
build/bdist.linux-x86_64/egg/skhep_testdata/remote_files.py:53: in is_known
    ???
build/bdist.linux-x86_64/egg/skhep_testdata/remote_files.py:39: in load_remote_configs
    ???
E   IOError: [Errno 20] Not a directory: '/home/travis/build/scikit-hep/uproot/.eggs/scikit_hep_testdata-0.1.1-py2.7.egg/skhep_testdata/remote_datasets.yml'

This doesn't happen in my local tests, only on Travis, so far. (My local tests use scikit_hep_testdata installed via pip; Travis would be getting it from the test_requires in the setup.py.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.