Giter Club home page Giter Club logo

tables_io's Introduction

GitHub release (latest SemVer) GitHub Workflow Status Read the Docs

Template

tables_io

Input/output and conversion interfaces for tabular data formats used in DESC analysis pipelines.

People

License, Contributing etc

The code in this repo is available for re-use under the MIT license.

See read the docs

tables_io's People

Contributors

eacharles avatar delucchi-cmu avatar joselotl avatar olivialynn avatar gschwend avatar heather999 avatar drewoldag avatar sschmidt23 avatar sidneymau avatar beckermr avatar joezuntz avatar

Stargazers

Glauber Costa Vila-Verde avatar

Watchers

Johann Cohen-Tanugi avatar James Cloos avatar  avatar Seth Digel avatar  avatar Alex Malz avatar  avatar Tony Johnson avatar  avatar Katrin Heitmann avatar  avatar 颜子昂,Zi'ang Yan avatar  avatar Jaime RZ avatar  avatar

Forkers

hdante

tables_io's Issues

Problem with lazyImport on MacOS if module is not already imported

This is low priority as there is an easy workaround, but I've seen a strange issue with the lazyImport mechanism on MacOS which seems to be triggered when h5py is compiled against MPI.

When I try to import tables_io I get this:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/jzuntz/src/tmp/conda-for-tables-io-test/lib/python3.9/site-packages/tables_io/__init__.py", line 8, in <module>
    from .lazy_modules import *
  File "/Users/jzuntz/src/tmp/conda-for-tables-io-test/lib/python3.9/site-packages/tables_io/lazy_modules.py", line 58, in <module>
    h5py = lazyImport('h5py')
  File "/Users/jzuntz/src/tmp/conda-for-tables-io-test/lib/python3.9/site-packages/tables_io/lazy_modules.py", line 49, in lazyImport
    _ = dir(module)
  File "/Users/jzuntz/src/tmp/conda-for-tables-io-test/lib/python3.9/importlib/util.py", line 245, in __getattribute__
    self.__spec__.loader.exec_module(self)
  File "/Users/jzuntz/src/tmp/conda-for-tables-io-test/lib/python3.9/site-packages/h5py/__init__.py", line 25, in <module>
    from . import _errors
  File "/Users/jzuntz/src/tmp/conda-for-tables-io-test/lib/python3.9/site-packages/h5py/__init__.py", line 45, in <module>
    from ._conv import register_converters as _register_converters, \
  File "h5py/_conv.pyx", line 1, in init h5py._conv
  File "h5py/h5t.pyx", line 146, in init h5py.h5t
  File "h5py/h5t.pyx", line 80, in h5py.h5t.lockid
  File "h5py/h5t.pyx", line 49, in h5py.h5t.typewrap
  File "h5py/defs.pyx", line 3658, in h5py.defs.H5Tget_class
RuntimeError: Unspecified error in H5Tget_class (return value <0)

If I import h5py first and then tables_io then it works fine. This didn't happen on linux for me, just MacOS (only tested with conda and on intel). I've attached a script that recreates the issue (renamed to a txt file to enable uploading it here).

tables_io_error.sh.txt

Use `match`

There is a lot of code in tables_io that uses a series of if statements before falling back to an error. This should probably be a match statement now that those are supported in python.

For example, what is currently implemented as

tType = tableType(obj)
if tType == AP_TABLE:
    return obj
if tType == NUMPY_DICT:
    return apTable.Table(obj)
if tType == NUMPY_RECARRAY:
    return apTable.Table(obj)
if tType == PD_DATAFRAME:
    # try this: apTable.from_pandas(obj)
    return dataFrameToApTable(obj)
raise TypeError(f"Unsupported TableType {tType}")  # pragma: no cover

could be rewritten as

tType = tableType(obj)
match tType:
    case AP_TABLE:
        return obj
    case NUMPY_DICT:
        return apTable.Table(obj)
    case NUMPY_RECARRAY:
        return apTable.Table(obj)
    case PD_DATAFRAME:
        # try this: apTable.from_pandas(obj)
        return dataFrameToApTable(obj)
    case _:
        raise TypeError(f"Unsupported TableType {tType}")  # pragma: no cover

I think this would be a bit clearer and more robust to any possible conflicts with the type enum, etc.

Set up conditional dependency on pandas for py 3.7

I have a simple PR open on qp: LSSTDESC/qp#73 that just removes astropy from the list of dependencies - since it is truly no longer a dependency. Unfortunately the tests against py3.7 fail as the qp setup installs tables_io which installs pandas.
The recent pandas releases since 1.4.x have dropped support for python 3.7.
While I wondered aloud about dropping support for 3.7 - given that python 3.7 is still alive for now and clearly still supported on conda-forge, I'd like to suggest modifying the requirements and setup.py in this repo to conditionally restrict the version of pandas if we're dealing with python 3.7, otherwise, there should be no restrictions.
I think I can muster a PR to make that happen.

Coverage report not being uploaded to Codecov

Bug report
I was reviewing the pyarrow PR and noticed that the usual codecov summary was missing. Looking at the Unit Tests and Coverage workflow, I see this error under the upload coverage report to codecov section:

[2024-05-23T13:35:49.321Z] ['error'] There was an error running the uploader: Error uploading to [https://codecov.io:](https://codecov.io/) Error: There was an error fetching the storage URL during POST: 429 - {'detail': ErrorDetail(string='Rate limit reached. Please upload with the Codecov repository upload token to resolve issue. Expected time to availability: 696s.', code='throttled')}

I also notice that this same error is present in the last merged branch to main, so it is not unique to the pyarrow PR. I'm not sure why this is failing, but probably worth tracking down.

Before submitting
Please check the following:

  • I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
  • I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
  • If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.

Add metadata support within tables_io

Desirable features for user metadata include

  • optional per-column metadata
    • probably as a collection of keyword-value pairs, value a simple scalar
  • optional per-table metadata
    • preferably allow more complex, hierarchical
  • use of native table- or file-format-specific metadata features wherever possible
  • uniform API insofar as possible across different table and file formats

Likely implementation consequences:

  • complex per-table metadata may have to be stored in an auxiliary file, say a yaml or json text file
  • a table base class or mix-in to keep API streamlined

Extend supported input types types in writeDictToHdf5

I've encountered a problem when running RAIL's Goldenspike where calls are being made to writeDictToHdf5 in ioUtils.py, but use types that are not the np.ndarray that the function expects.

(The types I've personally run into are list and jaxlib.xla_extension.DeviceArray).

It would be nice to extend support for writing other kinds of dicts and dict-like types to hdf5s.

Crashing with astropy 5.1

Reading a fits file with astropy 5.1 causes this crash:

/Users/echarles/miniconda3/envs/rail_test/lib/python3.8/site-packages/astropy/table/row.py(76)eq()
-> return self.as_void() == other

This is trivially fixed by added a break statement in ioUtils.read()

        if defName in odict:
            odict = odict[defName]
            break

`getGroupInputDataLength` runtime error

Bug report
Running iterHdf5ToDict with a non-None groupname parameter will result in a runtime error when it executes the getGroupInputDataLength(f) line. This is because f is defined to be the hdf5 group, not the file.

In arrayUtils.py line 88, we call hg.keys() which is only valid for the hdf5 file.

The following code reproduces the runtime error. To run this you'll need an environment with h5py and tables_io installed.

import h5py
import json

from tables_io.ioUtils import iterHdf5ToDict

GROUPNAME = 'example_group'

# write an example hdf5 file using variable length strings
with h5py.File('bug_example.hdf5', 'w') as file:
    dicts = [
        {'a':0, 'b':[[8,3]]},
        {'a':1, 'b':[[1,2,3]]},
        {'a':2, 'b':[[1,2,3,4,5],[7,8,9]]},
    ]

    # convert the dictionaries to json strings
    data = [json.dumps(this_dict) for this_dict in dicts]
    dt = h5py.special_dtype(vlen=str)

    # add the data to the hdf5 file
    dataset = file.create_dataset(GROUPNAME, data=data, dtype=dt)


SHOW_BUG = True
if SHOW_BUG:
    # get an iterator into the hdf5 file
    buggy_iter = iterHdf5ToDict(
        "bug_example.hdf5",
        chunk_size=1,
        groupname=GROUPNAME,
        rank=0,
        parallel_size=1)

    # use the iterator to read the lines, convert to dictionaries, and print
    for start, end, data in buggy_iter:
        dicts = [json.loads(this_line) for this_line in data]
        print(f"Start, end: {start, end}")
        print(dicts)

else:
    # get an iterator into the hdf5 file
    good_iter = iterHdf5ToDict(
        "bug_example.hdf5",
        chunk_size=1,
        groupname=None,
        rank=0,
        parallel_size=1)

    # use the iterator to read the lines, convert to dictionaries, and print
    for start, end, data in good_iter:
        dicts = [json.loads(this_line) for this_line in data[GROUPNAME]]
        print(f"Start, end: {start, end}")
        print(dicts)

Produces the following error:

Traceback (most recent call last):
  File "/home/drew/code/hdf5-test/bug_example.py", line 35, in <module>
    for start, end, data in buggy_iter:
  File "/home/drew/miniconda3/envs/hdf5/lib/python3.9/site-packages/tables_io/ioUtils.py", line 379, in iterHdf5ToDict
    num_rows = getGroupInputDataLength(f)
  File "/home/drew/miniconda3/envs/hdf5/lib/python3.9/site-packages/tables_io/arrayUtils.py", line 88, in getGroupInputDataLength
    firstkey = list(hg.keys())[0]
AttributeError: 'Dataset' object has no attribute 'keys'

I believe this can be addressed with the following code in arrayUtils.py

def getGroupInputDataLength(hg):
    if isinstance(hg, h5py.File):
        return _getHdf5FileLength(hg)
    elif isinstance(hg, h5py.Group):
        return _getHdf5GroupLength(hg)

def _getHdf5FileLength(hg):
    firstkey = list(hg.keys())[0]
    nrows = len(hg[firstkey])
    firstname = hg[firstkey].name
    for value in hg.values():
        if len(value) != nrows:
            raise ValueError(
                f"Group does not represent a table. Length ({len(value)})"
                f"of column {value.name} not not match length ({nrows}) of"
                f"first column {firstname}"
            )
    return nrows

def _getHdf5GroupLength(hg):
    return len(hg)

Have tables_io.read and readNative allow you to only read some tables in a file

** Feature request**

Before submitting
Please check the following:

  • I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
  • I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
  • If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.

support numpy.recarray as a table format

In terms of using numpy.recarray, a few questions:

  1. how would you read/ write them by default (and to what type of file, you said FITS in the meeting, do you think that should be the default?)
  2. what other file types should be available?
  3. how would you convert a recarray to / from each of the other supported data types.
  4. Do you imaging supporting parallel write / parallel iterated reading for them.

All in all I’m thinking that we would need to add a couple of read /write functions (one for each supported format), 6 conversion functions (one to/from each additional data type) and then the switches to send things to the correct function. This should all be easily do-able.

Some points from Eli:
It’s nice to go straight to numpy but you can always use the conversions from astropy/pandas. I usually use FITS, but am trying to add parquet to my repertoire. I think of hdf5 as the xml of binary data formats.
The problem of numpy to pandas is that numpy doesn’t have nulls. So not all pandas dataframes can go to numpy, but you can go the other way.
And of course pandas doesn’t like those multi-dimensional columns.

Add CSV to supported formats

Hey, @glaubervila and I would like to contribute by adding CSV format to read and write functions.
Although CSV is not optimal for big data, it is still a convenient option to handle lightweight ancillary files, such as pz training sets.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.