lsstdesc / tables_io Goto Github PK

View Code? Open in Web Editor NEW

1.0 15.0 1.0 218 KB

A small package to provide tools to read / write and convert tabular data for DESC

License: MIT License

Python 83.61% Jupyter Notebook 16.39%

tables_io's Introduction

tables_io

Input/output and conversion interfaces for tabular data formats used in DESC analysis pipelines.

People

Eric Charles (SLAC)

License, Contributing etc

The code in this repo is available for re-use under the MIT license.

See read the docs

tables_io's People

Contributors

Stargazers

Watchers

Forkers

hdante

tables_io's Issues

switch to pyproject

Allow iterChunkHdf5Data to specify a subset of columns.

Problem with lazyImport on MacOS if module is not already imported

This is low priority as there is an easy workaround, but I've seen a strange issue with the lazyImport mechanism on MacOS which seems to be triggered when h5py is compiled against MPI.

When I try to import tables_io I get this:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/jzuntz/src/tmp/conda-for-tables-io-test/lib/python3.9/site-packages/tables_io/__init__.py", line 8, in <module>
    from .lazy_modules import *
  File "/Users/jzuntz/src/tmp/conda-for-tables-io-test/lib/python3.9/site-packages/tables_io/lazy_modules.py", line 58, in <module>
    h5py = lazyImport('h5py')
  File "/Users/jzuntz/src/tmp/conda-for-tables-io-test/lib/python3.9/site-packages/tables_io/lazy_modules.py", line 49, in lazyImport
    _ = dir(module)
  File "/Users/jzuntz/src/tmp/conda-for-tables-io-test/lib/python3.9/importlib/util.py", line 245, in __getattribute__
    self.__spec__.loader.exec_module(self)
  File "/Users/jzuntz/src/tmp/conda-for-tables-io-test/lib/python3.9/site-packages/h5py/__init__.py", line 25, in <module>
    from . import _errors
  File "/Users/jzuntz/src/tmp/conda-for-tables-io-test/lib/python3.9/site-packages/h5py/__init__.py", line 45, in <module>
    from ._conv import register_converters as _register_converters, \
  File "h5py/_conv.pyx", line 1, in init h5py._conv
  File "h5py/h5t.pyx", line 146, in init h5py.h5t
  File "h5py/h5t.pyx", line 80, in h5py.h5t.lockid
  File "h5py/h5t.pyx", line 49, in h5py.h5t.typewrap
  File "h5py/defs.pyx", line 3658, in h5py.defs.H5Tget_class
RuntimeError: Unspecified error in H5Tget_class (return value <0)

If I import h5py first and then tables_io then it works fine. This didn't happen on linux for me, just MacOS (only tested with conda and on intel). I've attached a script that recreates the issue (renamed to a txt file to enable uploading it here).

tables_io_error.sh.txt

Add iterator functionality for parquet files

Add github action to push release to PyPi

Use `match`

There is a lot of code in tables_io that uses a series of if statements before falling back to an error. This should probably be a match statement now that those are supported in python.

For example, what is currently implemented as

tType = tableType(obj)
if tType == AP_TABLE:
    return obj
if tType == NUMPY_DICT:
    return apTable.Table(obj)
if tType == NUMPY_RECARRAY:
    return apTable.Table(obj)
if tType == PD_DATAFRAME:
    # try this: apTable.from_pandas(obj)
    return dataFrameToApTable(obj)
raise TypeError(f"Unsupported TableType {tType}")  # pragma: no cover

could be rewritten as

tType = tableType(obj)
match tType:
    case AP_TABLE:
        return obj
    case NUMPY_DICT:
        return apTable.Table(obj)
    case NUMPY_RECARRAY:
        return apTable.Table(obj)
    case PD_DATAFRAME:
        # try this: apTable.from_pandas(obj)
        return dataFrameToApTable(obj)
    case _:
        raise TypeError(f"Unsupported TableType {tType}")  # pragma: no cover

I think this would be a bit clearer and more robust to any possible conflicts with the type enum, etc.

Add stuff from RAIL/fileIO.py

Set up conditional dependency on pandas for py 3.7

I have a simple PR open on qp: LSSTDESC/qp#73 that just removes astropy from the list of dependencies - since it is truly no longer a dependency. Unfortunately the tests against py3.7 fail as the qp setup installs tables_io which installs pandas.
The recent pandas releases since 1.4.x have dropped support for python 3.7.
While I wondered aloud about dropping support for 3.7 - given that python 3.7 is still alive for now and clearly still supported on conda-forge, I'd like to suggest modifying the requirements and setup.py in this repo to conditionally restrict the version of pandas if we're dealing with python 3.7, otherwise, there should be no restrictions.
I think I can muster a PR to make that happen.

Move mpi4py to tests_require from install_requires

change iterate() functions to work with MPI

Add generic concat and slice functionality

Remove setuptools_scm from install_requires in setup.py

The install doesn't require setuptools_scm, this is already pulled in when needed via setup_requires
Related to LSSTDESC/qp#75

Coverage report not being uploaded to Codecov

Bug report
I was reviewing the pyarrow PR and noticed that the usual codecov summary was missing. Looking at the Unit Tests and Coverage workflow, I see this error under the upload coverage report to codecov section:

[2024-05-23T13:35:49.321Z] ['error'] There was an error running the uploader: Error uploading to [https://codecov.io:](https://codecov.io/) Error: There was an error fetching the storage URL during POST: 429 - {'detail': ErrorDetail(string='Rate limit reached. Please upload with the Codecov repository upload token to resolve issue. Expected time to availability: 696s.', code='throttled')}

I also notice that this same error is present in the last merged branch to main, so it is not unique to the pyarrow PR. I'm not sure why this is failing, but probably worth tracking down.

Before submitting
Please check the following:

I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a descriprion of what I expected instead.
If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.

Allow initializeHdf5Write to open file in parallel mode

In order to allow chunk writing in parallel, initializeHdf5Write needs to get information of the MPI communicator.

Add metadata support within tables_io

Desirable features for user metadata include

optional per-column metadata
- probably as a collection of keyword-value pairs, value a simple scalar
optional per-table metadata
- preferably allow more complex, hierarchical
use of native table- or file-format-specific metadata features wherever possible
uniform API insofar as possible across different table and file formats

Likely implementation consequences:

complex per-table metadata may have to be stored in an auxiliary file, say a yaml or json text file
a table base class or mix-in to keep API streamlined

Extend supported input types types in writeDictToHdf5

I've encountered a problem when running RAIL's Goldenspike where calls are being made to writeDictToHdf5 in ioUtils.py, but use types that are not the np.ndarray that the function expects.

(The types I've personally run into are list and jaxlib.xla_extension.DeviceArray).

It would be nice to extend support for writing other kinds of dicts and dict-like types to hdf5s.

Crashing with astropy 5.1

Reading a fits file with astropy 5.1 causes this crash:

/Users/echarles/miniconda3/envs/rail_test/lib/python3.8/site-packages/astropy/table/row.py(76)eq()
-> return self.as_void() == other

This is trivially fixed by added a break statement in ioUtils.read()

        if defName in odict:
            odict = odict[defName]
            break

Apply RAIL-specific python project template

This will replace the LINCC Frameworks PPT that is currently applied.

We want to do this because the RAIL specific version is updated to use python version 3.9-3.11.

Remove getDType and getData

It is easier just to expect input data to by in np arrays, which you can do with np.array( your_list )

protect against empty groupname

Support different suffixes for parquet files

Please add .parquet and .parq to the list of valid file extensions for parquet files.
These two options were used in DP0.1 and DP0.2 catalog files, respectively.

Add specific fuctions to effectively read parquet files

Add object that can collect a set of tables and provides unified IO support.

Support Index files

support having files that are list of files, because meta!

add utilities to manipulate dictionaries of numpy arrays.

The idea is NOT to reproduce lots of table functionality, if people want that they should use tables.
But rather to provide a few very key functions:

printShape(odict)
slice(odict, theSlice)
vstack(odicts)

use lazy_modules for import of pa in types

Add stuff that ceci/ rail will need

In particular iteration and open() functions.

Add initializeHdf5WriteSingle and writeDictToHdf5ChunkSingle to handle case of writing only one table at the top level of and hdf5 file

This special case is actually how we do things in RAIL, so we should support it for now

Consider adding pyarrow and/ or numpy dicts format to list of supported formats for parquet files

Currently, pandas data frames are the only supported format for parquet files. We might want to add pyarrow and/ or numpy dicts as supported formats

`getGroupInputDataLength` runtime error

Bug report
Running iterHdf5ToDict with a non-None groupname parameter will result in a runtime error when it executes the getGroupInputDataLength(f) line. This is because f is defined to be the hdf5 group, not the file.

In arrayUtils.py line 88, we call hg.keys() which is only valid for the hdf5 file.

The following code reproduces the runtime error. To run this you'll need an environment with h5py and tables_io installed.

import h5py
import json

from tables_io.ioUtils import iterHdf5ToDict

GROUPNAME = 'example_group'

# write an example hdf5 file using variable length strings
with h5py.File('bug_example.hdf5', 'w') as file:
    dicts = [
        {'a':0, 'b':[[8,3]]},
        {'a':1, 'b':[[1,2,3]]},
        {'a':2, 'b':[[1,2,3,4,5],[7,8,9]]},
    ]

    # convert the dictionaries to json strings
    data = [json.dumps(this_dict) for this_dict in dicts]
    dt = h5py.special_dtype(vlen=str)

    # add the data to the hdf5 file
    dataset = file.create_dataset(GROUPNAME, data=data, dtype=dt)


SHOW_BUG = True
if SHOW_BUG:
    # get an iterator into the hdf5 file
    buggy_iter = iterHdf5ToDict(
        "bug_example.hdf5",
        chunk_size=1,
        groupname=GROUPNAME,
        rank=0,
        parallel_size=1)

    # use the iterator to read the lines, convert to dictionaries, and print
    for start, end, data in buggy_iter:
        dicts = [json.loads(this_line) for this_line in data]
        print(f"Start, end: {start, end}")
        print(dicts)

else:
    # get an iterator into the hdf5 file
    good_iter = iterHdf5ToDict(
        "bug_example.hdf5",
        chunk_size=1,
        groupname=None,
        rank=0,
        parallel_size=1)

    # use the iterator to read the lines, convert to dictionaries, and print
    for start, end, data in good_iter:
        dicts = [json.loads(this_line) for this_line in data[GROUPNAME]]
        print(f"Start, end: {start, end}")
        print(dicts)

Produces the following error:

Traceback (most recent call last):
  File "/home/drew/code/hdf5-test/bug_example.py", line 35, in <module>
    for start, end, data in buggy_iter:
  File "/home/drew/miniconda3/envs/hdf5/lib/python3.9/site-packages/tables_io/ioUtils.py", line 379, in iterHdf5ToDict
    num_rows = getGroupInputDataLength(f)
  File "/home/drew/miniconda3/envs/hdf5/lib/python3.9/site-packages/tables_io/arrayUtils.py", line 88, in getGroupInputDataLength
    firstkey = list(hg.keys())[0]
AttributeError: 'Dataset' object has no attribute 'keys'

I believe this can be addressed with the following code in arrayUtils.py

def getGroupInputDataLength(hg):
    if isinstance(hg, h5py.File):
        return _getHdf5FileLength(hg)
    elif isinstance(hg, h5py.Group):
        return _getHdf5GroupLength(hg)

def _getHdf5FileLength(hg):
    firstkey = list(hg.keys())[0]
    nrows = len(hg[firstkey])
    firstname = hg[firstkey].name
    for value in hg.values():
        if len(value) != nrows:
            raise ValueError(
                f"Group does not represent a table. Length ({len(value)})"
                f"of column {value.name} not not match length ({nrows}) of"
                f"first column {firstname}"
            )
    return nrows

def _getHdf5GroupLength(hg):
    return len(hg)

Allow missing files when reading sets of parquet files

This is needed by qp to deal with the ancilary data, which is optional

Have tables_io.read and readNative allow you to only read some tables in a file

** Feature request**

Before submitting
Please check the following:

I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.

support numpy.recarray as a table format

In terms of using numpy.recarray, a few questions:

how would you read/ write them by default (and to what type of file, you said FITS in the meeting, do you think that should be the default?)
what other file types should be available?
how would you convert a recarray to / from each of the other supported data types.
Do you imaging supporting parallel write / parallel iterated reading for them.

All in all I’m thinking that we would need to add a couple of read /write functions (one for each supported format), 6 conversion functions (one to/from each additional data type) and then the switches to send things to the correct function. This should all be easily do-able.

Some points from Eli:
It’s nice to go straight to numpy but you can always use the conversions from astropy/pandas. I usually use FITS, but am trying to add parquet to my repertoire. I think of hdf5 as the xml of binary data formats.
The problem of numpy to pandas is that numpy doesn’t have nulls. So not all pandas dataframes can go to numpy, but you can go the other way.
And of course pandas doesn’t like those multi-dimensional columns.