pytables / pytables Goto Github PK

View Code? Open in Web Editor NEW

1.3K 62.0 270.0 39.67 MB

A Python package to manage extremely large amounts of data

Home Page: http://www.pytables.org

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.06% Python 84.49% Shell 0.48% Gnuplot 0.06% C 6.01% CMake 0.08% Cython 7.80% Jupyter Notebook 1.01%

pytables's Introduction

PyTables: hierarchical datasets in Python

Join the chat at https://gitter.im/PyTables/PyTables

URL:	http://www.pytables.org/

PyTables is a package for managing hierarchical datasets, designed to efficiently cope with extremely large amounts of data.

It is built on top of the HDF5 library and the NumPy package. It features an object-oriented interface that, combined with C extensions for the performance-critical parts of the code (generated using Cython), makes it a fast, yet extremely easy to use tool for interactively saving and retrieving very large amounts of data. One important feature of PyTables is that it optimizes memory and disk resources so that they take much less space (between 3 to 5 times and more if the data is compressible) than other solutions, like for example, relational or object-oriented databases.

State-of-the-art compression

PyTables supports the Blosc compressor out of the box. This allows for extremely high compression speed, while keeping decent compression ratios. By doing so, I/O can be accelerated by a large extent, and you may end up achieving higher performance than the bandwidth provided by your I/O subsystem. See the Tuning The Chunksize section of the Optimization Tips chapter of the user documentation for some benchmarks.

Not a RDBMS replacement

PyTables is not designed to work as a relational database replacement, but rather as a teammate. If you want to work with large datasets of multidimensional data (for example, for multidimensional analysis), or just provide a categorized structure for some portions of your cluttered RDBS, then give PyTables a try. It works well for storing data from data acquisition systems, simulation software, network data monitoring systems (for example, traffic measurements of IP packets on routers), or as a centralized repository for system logs, to name only a few possible use cases.

Tables

A table is defined as a collection of records whose values are stored in fixed-length fields. All records have the same structure, and all values in each field have the same data type. The terms "fixed-length" and strict "data types" seem to be a strange requirement for an interpreted language like Python, but they serve a useful function if the goal is to save very large quantities of data (such as generated by many scientific applications, for example) in an efficient manner that reduces demand on CPU time and I/O.

Arrays

There are other useful objects like arrays, enlargeable arrays, or variable-length arrays that can cope with different use cases on your project.

Easy to use

One of the principal objectives of PyTables is to be user-friendly. In addition, many different iterators have been implemented to make interactive work as productive as possible.

Platforms

We use Linux on top of Intel32 and Intel64 boxes as the main development platforms, but PyTables should be easy to compile/install on other UNIX (including macOS) or Windows machines.

Compiling

To compile PyTables, you will need a recent version of the HDF5 (C flavor) library, the Zlib compression library, and the NumPy and Numexpr packages. Besides, PyTables comes with support for the Blosc, LZO, and bzip2 compressor libraries. Blosc is mandatory, but PyTables comes with Blosc sources so, although it is recommended to have Blosc installed in your system, you don't absolutely need to install it separately. LZO and bzip2 compression libraries are, however, optional.

Make sure you have HDF5 version 1.10.5 or above. On Debian-based Linux distributions, you can install it with:

$ sudo apt install libhdf5-serial-dev

Installation

Install with pip <https://pip.pypa.io/en/stable/>:
```
$ python3 -m pip install tables
```
To run the test suite:
```
$ python3 -m tables.tests.test_all
```
If there is some test that does not pass, please send us the complete output using the [GitHub Issue Tracker](https://github.com/PyTables/PyTables/issues/new).

Enjoy data! -- The PyTables Team

pytables's People

Contributors

Stargazers

Watchers

Forkers

wesm avalentino ilustreous smehra joshmoore scopatz pulseenergy 87 mistobaan andreabedini tnorth continuumio joshayers dapid mwiebe umayer eisenkdr invinciblejha michalslonina dmaniloff takluyver jenshnielsen wilsaj r0k3 montefra asford endyson sjagoe jreback drakerlabs ndawe andreas-h streeto timburgess skoko dhinds crs4 mrgloom francescalted buybackoff bbudescu gidden gdementen mdxs r3m0t b-rich funkotron alimuldal bmagill1250 lbergelson goller ray2020 stevesimmons jack-pappas keszybz damonhatchett lebigot deamk migdard geofizzydrink tkofol ali-hallaji c-wilson ericr86 ospreyx mythsmith better629 thomasp6t dotsdl rdhyee kaukrise joonro jennolsen84 rabernat 0x0l dashesy rthouvenin imanojkumar grahamrjuk detrout tomkooij you13 quintusdias esss rohitjamuar flylongsky rlugojr tacaswell xlong88 cython-testbed wpi-ds minhpascal javdejong khaledto chrisburr yarikoptic underxirox shanyuhu ppc64 jaydenwhyte

pytables's Issues

ptrepack should be able to create a file consistent with 1.6.x

From PyTables 2.2 on, users can create files with HDF5 1.8.x that are not compatible with HDF5 1.6.x apps (included PyTables with 1.6.x). External links are an example of this.

It would be nice to add a flag to ptrepack so that it can remove 1.8.x objects from files during the repackaging process. That way, new files are ensured to be read from either HDF5 1.6 or 1.8 apps.

Blosc filter does not work with fletcher32

This script reproduces the issue:

import tables
import numpy

h5ft = tables.openFile('/tmp/test_earray.h5','w')

filters = tables.Filters(complevel = 1, complib = "blosc",
                         fletcher32 = True)
ea = h5ft.createEArray(h5ft.root, "foo", tables.Int16Col(),
(0,1024,1024), "earray test", filters =  filters, expectedrows =
1000000)

for ii in range(100):
    arr = numpy.random.randint(0,4096, size=(1024,1024))
    arr2 = numpy.asarray(arr, numpy.int16)[numpy.newaxis,:]
    ea.append(arr2)
    ea.flush()

h5ft.close()

and the error:

HDF5-DIAG: Error detected in HDF5 (1.8.5) thread 0:
  #000: H5Dio.c line 266 in H5Dwrite(): can't write data
    major: Dataset
    minor: Write failed
  #001: H5Dio.c line 578 in H5D_write(): can't write data
    major: Dataset
    minor: Write failed
  #002: H5Dchunk.c line 1862 in H5D_chunk_write(): unable to read raw data chunk
    major: Low-level I/O
    minor: Read failed
  #003: H5Dchunk.c line 2737 in H5D_chunk_lock(): data pipeline read failed
    major: Data filters
    minor: Filter operation failed
  #004: H5Z.c line 1116 in H5Z_pipeline(): filter returned failure during read
    major: Data filters
    minor: Read failed
  #005: blosc/blosc_filter.c line 232 in blosc_filter(): Blosc decompression error
    major: Data filters
    minor: Callback failed
Traceback (most recent call last):
  File "blosc.py", line 15, in <module>
    ea.append(arr2)
  File "/home/faltet/vepython/lib/python2.6/site-packages/tables/earray.py", line 226, in append
    self._append(nparr)
  File "hdf5Extension.pyx", line 1027, in tables.hdf5Extension.Array._append (tables/hdf5Extension.c:8984)
tables.exceptions.HDF5ExtError: Problems appending the elements
Closing remaining open files: /tmp/test_earray.h5... done

iterrows start/stop behaves in a surprising way

Normally, an iterator that gets a start, but no stop, argument, should iterate starting at start and continuing as long as possible. For example, that is the behaviour of Pythons itertools.islice. Pytables violates this and takes a default of stop=start+1

In [10]: t = tables.openFile('tutorial1.h5')

In [11]: it = t.root.detector.readout.iterrows(start=5)

In [12]: it.next()
Out[12]: /detector/readout.row (Row), pointing to row #5

In [13]: it.next()
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)

/storage4/home/gerrit/checkouts/pytables-2.2/examples/<ipython console> in <module>()

/storage4/home/gerrit/.local/lib/python2.6/site-packages/tables/tableExtension.so in tables.tableExtension.Row.__next__ (tables/tableExtension.c:7462)()

/storage4/home/gerrit/.local/lib/python2.6/site-packages/tables/tableExtension.so in tables.tableExtension.Row.__next__general (tables/tableExtension.c:8492)()

/storage4/home/gerrit/.local/lib/python2.6/site-packages/tables/tableExtension.so in tables.tableExtension.Row._finish_riterator (tables/tableExtension.c:8641)()

StopIteration:

To get the intended behaviour, one needs to write:

In [17]: it = t.root.detector.readout.iterrows(start=5, stop=t.root.detector.readout.nrows)

Stop=-1 also doesn't work because slices are not inclusive, so that would skip the last element.

This can unfortunately not be changed without breaking backward-compatibility, so I would suggest not changing it before Pytables 3.

Add szip compression support

Here is the original SourceForge report from Jeff Whitaker:

It would be nice to have the ability to create szip
compressed files. In my experience, szip produces h5
files that are about 20% smaller than zlib, and it's
faster too.

(Imported from [https://sourceforge.net/tracker/index.php?func=detail&aid=1050793&group_id=63486&atid=504147 SourceForge #1050793].)

Schedule first governance meeting

Topics:

General governance plan (See http://jenkins-ci.org/node/280)
- Where to host governance discussions? pytables-users? convore?
- How frequently to meet & how? Calls? IRC?
- Discussion of existing milestones / issues

Convert trac tickets

Since the source code changes will be made on github and not in svn, it may make sense to move the tickets from trac (http://pytables.org/trac/report/1). Searching briefly there is at least one tool to do the conversion (https://github.com/adamcik/github-trac-ticket-import) which could be used.

Alternatively, there may be another issue tracking tool, integrated with github, that would be generally preferred. If so, please mention it below.

OldFlavorTestCase test failed

Ludwig Ohl reported this one:

>>> tables.test()
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version:  2.1.2
HDF5 version:      1.8.4
NumPy version:     1.3.0
Zlib version:      1.2.3.3
LZO version:       2.03 (Apr 30 2008)
BZIP2 version:     1.0.5 (10-Dec-2007)
Python version:    2.6.5 (r265:79063, Apr 16 2010, 13:57:41)
[GCC 4.4.3]
Platform:          linux2-x86_64
Byte-ordering:     little
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Performing only a light (yet comprehensive) subset of the test suite.
If you want a more complete test, try passing the --heavy flag to this 
script
(or set the 'heavy' parameter in case you are using tables.test() call).
The whole suite will take more than 2 minutes to complete on a relatively
modern CPU and around 80 MB of main memory.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Skipping Numeric test suite.
Skipping numarray test suite.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
[clip]
======================================================================
FAIL: None (tables.tests.test_basics.OldFlavorTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
   File "/usr/lib/python2.6/dist-packages/tables/tests/common.py", line 
248, in newmethod
     return oldmethod(self, *args, **kwargs)
   File "/usr/lib/python2.6/dist-packages/tables/tests/test_basics.py", 
line 2202, in test
     self.assert_(common.allequal(node_data, data, new_flavor))
AssertionError

----------------------------------------------------------------------
Ran 6522 tests in 39.019s

FAILED (failures=1)

Provided that this is a test for dealing with really old functionality, it would be better if this is removed for the suite.

Define out-of-beta requirements

A first real release of PyTables will entail moving from 2.3b1 to 2.3 (https://github.com/PyTables/PyTables/issues/milestones/2/edit). The necessary TO-DOs for that milestone need to be defined in conjunction with Francesc.

Cannot create empty table

I dynamically generate my tables based on the columns I want to write out. I thought I could avoid a special case for "no columns", but apparently, I cannot. Is this by design or is this a bug?

Python 2.6.6 (r266:84292, Sep 15 2010, 16:22:56) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tables
>>> class Empty(tables.IsDescription): pass
... 
>>> h5 = tables.openFile('/tmp/fubar.h5', 'w')
>>> h5.createTable(h5.root, "test", Empty)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/storage4/home/gerrit/.local/lib/python2.6/site-packages/tables/file.py", line 771, in createTable
    chunkshape=chunkshape, byteorder=byteorder)
  File "/storage4/home/gerrit/.local/lib/python2.6/site-packages/tables/table.py", line 537, in __init__
    self.description = Description(descr.columns)
  File "/storage4/home/gerrit/.local/lib/python2.6/site-packages/tables/description.py", line 494, in __init__
    self._g_setPathNames()
  File "/storage4/home/gerrit/.local/lib/python2.6/site-packages/tables/description.py", line 561, in _g_setPathNames
    head = cols[0]
IndexError: list index out of range

Secure against malicious pickled objects

!PyTables could be made a little safer against malicious pickled objects. One way to do it would be to avoid unpickling attributes not explicitly requested by the user. Some changes to the AttributeSet class would be needed:

Do not load all attributes on instance construction. At least, do not load those attributes which look like a pickled object.
When loading needed pickled system attributes (FIELD_N_FILL from 1.2 and FILTERS from 0.x and 1.x), use the mechanism described in http://docs.python.org/lib/pickle-sub.html to limit visible classes (namely to tables.filters.Filters).
When calling AttributeSet.__repr__(), show supposedly-pickled attributes as ATTRIBUTE_NAME := <pickled Python object> or similar.

Of course, explicit retrieval of pickled attributes would not be limited, so some additional machinery should go into AttributeSet.__getattr__() as well. If not loading any pickled object is too strong a limitation (only for representation purposes, I guess), a clever use of the previously-referenced mechanism could be used to limit visible classes and modules to a minimum set.

Run coverage tests on PyTables' unittests

It would be nice to run coverage tests on !PyTables' unittests, e.g. using [http://nedbatchelder.com/code/modules/coverage.html coverage.py]. This would provide unvaluable information about fragments of code not exercised by unit tests, so new unit test may be written and stale code detected.

Error while reading and writing file

I have a python process running a script which writes and flushes continuously data into a hdf5-file.

When I start another python process to read concurrently already flushed data I get sometimes following exception:

HDF5-DIAG: Error detected in HDF5 library version: 1.6.7-rc2 thread 0.  Back trace follows.
  #000: \hdf\hdf5-16\vnet\release-testing\std\src\H5Dio.c line 601 in H5Dread(): can't read data
    major(15): Dataset interface
    minor(24): Read failed
  #001: \hdf\hdf5-16\vnet\release-testing\std\src\H5Dio.c line 866 in H5D_read(): can't read data
    major(15): Dataset interface
    minor(24): Read failed
  #002: \hdf\hdf5-16\vnet\release-testing\std\src\H5Dio.c line 1752 in H5D_chunk_read(): optimized read failed
    major(15): Dataset interface
    minor(24): Read failed
  #003: \hdf\hdf5-16\vnet\release-testing\std\src\H5Dselect.c line 511 in H5D_select_read(): read error
    major(14): Dataspace interface
    minor(24): Read failed
  #004: \hdf\hdf5-16\vnet\release-testing\std\src\H5Distore.c line 2060 in H5D_istore_readvv(): unable to read raw data chunk
    major(05): Low-level I/O layer
    minor(24): Read failed
  #005: \hdf\hdf5-16\vnet\release-testing\std\src\H5Distore.c line 1564 in H5D_istore_lock(): data pipeline read failed
    major(19): Data filters layer
    minor(24): Read failed
  #006: \hdf\hdf5-16\vnet\release-testing\std\src\H5Z.c line 998 in H5Z_pipeline(): filter returned failure during read
    major(19): Data filters layer
    minor(24): Read failed
  #007: \hdf\hdf5-16\vnet\release-testing\std\src\H5Zdeflate.c line 114 in H5Z_filter_deflate(): inflate() failed
    major(19): Data filters layer
    minor(29): Unable to initialize object
Traceback (most recent call last):
  File "C:\data\src\erlang\analytics\trunk\src\python\partest.py", line 6, in <module>
    for rec in h5file.root.adimp:
  File "tableExtension.pyx", line 805, in tableExtension.Row.__next__
  File "tableExtension.pyx", line 955, in tableExtension.Row.__next__general
  File "tableExtension.pyx", line 550, in tableExtension.Table._read_records
tables.exceptions.HDF5ExtError: Problems reading records.
Closing remaining open files: c:\data\adimpressions.h5... done

I used standard pytbales binaries for windows:
http://www.pytables.org/download/stable/tables-2.0.3.win32-py2.5.exe

Version infos:
Pytables: 2.0.3
hdf5: 1.6.7
Numpy: 1.0.4
Python: 2.5.1

Support for latest file format

It seems that the latest HDF5 format can give much more performance in situations when there are many groups present in hierarchy. For some figures on the improvements that can be achieved, see http://www.hdfeos.net/workshops/ws12/agenda.php, "Migrating from HDF5 1.6 to 1.8".

This can be selectable by the user with a new parameter in the openFile(). Of course, the new format is backward incompatible, so some kind of incompatibility warning should be issued when using this.

Pro liberation

From: https://sourceforge.net/mailarchive/message.php?msg_id=27598267

To make the distinction between the old Pro and 'standard' versions,
the new liberated Pro version should be released without the "Pro"
suffix and with its own incremented version number. This is the
fastest way to eliminate confusion.

For background on the change in license, see Francesc'semail (https://sourceforge.net/mailarchive/message.php?msg_id=27597311)

Hi List,

Fortunately, now that it seems like there will be some opportunities for 
PyTables to be maintained, I'm happy to announce that, hereby, PyTables 
Pro drops its original, commercial license, and acquires a BSD license.  
The new ``LICENSE.txt`` in the root directory states this.

You can find the SVN sources in:

http://www.pytables.org/svn/pytables/PyTablesPro

and browse the sources via Trac too in:

http://pytables.org/trac/browser/PyTablesPro

@ people interested in the future maintenance, please feel free to use 
Pro as a possible base for the future PyTables (with no Pro suffix 
anymore).  Some caveats about doing this though:

- The version format for the Pro version is something like 'X.Y.Zpro', 
and such a 'pro' suffix is necessary in certain parts of the code in 
order to enable the Pro-specific features (indexing and caching, 
mainly).  It should be easy to get rid of this, but needs some 
modification of the code.

- The manual is the same in PyTables Pro than PyTables standard, but if 
the new maintainers choose to put Pro as the default, the notes of the 
style "... is only available in PyTables Pro." should be obviously 
removed.

- The copyright for the years 2002-2010 should be respected, and I have 
added a new entry for 2011 which is attributed to "PyTables 
maintainers", in honor to those braves that are undertaking the 
maintenance task of PyTables.  Please change this 'author' by something 
more meaningful if you feel like.

So, enjoy data with PyTables Pro (BSD-flavored :) !

-- 
Francesc Alted

Document release process

In order to know if all the changes made as a part of "Pro Liberation" (#1) were successful, it would be good to have a list of all the steps that Francesc takes/took for performing a new version release.

Which build commands are used?
Which tests are run?
Who has to be informed?

`createparents` flag should be True by default

This would allow for more easy operation in situations like:

f.createArray('/scan0/data', 'y', y, createparents=True)

segfault when accessing row-iterator

Pytables segfaults when accessing a column from a row-iterator before the first .next() is called. When this row-iterator is from a .where() search, at least 'print row' warns not to access it (but it still segfaults, which it shouldn't), but in this example, there is no such warning.

#!/usr/bin/python

import tables
tables.print_versions()
t = tables.openFile("tutorial1.h5")
r = t.root.detector.readout.iterrows(start=5, stop=6)
print "table", t
print "row", r
print "Still here, retrieving column 'name'"
print r["name"]

$ python pytables_crash.py 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version:  2.2.2.dev
HDF5 version:      1.8.4-patch1
NumPy version:     2.0.0.dev-12d0200
Numexpr version:   1.5.dev260 (not using Intel's VML/MKL)
Zlib version:      1.2.3.4 (in Python interpreter)
Blosc version:     1.1.3.dev (2010-11-10)
Cython version:    0.13
Python version:    2.6.6 (r266:84292, Sep 15 2010, 16:22:56) 
[GCC 4.4.5]
Platform:          linux2-x86_64
Byte-ordering:     little
Detected cores:    8
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
table tutorial1.h5 (File) 'Test file'
Last modif.: 'Fri Nov 26 14:15:34 2010'
Object Tree: 
/ (RootGroup) 'Test file'
/columns (Group) 'Pressure and Name'
/columns/name (Array(3,)) 'Name column selection'
/columns/pressure (Array(3,)) 'Pressure column selection'
/detector (Group) 'Detector information'
/detector/readout (Table(10,)) 'Readout example'

row /detector/readout.row (Row), pointing to row #4
Still here, retrieving column 'name'
Segmentation fault

This should not happen. It should raise some Exception instead. I have not tried this with any other versions.

Add an 'EXPERT_MODE' in parameters.py

Such an 'EXPERT_MODE' would prevent warnings (mainly of type PerformanceWarning) to appear. That could be very nice for people that knows what they are doing and do not want to be bothered about the way they are doing things.

Of course, the default for this would be False.

Table objects should get rid of FIELD_*

This is a task reminder for next minor release (2.3).

PyTables adheres to the HDF5's Table High Level API:

http://www.hdfgroup.org/HDF5/doc/HL/RM_H5TB.html

and tries to mimic its format as much as possible. However, one of the most inefficient "features" of such a format is the existence of a couple of attributes per field, i.e. FIELD_N_NAME and FIELD_N_FILL, where N is the number of column.

These attributes a really redundant, as this information is already present in the HDF5 type definition. However, when a table has a lot of columns, this can lead to really bad performance (see #304 for example).

Hence, my intention is to remove these attributes completely from PyTables in the next minor version (2.3).

Data appended to the row of a preempted table is discarded.

This bug may be related to #94.

When appending data to the row object of a Table instance which has already been killed (unreferenced) and preempted from the node cache, flushing the table does no longer dump the aforementioned data.

The reason is similar to that of #94, however losing the reference from the table to the row is unavoidable since the table object completely disappears and the object used to invoke the flush() is then a brand new one which has no way to reach the used row (its own row object would be a newly created one).

The attached patch provides an extension to the test added for #94 which triggers this bug.

Base chunkshape of VLArray objects on the number of rows, instead of the total size

When I first implemented the chunkshape size estimation for VLArrays, I based the computation on the fact that all the VLArray was going to be compressed. This turned out not to be the case because only the header for each row is compressed. So I need to update the logic of this calculation.

This will probably imply to deprecate the expectedsizeinMB parameter and add a new one called most probably expectedrows. An strategy for doing this should be devised in order to easy the transition.

Undesired 'break' effect in update table iterators

If a break is reached in the middle of a table iterator that is doing an update, some of the rows will not get updated even if you do a Table.flush() after the process.

The next script shows the problem:

from tables import *

class Record(IsDescription):
    col1 = BoolCol()
    col2 = IntCol()
    col3 = FloatCol()

# Create a table and fill it with some values
f = openFile("break-effect.h5", "w")
t = f.createTable(f.root, "table", Record)
for i in xrange(10):
    t.row['col2'] = i
    t.row.append()
t.flush()

# Do an update iterator and break it
for row in t:
    row['col2'] = 1
    row.update()
    break  # This makes the Row._finish_riterator() to not be called!
t.flush()

col2 = t.cols.col2[:].tolist()
print "Col2:", col2
if col2 == range(10):
    print "ERROR: Incorrect update!"

f.close()

The solution of this could be rather complicated, because all the info about unsaved updated rows lives in the iterator, but after receiving the break, this will disappear!. It seems to me that the only solution would be to put such an info in the Table container instead of the Row iterator, and this is a major change :-/

Mmm, perhaps it is worth mentioning this issue in documentation until a proper solution would be devised.

Hooks for events

It would be nice if one can take advantage of the do/undo machinery for triggering hooks during the object tree manipulation. This can be useful for external libraries like Traits.

I'm attaching the patch for this (thanks to Anthony Scopatz).

Create an index for UG

Users could benefit a lot of a index for Users Manual.

Pad with zeros when reading padded types

When reading padded types (padding is the bits of a data element which are not significant as defined by the precision and offset properties), PyTables does not zero the container first, so the data ends corrupted.

Using H5T_setpad() when building the type would prevent this from happening.

saved record array does not retain shape

Hi,

When I save a numpy record array as a Table, its shape is automatically flattened. Is this intended or a bug?

I am using version 2.0.4

See attached file for a short demo.

Cheers, Gabriel

Document how to subclass PyTables Node classes

There can be situations where users want to subclass any of current Node subclasses. This should be documented in User's Guide, and an example should be useful, like for example, one based on [http://pytables.org/moin/UserDocuments/CustomDataTypes].

Large type sizes and object header message "is too large" error

When creating large type sizes, exceeding 64k, HDF5 complains about:

  #014: H5Oalloc.c line 972 in H5O_alloc(): object header message is too large  
    major: Object header                                                        
    minor: Unable to initialize object

The next script reproduces the above error:

import tables

class Rec(tables.IsDescription):
    #col = tables.Int8Col(shape=2**16-9)  # works
    col = tables.Int8Col(shape=2**16)  # raise error

f = tables.openFile("/tmp/defaults.h5", "w")
t = f.createTable(f.root, 't', Rec)
f.close()

The problem is that PyTables always makes use of the H5Pset_fill_value() HDF5 call to set the default values, and these defaults have to be saved in the object header, which has a maximum size of 64 KB in HDF5. The HDF5 team has logged the issue as a bug, but meanwhile it would be nice to find a workaround.

Wrong shape computation in some situations when using tables.Expr

The next script explains the problem:

import tables as tb
import numpy as np
from tables import numexpr as ne

f = tb.openFile('test.h5', 'w')

factor = np.array([3.])
ar = np.arange(10.)

print "tables.Expr"
ar = f.createArray(f.root, 'test1', ar)
e = tb.Expr('factor*ar')
print e.eval() # [ 0.]

print "NumPy"
print factor*ar
print "Numexpr"
print ne.evaluate('factor*ar')

f.close()

and the output:

tables.Expr
[ 0.]     # wrong!
NumPy
[  0.   3.   6.   9.  12.  15.  18.  21.  24.  27.]
Numexpr
[  0.   3.   6.   9.  12.  15.  18.  21.  24.  27.]

Add write support to FileNode

It would be nice that the FileNode module supported in-place writing (i.e. modification) of data, besides read-only and read-append. Implementing it should not be very difficult, by using EArray.__setitem__(). However, it is a problem that EArray objects do not allow truncation, to better simulate RW files.

Add support for NumPy 2.0

NumPy 2.0 should be tested with PyTables.

Synchronize git/svn/launchpad/sf.net

It should be a goal that users do not get stale code when trying to access PyTables.

There have not yet been any commits to the svn trunk or the git master since the initial migration (http://sourceforge.net/mailarchive/message.php?msg_id=27598107), but once commits start showing up on the git master, the two will begin to diverge.

There is also a Bazaar-NG branch hosted at Launchpad (bzr branch lp:pytables) as described on http://www.pytables.org/moin/Development which will need to be addressed.

Change mentions to 'record arrays' to 'structured arrays'

In the documentation, there are many references (for example in Table.read or Table.readSorted docstrings) to 'record arrays' which was a former name for 'structured arrays'. This should be updated to the new terminology.

Table.setitem and Table.modifyRows() should deal better with NumPy void types

The next describes the problem:

>>> import numpy as np
>>> import tables as tb
>>> ra = np.fromiter(((i, i*2, i*3) for i in range(100)), dtype="i1,i2,i4")
>>> f = tb.openFile("/tmp/test.h5", "w")
>>> t = f.createTable(f.root, 't', ra)
>>> r = t[1]
>>> r
(1, 2, 3)
>>> r['f1'] = 0
>>> t[1] = [r]; t[1]
(1, 0, 3)
>>> t[1] = r.tolist(); t[1]
(1, 0, 3)

# But...

>>> t[1] = r
[clip]
ValueError: Object cannot be converted into a recarray object compliant with
table format '[('f0', '()i1'), ('f1', '()i2'), ('f2', '()i4')]'. The error was: <mismatch between the number of fields and the number of arrays>

This also applies to Table.modifyRows():

>>> t.modifyRows(1, rows=[[r]]); t[1]
1
(1, 0, 3)
>>> t.modifyRows(1, rows=[r.tolist()]); t[1]
1
(1, 0, 3)

# But...

>>> t.modifyRows(1, rows=[r]); t[1]
[clip]
ValueError: Object cannot be converted into a recarray object compliant with
table format '[('f0', '()i1'), ('f1', '()i2'), ('f2', '()i4')]'. The error was: <mismatch between the number of fields and the number of arrays>

Add support for querying the number of atoms of a VLArray row

Having this feature may allow users to know whether a read would take a lot of memory and time or not. Something like:

vlarray.getNAtoms(nrow)

would work.

Too large atoms does create a MemoryError

The script below reproduces the problem:

import tables
import numpy

# ----- Writing data to file ----- #

# Open the output file for writing
fid = tables.openFile("carray_error.hdf","w")

# Create a table group
fid.createGroup("/", 'table', 'Flow table')

# The number of rows and columns in a frame, and the number of frames
n_rows = 480
n_cols = 720
n_frames = 2

# Create a numpy vector to be stored in the Carray
matrix = numpy.random.randn(n_rows,n_cols)

# The CArray shape
array_shape = (n_frames,)

# The CArray atom
array_atom = tables.Atom.from_dtype(numpy.dtype((numpy.int16, (n_rows,n_cols))))

# Create a Carray for holding horizontal flow values
fid.createCArray(fid.root.table,'flow_x',array_atom,array_shape)

# Create a Carray for holding vertical flow values. This is where we get an 
error;
# working with smaller values of n_rows and n_cols works fine though.
fid.createCArray(fid.root.table,'flow_y',array_atom,array_shape)

for m in range(n_frames):
    fid.root.table.flow_x[0] = matrix
    fid.root.table.flow_y[0] = matrix

# Close the output file
fid.close()

More info about this in:

http://www.mail-archive.com/[email protected]/msg02209.html

This has been partially addressed in r4657, but this breaks some tests for CArray defaults, so another solution must be devised.

indexing should permit uint as well as int

I have a column of type uint32 in one table containing values for row-numbers in another table. Currently, I am forced to cast this to int32 before I can use it as an index. I think that makes no sense, logically, so I set the type of this issue to defect rather than enhancement.

In [1]: import tables

In [3]: t = tables.openFile('tutorial1.h5')

In [8]: t.root.detector.readout[np.array([2, 3, 4], dtype=np.int32)]
Out[8]: 
array([(512, 2, 256.0, 2, 8, 34359738368, 'Particle:      2', 4.0),
       (768, 3, 6561.0, 3, 7, 51539607552, 'Particle:      3', 9.0),
       (1024, 4, 65536.0, 4, 6, 68719476736, 'Particle:      4', 16.0)], 
      dtype=[('ADCcount', '<u2'), ('TDCcount', '|u1'), ('energy', '<f8'), ('grid_i', '<i4'), ('grid_j', '<i4'), ('idnumber', '<i8'), ('name', '|S16'), ('pressure', '<f4')])

In [9]: t.root.detector.readout[np.array([2, 3, 4], dtype=np.uint32)]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

/storage4/home/gerrit/checkouts/pytables-2.2/examples/<ipython console> in <module>()

/storage4/home/gerrit/.local/lib/python2.6/site-packages/tables/table.pyc in __getitem__(self, key)
   1709         # Try with a boolean or point selection

   1710         elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):
-> 1711             coords = self._pointSelection(key)
   1712             return self._readCoordinates(coords, None)
   1713         else:

/storage4/home/gerrit/.local/lib/python2.6/site-packages/tables/leaf.pyc in _pointSelection(self, key)
    587                 coords = numpy.asarray(key, dtype="i8")
    588         else:
--> 589             raise TypeError("Only integer coordinates allowed.")
    590         # We absolutely need a contiguous array

    591         if not coords.flags.contiguous:

TypeError: Only integer coordinates allowed.

It's probably a very easy fix, I might have a look at it this weekend.

Support variable length strings

Here is the original SourceForge report from Vicent Mas:

The attached file contains a dataset of strings with
variable length (property STRSIZE H5T_VARIABLE). These
strings are not supported in PyTables. It would be nice
to add this support in future versions of PyTables.

vmas@rachel:~/vitables/misc/examples/generic$ h5dump strings.h5
HDF5 "strings.h5" {
GROUP "/" {
   DATASET "StringsEx" {
      DATATYPE H5T_STRING {
            STRSIZE H5T_VARIABLE;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
      DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): "A fight is a contract that takes two people to honor.",
      (1): "A combative stance means that you've accepted the contract.",
      (2): "In which case, you deserve what you get.",
      (3): " -- Professor Cheng Man-ch'ing"
      }
   }
}
}

vmas@rachel:~/vitables/misc/examples/generic$ ptdump -d strings.h5
/usr/lib/python2.3/site-packages/tables/File.py:235:
UserWarning: file ``strings.h5`` exists and it is an HDF5 file, but it does not have a PyTables format; I will try to do my best to guess what's there using HDF5 metadata
warnings.warn("""\
/ (RootGroup) ''
/StringsEx (Array(4L,)) ''
Data dump:
[0] +
[1] ��*
[2] �*
[3] �+

(Imported from [https://sourceforge.net/tracker/index.php?func=detail&aid=1298908&group_id=63486&atid=50414 SourceForge #1298908].)

Unicode keys in dictionary when creating a table

The following method of describing a table and adding it works.

Event = {  "name"     : tables.StringCol(itemsize=16)}
table = h5file.createTable(group, 'readout', Event, "Readout example")

With a Unicode dictionary key it does not work.

Event = { u"name"     : tables.StringCol(itemsize=16)}
table = h5file.createTable(group, 'readout', Event, "Readout example")

Traceback (most recent call last):
  File "<pyshell#104>", line 1, in <module>
    table = h5file.createTable(group, 'readout', Event, "Readout example")
  File "C:\Python26\lib\site-packages\tables\file.py", line 718, in createTable
    chunkshape=chunkshape, byteorder=byteorder)
  File "C:\Python26\lib\site-packages\tables\table.py", line 525, in __init__
    self.description = Description(description)
  File "C:\Python26\lib\site-packages\tables\description.py", line 487, in __init__
    newdict['_v_dtype'] = numpy.dtype(nestedDType)
TypeError: data type not understood

I think the second method should also work.

Continuous integration

A cross-platform, continuous integration server (perhaps using http://jenkins-ci.org) would be beneficial if several users will be pushing source changes.

If a hosted solution could be found, that might be preferable:

The limits for an indexed column are computed only partially in some situations.

The situation is when an variable with two limits that is usable for
indexing follows to a expression that is not usable. For example, in:

(c_extra > 0) & (c_int32 > 0) & (c_int32 < 5)

if c_extra is not indexed, but c_int32 is, the indexable
part is computed as just (c_int32 > 0), which doesn't introduce
problems, but is sub-optimal.

This is in fact a very minor problem, as the user is compelled to
always put the indexed columns first in condtion expressions, i.e.:

(c_int32 > 0) & (c_int32 < 5) & (c_extra > 0)

which has not problems in detecting both limits.

The problem is exposed by running:

python tables/tests/test_queries.py IndexedTableUsageTestCase.test06

If this is considered not worth the effort to solve, one can remove
(or comment out) test06 and relax the priority of this ticket until
this would be fixed.

See ticket #162 for a related (and solved) problem.

table.Column iteration is very slow

It is much faster to copy an entire column than to iterate over it.

These all run in about 2.5 seconds:
tmp = tbl.cols.some_field[:]
tmp = [x for x in tbl.cols.some_field[:]]
tmp0 = [x['some_field'] for x in tbl]

This runs in 32 seconds:
tmp = [x for x in tbl.cols.some_field]

The problem is that the most natural way to iterate over an entire column is using iteration (e.g. [func(x) for x in tbl.cols.myfield]). So it should be fast.

The unsuspecting user has no reason to assume that iteration has such a huge penalty in this case - iteration in python has a minor performance penalty in some cases, but rarely such a large penalty.

Add automatic tree comparison to tests

There are a lot of tests which check that some nodes are or aren't in the tree after some hierarchy manipulation operation. Checking this is repetitive and prone to be incomplete, e.g. consistence of changes to individual nodes, missing nodes, unexpected nodes...

There should be some common testing support for those checks. For instance a function to check that the tree contains an exact set of paths, maybe under some group, and maybe checking for the type of node (leaf or group).

Less than recommended numexpr version should issue a native Python warning

Now, if only prints a message like:

*Warning*: Numexpr version is lower than recommended: 1.4 < 1.4.1

That should be fixed.

Add a `size` read-only attribute to leaves

That would be a fast and easy way to get the size in bytes of a leaf.

PyTables 2.2 should be easy_install-able

Currently 2.2 is not installable via setuptools if numexpr is not installed previously. The problem is that setup.py enforces that numexpr has to be installed, and a new entry for numexpr in setup_requires key of setuptools_kwargs must be added.

This is actually too strong, as the requisite is mainly for running, not compiling. This should be fixed on some way or another.

Saving a pickled object should raise a warning

In some situations one may not want to generate pickled object under no conditions. Adding a parameter (something like PURE_HDF5_DATA) would allow a user to select the preferred behaviour:

0: No warnings. Pickled objects are allowed (default)
1: Warning. Issue a warning every time a pickled object is created
2: Error. Issue a ValueError everytime a pickled object is created.