mtiller / recon Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 4.0 11.61 MB

Web and network friendly simulation data formats

License: MIT License

Python 63.32% TeX 34.75% CSS 1.93%

recon's People

Contributors

Stargazers

Watchers

Forkers

dietmarw choeger pombredanne rozhddmi

recon's Issues

Look at u-msgpack-python

This can be embedded in the PyRecon package to avoid dependencies.

If each table was given a unique id (and index, for example...assuming an order could be imposed on the tables), then the unique id could be used in each table entry which would save space in wall files.

IOError on dsres.mld

I converted dsres.mat to mld using script dsres2meld. Then I got:

Python 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] on win32
>>> from recon.meld import MeldWriter, MeldReader
>>> fp = open(r'c:\Download\recon-master\tests\dsres.mld', 'rb')
>>> meld = MeldReader(fp, verbose=True)
Compression: True
Header = {'tabs': {'T2': {'vmeta': {'der(i_L)': {'desc': ''}, 'i_L': {'desc': ''}, 'time': {'desc': 'Simulation time [s]'}, 'i_C': {'desc': ''}, 'der(V)': {'desc': ''}, 'V': {'desc': ''}, 'i_R': {'desc': ''}}, 'vars': ['time', 'der(i_L)', 'i_L', 'i_C', 'der(V)', 'V', 'i_R'], 'tmeta': {}, 'toff': {'der(i_L)': {'i': 2056, 'l': 2444}, 'i_L': {'i': 4510, 'l': 2233}, 'time': {'i': 568, 'l': 1483}, 'i_C': {'i': 6748, 'l': 2419}, 'der(V)': {'i': 9178, 'l': 2431}, 'V': {'i': 11616, 'l': 2113}, 'i_R': {'i': 13737, 'l': 2117}}}, 'T1': {'vmeta': {'time': {'desc': 'Simulation time [s]'}, 'Vb': {'desc': 'Battery voltage'}, 'R': {'desc': ''}, 'C': {'desc': ''}, 'L': {'desc': ''}}, 'vars': ['time', 'C', 'L', 'Vb', 'R'], 'tmeta': {}, 'toff': {'Vb': {'i': 16015, 'l': 49}, 'R': {'i': 16065, 'l': 49}, 'C': {'i': 15911, 'l': 52}, 'L': {'i': 15963, 'l': 51}, 'time': {'i': 15861, 'l': 50}}}}, 'fmeta': {}, 'objs': {}, 'comp': True}
>>> meld.tables()
['T2', 'T1']
>>> table = meld.read_table('T2')
>>> table.signals()
['time', 'der(i_L)', 'i_L', 'i_C', 'der(V)', 'V', 'i_R']
>>> table.data('time')
Traceback (most recent call last):
  File "<pyshell#11>", line 1, in <module>
    table.data('time')
  File "C:\Programme\Python27\lib\site-packages\pyrecon-0.3.0-py2.7.egg\recon\meld.py", line 568, in data
    data = self.reader.ser.decode_vec(self.reader.fp, blen)
  File "C:\Programme\Python27\lib\site-packages\pyrecon-0.3.0-py2.7.egg\recon\serial.py", line 114, in decode_vec
    verbose=verbose, uncomp=uncomp)
  File "C:\Programme\Python27\lib\site-packages\pyrecon-0.3.0-py2.7.egg\recon\serial.py", line 106, in decode_obj
    data = decompress(data)
  File "C:\Programme\Python27\lib\site-packages\pyrecon-0.3.0-py2.7.egg\recon\serial.py", line 17, in decompress
    return c.decompress(data)
IOError: invalid data stream

Environment is WinXP with Python 2.7.6, setuptools 2.2 and msgpack-python 0.4.1.

Add argparse to wall_info script

I should add a --verbose option to the wall_info script. This will mean adding argparse support.

Add discussion to the paper about handling optional kinds of data

For example, if a transform isn't present, the key and value shouldn't be there. If objects aren't present, the map should simply be empty.

In summary, there should never be null/None values associated with keys.

Add support for type annotations

As it stands, there are no type restrictions in the API. I can see a couple of different type restriction options. First, a natural type constraint would be that every element in a column is of the same type. Whether that type is prescribed a priori could be up to the user.

At a minimum, it would be useful to do the following:

Allow specifying per column type constraints
Specify a default column type (to be used when a per column type is not provided)
Check that each row after the first conforms to the types of the columns in the first row

TypeError on fullRobot.mld

I converted fullRobot.mat to mld using script dsres2meld. Then I got:

PythonWin 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] on win32.
>>> from recon.meld import MeldWriter, MeldReader
>>> fp = open(r'c:\Download\recon-master\tests\fullRobot.mld', 'rb')
>>> meld = MeldReader(fp, verbose=True)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Programme\Python27\lib\site-packages\pyrecon-0.3.0-py2.7.egg\recon\meld.py", line 444, in __init__
    self.header = self.ser.decode_obj(self.fp, length=blen)
  File "C:\Programme\Python27\lib\site-packages\pyrecon-0.3.0-py2.7.egg\recon\serial.py", line 107, in decode_obj
    x = msgpack.unpackb(data)
  File "_unpacker.pyx", line 119, in msgpack._unpacker.unpackb (msgpack/_unpacker.cpp:119)
TypeError: unhashable type: 'dict'

Environment is WinXP with Python 2.7.6, setuptools 2.2 and msgpack-python 0.4.1.

Use msgpack for transformations

@choeger ask the question "Why not use msgpack for transformations". After all, it would not only make the file format more consistent, but it would eliminate the need to parse transformation strings. It would also be well aligned with #43.

Initially, I thought that the strings would be more compact. But in discussing it with Christoph, I recognized that the most common transformation, by far, would be the pure alias transformation: aff(1.0,0.0). Even if you use integers to represent this string, you can still only get it down to aff(1,0) (8 bytes). That seems pretty small. But in discussing it with Christoph, I realized that if we represented this transformation as:

{"k": "aff", s: 1, o: 0}

and then used msgpack to pack it. We get the byte sequence:

\x83\xa1k\xa3aff\xa1s\x01\xa1o\x00

which is only 13 bytes. Yes, we pay 7 bytes per alias signal for using msgpack. But we make implementations easier since people don't have to include transformation parsing.

So I think this is a good idea for version 2!

Native MATLAB/Octave implementation

This would be good for anybody who wants to be able to read these
files in MATLAB or Octave. (suggested by @sjoelund).

File type header version number can't go beyond 9

"recon:meld:v10" isn't possible, because we read a fixed-length string. If it gets to version 10...

Type Contraints

It would be nice to have a feature where the type of each column could be specified and somehow enforced/checked during reading and writing. The underlying format (due to its BSON foundation) doesn't really care. But it would be nice to allow the client libraries to make some basic checks.

Proper Setup Script

I need to add a setup script. Furthermore, it should include scripts for converting a) wall files into meld files and dsres files into meld files.

Field updates

In thinking about the way objects are defined in wall files, it seems inefficient and a bit odd that they need to be defined one field at a time. It would be more efficient to allow the values of several fields to be updated in a single entry. But currently, the format doesn't allow this.

Add argument parsing support to scripts

At a minimum, it would be good to be able to specify output file names, verbosity and perhaps float precision options on the command line.

Update pyrecon on PyPI

Can you please provide a new release on PyPI which contains the fixes of #37 and #38.
Thanks!

Refactor for explicit data lengths

There are a few cases (in particular, headers) where the length of the data
is maintained in the serialized data. This is a problem for several reasons. First,
it requires you to know the serialization format to decode the length. Second,
it presumes that the serialization format encodes the length. Third, compression
makes it impossible to know how many bytes to read.

Currently, all data already includes explicit lengths in the header. But what doesn't
include an explicit length is the header itself. So this needs to be changed so there
is an explicit header length in the format. Then all reads can be done for precisely
the required number of bytes.

This will be necessary to address #11 but it will also clean up the APIs for the
serializers significantly and shouldn't impact the read count.

Asymmetry in reading

I think a WallTableReader object would be useful. It would add a proper API around extracting signals and aliases. It would also be symmetric with how stuff is written.

What does x*scale+offset do to booleans or strings?

In meld.py, when fetching data, what happens from the following when the data is a boolean or string?

return map(lambda x: x*scale+offset, data)

LICENCE file missing

It looks like you haven't decided on a licence yet.

Register recon formats as media types

https://www.iana.org/assignments/media-types/media-types.xhtml

Proper metadata

I need to add support for metadata associated with each file, signal, table or object.

To avoid excessive overhead, perhaps the metadata should be left out if empty.

Better transform names

I realized, while writing the paper, that the most common transform would be to either perform a logical not or a sign inversion on data. For sign inversion, currently a tranform of "affine(-1,0)" would be required. It turns out, one of the things I realized when looking at the key names is that there is a lot of alias information stored in a typical results file and so it is important to key keep size small. The same applies to transforms. As such, I think we should refactor the transforms as follows:

"inv" - Simple transform that either changes the sign of numerical data or logically inverts booleans
"aff(s,o)" - Applies an affine transformation for the specified scale, s and offset, o.

Header length is encoded as little-endian, whereas network byte order is big-endian

The methods in util.py use little-endian encoding, network ordering / msgpack / Java / Javascript are all big-endian, so this may be more appropriate?

Look at wall2meld performance

Martin Sjölund pointed out several areas where wall to meld conversion was slow. The first thing to do is identify whether it is possible to do this conversion "in memory" (since that would probably be a fairer comparison).

We should also profile the conversion process to see where we can speed things up. Martin points out that the array packing and unpacking seems to be the big thing. I wonder if there is a way to optimize that more?

Dymola results file testing

Convert Dymola results files into wall and meld formats and compare sizes.

Mike

Create a script to convert dsres to meld

The Python code already exists, we just need to create a script.

Use msgpack for everything

As suggested by Martin Sjoelund, there's not reason not to use msgpack integers for lengths.

Create nose tests

I should really create a bunch of formal nose tests that not only test the code but report on coverage.

Create scripts to convert wall to meld

The Python function already exists, we just need to create a script as well.

Metadata handling is messy and inconsistent

Variable metadata is not done in a "pythonic" way and objects don't really have any metadata support.

Implement meld format reader and writer

Support for compression

A key goal with this format is to minimize reads. If compression were supported, it would have to be pretty localized (e.g. compressing individual columns) because this would avoid impacting the number of reads.

Header compression is possible, but it would be a bit problematic. The ID would have to reflect the fact that it was compressed and the length information to proceeds each document couldn't be included in the compression (again...impact on reads).

Compression of columns is probably more likely to have a significant impact on storage space than compression of the header (which probably won't include a lot of repetitive data).

Any open question would be...what type of compression? We'd want to use something that is typically available as part of standard libraries. For Python, zlib and bz2 seem to be easily accessible. But what about the Java and C platforms?

Preserve column order

At the moment, there is nothing that specified column ordering. This is absolutely essential I think.

Register Content Type

There is a content type for CSV (RFC 4180) and there is even an RFC about how to use URI fragments to refer to contents within a CSV file.

We should do the same for Recon formats. By that I mean register them with IANA AND define a URI Fragment specification as well (to refer to contents within a file via URI).

Should DyMat (and SciPy) be added to install_requires?

There is DyMat required which itself requires SciPy.

Unfortunately there is no DyMat-package on PyPi that can be downloaded by the setuptools.
Unfortunately setup.py of DyMat does not list SciPy as requirement.

Avoid invalid files by error-prone mode flags of open

From #37 and #38 we learned that opening the recon file requires mode='rb' for reading and mode='wb+' for writing since it turned out that not setting the binary 'b' mode may lead to invalid recon files.

with open('dsres', 'rb') as wfp:
    with open('mld', 'wb+') as mfp:
        dsres2meld(wfp, mfp)

This is error-prone since developers might forget to set the binary 'b' mode. For that reason I propose to introduce a new file wrapper class, say recon.reconFile. Mode flags then could be similar to zipfile.ZipFile with valid settings like mode='r' or mode='w'. Finally all functions that currently take file handles (i.e. isinstance(wfp, file) yields True) shall check their arguments for type recon.reconFile (i.e. isinstance(wfp, recon.reconFile) must yield True).

with recon.reconFile('dsres', 'r') as wfp:
    with recon.reconFile('mld', 'w') as mfp:
        dsres2meld(wfp, mfp)

def dsres2meld(wfp, mfp):
    if isinstance(wfp, recon.reconFile) and isinstance(mfp, recon.reconFile):
        print('OK: These are the expected file type, go ahead')

Simplify Meld format

At the moment, tables have a complex structure. I suspect things can be simplified quite a bit. This isn't a big deal, because it only affects the headers though.

Basically, the question is whether the indices sub-document is required.

Ambiguous keys

At some point, we switched to single character keys to reduce the size of the files. This is reasonable but probably not very effective at reducing size. Now that I'm trying to write up descriptions, these short (and sometimes repeated) keys make explanations confusing.

We should adopt a slightly different set of keys to maintain relatively terse names but, at the same time, avoid ambiguities.

Javascript implementation

It would be nice to be able to process recon within a web app. Ideally, it should use range headers for any AJAX requests it makes.

Better Handling of Transforms

We need a scheme for defining transforms that are performed on aliases.

The obvious legacy case is flipping the sign (vs. the base signal). Even "richer" would be to scale things by some constant. This would expand the applications of such transforms from simple sign flipping (e.g. a = -b) to linear relations (e.g. V = R _i, assuming R was bound or a constant and not a variable). If we do linear scaling, we might as well support affine transformations (e.g. y = m_x+b).

But all this is centered around numeric types. Another application would be things like applying a "not" operation to a base boolean signal.

What I propose to do, as part of this ticket, is to introduce a "transform" field for all aliases. This field will be a string that contains a transform definition. To begin with, I propose only two transforms:

affine(s,o) - where s is the scale factor and o is the offset. This transform can only be applied to "numeric" values (integers and floating point numbers).
not - This transform gives you the inverse value for boolean values.

If the data in the "base signal" doesn't meet the requirements for applying the transform, the transform is not applied.

Investigate Msgpack

Based on the results in #9, I recognize that BSON is actually very inefficient for arrays.

I researched this and looked at BJSON, UBJSON, Protocol Buffers, Thrift and Smile before finally deciding that the best supported and most compact format (across Java, C and Python) appears to be msgpack.

So I'm going to investigate this by refactoring the current code to have modular serialization/deserialization capabilities for some side by side comparisons.