Giter Club home page Giter Club logo

recon's People

Contributors

dietmarw avatar harmanpa avatar mtiller avatar tbeu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

recon's Issues

Reduce entry overhead

If each table was given a unique id (and index, for example...assuming an order could be imposed on the tables), then the unique id could be used in each table entry which would save space in wall files.

IOError on dsres.mld

I converted dsres.mat to mld using script dsres2meld. Then I got:

Python 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] on win32
>>> from recon.meld import MeldWriter, MeldReader
>>> fp = open(r'c:\Download\recon-master\tests\dsres.mld', 'rb')
>>> meld = MeldReader(fp, verbose=True)
Compression: True
Header = {'tabs': {'T2': {'vmeta': {'der(i_L)': {'desc': ''}, 'i_L': {'desc': ''}, 'time': {'desc': 'Simulation time [s]'}, 'i_C': {'desc': ''}, 'der(V)': {'desc': ''}, 'V': {'desc': ''}, 'i_R': {'desc': ''}}, 'vars': ['time', 'der(i_L)', 'i_L', 'i_C', 'der(V)', 'V', 'i_R'], 'tmeta': {}, 'toff': {'der(i_L)': {'i': 2056, 'l': 2444}, 'i_L': {'i': 4510, 'l': 2233}, 'time': {'i': 568, 'l': 1483}, 'i_C': {'i': 6748, 'l': 2419}, 'der(V)': {'i': 9178, 'l': 2431}, 'V': {'i': 11616, 'l': 2113}, 'i_R': {'i': 13737, 'l': 2117}}}, 'T1': {'vmeta': {'time': {'desc': 'Simulation time [s]'}, 'Vb': {'desc': 'Battery voltage'}, 'R': {'desc': ''}, 'C': {'desc': ''}, 'L': {'desc': ''}}, 'vars': ['time', 'C', 'L', 'Vb', 'R'], 'tmeta': {}, 'toff': {'Vb': {'i': 16015, 'l': 49}, 'R': {'i': 16065, 'l': 49}, 'C': {'i': 15911, 'l': 52}, 'L': {'i': 15963, 'l': 51}, 'time': {'i': 15861, 'l': 50}}}}, 'fmeta': {}, 'objs': {}, 'comp': True}
>>> meld.tables()
['T2', 'T1']
>>> table = meld.read_table('T2')
>>> table.signals()
['time', 'der(i_L)', 'i_L', 'i_C', 'der(V)', 'V', 'i_R']
>>> table.data('time')
Traceback (most recent call last):
  File "<pyshell#11>", line 1, in <module>
    table.data('time')
  File "C:\Programme\Python27\lib\site-packages\pyrecon-0.3.0-py2.7.egg\recon\meld.py", line 568, in data
    data = self.reader.ser.decode_vec(self.reader.fp, blen)
  File "C:\Programme\Python27\lib\site-packages\pyrecon-0.3.0-py2.7.egg\recon\serial.py", line 114, in decode_vec
    verbose=verbose, uncomp=uncomp)
  File "C:\Programme\Python27\lib\site-packages\pyrecon-0.3.0-py2.7.egg\recon\serial.py", line 106, in decode_obj
    data = decompress(data)
  File "C:\Programme\Python27\lib\site-packages\pyrecon-0.3.0-py2.7.egg\recon\serial.py", line 17, in decompress
    return c.decompress(data)
IOError: invalid data stream

Environment is WinXP with Python 2.7.6, setuptools 2.2 and msgpack-python 0.4.1.

Add support for type annotations

As it stands, there are no type restrictions in the API. I can see a couple of different type restriction options. First, a natural type constraint would be that every element in a column is of the same type. Whether that type is prescribed a priori could be up to the user.

At a minimum, it would be useful to do the following:

  • Allow specifying per column type constraints
  • Specify a default column type (to be used when a per column type is not provided)
  • Check that each row after the first conforms to the types of the columns in the first row

TypeError on fullRobot.mld

I converted fullRobot.mat to mld using script dsres2meld. Then I got:

PythonWin 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] on win32.
>>> from recon.meld import MeldWriter, MeldReader
>>> fp = open(r'c:\Download\recon-master\tests\fullRobot.mld', 'rb')
>>> meld = MeldReader(fp, verbose=True)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Programme\Python27\lib\site-packages\pyrecon-0.3.0-py2.7.egg\recon\meld.py", line 444, in __init__
    self.header = self.ser.decode_obj(self.fp, length=blen)
  File "C:\Programme\Python27\lib\site-packages\pyrecon-0.3.0-py2.7.egg\recon\serial.py", line 107, in decode_obj
    x = msgpack.unpackb(data)
  File "_unpacker.pyx", line 119, in msgpack._unpacker.unpackb (msgpack/_unpacker.cpp:119)
TypeError: unhashable type: 'dict'

Environment is WinXP with Python 2.7.6, setuptools 2.2 and msgpack-python 0.4.1.

Use msgpack for transformations

@choeger ask the question "Why not use msgpack for transformations". After all, it would not only make the file format more consistent, but it would eliminate the need to parse transformation strings. It would also be well aligned with #43.

Initially, I thought that the strings would be more compact. But in discussing it with Christoph, I recognized that the most common transformation, by far, would be the pure alias transformation: aff(1.0,0.0). Even if you use integers to represent this string, you can still only get it down to aff(1,0) (8 bytes). That seems pretty small. But in discussing it with Christoph, I realized that if we represented this transformation as:

{"k": "aff", s: 1, o: 0}

and then used msgpack to pack it. We get the byte sequence:

\x83\xa1k\xa3aff\xa1s\x01\xa1o\x00

which is only 13 bytes. Yes, we pay 7 bytes per alias signal for using msgpack. But we make implementations easier since people don't have to include transformation parsing.

So I think this is a good idea for version 2!

Type Contraints

It would be nice to have a feature where the type of each column could be specified and somehow enforced/checked during reading and writing. The underlying format (due to its BSON foundation) doesn't really care. But it would be nice to allow the client libraries to make some basic checks.

Proper Setup Script

I need to add a setup script. Furthermore, it should include scripts for converting a) wall files into meld files and dsres files into meld files.

Field updates

In thinking about the way objects are defined in wall files, it seems inefficient and a bit odd that they need to be defined one field at a time. It would be more efficient to allow the values of several fields to be updated in a single entry. But currently, the format doesn't allow this.

Refactor for explicit data lengths

There are a few cases (in particular, headers) where the length of the data
is maintained in the serialized data. This is a problem for several reasons. First,
it requires you to know the serialization format to decode the length. Second,
it presumes that the serialization format encodes the length. Third, compression
makes it impossible to know how many bytes to read.

Currently, all data already includes explicit lengths in the header. But what doesn't
include an explicit length is the header itself. So this needs to be changed so there
is an explicit header length in the format. Then all reads can be done for precisely
the required number of bytes.

This will be necessary to address #11 but it will also clean up the APIs for the
serializers significantly and shouldn't impact the read count.

Asymmetry in reading

I think a WallTableReader object would be useful. It would add a proper API around extracting signals and aliases. It would also be symmetric with how stuff is written.

Proper metadata

I need to add support for metadata associated with each file, signal, table or object.

To avoid excessive overhead, perhaps the metadata should be left out if empty.

Better transform names

I realized, while writing the paper, that the most common transform would be to either perform a logical not or a sign inversion on data. For sign inversion, currently a tranform of "affine(-1,0)" would be required. It turns out, one of the things I realized when looking at the key names is that there is a lot of alias information stored in a typical results file and so it is important to key keep size small. The same applies to transforms. As such, I think we should refactor the transforms as follows:

  • "inv" - Simple transform that either changes the sign of numerical data or logically inverts booleans
  • "aff(s,o)" - Applies an affine transformation for the specified scale, s and offset, o.

Look at wall2meld performance

Martin Sjölund pointed out several areas where wall to meld conversion was slow. The first thing to do is identify whether it is possible to do this conversion "in memory" (since that would probably be a fairer comparison).

We should also profile the conversion process to see where we can speed things up. Martin points out that the array packing and unpacking seems to be the big thing. I wonder if there is a way to optimize that more?

Create nose tests

I should really create a bunch of formal nose tests that not only test the code but report on coverage.

Support for compression

A key goal with this format is to minimize reads. If compression were supported, it would have to be pretty localized (e.g. compressing individual columns) because this would avoid impacting the number of reads.

Header compression is possible, but it would be a bit problematic. The ID would have to reflect the fact that it was compressed and the length information to proceeds each document couldn't be included in the compression (again...impact on reads).

Compression of columns is probably more likely to have a significant impact on storage space than compression of the header (which probably won't include a lot of repetitive data).

Any open question would be...what type of compression? We'd want to use something that is typically available as part of standard libraries. For Python, zlib and bz2 seem to be easily accessible. But what about the Java and C platforms?

Preserve column order

At the moment, there is nothing that specified column ordering. This is absolutely essential I think.

Avoid invalid files by error-prone mode flags of open

From #37 and #38 we learned that opening the recon file requires mode='rb' for reading and mode='wb+' for writing since it turned out that not setting the binary 'b' mode may lead to invalid recon files.

with open('dsres', 'rb') as wfp:
    with open('mld', 'wb+') as mfp:
        dsres2meld(wfp, mfp)

This is error-prone since developers might forget to set the binary 'b' mode. For that reason I propose to introduce a new file wrapper class, say recon.reconFile. Mode flags then could be similar to zipfile.ZipFile with valid settings like mode='r' or mode='w'. Finally all functions that currently take file handles (i.e. isinstance(wfp, file) yields True) shall check their arguments for type recon.reconFile (i.e. isinstance(wfp, recon.reconFile) must yield True).

with recon.reconFile('dsres', 'r') as wfp:
    with recon.reconFile('mld', 'w') as mfp:
        dsres2meld(wfp, mfp)

def dsres2meld(wfp, mfp):
    if isinstance(wfp, recon.reconFile) and isinstance(mfp, recon.reconFile):
        print('OK: These are the expected file type, go ahead')

Simplify Meld format

At the moment, tables have a complex structure. I suspect things can be simplified quite a bit. This isn't a big deal, because it only affects the headers though.

Basically, the question is whether the indices sub-document is required.

Ambiguous keys

At some point, we switched to single character keys to reduce the size of the files. This is reasonable but probably not very effective at reducing size. Now that I'm trying to write up descriptions, these short (and sometimes repeated) keys make explanations confusing.

We should adopt a slightly different set of keys to maintain relatively terse names but, at the same time, avoid ambiguities.

Javascript implementation

It would be nice to be able to process recon within a web app. Ideally, it should use range headers for any AJAX requests it makes.

Better Handling of Transforms

We need a scheme for defining transforms that are performed on aliases.

The obvious legacy case is flipping the sign (vs. the base signal). Even "richer" would be to scale things by some constant. This would expand the applications of such transforms from simple sign flipping (e.g. a = -b) to linear relations (e.g. V = R _i, assuming R was bound or a constant and not a variable). If we do linear scaling, we might as well support affine transformations (e.g. y = m_x+b).

But all this is centered around numeric types. Another application would be things like applying a "not" operation to a base boolean signal.

What I propose to do, as part of this ticket, is to introduce a "transform" field for all aliases. This field will be a string that contains a transform definition. To begin with, I propose only two transforms:

  • affine(s,o) - where s is the scale factor and o is the offset. This transform can only be applied to "numeric" values (integers and floating point numbers).

  • not - This transform gives you the inverse value for boolean values.

    If the data in the "base signal" doesn't meet the requirements for applying the transform, the transform is not applied.

Investigate Msgpack

Based on the results in #9, I recognize that BSON is actually very inefficient for arrays.

I researched this and looked at BJSON, UBJSON, Protocol Buffers, Thrift and Smile before finally deciding that the best supported and most compact format (across Java, C and Python) appears to be msgpack.

So I'm going to investigate this by refactoring the current code to have modular serialization/deserialization capabilities for some side by side comparisons.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.