dlab-projects / marketflow Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 11.0 736 KB

Basic Python library for working with the TAQ (US Trade and Quote) dataset

Home Page: http://marketflow.readthedocs.org/

License: BSD 2-Clause "Simplified" License

Python 18.04% Jupyter Notebook 81.96%

marketflow's People

Contributors

Stargazers

Watchers

Forkers

juanshishido rdhyee yangraymond minhpascal rpatil524 glass-bead-labs mp3201

marketflow's Issues

Setup.py not allowing module import

After python setup.py install, I'm attempting to import raw_taq from taq and getting this error:

Traceback (most recent call last):
  File "taq/generate_test_data.py", line 6, in <module>
    from pytaq import raw_taq
ImportError: No module named 'pytaq'
Majora:python-taq dillon$ python taq/generate_test_data.py /Users/dillon/transfer/EQY_US_ALL_BBO_20140206.zip test_data_public.zip 10
Traceback (most recent call last):
  File "taq/generate_test_data.py", line 6, in <module>
    from taq import raw_taq
ImportError: No module named 'taq'

Add a test for generator correctness

How could we write a test that illustrates the problem @rdhyee fixed in #17

Consider approaches to modifying TAQ2Chunks behavior

This would include how the data is chunked (rows, symbols) as well as where it's going.

Odo can convert numpy chunks to whatever. Blaze is a related technology that is good at slicing and dicing data.
Martin Faasen's (author of lxml bindings for Python) wrote a generics library called Reg inspired by Zope Interfaces. Probably overkill, but worth thinking about.
Subclassing.

Consider trimming strings with gratuitous whitespace

Essentially varchar strings in pytables:

http://www.pytables.org/cookbook/hints_for_sql_users.html#column-type-declarations

H5_TIME is unsupported...

So, we shouldn't use it. This isn't a huge issue, and actually simplifies the logic of our code - no more special handling of a time column, it's just a float64 in and HDF5 file.

I identified this in the HDF5 docs after being referred there by h5py/h5py#360. See also this.

Not sure if this is related perhaps to dlab-projects/dlab-finance#60

Retain date in string format to facilitate matching with CRSP dates

Implement splitting securities into separate chunks

It makes sense to do this for performance reasons. We also want to store separate tables in HDF5 / pytables.

Can't read new TAQ files

Working with EQY_US_ALL_BBO_20150731.zip results in BaseException("Can't map fields onto bytes_per_line")

raw_taq error call needs Error type

Line 179 in taq/raw_taq uses invalid error class Error :

/Users/dillon/Dropbox/dlab/python-taq/taq/raw_taq.py in check_present_fields(self)
    177                 return
    178 
--> 179         raise Error("Can't map fields onto bytes_per_line")
    180 
    181 

NameError: name 'Error' is not defined

Can't read old TAQ files

taq2h5 EQY_US_ALL_BBO_20111101.zip results in ValueError: no field of name Retail_Interest_Indicator_RPI.

Why don't our tests catch this? Well - we don't have a test datafile for each epoch. @yangraymond I guess you won't have time to create such a thing before you go?

Figure out unique IDs from CRSP

We should figure out if we can legally distribute those IDs.

Clean up clean_dsenames and related files

In particular, once you're sure you got what you need from them, you can delete the "Dav" versions of your notebooks!

Upload marketflow to various repositories

Probably PyPI, but also anaconda.org. @davclark can help with conda packaging.

Strange symbol in 2014 data

We find ZXYZ.A in EQY_US_ALL_BBO_20140213. This works as a valid HDF5 identifier, so I've not changed it.

Script to pull down test data from box into test-data directory

Ask Dav for details if unclear!

cc @jaysid95

Create a basic test to understand pytest

Something like:

Create f(x) = x + 2
Test that f(x) = x + 3
See that test fails
Fix the failing test

Feel free to do that in the repo... it'll help us move forward in having a test directory in the right place. Let's start with pytest.

autodoc not working on ReadTheDocs

No idea why. Builds report no errors.

Bcolz conversion pipeline

First - see if you can use blaze / odo to convert to bcolz. If that's not easy, just use a structure similar to hdf5.py

Review of testing code

@yangraymond leads us in a tour of his testing code.

Clean up code from dlab-finance and set up CRSP-related architecture

cc @max-eddy @Jay4869

Feel free to bug me about this.

Document - probably with Sphynx or Asciidoc

Can either of these systems incorporate type hinting?
What's a good way to publish (RTD, gh-pages, etc.)?

Create basic newbie documentation

Create basic tests

For now, these will use actual TAQ data files that we can't legally share, so make a .gitignored data directory for tests.

For now let's use pytest.

@rdhyee knows some good files to use (so does @Jay4869).

cc @juanshishido

raw_taq does not convert chunks back into TAQ format

This functionality is needed for writing test data: i.e. we need to read in TAQ data, anonymize it, and write it back to TAQ format.

The current work-around for this takes individual rows from TAQ2Chunks output, convert them to strings with numpy.to_string, and append them to file with b'\n'. This creates data that looks like TAQ data, but causes TAQ2Chunks to throw an error about mapping fields from the file:

---------------------------------------------------------------------------
BaseException                             Traceback (most recent call last)
<ipython-input-6-dfe8991edd77> in <module>()
----> 1 generator = raw_taq.TAQ2Chunks('test.zip', do_process_chunk=True, chunksize=1000)

/Users/dillon/Dropbox/dlab/python-taq/taq/raw_taq.py in __init__(self, taq_fname, chunksize, do_process_chunk, chunk_type)
    213         if chunk_type == 'lines':
    214             self.iter_ = self._convert_taq()
--> 215             next(self.iter_) #read first line and setup attributes
    216         elif chunk_type == 'symbols':
    217             self.iter_ = self._symbol_taq() #make symbol_taq top level iter

/Users/dillon/Dropbox/dlab/python-taq/taq/raw_taq.py in _convert_taq(self)
    253                         self.bytes_spec = \
    254                             BytesSpec(bytes_per_line,
--> 255                                       computed_fields=[('Time', np.float64)])
    256                                       # We want this for making the PyTables
    257                                       # description:

/Users/dillon/Dropbox/dlab/python-taq/taq/raw_taq.py in __init__(self, bytes_per_line, computed_fields)
    105         '''
    106         self.bytes_per_line = bytes_per_line
--> 107         self.check_present_fields()
    108 
    109         # The "easy" dtypes are the "not datetime" dtypes

/Users/dillon/Dropbox/dlab/python-taq/taq/raw_taq.py in check_present_fields(self)
    177                 return
    178 
--> 179         raise BaseException("Can't map fields onto bytes_per_line")
    180 
    181 

BaseException: Can't map fields onto bytes_per_line

generator = raw_taq.TAQ2Chunks(fp, chunk_type='symbols', chunksize=100000)
next(generator)

yields:

array([], 
      dtype=[('Time', '<f8'), ('hour', 'i1'), ('minute', 'i1'), ('msec', '<u2'), ('Exchange', 'S1'), ('Symbol_root', 'S6'), ('Symbol_suffix', 'S10'), ('Bid_Price', '<f8'), ('Bid_Size', '<i4'), ('Ask_Price', '<f8'), ('Ask_Size', '<i4'), ('Quote_Condition', 'S1'), ('Market_Maker', 'S4'), ('Bid_Exchange', 'S1'), ('Ask_Exchange', 'S1'), ('Sequence_Number', '<i8'), ('National_BBO_Ind', 'S1'), ('NASDAQ_BBO_Ind', 'S1'), ('Quote_Cancel_Correction', 'S1'), ('Source_of_Quote', 'S1'), ('Retail_Interest_Indicator_RPI', 'S1'), ('Short_Sale_Restriction_Indicator', 'S1'), ('LULD_BBO_Indicator_CQS', 'S1'), ('LULD_BBO_Indicator_UTP', 'S1'), ('FINRA_ADF_MPID_Indicator', 'S1'), ('SIP_generated_Message_Identifier', 'S1'), ('National_BBO_LULD_Indicator', 'S1')])