@rgrp @tryggvib (and of course anyone else)
In my work so far on API design for the validation pipeline, an important issue has revolved around (a) how to parse a stream into some type of object-oriented interface, and (b) which backend to choose for this (and why). There are many ways to go about this of course - we're not going to go over them all here. Some discussion around it took place here
Anyway, the bottom line is that I decided to depend on Pandas to read data (CSV, and also JSON), and I currently work with that data in a Pandas DataFrame.
However, I do not want to expose all the DataFrame API directly, for the following reasons:
- Pandas is really just a backend here: end users and clients mostly just need to know about headers, rows and columns
- Future changes to the (this) backend should have minimal impact on public APIs
- Pandas provides different interfaces for parsing CSV and JSON; I would like to provide a single interface and pass off to the appropriate parser as required
So, locally I'm now working with a DataTable class that wraps a Pandas DataFrame (here is a WIP of the code), and only exposes properties we need for the validation pipeline (all DataFrame properties can still be accessed via self._frame).
Each validator package will depend on this DataTable class. Hence, this will become another package in the suite.
Any thoughts/comments before I commit to this pattern?
One reason I'm even writing this up as an issue is because of the many "DataTable" like interfaces around in Python (csv.DictReader, Tablib, Dataset, Pandas' DataFrame), so I'm a bit wary of my own need to package up another one.