Giter Club home page Giter Club logo

debrief / pepys-import Goto Github PK

View Code? Open in Web Editor NEW
5.0 3.0 5.0 58.71 MB

Support library for Pepys maritime data analysis environment

Home Page: https://pepys-import.readthedocs.io/

License: Apache License 2.0

Makefile 0.07% Python 87.64% Batchfile 0.19% PowerShell 0.71% PLpgSQL 1.69% Mako 0.02% Jupyter Notebook 0.11% JavaScript 4.24% CSS 4.47% HTML 0.85% Dockerfile 0.02%
python postgis spatialite open-source analysis data-science etl hacktoberfest

pepys-import's Introduction

Project Status

Continuous Build Status

New in Debrief

Supported by

The development of Debrief has been supported by Oxygen XML

Oxygen Logo

The development of Debrief has also been supported by the YourKit Java Profiler

YourKit Logo

pepys-import's People

Contributors

barissari avatar ianmayo avatar krsnik93 avatar mew-nsc avatar rnllv avatar robintw avatar samwilliams-dev avatar snyk-bot avatar tahirs95 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pepys-import's Issues

Speed Unit Failing

When a new object tried to push to State table, speed attribute causes the following error:

Screenshot from 2019-11-07 15-27-23

Screenshot from 2019-11-07 15-29-21

Spatialite DLL not registered for MS-Windows

When we run spatialite tests on MS-Windows, they fail with this error

  File "C:\git\pepys-import\pepys_import\utils\geoalchemy_utils.py", line 15, in load_spatialite
    connection.load_extension(EXTENSION_PATH)
TypeError: argument 1 must be str, not None

I now recognise that the error is because the EXTENSION_PATH isn't being provided, we only do it for Linux & Macos.

2.2.1 GPX Importer

The clients regularly receive data in GPX format, including from:

  • GPS Wristwatches
  • GPS Trackers temporarily attached to smaller ships
  • Some types of fixed-wing aircraft

I believe it's possible to represent tracks in GPX in a range of structure, so we'll have to start off by supporting our sample data-files. I'll do a quick look for sample data on the Internet, and include some links in this issue.

Need to rename package

We're using an illegal package name - Python packages can't contain a "-" name, only a "_".

image

We've probably got to rename it on PyPi and ReadTheDocs.

I'd rather not rename the GitHub repo, but we probably should.

Oops.

Methods to prepare/populate database

This task is to introduce methods to prepare the datastore:
https://docs.google.com/drawings/d/1CXE6pW6lvAKcCP8O1spRNSYBwl6ajJ3HX9dCSRm7nK0/edit

  • initialise() . already present
  • populateReference()
  • populateMetadata()
  • populateMeasurements()

The repo currently contains Python scripts to populate the data-store. The commands will be moved to an initialise method.

The other commands will insert data into the reference and/or metadata/measurements tables.

If .csv files are created with a few dummy lines, I will step in and provide more data to them. The .csv files are to be structured with the field name in the header row, suitable for reading with the Python csv library.

Initially, measurements are only required for the State and Contacts tables.

3.1.1 Identify initial set of import tests

Pepys will automated testing of imported data.

In the longer term these tests will deliver value in verifying the content of poor quality datafiles (typically human generated). But, in the

To do this, we have to first identify a set of tests to apply. This issue will act as a temporary placeholder for these tests.

None:

For very high volume data-sets that consistently contain zero defects, QA testing can be omitted.

Basic:

  • lat >= -90 && <= +90
  • long >= -180 && <= +180
  • course/heading >=0 && < 360 (data stored in degs)

Enhanced:

Note: while they are significantly more expensive (in processing time) these enhanced tests actually serve to verify the lat/long, in addition to the course/speed/heading.

  • course/heading - loosely match delta between last two locations (+/- 90 degs)
  • speed - loosely matches delta between last two locations (+/- speed * 10).

Start with new way of writing asserts in tests?

Now that we're using pytest (and it's documented in the Developer Guide, and included in requirements_dev.txt) we have the option of using a new style of assertion tests.

Currently in the code we do something like self.assertEqual(len(datafiles), 0). With pytest, we can replace this with assert len(datafiles) == 0. Not only does this save typing and look cleaner, it also gives us better output when debugging.

With the original style we get:

            states = self.store.session.query(self.store.db_classes.State).all()
>           self.assertEqual(len(states), 9999)
E           AssertionError: 0 != 9999

whereas now we get:

            states = self.store.session.query(self.store.db_classes.State).all()
>           assert len(states) == 9999
E           AssertionError: assert 0 == 9999
E            +  where 0 = len([])

which shows us where the 0 has come from. The Pytest docs show even more clever things it can do - like showing you where strings differ, what the differing elements of a set are etc.

There are also far nicer ways of testing for exceptions to be raised (see docs).

We don't have to go back and change old tests to use the new format, as both formats can co-exist happily and pytest will work fine with both. Therefore, I propose that new tests we write should be written in the new, cleaner, format.

What do you think?

Sensors are per-platform

In the sensors table, we will be storing the sensors for specific platforms.

So, there won't just be a single "GPS" sensor. There will be a GPS sensor for each platform that is carrying a GPS sensor (or even multiple instances per platform).

That is the role of the host field in the Sensors table.

This is a current code snippet for getting a sensor.

                sensor = platform.get_sensor(
                    session=data_store.session,
                    all_sensors=all_sensors,
                    sensor_name="E-Trac",
                    sensor_type="GPS",
                    privacy="TEST",
                )

Note that all_sensors includes all sensors in the database. Instead of this, within platform.get_sensor - the all_sensors should be the list of sensors for this platform.

Black isn't being run as pre-commit

I've just made some changes, and pushed to a PR.

The PR failed because of Black issues.

I then ran black from command-line, and it fixed the stuff.

But, I was expecting it to automatically run on the files.

Switch Travis to 3.7

Travis is currently running the test/build on 3.8, which is the newest version of Python.

But, our deployment target is 3.7, so it would be sensible to always run our test suite against that version.

README Instructions

Readme instructions are not clear for now. Therefore, the project should be set up in a new environment and each step should be written.

Create simple example script that parses a file and adds to database

I can't see anything quite like this in the repo at the moment, but it's possible I'm not looking in the right place.

What I'm looking for is a script that will exercise a reasonable chunk of the pepys-import library, by taking a file and importing it to a SQLite database. I need some way to test a deployment of pepys-import, without having to have all of the infrastructure around the formal unit tests set up (eg having pytest and testing.postgresql installed, and so on)

This could be strongly related to #84 - as creating the pepys_import Python script, and providing it with a example file to import, could satisfy this issue.

Incorrect number of tables created for sqlite on Windows

Seem to be getting this error for test_sqlite_initialise on Windows only (all tests passed on my Mac) - need to investigate why, possibly something to do with the setup of SQLite and spatialite creating extra tables...

self = <tests.test_data_store_initialise.DataStoreInitialiseSpatiaLiteTestCase testMethod=test_sqlite_initialise>

    def test_sqlite_initialise(self):
        """Test whether schemas created successfully on SQLite"""
        data_store_sqlite = DataStore("", "", "", 0, ":memory:", db_type="sqlite")

        # inspector makes it possible to load lists of schema, table, column
        # information for the sqlalchemy engine
        inspector = inspect(data_store_sqlite.engine)
        table_names = inspector.get_table_names()

        # there must be no table at the beginning
        self.assertEqual(len(table_names), 0)

        # creating database from schema
        data_store_sqlite.initialise()

        inspector = inspect(data_store_sqlite.engine)
        table_names = inspector.get_table_names()

        # 36 tables + 36 spatial tables must be created. A few of them tested
>       self.assertEqual(len(table_names), 72)
E       AssertionError: 74 != 72

tests\test_data_store_initialise.py:99: AssertionError

Jupyter Notebook File Failing

When the jupyter notebook file tried to run (/examples/notebooks/data_store_sqlite.ipynb) , it throws the following error:

OperationalError: (sqlite3.OperationalError) no such table: Datafiles
[SQL: SELECT "Datafiles".datafile_id AS "Datafiles_datafile_id", "Datafiles".simulated AS "Datafiles_simulated", "Datafiles".reference AS "Datafiles_reference", "Datafiles".url AS "Datafiles_url", "Datafiles".privacy_id AS "Datafiles_privacy_id", "Datafiles".datafile_type_id AS "Datafiles_datafile_type_id" 
FROM "Datafiles" 
WHERE "Datafiles".reference = ?
 LIMIT ? OFFSET ?]
[parameters: ('../repl_files/missing_platform.rep', 1, 0)]

Screenshot from 2019-11-07 13-33-51

We should examine and fix it.

Pre-pend ASCII portrait in welcome banner

I've produced this ASCII version of a portrait of Pepys:

    `:ssoo+`      
   /mdsohmMN+`    
  .MM+-.-:MMMd`   
  .NMs..`/NMMMy`  
  +Mm/--../hNMMy. 
  yM:`::-.-/-NMMh 
  /o`....`.` :ooo`

IT would be good if we could (somehow) prepend it to the Pepys-Import text.

Here's another:

   @@@@ @@@@@       
 @@@@@ @@..@@@@     
  @@@     @@@@@@    
  @@@     @@@@@@@@  
 @@@@.     @@@@@@@  
 @@ #@@ @   ...@@@@@
 @@   &        @@@@@

and another:

 `yMmysymMMN+`     
.NMM+:`//.NMMm.    
.MMM -. -.MMMMMd`   
.NMMo '-  mMMMMMs   
-NMMh -o- mMMMMMs  
mMMMM:....-dMMMNy.
dMy` --/`  ...sMMMM:

3.1.2 Implement initial import tests

We're developing a set of import tests in #93

We need to both implement the tests, and devise an extensible framework for running the tests.

This implementation is detailed here: https://docs.google.com/document/d/1Wjf2XUOMXBsPavzaH9yHf50h4kLSSNXmi5AXxRD9Tr4/edit#heading=h.cfx6v5jgl5es

The strategic approach taken is also being considered here: https://docs.google.com/drawings/d/1JHkjhMw4YRgET8F4u3BSeo-tc4n8kPQ0YnuGEH2rTRM/edit

Ian would like the parsers to be as dumb as possible, keeping as much functionality as possible in our tested, supported processing library (and similarly minimising the volume of code that the data scientists maintain).

So, for this reason Ian would like the test-suite to be run in file_processor.py. An issue with this, is that different parsers ask for different levels of testing. So, if multiple parsers handle a file, we may be running different levels of testing on the data that's sitting in "temporary storage" before being pushed to the database.

It looks like the relevant place to do the testing is when a parser returns control to file_processor. Then file_processor could run the relevant level of testing against the records read in by that parser.

This would mean we have to do something like:

  1. capture measurements.size() before running a parser, then run the tests from that row to the end of measurements.
  2. after running tests, submit those measurements to the database - which empties the measurements array
  3. store the measurements for each parser into a hashmap indexed by the parser, then run all tests at the end, before submitting.

Maybe the strategy should be led by what happens if there's a failure. We can't submit just the records from one parser, since if the tests for a subsequent parser fail, the analyst will fix the file and resubmit it, and the records from the first parser would all be submitted to the database again. Hmm, so I think we should run all of the parsers, and all of the tests before we fail - then we can give the analyst a full to-do list of what needs fixing.

Mac-specific issue

This is bound to be an issue with my config, rather than pepys-import, but I'm unable to review code until I fix it :-(

(well, I'm trying to get a MS-Windows VM working, so I can test/review in there).

> python3  -m unittest tests/test_parse_etrac.py 
started
Software Version :  0.0.4 
Database Type :  sqlite
Database Name :  :memory:
Database Host :  
---------------------------------
before connect
Segmentation fault: 11

The before connect debug-line comes from here:
image

So, I'm pretty sure it's coming from self.engine.connect().

File parsers: we need to defer "load" until the end

We currently have two file parsers, nmea and replay. I'm currently working on e-trac.

The parsers follow this logic:

  • loop through lines in file
    • collate data from one line, or multiple lines
    • check we have the data we need, then:
      • open a database connection
      • get the datafile
      • get the platform
      • get the sensor
      • create the new state object
      • configure the state fields
      • if the datafile validates:
        • submit the state

But, then we're opening/closing the database connection lots of times (potentially once per line). We're also finding the datafile multiple times, even though we know we're just processing the single datafile. And lastly, we're submitting those states one at at time.

I think a better logic would be:

  • open a database connection
  • get the datafile
  • loop through the lines:
    • collate the data
    • when we have our data:
      • get the platform
      • get the sensor
      • create the state (from the datafile)
      • configure the state fields
  • check the datafile validates:
    • submit the datafile (well, all measurements in the datafile)

Pepys-Import should dynamically load user parsers

Pepys will use two kinds of parsers to import data:

  • standard parsers that are distributed with Pepys
  • custom parsers maintained by the user community

For these two sets, the parsers should be dynamically loaded in this way:

  • load all files from the pepys_import\file\parsers folder
  • check for the existence of a PEPYS_PARSERS environmental variable. If it exists, also load all the parsers from that folder.

As we're loading the parsers, we should probably do some kind of checking on them, to verify they have the import-test and parse methods we're expecting.

Design logic for parsing multiple files

Library users will create importers, and register them with our library.

Then they will instruct our library to load a single file, or to recurse through a set of folders.

For each file, we will check with each importer if it can handle that kind of file. To allow this, importers will either provide some metadata describing their capability, or will implement a set of methods that enable it to determine if it can load that file.

Here are an initial set of tests:

  • suffix (.rep)
  • pattern in filename (regex)
  • first line . (;;DEBRIEF DATA)
  • presence of some marker line, appearing further down file . (SERIAL 1 Data:)

This task is to produce a proposal (with pseudocode) for how to implement the above

The proposal will be discussed by the project team.

Create pepys_import command-line script

#44 says that we want to run a pepys_import batch script which will run a Python script to call the file processor with the current directory as an argument. We need to write that Python script.

Welcome banner

When the command line app opens, it would strengthen our branding if we had some ascii text.

I suggest we display Pepys_import in the Doom font, like in this app:
http://patorjk.com/software/taag/#p=display&f=Doom&t=Pepys_import

It should look like this:
image

Aah, we should also show the software build version, plus some summary of the database connection (SQLite vs PostGres, SQLite filename, PostGres hostname/db-name)

The API for creating a new data-store includes boolean flags for welcome banner & initial status.

Switch README from .md to .rst

It looks like our Sphinx doc generation (including read-the-docs) favours .rst over .md

Could you please rewrite our repo in .rst and rename the page to `README.rst, please?

The the automatic docs should be formatted more nicely.

Add pytest to requirements so we can get nicer test output

pytest is now considered the standard Python testing framework to use. It is fully compatible with the built-in unittest module, although it does allow some nicer code structures for testing (eg. assert x in y rather than self.assertIn(x, y). More importantly, it produces far nicer output when running tests (including colour) - plus debugging information that you don't normally get in unittest.

For example:

python -m unittest tests\test_data_store_api_spatialite.py produces output like:

______                      _                            _
| ___ \                    (_)                          | |
| |_/ /__ _ __  _   _ ___   _ _ __ ___  _ __   ___  _ __| |_
|  __/ _ \ '_ \| | | / __| | | '_ ` _ \| '_ \ / _ \| '__| __|
| | |  __/ |_) | |_| \__ \ | | | | | | | |_) | (_) | |  | |_
\_|  \___| .__/ \__, |___/ |_|_| |_| |_| .__/ \___/|_|   \__|
         | |     __/ | ______          | |
         |_|    |___/ |______|         |_|

Software Version :  0.0.4


Database Type :  sqlite
Database Name :  :memory:
Database Host :
---------------------------------
.______                      _                            _
| ___ \                    (_)                          | |
| |_/ /__ _ __  _   _ ___   _ _ __ ___  _ __   ___  _ __| |_
|  __/ _ \ '_ \| | | / __| | | '_ ` _ \| '_ \ / _ \| '__| __|
| | |  __/ |_) | |_| \__ \ | | | | | | | |_) | (_) | |  | |_
\_|  \___| .__/ \__, |___/ |_|_| |_| |_| .__/ \___/|_|   \__|
         | |     __/ | ______          | |
         |_|    |___/ |______|         |_|

Software Version :  0.0.4


Database Type :  sqlite
Database Name :  :memory:
Database Host :
---------------------------------
.______                      _                            _
| ___ \                    (_)                          | |
| |_/ /__ _ __  _   _ ___   _ _ __ ___  _ __   ___  _ __| |_
|  __/ _ \ '_ \| | | / __| | | '_ ` _ \| '_ \ / _ \| '__| __|
| | |  __/ |_) | |_| \__ \ | | | | | | | |_) | (_) | |  | |_
\_|  \___| .__/ \__, |___/ |_|_| |_| |_| .__/ \___/|_|   \__|
         | |     __/ | ______          | |
         |_|    |___/ |______|         |_|

Software Version :  0.0.4


Database Type :  sqlite
Database Name :  :memory:
Database Host :
---------------------------------
<snipped>

where it is hard to see which tests ran and when.

pytest tests\test_data_store_api_spatialite.py produces output like:

================================================================================ test session starts =================================================================================
platform win32 -- Python 3.7.6, pytest-5.3.5, py-1.8.1, pluggy-0.13.1
rootdir: c:\IanMayo\pepys-import
collected 36 items

tests\test_data_store_api_spatialite.py ..............xxss.......xs.x.sss...                                                                                                    [100%]

===================================================================== 26 passed, 6 skipped, 4 xfailed in 10.50s ====================================================================== 

Command Line Resolver

Analysts will use the Command Line Resolver to interactively fill in missing fields.

The pepys-import command-line mockup included some "tricks" to save time, such as copying data from another platform.

I had a meeting with the clients today, but we agreed it wasn't necessary. A lot of the data (own platforms) will be pre-generated. The import parsers will also be intelligently insert other data.

So, the command-line resolver will just work through the missing data. But, where-ever possible we'll use the fzf search capability to make it quicker for the user to select existing values.

Aah, I think we do still need some logic for this, here are my first thoughts:
https://docs.google.com/spreadsheets/d/1vE1nP_ax89kqAjhYCD_1oM6u6vnzKr2v0Ie46MRJuDU/edit?usp=sharing

Pepys Admin application

In the current phase of development, we will have to introduce the Pepys-Admin application.

The only function of it to start with, will be to export data. But, we can usefully include a database status command to.

Note: this is a python project with an interactive command-line loop: https://docs.python.org/3/library/cmd.html

The application will take an optional command-line parameter, -db. If provided, the application will open that SQLite database. Otherwise, the application will connect to the PostGres instance that's used in the rest of Pepys-Import.

When the application opens, it will display Pepys-Admin banner text, similar to how we do in Pepys-Import. It will then show the database connection settings (again, like import), then the available commands:

  • e - Export (to start the export process)
  • i - Initialise (import different sets of stock data)
  • s - Status (to report on the database contents)
  • x - Exit (to exit the application)

Export

In Release-1, we export by datafile name. So, after selecting Export we will use the fuzzy search to let the user search for a datafile by name.
In Release-2, we export by platform name and date. So, start with fuzzy search for platform name, then display a numbered list of datafiles/sensors that contain data for this platform. The user then selected an index number of a data-file, and we export data for datafile, for that sensor. So, I guess the view will look like:

1) GPS File_a 12/07/20 14:22-13/07/20 13:43
2) GPS File_b 13/07/20 14:22-13/07/20 16:43
3) SINS File_c 11/07/20 14:22-13/07/20 17:43
4) WECDIS-GPS File_d 10/07/20 14:22-13/07/20 18:43
5) WECDIS-GPS File_e 15/07/20 14:22-13/07/20 12:43
0) Cancel

Selecting 1 to 5 will then ask the user for the filename (offer platform name as default), then export that datafile to the current folder.
Selecting 0 will cancel
After either, we return to the top level command loop.

Initialise

This will allow the currently connected database to be configured, using a range of commands - the existing functionality is shown in the code snippet shown below.

In order to allow a "fresh" start, a command will be provided that removes all tables (and data) in the current database.

Then the user will be able to create the Pepys tables within the database. Then the user is able to pull in a range of types of background data, as shown below. But, instead of TEST_DATA_PATH, the relevant files are taken from the current directory.

   self.store.initialise()
   self.store.populate_reference(TEST_DATA_PATH)
   self.store.populate_metadata(TEST_DATA_PATH)
   self.store.populate_measurement(TEST_DATA_PATH)
  1. Clear database
  2. Create Pepys schema
  3. Import Reference data
  4. Import Metadata
  5. Import Sample Measurements
  6. Cancel

Status

Note: see the TableSummarySet class, and it's usage in the unittests.

Show a tabular summary of the database:

## Measurements
States: 2023
Contacts: 2342
Comments: 234
## Metadata
Platforms:  5
Datafiles: 12
(etc)

On completion, return to the command loop.

Exit

Exit the application.

Need command to clear/reset database

The admin tool includes the capability to initialise a database. This includes:

  • insert schemas
  • import content of Reference tables
  • import content of Metadata tables
  • import content of Measurements tables

Along with these, it would be logical to allow the admin console to clear/wipe the connected database. We need a DataStore method to implement this.

Introduce Spatial objects

We need to introduce spatial objects to both the PostGIS and SQLite database implementations.

This will give us PostGIS and Spatialite.

Once we've got good spatial referencing in, we can start work on spatial indexes in the database, to improve the performance of spatial search.

Fix for code coverage reporting

Submitting the code coverage reports to codecov is failing. To my untrained eye, it looks like an issue relate to our CookieCutter settings. Here's a transcript:

6.72s$ coverage3 report
after_success.2
23.00s$ codecov
      _____          _
     / ____|        | |
    | |     ___   __| | ___  ___ _____   __
    | |    / _ \ / _  |/ _ \/ __/ _ \ \ / /
    | |___| (_) | (_| |  __/ (_| (_) \ V /
     \_____\___/ \____|\___|\___\___/ \_/
                                    v2.0.15
==> Detecting CI provider
    Travis Detected
==> Preparing upload
==> Processing gcov (disable by -X gcov)
    Executing gcov (find /home/travis/build/debrief/pepys-import -not -path './bower_components/**' -not -path './node_modules/**' -not -path './vendor/**' -type f -name '*.gcno'  -exec gcov -pb  {} +)
==> Collecting reports
    Generating coverage xml reports for Python
/home/travis/virtualenv/python3.8.0/lib/python3.8/site-packages/sqlalchemy/orm/query.py:196: SyntaxWarning: "is not" with a literal. Did you mean "!="?
  if entities is not ():
    + /home/travis/build/debrief/pepys-import/coverage.xml bytes=2866063
==> Appending environment variables
    + TRAVIS_PYTHON_VERSION
    + TRAVIS_OS_NAME
==> Uploading
    .url https://codecov.io
    .query commit=30bf0b732265449d531166570ba0b5fbe2817a1f&branch=develop&job=608295468&pr=false&service=travis&build=1.1&slug=debrief%2Fpepys-import&package=py2.0.15
    Pinging Codecov...
Error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Tip: See an example python repo: https://github.com/codecov/example-python
Support channels:
  Email:   [email protected]
  IRC:     #codecov
  Gitter:  https://gitter.im/codecov/support
  Twitter: @codecov
Done. Your build exited with 0.

REP export capability

In release 1 we aim to have the ability to export data from the database to REP format.

Release 1 will just export entries in the States, Contacts and Comments tables.

We will have a Python class/method with this API:

export(states, contacts, comments)

The method will return the contents of those element lists as a string. Note: many fields in the classes won't have a representation in REP format, so are ignored.

Note: when looping through the contents of the tables, the platform name to use will be retrieved via the Sensor entry in each row.

It's probably worth cacheing the platform names that are retrieved from sensor ids, since the same platform is likely to appear many times.

The following types of element will be created:

Rename database schema

I'd been aware of this, but hadn't question it. Tahir pointed out:

I have noticed that we are creating schema with the name datastore_schema . But we are creating our tables in public schema.

So, we should put our tables under our schema. But, we can use a more descriptive title than datastore_schema. Let's use Pepys.

We can probably do a lot of this via find/replace.

Refactor CommandLineResolver

This is a relatively low priority task, since it doesn't bring any new features. It's suited for when @BarisSari is held up on other tasks.

We've refactored the command line resolver, and adopted prompt_toolkit to enhance the command line support.

But, there are some repeating patterns in the code, where we do similar operations on a range of table types (such as fuzzy search for sensor, platform, datafile, etc).

If possible, it would be good to refactor these, to pull out the common code. This would be easier to maintain, and easier to maintain our test coverage. I expect we're starting with complete test coverage of the implementations - but we should be able to delete some, post-refactoring.

Before and after table summaries have title in wrong place

When running on Windows (haven't tested it on other platforms yet), the before and after table summaries have their overall title ('Before' or 'After') on the same line as the column headers, making everything unaligned:

==After==| Table name   |   Number of rows | Last item added            |
|--------------|------------------|----------------------------|
| States       |             1210 | 2020-02-24 09:35:14.800510 |
| Platforms    |                6 | 2020-02-24 09:35:10.423745 |
Files got processed: 7 times

(I'm happy to fix this, this is more of a note to myself to fix it)

Refactor State, fields & parsing

The State class is being used as a temporary store, to collate fields ready to create a new record in the State database table.

The class requires these changes:

  • ensure fields match the SQL schema for State
  • parsing of REPFile data should happen in the REPFile class, not the State class
  • the constructor for a State object should include elements that are mandatory within the State schema
  • the REPFile importer would then assign the other fields using setter methods (or attributes)

Unit tests should also be modified to reflect the above

** Update **: Ian took the first steps towards this in #40

Switch SQLite to use int IDs instead of Blobs

It would be easier to track what's happening in our development databases if we could easily read the ID numbers.

We should change the SQLite schema to use ints, instead of blobs.

Flake8 fixes

We have quite a few Flake8 fixes (sample attached).

We should work through them, and correct them. Once we've done that, we can aim for zero flake8 failures.

Note: flake8 failures can be determined by entering flake8 pepys_import

flake_8_failures.txt

Bulk import performance

An attached zip-file contains a high volume of data-files in REPLAY format.

We should use the sample data-set to investigate Pepys-Import performance bottlenecks, and see where performance can be improved.
BulkAssetRuns.zip

Refactor API to match design document

In our current API, database operation are conducted from the DataStore entity.

The Python Import Library design document has an API with a richer object model. In this model State objects get added to a DataFile and Sensor objects get added to a Platform parent.

In the separate SQL and PG database modules we do have classes like these - and maybe those classes could be extended with this new functionality. But, it would be wrong to duplicate the code across both of them.

This may lead us to introduce a new object model that supports the necessary API calls.

Hopefully in the new API we can embed the session_scope inside the objects so our library consumers don't have to think about/manage it.

Provide command line arguments to run Pepys-Admin in headless mode

We may wish to use pepys-admin in a non-interactive way.

This could be done through the provision of a Defaults Resolver (a file of missing values). Or, by passing missing commands as parameters.

Once we can do this, it will be possible to convert a file from one format to another: ask pepys to both import it, and to export it, all in one operation. This could be like this:

pepys_import -input "datafile.csv" -in_format "E-Trac" -out_format "REPLAY"
    -output "NONSUCH.rep" -platform "NONSUCH" -sensor "GPS

Once we have the above command working, it should be possible to right-click on a .csv file, choose Send to > Replay format

Hmm, we may even wish to use pepys-admin in the same way:

pepys_admin -add "PLATFORM" -name "NONSUCH" -nationality "UK" -type "Frigate"
pepys_admin -delete "DATAFILE" -id "22EF34FE"

This command-line use of pepys_admin could allow for the bulk insertion of data into the datastore, using python script to parse some external datafile. Note: it wouldn't be suited to bulk loading of measurements. The high transaction cost of connecting to the database, making the change, then closing the connection wouldn't be a problem when submitting the serials/phases of an exercise. But, it wouldn't realistically be able to add 10k platform states.

Note: once we support command line options, we should also support:

  • version
  • help

Note: background on "Good CLI Implementation" here

Minor - use (x,y) location objects

There's currently a bug whereby location is only expressed in degrees.

We should switch to passing self-describing completed State2 objects around.

Adopt CLI library

We currently have a hand-made command-line interaction module, in command_line_input.py

This has been acceptable so far, but now we require two refinements:

  • the ability to have some responses permanently assigned to a key (such as . for Cancel)
  • the ability to inject some key responses that would allow us to cover command-line interactions from unit tests

We should switch to the QPrompt library to support the above: https://qprompt.readthedocs.io/en/latest/

This library will also bring these benefits:

  • fuzzy search for matching string. This will be good for platforms, sensors and datafiles
  • change title of command prompt during process
  • busy cursor for lengthy operations (such as the final batch commit of measurements)

NMEA Parser should get platform from user

The NMEA parser currently uses a hard-coded platform:
image

Instead, it should get the platform from the user. We'll do that by using the resolver to get a platform name. So, the resolver class(es) need to be able to get a platform name to use, not just to help in searching for one.

I've updated the resolver logic to support this: https://docs.google.com/spreadsheets/d/1vE1nP_ax89kqAjhYCD_1oM6u6vnzKr2v0Ie46MRJuDU/edit?usp=sharing

We should defer this issue until we have refactored the command-line-resolver class.

PowerShell scripts to create Pepys shortcuts

For the MS-Windows "Send to" integration we require Windows Shortcut files to the pepys batch file.

These have to be created after deployment, since they contain the absolute path to the batch files.

When creating them, we should assign the Pepys icon (as found in the bin folder.

A second shell script should copy the shortcuts to the AppData\Roaming\Microsoft\Windows\SendTo\ folder for the current user.

Note: the first script will only be run once, as user with privilege to write to deployed folder, whereas the second script would be used by any Pepys user.

Generic database API

We currently have a database-specific APIs. This means our business logic would have to target either SQLite or Postgres:

image

I'm unsure of the best strategy, but I'd like us to either have both implementations extend a parent interface, or to hide the two implementations behind a single API.

Distribution Strategy

Once deployed on the user network, ideally the an import will happen as follows:

  • user opens command prompt
  • user navigates to folder containing data
  • user enters pepys_import

Pepys_import will be a .bat file on their path. It will instruct Python to run the file_processor with the current folder as an argument.

The %cd% bat-file variable contains the directory the current working directory: https://stackoverflow.com/a/4420078/92441

For evaluation purposes, the data will be imported into SQLite database named pepys_import.db, in the current working folder.

Our distribution strategy should also include:

  • process to build a GitHub release (zip file) that contains all dependencies, sufficient that it can run on a PC without an Internet connection
  • process to deploy release to shared folder
  • script to make pepys_import available to user (including use of virtual environment)

Update: This morning I came across a serious improvement to the above. The MS-Windows context menu has a "Send to" sub-menu. It lists shared drives, but also apps such as "Dropbox". This would be an ideal integration for Pepys-import:

  • user navigate to folder (or files) to import
  • right-click on folder, select Send to > Pepys Datastore
  • command prompt opens, user provides missing data, and/or allow/rejects import of files from that folder.

For this "Send to" support, the onsite IT team believe they can "push" the shortcut to all users. They suggested that when we deliver our first release (maybe a beta), they're happy to experiment with implementing this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.