savkov / planchet Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 4.2 MB

Your large data processing personal assistant

License: MIT License

Python 96.69% Makefile 2.19% Dockerfile 0.83% Shell 0.29%

planchet's People

Contributors

Stargazers

Watchers

planchet's Issues

Client has references to the rest of the app

The client is trying to import from other modules in the __init__ module.

https://github.com/savkov/planchet/blob/master/planchet/__init__.py#L1

Client method for purge

I forgot to add a client method for /purge 🤦

Allow dumping jobs

Currently, jobs are running checks against served items whenever they receive items. It would be nice to have special jobs that only receive.

Make install-redis doesn't work after docker-compose has been used

Docker-compose uses the redis name for the container it creates. It's then impossible to run make install-redis.

Test coverage from inside Docker

Currently, the automated CI test coverage does not include the app and client tests which is a significant part of the project. Ideally, we should be testing inside the docker-compose tests and reporting.

Set up readthedocs

set up sphinx documentation
set up readthedocs account

Output registry

If multiple users or jobs are running in parallel, it could be that they write to the same place and that could be dangerous.

Suggestion:

create output file registry and reject jobs that want to write to a registered output file
implement a force mechanism for when this is the desired effect

Improved Ledger

The current ledger is basically the Redis client and each item is 1-to-1 recorded in Redis. This is quite easy and lazy but gets inefficient when Redis needs to sort through millions of records in order to update the 100 records it just received in a batch.

Proposed solutions:

store batches instead of items; this will require quite a bit of rework due to there being no IDs for batches
store items in large batches on Redis; this can be done by wrapping the Redis client and hiding the logic in a Ledger class.

An example worker

An example of how a worker can be set up as a script or even dockerised would help users to get started and have a base to build on if they want to make their own.
Some degree of configurability might be worth investing in as well. For example, making the processing function importable and configurable to decouple the worker from the processing. That's a nice-to-have though.

Error status

Allow processors to send a special flag to indicate error so that items can be re-submitted for processing.

Rudimentary authentication

At the moment anyone can create jobs and do processing as apart of existing jobs. As this is meant to be a tool that lives in a sandbox, this is acceptable, but the risk of users interfering with each other's work remains. Therefore a simple solution can go a long way.

Proposal:

require an authentication token with each request
set a master token when starting Planchet (passed as env var)
set job tokens as part of creating a job

API fix for CsvReader and CsvWriter

CsvReader and CsvWriter use a list as their input/output item structure. The order of the list matters. It may be better to force this to be a dictionary. Probably through a new pair of classes that use JSON for their item structure.

Served items are never processed if the processor dies

Currently, there is no mechanism for the served items to be re-served in case the processor has died.

The proposed solution:

introduce a new flag for continuation jobs
allow continuation jobs to process SERVED items before anything else

Limitation:

if run multiple client instances of continuation jobs they will all end up processing the same initial SERVED items
can be avoided by spinning up only one instance initially
does not cause data corruption as writing is controlled by the RECEIVED status

Add a `output` parameter to clean

Currently /clean is a nice way to clean the ledger and restart a job that may have gone wrong. That doesn't currently remove the output file.

Suggestion:

add a boolean parameter output
remove the output files if set to True

Overwrite creates a loop of overwriting the output with each new batch

Currently overwrite when used as part of the /receive endpoint overwrites the output file. This is desired behaviour but what is often not clear is that each following request will do the same. So to protect the users from themselves we should act on this only if there are no active items in the job ledger.

A new `reset_redis` end point

We need a convenient way to clean up redis remotely. An endpoint that would clean all entries for a job or all jobs can help the users do this without logging in on their planchet server.

Batch readers and writers

So far Planchet has assumed that the data would be stored in a single file both in the reading and writing mode. Sometimes it would be more convenient to split data into smaller files for storage reasons or access reasons. Or maybe because that's how the data was stored originally.

Suggestion:
Add new readers and writers that are able to operate with batches of files. Ideally, we should try to wrap around the existing ones.

savkov / planchet Goto Github PK

planchet's People

Contributors

Stargazers

Watchers

planchet's Issues

Recommend Projects

Recommend Topics

Recommend Org