Giter Club home page Giter Club logo

planchet's People

Contributors

savkov avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

planchet's Issues

Output registry

If multiple users or jobs are running in parallel, it could be that they write to the same place and that could be dangerous.

Suggestion:

  • create output file registry and reject jobs that want to write to a registered output file
  • implement a force mechanism for when this is the desired effect

Overwrite creates a loop of overwriting the output with each new batch

Currently overwrite when used as part of the /receive endpoint overwrites the output file. This is desired behaviour but what is often not clear is that each following request will do the same. So to protect the users from themselves we should act on this only if there are no active items in the job ledger.

Batch readers and writers

So far Planchet has assumed that the data would be stored in a single file both in the reading and writing mode. Sometimes it would be more convenient to split data into smaller files for storage reasons or access reasons. Or maybe because that's how the data was stored originally.

Suggestion:
Add new readers and writers that are able to operate with batches of files. Ideally, we should try to wrap around the existing ones.

API fix for CsvReader and CsvWriter

CsvReader and CsvWriter use a list as their input/output item structure. The order of the list matters. It may be better to force this to be a dictionary. Probably through a new pair of classes that use JSON for their item structure.

An example worker

  • An example of how a worker can be set up as a script or even dockerised would help users to get started and have a base to build on if they want to make their own.
  • Some degree of configurability might be worth investing in as well. For example, making the processing function importable and configurable to decouple the worker from the processing. That's a nice-to-have though.

Served items are never processed if the processor dies

Currently, there is no mechanism for the served items to be re-served in case the processor has died.

The proposed solution:

  • introduce a new flag for continuation jobs
  • allow continuation jobs to process SERVED items before anything else

Limitation:

  • if run multiple client instances of continuation jobs they will all end up processing the same initial SERVED items
  • can be avoided by spinning up only one instance initially
  • does not cause data corruption as writing is controlled by the RECEIVED status

Allow dumping jobs

Currently, jobs are running checks against served items whenever they receive items. It would be nice to have special jobs that only receive.

Add a `output` parameter to clean

Currently /clean is a nice way to clean the ledger and restart a job that may have gone wrong. That doesn't currently remove the output file.

Suggestion:

  • add a boolean parameter output
  • remove the output files if set to True

Improved Ledger

The current ledger is basically the Redis client and each item is 1-to-1 recorded in Redis. This is quite easy and lazy but gets inefficient when Redis needs to sort through millions of records in order to update the 100 records it just received in a batch.

Proposed solutions:

  1. store batches instead of items; this will require quite a bit of rework due to there being no IDs for batches
  2. store items in large batches on Redis; this can be done by wrapping the Redis client and hiding the logic in a Ledger class.

Error status

Allow processors to send a special flag to indicate error so that items can be re-submitted for processing.

Rudimentary authentication

At the moment anyone can create jobs and do processing as apart of existing jobs. As this is meant to be a tool that lives in a sandbox, this is acceptable, but the risk of users interfering with each other's work remains. Therefore a simple solution can go a long way.

Proposal:

  • require an authentication token with each request
  • set a master token when starting Planchet (passed as env var)
  • set job tokens as part of creating a job

A new `reset_redis` end point

We need a convenient way to clean up redis remotely. An endpoint that would clean all entries for a job or all jobs can help the users do this without logging in on their planchet server.

Test coverage from inside Docker

Currently, the automated CI test coverage does not include the app and client tests which is a significant part of the project. Ideally, we should be testing inside the docker-compose tests and reporting.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.