aredier / chariots Goto Github PK

versioned machine learning pipelines

License: GNU Lesser General Public License v3.0

Python 96.92% Makefile 1.76% Batchfile 0.28% Jupyter Notebook 1.04%

data-pipelines flask machine-learning project-template python

chariots's Issues

ops should pass the data they are not consuming

this perticularly a problem in training pipeline when for instance a vectorizers transforms the x but we need the y to be passed along

`from_runner_iterator` should keep a "natural" version of the iterator

when using Runner::from_runner_iterator returns a Runner<RunnerBatch<DataType, Op>, Op> instead of `Runner<DataType, Op>

add ChariotsConfig

It would be great to have a named tuple with all the configuration in order to pass to all the different Server and Clients

refacto to a real pipeline and data separation

Pipeline base
Pipeline Iterator
Split

nodes should be able to output multiple branches

it should be possible to write something in the likes of

pipe = Pipeline(
    [
        DataLoadingNode( ..., output_nodes="loaded_data"),
        Node(train_test_split, input_nodes=["loaded_data"], output_nodes=["train_data", "test_data"]),
       ...
    ]
)

this will allow to fit some common machine learning workflows (train/test split, x/y split, ...) and many more

refacto to python

we should pass the bulk of the code to python

add some things to the read me

client should be complete

all the endpoints in the backend should have a mapping to the client (no curl/request should ever be required to use chariots.

data set type representing a DataFrame ?

should we build a data type that represents sort of a DataFrame with fields in order for ops to know which fields they are dealing with.
Would be useful once #10 is implemented

look for possible rust crates
decide implementation

refacto coockie cutter template

use the same op multiple times in the same pipeline

for now as the graph of operation in a pipeline is represented as a dict, an operation can only appear once in its keys

add trainable ops

ops that can be trained saved, and loaded

train/predict
save/load

restructure module structure

the idea would be to have a new, more usable structure with max one level:

chariots

Chariots
Client
Pipeline
Node
base
- BaseNode
- BaseOp
- BasePipelineCallBack
- BaseOpCallback
- BaseRunner
- BaseSaver
- BaseMlOp
ops
- LoadableOp
errors
- ...
runners
- Sequentialrunner
savers
- FileSaver
sklearn
- SkLearnSupervised
- SKUnsupervised
cli
- ...

the idea is to have a structure that is different from the

ops are not persistent when app is closed

add required ancestors

today an op just assumes that a certain op was made upstream in the pipeline. We need to find a way for an op to be relatively agnostic to its ancestor and still remain reliable (eg: A/B testing)

cross compatibility

we should be able to release in pythons from 3.5 to 3.7

new dostring type

choose between numpy and google style doctrsings rather than rst:

decide the type of docstrings
use napoleon for the doc generation
refacto all existing docstrings

Ops should be able to change data type

as for now an op cannot change the data type from input to output this should change

add a build method to the ops to reduce the compute time at

benchmark (also build a script to reproduce benchmarks in the future

hackability

be able to inject in the framework:

before pipeline is called
after pipeline is called
in between every op of the Pipeline
These first three could later be bundled in a reusable callback
before a specific op is called
after a specific op is called

split and merge runners

we should be able to merge and split runners in order to make fully functional graphs

find better way to get version hash

the version hash comes from stringifying the the versioned Field. This should be done more regourously

clean the node_versions/op_versions/version

ther are too much versions being used everywhere in a fairly undocumented way. this needs to be clearer

pre commit hooks

Have pre commit hooks for:

single quotes only
pytest
linting
run on all files

test the documentation

we should test all the code snipets in the doc work, both for the docstrings and for the rst files

use a server and a DB for the Metadata

rename VersionType rename to SUBVERSIONTYPE for consistency and clarity

the version type enum is confusing since it actually defines the types of subversions. this should be renamed.
chariots/versioning/version_type.py::VersionType

the version fields stay the same between instances of the class

it seems that because the versioned fields become real python objects instead of their underlying class, their behavior changes. this is not the desired seemless integration of versioning in the ops that is our objective.

Steps to reproduce:

from chariots.core.ops import BaseOp
from chariots.core.versioning import VersionField
from chariots.core.versioning import VersionType


class VersionedOp(BaseOp):
    name = "fake_op"
    versioned_field = VersionField(VersionType.MAJOR, default_value=2)

    def _main():
        pass


op = VersionedOp()
op.versioned_field = 3
op2 = VersionedOp()
op2.versioned_field

this outputs 3 the value instead of 2 (the default factory value)

create graphql metadata api

It has become aparrent that storing metadata on each op and pipeline is key. I should do this sooner rather than latter

be able to see server errors in the client

errors that occur in the server do not transmit to the server instead we get this generic errors:

ValueError: the execution of the pipeline failed, see _deployment logs for traceback

which doesn't help and is running on my nerves.

easy local execution of pipelines

today to execute a Pipeline locally (in a notebook for instance) you still need to setup an OpStore and a Runner. This should be hidden during prototyping stage and left do deal with during deployment (the only actual work needed to go from one to the other)

gym pytorch RL integration

the aim of this is to create a tutorial to show how to implement a RL pipeline with chariots using Gym environments and pytorch. The aim is also to implement all necessary changes in chariots to make this process streamlined

Gym integration
pytorch support
control flow in chariots (abiltiy to create training loops and early stoping procedure will be central to implement RL environments)

Add pages and update documentation

we need to add some stuff to the documentation

Components of Chariots
Roadmap

keras Ops

we need ops to support keras NN building, and potentially compositing

aredier / chariots Goto Github PK

chariots's People

Contributors

Stargazers

Watchers

Forkers

chariots's Issues

Steps to reproduce:

Recommend Projects

Recommend Topics

Recommend Org