Giter Club home page Giter Club logo

lightwood's Introduction

Lightwood

Lightwood is an AutoML framework that enables you to generate and customize machine learning pipelines declarative syntax called JSON-AI.

Our goal is to make the data science/machine learning (DS/ML) life cycle easier by allowing users to focus on what they want to do their data without needing to write repetitive boilerplate code around machine learning and data preparation. Instead, we enable you to focus on the parts of a model that are truly unique and custom.

Lightwood works with a variety of data types such as numbers, dates, categories, tags, text, arrays and various multimedia formats. These data types can be combined together to solve complex problems. We also support a time-series mode for problems that have between-row dependencies.

Our JSON-AI syntax allows users to change any and all parts of the models Lightwood automatically generates. The syntax outlines the specifics details in each step of the modeling pipeline. Users may override default values (for example, changing the type of a column) or alternatively, entirely replace steps with their own methods (ex: use a random forest model for a predictor). Lightwood creates a "JSON-AI" object from this syntax which can then be used to automatically generate python code to represent your pipeline.

For details on how to generate JSON-AI syntax and how Lightwood works, check out the Lightwood Philosophy.

Lightwood Philosophy

Lightwood abstracts the ML pipeline into 3 core steps:

(1) Pre-processing and data cleaning
(2) Feature engineering
(3) Model building and training

Lightwood internals

i) Pre-processing and cleaning

For each column in your dataset, Lightwood will identify the suspected data type (numeric, categorical, etc.) via a brief statistical analysis. From this, it will generate a JSON-AI syntax.

If the user keeps default behavior, Lightwood will perform a brief pre-processing approach to clean each column according to its identified data type. From there, it will split the data into train/dev/test splits.

The cleaner and splitter objects respectively refer to the pre-processing and the data splitting functions.

ii) Feature Engineering

Data can be converted into features via "encoders". Encoders represent the rules for transforming pre-processed data into a numerical representations that a model can be used.

Encoders can be rule-based or learned. A rule-based encoder transforms data per a specific set of instructions (ex: normalized numerical data) whereas a learned encoder produces a representation of the data after training (ex: a "[CLS]" token in a language model).

Encoders are assigned to each column of data based on the data type; users can override this assignment either at the column-based level or at the data-type based level. Encoders inherit from the BaseEncoder class.

iii) Model Building and Training

We call a predictive model that intakes encoded feature data and outputs a prediction for the target of interest a mixer model. Users can either use Lightwood's default mixers or create their own approaches inherited from the BaseMixer class.

We predominantly use PyTorch based approaches, but can support other models.

Usage

We invite you to check out our documentation for specific guidelines and tutorials! Please stay tuned for updates and changes.

Quick use cases

Lightwood works with pandas.DataFrames. Once a DataFrame is loaded, defined a "ProblemDefinition" via a dictionary. The only thing a user needs to specify is the name of the column to predict (via the key target).

Create a JSON-AI syntax from the command json_ai_from_problem. Lightwood can then use this object to automatically generate python code filling in the steps of the ML pipeline via code_from_json_ai.

You can make a Predictor object, instantiated with that code via predictor_from_code.

To train a Predictor end-to-end, starting with unprocessed data, users can use the predictor.learn() command with the data.

import pandas as pd
from lightwood.api.high_level import (
    ProblemDefinition,
    json_ai_from_problem,
    code_from_json_ai,
    predictor_from_code,
)

if __name__ == '__main__':
    # Load a pandas dataset
    df = pd.read_csv("https://raw.githubusercontent.com/mindsdb/benchmarks/main/benchmarks/datasets/hdi/data.csv"
    )

    # Define the prediction task by naming the target column
    pdef = ProblemDefinition.from_dict(
        {
            "target": "Development Index",  # column you want to predict
        }
    )

    # Generate JSON-AI code to model the problem
    json_ai = json_ai_from_problem(df, problem_definition=pdef)

    # OPTIONAL - see the JSON-AI syntax
    # print(json_ai.to_json())

    # Generate python code
    code = code_from_json_ai(json_ai)

    # OPTIONAL - see generated code
    # print(code)

    # Create a predictor from python code
    predictor = predictor_from_code(code)

    # Train a model end-to-end from raw data to a finalized predictor
    predictor.learn(df)

    # Make the train/test splits and show predictions for a few examples
    test_df = predictor.split(predictor.preprocess(df))["test"]
    preds = predictor.predict(test_df).iloc[:10]
    print(preds)

BYOM: Bring your own models

Lightwood supports user architectures/approaches so long as you follow the abstractions provided within each step.

Our tutorials provide specific use cases for how to introduce customization into your pipeline. Check out "custom cleaner", "custom splitter", "custom explainer", and "custom mixer". Stay tuned for further updates.

Installation

You can install Lightwood as follows:

pip3 install lightwood

Note: depending on your environment, you might have to use pip instead of pip3 in the above command.

However, we recommend creating a python virtual environment.

Setting up a dev environment

  • Python version should be in the range >=3.8, < 3.11
  • Clone lightwood
  • cd lightwood && pip install -r requirements.txt && pip install -r requirements_image.txt
  • Add it to your python path (e.g. by adding export PYTHONPATH='/where/you/cloned/lightwood':$PYTHONPATH as a newline at the end of your ~/.bashrc file)
  • Check that the unittests are passing by going into the directory where you cloned lightwood and running: python -m unittest discover tests

If python default to python2.x on your environment use python3 and pip3 instead

Currently, the preferred environment for working with lightwood is visual studio code, a very popular python IDE. However, any IDE should work. While we don't have guides for those, please feel free to use the following section as a template for VSCode, or to contribute your own tips and tricks to set up other IDEs.

Setting up a VSCode environment

  • Install and enable setting sync using github account (if you use multiple machines)
  • Install pylance (for types) and make sure to disable pyright
  • Go to Python > Lint: Enabled and disable everything but flake8
  • Set python.linting.flake8Path to the full path to flake8 (which flake8)
  • Set Python › Formatting: Provider to autopep8
  • Add --global-config=<path_to>/lightwood/.flake8 and --experimental to Python › Formatting: Autopep8 Args
  • Install live share and live share whiteboard

Contribute to Lightwood

We love to receive contributions from the community and hear your opinions! We want to make contributing to Lightwood as easy as it can be.

Being part of the core Lightwood team is possible to anyone who is motivated and wants to be part of that journey!

Please continue reading this guide if you are interested in helping democratize machine learning.

How can you help us?

  • Report a bug
  • Improve documentation
  • Solve an issue
  • Propose new features
  • Discuss feature implementations
  • Submit a bug fix
  • Test Lightwood with your own data and let us know how it went!

Code contributions

In general, we follow the "fork-and-pull" git workflow. Here are the steps:

  1. Fork the Lightwood repository
  2. Checkout the staging branch, which is the development version that gets released weekly (there can be exceptions, but make sure to ask and confirm with us).
  3. Make changes and commit them
  4. Make sure that the CI tests pass. You can run the test suite locally with flake8 . to check style and python -m unittest discover tests to run the automated tests. This doesn't guarantee it will pass remotely since we run on multiple envs, but should work in most cases.
  5. Push your local branch to your fork
  6. Submit a pull request from your repo to the staging branch of mindsdb/lightwood so that we can review your changes. Be sure to merge the latest from staging before making a pull request!

Note: You will need to sign a CLI agreement for the code since lightwood is under a GPL license.

Feature and Bug reports

We use GitHub issues to track bugs and features. Report them by opening a new issue and fill out all of the required inputs.

Code review process

Pull request (PR) reviews are done on a regular basis. If your PR does not address a previous issue, please make an issue first.

If your change has a chance to affecting performance we will run our private benchmark suite to validate it.

Please, make sure you respond to our feedback/questions.

Community

If you have additional questions or you want to chat with MindsDB core team, you can join our community: MindsDB Community.

To get updates on Lightwood and MindsDB’s latest announcements, releases, and events, sign up for our Monthly Community Newsletter.

Join our mission of democratizing machine learning and allowing developers to become data scientists!

Contributor Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project, you agree to abide by its terms.

Current contributors

License PyPI - License

lightwood's People

Contributors

abitrolly avatar adripo avatar alexandre-dz-oscore avatar azulgarza avatar btseytlin avatar ea-rus avatar george3d6 avatar hakunanatasha avatar hamishfagg avatar jaredc07 avatar kination avatar lezcano avatar lyndonfan avatar maximlopin avatar michaellantz avatar mindsdb-devops avatar mrandri19 avatar noraa-july-stoke avatar ongspxm avatar paxcema avatar quantumplumber avatar rajveer43 avatar riadhlaabidi avatar stpmax avatar surendra1472 avatar talaathasanin avatar tomhuds avatar torrmal avatar vaithak avatar zoranpandovski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lightwood's Issues

Start training on a sub-set

Instead of training on the whole dataset at once, start training on a small sub-set (say, as large as 5 or so batches) and as the network start converging on that subset start feeding it the whole dataset (or a bigger sub-set).

This sort of "priming" might help us achieve convergence quicker, which could be rather good considering 0.11.8 increased training times.

Columns with too large or with many dimensions for the categorical autoencoder

Certain categorical columns seem large enough that we run oom when training the categorical autoencoder. This can happen for two reasons:

a) Columns with loads of dimensions. In this case we might just want to eliminate categorical columns with 5000+ or so dimensions for mindsdb, or encode them as text when appropriate

b) [More pressing] When the column has a large but doesn't contain an excessive number of dimensions. Not sure why it would crash in this situation, but I'm too tired to check at the moment. Should look into this later.

Cuda-enabled Learning

Hi,

How do I modify the learn method to be cudified? I saw some stuff from earlier with specifying devices, etc, but wasn't sure about the details there as I couldn't find anything in the documentation

Training parallelism on multiple machines

We should start thinking about this once we're done with #63 , pytorch and various pytroch related frameworks should provide some support for this. But I doubt it's going to be very useful outside of very large datasets, due to the large data transfer and synchronization overhead.

Good candidate to try this out on would be the Ax Optimizer, which could run multiple trials on different machines.

Training parallelism on multi-GPU machines

We need to make lightwood able to make use of multi-GPU machines in order to train faster. Pytorch should have support for this, shouldn't be terribly hard to implement, but testing to make sure nothing is broken by this might take a bit.

Make certain encoders "learn" a correlation with the target variable

We discussed doing this with the categorical autoencoder first, but it could be done with other encoders as well.

Essentially, we should try to predict the target variable from the intermediary representation when training the autoencoder. So instead of the autoencoder being:

column_value -> IR -> column_value

it would become:

column_value -> IR -> column_value + target

This could be useful for tow reasons:

a) We know that if the IR can be used to reasonably well predict the target, it can be also used by the mixer to predict the target

b) We could obtain some sort of "column correlation" score from this which could help us determine the importance of the column, which might be interesting for both the user and for lightwood itself when it decided what column to drop-out during later stages of training (see #68)

Weird segfault issue on import

Basically, there's a weird segfault happening when importing lightwood... in some cases.

Examples:

Segmentation fault:
import transformers

Segmentation fault:
import mindsdb
import lightwood

Segmentation fault:
import mindsdb
import transformers

Works:
import lightwood
import transformers

Works:
import lightwood
import mindsd

Doesn't happen on all machine. No idea of the cause, investigating now.

Training seems to stop to early on samll dataset

This is on a proprietary dataset, so sadly I can't include it here.

Training seems to stop too early on very small datasets, before the algorithm is allowed to converge to an optimal solution. If I just copy-past the dataset 10x times in the same file we reach 100% accuracy, whilst leaving it as is only allows us to reach ~94%.

We should find a way to run a dataset multiple times (or not run it fully) during an epoch based on it's size, or change the nr of epochs before evaluation dynamically based on dataset size.

This could also be done in mindsdb but I'd prefer it if lightwood itself knew how to do this, as it's closely related to the training process.

ImportError: cannot import name 'Imputer' from 'sklearn.preprocessing'

Describe the bug
Installing the latest version of lightwood throws ImportError. I guess the issue is related to the sciki-learn.

To Reproduce
Steps to reproduce the behavior:

  1. Train model, the issue is not related to a specific dataset
  2. See error
from cesium import featurize

File "/home/zoran/MyProjects/lightwood/l/lib/python3.7/site-packages/cesium-0.9.9-py3.7-linux-x86_64.egg/cesium/featurize.py", line 10, in
from sklearn.preprocessing import Imputer
ImportError: cannot import name 'Imputer' from 'sklearn.preprocessing' (/home/zoran/MyProjects/lightwood/l/lib/python3.7/site-packages/scikit_learn-0.22rc3-py3.7-linux-x86_64.egg/sklearn/preprocessing/init.py)

Screenshots
Screenshot from 2019-12-02 14-40-15

Possible issue with encoder output size differences.

So, there's a possible issue that I can best formalize as:

Given n encoders with outputs of different size, that encode variables which are equally important for predicting the target variable, the network might have too many parameters dedicated to the larger inputs and thus learn very fast / overfit on the inputs encoded with the largest representation.

Some possible solutions:

  1. Make all mixer have a standard size input (possibly equal to the size of the output variable[s], which could also be combined with training all encoders to predict the output variable...)

  2. Add "input networks" as part of the mixer for each different input, these network with have an in layer equal to the size of the output variable and an out-layer of a standard size (maybe equal to the largest encoded input size)

  3. This issue would mainly happen with categorical one-hot-encoded values with very few dimensions (e.g. categories with 2-20 possible values) and numbers (which are always represented by a 4-variable input vector [isnan, iszero, sign, normalized_numerical_value].

We might be able to "hack" around this by simply "copy-pasting" the inputs from the smaller categories up to a certain threshold (pick magic number or some function in relation to the size of the largest encoded input).

We could also change numerical encoders where instead of having a single value for the numerical value, we have two values per bucket of the numerical value, one represents the normalized numerical between the start and end value of the bucket, the other represents how strongly we believe the value to be in this bucket. If we pick these buckets to be equal in size to the ones generate by MindsdDB in the histogram, this is also helpful since then lightwood itself can give a numerical prediction + a number of high probability buckets for said numerical prediction, even if the prediction is wrong there's a high chance of the buckets being correct.

@torrmal you were the one that came up 2 and seemed to be very keen on 1, so if I'm missing anything there please add to them or correct me.

For now I'm leaning towards implementation number 2, since it would affect the least number of moving parts.

Windows Installation issue

Describe the bug
Installation failed for lightwood=0.7.6. Error:

Packages installed from PyPI cannot depend on packages which are not also hosted on PyPI. lightwood depends on torch@ https://download.pytorch.org/whl/cu100/torch-1.1.0-cp37-cp37m-win_amd64.whl

To Reproduce
Steps to reproduce the behavior:

  1. pip install lightwood=0.7.6 or just pip install lightwood since 0.7.6 is latest

Desktop (please complete the following information):

  • OS: windows 10
  • Lightwood version 0.7.6
  • Python Version 3.7.3

Support training multiple mixers and add more mixers

Support the training of multiple mixers and add a few more mixers (especially one or two boosting models, since they seem to often beat our own mixer on certain datasets, or approach it's accuracy in a fraction of the time).

We could either/or: input the predictions from these mixers into the final NN mixer, use these mixers instead of the NN mixers, adopt an ensemble prediction style (where we trust the majority and give a confidence based on how the predictions align).

This architecture change could also be used to train multiple NN mixers (e.g. train one with selfaware on and one with selfaware off, and use the one with selfaware off if it's much more accurate on the testing data). We could even expose more mixers to the user/mindsdb and let them chose which one to use for a given prediction based on various criteria.

Travis build not failing for failed tests

Describe the bug
Even if there are failures in test scripts then also the travis build will pass
To Reproduce
Run travis build

Expected behavior
If there is an error in travis unit test scripts then the build should fail

Additional context
This is just the replication of similar issue in mindsdb/mindsdb mindsdb/mindsdb#343

TypeError: object of type 'NoneType' has no len()

Describe the bug
TypeError: object of type 'NoneType' has no len() is thrown when using lightwood as backend.

To Reproduce
Steps to reproduce the behavior:

  1. Use this example

Screenshots
Screenshot from 2019-09-02 17-31-02

Desktop (please complete the following information):

  • OS: Ubuntu 18.04
  • Lightwood version 0.9.0
  • Python Version 3.7.4

Ignore deployment for Doc type change

Describe the bug
Travis deploy is running for Docs also

To Reproduce
send any PR with a Doc change such as .md, LICENSE or .travis.yml

Expected behavior
Deployment should be skipped

Additional context
This is similar to what was requested in mindsdb/mindsdb, just a replication of mindsdb/mindsdb#352

Training on large datasets (with no cache) OOM

When training on very large datasets, even if the cache is disabled, we sometimes run OOM, especially on GPUs, even on rather large ones, since memory is usually rather limited (<12GB).

Initially I thought this is owned to the accumulating gradient tensors pytorch stores during forwardprop, similar to what was happening in .predict, but I can't find any evidence of this.

I'll have to investigate this further (any large datasets, say > 2GB with that yields a few thousand input dimensions when encoded, should do the trick for testing)... except for image datasets, since we loads those from disk when encoding and reduce the dimension by quite a lot).

Windows install via pypi repo fails

pip install lightwood will fail on windows due to us sourcing the torch dependency from outside of pypi.

Not much we can do about it at the moment, since the pypi version of torch for windows seems to fail installing most of the time (or install a version that's way too old).

we should look into fixing when new releases of torch arrive on pypi and/or new updates for the various libraries involved with torch appear on windows.

Temporary "fix" is just asking people to install from github.

ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

Your Environment

Python version: 3.7.4
Pip version: 19
Operating system: Ubuntu 18.04
Python environment used (e.g. venv, conda): venv
Mindsdb version you tried to install: 1.6.8

Describe the bug
ValueError is thrown when training

To Reproduce
Steps to reproduce the behavior, for example:

Use this example
You should see the error: ValueError: Input contains NaN

Additional context
Screenshot from 2019-10-08 13-41-47: https://user-images.githubusercontent.com/7192539/66394176-aa3d6680-e9d4-11e9-9e1b-d61fe50b26c4.png

Cache disabling break predicti functionality

If we disable encoded value caching when making predictions, lightwood crashes because it tries to encode the missing target variable column[s].

Example stack trace:

ERROR:mindsdb-logger-core-logger:libs/controllers/transaction.py:126 - Could not load module ModelInterface                                                                                   
                                                                                                                                                                                              
ERROR:mindsdb-logger-core-logger:libs/controllers/transaction.py:127 - Traceback (most recent call last):                                                                                     
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc                                                                                   
    return self._engine.get_loc(key)                                                                                                                                                          
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc                                                                                                          
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc                                                                                                          
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item                                                                             
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item                                                                             
KeyError: '<target_value>'                                                                                                                                                                       
                                                                                                                                                                                              
During handling of the above exception, another exception occurred:                                                                                                                           
                                                                                                                                                                                              
Traceback (most recent call last):                                                                                                                                                            
  File "/home/ubuntu/george_experiments/mindsdb/mindsdb/libs/controllers/transaction.py", line 123, in _call_phase_module                                                                     
    return module(self.session, self)(**kwargs)                                                                                                                                               
  File "/home/ubuntu/george_experiments/mindsdb/mindsdb/libs/phases/base_module.py", line 54, in __call__                                                                                     
    ret = self.run(**kwargs)                                                                                                                                                                  
  File "/home/ubuntu/george_experiments/mindsdb/mindsdb/libs/phases/model_interface/model_interface.py", line 33, in run                                                                      
    self.transaction.hmd['predictions'] = self.transaction.model_backend.predict()                                                                                                            
  File "/home/ubuntu/george_experiments/mindsdb/mindsdb/libs/backends/lightwood.py", line 228, in predict                                                                                     
    predictions = self.predictor.predict(when_data=run_df)                                                                                                                                    
  File "/home/ubuntu/george_experiments/lightwood/lightwood/api/predictor.py", line 353, in predict                                                                                           
    return self._mixer.predict(when_data_ds)                                                                                                                                                  
  File "/home/ubuntu/george_experiments/lightwood/lightwood/mixers/nn/nn.py", line 65, in predict                                                                                             
    for i, data in enumerate(data_loader, 0):                                                                                                                                                 
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 346, in __next__                                                                                
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration                                                                                                                      
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch                                                                                  
    data = [self.dataset[idx] for idx in possibly_batched_index]                                                                                                                              
  File "/home/ubuntu/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>                                                                             
    data = [self.dataset[idx] for idx in possibly_batched_index]                                                                                                                              
  File "/home/ubuntu/george_experiments/lightwood/lightwood/api/data_source.py", line 111, in __getitem__                                                                                     
    sample[feature_set][col_name] = self.get_encoded_column_data(col_name, feature_set, custom_data={col_name: [self.data_frame[col_name].iloc[idx]]})[0]                                     
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/frame.py", line 2995, in __getitem__                                                                                      
    indexer = self.columns.get_loc(key)                                                                                                                                                       
  File "/home/ubuntu/.local/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc                                                                                   
    return self._engine.get_loc(self._maybe_cast_indexer(key))                                                                                                                                
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc                                                                                                          
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc                                                                                                          
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item                                                                             
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item                                                                             
KeyError: '<target_value>'

Could not find a version that satisfies the requirement torch>=1.1.0.post2

Installing lightwood produces the following error:

ERROR: Could not find a version that satisfies the requirement torch>=1.1.0.post2 (from lightwood) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2, 0.4.1, 0.4.1.post2, 1.0.0, 1.0.1, 1.0.1.post2, 1.1.0)
ERROR: No matching distribution found for torch>=1.1.0.post2 (from lightwood)

The strange thing is that torch 1.1.0.post2 version is available on PyPi. Similar issues are reported on pytorch repo

installation error in python 3.8.1

The installation fails like this
ERROR: Could not find a version that satisfies the requirement torch>=1.3.0 (from pyro-ppl>=0.4.1->lightwood) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2) ERROR: No matching distribution found for torch>=1.3.0 (from pyro-ppl>=0.4.1->lightwood)

Provide column importance scores via lightwood

This should probably come after #68 and #69

We could try providing some sort of column importance score from lightwood based on:

a) The results of training with drop-out (#68), i.e. if dropping out a given column yields a low accuracy then it's probably very important and vice-versa
b) The correlation that certain encoder find between the IR of the column and the target variable (see #69)
c) Maybe some analysis of the input weights corresponding to the inputs derived from the column and/or a more in-depth analysis of their importance in the graph and/or maybe even the gradient flow during training.

Add column drop-out once the model converges

Once we reach a "maximum" accuracy for the model start feeding it incomplete datasets by using the datasource's dropout feature. This should help the network be able to better predict with missing input data in the future.

The important things to think about / implement:

  • Which columns do we chose to drop ? Do we go through all of them one by one ? Do we try to see how the awareness netwrok reacts to certain missing columns and chose combinations of columns to drop based on that ?
  • If the dropout-trained model have slightly worse accuracy than the best one trained without dropout, which one do we pick ?
  • We need to change the dropout interface for the datasource so that it can be called during training, rather than only modified via the config during the setup of the datasource.

AX Optimization causes transofrmer error during training

Sometimes, when running the ax optimization (happened on a proprietary dataset, doesn't seem to happen on other), transformer's self.feature_len_map mysteriously gets cleared before the first call to the callback function during train.

I'm not sure what's causing this, we need to look into it further.

Add generic training loop

Currently a lot of the elements from the training look happening inside the nn mixer's iter_fit and in the callback that the predictor api passes to it are re-used in the categorical autoencoder and the text encoder. It might be worth while abstracting away a few of the things an re-using them in all 3 places.

Allow disabling the encoder and transofrmer caches

Allow the disabling of encoder and transofrmer caches, via a flag and/or automatically when the data in a column or in all the columns would result in caches that are too big.

Before doing this we need to implemented #29

Run unit tests as part of the travis CI tests

Currently we have a bunch of relatively quick unit tests for each file, we should run some of these (or even all of them) as part of the CI tests.

This will also force us to keep them updated, a lot of them were/are deprecated since they were written when lightwood was first being developed.

Add more check tot he CI tests

The lightwood CI tests don't cover much ground at the moment, we should add a few things to them:

a) Try turning various flags on/off (e.g. OVERSAMPLE, SELFAWARE, PLINEAR)

b) Run on one or two datasets which are either deterministic (and should reach ~100% accuracy) or for which we have a lot of previous benchmarks (e.g. default on credit), and look at the actual accuracy obtained on them, if it's surprisingly small then don't auto deploy to pypi. For why this is needed see release 0.11.6 and 0.11.7 which are essentially "broken", in that they don't reach a decent accuracy on any dataset, but passed the CI tests

c) #65

Adde argument for if and which gpus to use

For debugging purposes and for the sake of people wanting to try lightwood that have issues with cudnn we should add an argument to allow forcing lightwood to use the cpu.

We should also add (ideally via the same argument) the ability to specify a list of GPUs to use (for people wishing to keep certain GPUs dedicated to other models).

Input parameters error for callback_on_iter() function

Describe the bug
TypeError is thrown when training a new model.
In the latest version of lightwood, accuracy is sent to callbeck_on_iter function that doesn't accept that parameter

callback_on_iter(epoch, training_error, test_error, delta_mean, self.calculate_accuracy(test_data_ds))

To Reproduce
Steps to reproduce the behavior:

  1. Train new data
  2. See error: callback_on_iter() takes 5 positional arguments but 6 were given

Screenshots
Screenshot from 2019-09-05 01-07-42

ModuleNotFoundError: No module named 'botorch.models.fidelity'

Describe the bug
There is import error using the latest lightwood version lightwood==0.13.5.

To Reproduce
Steps to reproduce the behavior:

  1. Use full_test.py or any example from mindsdb repository/

** Stacktrace **

File "train.py", line 5, in
import mindsdb
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/mindsdb/init.py", line 7, in
import lightwood
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/lightwood/init.py", line 8, in
import lightwood.model_building
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/lightwood/model_building/init.py", line 1, in
from lightwood.model_building.basic_ax_optimizer.basic_ax_optimizer import BasicAxOptimizer
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/lightwood/model_building/basic_ax_optimizer/basic_ax_optimizer.py", line 1, in
import ax
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/ax/init.py", line 5, in
from ax.modelbridge import Models
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/ax/modelbridge/init.py", line 6, in
from ax.modelbridge.factory import (
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/ax/modelbridge/factory.py", line 13, in
from ax.modelbridge.discrete import DiscreteModelBridge
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/ax/modelbridge/discrete.py", line 18, in
from ax.models.discrete_base import DiscreteModel
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/ax/models/init.py", line 5, in
from ax.models.torch.botorch import BotorchModel
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/ax/models/torch/botorch.py", line 10, in
from ax.models.torch.botorch_defaults import (
File "/home/zoran/MyProjects/mindsdb-examples/pl/lib/python3.7/site-packages/ax/models/torch/botorch_defaults.py", line 12, in
from botorch.models.fidelity.gp_regression_fidelity import (
ModuleNotFoundError: No module named 'botorch.models.fidelity'

Encoder transfer

There's certain encoder which we might want to train whenever lightwood learns a model, such as the categorical auto-encoder, the basic RNN text encoder and various other encoders (see #6).

We might want to re-train a model but not re-train all the encoder used for that model (e.g. If a new column was added).

Partially, this requires implementing modular re-training logic (i.e. in this case start the mixer training from scratch but leave the encoders as is), but also that we're able to train encoders for new columns or for a list of columns that we think have changed in such a way as to warrant re-encoding.

Issues training on small datasets / Make sure nr parameters > nr of rows

Pretty self explanatory, essentially if we have more parameters in the mixer than unique rows in the input dataset this might result in the model overfiting to predict each separate row. Then the testing dataset just becomes a selector between a number of over-fitted models.

To some extent implementing #73 might help with this.

I'm also not yet sure whether or not this is an actual issue we ran into. There were indeed cases where predicting with lightwood on n rows (where n is small, say around 800), that had a relatively large input representation (which would result in a network with dozens or hundreds of thousands of parameters), resulted in surprisingly poor accuracy and long training time.

However, copy-pasting the rows a few times seems to have fixed the issue... so, I don't see how that would necessarily fit this model, since this issue would be cause by the amount of distinct rows, not by the absolute number of rows.

@torrmal if you have any further opinions on this or if you think I'm miss-understanding your stance about this issue please correct me.

Make text encoders predict numerical targets (and maybe other types)

Attach a head to the distilBERT (or preferably make it generic, all of the current ones output 768 embeddings anyway) that can predict numerical targets, and train in a similar way we do for categorical targets.

Maybe try using the categorical head and see if it fits the bill with the right function (maybe change/remove the last layer if it's a softmax / some other exponential normalization function).

We could also do this for text/image/sequence outputs, but for text there's the LM (language modeling) heads that hugging-face already provides, we don't really support image outputs and I'm not really familiar with the representations that come out of cesium, so I'm not sure how easy those would be to "predict" or what kind of loss/model one would want to use for them.

Save encoder

Since we are now training certain encoders, and that process takes time, it would be nice if we could save each encoder once training is complete, so that if we run into various issues with the other encoder or mixers, or if we want to tweak the mixer behavior but not the encoders, we don't have to re-train everything again.

Only really an issue on large very large datasets, but considering we have some datasets where it take over a day to train the text encoders (on an undersampled version), I think this might be a time saver in the long run.

This kind of modular saving/freezing of certain components could also be a good lead into being able to partially re-train a model when new data points come.

Numerical encoder / Predict data frame iteration issue

There was an issue in mindsdb's CI tests where it was passing a list of correct numbers as the values of predict (as the column in the dataframe), yet when encoding it the numerical encoder somehow stumbles upon a single variable with the value None.

There's a hotfix in numerical encoder line 76, but we need to figure out why this is happening, I suspect it's a lightwood bug since there's no issue mindsdb side (the numbers being passed all are correct, not-None, not-infinite).

Setup linter

Describe the bug
Setup python linter for project

Expected behavior
Linting code

I know this is not the main purpose of the project, but I think that setup linter will be better for project maintenance.

How do you think?

Separate encoder logic

Currently encoder's encode function encodes all the data in the column and creates the mapping required for this encoding (e.g. the dictionary for one-hot or the number min-max range of numerical encoders) in one go.

We should separate this into something like:

create_encoding_mapping
and
encode

Where the later should be allowed to operate with an arbitrary amount of data from the column.

Training encoders

It might be worthwhile looking into training some of the final layers for the img and sequence encoders once the mixer starts achieving decent performance.

However, this would require (I think) making the encoders part of the actual network we optimize or adding a bunch of error propagation logic to the encoder objects and creating the glue code between the error on the final layer of the mixer and the first layer of each encoder.

Implementing this dynamic encoder modification might also help us test different encoders during the initial steps of training in the future, to automatically determine the best encoders for a specific dataset.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.