hdi-project / atm Goto Github PK

View Code? Open in Web Editor NEW

524.0 57.0 141.0 8.78 MB

Auto Tune Models - A multi-tenant, multi-data system for automated machine learning (model selection and tuning).

Home Page: https://hdi-project.github.io/ATM/

License: MIT License

Python 96.91% Makefile 3.09%

machine-learning data-science hyperparameter-optimization distributed-computing automl

atm's Introduction

An open source project from Data to AI Lab at MIT.

ATM - Auto Tune Models

License: MIT
Development Status: Pre-Alpha
Documentation: https://HDI-Project.github.io/ATM/
Homepage: https://github.com/HDI-Project/ATM

Overview

Auto Tune Models (ATM) is an AutoML system designed with ease of use in mind. In short, you give ATM a classification problem and a dataset as a CSV file, and ATM will try to build the best model it can. ATM is based on a paper of the same name, and the project is part of the Human-Data Interaction (HDI) Project at MIT.

Install

Requirements

ATM has been developed and tested on Python 2.7, 3.5, and 3.6

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where ATM is run.

These are the minimum commands needed to create a virtualenv using python3.6 for ATM:

pip install virtualenv
virtualenv -p $(which python3.6) atm-venv

Afterwards, you have to execute this command to have the virtualenv activated:

source atm-venv/bin/activate

Remember about executing it every time you start a new console to work on ATM!

Install with pip

After creating the virtualenv and activating it, we recommend using pip in order to install ATM:

pip install atm

This will pull and install the latest stable release from PyPi.

Install from source

Alternatively, with your virtualenv activated, you can clone the repository and install it from source by running make install on the stable branch:

git clone [email protected]:HDI-Project/ATM.git
cd ATM
git checkout stable
make install

Install for Development

If you want to contribute to the project, a few more steps are required to make the project ready for development.

First, please head to the GitHub page of the project and make a fork of the project under you own username by clicking on the fork button on the upper right corner of the page.

Afterwards, clone your fork and create a branch from master with a descriptive name that includes the number of the issue that you are going to work on:

git clone [email protected]:{your username}/ATM.git
cd ATM
git branch issue-xx-cool-new-feature master
git checkout issue-xx-cool-new-feature

Finally, install the project with the following command, which will install some additional dependencies for code linting and testing.

make install-develop

Make sure to use them regularly while developing by running the commands make lint and make test.

Data Format

ATM input is always a CSV file with the following characteristics:

It uses a single comma, ,, as the separator.
Its first row is a header that contains the names of the columns.
There is a column that contains the target variable that will need to be predicted.
The rest of the columns are all variables or features that will be used to predict the target column.
Each row corresponds to a single, complete, training sample.

Here are the first 5 rows of a valid CSV with 4 features and one target column called class as an example:

feature_01,feature_02,feature_03,feature_04,class
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa

This CSV can be passed to ATM as local filesystem path but also as a complete AWS S3 Bucket and path specification or as a URL.

You can find a collection of demo datasets in the atm-data S3 Bucket in AWS.

Quickstart

In this short tutorial we will guide you through a series of steps that will help you getting started with ATM by exploring its Python API.

1. Get the demo data

The first step in order to run ATM is to obtain the demo datasets that will be used in during the rest of the tutorial.

For this demo we will be using the pollution csv from the atm-data bucket, which you can download with your browser from here, or using the following command:

atm download_demo pollution_1.csv

2. Create an ATM instance

The first thing to do after obtaining the demo dataset is creating an ATM instance.

from atm import ATM

atm = ATM()

By default, if the ATM instance is without any arguments, it will create an SQLite database called atm.db in your current working directory.

If you want to connect to a SQL database instead, or change the location of your SQLite database, please check the API Reference for the complete list of available options.

3. Search for the best model

Once you have the ATM instance ready, you can use the method atm.run to start searching for the model that better predicts the target column of your CSV file.

This function has to be given the path to your CSV file, which can be a local filesystem path, an URL to and HTTP or S3 resource.

For example, if we have previously downloaded the pollution_1.csv file inside our current working directory, we can call run like this:

results = atm.run(train_path='pollution_1.csv')

Alternatively, we can use the HTTPS URL of the file to have ATM download the CSV for us:

results = atm.run(train_path='https://atm-data.s3.amazonaws.com/pollution_1.csv')

As the last option, if we have the file inside an S3 Bucket, we can download it by passing an URI in the s3://{bucket}/{key} format:

results = atm.run(train_path='s3://atm-data/pollution_1.csv')

In order to make this work with a Private S3 Bucket, please make sure to having configured your AWS credentials file, or to having created your ATM instance passing it the access_key and secret_key arguments.

This run call will start what is called a Datarun, and a progress bar will be displayed while the different models are tested and tuned.

Processing dataset demos/pollution_1.csv
100%|##########################| 100/100 [00:10<00:00,  6.09it/s]

Once this process has ended, a message will print that the Datarun has ended. Then we can explore the results object.

4. Explore the results

Once the Datarun has finished, we can explore the results object in several ways:

a. Get a summary of the Datarun

The describe method will return us a summary of the Datarun execution:

results.describe()

This will print a short description of this Datarun similar to this:

Datarun 1 summary:
    Dataset: 'demos/pollution_1.csv'
    Column Name: 'class'
    Judgment Metric: 'f1'
    Classifiers Tested: 100
    Elapsed Time: 0:00:07.638668

b. Get a summary of the best classifier

The get_best_classifier method will print information about the best classifier that was found during this Datarun, including the method used and the best hyperparameters found:

results.get_best_classifier()

The output will be similar to this:

Classifier id: 94
Classifier type: knn
Params chosen:
    n_neighbors: 13
    leaf_size: 38
    weights: uniform
    algorithm: kd_tree
    metric: manhattan
    _scale: True
Cross Validation Score: 0.858 +- 0.096
Test Score: 0.714

c. Explore the scores

The get_scores method will return a pandas.DataFrame with information about all the classifiers tested during the Datarun, including their cross validation scores and the location of their pickled models.

scores = results.get_scores()

The contents of the scores dataframe should be similar to these:

  cv_judgment_metric cv_judgment_metric_stdev  id test_judgment_metric  rank
0       0.8584126984             0.0960095737  94         0.7142857143   1.0
1       0.8222222222             0.0623609564  12         0.6250000000   2.0
2       0.8147619048             0.1117618135  64         0.8750000000   3.0
3       0.8139393939             0.0588721670  68         0.6086956522   4.0
4       0.8067754468             0.0875180564  50         0.6250000000   5.0
...

5. Make predictions

Once we have found and explored the best classifier, we will want to make predictions with it.

In order to do this, we need to follow several steps:

a. Export the best classifier

The export_best_classifier method can be used to serialize and save the best classifier model using pickle in the desired location:

results.export_best_classifier('path/to/model.pkl')

If the classifier has been saved correctly, a message will be printed indicating so:

Classifier 94 saved as path/to/model.pkl

If the path that you provide already exists, you can ovewrite it by adding the argument force=True.

b. Load the exported model

Once it is exported you can load it back by calling the load method from the atm.Model class and passing it the path where the model has been saved:

from atm import Model

model = Model.load('path/to/model.pkl')

Once you have loaded your model, you can pass new data to its predict method to make predictions:

import pandas as pd

data = pd.read_csv(demo_datasets['pollution'])

predictions = model.predict(data.head())

What's next?

For more details about ATM and all its possibilities and features, please check the documentation site.

There you can learn more about its Command Line Interface and its REST API, as well as how to contribute to ATM in order to help us developing new features or cool ideas.

Credits

ATM is an open source project from the Data to AI Lab at MIT which has been built and maintained over the years by the following team:

Bennett Cyphers [email protected]
Thomas Swearingen [email protected]
Carles Sala [email protected]
Plamen Valentinov [email protected]
Kalyan Veeramachaneni [email protected]
Micah Smith [email protected]
Laura Gustafson [email protected]
Kiran Karra [email protected]
Max Kanter [email protected]
Alfredo Cuesta-Infante [email protected]
Favio André Vázquez [email protected]
Matteo Hoch [email protected]

Citing ATM

If you use ATM, please consider citing the following paper:

Thomas Swearingen, Will Drevo, Bennett Cyphers, Alfredo Cuesta-Infante, Arun Ross, Kalyan Veeramachaneni. ATM: A distributed, collaborative, scalable system for automated machine learning. IEEE BigData 2017, 151-162

BibTeX entry:

@inproceedings{DBLP:conf/bigdataconf/SwearingenDCCRV17,
  author    = {Thomas Swearingen and
               Will Drevo and
               Bennett Cyphers and
               Alfredo Cuesta{-}Infante and
               Arun Ross and
               Kalyan Veeramachaneni},
  title     = {{ATM:} {A} distributed, collaborative, scalable system for automated
               machine learning},
  booktitle = {2017 {IEEE} International Conference on Big Data, BigData 2017, Boston,
               MA, USA, December 11-14, 2017},
  pages     = {151--162},
  year      = {2017},
  crossref  = {DBLP:conf/bigdataconf/2017},
  url       = {https://doi.org/10.1109/BigData.2017.8257923},
  doi       = {10.1109/BigData.2017.8257923},
  timestamp = {Tue, 23 Jan 2018 12:40:42 +0100},
  biburl    = {https://dblp.org/rec/bib/conf/bigdataconf/SwearingenDCCRV17},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Related Projects

BTB

BTB, for Bayesian Tuning and Bandits, is the core AutoML library in development under the HDI project. BTB exposes several methods for hyperparameter selection and tuning through a common API. It allows domain experts to extend existing methods and add new ones easily. BTB is a central part of ATM, and the two projects were developed in tandem, but it is designed to be implementation-agnostic and should be useful for a wide range of hyperparameter selection tasks.

Featuretools

Featuretools is a python library for automated feature engineering. It can be used to prepare raw transactional and relational datasets for ATM. It is created and maintained by Feature Labs and is also a part of the Human Data Interaction Project.

atm's People

Contributors

Stargazers

Watchers

Forkers

sthitaprajnas milliondreams o7s8r6 luotongml faviovazquez shaneopatrick jazzman37 cclauss stochasticresearch johanfrisk abhinavm24 rogerfitz redtrades komorihi alfredo-cuesta takabayashi wojohowitz00 praneeth-gummalla tom-hydrogen lauragustafson adripurkayastha cici-tan jaedukseo kimiliu1992 xflee dream-seeker devopsmi caidongyun leaderyangzi kerou nidecai luciany shivasj sczhaoqi batermj csala paulrich1234 iampatgrady zhouyonglong bigrlab cyinseu ituco allensmile fendaq zssasa wuqixiaobai veronicahs jobliz wangqianwen0418 essobi nginyc kioco hangjie720 nanaakwasiabayieboateng nkamsteve chengguobiao tony32769 limberc rencx rogersf dlts85 mansourfall feitianyiren wxrui zenghanfu we1l1n shyampandey2895 pvk-developer mikewlange kajjjak iprobe-lab drroad beevabeeva aledala yishuihanhan pivotsecurity liwei-cn averroes jdeguia ajoeajoe singh8477 autumnn hwunlams zhanpengjie reference-project maximlf forkdump septumcapital zwcdp adin-alihodzic kangking2019 kventures chrinide eugenesavenko hdony nurulc jigao19 phymucs maduhu oustandingman

atm's Issues

Add database command to remove dataruns

Right now, if you create a datarun with a typo or just decide you don't want to run it, there's no simple way to remove it from the database. We should add it as a subcommand to enter_data.py. Maybe:

python enter_data.py remove --datarun 1

likewise,

python enter_data.py remove --dataset 1

REST API

A REST API would be useful to access ATM from any other language and device through the web, and also to illustrate how the code is structured and might be extended.

From the project's readme I get that the internal API's are going to change and that it consequently might be a bit early to develop a REST API, but I wanted to see what was possible to do with the current code. The API currently serves:

Various GET endpoints for reading data from the four entities that are currently present in the database
1 GET endpoint to run the worker.py script inside the virtualenv as a subprocess and retrieve it's stdout and stderr
1 POST endpoint to send a .csv file with the HTTP request, save the file to the atm/data directory and run enter_data on it.

No modifications were made outside of the rest_api_server.py file except to the requirements.txt file, adding flask and simplejson to the project dependencies.

TODO's / caveats:

No AWS integration
api.py currently does not check if the uploaded filename is already present, so a CSV upload can rewrite a previously sent file with the same name. This needs fixing, but I thought that I should ask first if the atm/data directory is the right one to put new files in, and if storing UUID's is ok with the project's design before using them and doing a pull request.

Example api.py usage:

After following the readme's installation instructions and running python scripts/rest_api_server.py on a separate shell under the virtualenv:

curl localhost:5000/enter_data -F file=@/path/file.csv

It should return:

{"success": true}

To see the created dataset:

curl localhost:5000/datasets/1 | jq

{
  "class_column": "class",
  "description": null,
  "size_kb": 6,
  "test_path": null,
  "k_classes": 3,
  "majority": 0.333333333,
  "d_features": 6,
  "train_path": "FULLPATH/file.csv",
  "id": 1,
  "n_examples": 150,
  "name": "file"
}

To see the created datarun:

curl localhost:5000/dataruns/1 | jq

{
  "status": "pending",
  "start_time": null,
  "description": "uniform__uniform",
  "r_minimum": 2,
  "metric": "f1",
  "budget": 100,
  "selector": "uniform",
  "priority": 1,
  "score_target": "cv_judgment_metric",
  "deadline": null,
  "budget_type": "classifier",
  "id": 1,
  "tuner": "uniform",
  "dataset_id": 1,
  "gridding": 0,
  "k_window": 3,
  "end_time": null
}

To run the worker.py script once:

curl localhost:5000/simple_worker | jq

after a while

{
  "stderr": "",
  "stdout": "huge stdout string with worker.py's output"
}

Wrong keywords into ML models

Hello, I am trying to test run your classifiers on our data, and am getting some errors when the system tries various classifiers. The relevant portions of the error messages are pasted below:

Chose` parameters for method dt:
	n_jobs = -1
	min_samples_leaf = 1
	n_estimators = 100
	criterion = entropy
	max_features = 0.950735797858
	max_depth = 6
TypeError: __init__() got an unexpected keyword argument 'n_jobs'

Chose parameters for method dt:
	C = 0.00359684119303
	tol = 0.000357435603328
	fit_intercept = True
	penalty = l2
	_scale = True
	dual = False
	class_weight = auto
TypeError: __init__() got an unexpected keyword argument 'C'

Chose parameters for method logreg:
	n_jobs = -1
	min_samples_leaf = 1
	n_estimators = 100
	criterion = gini
	max_features = 0.218919710352
	max_depth = 7
TypeError: __init__() got an unexpected keyword argument 'min_samples_leaf'

Based on these errors, it seems that the hyperparameters input for the scikit-learn's DecisionTree model is being mixed up with the hyperparameters input for scikit-learn's LogisticRegression model. For example, LogisticRegression does not have a "min_samples_leaf" hyperparameter. Similarly, DecisionTreeClassifier does not have C or n_jobs as hyperparameters. Digging around, the methods/decision_tree.json and methods/logistic_regression.json files seem correct .. so I'm not sure why this is getting mixed up.

I get similar issues when running against the example provided in the readme. Here is a copy/paste of the entire error message

Selector: <class 'btb.selection.uniform.Uniform'>
Tuner: <class 'btb.tuning.uniform.Uniform'>
Choosing hyperparameters...
Chose parameters for method knn:
	C = 0.000128015603097
	tol = 0.000148636727508
	fit_intercept = True
	penalty = l2
	_scale = True
	dual = True
	class_weight = auto
Creating classifier...
Testing classifier...
Error testing classifier: datarun=<ID = 5, dataset ID = 5, strategy = uniform__uniform, budget = classifier (100), status: running>
Traceback (most recent call last):
  File "atm/worker.py", line 440, in run_classifier
    model, performance = self.test_classifier(classifier_id, params)
  File "atm/worker.py", line 374, in test_classifier
    performance = wrapper.start()
  File "/home/kkarra/atm/atm/wrapper.py", line 97, in start
    self.make_pipeline()
  File "/home/kkarra/atm/atm/wrapper.py", line 383, in make_pipeline
    classifier = self.class_(**classifier_params)
  File "/home/kkarra/atm/venv/local/lib/python2.7/site-packages/sklearn/neighbors/classification.py", line 126, in __init__
    metric_params=metric_params, n_jobs=n_jobs, **kwargs)
TypeError: _init_params() got an unexpected keyword argument 'C'

Here, it seems that the KNN model is getting the wrong keywords. I'm not sure why model's are not being optimized with appropriate keywords. I'm wondering if I should dig further to ensure that the selected model chooses the correct keywords, or if this an identified bug already in the course of porting from old environment to new?

Change btb evaluation to include multiple trials

Change evaluate_btb.py to include multiple trials of tuner. This allows for a better evaluation as to whether a specific tuner actually leads to a performance increase. The results will be compared to the best so far in terms of mean over the trials, and standard deviation.

atm importing issue

When I run atm/enter_data.py, it pops up this error:
ImportError: No module named atm.config

Add in-memory database option

In general, we would like to better support the use of ATM as a normal library. It should be just as convenient to use pieces of ATM in another python project as it is to use the system from the command line.

To that end, we should add a database option that is purely in-memory -- and does not leave a file system footprint unless requested. For example, we could use pandas DataFrames in the back-end instead of tables in a SQL database while maintaining the same interface for the Database class.

Make matplotlib a conditional import

Tests fail because utilities.py is imported and matplotlib is in turn imported. Doesn't make any sense to have plotting as a hard dependency, but you can make it a conditional import in utilities.py

try:
    import matplotlib.pyplot as plt
except ImportError:
    plt = None

# then, in graph_series
if plt is None:
    raise ImportError("Unable to import matplotlib")

Metrics file printing wrong filename to log

Line 235 should be changed from:
_log("Saving metrics in: %s" % local_model_path)

_log("Saving metrics in: %s" % local_metric_path)

GaussianProcessClassifier errors with "N-th leading minor is not positive definite"

Appears to only happen when kernel == 'exp_sine_squared'. Does not happen every time. More investigation needed.

Error testing classifier: datarun=<ID = 24, dataset ID = 10, strategy = gp__bestk, budget = classifier (100), status: running>
Traceback (most recent call last):
  File "/home/bcyphers/work/fl/atm/atm/worker.py", line 401, in run_classifier
    model, performance = self.test_classifier(hyperpartition.method, params)
  File "/home/bcyphers/work/fl/atm/atm/worker.py", line 339, in test_classifier
    test_path=test_path)
  File "/home/bcyphers/work/fl/atm/atm/model.py", line 195, in train_test
    cv_scores = self.cross_validate(X_train, y_train)
  File "/home/bcyphers/work/fl/atm/atm/model.py", line 132, in cross_validate
    n_folds=self.N_FOLDS)
  File "/home/bcyphers/work/fl/atm/atm/metrics.py", line 194, in cross_validate_pipeline
    pipeline.fit(X[train_index], y[train_index])
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/pipeline.py", line 270, in fit
    self._final_estimator.fit(Xt, y, **fit_params)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 610, in fit
    self.base_estimator_.fit(X, y)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/multiclass.py", line 216, in fit
    for i, column in enumerate(columns))
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 758, in __call__
    while self.dispatch_one_batch(iterator):
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 608, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 571, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 109, in apply_async
    result = ImmediateResult(func)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 326, in __init__
    self.results = batch()
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/multiclass.py", line 80, in _fit_binary
    estimator.fit(X, y)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 208, in fit
    self.kernel_.bounds)]
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 426, in _constrained_optimization
    fmin_l_bfgs_b(obj_func, initial_theta, bounds=bounds)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 193, in fmin_l_bfgs_b
    **opts)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 328, in _minimize_lbfgsb
    f, g = func_and_grad(x)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 278, in func_and_grad
    f = fun(x, *args)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 292, in function_wrapper
    return function(*(wrapper_args + args))
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 63, in __call__
    fg = self.fun(x, *args)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 200, in obj_func
    theta, eval_gradient=True)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 344, in log_marginal_likelihood
    self._posterior_mode(K, return_temporaries=True)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 397, in _posterior_mode
    L = cholesky(B, lower=True)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/linalg/decomp_cholesky.py", line 81, in cholesky
    check_finite=check_finite)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/linalg/decomp_cholesky.py", line 30, in _cholesky
    raise LinAlgError("%d-th leading minor not positive definite" % info)
LinAlgError: 31-th leading minor not positive definite

Support for regressions

Hello, If I understood correctly currently ATM solves only classification tasks. I wonder if there is a plan for adding support for regression problems. Thanks

Implement proper logging

Currently, there are print() statements scattered around the project, and worker.py has a simple custom _log function which prints information to stdout and a log file simultaneously. We should aim to get rid of print statements altogether and replace them with calls to python's logging module, so that output to log files and stdout is handled in a more robust way. This will make it more practical for users to run ATM in the background or to call parts of it from other programs.

If there is a third-party logging library that would do the job better, I'm open to using that as well.

Allow custom evaluation metrics

Right now, it's only possible to configure a datarun to use methods and metrics that are included with the library. It should be possible to pass JSON files for custom machine-learning methods and Python files/functions for custom metrics. This can be implemented in much the same way that custom tuners/selectors for BTB are handled.

Error happened when using model.predict(X) function

I trained a few models, and then picked the best one from model directory, was trying to predict new data, then error came up.

Traceback (most recent call last):
  File "test_best_model.py", line 21, in <module>
    preds = best_model.predict(X)
  File "/medical_data/Datasets/ATM/atm/model.py", line 209, in predict
    X, _ = self.encoder.transform(data)
  File "/medical_data/Datasets/ATM/atm/encoder.py", line 101, in transform
    features = data[self.feature_columns]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

I use pickle to load the model:

with open(classfier_p, 'rb') as f:
        best_model= pickle.load(f)

Dockerisation and parallelisations

Dockerisation of this project would be great. Makes it easy to install and deploy :) It would also be great to incorporate Celery in so that the jobs can be dispatched to different worker instances.

Adding custom blocks to pipeline

It would be nice to be able to add additional blocks to the ML pipeline, both static and dynamic (see #70 ).

Make model and metric file names more friendly

There is no way we need each file name to be 90 characters of gibberish. We should, at the very least, shorten the hash, and probably try to make them more human-readable.

Metrics should be saved as JSON (not pickled)

Currently, the metrics dict, which is generated by Model.train_test(), is saved as a pickled python object. This is unnecessary, because the object is entirely composed of python dicts, lists, and numeric values. Saving metrics in json files instead would make it easier to eyeball the results and analyze them with other software.

This is just a matter of changing the save_metric() function in atm/utilities.py to use json.dump() instead of pickle.dump().

All pull requests' circle-ci tests will fail

All pull requests will fail the circle-ci tests due to lint errors on the original codebase ... see #75

atm/enter_data.py is failing: ImportError: No module named boto.s3.connection

Got an issue when executing enter_data.py

atm) ahmets-MBP-782:atm ahmet$ python atm/enter_data.py Traceback (most recent call last): File "atm/enter_data.py", line 7, in <module> from boto.s3.connection import S3Connection, Key as S3Key ImportError: No module named boto.s3.connection

Bugs in Logging

Hello,
I believe that the logging configurations are not being passed through, resulting in models and metrics being persisted in the default directories rather than the ones specified in log.yaml.

I think the following needs to be added to atm/config.py
log_path = log_path or kwargs.get('log_config') (on line 524)

Additionally, the caller of the load_config function overwrites the log file configuration for what goes to STDOUT and the logs, which means that even if the user specifies the log-level to be info for stdout, they still need to add the command line argument --verbose-metrics

Installation Should Specify Python Version

Added a python=2.7 line in #32 but not sure how you want to style the information.
Python3 doesn't work because of the prints

Allow testing in other environments

I'm guessing that the tester in their environment has populated the files config/test/btb/aws.yaml etc. and data/car_1.csv etc.

As it currently stands, it is not possible for someone who just clones the package to run tests and assure themselves the software is installed correctly/works correctly.

Consider overriding the .gitignore to commit the needed yaml and csv files for the test/ directory only? I imagine that the AWS tests would be skipped by default. Also, at the least, there could be unit tests for some of the software that are not dependent on data and configs.

(Somewhat unrelated to this issue, I would have thought that test/method_test.py does unit tests, and was going to recommend the more conventional python -m pytest invocation. But it seems that that file is very similar to test/end_to_end_test.py, doing end-to-end tests as well. Fairly confused by that.)

Create `explorer.py` to help explore results of Dataruns

Should be a file that has helper methods for loading and testing previously-generated models. This should include some graphing and data visualization utilities.

Not sure about the best way to make the interface -- it could be a command-line tool like enter_data and worker, or it could be a Python REPL-style interface directly to the functions.

Include Matplotlib in 'requirements.txt'

It would be nice that matplotlib module was included in requirements.txt too.
Sometimes it is useful in debugging tasks.

Passing arguments for computation of ROC curves

When the performance of a classifier is evaluated, the chain of function calls is as follows:

worker.py::self.test_classifer --> model.train_test --> model.test_final_model --> metrics.py::test_pipeline --> metrics.py::get_metrics

test_pipeline has a kwargs argument to allow for include_curves to be set, however, no kwargs are passed through the functions above the chain, so include_curves is always defaulted to False in the code. How do we want to address this? Should there be a command line argument (or configuration in run_config.yaml) which allows the user to decide if the roc_curves should be computed, and if so, passed through the system?

Scientific notation in method json definitions is defined wrong

Pretty much all exponential hyperparameter ranges are defined as something like

"range": [10e-5, 10e5]

This is wrong, and it's my fault for misunderstanding how scientific notation works in Python. When I translated all the old enumeration classes to the new json, I made the mistake of turning 10**3 into 10e3 across the board. See e.g. https://github.com/HDI-Project/ATM/blob/50b592dd6a151a75470fb4120c1781ca7249d43f/atm/enumeration/classification/svm.py for how it was before. This is a quick fix that I'll push in a few minutes.

Update fabfile.py to fix AWS compatibility

The fabfile is currently out of sync with the rest of the codebase, making it impossible to automatically launch an ATM cluster on AWS. This needs to be fixed.

Judgement Metric not passed through

Line 341 of atm/worker.py does not pass the score target, so regardless of what the user chooses in run_config.yaml, the code determines the best classifier based on the mu_sigma computation. It seems there is some inconsistency in nomenclature between the mu_sigma and the cv options.

Was the original intent to use mu_sigma to correspond to the highest lower error bound, cv to correspond to the average cv score, and test to correspond to average test score? mu_sigma is not in the keys of the database, so its not a valid run_config.yaml value. If we take cv to mean the same thing as mu_sigma, then one approach to deal with this may be (starting at Line 341 of atm/worker.py):

        if('cv' in self.datarun.score_target):
            score_target_in = 'mu_sigma'
        else:
            score_target_in = self.datarun.score_target
        old_best = self.db.get_best_classifier(datarun_id=self.datarun.id,
                                               score_target=score_target_in)

        cur_cv_val       = model.cv_judgment_metric
        cur_cv_err       = model.cv_judgment_metric_stdev
        cur_test_val     = model.test_judgment_metric
        
        if old_best is not None:
            old_cv_val   = old_best.cv_judgment_metric
            old_cv_err   = 2*old_best.cv_judgment_metric_stdev
            old_test_val = old_best.test_judgment_metric
        if('cv' in self.datarun.score_target):
            _log('Judgment metric (%s): %.3f +- %.3f' %
                 (self.datarun.metric, cur_cv_val, cur_cv_err))
            if old_best is not None:
                if (cur_cv_val - cur_cv_err) > (old_cv_val - old_cv_err):
                    _log('New best score! Previous best (classifier %s): %.3f +- %.3f' %
                         (old_best.id, old_cv_val, old_cv_err))
                else:
                    _log('Best so far (classifier %s): %.3f +- %.3f' %
                         (old_best.id, old_cv_val, old_cv_err))
        else:
            _log('Judgment metric (%s): %.3f' %
                 (self.datarun.metric, cur_test_val))
            if old_best is not None:
                if (cur_test_val) > (old_test_val):
                    _log('New best score! Previous best (classifier %s): %.3f' %
                         (old_best.id, old_test_val))
                else:
                    _log('Best so far (classifier %s): %.3f' %
                         (old_best.id, old_test_val))

Add command line help to atm/enter_data.py

Running python atm/enter_data.py -h (or --help) should print out available commands, as is widely used convention.

MySQL setup not working

Hi all, any ideas re: the error message below?

Collecting mysql-python==1.2.5 (from -r requirements.txt (line 9))
  Downloading MySQL-python-1.2.5.zip (108kB)
    100% |████████████████████████████████| 112kB 9.6MB/s 
    Complete output from command python setup.py egg_info:
    sh: 1: mysql_config: not found
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-wg4Zpa/mysql-python/setup.py", line 17, in <module>
        metadata, options = get_config()
      File "setup_posix.py", line 43, in get_config
        libs = mysql_config("libs_r")
      File "setup_posix.py", line 25, in mysql_config
        raise EnvironmentError("%s not found" % (mysql_config.path,))
    EnvironmentError: mysql_config not found
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-wg4Zpa/mysql-python/

Nolearn DBN is out of date

nolearn.dbn, the library that ATM uses for its Deep Belief Network (DBN) classifier, is no longer supported. From the home page:

The nolearn.dbn module is no longer supported. Take a look at nolearn.lasagne for a more modern neural net toolkit.

We should upgrade to lasagne ASAP.

Possibly unrelated, but sometimes the DBN classifier will fail with an error like this:

Traceback (most recent call last):
  File "/home/bcyphers/work/fl/atm/atm/worker.py", line 401, in run_classifier
    model, performance = self.test_classifier(hyperpartition.method, params)
  File "/home/bcyphers/work/fl/atm/atm/worker.py", line 339, in test_classifier
    test_path=test_path)
  File "/home/bcyphers/work/fl/atm/atm/model.py", line 195, in train_test
    cv_scores = self.cross_validate(X_train, y_train)
  File "/home/bcyphers/work/fl/atm/atm/model.py", line 132, in cross_validate
    n_folds=self.N_FOLDS)
  File "/home/bcyphers/work/fl/atm/atm/metrics.py", line 194, in cross_validate_pipeline
    pipeline.fit(X[train_index], y[train_index])
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/pipeline.py", line 270, in fit
    self._final_estimator.fit(Xt, y, **fit_params)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/nolearn/dbn.py", line 407, in fit
    self.use_dropout,
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/gdbn/dbn.py", line 202, in fineTune
    err, outMB = step(inpMB, targMB, self.learnRates, self.momentum, self.L2Costs, useDropout)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/gdbn/dbn.py", line 303, in stepNesterov
    errSignals, outputActs, error = self.fpropBprop(inputBatch, targetBatch, useDropout)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/gdbn/dbn.py", line 262, in fpropBprop
    outputErrSignal = -self.outputActFunct.dErrordNetInput(targetBatch, self.state[-1], outputActs)
AttributeError: 'Tanh' object has no attribute 'dErrordNetInput'

Hopefully the upgrade will kill two birds with one stone.

Add PyYAML dependency to requirements.txt

Not installed by default in new virtual environments such as conda python 2.7

Avoid creating redundant datasets

If enter_data() is called with the same train_path twice in a row and the data itself hasn't changed, a new Dataset does not need to be created.

We should add a column which stores some kind of hash of the actual data. When a Dataset would be created, if the metadata and data hash are exactly the same as an existing Dataset, nothing should be added to the ModelHub database and the existing Dataset should be returned instead.

update the field names in the modelhub database

These are simple changes like error_msg --> error_message, model_path --> model_location, to more meaningful changes like tunables --> conditional_hyperparameters.

Got error while running 'python atm/enter_data.py'

Hi all,
As a beginer of ATM,
I have below error while running 'python atm/enter_data.py',
could you tell me how can I deal with this error?

Thank you!

=======================================
(myPy27) ubuntu@ip-172-31-17-79:~/workspace/git/atm$ python atm/enter_data.py
Traceback (most recent call last):
File "atm/enter_data.py", line 8, in
from .config import *
ValueError: Attempted relative import in non-package

Install failing

Seems that the install is conflicting with anaconda. Anyone found a work around?

ahmets-MBP-782:atm ahmet$ virtualenv venv Using base prefix '/Users/ahmet/anaconda' New python executable in /Users/ahmet/Documents/GitHub/atm/venv/bin/python dyld: Library not loaded: @rpath/libpython3.6m.dylib Referenced from: /Users/ahmet/Documents/GitHub/atm/venv/bin/python Reason: image not found ERROR: The executable /Users/ahmet/Documents/GitHub/atm/venv/bin/python is not functioning ERROR: It thinks sys.prefix is '/Users/ahmet/Documents/GitHub/atm' (should be '/Users/ahmet/Documents/GitHub/atm/venv') ERROR: virtualenv is not compatible with this system or executable

How do you interpret the results across and within models?

HI all, the log file will produce a line like below showing the best classifier and then you can find out the filename for the learner (provided you made a log of the run), but how do you read the model file? The manual doesn't seem to cover this either.

Examining results across all models (e.g., by a plot) would also be helpful.

nohup python atm/worker.py 2>&1 | tee Output.txt & # log the run

...
Saving model in: models/1261aa3655008f0b9afec119e25d5aab-b585ff5423b4c095b6562b81f2dc2f63-uniform__uniform.model
Saving metrics in: models/1261aa3655008f0b9afec119e25d5aab-b585ff5423b4c095b6562b81f2dc2f63-uniform__uniform.model
Saved classifier 21.
...
Best so far (learner 21): 0.716 +- 0.035

Bug in concatenating dataframes

Hello,
I believe there is a bug in line 207 of atm/model.py The call to concat should be as follows:

all_data = pd.concat([train_data, test_data])

See here: https://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.concat.html?highlight=concat#pandas.concat

I would submit a pull request but I have other changes and don't want to complicate things.

We need more docs!

The documentation hasn't been updated for a few months, and does not reflect the state of the project. I'm removing the docs/ folder from the master repo for now, and pushing it into the branch 'bcyphers/docs' (#80) to avoid confusing newcomers to the project. The folder will be added back to master once it is closer to being ready.

ATM should be compatible with Python 2 and 3

Thanks to @cclauss in #14 for bringing this up. We definitely want to future-proof ATM and make it as widely available as possible.

This is mostly a matter of sitting down and doing it. Most of it should be easy with http://python-future.org/automatic_conversion.html. I think the biggest decision to make is whether to use unicode_literals or not.

I am for it. Since the project is new and volatile, it shouldn't matter too much whether we have to change the existing API, and I don't think there will be any major changes anyway. unicode_literals will result in cleaner code, and will make it easier to reason about strings in the future.

I've started doing a test-run of futurize in BTB, since it's a much smaller project. Once that's done, I'll start going through file-by-file in ATM and doing the same. Feel free to jump in and contribute!

Change btb_test to evaluate

Change name of btb_test.py to evaluate_btb.py. Here we are evaluating the performance of different tuners, opposed to testing their functionality.

naming of the hyperparameters for the methods

To match our explanation of the hyperparameters and their types in the ATM paper, we could:

parameters→ hyperparameters
root_parameters→ root_hyperparameters

There is a bit more of categorization required in terms of hyperparameters. Will follow up on this thread with more.

Separating repeated processing from classifier models

In between different runs of the ATM, the outputs of all the steps of the pipeline are "static," except for the input and output to the classifier that is chosen by BTB. What I mean by this is, for example, suppose PCA is in the pipeline, then every time ATM/BTB chooses a new model to run, it will recompute the PCA for the same dataset. Unless I'm misunderstanding the flow of data, this seems inefficient. Although the current pipeline is pretty simple (scaling/PCA), there could be more computationally intensive elements to the pipeline that people may want to add.

We can separate the pipeline into two pipelines, one that is "static" and the outputs stored somewhere to disk such that it can be recalled between runs, and a "dynamic" which is essentially the classifier, and any blocks which change based on the ATM/BTB model being run.

If you think this is a good idea, how do we want to go about architecting this from a software perspective? One approach is to compute the static pipeline before the test_classifier method is run and save that to the data directory where the train/test dataset is being saved.

methods config should be configurable

Line 7 of methods.py has the hardcoded value:
CONFIG_PATH = 'methods'

This should be moved to a configuration parameter by run_config.yaml

Encode foreign-key database relationships with the SQLAlchemy ORM

Right now, foreign-key relationships in the ModelHub database are not reflected with SQLAlchemy's ORM. Doing so like this will make it easier to reference attributes of mapped objects. For example, the following code:

def foo(classifier_id):
    classifier = db.get_classifier(classifier_id)
    datarun = db.get_datarun(classifier.datarun_id)
    dataset = db.get_dataset(datarun.dataset_id)
    ...

could be simplified to this:

def foo(classifier_id):
    dataset = db.get_classifier(classifier_id).dataset
    ...

Overall, it will make things cleaner and easier to maintain.

Add minimal unit tests

Right now there are just a couple of actual tests in the test/ folder, and they leave huge portions of the code untouched.

Eventually, we will need a suite of unit tests and comprehensive integration tests that reasonably convince us that a new change won't break anything. We'll also want to integrate with CircleCI and github so that we can evaluate pull requests from the web (but that will come later).

I'll try to update this issue with our progress as time goes on, but a good start would be unit tests for:

each hyperpartition for each classification method
each database.py create/query/update function
hyperpartition enumeration
hyperpartition selection/parameter tuning
metric computation in metrics.py
data loading and encoding with a variety of quirky data types
serialization/deserialization of models/metrics objects

Make paths relative to project root

Currently, lots of code for loading config or data depends on the initiating script running from the project root. We can fix this by defining everything relative to the location of a python file instead of relative to ./. This will make everything more predictable and less brittle.

is it possible to add more methods like RNN or CNN from keras?

Hi all,

Just wonder is it possible if I want to use methods from keras to build CNN? Since there are all basic classifiers from sklearn in the methods for now. And does it accept 3d input?

Remove Methods table from database

At this point, I think the "methods" table in the database is vestigial, and no longer serves a purpose. We should get rid of it to reduce clutter.

Gaussian Process classifier "nu" parameter should be a float

The nu hyperparameter (defined in methods/gaussian_process.json) is currently a string categorical variable; this causes the following error whenever the matern kernel is selected:

Traceback (most recent call last):
  File "/home/bcyphers/work/fl/atm/atm/worker.py", line 401, in run_classifier
    model, performance = self.test_classifier(hyperpartition.method, params)
  File "/home/bcyphers/work/fl/atm/atm/worker.py", line 339, in test_classifier
    test_path=test_path)
  File "/home/bcyphers/work/fl/atm/atm/model.py", line 195, in train_test
    cv_scores = self.cross_validate(X_train, y_train)
  File "/home/bcyphers/work/fl/atm/atm/model.py", line 132, in cross_validate
    n_folds=self.N_FOLDS)
  File "/home/bcyphers/work/fl/atm/atm/metrics.py", line 194, in cross_validate_pipeline
    pipeline.fit(X[train_index], y[train_index])
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/pipeline.py", line 270, in fit
    self._final_estimator.fit(Xt, y, **fit_params)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 610, in fit
    self.base_estimator_.fit(X, y)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 208, in fit
    self.kernel_.bounds)]
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 426, in _constrained_optimization
    fmin_l_bfgs_b(obj_func, initial_theta, bounds=bounds)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 193, in fmin_l_bfgs_b
    **opts)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 328, in _minimize_lbfgsb
    f, g = func_and_grad(x)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 278, in func_and_grad
    f = fun(x, *args)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 292, in function_wrapper
    return function(*(wrapper_args + args))
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 63, in __call__
    fg = self.fun(x, *args)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 200, in obj_func
Chose parameters for method knn:
    theta, eval_gradient=True)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 337, in log_marginal_likelihood
    K, K_gradient = kernel(self.X_train_, eval_gradient=True)
  File "/home/bcyphers/work/fl/atm/venv/lib/python2.7/site-packages/sklearn/gaussian_process/kernels.py", line 1337, in __call__
    tmp = (math.sqrt(2 * self.nu) * K)
TypeError: a float is required

nu is passed to the constructor for a Matern kernel, which is then passed to the GaussianProcessClassifier constructor. According to the docs, nu should be a float that defaults to 1.5. I'm not sure whether the current configuration was ever correct, but it's not correct as of sklearn 0.18.

Someone should figure out what the proper range of values for nu is, and update the json to reflect that.