Giter Club home page Giter Club logo

tabular_automl_nni's Introduction

How to use NNI to do Automatic Feature Engineering?

What is Tabular Data?

Tabular data is an arrangement of data in rows and columns, or possibly in a more complex structure. Usually, we treat columns as features, rows as data. AutoML for tabular data including automatic feature generation, feature selection, and hyper tunning on a wide range of tabular data primitives — such as numbers, categories, multi-categories, timestamps, etc.

Quick Start

In this example, we will show how to do automatic feature engineering on nni.

We treat the automatic feature engineering(auto-fe) as a two steps task. feature generation exploration and feature selection.

We give a simple example.

The tuner call AutoFETuner first will generate a command that to ask Trial the feature_importance of original feature. Trial will return the feature_importance to Tuner in the first iteration. Then AutoFETuner will estimate a feature importance ranking and decide what feature to be generated, according to the definition of search space.

In the following iterations, AutoFETuner updates the estimated feature importance ranking.

If you are interested in contributing to the AutoFETuner algorithm, such as Reinforcement Learning(RL) and genetic algorithm (GA), you are welcomed to propose proposal and pull request. Interface update_candidate_probility() can be used to update feature sample probability and epoch_importance maintains the all iterations feature importance.

Trial receives the configure contains selected feature configure from Tuner, then Trial will generate these feature by fe_util, which is a general SDK to generate features. After evaluating performance by adding these features, Trial will report the final metric to the Tuner.

So when user wants to write a tabular autoML tool running on NNI, she/he should:

1) Have a Trial code to run

Trial's code could be any machine learning code. Here we use main.py as an example:

import nni


if __name__ == '__main__':
    file_name = 'train.tiny.csv'
    target_name = 'Label'
    id_index = 'Id'

    # read original data from csv file
    df = pd.read_csv(file_name)

    # get parameters from tuner
+   RECEIVED_FEATURE_CANDIDATES = nni.get_next_parameter()

+    if 'sample_feature' in RECEIVED_FEATURE_CANDIDATES.keys():
+        sample_col = RECEIVED_FEATURE_CANDIDATES['sample_feature']
+    # return 'feature_importance' to tuner in first iteration
+    else:
+        sample_col = []
+    df = name2feature(df, sample_col)

    feature_imp, val_score = lgb_model_train(df,  _epoch = 1000, target_name = target_name, id_index = id_index)

+    # send final result to Tuner
+    nni.report_final_result({
+        "default":val_score , 
+        "feature_importance":feature_imp
    })

2) Define a search space

Search space could be defined in a JSON file, format as following:

{
    "1-order-op" : [
            col1,
            col2
        ],
    "2-order-op" : [
        [
            col1,
            col2
        ], [
            col3, 
            col4
        ]
    ]
}

We provide count encoding, target encoding, embedding encoding for 1-order-op. We provide cross count encoding, aggerate statistics(min max var mean median nunique), histgram aggerate statistics for 2-order-op. All operations above are classic feature engineer methods, and the detail in here.

Tuner receives this search space and generates the feature by calling generator in fe_util.

For example, we want to search the features which are a frequency encoding (value count) features on columns name {col1, col2}, in the following way:

{
    "COUNT" : [
        col1,
        col2
    ],
}

For example, we can define a cross frequency encoding (value count on cross dims) method on columns {col1, col2} × {col3, col4} in the following way:

{
    "CROSSCOUNT" : [
        [
            col1,
            col2
        ],
        [
            col3,
            col4
        ],
    ]
}

3) Get configure from Tuner

User import nni and use nni.get_next_parameter() to receive configure.

...
RECEIVED_PARAMS = nni.get_next_parameter()
if 'sample_feature' in RECEIVED_PARAMS.keys():
            sample_col = RECEIVED_PARAMS['sample_feature']
else:
    sample_col = []
# raw_feature + sample_feature
df = name2feature(df, sample_col)
...

4) Send final metric and feature importances to tuner

Use nni.report_final_result to send final result to Tuner. Please noted 15 line in the following code.

feature_imp, val_score = lgb_model_train(df,  _epoch = 1000, target_name = target_name, id_index = id_index)
nni.report_final_result({
    "default":val_score , 
    "feature_importance":feature_imp
})

5) Extend the SDK of feature engineer method

If you want to add a feature engineer operation, you should follow the instruction in here.

6) Run expeirment

nnictl create --config config.yml

Test Example

We test some binary-classification benchmarks which come from public resources.

The experiment setting is given in the ./benchmark/benchmark_name/search_sapce.json :

The baseline and the result as following:

Dataset baseline auc automl auc number of cat number of num dataset link
Cretio 0.7516 0.7760 13 26 data link
titanic 0.8700 0.8867 9 1 data link
Heart 0.9178 0.9501 4 9 data link
Cancer 0.7089 0.7846 9 0 data link
Haberman 0.6568 0.6948 2 1 data link

tabular_automl_nni's People

Contributors

spongebbob avatar xuehui1991 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tabular_automl_nni's Issues

Trial failed

I setup the environment with the file requirments.txt and run experiments with command provided in readme.md.
but the first trial failed and the dispatch log looks like
File "/Users/admin/Desktop/tabular_automl_NNI/./autofe_tuner.py", line 65, in generate_parameters sample_p = np.array(self.estimate_sample_prob) / np.sum(self.estimate_sample_prob) TypeError: unsupported operand type(s) for /: 'NoneType' and 'NoneType'

Error due to feature name

If some feature name in dataset include "_" ( like A_B), it will cause error.

solution: just rename your feature.

Nonetype Error when running train.tiny.csv

Thanks for your sharing.

I have this problem when trying to run the demo code and train.tiny.csv. The program failed at the first time. When I switched to other sample data like heart or haberman, sample problems happened.
(I'm using windows、lightgbm 2.3.1、nni 0.9.1)

Could anyone tell how to solve this problem. Thanks!

Here is the dispatcher.log:
[05/02/2020, 12:13:50 PM] INFO (nni.msg_dispatcher_base/MainThread) Start dispatcher
[05/02/2020, 12:14:05 PM] ERROR (nni.msg_dispatcher_base/Thread-1) unsupported operand type(s) for /: 'NoneType' and 'NoneType'
Traceback (most recent call last):
File "D:\Anaconda3\lib\site-packages\nni\msg_dispatcher_base.py", line 102, in command_queue_worker
self.process_command(command, data)
File "D:\Anaconda3\lib\site-packages\nni\msg_dispatcher_base.py", line 160, in process_command
command_handlerscommand
File "D:\Anaconda3\lib\site-packages\nni\msg_dispatcher.py", line 106, in handle_request_trial_jobs
params_list = self.tuner.generate_multiple_parameters(ids)
File "D:\Anaconda3\lib\site-packages\nni\tuner.py", line 52, in generate_multiple_parameters
res = self.generate_parameters(parameter_id, **kwargs)
File "D:\git\tabular_automl_NNI.\autofe_tuner.py", line 65, in generate_parameters
sample_p = np.array(self.estimate_sample_prob) / np.sum(self.estimate_sample_prob)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'NoneType'
[05/02/2020, 12:14:09 PM] INFO (nni.msg_dispatcher_base/MainThread) Dispatcher exiting...
[05/02/2020, 12:14:11 PM] INFO (nni.msg_dispatcher_base/MainThread) Terminated by NNI manager

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.