Giter Club home page Giter Club logo

dsa2's Introduction

Status

multi: test_fast_linux

main: test_fast_linux

Install

 pip install  pyarrow pandas pyro-ppl lightgbm pandas scikit-learn scipy matplotlib

Basic usage

cd dsa2
python run.py data_profile --config_uri titanic_classifier.py::titanic_lightgbm   > zlog/log-titanic.txt 2>&1
python run.py preprocess   --config_uri titanic_classifier.py::titanic_lightgbm   > zlog/log-titanic.txt 2>&1
python run.py train        --config_uri titanic_classifier.py::titanic_lightgbm   > zlog/log-titanic.txt 2>&1
python run.py predict      --config_uri titanic_classifier.py::titanic_lightgbm   > zlog/log-titanic.txt 2>&1

Basic usage 2

python  titanic_classifier.py  data_profile
python  titanic_classifier.py  preprocess
python  titanic_classifier.py  train
python  titanic_classifier.py  check
python  titanic_classifier.py  predict
python  titanic_classifier.py  run_all

data/input : Input data format

data/input/titanic/raw/  : the raw files
data/input/titanic/raw2/ : the raw files  split manually


data/input/titanic/train/ :   features.zip ,  target.zip, cols_group.json  names are FIXED
         features.zip or features.parquet  :  csv file of the inputs
         target.zip   or target.parquet    :  csv file of the label to predict.


data/input/titanic/test/ :   
         features.zip or parquet format  , used for predictions

File names Are FIXED, please create sub-folder  

Model, train, inference :

All are defined in a single model_dictionnary containing all

Column Group for model preprocessing / training/inference :

*Titanic dataframe structure (example:
             Survived  Pclass                                               Name     Sex   Age  SibSp  Parch            Ticket     Fare Cabin Embarked
PassengerId                                                                                                                                           
1                   0       3                            Braund, Mr. Owen Harris    male  22.0      1      0         A/5 21171   7.2500   NaN        S
2                   1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1      0          PC 17599  71.2833   C85        C
3                   1       3                             Heikkinen, Miss. Laina  female  26.0      0      0  STON/O2. 3101282   7.9250   NaN        S
4                   1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1      0            113803  53.1000  C123        S
5                   0       3                           Allen, Mr. William Henry    male  35.0      0      0            373450   8.0500   NaN        S


(1) Initial Manual Column Mapping  :   'cols_input_type' 
 From Raw data --> "colid","colnum","colcat","coldate","coltext","coly","colcross"
   
   |-"colid"    --> index or id of each row (e.g. ["PassengerId"])

   |-"coly"     --> target column or y column (e.g. ["Survived"])


   |-"colnum"   --> columns with float or interger numbers (e.g. ["Pclass", "Age", "SibSp", "Parch", "Fare"])
   |-"colcat"   --> columns with string labels (e.g. ["Sex", "Embarked"])

   |-"coldate"  --> columns with date format data
   |-"coltext"  --> columns with text data (e.g. ["Ticket", "Name"])


   |-"colcross" --> columns to be checked for feature crosses
                    (e.g. ["Name", "Sex", "Ticket", "Embarked", "Pclass", "Age", "SibSp", "Parch", "Fare"])


 
(2) Columns feature family for model training  :   "cols_model_group"
    "cols_model_group" --> column family used for Model Training 

  "colnum", "colnum_bin", "colnum_onehot", "colnum_binmap",  #### Colnum columns                        
  "colcat", "colcat_bin", "colcat_onehot", "colcat_bin_map",  #### colcat columns                        
  'colcross_single_onehot_select', "colcross_pair_onehot",  'colcross_pair',  #### colcross columns            
  'coldate',
  'coltext',            

Preprocessing pipeline dataframe ( in source/run_preprocess.py) :

*Preprocessing as follow    in  source/run_preprocess.py
     Raw Columns   --->  Feature columns family

    "colnum"    --> "colnum_bin" --> "colnum_onehot" ---------------> 
        |--------------------------> "colnum_onehot" ---------------> 
        
    "colcat"    --> "colcat_bin" --> "colcat_onehot" ---------------> 
        |--------------------------> "colcat_onehot" ---------------> 
        
    "coltext"   -(bag of words)-> "dftext_tfidf" --> "dftext_svd" --> 
                                         |--------------------------> 
                                         
                                         
 Default pipeline options are considered in 

 pipe_default= [
        {'uri': 'source/preprocessors.py::pd_coly',                 'pars': {}, 'cols_family': 'coly',       'cols_out': 'coly',           'type': 'coly'         },
        {'uri': 'source/preprocessors.py::pd_colnum_bin',           'pars': {}, 'cols_family': 'colnum',     'cols_out': 'colnum_bin',     'type': ''             },
        {'uri': 'source/preprocessors.py::pd_colnum_binto_onehot',  'pars': {}, 'cols_family': 'colnum_bin', 'cols_out': 'colnum_onehot',  'type': ''             },
        {'uri': 'source/preprocessors.py::pd_colcat_bin',           'pars': {}, 'cols_family': 'colcat',     'cols_out': 'colcat_bin',     'type': ''             },
        {'uri': 'source/preprocessors.py::pd_colcat_to_onehot',     'pars': {}, 'cols_family': 'colcat_bin', 'cols_out': 'colcat_onehot',  'type': ''             },
        {'uri': 'source/preprocessors.py::pd_colcross',             'pars': {}, 'cols_family': 'colcross',   'cols_out': 'colcross_pair_onehot',  'type': 'cross'}
               ]


{'uri': 'python file address::the function for column processing', 'pars': any parameters to pass to function, 'cols_family': column family name, 'cols_out': *optional, 'type': 'coly' or 'cross'}


'::pd_coly'                => Input:  the target dataframe, returns filtered and labeled dataframe


'::pd_colnum_bin'          => Input:  a dataframe with selected numerical columns, creates categorical bins, returns dataframe with new columns (colnum_bin)
'::pd_colnum_binto_onehot' => Input:  a dataframe dfnum_bin, returns one hot matrix as dataframe colnum_onehot


'::pd_colcat_bin'          => Input:  a dataframe with categorical columns, returns dataframe colcat_bin with numerical values
'::pd_colcat_to_onehot'    => Input:  a dataframe with categorical columns, returns one hot matrix as dataframe colcat_onehot


'::pd_colcross'            => Input:  a dataframe of numerical and categorical one hot encoded columns with defined cross columns, returns dataframe colcross_pair_onehot

Command line usage advanced

cd dsa2
source activate py36 
python source/run_train.py  run_train   --n_sample 100  --model_name lightgbm  --path_config_model source/config_model.py  --path_output /data/output/a01_test/     --path_data /data/input/train/    


source activate py36 
python source/run_inference.py  run_predict  --n_sample 1000  --model_name lightgbm  --path_model /data/output/a01_test/   --path_output /data/output/a01_test_pred/     --path_data /data/input/train/

source/ : code source CLI to train/predict.

   run_feature_profile.py : CLI Pandas profiling
   run_preprocess.py      : CLI for feature preprocessing
   run_train.py :           CLI to train any model, any data (model  data agnostic )
   run_inference.py :       CLI to predict with any model, any data (model  data agnostic )




source/models/ : Generic API to access models.

   One file python file per model.

   models/model_sklearn.py      :   generic module as class, which wraps any sklearn API type model.
   models/model_bayesian_pyro.py :  generic model as class, which wraps Bayesian regression in Pyro/Pytorch.

   Method of the moddule/class
       .init
       .fit()
       .predict()


dsa2's People

Contributors

akouaouchissam avatar arita37 avatar deepsourcebot avatar elaynousse avatar mozin avatar soheil-star01 avatar vladimir9390 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.