Giter Club home page Giter Club logo

kaggle_pipeline_tps_aug_22's Introduction

Kaggle Pipeline for Kaggle TPS August 2022

Browse Code in your browser

This is an Open Source Python based pipeline for Kaggle tabular data competitions. Although it is customized for Kaggle TPS August 2022, with limited code changes, this project can be used as a pipeline for any tabular data competition. This project includes APIs for most of the ML competition related tasks:

	- data processing
	- visualization
	- feature engineering
	- training
	- ensembling
	- feature selection
	- hyperparameter optimization
	- experiment tracking
	- submission of prediction to kaggle

Project Structure

- data				
    - features	 	- location for parquet files containing engineered features
    - processed	 	- location for parquet files containing raw data after initial processing
    - raw	 	- location for parquet files containing raw data (train, test, sample submission)
- fi 		 	- location to store feature importances in CSV files
- fi_fig 	 	- location to store plots capturing feature importances
- hpo            	- location to save hyperparameter optimization artifacts
- logs           	- location for logs generated by python modules 
- notebooks	 	- Any Jupyter notebook can be saved here
- oof		 	- Out of fold predictions are saved here
- src			
	- common	- package containing common utility functions
	- config	- package containing configuration related modules
	- cv		- package containing cross validation related functions
	- fe		- package containing feature engineering related functions
	- fs		- package containing feature selection related functions
	- hpo		- package containing hyperparameter optimization related functions
	- modeling	- package containing training/prediction related functions
	- munging	- package containing data processing/exploration related functions
	- pre_process	- package containing data pre-processing related functions
	- scripts	- location for fe, training scripts
	- ts		- package containing time series related functions
	- viz		- package containing data visualization related functions
- submissions           - locations for predictions and submission scripts
- tracking              - CSV file to track experiments

Acknowledgment

Steps to execute:

  1. Clone the source code from github under <PROJECT_HOME> directory.

     > git clone https://github.com/arnabbiswas1/kaggle_pipeline_tps_aug_22.git
    

    This will create the following directory structure:

     > <PROJECT_HOME>/kaggle_pipeline_tps_aug_22
    
  2. Create conda env:

     > conda env create --file environment.yml
    
  3. Go to <PROJECT_HOME>/kaggle_pipeline_tps_aug_22 and activate conda environment:

     > conda activate py_k
    
  4. Go to the raw data directory at <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/data/raw. Download dataset from Kaggle (Kaggle API should be configured following link):

     > kaggle competitions download -c tabular-playground-series-aug-2022
    
  5. Unzip the data:

     > unzip tabular-playground-series-aug-2022.zip
    
  6. Set the value of variable HOME_DIR at <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/src/config/constants.py with the absolute path of <PROJECT_HOME>/kaggle_pipeline_tps_aug_22

  7. To process raw data into parquet format, go to <PROJECT_HOME>/kaggle_pipeline_tps_aug_22. Execute the following:

     > python -m src.scripts.data_processing.process_raw_data
    

    This will create 3 parquet files under <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/data/processed representing train, test and sample_submission CSVs

  8. To trigger feature engineering, go to <PROJECT_HOME>/kaggle_pipeline_tps_aug_22. Execute the following:

     > python -m src.scripts.data_processing.create_features
    

    This will create a parquet file containing all the engineered features under <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/data/features

  9. To train the baseline model with LGBM, <PROJECT_HOME>/kaggle_pipeline_tps_aug_22. Execute the following:

     > python -m src.scripts.training.lgb_baseline
    

    This will create the submission file under <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/submissions. Out of Fold predictions under <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/oof and CSVs capturing feature importances under <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/fi

Result of the experiment will be tracked at <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/tracking/tracking.csv

  1. To submit the submission file to kaggle, go to <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/submissions:

     > python -m submissions_1.py
    

Note:

Following is needed for visualizing plots for optuna using plotly (i.e. plotly dependency):

jupyter labextension install [email protected]

kaggle_pipeline_tps_aug_22's People

Contributors

arnabbiswas1 avatar hardboyvino avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.