DataScience Project Template

ML related tasks:

	- data processing
	- visualization
	- feature engineering
	- training
	- ensembling
	- feature selection
	- hyperparameter optimization
	- experiment tracking
	- submission of prediction to kaggle

Project Structure

- data				
    - features	 	- location for parquet files containing engineered features
    - processed	 	- location for parquet files containing raw data after initial processing
    - raw	 	- location for parquet files containing raw data (train, test, sample submission)
- fi 		 	- location to store feature importances in CSV files
- fi_fig 	 	- location to store plots capturing feature importances
- hpo            	- location to save hyperparameter optimization artifacts
- logs           	- location for logs generated by python modules 
- notebooks	 	- Any Jupyter notebook can be saved here
- oof		 	- Out of fold predictions are saved here
- src			
	- common	- package containing common utility functions
	- config	- package containing configuration related modules
	- cv		- package containing cross validation related functions
	- fe		- package containing feature engineering related functions
	- fs		- package containing feature selection related functions
	- hpo		- package containing hyperparameter optimization related functions
	- modeling	- package containing training/prediction related functions
	- munging	- package containing data processing/exploration related functions
	- pre_process	- package containing data pre-processing related functions
	- scripts	- location for fe, training scripts
	- ts		- package containing time series related functions
	- viz		- package containing data visualization related functions
- submissions           - locations for predictions and submission scripts
- tracking              - CSV file to track experiments

Acknowledgment

I have borrowed the initial project structure and framework code from arnabbiswas1's open sourced code.

Steps to execute:

Clone the source code from github under <PROJECT_HOME> directory.

 > git clone https://github.com/castillosebastian/mortality_analyses_covid.git

Create r and python (/usr/local/bin/python3) env:
```
 > renv::init()
 > renv::use_python()
```

Download dataset

> HOME_DIR /src/scripts/data_processing/process_raw_data.R

Set the value of variable HOME_DIR, libraries, logger and much more at <PROJECT_HOME>/main.R
To train the baseline model with LGBM, <PROJECT_HOME>/kaggle_pipeline_tps_aug_22. Execute the following:
```
 > python -m src.scripts.training.lgb_baseline
```
This will create the submission file under <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/submissions. Out of Fold predictions under <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/oof and CSVs capturing feature importances under <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/fi

Result of the experiment will be tracked at <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/tracking/tracking.csv

To submit the submission file to kaggle, go to <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/submissions:
```
 > python -m submissions_1.py
```
Important Bib

custom metric functions: 1,2,
metric: binary_logloss
hpyer parameters optimization grid: 1

castillosebastian / experimet_template Goto Github PK

experimet_template's Introduction

DataScience Project Template

Project Structure

Acknowledgment

Steps to execute:

experimet_template's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent