DataScience Project Template
ML related tasks:
- data processing
- visualization
- feature engineering
- training
- ensembling
- feature selection
- hyperparameter optimization
- experiment tracking
- submission of prediction to kaggle
Project Structure
- data
- features - location for parquet files containing engineered features
- processed - location for parquet files containing raw data after initial processing
- raw - location for parquet files containing raw data (train, test, sample submission)
- fi - location to store feature importances in CSV files
- fi_fig - location to store plots capturing feature importances
- hpo - location to save hyperparameter optimization artifacts
- logs - location for logs generated by python modules
- notebooks - Any Jupyter notebook can be saved here
- oof - Out of fold predictions are saved here
- src
- common - package containing common utility functions
- config - package containing configuration related modules
- cv - package containing cross validation related functions
- fe - package containing feature engineering related functions
- fs - package containing feature selection related functions
- hpo - package containing hyperparameter optimization related functions
- modeling - package containing training/prediction related functions
- munging - package containing data processing/exploration related functions
- pre_process - package containing data pre-processing related functions
- scripts - location for fe, training scripts
- ts - package containing time series related functions
- viz - package containing data visualization related functions
- submissions - locations for predictions and submission scripts
- tracking - CSV file to track experiments
Acknowledgment
- I have borrowed the initial project structure and framework code from arnabbiswas1's open sourced code.
Steps to execute:
-
Clone the source code from github under <PROJECT_HOME> directory.
> git clone https://github.com/castillosebastian/mortality_analyses_covid.git
-
Create r and python (/usr/local/bin/python3) env:
> renv::init() > renv::use_python()
-
Download dataset
> HOME_DIR /src/scripts/data_processing/process_raw_data.R
-
Set the value of variable
HOME_DIR
, libraries, logger and much more at<PROJECT_HOME>/main.R
-
To train the baseline model with LGBM,
<PROJECT_HOME>/kaggle_pipeline_tps_aug_22
. Execute the following:> python -m src.scripts.training.lgb_baseline
This will create the submission file under
<PROJECT_HOME>/kaggle_pipeline_tps_aug_22/submissions
. Out of Fold predictions under<PROJECT_HOME>/kaggle_pipeline_tps_aug_22/oof
and CSVs capturing feature importances under<PROJECT_HOME>/kaggle_pipeline_tps_aug_22/fi
Result of the experiment will be tracked at <PROJECT_HOME>/kaggle_pipeline_tps_aug_22/tracking/tracking.csv
-
To submit the submission file to kaggle, go to
<PROJECT_HOME>/kaggle_pipeline_tps_aug_22/submissions
:> python -m submissions_1.py
-
Important Bib
- custom metric functions: 1,2,
- metric: binary_logloss
- hpyer parameters optimization grid: 1