Genre Classification of Million Song Dataset
python3 -m venv music
source music/bin/activate
Make sure docker is setup in your system. Then follow these steps to install the project:
-
clone the project by using:
git clone [email protected]:meetnisha/genre.git
Run these commands in the project's folder: -
git submodule update --init --recursive
-
run
./run_local.sh
in root folder of the project
Alternatively, clone repo and run
docker-compose up -d --build
Or
docker-compose -f docker-compose.yaml up -d --build
docker logs -f core-api-container
Please check Report.docx for details about this application
/documents/demo.gif
- Home - home.png
- Prediction output - prediction.png
- Search functionality - search.png
The test output file is saved in this folder:
/data/test_prediction.csv
File: /app/analysis/EDA_ML.ipynb
One shouldnot get 100% accuracy from your training dataset. This means my model is overfitting.
XGBoost Test Accuracy - 65.71.
- I tried to find better hyper paramaters like n_estimators, reg_lambda but the space was too large.
- I applied dimensionalty reducing technique like PCA but the accuracy got worse.
- It consumed lots of time and hence I decided to move to Deep Learning.
As I need to decrease the complexity by removing features, I used recursive feature elimination but it took almost 24 hours to run on my machine. Hence the features extracted in my ML analysis are used to build a deep learning model where I wanted to use title and tags feature as well.
File: /app/analysis/EDA_DL.ipynb
- Cleaned data, removed all null values
- Dropped highly co-related features
- Used recursive feature elimination(RFECV) in XGBClassifier in ML analysis, I eliminated those features for DL model as well.
- Trid few optimizers like sgd, rmsprop and adam. Model performed better with adam.
- Base line model overfitted.
- Apply few techniques like regularization, drop outs and early stopping
- Drop puts performed better.
- XGBoost Test Accuracy - 65.71 % a. I tried to find better hyper paramaters like n_estimators, reg_lambda but the space was too large. b. It consumed lots of time and hence I decided to move to Deep Learning.
- Baseline DL model - 63.13%
- Best performing model - Dropout model - Test Accuracy - 70.89%