-
⚡ This GitHub repository contains code and resources for Alura's Challenge, a project that analyzes Spotify data and predicts song popularity using Python and Machine Learning.
-
🔭 The challenge during 7DaysOfCode is to improve my skills in data manipulation, visualization, and analysis using Machine Learning. The main goal is to analyze Spotify data and apply Machine Learning techniques to predict the popularity of songs. To achieve this, I will go through different stages of a Machine Learning project, including data collection, exploratory analysis, and model validation.
The topic I am exploring is Hit Song Science (HSS), which is a research field that focuses on predicting the success of songs before they are released in the market. HSS is a sub-area of Music Information Retrieval (MIR), which involves gathering information from songs (ARAUJO, CRISTO AND GIUST, 2020). Pachet and Roy mentioned in their paper “Hit Song is Not Yet a Science” that the idea behind HSS is that cultural items, such as songs, have specific technical features that make them preferred by a majority of people. These features can be extracted using algorithms to automate the prediction process for new songs.
The problem I am trying to solve is to predict the popularity of songs and artists. I want to develop data-driven models that can accurately forecast the success of songs and artists on music streaming platforms like Spotify. By leveraging data and insights, I aim to help musicians and record labels make informed decisions about which songs and artists to promote, invest in, and potentially collaborate with. This can help them optimize their marketing strategies, allocate resources effectively, and increase their chances of success in a highly competitive music industry. I analyzed a Spotify database of over 100,000 songs and conducted a basic EDA to identify trends. I observed that the top 25% of the distribution had small scores - 50 out of 100 - indicating that the dataset was skewed towards less popular songs. I also identified hidden duplicates and dropped them. In the final project I was working with ~80,000 datapoints.
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
popularity | 113549.0 | 33.324433 | 22.283855 | 0.0000 | 17.000000 | 35.000000 | 50.0000 | 100.00000 |
The modeling process involved testing several baseline models, including Dummy Classifiers and Logistic Regression. To address the issue of class imbalance in the dataset and further improve the model's performance, resampling techniques were applied and SMOTE was found to be the most effective. The performance of different models, including Logistic Regression, KNN, Decision Tree, Random Forest, and XGBoost, were compared and Random Forest was selected as the best performer. The model was further optimized through hyperparameter tuning using Random Search and Grid Search.
To evaluate the models, a cross-validation method with 5 folds was employed, and the model was finally trained on a test set to assess its ability to generalize. The precision score was found to be ~0.11 and the F1-Score was ~0.13, which were low but expected given the complexity of the factors that influence the popularity of a song, many of which are related to behavioral and emotional factors that could not be fully explored in this dataset. Similar discussions are found in other research papers on Hit Song Science, such as the previously cited "Hit Song Science is Not Yet a Science".
Random Forest Prediction | |
---|---|
Accuracy | 0.9406619121907733 |
Precision | 0.11456859971711457 |
Recall | 0.15576923076923077 |
F1-Score | 0.13202933985330073 |
These steps are recommended for further improvement in the project:
-
Conducting a more detailed feature selection process: This involves considering categorical features such as genre and artists, which may have a significant influence on the popularity of a song. Performing frequency encoding on these columns can help capture their information, and further analysis such as feature importance or correlation analysis can aid in selecting the most relevant features for the model.
-
Reducing dimensionality of the data through PCA analysis: Principal Component Analysis (PCA) is a dimensionality reduction technique that can help capture the most important information from a large number of features. By reducing the dimensionality of the data, the model may benefit from a more focused and relevant set of features, which could improve its performance.
By incorporating these steps into the project, it may be possible to improve the precision and performance of the model, and gain deeper insights into the factors that influence the popularity of songs.
You can find the detailed analysis and code here.
Main Goal | Status | Completed day | |
---|---|---|---|
Day One | Data Collection and EDA | ✔️ | 12-04-2023 |
Day Two | Pre-processing | ✔️ | 13-04-2023 |
Day Three | Splitting the data | ✔️ | 14-04-2023 |
Day Four | Machine Learning Models | ✔️ | 15-04-2023 |
Day Five | Evaluation | ✔️ | 16-04-2023 |
Day Six | Resampling | ✔️ | 19-04-2023 |
Day Seven | Prediction | ✔️ | 19-04-2023 |