Giter Club home page Giter Club logo

7daysofcode_spotifyml's Introduction

7DaysOfCode_SpotifyML

  • ⚡ This GitHub repository contains code and resources for Alura's Challenge, a project that analyzes Spotify data and predicts song popularity using Python and Machine Learning.

  • 🔭 The challenge during 7DaysOfCode is to improve my skills in data manipulation, visualization, and analysis using Machine Learning. The main goal is to analyze Spotify data and apply Machine Learning techniques to predict the popularity of songs. To achieve this, I will go through different stages of a Machine Learning project, including data collection, exploratory analysis, and model validation.

  • Dataset

  • Data Dictionary

  • Final project

Summary

The topic I am exploring is Hit Song Science (HSS), which is a research field that focuses on predicting the success of songs before they are released in the market. HSS is a sub-area of Music Information Retrieval (MIR), which involves gathering information from songs (ARAUJO, CRISTO AND GIUST, 2020). Pachet and Roy mentioned in their paper “Hit Song is Not Yet a Science” that the idea behind HSS is that cultural items, such as songs, have specific technical features that make them preferred by a majority of people. These features can be extracted using algorithms to automate the prediction process for new songs.

The problem I am trying to solve is to predict the popularity of songs and artists. I want to develop data-driven models that can accurately forecast the success of songs and artists on music streaming platforms like Spotify. By leveraging data and insights, I aim to help musicians and record labels make informed decisions about which songs and artists to promote, invest in, and potentially collaborate with. This can help them optimize their marketing strategies, allocate resources effectively, and increase their chances of success in a highly competitive music industry. I analyzed a Spotify database of over 100,000 songs and conducted a basic EDA to identify trends. I observed that the top 25% of the distribution had small scores - 50 out of 100 - indicating that the dataset was skewed towards less popular songs. I also identified hidden duplicates and dropped them. In the final project I was working with ~80,000 datapoints.

count mean std min 25% 50% 75% max
popularity 113549.0 33.324433 22.283855 0.0000 17.000000 35.000000 50.0000 100.00000

The modeling process involved testing several baseline models, including Dummy Classifiers and Logistic Regression. To address the issue of class imbalance in the dataset and further improve the model's performance, resampling techniques were applied and SMOTE was found to be the most effective. The performance of different models, including Logistic Regression, KNN, Decision Tree, Random Forest, and XGBoost, were compared and Random Forest was selected as the best performer. The model was further optimized through hyperparameter tuning using Random Search and Grid Search.

To evaluate the models, a cross-validation method with 5 folds was employed, and the model was finally trained on a test set to assess its ability to generalize. The precision score was found to be ~0.11 and the F1-Score was ~0.13, which were low but expected given the complexity of the factors that influence the popularity of a song, many of which are related to behavioral and emotional factors that could not be fully explored in this dataset. Similar discussions are found in other research papers on Hit Song Science, such as the previously cited "Hit Song Science is Not Yet a Science".

Random Forest Prediction
Accuracy 0.9406619121907733
Precision 0.11456859971711457
Recall 0.15576923076923077
F1-Score 0.13202933985330073

These steps are recommended for further improvement in the project:

  • Conducting a more detailed feature selection process: This involves considering categorical features such as genre and artists, which may have a significant influence on the popularity of a song. Performing frequency encoding on these columns can help capture their information, and further analysis such as feature importance or correlation analysis can aid in selecting the most relevant features for the model.

  • Reducing dimensionality of the data through PCA analysis: Principal Component Analysis (PCA) is a dimensionality reduction technique that can help capture the most important information from a large number of features. By reducing the dimensionality of the data, the model may benefit from a more focused and relevant set of features, which could improve its performance.

By incorporating these steps into the project, it may be possible to improve the precision and performance of the model, and gain deeper insights into the factors that influence the popularity of songs.

You can find the detailed analysis and code here.

7 days of code through the days

Main Goal Status Completed day
Day One Data Collection and EDA ✔️ 12-04-2023
Day Two Pre-processing ✔️ 13-04-2023
Day Three Splitting the data ✔️ 14-04-2023
Day Four Machine Learning Models ✔️ 15-04-2023
Day Five Evaluation ✔️ 16-04-2023
Day Six Resampling ✔️ 19-04-2023
Day Seven Prediction ✔️ 19-04-2023

7daysofcode_spotifyml's People

Contributors

biancaportela avatar

Stargazers

Cristina Mafra avatar Raquel Reis avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.