Giter Club home page Giter Club logo

using-nlp-techniques-to-predict-song-skips-on-sequential-user-and-acoustic-features-public's Introduction

Prediction of Song Skips on Spotify based on Sequential User and Acoustic Data

Abstract

The main purposes of this project are to find out the important of input variables i.e acoustic features that affect mostly in the songs skip prediction and make a predictive application by Machine-Learning techiques for the client to facilitate the comprehension of user behaviour.

Background of Understanding the Problem

The mucical app spotify has provided two datasets. One dataset contains data related to user behaviour and other one contains data related acoustic features of all individual musical tracks. Now we have to find out the the most relavant user-behaviour and acoustic features which contribute to the skip of a particular track so that the client i.e spotify be able to see the impact of the input features on skipping of a song and thus they can improve their service in various ways such as by accurately predicting song skips, Spotify can better understand user preferences and tailor personalized recommendations, by identifying songs that are more likely to be skipped, Spotify can optimize playlist sequencing and ensure a more enjoyable listening experience. Thus finally they are able to meet their business goals.

Data Details

  • log-mini.csv (df1) has 167880 rows and 21 columns
  • tf-mini.csv (df2) has 50704 rows and 30 columns
  • Data has no duplicates.
  • Data has missing values.

Data prepeocessing

• First, we'll rename 'track_id_clean' feature of df1 as 'track_id' then merge the two datasets (i.e df1 & df2) on common feature column 'track-id'

# Renaming the column track id
df1 = df1.rename(columns={'track_id_clean':'track_id'})

# merging the two dataset
df = pd.merge(df1, df2, on='track_id')

• Then we create another column named 'skipped' by combining 'skip1','skip2','skip3' through multiplication

df["skipped"] = df["skip_1"]*df["skip_2"]*df["skip_3"]
df.drop(["skip_1", "skip_2", "skip_3", "not_skipped"], axis=1, inplace=True)

Exploratory Data Analysis (EDA)

In EDA, we have examined a number of columns to learn more about the provided dataset.

EDA of Input variables with binary features

gif1

EDA of Input variables with multiple features

gif2

Observations from EDA

  1. Maximum songs are present in the first three sessions and minimum in last three sessions. And the songs belonging to 1st session are skipped least as compared to others.
  2. Maximum skips happened for those songs which are present in the sessions starting from session 4 upto session 10 then no of skips decreased along with the no of songs.
  3. Maximum count of sessions length is 20 which means users generally listen to music for 20 mins
  4. The users do not seek forward or backward through the song
  5. The users do not listen to music much at late nights or early in the morning. The peak listening hours are 1pm to 5pm in the afternoon
  6. Maximum songs were listened in 15-07-2018
  7. The maximum number of songs fall under the context user_collection. But songs present in editorial_playlist and catalog are less skipped as compared to user_collection.
  8. The 'trackdone' category has maximum no of songs that are not skipped after playing and 'fwdbtn' category has most songs skipped. It indicates that the users skipped the song when they try to play the songs by forward option but they didn't skip when the songs are played in general order.
  9. Maximum songs are released in 2018
  10. Maximum songs fall under the category key-1 and minimum under the category key-3
  11. The most common time signature in music is 4/4

Outlier detection and outlier capping

I used sns.boxplot() to identify the outliers from the following columns

cols = ['acousticness','beat_strength', 'bounciness', 'danceability', 'dyn_range_mean',
       'energy', 'flatness', 'instrumentalness', 'liveness', 'loudness','mechanism',
       'organism', 'speechiness', 'tempo', 'valence', 'acoustic_vector_0', 'acoustic_vector_1',
       'acoustic_vector_2', 'acoustic_vector_3', 'acoustic_vector_4',
       'acoustic_vector_5', 'acoustic_vector_6', 'acoustic_vector_7']

Then applied capping i.e seting the upper and lower limit with threshold values.

for i in cols:
    q1 = np.percentile(df[i],25)
    q3 = np.percentile(df[i],75)
    iqr = q3 - q1
    lw_lm = q1 - 1.5*iqr
    up_lm = q3 + 1.5*iqr
    df[i] = np.where(df[i]>up_lm, up_lm, np.where(df[i]<lw_lm, lw_lm, df[i]))

Encoding

We applied one hot encoding on the following columns

df = pd.get_dummies(df,columns=['context_type','hist_user_behavior_reason_start','hist_user_behavior_reason_end'])

Then we applied label encoding on 'mode'

encoder = LabelEncoder()
df['mode'] = encoder.fit_transform(df['mode'])

Feature selection

From sklearn.feature_selection we imported mutual_info_regression in order to get top 20 features

Example Image

Example Image

Mapping the column names for simplification

col_map = {
    
 'hist_user_behavior_reason_end_trackdone': 'end_trackdone',
 'hist_user_behavior_reason_start_trackdone': 'start_trackdone',
 'hist_user_behavior_reason_start_fwdbtn': 'start_fwdbtn',
 'hist_user_behavior_reason_end_fwdbtn': 'end_fwdbtn' ,
 'hist_user_behavior_reason_end_backbtn': 'end_backbtn',
 'no_pause_before_play': 'no_pause_before_play',
 'long_pause_before_play': 'long_pause_before_play',
 'valence': 'valence',
 'acoustic_vector_6': 'vector_6',
 'acoustic_vector_5': 'vector_5',
 'duration':'duration',
 'dyn_range_mean': 'dyn_range_mean',
 'acoustic_vector_1': 'vector_1',
 'organism': 'organism',
 'energy': 'energy',
 'acoustic_vector_2': 'vector_2',
 'us_popularity_estimate': 'us_popularity',
 'bounciness': 'bounciness',
 'short_pause_before_play': 'short_pause_before_play',
 'beat_strength': 'beat_strength'
    
}

Applying minmax scaling scaling on numerical variables with large unique values

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

X[['valence', 'vector_6', 'vector_5','duration', 'dyn_range_mean', 'vector_1',
   'organism', 'energy', 'vector_2', 'us_popularity','bounciness',
   'beat_strength']] = scaler.fit_transform(X[['valence', 'vector_6', 'vector_5', 
                    'duration', 'dyn_range_mean', 'vector_1','organism', 'energy', 'vector_2', 'us_popularity',
                    'bounciness', 'beat_strength']])
  • Here X is input features

Saving the scaler by joblib.dump

joblib.dump(scaler,'minmaxScaler.joblib')

Modeling

Train test split

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

Appying DecisionTree classifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import  accuracy_score
dtc = DecisionTreeClassifier(max_depth = 5,
			     criterion = "gini",
			     random_state = 100,
                             min_samples_split = 2,
			     min_samples_leaf=2)
dtc.fit(X_train, y_train)
y_pred = dtc.predict(X_test)
accuracy_score(y_test, y_pred)

Training accuracy is - 0.874262866333095

Testing accuracy is - 0.8765487252799619

Saving the model by joblib.dump

joblib.dump(dtc, 'decisionTree.joblib')

Web Framework used for the prediction app - Streamlit

Cloud hosting sevice used for deployment - Render

The team

Here is a breakdown of the roles and responsibilities of each team member:

Team member name Role and Tasks
Tuhin - Data Pre-Processing
- Feature Engineering
- Modeling Building
- Model Deployment (Render)
- Report Writing
Jannah - Data Cleaning
- Pre-Processing
- Feature Engineering
- Model Building
- Model Deployment (Streamlit)
Ashish (Team Lead) - Data Pre-Processing
- Feature Engineering
- Model Building
- Model Deployment (Render)
- Report Writing
Aman - Data Analysis
- Pre-Processing
- Model Building
- Model Deployment (Flask)
Himanshu - Exploratory Data Analysis
- Model Building
- Model Deployment (Streamlit)
Param - Exploratory Data Analysis
- Model Building
- Model Deployment (Flask)

using-nlp-techniques-to-predict-song-skips-on-sequential-user-and-acoustic-features-public's People

Contributors

ashishlathkar77 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.