Giter Club home page Giter Club logo

hashtag_popularity_prediction's Introduction

Hashtag Popularity Prediction

Predicting the popularity of a given hashtag using standard regression techniques such as Linear Regression, Random Forest and Non-linear SVM.

Done as a part of summer remote project in Northwestern University. Certificate of the project can be found here.

Dependencies

  1. numpy
  2. tqdm
  3. tweepy
  4. pandas
  5. scikit-learn

Data Scrapping

NOTE: You need to have a Twitter Developer Account inorder to scrap data. If you don't have access, please apply here

The data for #DarkNetflix was scrapped from 27/06/2020 for a maximum of 20000 entries. The scrapped data will be in a JSON format. We try to specifically extract the following entities from the data.

  1. Number of retweets
  2. Number of mentions
  3. Number of hashtags
  4. Number of urls
  5. Number of followers
  6. Number of favourites
  7. Time at which the post was created (which will be converted to a time stamp)

Once the data is scrapped, we batch entities 1 to 6 with a window of one minute. Therefore, the final data is made with entries corresponding to different minutes.

The popularity of the hashtag is measured by the number of tweets it comprises at a given minute.

Usage

    python main.py

Results

errors

Random Forest and Non-linear SVM has the least cross validation error.

plt1

Looking at the prediction on the test set, we can see that the model is able to predict the popularity very closely to the true value. But training and testing on the same hashtag may induce some bias to the model. Therefore, we have to check if the model is somehow able to capture the trend with regards to other hashtags.

plt2

We train the models on #PostponeNEETandJEE which trended in India during mid-june of 2020. As you can see from the first plot, the model is again able to predict somewhat closely with the true value. The interesting part is the plot in the second plot where the model trained on #PostponeNEETandJEE is tested on #DarkNetflix. As expected, the performance of the model has reduced significantly when compared to its performance when the model was trained on #DarkNetflix. Even though the predicted and actual values might be way different, it is able to predict the peaks and valleys with a small delay. Therefore, the model is still consistent with predicting the expected trend of the hashtag.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.