Giter Club home page Giter Club logo

covid-19's Introduction

COVID-19

NLP applied to recent Coronavirus / COVID19 tweets: How has sentiment progressed in recent weeks?

Aim:

  • Scrape London-based Tweets related to #Coronavirus
  • Apply Vader to run Sentiment Analysis
  • Chart Sentiment over the period vs. Virus: How have Londoners' attitudes changed as the virus progressed?
  • Who were the winners / losers of the lockdown period?

Approach:

  • Retrieve tweets, apply post-processing using spacy and vader to parse entities, and score for sentiment.
  • Import case statistics and compare with sentiment to try to model the trend
  • Conduct Hypothesis tests to look for winners / losers this way. Visualise the data through wordclouds, also use wordclouds to drive new hypotheses.
  • Improve on this non-rigorous, piecemeal approach by modelling winners / losers holistically using Logistic Regression
  • Use the Logistic Regression as the basis to then analyse each winner / loser: by moving to a Bayesian beta-binomial model and animating the progression over time of the posterior beta distribution

Results:

Lockdown: The Winners and Losers

  • Amazon: as we mentioned, it started off in neutral territory: 0.5. But jumped into positive territory almost immediately after the lockdown, and stayed.
  • Cummings: Very interesting to see, e.g. you can run this for Cummings and see how his stock has fallen over the lockdown period. This had already started by a week into the lockdown, which perhaps shows the prior belief (informed by tweets prior to lockdown) was not in line with what we would come to later understand of public feeling towards him. In the last few days, with the story that he broke the lockdown, the posterior makes a dramatic shift to the left, crossing over into <0.5 territory and rapidly becoming peaked at ~0.4
  • SocialDistancing: Very quickly became positive, and stayed positive, showing the focused mood of a nation.
  • Government: no prior belief about views on the Government, however after immediately seeing positive sentiment in the lockdown period, we started to see that normalise back towards 0.5 and settling just above that level. Mixed views here, started strong but faded.
  • Keir Starmer: has been a big Loser from the Coronavirus period, with many Labour supporters unhappy at his support of the government's approach, and of course those on Conservative side bound to be against him.

Conclusion

  • In conclusion, we showed that it's possible to set up a feed in from Twitter, apply some text processing, conduct sentiment scoring, and analyse the per-entity sentiment, in a way that clarifies the winners / losers.
  • Hypothesis Testing: used this extensively to understand bring some statistical clarity to my findings, which taught me a few areas to be cautious about:
    • p-values: can be easily made to be statistically significant if the sample size if large enough to ensure a low enough standard error, ask yourself if your test is specific enough to ask a pointed question? Ask yourself if your filtering is set up correctly, ensuring your sample isn't inflated.
    • normality assumption: this generally held in most cases, thanks to the Central Limit Theorem, but we can always use non-parametric versions of our hypothesis tests like Mann-Whitney or Wilcoxon Signed-Rank tests.
    • mean reversion / neutralisation: for any slice of the data that's large enough / not specific enough, sentiment would generally cluster around 0 and not be meaningful. View the distribution of your subsample, and if it has a high peak at 0 sentiment then likely you need to be more specific with filtering for a meaningful hypothesis test.
  • Holistic Winners/Losers: To get an overall view, I used Logistic Regression to model positive vs negative labelled tweets, using the vectorized entity column as the predictor matrix. This gave some interesting results:
    • The overall $R^2$ was low, suggesting I needed more features than the few I included to be able to model accurately whether a tweet would be positive or negative.
    • However, this wasn't what I wanted... I was interested in how those features I did include contribute to positivity or negativity, which I could get from their coefficients. The nice things about using the Statsmodel package vs sklearn's was being able to see the statistical confidence of each coefficient.
    • So because we actually want to solve for $P(p ;|; data)$ not $P(data ;|; p)$, and also because we have a constantly evolving view of $P(p ;|; data)$ we want to follow, we turned to Bayesian approach:
  • Bayesian Winners/Losers: Here I set up an animated slider to show the prior distribution, and then how the posterior distribution varied after that as new data came in, which could be configured to the entity of choice.
    • This showed clearly how/when Cummings and Starmer sentiment started to deteriorate, giving some good explanation.
    • It also clearly showed certainty of our posterior beliefs, by attaching a probability distribution to the sentiment at each stage.
    • The Beta-Binomial model was used which simplified the parameters greatly, a good way to view this conjugacy is that the binomial distribution is just the discrete version of the beta ditribution, and therefore simply reflects the addition of individual new data-points to the continuous prior. This is solved mathematically as the update of the beta parameter by addition of the binomial parameter.

Future enhancements

  • This has hope of being modularized such that it's generalizable as a Twitter monitor, which has the below capabilities:
    • New tweets come in, they are post-processed using spacy and vader.
    • We are able to conduct some hypothesis tests on sections of the data.
    • We can provide a holistic view through Logistic Regression
    • We can view Bayesian Updating in an animation for a chosen entity.
  • I also want to learn a bit more about spacy, and see if it's possible to train the Entity Recognizer better such that any capitalized word isn't mistaken for an entity

covid-19's People

Contributors

noahberhe avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.