Giter Club home page Giter Club logo

twitter-credibility-test-with-big-data's Introduction

Twitter-Credibility-Test-with-Big-Data

Background: The objective of this project is to identify whether Twitter can be considered a credible source of information, which reflects the emergence of important trends or topics in education.

Data Source: Approximately 100 Million Tweets about education are extracted from Twitter API from April 2022 to November 2022.

Tools and Platform: Performed on Google Cloud Platform, with PySpark (packages: spark.sql and MLlib, codes for MLlib will be updated at version 2.0), and Python (packages: pandas, seaborn, numpy, scikit-learn, geopandas, etc.)

Structure

  • Data preparation and cleaning
  • EDA (Exploratory Data Analysis)
  • Influential Analysis for usr account
  • Location and Time-series Analysis
  • Text Duplication Analysis with Jaccard Distance & LSH test on text similarity

Preview

Filtering & Cleanup: We intentionally control the number of filtered tweets (about 7.4 million) by filtering topics such as racial equality, literacy, tech & digital, special need, school curriculum, higher education to see if the tweet reflect the trending real-world online discussion*. Also, only tweets in English (lang=’en’) are considered.

Location Analysis: The geological distribution of tweets within our period of data collection indicate features (very-likely) caused by shocking news incidents of the school shooting in 05/24/2022 at Robb Elementary School in Texas had risen a huge online disgruntled in school gun control.

Time Analysis One of our findings is about time and the spike of tweet volume. Recall that we included certain heated topics while filtering the data and EDA. There is a obvious spike in the U.S.’s tweets in August, 2022. The corresponding news is that the Biden Administration announced a $10 Billion student debt relief. This was a series of news announcements in August, which could explain part of the spike in tweets.

LSH Similarity Given the limitation on and our selection (during data cleaning) of the tweets' length, this is a text similarity tes rather than topic modeling (LDA/LSA). It turned out that all organizations had a high percentage of unique tweets, which indicates that they are posting original content. News & Media had the least percentage of duplicate content, which indicates that they might be sharing similar education information. Schools and NGOs also have lower results of unique content.

twitter-credibility-test-with-big-data's People

Contributors

hjianganthony avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.