Twitter-Credibility-Test-with-Big-Data

Background: The objective of this project is to identify whether Twitter can be considered a credible source of information, which reflects the emergence of important trends or topics in education.

Data Source: Approximately 100 Million Tweets about education are extracted from Twitter API from April 2022 to November 2022.

Tools and Platform: Performed on Google Cloud Platform, with PySpark (packages: spark.sql and MLlib, codes for MLlib will be updated at version 2.0), and Python (packages: pandas, seaborn, numpy, scikit-learn, geopandas, etc.)

Structure

Data preparation and cleaning
EDA (Exploratory Data Analysis)
Influential Analysis for usr account
Location and Time-series Analysis
Text Duplication Analysis with Jaccard Distance & LSH test on text similarity

Preview

Filtering & Cleanup: We intentionally control the number of filtered tweets (about 7.4 million) by filtering topics such as racial equality, literacy, tech & digital, special need, school curriculum, higher education to see if the tweet reflect the trending real-world online discussion*. Also, only tweets in English (lang=’en’) are considered.

Location Analysis: The geological distribution of tweets within our period of data collection indicate features (very-likely) caused by shocking news incidents of the school shooting in 05/24/2022 at Robb Elementary School in Texas had risen a huge online disgruntled in school gun control.

Time Analysis One of our findings is about time and the spike of tweet volume. Recall that we included certain heated topics while filtering the data and EDA. There is a obvious spike in the U.S.’s tweets in August, 2022. The corresponding news is that the Biden Administration announced a $10 Billion student debt relief. This was a series of news announcements in August, which could explain part of the spike in tweets.

LSH Similarity Given the limitation on and our selection (during data cleaning) of the tweets' length, this is a text similarity tes rather than topic modeling (LDA/LSA). It turned out that all organizations had a high percentage of unique tweets, which indicates that they are posting original content. News & Media had the least percentage of duplicate content, which indicates that they might be sharing similar education information. Schools and NGOs also have lower results of unique content.

hjianganthony / twitter-credibility-test-with-big-data Goto Github PK