In this project, NLP techniques of topic modeling and sentiment analysis with transformers are used to analyze the online reputation of several brands - Apple, Tesla, Amazon, Google and Microsoft - from content published on X (Twitter) between 01-06-2019 and 01-01-2020. For topic modeling the BERTopic model (based on BERT) ,designed specifically for this task, was used, while for sentiment analysis BERTweet model (based on RoBERTa) hosted on Huggingface was the one used, which is suitable for analyzing sentiment underlying English tweets.
The analysis methodology was as follows:
- Data selection
- Cleaning and pre-processing
- Descriptive analysis of N-frames (unigrams, bigrams, trigrams) using the TF-IDF algorithm.
- Topical modeling
- Sentiment analysis
As for the content of the repository, it contains the following files:
- Descriptive data analysis: in this file the initial data is loaded, the temporal and company filtering is performed and the content distribution by company or the temporal evolution of the number of tweets is analyzed, among others.
- N-Grams Analysis Apple and Tesla: in this file the N-Grams analysis of the sets of tweets about Apple and Tesla is performed, applying the TF-IDF algorithm to obtain the most relevant unigrams, bigrams and trigrams. The most repeated terms are also visualized using word clouds.
-
Amazon, Google and Microsoft N-Grams analysis: in this file the same N-Grams analysis procedure is repeated for the Amazon-Google-Microsoft set.
-
Apple topic modeling: this file performs the topic modeling with the BERTopic model, obtaining the optimal number of the most relevant topics about Apple. It also includes multiple visualizations included in the model, such as intertopic distance, hierarchical clustering, similarity matrix or temporal evolution of the topics along the time span.
-
Tesla topic modeling: in this file the same topic modeling procedure is repeated for the Tesla ensemble.
-
Amazon-Google-Microsoft topic modeling::in this file the same topic modeling procedure is repeated for the Amazon-Google-Microsoft set.
-
Sentiment analysis: this file contains the sentiment analysis of the 3 sets using the BERTweet model, in which for each tweet a positive (POS), negative (NEG) or neutral (NEU) categorization is obtained, as well as the corresponding confidence score or index.
-
Sentiment Analysis - Graphs: this file contains the code used for : overall distribution and temporal evolution of sentiment across sets, evolution of the model's confidence score, and distribution of sentiment and temporal evolution of sentiment for a set of relevant topics.
The initial data has been extracted from the following Kaggle dataset: Tweets about the Top Companies from 2015 to 2020