Giter Club home page Giter Club logo

uniuk's Introduction

UniUK Sentiment Dashboard

This repository contains a data visualization dashboard for analyzing and visualizing data from the subreddit r/UniUK from February 2016 (the inception of Subreddit r/UniUK) to December 2023. The data has been collected through an open-source project named PushShift and includes a vast number of posts and comments that offer insights into university life in the UK.

Navigating the Dashboard

The dashboard is built using Dash (Plotly) with the following components:

  • Background Page: Introduces the study motivation and research question, guides users on exploring the dashboard findings and provide additional details like data source, preprocessing steps and references.
  • Topic Frequency Page: Allows users to view the frequency of selected topics over time, either as absolute counts or normalized percentages, to identify popular topics and trends over time.
  • Sentiment Analysis Page: Enables users to analyze sentiment trends for a specific topic over time, using absolute counts or normalized percentages views, to understand the emotional tone of discussions.
  • Topic Data Page: Provides a table view of the individual posts for a selected topic and year range, with sentiment indicated by cell color, allowing users to explore specific discussions.
  • Interpretation Page: Demonstrates how to use the dashboard to examine the research question through an example analysis of a specific topic, showcasing insights and conclusions.

Data Source and Preprocessing

The data, spanning from February 2016 (the inception of Subreddit r/UniUK) to December 2023, was obtained from academic torrents hosted online and collected by an open-source project called Pushshift. To prepare the data for analysis and answer the research question, several pre-processing steps and modeling were undertaken. First, the data was cleaned using custom stopwords and the NLTK library to remove irrelevant information and improve the quality of the dataset. Next, sentiment analysis was performed using VaderSentiment to determine the polarity (positive, neutral, and negative) of each post. Finally, topic modeling was conducted using BerTopic to identify and categorize the main themes within the data.

To focus on the visualisation aspects, the detailed data modeling steps are not covered in this project repository. However, the modeling process are shared in the accompanying Kaggle notebook, providing a reproducible account of the data analysis.

Meta-Data for Processed Data

  • body: The text content within a post. It includes both initial submissions (which start discussion threads) and subsequent comments. By analyzing these elements collectively, we treat them as a unified set of social media posts for examination.
  • created_utc: The timestamp of when the post was created.
  • sentiment: The sentiment of the post as determined by VaderSentiment (positive, neutral, or negative).
  • processed_text: The processed content of the post using custom stopwords and NLTK library.
  • Topic: The topic number (1 to 74) that the post belongs to as determined by BerTopic. Topic 74 refers to the outliers not classified into any specific topic.
  • Topic_Label: The descriptive label assigned to each topic, derived from the four most representative words identified through a class-based Term Frequency-Inverse Document Frequency analysis of the topic's content (Grootendorst, 2022).

Note: The naming conventions for Topic and Topic_Label adhere to those established by BERTopic, utilizing capitalization for consistency and to facilitate future interactions with the library.

Repository Structure

  • assets/styles.css: The CSS file containing custom styles for the dashboard to enhance its appearance.
  • assets/future_direction.gif: The GIF file showcasing an example of AI-powered visualizations for future directions or features.
  • data/uniuk_sentiment_data.csv: The preprocessed dataset used for visualization in the dashboard.
  • fig/: The folder containing visualizations saved and used in the interpretation page (generated by interpretation.py).
  • .gitattributes: The configuration file that specifies which files should be handled by Git Large File Storage (LFS), used for handling of the data file.
  • app.py: The main Python script that contains the Dash application code for the visualization dashboard.
  • dockerfile: The Dockerfile to build the Docker image for the repository, with the necessary environment and dependencies. Docker Image Link
  • interpretation.py: The Python script that generates the HTML files used for the interpretation page within the dashboard.
  • requirements.txt: The file listing the required Python packages to run the application.

App.py Components

The app.py script contains the following main components:

  1. Data Processing: This section loads the data from the CSV file, removes error data, converts the date correctly, merges Topic 1 to outliers, and shifts topic numbers sequentially.

  2. Dashboard App: This section initializes the Dash application and sets up the overall layout of the dashboard.

  3. Background Page: This section defines the layout and components for the background page, including the data table to explain the meta-data.

  4. Topic Frequency Page: This section defines the layout and components for the topic frequency analysis page, including a range slider for selecting the range of topics and tabs for displaying absolute and normalized frequencies.

  5. Sentiment Analysis Page: This section defines the layout and components for the sentiment analysis page, including a dropdown for selecting the topic of interest and tabs for displaying absolute and normalized frequencies.

  6. Topic Data Page: This section defines the layout and components for the topic data page, including a dropdown for selecting the topic of interest and a range slider for selecting the range of years.

  7. Interpretation Page: This section defines the layout and components for the interpretation page, including the generation of the 3 htmls from interpretation.py and the example gif.

  8. Callbacks: This section contains the callbacks that update the visualizations and data tables based on user interactions with the dashboard components.

Running Locally

Cloning the Repository

First, clone the repository to your local machine:

git clone https://github.com/sgjustino/UniUK.git
cd UniUK

Using Python

  1. Install the required Python packages listed in requirements.txt by running the following command:

    pip install -r requirements.txt
    
  2. Run app.py to start the Dash application:

    python app.py
    
  3. Access the dashboard through your web browser:

    http://127.0.0.1:8050/
    

Using Docker

Alternatively, you can run the application using the Docker image available on Docker Hub:

  1. Pull the Docker image from Docker Hub:

    docker pull razuki/uniuk-app:latest
    
  2. Run the Docker container:

    docker run -p 8050:8050 razuki/uniuk-app:latest
    

    This command maps port 8050 of the container to port 8050 on your local machine.

  3. Access the dashboard through your web browser:

    http://127.0.0.1:8050/
    

Built With

References

  • Al-Natour, S., & Turetken, O. (2020). A comparative assessment of sentiment analysis and star ratings for consumer reviews. International Journal of Information Management, 54, 102132. https://doi.org/10.1016/j.ijinfomgt.2020.102132
  • Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., & Blackburn, J. (2020, May). The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media (Vol. 14, pp. 830-839). https://doi.org/10.1609/icwsm.v14i1.7347
  • Biswas, A. (2023, October 17). AI-powered data visualizations: Introducing an app to generate charts using only a single prompt and OpenAI large language models. Databutton. https://medium.com/databutton/ai-powered-data-visualization-134e89d82d99
  • Briggs, A. R., Clark, J., & Hall, I. (2012). Building bridges: understanding student transition to university. Quality in higher education, 18(1), 3-21. https://doi.org/10.1080/13538322.2011.614468
  • Gagné, T., Schoon, I., McMunn, A., & Sacker, A. (2021). Mental distress among young adults in Great Britain: long-term trends and early changes during the COVID-19 pandemic. Social Psychiatry and Psychiatric Epidemiology, 1-12. https://doi.org/10.1007/s00127-021-02194-7
  • Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794. https://doi.org/10.48550/arXiv.2203.0579
  • Grootendorst, M. (2023, August 22). Topic modeling with Llama 2: Create easily interpretable topics with Large Language Models. Towards Data Science. https://towardsdatascience.com/topic-modeling-with-llama-2-85177d01e174
  • Guo, Y., Ge, Y., Yang, Y. C., Al-Garadi, M. A., & Sarker, A. (2022). Comparison of pretraining models and strategies for health-related social media text classification. In Healthcare (Vol. 10, No. 8, p. 1478). MDPI. https://doi.org/10.3390/healthcare10081478
  • Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014. https://doi.org/10.1609/icwsm.v8i1.14550
  • Reddy, K. J., Menon, K. R., & Thattil, A. (2018). Academic stress and its sources among university students. Biomedical and pharmacology journal, 11(1), 531-537. https://dx.doi.org/10.13005/bpj/1404
  • Samaras, L., García-Barriocanal, E., & Sicilia, M. A. (2023). Sentiment analysis of COVID-19 cases in Greece using Twitter data. Expert Systems with Applications, 230, 120577. https://doi.org/10.1016/j.eswa.2023.120577
  • Solatorio, A. V. (2024). GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning. arXiv preprint arXiv:2402.16829. https://doi.org/10.48550/arXiv.2402.16829
  • Souza, F. D., & Filho, J. B. D. O. E. S. (2022). BERT for sentiment analysis: pre-trained and fine-tuned alternatives. In International Conference on Computational Processing of the Portuguese Language (pp. 209-218). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-98305-5_20
  • Winstone, L., Mars, B., Haworth, C. M., & Kidger, J. (2021). Social media use and social connectedness among adolescents in the United Kingdom: a qualitative exploration of displacement and stimulation. BMC public health, 21, 1-15. https://doi.org/10.1186/s12889-021-11802-9

License

This project is licensed under the MIT License. See the LICENSE file for details.

End

uniuk's People

Contributors

sgjustino avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.