Giter Club home page Giter Club logo

konkyrkos / mep-tweet-clustering-classification Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 1.38 MB

Clustering and Classification of Members of the European Parliament's Tweets. This repository provides my solution for the 3rd Assignment for the course of Practical Data Science for the MSc in Data Science at Athens University of Economics and Business.

License: MIT License

Jupyter Notebook 99.87% Python 0.13%
python tweeter data-preprocessing data-preparation clustering classification sklearn dummyclassfier bag-of-words tweeter-api

mep-tweet-clustering-classification's Introduction

mep-tweet-clustering-classification

This repository provides my solution for the 3rd Assignment for the course of Practical Data Science for the MSc in Data Science at Athens University of Economics and Business.

Investigatign a dataset of tweets made by Members of the European Parliament.Data are collected by Darko Cherepnalkoski, Andreas Karpf, Igor Mozetič, and Miha Grčar for their paper Cohesion and Coalition Formation in the European Parliament: Roll-Call Votes and Twitter Activities.

Data Preparation

Downloading the dataset from https://www.clarin.si/repository/xmlui/handle/11356/1071 and using the retweets.csv file. Keeping only the records for which the language of the original tweet is in English. Getting the text of the original tweet and adding it to the dataset as an extra column. Using the Tweeter API to get the text. Keeping only the records for which we can download the tweet text. Groupping the records by the European group of the MEP that posted the original tweet and dropping groups with very few tweets.

Clustering

Converting tweet text to bag of word matrices using CountVectorizer. Using k-means to cluster the tweets based on their text. Stripping accents, and converting everything to lowercase. Setting min_df=10 and max_df=50 and also removing all English stopwords. Using both the elbow method and the silhouette score to investigate the best number of clusters and to settle on the best number of clusters. Visualizing the clusters using Yellowbrick's InterclusterDistance. Investigating the clusters by finding the most important features in each cluster.

Classification

Training at least two algorithms to learn to classify an unseen tweet. The target variable is the political party of the original poster and the training features are the original tweet's text. Splitting data into training and testing datasets. Experimenting with the different algorithms with cross validation on the training dataset to find the best hyperparameters for the best algorithm. To gauge the efficacy of the algorithm, we present also the results of a baseline classifier, using scikit-learn's DummyClassifier.

mep-tweet-clustering-classification's People

Contributors

konkyrkos avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.