data4democracy / assemble Goto Github PK

NOT AN ACTIVE PROJECT -- Check readme for data sources

License: MIT License

Jupyter Notebook 97.52% Python 2.15% JavaScript 0.11% HTML 0.03% Julia 0.19%

assemble's Introduction

Assemble

Project Description: Assemble is a data for democracy community working to build tools and infrastructure to enable the study of online communities and their characteristics. We have several active repositories, our goal is to build a toolkit which takes care of common tasks so researchers do not have to reinvent the wheel with each new project.

Project Leads:

Maintainers: Maintainers have write access to the repository. They are responsible for reviewing pull requests, providing feedback and ensuring consistency.

@sjackson (Subject Matter Knowledge)
@wwymak (Community Detection, NLP)
@henripal (Assemble, Community Detection, NLP)
@alarcj (collect-social, onboarding, tutorials, twitter-analysis)

Project Ambassadors:

@ccarey

Getting Started:

For a list of first steps, please visit our community guide.
Read about how we use issue labels

Things you should know

"First-timers" are welcome! Whether you're trying to learn data science, hone your coding skills, or get started collaborating over the web, we're happy to help. (Sidenote: with respect to Git and GitHub specifically, our github-playground repo and the #github-help Slack channel are good places to start.)
We believe good code is reviewed code. All commits to this repository are approved by project maintainers and/or leads (listed above). The goal here is not to criticize or judge your abilities! Rather, sharing insights and achievements. Code reviews help us continually refine the project's scope and direction, as well as encourage the discussion we need for it to thrive.
This README belongs to everyone. If we've missed some crucial information or left anything unclear, edit this document and submit a pull request. We welcome the feedback! Up-to-date documentation is critical to what we do, and changes like this are a great way to make your first contribution to the project.

Currently utilized skills

Take a look at this list to get an idea of the tools and knowledge we're leveraging. If you're good with any of these, or if you'd like to get better at them, this might be a good project to get involved with!

If you would like to get started with any of these skills, check out the tutorials and chat about it in #learning.

Python 3 (scripting, web scraping, analysis, Jupyter notebooks, visualization)
Data extraction/ETL
Data cleaning

Project Areas

Infrastructure

If you like the idea of building tools that will help enable analysis across many domains these projects are a great place to start. If you have an idea for a dataset you would like to collect please file a proposal via GitHub issue with the label proposal.

Curation

Leveraging the Infrastructure group's fantastic work, the Curation team makes available repositories of information about online communities. The data is "analysis ready" and has been curated to support downstream analytical objectives, and the team works closely with the data.world staff.

Infrastructure Repositories

town-council: Tools to scrape and centralize the text of meeting agendas & minutes from local city governments.
smtk: Ambitious attempt to combine all below projects + more into single toolkit.
Discursive: Framework for searching or streaming tweets storing them in Elasticsearch and S3.
Collect-Social: This project aims to make that collection process as simple as possible, by making some common-sense assumptions about what most researchers need, and how they like to work with their data. For example, tasks like grabbing all the posts and comments from a handful of Facebook pages, and dumping the results into a sqlite database.
Reddit-Api-Miner No longer active. Reddit integration is on the roadmap for smtk.

Data pipeline

We are looking for people to take our raw data and curate it so that it is analysis ready. You will work closely with the the person(s) who gathered the data to understand methodologies for how the data was gathered to help document the end to end data cleaning process for future analysts. Eventador has gracioulsy donated infrastructure to assist with this effort.

Additional Resources:

Getting started with Eventador
See DATA GOVERNANCE.

Raw data:

Info Source
Congressional Record
Discursive
Have data to add? Check our how-to guide

Curation Projects

Oathkeepers - Militia and white nationalist twitter data

Tutorials and Example Notebooks:

We need people who would like to write tutorials or script examples on how to do common tasks.

Do not worry if you are not an expert. Tutorials from the perspective of a beginner are great for other beginners.

Examples of work that has inspired us:

Special thanks to the drug-spending team for writing such a great README we borrowed liberally from it

assemble's People

Contributors

Stargazers

Watchers

assemble's Issues

Tweet text data parsing/cleaning for nlp

Look through data available at https://data.world/data4democracy/far-right as data from the discursive project

Some of the tasks we might do are:

Stem
Tokenize
Remove stop words
List of stop words: https://pypi.python.org/pypi/stop-words (most nlp libraries also come with them)

Depending on what you want to achieve, you might not need all of the above (e.g. for training word2vec, you might not need to do any of that, but you might want to convert emojis)

Tag POS (for further sentiment analysis)

Useful libraries:
spaCy
NLTK
sklearn
TextBlob
gensim
Mallet

I'm exploring what is possible/needed at the mo with @divya -- but feel free to chip in with opinions, ideas, especially if you're an nlp expert :)

URL domain extraction

Problem:

In order to do analysis on types of links being shared we need a reliable way to extract & count domains that appear in a list of URLs.

Tasks:

There are libraries that do this but none of them are perfect (It's fine to leverage a library but try to do your own validation on the results)
Attempt to alias domains known to be associated and count together youtu.be and youtube.com are both youtube
Make sure you're capturing the actual domain ex forums.website.com the domain is website
Output should be a list of domain counts in descending order.
Sort out shortened links and publish as a separate file. Ex bit.ly, t.co

You do not need a full solution in order to submit a PR. If you have questions drop in to assemble chat and see if anyone else is interested in working on the problem.

You can download the data here or load directly to pandas via

import pandas as pd
df = pd.read_csv('https://s3.amazonaws.com/far-right/fourchan/youtube_urls.csv')

Post cleaning should generate a list of domains and their count. As well as a separate file of all shortened links where domain is not know. (Recommend you do not try to visit these shortened links)

youtube, 500
facebook, 200
twitter, 150
wikipedia, 100

warning: this work requires you deal with highly explicit and offensive content from the pol 4chan board. Please do not visit the links you find as some may contain malware/offensive content.

Consolidate text data into a single data store.

Goal:

Consolidate the article and text data gathered from various websites into a single data store. This work is to support work being done by the #far-right team.

First Step:

Agree on appropriate data store. If you are familiar with a specific tool AND are willing to help us get started please post pro/cons about how this tool(s) may handle our requirements.

Able to store/query at least ~10 million records (see examples below)
Data is readily available for our analysts.
Access can be restricted
Cost conscious but does not have to be free

Background:

Community members have collected or donated article text data from various online communities and news sources. Chosen storage should be flexible enough to allow data model to change over time but structured enough to enable analysts to search data across multiple sources. New data is uploaded to s3 daily.

Examples of data we need to store:

an average case popular news archive is about ~250,000 rows (800 mb in raw json unzipped). Adding about ~75,000-100,000 rows a year.
Our largest identified source comes from a web forum and has about 3-5 million historical posts and adding thousands per day (TBD how much of the archive we store in live data set)
If you would like to get your hands on some real data contact me for s3 access.

Some basic cleaning/standardization has already been done. Current data is in below format and stored in json files on s3.

CURRENT data model

Required
language: language of the text
url: the url of an article
text_blob: body of article/text
source: source website

Optional/If exists
authors: the authors of a particular article
pub_date: date the article was published. (Format: YYYY-MM-DD)
pub_time: time the article was published. Time should be stored in UTC if possible. (Format:HH:MM:SSZ)
title: the headline of a page/article/news item
lead: opening paragraph, initial bolded text or summary
hrefs: list of hrefs extracted from article or text
meta: Non standard field. This field contains data specific to the source. May contain embedded json objects. Analysts should make sure they understand the data model used before relying on this field as it may be different across sources

Word2Vec models

Construct word2vec model with tweets for groups of people (e.g. far right) and compare with models trained on the overall twitterverse (e.g. http://fredericgodin.com/papers/Named%20Entity%20Recognition%20for%20Twitter%20Microposts%20using%20Distributed%20Word%20Representations.pdf)

Some things to try:
clustering tweets with tSNE/kMeans/PCA
predict hashtags with tweets vectors
do regression on tweet/hashtag vectors

(notes from a chat with a colleague of mine who did some nlp research.
The following are some of his recommendations:

using word2vec is more going to give better results compared to e.g. countVectorizer
use word2vec with skipgram training for the tweets themselves
there probably is no need to remove stop words or tokenize tweets (but remove punctuation)
convert emojis into e.g. happy to get better context
convert word2vec vectors into polar coordinates
train word2vec for hashtags from tweets using cbow

His opinion is that gensim is a handy tool but he also built some extra utils etc for his work that may be useful: https://github.com/pelodelfuego/word2vec-toolbox )

I have been tinkering a bit with the our data using gensim (seems fairly easy to use although I haven't actually tried seeing what falls out of it yet)

Slack bot integration with assemble repo

I think it would be cool to test out a chat bot that sits in the #assemble channel and messages when issues/PRs are opened or a PR is merged. Bonus points if the bot gives a special congratulations for a first time merge. We can start small, if it is a success project leads can optionally add this bot to other project channels.

Next steps:

Comment if you think this is a terrible/great idea but do not want to work on it
Post below if you're interested, know how slack bots work or are willing to research to figure it out.

Data guide: Discursive tweet archive

Create a data-guide for the discursive data set. Data is uploaded to s3 bucket but undocumented.

Next steps:

Contact @hadoopjax to discuss collection methodology
Contact @bstarling to get directory structure and aws credentials
(Optional) Create some sample analysis or pose interesting questions for another community member to run with.

Community detection using pre-implemented algorithms

The idea is to list and test classic, pre-implemented algorithms for community detection:

Louvain detection using this implementation https://github.com/taynaud/python-louvain/
hierarchical clustering https://networkx.github.io/documentation/networkx-1.10/examples/algorithms/blockmodel.html
others?

Testing to be done on the #far-right twitter dataset.

See @alejandrox1 great post and tutorials: Data4Democracy/discursive#4
https://github.com/Data4Democracy/tutorials

Strip youtube ID from youtube links

The problem:

A goal of an progress data pipeline is to extract youtube links found in text blobs then poll youtube API to to get additional video metadata. In order to poll API we need to extract the youtube video ID from the URL.

Tasks

Determine if link is actually a link to a video
Isolate youtube ID
Create output CSV which contains original_url and youtube_id
Submit a PR to this repository as a stand alone file or Jupyter notebook Even if you only have a partial solution, PR is encouraged so others can pickup where you left off or provide suggestions.

Additional Info

The base case is links will look like https://www.youtube.com/watch?v=DiTECkLZ8HM the youtube ID is DiTECkLZ8HM. Create a csv file with two columns original_url, youtube_id.

Links will come in many formats.
Some examples:

https://youtu.be/C-XXXXXXXXXXX
https://www.youtube.com/watch?v=XXXXXXXXXXX&t=538s
https://www.youtube.com/watch?v=XXXXXXXXXXX&list=LLqm0Q-XmsHWX_Gklk-NAUAw&index=43
https://www.youtube.com/user/XXXXXX/videos (would be discarded since it is not a link to a video)

A sample of 40,000 URLs to be used for testing purposes can be found here

Warning: this work requires you deal with highly explicit and offensive content from the pol 4chan board.

Community detection using spectral matrix analysis and clustering

The idea here is to treat the graph matrix as a feature matrix and to use traditional dimension reduction/clustering techniques on these features.

An example workflow would be:

build the sparse follow matrix from the graph representation of the network
SVD or PCA the result http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
use classic clustering techniques on the result http://scikit-learn.org/stable/modules/clustering.html

good testing ground is the twitter #far-right data.

also check out great post and tutorial by @alejandrox1 Data4Democracy/discursive#4
https://github.com/Data4Democracy/tutorials

tokenize and analyze 2016 presidential candidate rhetoric for comparison with extremist communities

Anyone interested in doing some basic word/n-gram analysis, topic models, etc. on presidential candidate speeches and press releases? Would be really interesting to see which candidates were/weren't plugged in to the extremist communities and when/where certain extremist language creeps into more mainstream campaign discourse.

An R notebook with instructions and code for obtaining this data from The American Presidency Project will be in the exploratory_notebooks folder soon (just submitted a pull request).

Community detection using non-negative matrix factorization

Initially suggested by @hadoopjax in Data4Democracy/discursive#4

test and implement non-negative matrix factorization using both graph and textual features as described in https://arxiv.org/pdf/1608.01771v1.pdf
testing to be done on the #far-right twitter dataset

Web scraping: Pull congressional record

Looking for someone who can work with me to build a spider to pull the congressional record. This needs to be done by end of weekend so it is a tight turn around so looking for someone with time to spare.

Experienced scrapers DM me for full details. @bstarling
If you are new and I am willing to mentor on this work.

Requirements:

We're looking to parse all 2017 activity from here and return 1 json file per day per category with the following fields:

date: (congressional date of record)
category: (daily digest, senate, house, extensions)
title: 
url : (url source)
text_blob:
hrefs: links in article ex: /congressional-record/volume-163/senate-section/page/S554

4chan link extraction & cleanup

Problem:

We have collected text data from the 4chan api. Unfortunately the text returned via the API In the (com) field is pretty rough. It includes html markup and other random garbage.

Ex:

"com": "<a href=\"#p116190305\" class=\"quotelink\">&gt;&gt;116190305</a>
<br>redacted are simply jealous of redacted.<br>https://youtu.be/k4yXQkG2s1E"

Additional info:

Clean all html tags leaving only the plain text. Extract links to external sites & quoted threads.
Expect lots of malformed html and weird junk mixed in.
Err on the side of capturing too much vs too little (especially when identifying links).
I've setup a small public test dataset posted to s3 here you can load directly to pandas via df = pd.read_csv('https://s3.amazonaws.com/far-right/fourchan/chan_example.csv', parse_dates=['created_at'])

Post cleaning the above should generate something along the lines of the below (use your own judgement after playing with the data):

{
    "text": "redacted are simply jealous of redacted",
    "external_links": ["https://youtu.be/k4yXQkG2s1E"]
}

warning: this work requires you deal with highly explicit and offensive content from the pol 4chan board.

data4democracy / assemble Goto Github PK

assemble's Introduction

Assemble

Getting Started:

Things you should know

Currently utilized skills

Project Areas

Infrastructure

Curation

Infrastructure Repositories

Data pipeline

Curation Projects

Tutorials and Example Notebooks:

Examples of work that has inspired us:

assemble's People

Contributors

Stargazers

Watchers

Forkers

assemble's Issues

Problem:

Tasks:

Goal:

First Step:

Background:

CURRENT data model

The problem:

Tasks

Additional Info

Requirements:

Problem:

Additional info:

Recommend Projects

Recommend Topics

Recommend Org