Light

mr-devs / top-fibers Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 1.0 2.14 MB

A project to find and rank the top superspreaders of misinformation on Twitter

License: MIT License

Python 86.87% Shell 13.13%

top-fibers's Introduction

top-fibers

Code to find and rank the top superspreaders of misinformation on Twitter using the FIB-index.

Creators

Top FIBers is a project of the Observatory on Social Media (OSoMe, pronounced "awesome") at Indiana University. The following individuals have contributed to this project: Matthew R. DeVerna, Pasan Kamburugamuwa, Nick Liu, Kaicheng Yang, Ben Serrette, and Filippo Menczer.

The best way to contact the team is by using the contact information found at the OSoMe website.

top-fibers's People

Contributors

Watchers

Forkers

bs22iu

top-fibers's Issues

Finalize documentation

I am adding all of us but likely I will handle most of this myself.

ToDo List

The below is for when the repo is ready and all data has been updated.

Make sure that the README docs are correct in all directories
Update the web-based documentation (see this) which is quite outdated at this point.
- Nice to have: Can we create a diagram of the infrastructure and the data processing pipeline? (use https://app.diagrams.net/)

Update the `_all_months.sh` scripts

The logic from the two scripts listed below do not work.

They are not crucial to the monthly logic so they can either:

Be removed
Be updated to loop through the directories that we have symlinks for
Be fixed some other way

It would be nice to have the convenience of these scripts in case something happens.

Merge the two fib scripts into one

Currently, there are two scripts that are used to calculate the fib information.

They are:

They can be merged into one file if we do the following:

Update the argparse inputs to include:
- Number of spreaders (currently hardcoded)
- Type of spreader (currently hardcoded)
- Platform
Move both data extraction functions to the package/top_fibers_pkg/fib_helpers.py
- Another option would be to combine them into one function that does different things depending on a platform flag (which could be taken from above) — this would create a very large function though...
Other small things like: importing both data models, setting output files based on flags, etc.

Right now, the code works by ingesting all tweet objects from base tweets, retweets, and quote tweets. This means that, if something is retweeted from a very old post, it could have had a year or more to gain retweets. This is not necessarily a fair comparison so we may want to filter out tweets that are not originated during the time frame observed.

Download FB posts for the past year Oct '21-Dec '22

We want to be able to calculate the FIB indices for all months in 2021.

In order to do this, we will need to have one file for each month, going back three months from Jan '21. Thus, the script needs to be called 15 different times for each month of data.

Please address the cosmetic changes in #35

I merged #35 despite requesting some small cosmetic changes. I did this to get the pipeline ready for the cronjob that is going to occur over the weekend.

When you have a chance, please create a new PR to address what I left in the comments of that PR.

Thanks!

Remove date from the package documentation

top-FIBers/docs/code/top_fibers_pkg.md

Line 10 in f69d3f5

As of 2022-11-11, there are three modules which are all heavily documented.

As you can see in the most updated rendering of this page here — the date in the referenced line above is outdated.

Having the last updated date in two places is likely to lead to this issue. Use only the Last modified note from the YAML.

Re-pull all Twitter data from moe using the new list of domains

You can find the list of domains to use in data/iffy_files. Will probably be easier to do this after you merge #36 first (as the file is on that branch).

The data should be repulled for all months that we had previously. That is all of last year, up to the month prior to whatever is the previous month. As today is 4/28, that would be up to and including 2023-03. However, in two days, we are going to need the April data as well.

Please let me know if anything isn't clear. Thanks!

Get CrowdTangle to raise rate API limits

API requests page: https://www.facebook.com/help/contact/908993259530156

Crontab `MAILTO` variable

Once the project is finalized, do you think we need to add others to this variable? I know that many of the other projects send emails to the larger group email list and Fil gets updates.

I will let you decide how you'd like to manage this as it will ultimately become your project to manage after I am gone. That said, I would recommend adding at least one other person's email from the developers' team.

Also, I will add myself to keep an eye on this while I am still around.

Create a clean version of the iffy list

Create a clean version of the iffy list with https and wildcards removed
Update the code so it loads the clean version and doesn't need to clean the domains on the fly

Fix typo in script name

top-FIBers/scripts/data_processing/README.md

Line 7 in 7137538

 - `calc_crowdtanlge_fib_all_months.sh` : runs `calc_crowdtangle_fib_indices.py` for all time periods 

Write script to pull CrowdTangle data

Start by pulling only Facebook data.

See #9 for a bit more details on incorporating Instagram.

Add workflow bash script that runs the entire pipeline

Add logging aspect to iffy_get_data.sh

This script will need to be updated to log its progress and any errors.

Create a file that maps top FIBers to the iffy list used to find them

For the FB download, this could be built into the script.

For the Twitter pipeline, this will need to be thought through a bit..

Incorporate logging

Create a logging directory that has different subdirectories for each type of script running.

data collection
analysis
etc

Clickable account icons

On the "Accounts" page, it looks like we can no longer click the user icon to visit an account's twitter/facebook page. I think that Fil had asked you to make the text no longer look like a hyperlink when we added the "Unfollow" button but I would still like to be able to click the icon to visit. So we would be able to visit both ways, via the unfollow button but also the icon. Can you please add that functionality back in when you get a chance?

Thanks!

Figure out zenodo upload

Figure out how to automatically upload top FIBer files to zenodo.

Update FAQ

Hi Pasan, can you please update the FAQ sections outlined below with the following text? These have changed since we've changed the list of domains slightly.

Thanks!

How do you define misinformation?

We adopt a common definition of misinformation utilized in academic research, which focuses on a source of information “that mimics news media content in form but not in organizational process or intent” (Lazer et al, Science, 2018). With this definition, we search for posts that contain at least one link to sources within a list that is curated by an independent third party, Iffy.news. Specifically, we include sources that have been marked by Media Bias Fact Check (MBFC) as having a "low" or "very-low" "MBFC Factual" score.

According to MBFC methodology, a source in these categories "rarely uses credible sources and is not trustworthy for reliable information" and "need[s] to be fact-checked for intentional fake news, conspiracy, and propaganda." Since a source's MBFC Factual score can change, we update our list of sources each month prior to releasing a new Top FIBers monthly report. These updates do not affect prior reports.

How do you collect your data?

Facebook

Facebook data are gathered using the CrowdTangle API . Specifically, we utilize the /posts/search/ endpoint. As a result of utilizing CrowdTangle, we are limited to collecting public posts (see the CrowdTangle documentation for more details).

Data for all months in 2022 as well as January--March of 2023 were gathered during the week of April 24, 2023. After that point, Facebook data are collected within the first week of the following month (depending on how long it takes for all data to download). For example, April 2023 data were collected during the first week of March 2023. We collect public posts linking to at least one of the low-credibility sources (see How do you define misinformation? for more details). please link the bold portion to the section above.

Twitter

Twitter data are collected with the enterprise-level Decahose endpoint. The Decahose delivers a 10% random sample of all tweets in real time. From this, we then collect all tweets that link to at least one of the credibility sources.

As Twitter's recent API changes have made continuing this data collection virtually impossible, we will no longer be able to continue analyzing Twitter's biggest superspreaders of misinformation.

Visualization ideas

Some ideas for account visualization:

Top ten domains shared
Top hashtags utilized
Misinfo links shared per day
Mean low-credibility tweets per day/week/etc

update /scripts/data_collection/get_tweets_from_moe.sh

need to figure out a way to check for task success accurately.

Add line for profile_links.py

top-FIBers/data-loader/database_functions/README.md

Line 8 in 5eacd16

### Contents

Just noticed that the Contents section of this README does not include a line item for the profile_links.py file. Can you please add one? Thanks!

Typo in documentation

See: https://github.com/mr-devs/top-fibers/blob/2e74079ab1581037087f3d4b40b5851375bd368e/docs/documentation.md

Towards the end there is a typo of the word "repository" that needs to be fixed.

Make the cronjob send email reminders

We need to update the master bash script so that it sends an email to Nick when something fails.

We have already added Nick's email to the crontab file using the MAILTO variable but need to make all of the exit lines return 1.

I will take care of this.

Handle both instagram and facebook separately?

Technically, we will have the ability to pull Instagram data as well.

Decide if we want to do this. If we do, I think it would require rewriting a lot of the code...

Add some documentation about the package

It looks like there is some documentation for the package here but it can not be reached on the page as nothing points to it.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

mr-devs / top-fibers Goto Github PK

top-fibers's Introduction

top-fibers

Links

Creators

top-fibers's People

Contributors

Watchers

Forkers

top-fibers's Issues

ToDo List

How do you define misinformation?

How do you collect your data?

Facebook

Twitter

Recommend Projects

Recommend Topics

Recommend Org