thepanacealab / covid19_twitter Goto Github PK

View Code? Open in Web Editor NEW

472.0 16.0 187.0 12.71 GB

Covid-19 Twitter dataset for non-commercial research use and pre-processing scripts - under active development

Home Page: http://www.panacealab.org/covid19/

Python 6.02% R 3.93% Jupyter Notebook 90.05%

tweets dataset retweets tweets-acquired frequent-terms twitter-stream dissemination ngrams

covid19_twitter's People

Contributors

Stargazers

Watchers

Forkers

xx4444 ruihong000 weiaiwayne jdhazard brauliosanchez ariedamuco jamesbyers0990 wjerry5 liyuanlucasliu bearsodier tomas1861 frankblood krishnapsrinivasan rublev09 betoriay dariushghasemi ramcandrews igeralt yinyunlu totemprotocol yolenan njucshikaru jiacli cshong9 faye7766 butterflyer043 melaniezhang1 cpttbp jdunnmon anujlitoriyaa555 yuxiangzhang0114 rohitjayale sallyzhu baobei347 magciann chouisgiser pgauravv fourha mrebollo joshzyj yuanlei6616 raditya1117 leofn luweishuang ruilinchen saranfares liliwilson soccerdroid imkhubaibraza gauravsett geeksperiments zwytop acse-ty616 amneetmann bidur lucianmack shaunac rahatkatal drsultanqasem moradnejad nckups fintrek chacreton190 longnt304 sue1981 streamfunction bonchae joseangeldiazg aryavashisht justinjdy elhamnadimi windflyhuang stopforme zeta1999 vibhuk16 aecassawei yiyunfanfan liviucotfas andrea-lee yichen-lei tutubalinaev shrebox huangyingting ekaufman5 hxiao608 haesunshim zervel-fang ge-gao-bham tranylin billy-liu-12 happyhepingyu akshay-1612 ltsang01 luoluogogogo princepeak sarshaw dongyann jdspersonal pbooncho hongjiangli1997

covid19_twitter's Issues

Tweets in Indonesian is not found

Hello, from your data description in http://www.panacealab.org/covid19/ (Table 1: Languages and their frequencies on the dataset), the number of tweets in Indonesian (related abbreviation "id") ranks sixth in that of all languages. But I didn’t find it in my json data crawled. Would you find the same problem?

Best wishes

Link to dataset on Zenodo is broken

Readme in version1.0 says:

"Apparently github has a bandwidth limit on free accounts for large files, so the full dataset Version 1 will be available in Zenodo: https://doi.org/10.5281/zenodo.3723940"

But this link to the data set in Zenodo throws an
Internal server error, with
Error identifier: 9eb9e7c76b9348b797b692c529dc143a

Is it possible to get number of likes for each tweet

Hi,

May I check if it's possible to obtain the number of likes / retweets for each tweet?

Only Tweet IDs

Hi,
Once I load the tsv file I can only find 3 columns. Is this correct or are there other columns as well?

Latest Tweets

What about the last few days?
Are there any updates to this repo planned?

TweetID

Hydrating Tweets tutorial

Hi,

Thank you for sharing this dataset. I am trying to use it for my maters dissertation.

I am just trying to work through the usage tutorial and am stuck at:

!python3 get_metadata.py -i clean-dataset-filtered.tsv -o hydrated_tweets -k api_keys.json

I get the error:

ImportError: cannot import name 'TweepError' from 'tweepy'

Following a search I gleaned that TweepError has been replaced by TweepyException in the current version. I have tried that but still no luck. Do I need to be using a specific Tweepy version?

EDIT: I have tried to make changes to the get_metadata.py file and I get output files with headings but they're empty

Thanks!

Issues about hydrating

Hello, I successfully completed the pre-step, but I had a problem with the hydration step.

After running get_metadata.py and entering the parameters, The output files I get ("hydrated_tweets" and so on) are blank files (0Kb). What might be wrong with it? (I am sure my input file "'clean-dataset-filtered.tsv'" is valid)

Finally, thank you for the code you provided, which is very inspiring for my research!

hello, missing files for 23Jul - 25Jul

Dear creators,

Top used 1000 terms between 23Jul and 25Jul are missing. Is there a place to find them?

Geotagged subset?

Hi!

First of all, thanks for all your great work. This dataset is awesome.

I was wondering if it's possible for you to make public the subset of tweet IDs that are geotagged (i.e., have the coordinates field). Most users have limited capacity for rehydration, and in my case, I am interested specifically in the geotagged tweets. Rehydrating the whole dataset and filtering that out myself would be infeasible, but I noticed that on your website there is an interactive map with tweets based on their locations, so I assume there is already a subset made.

Thanks in advance!

Cannot unpack dailies

Hello! I'm trying to unpack daily .tar.gz files but running into the following error in Python:

import tar
t = tarfile.open('./covid19_twitter/dailies/2020-03-23/2020-03-23_clean.tar.gz')

And bash:

$ tar tvf ./covid19_twitter/dailies/2020-03-23/2020-03-23_clean.tar.gz
tar: This does not look like a tar archive

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Anyhow, really like your team's work & thanks in advance!

Retrieve the tweets from ids

Hello, Mr. Banda.
Thanks for you and all co-authors for the amazing database.
For retrieve the tweets from ids, and receiving the text of tweets I used the codes: processing_code/getDataset_clean.py and processing_code/parse_json_extreme_cleantweets.py by Jupyter Notebook. I process I faced with issue, after running the codes I find the files become empty. And I could not retrieve the tweets. Could you please guide me?

Best regards,
Zarnigor Dzhuraeva!

Filter by language. Especially for English

Hi,

This is awesome work and I can see how huge effort it must have taken. I am really glad about that.
We will use this dataset for our Information Networks course and I was wondering if you have the data in language filtered format. When it comes to filtering it ourselves, we will have to hydrate 1.5 billion tweets only to find the 600 million that are in English. Considering the limitations of our Twitter API access, it will be a huge burden on us.
Is there any chance you do that for maybe only the English language? I believe you store them non-dehydrated in JSON format.
That would be a huge help and I believe data in English would be used by a wider community since the tech and research world is mostly work in English.
Thanks

"dailies/2020-12-20" and "dailies/2020-12-21" only include the cleaned dataset

Thank you very much for making available the COVID-19 Twitter dataset. While browsing the repository I have noticed that the folders for the dates "2020-12-20" and "2020-12-21" do not include the "dataset.tsv.gz" file. Only the "clean-dataset.tsv.gz" file is included. Would it be possible to add the full dataset files?

Clustering data

Hello,
could you share please the code for clustering data by frequency terms?

Best regards,
Zarnigor.

hello, i failed when i use the parse_json_extreme.py

I run the "parse_json_extreme.py" in the shall after i download the dataset.tsv , but what i get after that was a empty file. i search on google, and found there are something wrong When the program runs to tweet = json.loads(line)， it always failed and continue. Am i forgot something ? i didn't know programming and python too much and i'm confused. Can you help me?

when i run the program , what i typed is "python .\parse_json_extreme.py .\2020-03-22-dataset.tsv"

Tools to hydrate the dataset

Hi there,

Thanks for your excellent work on collecting and sharing this dataset.

As you mentioned on the website, there is a set of tools that can be used to hydrate the dataset.
Could you offer the list of this set of tools?

Thanks and I really appreciate your help.

Lack of some data in the full clean dataset

In my understanding, the set of tweet ids from clean_language_en.tsv should be a proper subset of the set of tweet ids from full_dataset_clean.tsv. However, it is not the case - some ids are not presented in the full clean dataset. So, I am wondering how and why this happens. I can provide a list of non-presented tweets if needed.
Thanks in advance!

Language and place Location unavailable before 07-2020

Hello, It seems that the language and place location are not present in the datasets before July 2020. It was my understanding that they were added to all tweets in version 20.
I was wondering if I misunderstood and they were added to new tweets scraped from version 20 or is this a mistake?

Thank you for your amazing dataset and work :)

Tweet Label

Good morning!!
How is going?
I was browsing through your data.
Do you have the label of the tweets in terms of True(Truth/Good/Correct), or False(Fake/Lie/Disinformation)?

Best Wishes,
Fernando Durier