thepanacealab / covid19_twitter Goto Github PK
View Code? Open in Web Editor NEWCovid-19 Twitter dataset for non-commercial research use and pre-processing scripts - under active development
Home Page: http://www.panacealab.org/covid19/
Covid-19 Twitter dataset for non-commercial research use and pre-processing scripts - under active development
Home Page: http://www.panacealab.org/covid19/
Hello, from your data description in http://www.panacealab.org/covid19/ (Table 1: Languages and their frequencies on the dataset), the number of tweets in Indonesian (related abbreviation "id") ranks sixth in that of all languages. But I didn’t find it in my json data crawled. Would you find the same problem?
Best wishes
Readme in version1.0 says:
"Apparently github has a bandwidth limit on free accounts for large files, so the full dataset Version 1 will be available in Zenodo: https://doi.org/10.5281/zenodo.3723940"
But this link to the data set in Zenodo throws an
Internal server error
, with
Error identifier: 9eb9e7c76b9348b797b692c529dc143a
Hi,
May I check if it's possible to obtain the number of likes / retweets for each tweet?
Hi,
Once I load the tsv file I can only find 3 columns. Is this correct or are there other columns as well?
What about the last few days?
Are there any updates to this repo planned?
Hi,
Thank you for sharing this dataset. I am trying to use it for my maters dissertation.
I am just trying to work through the usage tutorial and am stuck at:
!python3 get_metadata.py -i clean-dataset-filtered.tsv -o hydrated_tweets -k api_keys.json
I get the error:
ImportError: cannot import name 'TweepError' from 'tweepy'
Following a search I gleaned that TweepError has been replaced by TweepyException in the current version. I have tried that but still no luck. Do I need to be using a specific Tweepy version?
EDIT: I have tried to make changes to the get_metadata.py file and I get output files with headings but they're empty
Thanks!
Hello, I successfully completed the pre-step, but I had a problem with the hydration step.
After running get_metadata.py and entering the parameters, The output files I get ("hydrated_tweets" and so on) are blank files (0Kb). What might be wrong with it? (I am sure my input file "'clean-dataset-filtered.tsv'" is valid)
Finally, thank you for the code you provided, which is very inspiring for my research!
Dear creators,
Top used 1000 terms between 23Jul and 25Jul are missing. Is there a place to find them?
Hi!
First of all, thanks for all your great work. This dataset is awesome.
I was wondering if it's possible for you to make public the subset of tweet IDs that are geotagged (i.e., have the coordinates field). Most users have limited capacity for rehydration, and in my case, I am interested specifically in the geotagged tweets. Rehydrating the whole dataset and filtering that out myself would be infeasible, but I noticed that on your website there is an interactive map with tweets based on their locations, so I assume there is already a subset made.
Thanks in advance!
Hello! I'm trying to unpack daily .tar.gz
files but running into the following error in Python:
import tar
t = tarfile.open('./covid19_twitter/dailies/2020-03-23/2020-03-23_clean.tar.gz')
And bash:
$ tar tvf ./covid19_twitter/dailies/2020-03-23/2020-03-23_clean.tar.gz
tar: This does not look like a tar archive
gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
Anyhow, really like your team's work & thanks in advance!
Hello, Mr. Banda.
Thanks for you and all co-authors for the amazing database.
For retrieve the tweets from ids, and receiving the text of tweets I used the codes: processing_code/getDataset_clean.py and processing_code/parse_json_extreme_cleantweets.py by Jupyter Notebook. I process I faced with issue, after running the codes I find the files become empty. And I could not retrieve the tweets. Could you please guide me?
Best regards,
Zarnigor Dzhuraeva!
Hi,
This is awesome work and I can see how huge effort it must have taken. I am really glad about that.
We will use this dataset for our Information Networks course and I was wondering if you have the data in language filtered format. When it comes to filtering it ourselves, we will have to hydrate 1.5 billion tweets only to find the 600 million that are in English. Considering the limitations of our Twitter API access, it will be a huge burden on us.
Is there any chance you do that for maybe only the English language? I believe you store them non-dehydrated in JSON format.
That would be a huge help and I believe data in English would be used by a wider community since the tech and research world is mostly work in English.
Thanks
Thank you very much for making available the COVID-19 Twitter dataset. While browsing the repository I have noticed that the folders for the dates "2020-12-20" and "2020-12-21" do not include the "dataset.tsv.gz" file. Only the "clean-dataset.tsv.gz" file is included. Would it be possible to add the full dataset files?
Hello,
could you share please the code for clustering data by frequency terms?
Best regards,
Zarnigor.
I run the "parse_json_extreme.py" in the shall after i download the dataset.tsv , but what i get after that was a empty file. i search on google, and found there are something wrong When the program runs to tweet = json.loads(line), it always failed and continue. Am i forgot something ? i didn't know programming and python too much and i'm confused. Can you help me?
when i run the program , what i typed is "python .\parse_json_extreme.py .\2020-03-22-dataset.tsv"
Hi there,
Thanks for your excellent work on collecting and sharing this dataset.
As you mentioned on the website, there is a set of tools that can be used to hydrate the dataset.
Could you offer the list of this set of tools?
Thanks and I really appreciate your help.
In my understanding, the set of tweet ids from clean_language_en.tsv
should be a proper subset of the set of tweet ids from full_dataset_clean.tsv
. However, it is not the case - some ids are not presented in the full clean dataset. So, I am wondering how and why this happens. I can provide a list of non-presented tweets if needed.
Thanks in advance!
Hello, It seems that the language and place location are not present in the datasets before July 2020. It was my understanding that they were added to all tweets in version 20.
I was wondering if I misunderstood and they were added to new tweets scraped from version 20 or is this a mistake?
Thank you for your amazing dataset and work :)
Good morning!!
How is going?
I was browsing through your data.
Do you have the label of the tweets in terms of True(Truth/Good/Correct), or False(Fake/Lie/Disinformation)?
Best Wishes,
Fernando Durier
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.