lschmelzeisen / nasty Goto Github PK
View Code? Open in Web Editor NEWNASTY Advanced Search Tweet Yielder
License: Apache License 2.0
NASTY Advanced Search Tweet Yielder
License: Apache License 2.0
Hi
Is the filter_="PHOTOS"
means that it retrieves only tweets containing images in them. if so, I guess it doesn't work.
When using,
tweet_stream = nasty.Search("trump", filter_="PHOTOS",lang="en").request()
it doesn't apply the filter.
Thank you.
Problem in importing Counter. How can I fix it?
$ nasty search --query "climate"
Traceback (most recent call last):
File "c:\users\meftahzd\appdata\local\programs\python\python36\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "c:\users\meftahzd\appdata\local\programs\python\python36\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\meftahzd\AppData\Local\Programs\Python\Python36\Scripts\nasty.exe\__main__.py", line 5, in <module>
File "c:\users\meftahzd\appdata\local\programs\python\python36\lib\site-packages\nasty\__init__.py", line 19, in <module>
from .batch.batch import Batch
File "c:\users\meftahzd\appdata\local\programs\python\python36\lib\site-packages\nasty\batch\batch.py", line 37, in <module>
from .batch_results import BatchResults
File "c:\users\meftahzd\appdata\local\programs\python\python36\lib\site-packages\nasty\batch\batch_results.py", line 20, in <module>
from typing import (
ImportError: cannot import name 'Counter'
For a sentiment analysis in the context of my academic master thesis, I use the really useful tool 'nasty' to crawl several company tweets within a certain period of time (with the search command) and the users' replies to them (with the reply command).
The search command returned several tweets for each company, i.e. many Tweet-IDs, for which I now have to retrieve the respective replies. Is there a way to crawl the answers to multiple Tweet-IDs / a predefined list of Tweet-IDs at once with the 'nasty reply' command? I guess a loop might solve my problem. However, since I am a marketer but not a computer scientist, I hope for a more convenient way to get the replies for more than one Tweet-ID.
Thanks in advance for any helpful suggestions.
I'm trying to get tweets from the hashtags '#COVID2019' and '#CoronavirusFrance', both return the following RuntimeError:
"Unknown entry type in entry-ID '{}'.".format(entry["entryId"])
RuntimeError: Unknown entry type in entry-ID 'novel_coronavirus_message'.
I'm using a simple python request for these tweets
nasty.Search(hashtag, lang="en").request()
but using the cmd version returns the same error
nasty search --query "#COVID2019" --lang "en"
I assume it's the automated twitter warning that shows up when you search for anything corona related.
Is there a way to skip it?
There has been a problem in the replies module of the nasty library. I cannot get all the replies of a certain tweet. Can you remove modify the library to include all the replies.
import nasty
import json
all_tweets=[]
counter=0
username="Imrankhanpti"
tweet_stream = nasty.Replies("1229250933525270528",max_tweets=10000,batch_size=9999).request()
try:
for tweet in tweet_stream:
print(tweet.id, tweet.text)
all_tweets.append({"user": tweet.user.name, "text": tweet.text})
counter=counter+1
print(counter)
except:
pass
filename = username+"_twitter.json"
print(all_tweets)
print("\nDumping data in file " + filename)
with open(filename, 'w',encoding="utf-8") as fh:
fh.write(json.dumps(all_tweets,ensure_ascii=False))
works as expected, great. I just wonder what is the easiest way to get a pdf from the json file that includes all attached pictures?
Hello,
I tried to install nasty via my command line. I have a Windows Laptop and I use Python 3.8.
I first installed pip by installing Python. And then I entered the following comman in the command line:
pip install nasty
and I got this Warning messages:
WARNING: The script tqdm.exe is installed in 'C:xxx' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script chardetect.exe is installed in 'C:xxx' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script nasty.exe is installed in 'C:xxx' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location
How can I solve this problem and add it to the path? What does it mean?
Thanks in advance!
I am currently in the process of using NASTY to retrieve all Tweets about the ongoing coronavirus. As presumably, many others are also doing so, in my view it is best to concentrate crawling efforts in one location and then share the results publicly.
Therefore, I here document my current methodology and am open to suggestions/criticism:
corona
, coronavirus
, covid
, covid19
, ncov
, sars
, wuhan
that were authored after 1 Dec 2019 in either English or German.
nasty search --daily
feature), i.e., a single search request would be with query corona
, time range from 1 Dec 2019 to 2 Dec 2019 using both the TOP and LATEST --filter
s. The next request for the following day and so on. Based on initial experiments this seems to yield more results and can easily be expanded on later, but more investigation on Twitter's search algorithm would be useful here.So far, I have crawled about 68.5 million English and 2.2 million German Tweets in the time span from 1 Dec 2019 to 5 Apr 2020 (about 34 GB compressed JSON with meta data). I plan to contentiously expand this collection for the upcoming months. I am note quite sure when I'm ready to share this and how I will do so (probably using NASTY's idify feature).
If you are interested in this dataset, please leave a comment here. Preferably also leave a very short summary of what you plan to do with it and what you think of the outlined methodology.
Hello,
I'm trying to run the program as specified in the README but I'm getting the following error:
Issued command:
nasty search --query "climate change"
Results:
Traceback (most recent call last):
File "/Users/fcks/anaconda3/envs/vui/bin/nasty", line 8, in <module>
sys.exit(main())
File "/Users/fcks/anaconda3/envs/vui/lib/python3.6/site-packages/nasty/cli/main.py", line 52, in main
command.run()
File "/Users/fcks/anaconda3/envs/vui/lib/python3.6/site-packages/nasty/cli/_request_command.py", line 104, in run
for tweet in request.request():
File "/Users/fcks/anaconda3/envs/vui/lib/python3.6/site-packages/nasty/request/search.py", line 146, in request
return SearchRetriever(self).tweet_stream
File "/Users/fcks/anaconda3/envs/vui/lib/python3.6/site-packages/nasty/_retriever/retriever.py", line 175, in __init__
self._fetch_new_twitter_session()
File "/Users/fcks/anaconda3/envs/vui/lib/python3.6/site-packages/nasty/_retriever/retriever.py", line 321, in _fetch_new_twitter_session
)[0]
IndexError: list index out of range
Do you have any ideas about why is this happening?
Hi,
while running a nasty batch request via command line, one of my search terms failed with the following error message:
"exception": {
"time": "2020-05-01T20:58:12.156301",
"type": "IndexError",
"message": "IndexError: list index out of range",
"trace": [
[
" File \"/home/beck/Data/corona-crawl/venv/lib/python3.6/site-packages/nasty/batch/batch.py\", line 115, in _execute_entry",
" write_jsonl_lines(data_file, entry.request.request(), use_lzma=True)"
],
[
" File \"/home/beck/Data/corona-crawl/venv/lib/python3.6/site-packages/nasty/_util/json_.py\", line 137, in write_jsonl_lines",
" use_lzma=use_lzma,"
],
[
" File \"/home/beck/Data/corona-crawl/venv/lib/python3.6/site-packages/nasty/_util/io_.py\", line 85, in write_lines_file",
" for value in values:"
],
[
" File \"/home/beck/Data/corona-crawl/venv/lib/python3.6/site-packages/nasty/_util/json_.py\", line 135, in <genexpr>",
" (json.dumps(value.to_json()) for value in values),"
],
[
" File \"/home/beck/Data/corona-crawl/venv/lib/python3.6/site-packages/nasty/_retriever/retriever.py\", line 66, in __next__",
" if not self._update_callback():"
],
[
" File \"/home/beck/Data/corona-crawl/venv/lib/python3.6/site-packages/nasty/_retriever/retriever.py\", line 214, in _update_tweet_stream",
" self._fetch_new_twitter_session()"
],
[
" File \"/home/beck/Data/corona-crawl/venv/lib/python3.6/site-packages/nasty/_retriever/retriever.py\", line 320, in _fetch_new_twitter_session",
" )[0]"
]
]
Here are the request parameters:
"request": {
"type": "Search",
"query": "coronavirus",
"since": "2019-12-01",
"until": "2020-03-16",
"filter": "LATEST",
"lang": "en",
"max_tweets": null
}
Some other search requests from the same batch-processing file finished successfully.
For specifying a maximum number of tweets, in the CLI you need to use the "max-tweets" argument. In python, the same argument is called max_tweets (underscore instead of hyphen).
Hi @lschmelzeisen,
Thank you for this work.
I try to read the following tweet with the command
nasty thread --tweet-id 1230464833696432129
NASTY gives the following response:
2020-02-20 16:14:38,186 I [ nasty._retriever.retriever ] Received 3 consecutive empty batches.
I also tried the example in README. But that one gave a json response.
What could be the problem of the Tweet ?
Thanks in advance!
i'm using the python API to 'search until' e.g.
tweet_stream = nasty.Search("trump",
until=datetime(2018,5,2),
lang="en").request()
for tweet in tweet_stream:
print(tweet.created_at, tweet.text)
with some datetimes it seems to go into a loop where, instead of returning a batch of 20 new tweets, it repeatedly returns the same tweet.
i have also reproduced the same basic problem with the following search/datetime combinations:
"climate", until=datetime(2016,5,2),
"climate", until=datetime(2019,5,2),
in the latter case, it's the final 3 tweets which get repeated in the output.
i attach 3 txt files:
test_output_until_2018-5-2_trump.txt | search = trump, datetime = 18-5-2, results = OK
test_output_until_2017-5-2_trump.txt | search = trump. datetime = 17-5-2, results = repeating tweet
test_output_until_2019-5-2_climate.txt | search = climate, datetime = 19-5-2, results = repeating final 3 tweets
test_output_until_2018-5-2_trump.txt
test_output_until_2017-5-2_trump.txt
test_output_until_2019-5-2_climate.txt
When inspecting the tweets retrieved with the nasty search command, I encountered the following problem:
If a tweet contains only emojis but no text, it cannot be crawled.
Example tweets:
https://twitter.com/McDonalds/status/1258532072634724354
https://twitter.com/McDonalds/status/1258894470055149572
Exemplary code for the command line:
nasty search --query โ(from:@McDonalds)โ --max-tweets -1 --since 2020-05-08 --until 2020-05-10 --filter LATEST > mcs.json
It would be possible that my parser does not work as intended, although I have no problems retrieving tweets which include text + emoticons / emojis. However, it might also be a bug in the nasty tool?
Hey everybody and thank you for this great tool.
However, I'm facing a strange issue (tested on CentOS 7 as well as MacOS):
nasty search --query "(from:realdonaldtrump)" --since 2020-10-06 --until 2020-10-07 --filter LATEST --max-tweets -1
leads to this error message:
Traceback (most recent call last):
File "/usr/local/bin/nasty", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/nasty/main.py", line 25, in main
NastyProgram.init(*args).run()
File "/usr/local/lib/python3.6/site-packages/nasty/_cli.py", line 113, in run
for tweet in request.request():
File "/usr/local/lib/python3.6/site-packages/nasty/_retriever/retriever.py", line 66, in next
if not self._update_callback():
File "/usr/local/lib/python3.6/site-packages/nasty/_retriever/retriever.py", line 209, in _update_tweet_stream
batch = self._fetch_batch()
File "/usr/local/lib/python3.6/site-packages/nasty/_retriever/retriever.py", line 352, in _fetch_batch
self._session_get(**self._batch_url()).json()
File "/usr/local/lib/python3.6/site-packages/nasty/_retriever/retriever.py", line 76, in init
self.tweets: Final = self._tweets()
File "/usr/local/lib/python3.6/site-packages/nasty/_retriever/retriever.py", line 90, in _tweets
for tweet_id in self._tweet_ids():
File "/usr/local/lib/python3.6/site-packages/nasty/_retriever/search_retriever.py", line 122, in _tweet_ids
tweet = entry["content"]["item"]["content"]["tweet"]
KeyError: 'tweet'
What makes me wonder is the fact, that other dates and accounts are working fine.
Is there any solution to this problem?
Thank you.
I am trying to crawl a discussion on twitter. As an example let's pick 1299724507377274883
By running both thread and reply I get:
$ nasty r --tweet-id 1299724507377274883
Received 3 consecutive empty batches.
$ nasty t --tweet-id 1299724507377274883
Received 3 consecutive empty batches.
My intent is to download both the original post:
Restaurant and hotel workers are receiving eye-expression training as they try to deliver service with a smile while the smile is out of service
and its reply:
I'm kind of glad I don't have to smile all the time.
Is there something I am doing wrong?
PS: Thanks a lot for this wonderful package. Although it needs maintenance it is really well designed
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.