lschmelzeisen / nasty Goto Github PK

View Code? Open in Web Editor NEW

49.0 49.0 9.0 785 KB

NASTY Advanced Search Tweet Yielder

License: Apache License 2.0

Python 97.38% Makefile 2.62%

crawler python twitter

nasty's People

Contributors

Stargazers

Watchers

Forkers

covcov tilmanbeck anrmew evhart ekojsalim benchen669666 wijijo azeddinebouabdallah crystico930

nasty's Issues

Problem regarding the PHOTOS filter.

Hi
Is the filter_="PHOTOS" means that it retrieves only tweets containing images in them. if so, I guess it doesn't work.

When using,

tweet_stream = nasty.Search("trump", filter_="PHOTOS",lang="en").request()

it doesn't apply the filter.

Thank you.

ImportError: cannot import name 'Counter' (from typing)

Problem in importing Counter. How can I fix it?

$ nasty search --query "climate"

Traceback (most recent call last):
  File "c:\users\meftahzd\appdata\local\programs\python\python36\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\meftahzd\appdata\local\programs\python\python36\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\meftahzd\AppData\Local\Programs\Python\Python36\Scripts\nasty.exe\__main__.py", line 5, in <module>
  File "c:\users\meftahzd\appdata\local\programs\python\python36\lib\site-packages\nasty\__init__.py", line 19, in <module>
    from .batch.batch import Batch
  File "c:\users\meftahzd\appdata\local\programs\python\python36\lib\site-packages\nasty\batch\batch.py", line 37, in <module>
    from .batch_results import BatchResults
  File "c:\users\meftahzd\appdata\local\programs\python\python36\lib\site-packages\nasty\batch\batch_results.py", line 20, in <module>
    from typing import (

ImportError: cannot import name 'Counter'

Retrieve replies for several tweet-IDs

For a sentiment analysis in the context of my academic master thesis, I use the really useful tool 'nasty' to crawl several company tweets within a certain period of time (with the search command) and the users' replies to them (with the reply command).

The search command returned several tweets for each company, i.e. many Tweet-IDs, for which I now have to retrieve the respective replies. Is there a way to crawl the answers to multiple Tweet-IDs / a predefined list of Tweet-IDs at once with the 'nasty reply' command? I guess a loop might solve my problem. However, since I am a marketer but not a computer scientist, I hope for a more convenient way to get the replies for more than one Tweet-ID.

Thanks in advance for any helpful suggestions.

Unkown entry type in entry-ID

I'm trying to get tweets from the hashtags '#COVID2019' and '#CoronavirusFrance', both return the following RuntimeError:
"Unknown entry type in entry-ID '{}'.".format(entry["entryId"])
RuntimeError: Unknown entry type in entry-ID 'novel_coronavirus_message'.

I'm using a simple python request for these tweets
nasty.Search(hashtag, lang="en").request()
but using the cmd version returns the same error
nasty search --query "#COVID2019" --lang "en"

I assume it's the automated twitter warning that shows up when you search for anything corona related.

Is there a way to skip it?

Problem retrieving all replies of a specific Tweet

There has been a problem in the replies module of the nasty library. I cannot get all the replies of a certain tweet. Can you remove modify the library to include all the replies.

import nasty
import json
all_tweets=[]
counter=0
username="Imrankhanpti"
tweet_stream = nasty.Replies("1229250933525270528",max_tweets=10000,batch_size=9999).request()
try:
    for tweet in tweet_stream:
        print(tweet.id, tweet.text)
        all_tweets.append({"user": tweet.user.name, "text": tweet.text})
        counter=counter+1
        print(counter)
except:
    pass
filename = username+"_twitter.json"
print(all_tweets)
print("\nDumping data in file " + filename)
with open(filename, 'w',encoding="utf-8") as fh:
    fh.write(json.dumps(all_tweets,ensure_ascii=False))

PDF output

works as expected, great. I just wonder what is the easiest way to get a pdf from the json file that includes all attached pictures?

Problem with `pip install nasty` under Windows: script is not on PATH

Hello,

I tried to install nasty via my command line. I have a Windows Laptop and I use Python 3.8.

I first installed pip by installing Python. And then I entered the following comman in the command line:

pip install nasty

and I got this Warning messages:

WARNING: The script tqdm.exe is installed in 'C:xxx' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script chardetect.exe is installed in 'C:xxx' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script nasty.exe is installed in 'C:xxx' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location

How can I solve this problem and add it to the path? What does it mean?

Thanks in advance!

Share dataset of Tweets about the novel coronavirus.

I am currently in the process of using NASTY to retrieve all Tweets about the ongoing coronavirus. As presumably, many others are also doing so, in my view it is best to concentrate crawling efforts in one location and then share the results publicly.

Therefore, I here document my current methodology and am open to suggestions/criticism:

The main goal is finding as many on-topic Tweets (i.e., talking about the novel coronavirus) as possible while including as few off-topic Tweets as is achievable. That means striking a sensible balance between precision and recall.
To this end, search requests are used to find Tweets containing at least one of the keywords corona, coronavirus, covid, covid19, ncov, sars, wuhan that were authored after 1 Dec 2019 in either English or German.
- For this I am issuing search requests per day (using the nasty search --daily feature), i.e., a single search request would be with query corona, time range from 1 Dec 2019 to 2 Dec 2019 using both the TOP and LATEST --filters. The next request for the following day and so on. Based on initial experiments this seems to yield more results and can easily be expanded on later, but more investigation on Twitter's search algorithm would be useful here.
- This will result in some off-topic matches (for example, corona beer), but these should be negligible, as the assumption is that there have been many more on-topic in the recent times (starting mid January).
- The December 2019 time span is included as a short period before the outbreak of the coronavirus to have a baseline of Tweet frequencies that are off-topic.
- I assume that most people reading this are only interested in the English Tweets, but since I am retrieving German Tweets for a personal research project, I will include these anyways. Tweets will be separated by language, so non-English ones can easily be filtered out.
Additional ways to retrieve on-topic Tweets would be to either manually identify a number of Twitter users that mostly tweet about corona (we can't just follow anyone that has tweeted about corona at one point in time as that presumably leads to a huge precision loss) or to retrieve replies to a known on-topic Tweet (e.g., any Tweet matching the above search criteria). However, both were deemed to cost expensive for now.
One thing I may do in the future, is looking up influential hash tags for each week (e.g. #masks4all) and add search requests for these.

So far, I have crawled about 68.5 million English and 2.2 million German Tweets in the time span from 1 Dec 2019 to 5 Apr 2020 (about 34 GB compressed JSON with meta data). I plan to contentiously expand this collection for the upcoming months. I am note quite sure when I'm ready to share this and how I will do so (probably using NASTY's idify feature).

If you are interested in this dataset, please leave a comment here. Preferably also leave a very short summary of what you plan to do with it and what you think of the outlined methodology.

Unable to run the progam

Hello,
I'm trying to run the program as specified in the README but I'm getting the following error:

Issued command:
nasty search --query "climate change"

Results:

Traceback (most recent call last):
  File "/Users/fcks/anaconda3/envs/vui/bin/nasty", line 8, in <module>
    sys.exit(main())
  File "/Users/fcks/anaconda3/envs/vui/lib/python3.6/site-packages/nasty/cli/main.py", line 52, in main
    command.run()
  File "/Users/fcks/anaconda3/envs/vui/lib/python3.6/site-packages/nasty/cli/_request_command.py", line 104, in run
    for tweet in request.request():
  File "/Users/fcks/anaconda3/envs/vui/lib/python3.6/site-packages/nasty/request/search.py", line 146, in request
    return SearchRetriever(self).tweet_stream
  File "/Users/fcks/anaconda3/envs/vui/lib/python3.6/site-packages/nasty/_retriever/retriever.py", line 175, in __init__
    self._fetch_new_twitter_session()
  File "/Users/fcks/anaconda3/envs/vui/lib/python3.6/site-packages/nasty/_retriever/retriever.py", line 321, in _fetch_new_twitter_session
    )[0]
IndexError: list index out of range

Do you have any ideas about why is this happening?

IndexError: list index out of range

Hi,

while running a nasty batch request via command line, one of my search terms failed with the following error message:

"exception": {
    "time": "2020-05-01T20:58:12.156301",
    "type": "IndexError",
    "message": "IndexError: list index out of range",
    "trace": [
      [ 
        "  File \"/home/beck/Data/corona-crawl/venv/lib/python3.6/site-packages/nasty/batch/batch.py\", line 115, in _execute_entry",
        "    write_jsonl_lines(data_file, entry.request.request(), use_lzma=True)"
      ],
      [ 
        "  File \"/home/beck/Data/corona-crawl/venv/lib/python3.6/site-packages/nasty/_util/json_.py\", line 137, in write_jsonl_lines",
        "    use_lzma=use_lzma,"
      ],
      [ 
        "  File \"/home/beck/Data/corona-crawl/venv/lib/python3.6/site-packages/nasty/_util/io_.py\", line 85, in write_lines_file",
        "    for value in values:"
      ],
      [ 
        "  File \"/home/beck/Data/corona-crawl/venv/lib/python3.6/site-packages/nasty/_util/json_.py\", line 135, in <genexpr>",
        "    (json.dumps(value.to_json()) for value in values),"
      ],
      [ 
        "  File \"/home/beck/Data/corona-crawl/venv/lib/python3.6/site-packages/nasty/_retriever/retriever.py\", line 66, in __next__",
        "    if not self._update_callback():"
      ],
      [ 
        "  File \"/home/beck/Data/corona-crawl/venv/lib/python3.6/site-packages/nasty/_retriever/retriever.py\", line 214, in _update_tweet_stream",
        "    self._fetch_new_twitter_session()"
      ],
      [ 
        "  File \"/home/beck/Data/corona-crawl/venv/lib/python3.6/site-packages/nasty/_retriever/retriever.py\", line 320, in _fetch_new_twitter_session",
        "    )[0]"
      ]
    ]

Here are the request parameters:

"request": {
    "type": "Search",
    "query": "coronavirus",
    "since": "2019-12-01",
    "until": "2020-03-16",
    "filter": "LATEST",
    "lang": "en",
    "max_tweets": null
  }

Some other search requests from the same batch-processing file finished successfully.

Argument names not consistent across CLI and python

For specifying a maximum number of tweets, in the CLI you need to use the "max-tweets" argument. In python, the same argument is called max_tweets (underscore instead of hyphen).

No response using thread/replies-commands for Tweet-ID

Hi @lschmelzeisen,

Thank you for this work.

I try to read the following tweet with the command

nasty thread --tweet-id 1230464833696432129

NASTY gives the following response:

2020-02-20 16:14:38,186 I [ nasty._retriever.retriever      ] Received 3 consecutive empty batches.

I also tried the example in README. But that one gave a json response.

What could be the problem of the Tweet ?

Thanks in advance!

python api 'search until' goes into a loop in some cases and repeatedly returns the same tweet

i'm using the python API to 'search until' e.g.

tweet_stream = nasty.Search("trump",
                           until=datetime(2018,5,2),
                           lang="en").request()
for tweet in tweet_stream:
    print(tweet.created_at, tweet.text)

with some datetimes it seems to go into a loop where, instead of returning a batch of 20 new tweets, it repeatedly returns the same tweet.

i have also reproduced the same basic problem with the following search/datetime combinations:
"climate", until=datetime(2016,5,2),
"climate", until=datetime(2019,5,2),

in the latter case, it's the final 3 tweets which get repeated in the output.

i attach 3 txt files:
test_output_until_2018-5-2_trump.txt | search = trump, datetime = 18-5-2, results = OK
test_output_until_2017-5-2_trump.txt | search = trump. datetime = 17-5-2, results = repeating tweet
test_output_until_2019-5-2_climate.txt | search = climate, datetime = 19-5-2, results = repeating final 3 tweets

test_output_until_2018-5-2_trump.txt
test_output_until_2017-5-2_trump.txt
test_output_until_2019-5-2_climate.txt

Problem retrieving tweets which contain emojis only

When inspecting the tweets retrieved with the nasty search command, I encountered the following problem:

If a tweet contains only emojis but no text, it cannot be crawled.

Example tweets:
https://twitter.com/McDonalds/status/1258532072634724354
https://twitter.com/McDonalds/status/1258894470055149572

Exemplary code for the command line:
nasty search --query “(from:@McDonalds)” --max-tweets -1 --since 2020-05-08 --until 2020-05-10 --filter LATEST > mcs.json

It would be possible that my parser does not work as intended, although I have no problems retrieving tweets which include text + emoticons / emojis. However, it might also be a bug in the nasty tool?

tweet = entry["content"]["item"]["content"]["tweet"]

Hey everybody and thank you for this great tool.

However, I'm facing a strange issue (tested on CentOS 7 as well as MacOS):

nasty search --query "(from:realdonaldtrump)" --since 2020-10-06 --until 2020-10-07 --filter LATEST --max-tweets -1

leads to this error message:

Traceback (most recent call last):
File "/usr/local/bin/nasty", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/nasty/main.py", line 25, in main
NastyProgram.init(*args).run()
File "/usr/local/lib/python3.6/site-packages/nasty/_cli.py", line 113, in run
for tweet in request.request():
File "/usr/local/lib/python3.6/site-packages/nasty/_retriever/retriever.py", line 66, in next
if not self._update_callback():
File "/usr/local/lib/python3.6/site-packages/nasty/_retriever/retriever.py", line 209, in _update_tweet_stream
batch = self._fetch_batch()
File "/usr/local/lib/python3.6/site-packages/nasty/_retriever/retriever.py", line 352, in _fetch_batch
self._session_get(**self._batch_url()).json()
File "/usr/local/lib/python3.6/site-packages/nasty/_retriever/retriever.py", line 76, in init
self.tweets: Final = self._tweets()
File "/usr/local/lib/python3.6/site-packages/nasty/_retriever/retriever.py", line 90, in _tweets
for tweet_id in self._tweet_ids():
File "/usr/local/lib/python3.6/site-packages/nasty/_retriever/search_retriever.py", line 122, in _tweet_ids
tweet = entry["content"]["item"]["content"]["tweet"]
KeyError: 'tweet'

What makes me wonder is the fact, that other dates and accounts are working fine.

Is there any solution to this problem?

Thank you.

Extend ThreadRetriever to also retrieve parent posts in a thread

I am trying to crawl a discussion on twitter. As an example let's pick 1299724507377274883

By running both thread and reply I get:

$ nasty r --tweet-id 1299724507377274883
Received 3 consecutive empty batches.
$ nasty t --tweet-id 1299724507377274883
Received 3 consecutive empty batches.

My intent is to download both the original post:
Restaurant and hotel workers are receiving eye-expression training as they try to deliver service with a smile while the smile is out of service

and its reply:
I'm kind of glad I don't have to smile all the time.

Is there something I am doing wrong?

PS: Thanks a lot for this wonderful package. Although it needs maintenance it is really well designed 👍