bianjiang / tweetf0rm Goto Github PK

View Code? Open in Web Editor NEW

304.0 304.0 109.0 41.44 MB

A twitter crawler in Python

License: MIT License

Python 100.00%

tweetf0rm's People

Contributors

Stargazers

Watchers

Forkers

bjustice zhujiahui cbuntain soneint thuhj urwithajit9 mstoaster maruthiprithivi jalchr wendyran datakind-sg raeed20 lusiyang-cis buptfarmer yxk9810 peterbengkui laudarch vinisr catboys brotsalat tsunamitreats mohamad82 ultimatelol92 xinyizzz shdut alexbeloi bcchenbc surendramarupudi folkcode parsegarden bag-of-projects jsalim shafcodes brightlike schaa0 scrapymcscrapeface veterun snachx javillegas gmodena latif0516 cooldarran coolwell upendra-k14 gajju3588 wenfahu idea-nthu-taiwan nweat surafelml tonnydp chrishuxi aljohaniabdula jtliames2013 buddiex arunbajaj umarmughal824 stephanvolkeri vdeleon zhiyu-chen gitbenji touyachrist sebastian-schlecht bhaskar24 grukz hadidonk molimomo spozi shah334 jelisejev denizcaba pmdscully v-at anandabhairav sethlsx rianasmara vasica38 gkzz leena201818 shekharsorot ir4n6 batermj desims futurev larusyang sgnajar xlwu2270 tarsbase pranav174 sampannakahu hhy5277 yeincrystal rudainaakh estathop nimeoz datafields forkarepo shaiful-hisham 5l1v3r1 abdulazeez001 mvarda21

tweetf0rm's Issues

no crawler is alive... waiting to recreate all crawlers

Hi ,
i am using tweetf0rm crawler to get data from twitter for my thesis , i followed all the instructions and installed them but i got this error : """ no crawler is alive... waiting to recreate all crawlers "" , it seems like the proxies that listed in the proxy.json doesn't work for me . so i decided to ask help from you , what should i do ? are there any other proxies or any suggestion to solve this problem?

Retweets in JSON are cut

While using the search/tweets command the texts of tweets in JSON retrieved that are REtweets are cut. They start with the letters RT. Do you know why this is happening ?
For example,

'RT @elpais_cultura: Michel Ocelot: "Soy consciente de la responsabilidad moral que conlleva dirigir una película. Por eso hasta en mi estét…'

Getting Started

Hello everyone, I'm new here and I have a lot of questions.
It could be a little stupid, but I'm really stuck on this.

First, I'm trying to run the command $ ./bootstrap.sh -c config.json -p proxies.json but I get this error

ImportError: No module named futures

I don't know why, but the others commands can run, but there is no output to say if it works or not.

Next, where the data (from the twitter) will be saved? This is my config.json file

{
"apikeys": {
"i0mf0rmer01" :{
"app_key":"xYtnU1e4bvFNxoLYVdasBsAUPjkO",
"app_secret":"hTvAgGP83R3Z23ktFnabqxiUm2vkRkRStQcpzYm9MblzwFYD0kaI",
"oauth_token":"45737364-FhIYhGKmaWcIPjJLvutonrmjlFszykdgngrmNIObH6",
"oauth_token_secret":"7PcktvTDvQU37h1jYZEYJcjYFveanBBb33SQ5IkZYhGINX"
}
},
"redis_config": {
"host": "localhost",
"port": 6379,
"db": 0
},
"verbose": "True",
"output": "./data",
"archive_output": "./data"
}

There is no 'data' folder after I run a simple command like client.sh -c config.json -cmd CRAWL_FRIENDS -d 1 -dt "ids" -uid 45737364.

Thanks in advanced.

The output folder is empty after I run the codes(no error occurs)

Hi~
I am trying to use your code to collect data from twitter. But after I have followed all the steps and strat to run ./bootstrap.sh -c config.json without proxies, the output folder is always emputy and I still can not collect any data. I don't know which part is going wrong, hope you can help me. My platform is Macbook and my python version is 2.7, here are the messages from terminal:
➜ tweetf0rm-master ./bootstrap.sh -c config.json
INFO-[2017-03-23 16:56:20,127][bootstrap][start_server][99]: output to /Users/xiangyuanxin/Developer/collectTwitterData/data
INFO-[2017-03-23 16:56:20,128][bootstrap][start_server][100]: archived to /Users/xiangyuanxin/Developer/collectTwitterData/data/archived
INFO-[2017-03-23 16:56:21,037][scheduler][init][46]: number of crawlers: 1
INFO-[2017-03-23 16:56:21,040][twitter_api][init][26]: {'apikeys': {u'oauth_token_secret': u'jT4RL4a11zT4Zyb4icbEc1dp5rr3odrtVoeyNV1****', u'app_secret': u'oMbssGKczQxkXgs34z9Tus6wRebkfK5qBSg7AD0Bv2q5o****', u'oauth_token': u'783543426411294720-XSyiiaxAuY5723pGT3GRSujTAPIthPR', u'app_key': u'mcqIwVZq4xwi6vWw9oSuW1G9y'}, 'client_args': {'timeout': 300}}
INFO-[2017-03-23 16:56:22,474][scheduler][init][63]: number of crawlers: 1 created
INFO-[2017-03-23 16:56:22,475][bootstrap][start_server][108]: starting node_id: b54b5c1ecdff3f6f8cadefe4f28bae7c
INFO-[2017-03-23 17:02:22,524][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
INFO-[2017-03-23 17:08:22,546][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
INFO-[2017-03-23 17:14:22,589][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
INFO-[2017-03-23 17:20:22,634][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
INFO-[2017-03-23 17:26:22,702][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
INFO-[2017-03-23 17:26:22,704][scheduler][balancing_load][128]: crawler with max_qsize: mcqIwVZq4xwi6vWw9oSuW1G9y (0)
INFO-[2017-03-23 17:26:22,704][scheduler][balancing_load][129]: crawler with min_qsize: mcqIwVZq4xwi6vWw9oSuW1G9y (0)
INFO-[2017-03-23 17:26:22,704][scheduler][balancing_load][130]: max_qsize - min_qsize > 0.5 * min_qsize ?: False
INFO-[2017-03-23 17:32:22,863][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
INFO-[2017-03-23 17:38:23,006][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
INFO-[2017-03-23 17:44:23,161][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]

Error on console, program does not collapse

Greeting again, I configured all the necessary files according to my needs but in the console the line below appears but the program does not collapse and continues to fetch tweets but my search words are in catalan and for about 2 hours it fetched nothing, so I want to know if I did something wrong or the are not any tweets at all.
Console Error:
(2019-06-06 12:03:42,765) [22347] ERROR: exception: file() takes at most 3 arguments (4 given)

ModuleNotFoundError: No module named 'twython'

Hi,

I am getting this error...

File "twitter_streamer.py", line 24, in
import twython
ModuleNotFoundError: No module named 'twython'

But I have install twython ... -> pip install twython -> right install...

And on the class twitter_streamer.py twython is right imported...
import twython
from util import full_stack, chunks, md5
class TwitterStreamer(twython.TwythonStreamer):

Can someone help me please ?

getting started

Hi,
first of all thank you very much for sharing your work here. I am just trying to get started with tweetf0rm. I ran into an error when trying to start form. Below I have posted the output from the terminal.
I should add that I am not very familiar with using python or the terminal.

Marks-MBP:tweetf0rm markheuer$ ./bootstrap.sh -c config.json -p proxies.json
Traceback (most recent call last):
File "./tweetf0rm/bootstrap.py", line 170, in config = json.load(config_f)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/init.py", line 290, in load **kw)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/init.py", line 338, in loads return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 365, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 381, in raw_decode obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 3 column 3 (char 18)

I am working on OSX with Python 2.7
Further, I am still not sure how to pass a command to tweetf0rm, once it is running.

Thanks for your help!

I messed up.

sudo ./bootstrap.sh -c config_sample.json
INFO-[2017-03-04 22:38:57,578][bootstrap][start_server][99]: output to /home/kaz/twitsearch/Twitcrawl/tweetf0rm/data
INFO-[2017-03-04 22:38:57,578][bootstrap][start_server][100]: archived to /home/kaz/twitsearch/Twitcrawl/tweetf0rm/data/archived
INFO-[2017-03-04 22:38:57,854][scheduler][init][46]: number of crawlers: 1
INFO-[2017-03-04 22:38:57,855][twitter_api][init][26]: {'apikeys': {u'oauth_token_secret': u'ACCESS_TOKEN_SECRET', u'app_secret': u'CONSUMER_SECRET', u'oauth_token': u'ACCESS_TOKEN', u'app_key': u'CONSUMER_KEY'}, 'client_args': {'timeout': 300}}
ERROR-[2017-03-04 22:38:58,029][scheduler][init][56]: Unable to obtain OAuth 2 access token.
INFO-[2017-03-04 22:38:58,031][scheduler][init][63]: number of crawlers: 1 created
INFO-[2017-03-04 22:38:58,033][bootstrap][start_server][108]: starting node_id: 29349c7140248a452d5e090704245570
INFO-[2017-03-04 22:38:58,035][bootstrap][start_server][142]: no crawler is alive... waiting to recreate all crawlers...
^[[A^[[BINFO-[2017-03-04 22:40:58,135][bootstrap][start_server][122]: []
INFO-[2017-03-04 22:40:58,135][bootstrap][start_server][142]: no crawler is alive... waiting to recreate all crawlers...

"Unable to obtain OAuth2 Token"

I think I gave it the wrong keys from twitter but I double checked everything and made sure that these keys worked I don't want to use a proxy but I am pretty sure that I have to use at least one proxy.

Let me know what I am doing wrong here.

Processing Data

Hi,
I was wondering how other users processs the data generated with tweetf0rm. Since I need the data in some kind of spreadsheet format, I was going to use open refine to import he JSON file containing the tweets I crawled. The issue here is that I can only import one of the nodes from every tweet, and not all of them. e.g. the tweet might look like this:

{
text: (...)
retweeted status: {...}
retweeted : {...}
(...)
}
open refine is only able to extract one of these nodes at a time, (e.g. ‘retweeted status: { (...) }' but not 'text: (...)'.

Another method I tried is to use a script for google spreadsheets called "=importJSON()", but without success.

I further noticed that both methods work fine with a JSON file i pulled "directly from the twitter API, but I have not been able to find out what exactly the the difference was between the two JSON files. If you care to have a look at this other file, you can find it here: https://www.dropbox.com/s/detv4ttvlsgbtz0/cop21.11.1301_07.txt?dl=0

Anyway, I was wondering If any other users have experience with convertinf the tweetform data into spreadsheet format, and what your workflow is.

best, Mark

Information on using open refine with twitter data can be found here: http://blog.ouseful.info/2012/10/02/grabbing-twitter-search-results-into-google-refine-and-exporting-conversations-into-gephi/
Information on the google spreadsheets extension can be found here: https://medium.com/@paulgambill/how-to-import-json-data-into-google-spreadsheets-in-less-than-5-minutes-a3fede1a014a

Feature Request: More handlers

Firstly, thanks so much for sharing your work. Secondly, in the spirit of your "I would be happy to take requests to add specific functionalities" note, it would be really helpful to have some more handlers :) I see there's a stub for Mongodb (mongodb_handler.py)... and an SQL centric one (SQLAlchemy?) could be helpful too.

Get tweets from a period

I'm trying to get only the tweets that some @ posts during some period. Is there any way to do that? Or can I only get the whole user's timeline?

"created_at:" timestamp in JSON tweet

I noticed in this API documentation (https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json) that field "created_at" of the tweet JSON is supposed to be the timestamp when the relevant tweet was created and published on twitter.
On the contrary from trial and error I realized that in the large text files every "created_at" field indicates a timestamp when the tweet was crawled and not created.
Is this an erroneous behavioral or is this how it is supposed to function ? Or whether you choose
"item_id" : 1 or latter it only fetches the latest 10 days for the standard API, ( the free one )

is this code still works?

I've pulled this code to crawl some specific users mentions, but can't get through more than ./bootstrap.sh

I'm wondering because it only generates empty folders.

Format of output in the file ?

Good afternoon and congrats for the work.
My question about your code is: Can you provide an example of the JSON structured results saved in the output ? How is a single entry depicted as JSON in the file ?
Thanks

search combined list of keywords

Greetings again, I would like to ask if it is possible for your application to fetch data based on a list of keywords combined with words from a second keyworld list. So in order for a tweet to be crawled there needs to include at least 1 word from the first list and 1 word from the second list

HI it seems that something wrong in the Windows Platform.

HI,Prof. bianjiang,

really thank you for your open sourse, firstly.

I test this program on the Mac Platform. It runs well.

However, I try to run it on the windows platform. It seems something wrong.

Here is the error log.

I tried many ways to fix it, but all failed.

Could you help me ? T.T

Thank you !!! :)

ERROR-[2017-04-05 06:30:42,088][scheduler][init][56]: Can't pickle <function at 0x0000000003D7AB38>: it's not found as redis.client.
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\multiprocessing\forking.py", line 381, in main
self = load(from_parent)
File "C:\Python27\lib\pickle.py", line 1384, in load
return Unpickler(file).load()
File "C:\Python27\lib\pickle.py", line 864, in load
dispatchkey
File "C:\Python27\lib\pickle.py", line 886, in load_eof
raise EOFError
EOFError

from tweetf0rmer.user_farm import UserFarm?

Hi there,
Are the scripts in /scripts/v1 supposed to be functional? twitter_crawler.py imports UserFarm which doesn't seem to exist in the project....

Thanks for the awesome work :)

search/tweets only geocodes

Is it possible to use search/tweets only with a geocode and no keyword terms at all (like an empty terms JSON array)? I want to fetch all the tweets from a very specific small area, and from what I read in the API documentation this still remains ambiguous

Writing asynchronous job

Is there an event which let us know when a job finishes so that we can execute another job?

Collections in querystring with OR return duplicate tweet post

Good afternoon,
I noticed that when you form a querystring with multiple keywords divided by OR, the crawler fetches the same tweet more than 1 time.
For instance, if 2 distinct keywords in the query string are present in the same tweet, the tweet will be crawled twice, I verified with by monitoring the Tweet IDs inside a DB.
Is there an easy way to eliminate this phenomenon in the crawler or should I apply other tactics, e.g. file storage, python dictionary in RAM, query of existence upon DB to discard, etc.
Thanks for you time

"Rate limit reached" displayed, but it remained actually.

Hi,
I faced a new problem.
Cloud you tell me how to find the reason and resolve it？

I executed server and sent a command to get followers as following
"./client.sh -c config.json -cmd CRAWL_FOLLOWERS -d 1 -dt "ids" -uid 147163321".

Then the next message displayed by the console of the server.
WARNING-[2017-08-17 21:51:50,514][twitter_api][rate_limit_error_occured][61]: [{u'application': u'xxxx'}] rate limit reached, sleep for 803.

But when I checked the rate limit from twitter API( application/rate_limit_status)，
it is shown that the rate limit did not reach.
(Actually, the result shows that all of APIs have not been invoked except 'rate_limit_status')
Next is the result from API.
....
'application':{
'/application/rate_limit_status':{
'limit':180,
'remaining':179,
'reset':1502978445
}
},
....
'followers':{
'/followers/ids':{
'limit':15,
'remaining':15,
'reset':1502978445
},
'/followers/list':{
'limit':15,
'remaining':15, <----- the api did not be invoked.
'reset':1502978445
}
},

Thank you very much.