bianjiang / tweetf0rm Goto Github PK
View Code? Open in Web Editor NEWA twitter crawler in Python
License: MIT License
A twitter crawler in Python
License: MIT License
Hi ,
i am using tweetf0rm crawler to get data from twitter for my thesis , i followed all the instructions and installed them but i got this error : """ no crawler is alive... waiting to recreate all crawlers "" , it seems like the proxies that listed in the proxy.json doesn't work for me . so i decided to ask help from you , what should i do ? are there any other proxies or any suggestion to solve this problem?
While using the search/tweets command the texts of tweets in JSON retrieved that are REtweets are cut. They start with the letters RT. Do you know why this is happening ?
For example,
'RT @elpais_cultura: Michel Ocelot: "Soy consciente de la responsabilidad moral que conlleva dirigir una película. Por eso hasta en mi estét…'
Hello everyone, I'm new here and I have a lot of questions.
It could be a little stupid, but I'm really stuck on this.
First, I'm trying to run the command $ ./bootstrap.sh -c config.json -p proxies.json
but I get this error
ImportError: No module named futures
I don't know why, but the others commands can run, but there is no output to say if it works or not.
Next, where the data (from the twitter) will be saved? This is my config.json file
{
"apikeys": {
"i0mf0rmer01" :{
"app_key":"xYtnU1e4bvFNxoLYVdasBsAUPjkO",
"app_secret":"hTvAgGP83R3Z23ktFnabqxiUm2vkRkRStQcpzYm9MblzwFYD0kaI",
"oauth_token":"45737364-FhIYhGKmaWcIPjJLvutonrmjlFszykdgngrmNIObH6",
"oauth_token_secret":"7PcktvTDvQU37h1jYZEYJcjYFveanBBb33SQ5IkZYhGINX"
}
},
"redis_config": {
"host": "localhost",
"port": 6379,
"db": 0
},
"verbose": "True",
"output": "./data",
"archive_output": "./data"
}
There is no 'data' folder after I run a simple command like client.sh -c config.json -cmd CRAWL_FRIENDS -d 1 -dt "ids" -uid 45737364
.
Thanks in advanced.
Hi~
I am trying to use your code to collect data from twitter. But after I have followed all the steps and strat to run ./bootstrap.sh -c config.json without proxies, the output folder is always emputy and I still can not collect any data. I don't know which part is going wrong, hope you can help me. My platform is Macbook and my python version is 2.7, here are the messages from terminal:
➜ tweetf0rm-master ./bootstrap.sh -c config.json
INFO-[2017-03-23 16:56:20,127][bootstrap][start_server][99]: output to /Users/xiangyuanxin/Developer/collectTwitterData/data
INFO-[2017-03-23 16:56:20,128][bootstrap][start_server][100]: archived to /Users/xiangyuanxin/Developer/collectTwitterData/data/archived
INFO-[2017-03-23 16:56:21,037][scheduler][init][46]: number of crawlers: 1
INFO-[2017-03-23 16:56:21,040][twitter_api][init][26]: {'apikeys': {u'oauth_token_secret': u'jT4RL4a11zT4Zyb4icbEc1dp5rr3odrtVoeyNV1****', u'app_secret': u'oMbssGKczQxkXgs34z9Tus6wRebkfK5qBSg7AD0Bv2q5o****', u'oauth_token': u'783543426411294720-XSyiiaxAuY5723pGT3GRSujTAPIthPR', u'app_key': u'mcqIwVZq4xwi6vWw9oSuW1G9y'}, 'client_args': {'timeout': 300}}
INFO-[2017-03-23 16:56:22,474][scheduler][init][63]: number of crawlers: 1 created
INFO-[2017-03-23 16:56:22,475][bootstrap][start_server][108]: starting node_id: b54b5c1ecdff3f6f8cadefe4f28bae7c
INFO-[2017-03-23 17:02:22,524][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
INFO-[2017-03-23 17:08:22,546][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
INFO-[2017-03-23 17:14:22,589][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
INFO-[2017-03-23 17:20:22,634][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
INFO-[2017-03-23 17:26:22,702][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
INFO-[2017-03-23 17:26:22,704][scheduler][balancing_load][128]: crawler with max_qsize: mcqIwVZq4xwi6vWw9oSuW1G9y (0)
INFO-[2017-03-23 17:26:22,704][scheduler][balancing_load][129]: crawler with min_qsize: mcqIwVZq4xwi6vWw9oSuW1G9y (0)
INFO-[2017-03-23 17:26:22,704][scheduler][balancing_load][130]: max_qsize - min_qsize > 0.5 * min_qsize ?: False
INFO-[2017-03-23 17:32:22,863][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
INFO-[2017-03-23 17:38:23,006][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
INFO-[2017-03-23 17:44:23,161][bootstrap][start_server][122]: [{'alive?': True,
'crawler_id': u'mcqIwVZq4xwi6vWw9oSuW1G9y',
'crawler_queue_key': u'queue:b54b5c1ecdff3f6f8cadefe4f28bae7c:mcqIwVZq4xwi6vWw9oSuW1G9y',
'qsize': 0}]
Greeting again, I configured all the necessary files according to my needs but in the console the line below appears but the program does not collapse and continues to fetch tweets but my search words are in catalan and for about 2 hours it fetched nothing, so I want to know if I did something wrong or the are not any tweets at all.
Console Error:
(2019-06-06 12:03:42,765) [22347] ERROR: exception: file() takes at most 3 arguments (4 given)
Hi,
I am getting this error...
File "twitter_streamer.py", line 24, in
import twython
ModuleNotFoundError: No module named 'twython'
But I have install twython ... -> pip install twython -> right install...
And on the class twitter_streamer.py twython is right imported...
import twython
from util import full_stack, chunks, md5
class TwitterStreamer(twython.TwythonStreamer):
Can someone help me please ?
Hi,
first of all thank you very much for sharing your work here. I am just trying to get started with tweetf0rm. I ran into an error when trying to start form. Below I have posted the output from the terminal.
I should add that I am not very familiar with using python or the terminal.
Marks-MBP:tweetf0rm markheuer$ ./bootstrap.sh -c config.json -p proxies.json
Traceback (most recent call last):
File "./tweetf0rm/bootstrap.py", line 170, in config = json.load(config_f)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/init.py", line 290, in load **kw)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/init.py", line 338, in loads return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 365, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 381, in raw_decode obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 3 column 3 (char 18)
I am working on OSX with Python 2.7
Further, I am still not sure how to pass a command to tweetf0rm, once it is running.
Thanks for your help!
sudo ./bootstrap.sh -c config_sample.json
INFO-[2017-03-04 22:38:57,578][bootstrap][start_server][99]: output to /home/kaz/twitsearch/Twitcrawl/tweetf0rm/data
INFO-[2017-03-04 22:38:57,578][bootstrap][start_server][100]: archived to /home/kaz/twitsearch/Twitcrawl/tweetf0rm/data/archived
INFO-[2017-03-04 22:38:57,854][scheduler][init][46]: number of crawlers: 1
INFO-[2017-03-04 22:38:57,855][twitter_api][init][26]: {'apikeys': {u'oauth_token_secret': u'ACCESS_TOKEN_SECRET', u'app_secret': u'CONSUMER_SECRET', u'oauth_token': u'ACCESS_TOKEN', u'app_key': u'CONSUMER_KEY'}, 'client_args': {'timeout': 300}}
ERROR-[2017-03-04 22:38:58,029][scheduler][init][56]: Unable to obtain OAuth 2 access token.
INFO-[2017-03-04 22:38:58,031][scheduler][init][63]: number of crawlers: 1 created
INFO-[2017-03-04 22:38:58,033][bootstrap][start_server][108]: starting node_id: 29349c7140248a452d5e090704245570
INFO-[2017-03-04 22:38:58,035][bootstrap][start_server][142]: no crawler is alive... waiting to recreate all crawlers...
^[[A^[[BINFO-[2017-03-04 22:40:58,135][bootstrap][start_server][122]: []
INFO-[2017-03-04 22:40:58,135][bootstrap][start_server][142]: no crawler is alive... waiting to recreate all crawlers...
"Unable to obtain OAuth2 Token"
I think I gave it the wrong keys from twitter but I double checked everything and made sure that these keys worked I don't want to use a proxy but I am pretty sure that I have to use at least one proxy.
Let me know what I am doing wrong here.
Hi,
I was wondering how other users processs the data generated with tweetf0rm. Since I need the data in some kind of spreadsheet format, I was going to use open refine to import he JSON file containing the tweets I crawled. The issue here is that I can only import one of the nodes from every tweet, and not all of them. e.g. the tweet might look like this:
{
text: (...)
retweeted status: {...}
retweeted : {...}
(...)
}
open refine is only able to extract one of these nodes at a time, (e.g. ‘retweeted status: { (...) }' but not 'text: (...)'.
Another method I tried is to use a script for google spreadsheets called "=importJSON()", but without success.
I further noticed that both methods work fine with a JSON file i pulled "directly from the twitter API, but I have not been able to find out what exactly the the difference was between the two JSON files. If you care to have a look at this other file, you can find it here: https://www.dropbox.com/s/detv4ttvlsgbtz0/cop21.11.1301_07.txt?dl=0
Anyway, I was wondering If any other users have experience with convertinf the tweetform data into spreadsheet format, and what your workflow is.
best, Mark
Information on using open refine with twitter data can be found here: http://blog.ouseful.info/2012/10/02/grabbing-twitter-search-results-into-google-refine-and-exporting-conversations-into-gephi/
Information on the google spreadsheets extension can be found here: https://medium.com/@paulgambill/how-to-import-json-data-into-google-spreadsheets-in-less-than-5-minutes-a3fede1a014a
Firstly, thanks so much for sharing your work. Secondly, in the spirit of your "I would be happy to take requests to add specific functionalities" note, it would be really helpful to have some more handlers :) I see there's a stub for Mongodb (mongodb_handler.py)... and an SQL centric one (SQLAlchemy?) could be helpful too.
I'm trying to get only the tweets that some @ posts during some period. Is there any way to do that? Or can I only get the whole user's timeline?
I noticed in this API documentation (https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/intro-to-tweet-json) that field "created_at" of the tweet JSON is supposed to be the timestamp when the relevant tweet was created and published on twitter.
On the contrary from trial and error I realized that in the large text files every "created_at" field indicates a timestamp when the tweet was crawled and not created.
Is this an erroneous behavioral or is this how it is supposed to function ? Or whether you choose
"item_id" : 1 or latter it only fetches the latest 10 days for the standard API, ( the free one )
I've pulled this code to crawl some specific users mentions, but can't get through more than ./bootstrap.sh
I'm wondering because it only generates empty folders.
Good afternoon and congrats for the work.
My question about your code is: Can you provide an example of the JSON structured results saved in the output ? How is a single entry depicted as JSON in the file ?
Thanks
Greetings again, I would like to ask if it is possible for your application to fetch data based on a list of keywords combined with words from a second keyworld list. So in order for a tweet to be crawled there needs to include at least 1 word from the first list and 1 word from the second list
HI,Prof. bianjiang,
really thank you for your open sourse, firstly.
I test this program on the Mac Platform. It runs well.
However, I try to run it on the windows platform. It seems something wrong.
Here is the error log.
I tried many ways to fix it, but all failed.
Could you help me ? T.T
Thank you !!! :)
ERROR-[2017-04-05 06:30:42,088][scheduler][init][56]: Can't pickle <function at 0x0000000003D7AB38>: it's not found as redis.client.
Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\multiprocessing\forking.py", line 381, in main
self = load(from_parent)
File "C:\Python27\lib\pickle.py", line 1384, in load
return Unpickler(file).load()
File "C:\Python27\lib\pickle.py", line 864, in load
dispatchkey
File "C:\Python27\lib\pickle.py", line 886, in load_eof
raise EOFError
EOFError
Hi there,
Are the scripts in /scripts/v1 supposed to be functional? twitter_crawler.py imports UserFarm which doesn't seem to exist in the project....
Thanks for the awesome work :)
Is it possible to use search/tweets only with a geocode and no keyword terms at all (like an empty terms JSON array)? I want to fetch all the tweets from a very specific small area, and from what I read in the API documentation this still remains ambiguous
Is there an event which let us know when a job finishes so that we can execute another job?
Good afternoon,
I noticed that when you form a querystring with multiple keywords divided by OR, the crawler fetches the same tweet more than 1 time.
For instance, if 2 distinct keywords in the query string are present in the same tweet, the tweet will be crawled twice, I verified with by monitoring the Tweet IDs inside a DB.
Is there an easy way to eliminate this phenomenon in the crawler or should I apply other tactics, e.g. file storage, python dictionary in RAM, query of existence upon DB to discard, etc.
Thanks for you time
Hi,
I faced a new problem.
Cloud you tell me how to find the reason and resolve it?
I executed server and sent a command to get followers as following
"./client.sh -c config.json -cmd CRAWL_FOLLOWERS -d 1 -dt "ids" -uid 147163321".
Then the next message displayed by the console of the server.
WARNING-[2017-08-17 21:51:50,514][twitter_api][rate_limit_error_occured][61]: [{u'application': u'xxxx'}] rate limit reached, sleep for 803.
But when I checked the rate limit from twitter API( application/rate_limit_status),
it is shown that the rate limit did not reach.
(Actually, the result shows that all of APIs have not been invoked except 'rate_limit_status')
Next is the result from API.
....
'application':{
'/application/rate_limit_status':{
'limit':180,
'remaining':179,
'reset':1502978445
}
},
....
'followers':{
'/followers/ids':{
'limit':15,
'remaining':15,
'reset':1502978445
},
'/followers/list':{
'limit':15,
'remaining':15, <----- the api did not be invoked.
'reset':1502978445
}
},
Thank you very much.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.