alextan-b-z / zhihuspider Goto Github PK
View Code? Open in Web Editor NEW知乎分布式爬虫(Scrapy、Redis)
License: MIT License
知乎分布式爬虫(Scrapy、Redis)
License: MIT License
zhihuspider0.py以及zhihuspider1.py都运行不了,一直卡在第一个链接上,怀疑代码的能用性
代码格式比较乱,建议稍微调整,便于读者阅读和PR
我用"npm install phantomjs-prebuilt"來裝phantomjs, 但一只出現這個error:
====
$ scrapy list
/usr/local/lib/python2.7/dist-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 11, in
sys.exit(execute())
File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 149, in execute
cmd.crawler_process = CrawlerProcess(settings)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 249, in init
super(CrawlerProcess, self).init(settings)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 137, in init
self.spider_loader = _get_spider_loader(settings)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 336, in _get_spider_loader
return loader_cls.from_settings(settings.frozencopy())
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiderloader.py", line 61, in from_settings
return cls(settings)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiderloader.py", line 25, in init
self._load_all_spiders()
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
for module in walk_modules(name):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/misc.py", line 71, in walk_modules
submod = import_module(fullpath)
File "/usr/lib/python2.7/importlib/init.py", line 37, in import_module
import(name)
File "/apps/AlexTan-b-z_ZhihuSpider/zhihu/zhihu/spiders/zhihuspider.py", line 22, in
class ZhihuspiderSpider(RedisSpider):
File "/apps/AlexTan-b-z_ZhihuSpider/zhihu/zhihu/spiders/zhihuspider.py", line 34, in ZhihuspiderSpider
obj = webdriver.PhantomJS(desired_capabilities=dcap)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/phantomjs/webdriver.py", line 56, in init
self.service.start()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/common/service.py", line 98, in start
self.assert_process_still_running()
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/common/service.py", line 111, in assert_process_still_running
% (self.path, return_code)
selenium.common.exceptions.WebDriverException: Message: Service phantomjs unexpectedly exited. Status code was: -6
我设置了3个帐号,其他所有配置没有动过
开始还因为Phantomjs不全导致无法运行,后来好了,但没有任何数据
输出信息如下:
qianzise@FengMaster-PC:~/ZhihuSpider-2.0/zhihu$ scrapy crawl zhihuspider
2017-11-27 16:55:37 [scrapy] INFO: Scrapy 1.1.0rc1 started (bot: zhihu)
2017-11-27 16:55:37 [scrapy] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.86 Safari/537.36', 'DUPEFILTER_CLASS': 'zhihu.scrapy_redis.dupefilter.RFPDupeFilter', 'SPIDER_MODULES': ['zhihu.spiders'], 'NEWSPIDER_MODULE': 'zhihu.spiders', 'DOWNLOAD_TIMEOUT': 10, 'SCHEDULER': 'zhihu.scrapy_redis.scheduler.Scheduler', 'RETRY_TIMES': 1, 'REDIRECT_ENABLED': False, 'BOT_NAME': 'zhihu'}
2017-11-27 16:55:37 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2017-11-27 16:55:37 [zhihuspider] INFO: Reading start URLs from redis key 'zhihuspider:start_urls' (batch size: 16, encoding: utf-8
2017-11-27 16:55:37 [zhihu.cookie] WARNING: The num of the cookies is 3
2017-11-27 16:55:37 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'zhihu.middlewares.UserAgentMiddleware',
'zhihu.middlewares.CookiesMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-11-27 16:55:37 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-11-27 16:55:37 [scrapy] INFO: Enabled item pipelines:
['zhihu.pipelines.ZhihuPipeline']
2017-11-27 16:55:37 [scrapy] INFO: Spider opened
2017-11-27 16:55:37 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-11-27 16:55:37 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-11-27 16:55:37 [zhihu.scrapy_redis.dupefilter] DEBUG: Filtered duplicate request <GET https://www.zhihu.com/api/v4/members/yun-he-shu-ju-8?include=locations,employments,industry_category,gender,educations,business,follower_count,following_count,description,badge[?(type=best_answerer)].topics> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-11-27 16:56:37 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-11-27 16:57:37 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-11-27 16:58:37 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-11-27 16:59:37 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-11-27 17:00:37 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-11-27 17:01:37 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
有的用户虽然没有提问,或者回答问题, 但是会有点赞和关注了某些话题.
在用户主页有一个“动态”
链接是类似这样的
https://www.zhihu.com/api/v4/members/yang-da-yi-19/activities?limit=20&after_id=1503246015&desktop=True
这类用户也是值得分析的
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.