The spider's discuss from malei666

可不可以爬爬各大主流新闻网站新闻链接

   正在用flutter尝试写一个新闻APP，需要一些新闻链接，不需要内容清洗，直接链接跳转到对应网站上的文章，免得有版权问题，

  APP类似“今日热榜”

 看过RSSHub，也提供一些新闻链接，不过太少了，

作者您这儿会不会有相应的计划？

$ scrapy crawl tt
2018-10-19 15:01:15 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: toutiao)
2018-10-19 15:01:15 [scrapy.utils.log] INFO: Versions: lxml 4.2.3.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 05:52:31) - [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.3, Platform Darwin-18.0.0-x86_64-i386-64bit
2018-10-19 15:01:15 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'toutiao', 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 3, 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'NEWSPIDER_MODULE': 'toutiao.spiders', 'REDIRECT_ENABLED': False, 'SPIDER_MODULES': ['toutiao.spiders']}
2018-10-19 15:01:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2018-10-19 15:01:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-19 15:01:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-19 15:01:15 [scrapy.middleware] INFO: Enabled item pipelines:
['toutiao.pipelines.ToutiaoPipeline']
2018-10-19 15:01:15 [scrapy.core.engine] INFO: Spider opened
2018-10-19 15:01:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-19 15:01:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-19 15:01:16 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.toutiao.com via http://localhost:8050/render.html> (failed 1 times): Connection was refused by other side: 61: Connection refused.
2018-10-19 15:01:19 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.toutiao.com via http://localhost:8050/render.html> (failed 2 times): Connection was refused by other side: 61: Connection refused.
2018-10-19 15:01:23 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://www.toutiao.com via http://localhost:8050/render.html> (failed 3 times): Connection was refused by other side: 61: Connection refused.
2018-10-19 15:01:23 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.toutiao.com via http://localhost:8050/render.html>
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 61: Connection refused.
2018-10-19 15:01:23 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-19 15:01:23 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError': 3,
'downloader/request_bytes': 1818,
'downloader/request_count': 3,
'downloader/request_method_count/POST': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 10, 19, 7, 1, 23, 667284),
'log_count/DEBUG': 4,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'memusage/max': 58916864,
'memusage/startup': 58912768,
'retry/count': 2,
'retry/max_reached': 1,
'retry/reason_count/twisted.internet.error.ConnectionRefusedError': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'splash/render.html/request_count': 1,
'start_time': datetime.datetime(2018, 10, 19, 7, 1, 15, 822066)}
2018-10-19 15:01:23 [scrapy.core.engine] INFO: Spider closed (finished)

这个是为什么呀

malei666 / spider Goto Github PK

spider's Issues

知乎登录403

可不可以爬爬各大主流新闻网站新闻链接

能交流一下头条爬虫吗？

connection was refused?

请问用的是哪个版本的python

zhihu 爬虫

selenium.common.exceptions.WebDriverException: Message: unknown error: TAC is not defined

您好，大众点评中的conf.ini格式能提供下吗？

我启动时出了点错

您好可以公开一下dianping目录下的数据集吗？

大众点评验证中心或页面不存在的问题

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent