hairlessvillager / minehotspot Goto Github PK
View Code? Open in Web Editor NEWLicense: GNU General Public License v2.0
License: GNU General Public License v2.0
@HairlessVillager
问题复现:在控制台输入以下命令爬取贴吧,scrapy crawl tiebapost -a pid=8351704896
问题描述:
2024-07-09 11:16:09 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: minehotspot)
2024-07-09 11:16:09 [scrapy.utils.log] INFO: Versions: lxml 4.9.4.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.11.9 | packaged by Anaconda, Inc. | (main, Apr 19 2024, 16:40:41) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Windows-10-10.0.19045-SP0
2024-07-09 11:16:09 [tiebalist] DEBUG: start='0', end='200', cookies_text='id'
2024-07-09 11:16:09 [tiebalist] DEBUG: self.cookies={'id': ''}
2024-07-09 11:16:09 [scrapy.addons] INFO: Enabled addons:
[]
2024-07-09 11:16:09 [asyncio] DEBUG: Using selector: SelectSelector
2024-07-09 11:16:09 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-07-09 11:16:09 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-07-09 11:16:09 [scrapy.extensions.telnet] INFO: Telnet Password: 735079daa16e3e8e
2024-07-09 11:16:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2024-07-09 11:16:09 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'minehotspot',
'DOWNLOAD_DELAY': 10,
'FEED_EXPORT_ENCODING': 'utf-8',
'LOG_FILE': 'logs\\minehotspot\\tiebalist\\93993b343da111ef9695744ca19748bf.log',
'NEWSPIDER_MODULE': 'minehotspot.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'SPIDER_MODULES': ['minehotspot.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-07-09 11:16:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'minehotspot.middlewares.RandomUserAgentDownloadMiddlware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'minehotspot.middlewares.RemoveHtmlCommentDownloadMiddleware',
'minehotspot.middlewares.CheckCAPTCHADownloadMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-07-09 11:16:09 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-07-09 11:16:09 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-07-09 11:16:09 [scrapy.core.engine] INFO: Spider opened
2024-07-09 11:16:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-09 11:16:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-07-09 11:16:09 [tiebalist] INFO: r=range(0, 200, 50)
2024-07-09 11:16:09 [tiebalist] INFO: start_from_range(): pn=0
2024-07-09 11:16:09 [tiebalist] INFO: start_from_range(): pn=50
2024-07-09 11:16:09 [tiebalist] INFO: start_from_range(): pn=100
2024-07-09 11:16:09 [tiebalist] INFO: start_from_range(): pn=150
2024-07-09 11:16:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D0×tamp=1720494970&signature=a431960e4a800f7493e6b8d6b96e37cb> from <GET https://tieba.baidu.com/f?kw=galgame&ie=utf-8&pn=0>
2024-07-09 11:16:09 [scrapy.downloadermiddlewares.offsite] DEBUG: Filtered offsite request to 'wappass.baidu.com': <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D0×tamp=1720494970&signature=a431960e4a800f7493e6b8d6b96e37cb>
2024-07-09 11:16:09 [scrapy.core.engine] DEBUG: Signal handler scrapy.downloadermiddlewares.offsite.OffsiteMiddleware.request_scheduled dropped request <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D0×tamp=1720494970&signature=a431960e4a800f7493e6b8d6b96e37cb> before it reached the scheduler.
2024-07-09 11:16:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D50×tamp=1720494978&signature=027cbf20eba1abd8b5ce88d35e1a3bfb> from <GET https://tieba.baidu.com/f?kw=galgame&ie=utf-8&pn=50>
2024-07-09 11:16:17 [scrapy.core.engine] DEBUG: Signal handler scrapy.downloadermiddlewares.offsite.OffsiteMiddleware.request_scheduled dropped request <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D50×tamp=1720494978&signature=027cbf20eba1abd8b5ce88d35e1a3bfb> before it reached the scheduler.
2024-07-09 11:16:31 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D100×tamp=1720494992&signature=52f7b924b2bb72f9299c54873b51c184> from <GET https://tieba.baidu.com/f?kw=galgame&ie=utf-8&pn=100>
2024-07-09 11:16:31 [scrapy.core.engine] DEBUG: Signal handler scrapy.downloadermiddlewares.offsite.OffsiteMiddleware.request_scheduled dropped request <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D100×tamp=1720494992&signature=52f7b924b2bb72f9299c54873b51c184> before it reached the scheduler.
2024-07-09 11:16:45 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D150×tamp=1720495005&signature=0b6693ac47921f1de23647c442cf9798> from <GET https://tieba.baidu.com/f?kw=galgame&ie=utf-8&pn=150>
2024-07-09 11:16:45 [scrapy.core.engine] DEBUG: Signal handler scrapy.downloadermiddlewares.offsite.OffsiteMiddleware.request_scheduled dropped request <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D150×tamp=1720495005&signature=0b6693ac47921f1de23647c442cf9798> before it reached the scheduler.
2024-07-09 11:16:45 [scrapy.core.engine] INFO: Closing spider (finished)
2024-07-09 11:16:45 [scrapy.extensions.feedexport] INFO: Stored jsonlines feed (0 items) in: scrapyd_items/
2024-07-09 11:16:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1469,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 4321,
'downloader/response_count': 4,
'downloader/response_status_count/302': 4,
'elapsed_time_seconds': 35.587994,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2024, 7, 9, 3, 16, 45, 246251, tzinfo=datetime.timezone.utc),
'log_count/DEBUG': 12,
'log_count/INFO': 16,
'offsite/domains': 1,
'offsite/filtered': 4,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2024, 7, 9, 3, 16, 9, 658257, tzinfo=datetime.timezone.utc)}
2024-07-09 11:16:45 [scrapy.core.engine] INFO: Spider closed (finished)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.