Giter Club home page Giter Club logo

minehotspot's Introduction

README.md

MineHotspot

Quick Start

  1. Install Docker Desktop
  2. Run it
  3. git clone xxx
  4. cd minehotspot
  5. docker-compose build
  6. docker-compose up

Useful Commands

  • cd scrapy
  • python setup.py bdist_egg
  • curl http://localhost:6800/addversion.json -F project=minehotspot -F version=1.0 -F egg=@dist/minehotspot-1.0-py3.11.egg
  • docker build -t minehotspot .
  • docker run -d --name minehotspot -p 4200:4200 minehotspot

minehotspot's People

Contributors

hairlessvillager avatar bzclovexrn avatar ruinaxalloy avatar

Watchers

Lucian avatar  avatar  avatar

minehotspot's Issues

[Feature Request] Auto pass baidu captcha v2

2024-07-09 11:16:09 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: minehotspot)
2024-07-09 11:16:09 [scrapy.utils.log] INFO: Versions: lxml 4.9.4.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.3.0, Python 3.11.9 | packaged by Anaconda, Inc. | (main, Apr 19 2024, 16:40:41) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 24.1.0 (OpenSSL 3.2.2 4 Jun 2024), cryptography 42.0.8, Platform Windows-10-10.0.19045-SP0
2024-07-09 11:16:09 [tiebalist] DEBUG: start='0', end='200', cookies_text='id'
2024-07-09 11:16:09 [tiebalist] DEBUG: self.cookies={'id': ''}
2024-07-09 11:16:09 [scrapy.addons] INFO: Enabled addons:
[]
2024-07-09 11:16:09 [asyncio] DEBUG: Using selector: SelectSelector
2024-07-09 11:16:09 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-07-09 11:16:09 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-07-09 11:16:09 [scrapy.extensions.telnet] INFO: Telnet Password: 735079daa16e3e8e
2024-07-09 11:16:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2024-07-09 11:16:09 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'minehotspot',
 'DOWNLOAD_DELAY': 10,
 'FEED_EXPORT_ENCODING': 'utf-8',
 'LOG_FILE': 'logs\\minehotspot\\tiebalist\\93993b343da111ef9695744ca19748bf.log',
 'NEWSPIDER_MODULE': 'minehotspot.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'SPIDER_MODULES': ['minehotspot.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-07-09 11:16:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'minehotspot.middlewares.RandomUserAgentDownloadMiddlware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'minehotspot.middlewares.RemoveHtmlCommentDownloadMiddleware',
 'minehotspot.middlewares.CheckCAPTCHADownloadMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-07-09 11:16:09 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-07-09 11:16:09 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-07-09 11:16:09 [scrapy.core.engine] INFO: Spider opened
2024-07-09 11:16:09 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-07-09 11:16:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-07-09 11:16:09 [tiebalist] INFO: r=range(0, 200, 50)
2024-07-09 11:16:09 [tiebalist] INFO: start_from_range(): pn=0
2024-07-09 11:16:09 [tiebalist] INFO: start_from_range(): pn=50
2024-07-09 11:16:09 [tiebalist] INFO: start_from_range(): pn=100
2024-07-09 11:16:09 [tiebalist] INFO: start_from_range(): pn=150
2024-07-09 11:16:09 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D0&timestamp=1720494970&signature=a431960e4a800f7493e6b8d6b96e37cb> from <GET https://tieba.baidu.com/f?kw=galgame&ie=utf-8&pn=0>
2024-07-09 11:16:09 [scrapy.downloadermiddlewares.offsite] DEBUG: Filtered offsite request to 'wappass.baidu.com': <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D0&timestamp=1720494970&signature=a431960e4a800f7493e6b8d6b96e37cb>
2024-07-09 11:16:09 [scrapy.core.engine] DEBUG: Signal handler scrapy.downloadermiddlewares.offsite.OffsiteMiddleware.request_scheduled dropped request <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D0&timestamp=1720494970&signature=a431960e4a800f7493e6b8d6b96e37cb> before it reached the scheduler.
2024-07-09 11:16:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D50&timestamp=1720494978&signature=027cbf20eba1abd8b5ce88d35e1a3bfb> from <GET https://tieba.baidu.com/f?kw=galgame&ie=utf-8&pn=50>
2024-07-09 11:16:17 [scrapy.core.engine] DEBUG: Signal handler scrapy.downloadermiddlewares.offsite.OffsiteMiddleware.request_scheduled dropped request <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D50&timestamp=1720494978&signature=027cbf20eba1abd8b5ce88d35e1a3bfb> before it reached the scheduler.
2024-07-09 11:16:31 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D100&timestamp=1720494992&signature=52f7b924b2bb72f9299c54873b51c184> from <GET https://tieba.baidu.com/f?kw=galgame&ie=utf-8&pn=100>
2024-07-09 11:16:31 [scrapy.core.engine] DEBUG: Signal handler scrapy.downloadermiddlewares.offsite.OffsiteMiddleware.request_scheduled dropped request <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D100&timestamp=1720494992&signature=52f7b924b2bb72f9299c54873b51c184> before it reached the scheduler.
2024-07-09 11:16:45 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D150&timestamp=1720495005&signature=0b6693ac47921f1de23647c442cf9798> from <GET https://tieba.baidu.com/f?kw=galgame&ie=utf-8&pn=150>
2024-07-09 11:16:45 [scrapy.core.engine] DEBUG: Signal handler scrapy.downloadermiddlewares.offsite.OffsiteMiddleware.request_scheduled dropped request <GET https://wappass.baidu.com/static/captcha/tuxing.html?ak=2ef521ec36290baed33d66de9b16f625&backurl=http%3A%2F%2Ftieba.baidu.com%2Ff%3Fkw%3Dgalgame%26ie%3Dutf-8%26pn%3D150&timestamp=1720495005&signature=0b6693ac47921f1de23647c442cf9798> before it reached the scheduler.
2024-07-09 11:16:45 [scrapy.core.engine] INFO: Closing spider (finished)
2024-07-09 11:16:45 [scrapy.extensions.feedexport] INFO: Stored jsonlines feed (0 items) in: scrapyd_items/
2024-07-09 11:16:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1469,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 4321,
 'downloader/response_count': 4,
 'downloader/response_status_count/302': 4,
 'elapsed_time_seconds': 35.587994,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2024, 7, 9, 3, 16, 45, 246251, tzinfo=datetime.timezone.utc),
 'log_count/DEBUG': 12,
 'log_count/INFO': 16,
 'offsite/domains': 1,
 'offsite/filtered': 4,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'start_time': datetime.datetime(2024, 7, 9, 3, 16, 9, 658257, tzinfo=datetime.timezone.utc)}
2024-07-09 11:16:45 [scrapy.core.engine] INFO: Spider closed (finished)

Bug Report: tieba.py

@HairlessVillager
问题复现:在控制台输入以下命令爬取贴吧,scrapy crawl tiebapost -a pid=8351704896

image

问题描述:

  • 在最初版本的TiebaComment里,有title(帖子标题这个字段),当前版本已经被删除,但是爬虫代码中,仍然有该字段

image
导致运行爬虫的时候,会报错如下:
image

  • 在目前版本的TiebaComent中,有uname该字段,但是在爬虫代码中,该字段并未处理,结果是uname字段全为none,如下

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.