aqua-dream / tieba_spider Goto Github PK
View Code? Open in Web Editor NEW百度贴吧爬虫(基于scrapy和mysql)
百度贴吧爬虫(基于scrapy和mysql)
账号某些帖子或楼层被百度限制,只有登录相应账号才能看到内容,因此会有带cookie爬取的需求。cookie形式类似:
timeShow=
BAIDUID=
TIEBA_USERTYPE=
TIEBAUID=
pgv_pvi=
bdshare_firstime=
BAIDU_WISE_UID=
IS_NEW_USER=
SEENKW=
BDUSS=
BDUSS_BFESS=
BDORZ=
STOKEN=
Hm_lvt_xxx=
wise_device=
Hm_lpvt_xxx=
st_data=
st_key_id=
st_sign=
其中重要的估计就BDUSS或者STOKEN。但我把cookie所有内容直接复制到未登录的浏览器里,似乎还是没法看到内容(还是未登录状态)。
若有成功的朋友,望指教。
你好!我正在试图用你的Scraper 但是我总是碰到这个问题:
File "/Users/.../PycharmProjects/BD_scraper/Tieba_Spider/tieba/commands/run.py", line 59, in run
cfg = config.config()
AttributeError: module 'config' has no attribute 'config'
你可以帮我解决吗?
感谢你!
start_time end_time elapsed_time tieba_name database_name pages etc
2020-09-24 11:19:15 2020-09-24 11:19:16 0.57 xxxxx tieba_gdupt None None
在哪里设UA,具体设成什么,重定向的地址是什么,我一问三不知
Originally posted by @Aqua-Dream in #14 (comment)
就是自己加了一个downloadMiddleware,在它的process_request方法中调用request.headers.setdefault设置了User-Agent。被重定向到一个百度搜索页面
只能爬到楼中楼的第一页以及没折叠的
✗ scrapy run
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 8, in
sys.exit(execute())
File "/Users/noname/Library/Python/3.8/lib/python/site-packages/scrapy/cmdline.py", line 142, in execute
_run_print_help(parser, cmd.process_options, args, opts)
File "/Users/noname/Library/Python/3.8/lib/python/site-packages/scrapy/cmdline.py", line 100, in _run_print_help
func(*a, **kw)
File "/Users/noname/Library/Python/3.8/lib/python/site-packages/scrapy/commands/init.py", line 130, in process_options
if opts.output or opts.overwrite_output:
AttributeError: 'Values' object has no attribute 'overwrite_output'
在本地爬取速度很快。部署到腾讯云的机器上就巨慢,一分钟爬几个帖子。有人遇到同样的问题吗?
爬了几页就开始报这个错误orz
2021-04-14 19:48:51 [scrapy.core.scraper] ERROR: Spider error processing <GET https://tieba.baidu.com/p/totalComment?tid=6423482591&fid=1&pn=33&red_tag=2930056521> (referer: None)
Traceback (most recent call last):
File "h:\python38\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
yield next(it)
File "h:\python38\lib\site-packages\scrapy\utils\python.py", line 353, in next
return next(self.data)
File "h:\python38\lib\site-packages\scrapy\utils\python.py", line 353, in next
return next(self.data)
File "h:\python38\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "h:\python38\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "h:\python38\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "h:\python38\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 342, in
return (_set_referer(r) for r in result or ())
File "h:\python38\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "h:\python38\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 40, in
return (r for r in result or () if _filter(r))
File "h:\python38\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "h:\python38\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "h:\python38\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\Administrator\Desktop\Tieba_Spider-master\tieba\spiders\tieba_spider.py", line 92, in parse_totalComment
for value in comment_list.values():
AttributeError: 'list' object has no attribute 'values'
比如MySQLdb
我装的是最新的scrapy,第一次可以运行,但是报了错,也爬不完整,说make_requests_from_url method被弃用,然后想再运行就没反应了,但是mysql会创建数据库,但是不会抓数据,我不会写代码,请求大神帮忙
[py.warnings] WARNING: /usr/local/lib/python3.8/site-packages/scrapy/spiders/init.py:81: UserWarning: Spider.make_requests_from_url method is deprecated: it will be removed and not be called by the default Spider.start_requests method in future Scrapy releases. Please override Spider.start_requests method instead. warnings.warn(
log
2020-08-15 08:15:38 2020-08-15 08:25:33 594.7 尿毒症 ESRD 1~228 None
2020-08-15 08:26:35 2020-08-15 08:26:36 0.7146 尿毒症 NiaoduSy None None
非常感谢您分享的这个爬虫,我也确实有所受益,再次感谢您!
但是很抱歉我刚入门,还有两个问题向您请教下:
1、我爬取的数据只有1000条,三个表都只有1000条。我爬的贴吧一共34页,确实每一页都爬到了,但是只有1000条 我很疑惑,是在哪里设置阈值了吗?还是我操作的问题呢?
2、由于您的代码中没有爬取主贴thread的发帖时间,请问我该如何设置?
期待您的回复!谢谢!
16:34 开始运行的
2019-11-08 16:47:09 [scrapy.core.scraper] ERROR: Spider error processing <GET https://tieba.baidu.com/p/totalComment?tid=5256877623&fid=1&pn=1&red_tag=2829333245> (referer: http://tieba.baidu.com/p/5256877623)
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
for r in iterable:
File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/root/Tieba_Spider/tieba/spiders/tieba_spider.py", line 82, in parse_comment
for value in comment_list.values():
AttributeError: 'list' object has no attribute 'values'
如题,需要改成手机浏览器的 UA,才会返回如 mp3 的数据
爬贴吧会触发反爬机制。
E:\Pycharmprojects\tieba>scrapy run 仙五前修改 Pal5Q_Diy Traceback (most recent call last): File "E:\anaconda3\Scripts\scrapy-script.py", line 10, in <module> sys.exit(execute()) File "E:\anaconda3\lib\site-packages\scrapy\cmdline.py", line 142, in execute _run_print_help(parser, cmd.process_options, args, opts) File "E:\anaconda3\lib\site-packages\scrapy\cmdline.py", line 100, in _run_print_help func(*a, **kw) File "E:\anaconda3\lib\site-packages\scrapy\commands\__init__.py", line 130, in process_options if opts.output or opts.overwrite_output: AttributeError: 'Values' object has no attribute 'overwrite_output'
看起来像是抓取数据过多,需要通过验证码验证。求助大佬
我试了一下,如果自己将user-agent设置为真实浏览器的ua,就会出现第一个请求被重定向到一个奇怪的地址,然后就爬不到数据了。
如题,本意是不想把挖坟贴抖出来,比较如果是贴吧创建时间比较久的话爬取的数据就太大了。
爬了22页触发了反爬机制
HI 请问被百度封ip怎么办
又试了其他几个网站,都可以完成,唯独这个
结果报错,mysql5.7 数据库建好了,config.json修改过配置了,大哥帮忙看看怎么解决
Traceback (most recent call last):
File "c:\users\hp envy\appdata\local\programs\python\python36\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "c:\users\hp envy\appdata\local\programs\python\python36\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\HP ENVY\AppData\Local\Programs\Python\Python36\Scripts\scrapy.exe_main.py", line 7, in
File "c:\users\hp envy\appdata\local\programs\python\python36\lib\site-packages\scrapy\cmdline.py", line 145, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "c:\users\hp envy\appdata\local\programs\python\python36\lib\site-packages\scrapy\cmdline.py", line 100, in _run_print_help
func(*a, **kw)
File "c:\users\hp envy\appdata\local\programs\python\python36\lib\site-packages\scrapy\cmdline.py", line 153, in _run_command
cmd.run(args, opts)
File "C:\Users\HP ENVY\Downloads\tie\tieba\commands\run.py", line 58, in run
cfg = config.config()
AttributeError: module 'config' has no attribute 'config'
请问下,你这个爬的数据是按照什么排序的?爬到的数据跟贴吧里面帖子数量不一致
今天我在尝试用这个程序爬取我自己关注的贴吧的精品帖的时候,一开始都很顺利,但是到最后一页(25页)的时候就卡死了,然后就输出:2020-07-18 12:00:43 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://tieba.baidu.com/p/3334670701?red_tag=3462644114> (failed 3 times): 500 Internal Server Error
然后我直接访问这个帖子链接,打不开,然后我又去贴吧的精品贴的最后一页的html里面查这个连接,也没有匹配。无奈最后我只能Ctrl+C强制关掉了。帖子数据太多我也没核对是不是所有的帖子都被拉下来了。
控制台输出日志下(后半段):
Crawling page 20...
Crawling page 21...
Crawling page 22...
Crawling page 23...
Crawling page 24...
Crawling page 25...
2020-07-18 12:00:43 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying
<GET https://tieba.baidu.com/p/3334670701?red_tag=3462644114> (failed 3 times):
500 Internal Server Error
[我在这里Ctrl+C了]
config.json
配置:
{
"DEFAULT_TIEBA": "ballance",
"MYSQL_PASSWD": "*****",
"MYSQL_DBNAME": {
"ballance": "tieba_backups"
},
"MYSQL_USER": "*****",
"MYSQL_HOST": "127.0.0.1",
"MYSQL_PORT": ****
}
环境:
运行的命令:scrapy run -g
服务器是tx云
错误信息如下:
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
for x in result:
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "C:\Users\surface\Desktop\Tieba_Spider\tieba\spiders\tieba_spider.py", line 82, in parse_comment
for value in comment_list.values():
AttributeError: 'list' object has no attribute 'values'
(venv) woody@Woody-MacBookPro tieba-spider % scrapy run java woody
Traceback (most recent call last):
File "/Users/woody/workspace/life/tieba-spider/venv/bin/scrapy", line 8, in
sys.exit(execute())
File "/Users/woody/workspace/life/tieba-spider/venv/lib/python3.10/site-packages/scrapy/cmdline.py", line 140, in execute
cmd.add_options(parser)
File "/Users/woody/workspace/life/tieba-spider/tieba/commands/run.py", line 19, in add_options
parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
AttributeError: 'ArgumentParser' object has no attribute 'add_option'. Did you mean: '_add_action'?
2017-07-21 21:37:53 [scrapy.core.scraper] ERROR: Spider error processing <GET https://tieba.baidu.com/f?kw=%E4%BB%99%E5%89%915&pn=0> (referer: None)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/root/Tieba_Spider/tieba/spiders/tieba_spider.py", line 41, in parse
yield self.make_requests_from_url(next_page.extract_first())
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/init.py", line 87, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/init.py", line 25, in init
self._set_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/init.py", line 58, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: //tieba.baidu.com/f?kw=%E4%BB%99%E5%89%915&ie=utf-8&pn=50
https://tieba.baidu.com/f?kw=%E7%BD%AA%E6%81%B6%E8%A3%85%E5%A4%87&ie=utf-8&pn=10010
https://tieba.baidu.com/f?kw=%E7%BD%AA%E6%81%B6%E8%A3%85%E5%A4%87&ie=utf-8&pn=0
https://tieba.baidu.com/f?kw=%E7%BD%AA%E6%81%B6%E8%A3%85%E5%A4%87&ie=utf-8&pn=9999
哥你看下,当pn=9999的时候还能看老帖,超过10000就变成首页了,陷入循环,那么那些老帖去哪儿了,怎么获取,感激不尽哥
就是先爬取自己发和回的贴,然后调用删除接口
第二个吧只建了表,里面没有数据;
--- ---
/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py:246:inContext
/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py:262:
/usr/local/lib/python2.7/dist-packages/twisted/python/context.py:118:callWithContext
/usr/local/lib/python2.7/dist-packages/twisted/python/context.py:81:callWithContext
/usr/local/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py:477:_runInteraction
/usr/local/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py:467:_runInteraction
/home/yangyilang/Tieba_Spider/tieba/pipelines.py:64:insert_thread
/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py:159:execute
/usr/lib/python2.7/dist-packages/MySQLdb/connections.py:264:literal
/usr/lib/python2.7/dist-packages/MySQLdb/connections.py:202:unicode_literal
] when dealing with item: {'author': u'LoserPanshao',
'good': False,
'id': 5009721898L,
'reply_num': 2,
'title': u'\u5982\u679c\u4e16\u754c\u6f06\u9ed1\u5176\u5b9e\u4f60\u5f88\u7f8e'}
已安装python37 mysql57
原来是我操作有误,依赖安装错了。
已经在运行,如有问题,再来请教大佬
我也喜欢单机游戏
环境:windows 10
命令: scrapy run xxx xxx
报错:
Traceback (most recent call last):
File "c:\python39\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "c:\python39\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Python39\Scripts\scrapy.exe\__main__.py", line 7, in <module>
File "c:\python39\lib\site-packages\scrapy\cmdline.py", line 142, in execute
_run_print_help(parser, cmd.process_options, args, opts)
File "c:\python39\lib\site-packages\scrapy\cmdline.py", line 100, in _run_print_help
func(*a, **kw)
File "c:\python39\lib\site-packages\scrapy\commands\__init__.py", line 130, in process_options
if opts.output or opts.overwrite_output:
AttributeError: 'Values' object has no attribute 'overwrite_output'
PS F:\code\Tieba_Spider> scrapy run
Traceback (most recent call last):
File "c:\python39\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "c:\python39\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Python39\Scripts\scrapy.exe\__main__.py", line 7, in <module>
File "c:\python39\lib\site-packages\scrapy\cmdline.py", line 142, in execute
_run_print_help(parser, cmd.process_options, args, opts)
File "c:\python39\lib\site-packages\scrapy\cmdline.py", line 100, in _run_print_help
func(*a, **kw)
File "c:\python39\lib\site-packages\scrapy\commands\__init__.py", line 130, in process_options
if opts.output or opts.overwrite_output:
AttributeError: 'Values' object has no attribute 'overwrite_output'
2018-06-11 11:14:58 [tieba] ERROR: Insert to database error: [Failure instance: Traceback: <class '_mysql_exceptions.OperationalError'>: (2019, "Can't initialize character set utf8mb4 (path: C:\mysql\\share\charsets\)")
c:\python\anaconda\lib\threading.py:801:__bootstrap_inner
c:\python\anaconda\lib\threading.py:754:run
c:\python\anaconda\lib\site-packages\twisted_threads_threadworker.py:46:work
c:\python\anaconda\lib\site-packages\twisted_threads_team.py:190:doWork
--- ---
c:\python\anaconda\lib\site-packages\twisted\python\threadpool.py:250:inContext
c:\python\anaconda\lib\site-packages\twisted\python\threadpool.py:266:
c:\python\anaconda\lib\site-packages\twisted\python\context.py:122:callWithContext
c:\python\anaconda\lib\site-packages\twisted\python\context.py:85:callWithContext
c:\python\anaconda\lib\site-packages\twisted\enterprise\adbapi.py:464:runInteraction
c:\python\anaconda\lib\site-packages\twisted\enterprise\adbapi.py:36:init
c:\python\anaconda\lib\site-packages\twisted\enterprise\adbapi.py:76:reconnect
c:\python\anaconda\lib\site-packages\twisted\enterprise\adbapi.py:431:connect
c:\python\anaconda\lib\site-packages\MySQLdb_init.py:81:Connect
c:\python\anaconda\lib\site-packages\MySQLdb\connections.py:221:init
c:\python\anaconda\lib\site-packages\MySQLdb\connections.py:312:set_character_set
] when dealing with item: {'author': u'\u5706\u53c8\u5706_00',
'content': u'\u4e13\u4e1a\u7b5b\u67e5\u662f\u4ec0\u4e48\uff1f',
'id': u'89618631379',
'post_id': u'10404126902',
'time': '2016-05-14 11:09:24'}
pip show MySQL-python
Name: MySQL-python
Version: 1.2.5
Summary: Python interface to MySQL
Home-page: https://github.com/farcepest/MySQLdb1
Author: Andy Dustman
Author-email: [email protected]
License: GPL
Location: c:\python\anaconda\lib\site-packages
Requires:
Required-by:
mysql8.0
跳转到验证码验证页面: https://wappass.baidu.com/static/captcha/tuxing.html
把浏览器的cookies 设置到请求的cookies也无法跳过
如题
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.