Giter Club home page Giter Club logo

tieba_spider's People

Contributors

aqua-dream avatar sumrise avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

tieba_spider's Issues

添加cookie

账号某些帖子或楼层被百度限制,只有登录相应账号才能看到内容,因此会有带cookie爬取的需求。cookie形式类似:

timeShow=
BAIDUID=
TIEBA_USERTYPE=
TIEBAUID=
pgv_pvi=
bdshare_firstime=
BAIDU_WISE_UID=
IS_NEW_USER=
SEENKW=
BDUSS=
BDUSS_BFESS=
BDORZ=
STOKEN=
Hm_lvt_xxx=
wise_device=
Hm_lpvt_xxx=
st_data=
st_key_id=
st_sign=

其中重要的估计就BDUSS或者STOKEN。但我把cookie所有内容直接复制到未登录的浏览器里,似乎还是没法看到内容(还是未登录状态)。

若有成功的朋友,望指教。

AttributeError: module 'config' has no attribute 'config'

你好!我正在试图用你的Scraper 但是我总是碰到这个问题:

File "/Users/.../PycharmProjects/BD_scraper/Tieba_Spider/tieba/commands/run.py", line 59, in run
cfg = config.config()
AttributeError: module 'config' has no attribute 'config'

你可以帮我解决吗?
感谢你!

请问这个是被block了吗?

start_time end_time elapsed_time tieba_name database_name pages etc
2020-09-24 11:19:15 2020-09-24 11:19:16 0.57 xxxxx tieba_gdupt None None

mac + anaconda 配置环境和执行

仓主你好,

我在 anaconda 下配置了你提到的那几样插件(scrapy + mysqlclient + beautifulsoap),版本也高于你的要求,在spyder(python 3.6)中打开了项目,执行就报错,如下:

screen shot 2018-07-09 at 17 13 26

如果我切换到 iTerm2 中执行,得到的错误如下:
screen shot 2018-07-09 at 17 08 42

我的数据库 MySQL 是在 XAMPP 中运行,使用Navicat 查看数据库可以看到如下信息
screen shot 2018-07-09 at 17 16 12

那么我的配置文件是
screen shot 2018-07-09 at 17 20 47

请问,我现在的问题是什么原因造成的呢?百度了也没找到答案呢,感谢回复。

AttributeError: 'Values' object has no attribute 'overwrite_output'

✗ scrapy run
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 8, in
sys.exit(execute())
File "/Users/noname/Library/Python/3.8/lib/python/site-packages/scrapy/cmdline.py", line 142, in execute
_run_print_help(parser, cmd.process_options, args, opts)
File "/Users/noname/Library/Python/3.8/lib/python/site-packages/scrapy/cmdline.py", line 100, in _run_print_help
func(*a, **kw)
File "/Users/noname/Library/Python/3.8/lib/python/site-packages/scrapy/commands/init.py", line 130, in process_options
if opts.output or opts.overwrite_output:
AttributeError: 'Values' object has no attribute 'overwrite_output'

AttributeError: 'list' object has no attribute 'values'求解答大佬

爬了几页就开始报这个错误orz


2021-04-14 19:48:51 [scrapy.core.scraper] ERROR: Spider error processing <GET https://tieba.baidu.com/p/totalComment?tid=6423482591&fid=1&pn=33&red_tag=2930056521> (referer: None)
Traceback (most recent call last):
File "h:\python38\lib\site-packages\scrapy\utils\defer.py", line 120, in iter_errback
yield next(it)
File "h:\python38\lib\site-packages\scrapy\utils\python.py", line 353, in next
return next(self.data)
File "h:\python38\lib\site-packages\scrapy\utils\python.py", line 353, in next
return next(self.data)
File "h:\python38\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "h:\python38\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "h:\python38\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "h:\python38\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 342, in
return (_set_referer(r) for r in result or ())
File "h:\python38\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "h:\python38\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 40, in
return (r for r in result or () if _filter(r))
File "h:\python38\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "h:\python38\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "h:\python38\lib\site-packages\scrapy\core\spidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:\Users\Administrator\Desktop\Tieba_Spider-master\tieba\spiders\tieba_spider.py", line 92, in parse_totalComment
for value in comment_list.values():
AttributeError: 'list' object has no attribute 'values'

scrapy新版本make_requests_from_url method被弃用,代码运行不了

我装的是最新的scrapy,第一次可以运行,但是报了错,也爬不完整,说make_requests_from_url method被弃用,然后想再运行就没反应了,但是mysql会创建数据库,但是不会抓数据,我不会写代码,请求大神帮忙

[py.warnings] WARNING: /usr/local/lib/python3.8/site-packages/scrapy/spiders/init.py:81: UserWarning: Spider.make_requests_from_url method is deprecated: it will be removed and not be called by the default Spider.start_requests method in future Scrapy releases. Please override Spider.start_requests method instead. warnings.warn(

log
2020-08-15 08:15:38 2020-08-15 08:25:33 594.7 尿毒症 ESRD 1~228 None
2020-08-15 08:26:35 2020-08-15 08:26:36 0.7146 尿毒症 NiaoduSy None None

有两个问题向您请教

非常感谢您分享的这个爬虫,我也确实有所受益,再次感谢您!
但是很抱歉我刚入门,还有两个问题向您请教下:
1、我爬取的数据只有1000条,三个表都只有1000条。我爬的贴吧一共34页,确实每一页都爬到了,但是只有1000条 我很疑惑,是在哪里设置阈值了吗?还是我操作的问题呢?
2、由于您的代码中没有爬取主贴thread的发帖时间,请问我该如何设置?
期待您的回复!谢谢!

运行过程中报错 ERROR: Spider error processing

16:34 开始运行的

2019-11-08 16:47:09 [scrapy.core.scraper] ERROR: Spider error processing <GET https://tieba.baidu.com/p/totalComment?tid=5256877623&fid=1&pn=1&red_tag=2829333245> (referer: http://tieba.baidu.com/p/5256877623)
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/core/spidermw.py", line 84, in evaluate_iterable
    for r in iterable:
  File "/root/anaconda3/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/root/Tieba_Spider/tieba/spiders/tieba_spider.py", line 82, in parse_comment
    for value in comment_list.values():
AttributeError: 'list' object has no attribute 'values'

我在运行项目时出现了问题

我在第一次运行项目时,修改了config.json文件后,出现了如下错误:我用的是scrapy 2.4.0

E:\Pycharmprojects\tieba>scrapy run 仙五前修改 Pal5Q_Diy Traceback (most recent call last): File "E:\anaconda3\Scripts\scrapy-script.py", line 10, in <module> sys.exit(execute()) File "E:\anaconda3\lib\site-packages\scrapy\cmdline.py", line 142, in execute _run_print_help(parser, cmd.process_options, args, opts) File "E:\anaconda3\lib\site-packages\scrapy\cmdline.py", line 100, in _run_print_help func(*a, **kw) File "E:\anaconda3\lib\site-packages\scrapy\commands\__init__.py", line 130, in process_options if opts.output or opts.overwrite_output: AttributeError: 'Values' object has no attribute 'overwrite_output'

新手第一天用python,依赖的包都下载好了,运行scrapy run 沙发 aa

结果报错,mysql5.7 数据库建好了,config.json修改过配置了,大哥帮忙看看怎么解决

Traceback (most recent call last):
File "c:\users\hp envy\appdata\local\programs\python\python36\lib\runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "c:\users\hp envy\appdata\local\programs\python\python36\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\Users\HP ENVY\AppData\Local\Programs\Python\Python36\Scripts\scrapy.exe_main
.py", line 7, in
File "c:\users\hp envy\appdata\local\programs\python\python36\lib\site-packages\scrapy\cmdline.py", line 145, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "c:\users\hp envy\appdata\local\programs\python\python36\lib\site-packages\scrapy\cmdline.py", line 100, in _run_print_help
func(*a, **kw)
File "c:\users\hp envy\appdata\local\programs\python\python36\lib\site-packages\scrapy\cmdline.py", line 153, in _run_command
cmd.run(args, opts)
File "C:\Users\HP ENVY\Downloads\tie\tieba\commands\run.py", line 58, in run
cfg = config.config()
AttributeError: module 'config' has no attribute 'config'

爬虫爬出来 500 Internal Server Error

今天我在尝试用这个程序爬取我自己关注的贴吧的精品帖的时候,一开始都很顺利,但是到最后一页(25页)的时候就卡死了,然后就输出:2020-07-18 12:00:43 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://tieba.baidu.com/p/3334670701?red_tag=3462644114> (failed 3 times): 500 Internal Server Error

然后我直接访问这个帖子链接,打不开,然后我又去贴吧的精品贴的最后一页的html里面查这个连接,也没有匹配。无奈最后我只能Ctrl+C强制关掉了。帖子数据太多我也没核对是不是所有的帖子都被拉下来了。

控制台输出日志下(后半段):

Crawling page 20...
Crawling page 21...
Crawling page 22...
Crawling page 23...
Crawling page 24...
Crawling page 25...
2020-07-18 12:00:43 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying
 <GET https://tieba.baidu.com/p/3334670701?red_tag=3462644114> (failed 3 times):
 500 Internal Server Error

[我在这里Ctrl+C了]

config.json配置:

{
    "DEFAULT_TIEBA": "ballance",
    "MYSQL_PASSWD": "*****",
    "MYSQL_DBNAME": {
        "ballance": "tieba_backups"
    },
    "MYSQL_USER": "*****",
    "MYSQL_HOST": "127.0.0.1",
    "MYSQL_PORT": ****
}

环境:

  • Python: Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 07:18:10) [MSC v.1900 32 bit (Intel)] on win32
  • beautifulsoup4: 4.9.1
  • scrapy: 2.2.1
  • mysqlclient: 1.4.6
  • mysql: 8.0.17
  • OS: Windows Server 2008 R2 Enterpise Services Pack 1 x64

运行的命令:scrapy run -g

服务器是tx云

爬取时出现了AttributeError: 'list' object has no attribute 'values'.

错误信息如下:
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
for x in result:
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "C:\ProgramData\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "C:\Users\surface\Desktop\Tieba_Spider\tieba\spiders\tieba_spider.py", line 82, in parse_comment
for value in comment_list.values():
AttributeError: 'list' object has no attribute 'values'

看不出什么问题了,help

(venv) woody@Woody-MacBookPro tieba-spider % scrapy run java woody
Traceback (most recent call last):
File "/Users/woody/workspace/life/tieba-spider/venv/bin/scrapy", line 8, in
sys.exit(execute())
File "/Users/woody/workspace/life/tieba-spider/venv/lib/python3.10/site-packages/scrapy/cmdline.py", line 140, in execute
cmd.add_options(parser)
File "/Users/woody/workspace/life/tieba-spider/tieba/commands/run.py", line 19, in add_options
parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
AttributeError: 'ArgumentParser' object has no attribute 'add_option'. Did you mean: '_add_action'?

运行Tieba_Spider提示如下错误,一直解决不了,请帮忙看看原因,谢谢!

2017-07-21 21:37:53 [scrapy.core.scraper] ERROR: Spider error processing <GET https://tieba.baidu.com/f?kw=%E4%BB%99%E5%89%915&pn=0> (referer: None)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/referer.py", line 339, in
return (_set_referer(r) for r in result or ())
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/urllength.py", line 37, in
return (r for r in result or () if _filter(r))
File "/usr/local/lib/python2.7/dist-packages/scrapy/spidermiddlewares/depth.py", line 58, in
return (r for r in result or () if _filter(r))
File "/root/Tieba_Spider/tieba/spiders/tieba_spider.py", line 41, in parse
yield self.make_requests_from_url(next_page.extract_first())
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/init.py", line 87, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/init.py", line 25, in init
self._set_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/init.py", line 58, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: //tieba.baidu.com/f?kw=%E4%BB%99%E5%89%915&ie=utf-8&pn=50

哥又遇到问题了,大的贴吧帖子多爬到pn值10000以上,帖子开始一种循环,那些老帖是没了吗

https://tieba.baidu.com/f?kw=%E7%BD%AA%E6%81%B6%E8%A3%85%E5%A4%87&ie=utf-8&pn=10010

https://tieba.baidu.com/f?kw=%E7%BD%AA%E6%81%B6%E8%A3%85%E5%A4%87&ie=utf-8&pn=0

https://tieba.baidu.com/f?kw=%E7%BD%AA%E6%81%B6%E8%A3%85%E5%A4%87&ie=utf-8&pn=9999

哥你看下,当pn=9999的时候还能看老帖,超过10000就变成首页了,陷入循环,那么那些老帖去哪儿了,怎么获取,感激不尽哥

mysql boom了

--- ---
/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py:246:inContext
/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py:262:
/usr/local/lib/python2.7/dist-packages/twisted/python/context.py:118:callWithContext
/usr/local/lib/python2.7/dist-packages/twisted/python/context.py:81:callWithContext
/usr/local/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py:477:_runInteraction
/usr/local/lib/python2.7/dist-packages/twisted/enterprise/adbapi.py:467:_runInteraction
/home/yangyilang/Tieba_Spider/tieba/pipelines.py:64:insert_thread
/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py:159:execute
/usr/lib/python2.7/dist-packages/MySQLdb/connections.py:264:literal
/usr/lib/python2.7/dist-packages/MySQLdb/connections.py:202:unicode_literal
] when dealing with item: {'author': u'LoserPanshao',
'good': False,
'id': 5009721898L,
'reply_num': 2,
'title': u'\u5982\u679c\u4e16\u754c\u6f06\u9ed1\u5176\u5b9e\u4f60\u5f88\u7f8e'}

关于win10上的使用?

已安装python37 mysql57

原来是我操作有误,依赖安装错了。
已经在运行,如有问题,再来请教大佬

我也喜欢单机游戏

启动报错

环境:windows 10
命令: scrapy run xxx xxx
报错:

Traceback (most recent call last):
  File "c:\python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Python39\Scripts\scrapy.exe\__main__.py", line 7, in <module>
  File "c:\python39\lib\site-packages\scrapy\cmdline.py", line 142, in execute
    _run_print_help(parser, cmd.process_options, args, opts)
  File "c:\python39\lib\site-packages\scrapy\cmdline.py", line 100, in _run_print_help
    func(*a, **kw)
  File "c:\python39\lib\site-packages\scrapy\commands\__init__.py", line 130, in process_options
    if opts.output or opts.overwrite_output:
AttributeError: 'Values' object has no attribute 'overwrite_output'
PS F:\code\Tieba_Spider> scrapy run
Traceback (most recent call last):
  File "c:\python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Python39\Scripts\scrapy.exe\__main__.py", line 7, in <module>
  File "c:\python39\lib\site-packages\scrapy\cmdline.py", line 142, in execute
    _run_print_help(parser, cmd.process_options, args, opts)
  File "c:\python39\lib\site-packages\scrapy\cmdline.py", line 100, in _run_print_help
    func(*a, **kw)
  File "c:\python39\lib\site-packages\scrapy\commands\__init__.py", line 130, in process_options
    if opts.output or opts.overwrite_output:
AttributeError: 'Values' object has no attribute 'overwrite_output'

mysql 插入错误

2018-06-11 11:14:58 [tieba] ERROR: Insert to database error: [Failure instance: Traceback: <class '_mysql_exceptions.OperationalError'>: (2019, "Can't initialize character set utf8mb4 (path: C:\mysql\\share\charsets\)")
c:\python\anaconda\lib\threading.py:801:__bootstrap_inner
c:\python\anaconda\lib\threading.py:754:run
c:\python\anaconda\lib\site-packages\twisted_threads_threadworker.py:46:work
c:\python\anaconda\lib\site-packages\twisted_threads_team.py:190:doWork
--- ---
c:\python\anaconda\lib\site-packages\twisted\python\threadpool.py:250:inContext
c:\python\anaconda\lib\site-packages\twisted\python\threadpool.py:266:
c:\python\anaconda\lib\site-packages\twisted\python\context.py:122:callWithContext
c:\python\anaconda\lib\site-packages\twisted\python\context.py:85:callWithContext
c:\python\anaconda\lib\site-packages\twisted\enterprise\adbapi.py:464:runInteraction
c:\python\anaconda\lib\site-packages\twisted\enterprise\adbapi.py:36:init
c:\python\anaconda\lib\site-packages\twisted\enterprise\adbapi.py:76:reconnect
c:\python\anaconda\lib\site-packages\twisted\enterprise\adbapi.py:431:connect
c:\python\anaconda\lib\site-packages\MySQLdb_init
.py:81:Connect
c:\python\anaconda\lib\site-packages\MySQLdb\connections.py:221:init
c:\python\anaconda\lib\site-packages\MySQLdb\connections.py:312:set_character_set
] when dealing with item: {'author': u'\u5706\u53c8\u5706_00',
'content': u'\u4e13\u4e1a\u7b5b\u67e5\u662f\u4ec0\u4e48\uff1f',
'id': u'89618631379',
'post_id': u'10404126902',
'time': '2016-05-14 11:09:24'}

pip show MySQL-python
Name: MySQL-python
Version: 1.2.5
Summary: Python interface to MySQL
Home-page: https://github.com/farcepest/MySQLdb1
Author: Andy Dustman
Author-email: [email protected]
License: GPL
Location: c:\python\anaconda\lib\site-packages
Requires:
Required-by:

mysql8.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.