jack-cherish / python-spider Goto Github PK

View Code? Open in Web Editor NEW

17.9K 17.9K 5.9K 1.24 MB

:rainbow:Python3网络爬虫实战：淘宝、京东、网易云、B站、12306、抖音、笔趣阁、漫画小说下载、音乐电影下载等

Home Page: https://cuijiahua.com/blog/spider/

Python 94.68% JavaScript 3.23% HTML 2.09%

python python-spider python3 webspider

python-spider's Introduction

Hi there 👋

🔭 热爱 coding 的算法工程一枚，欢迎关注~
🌱 Python 基础、数据结构、机器学习、深度学习、网络爬虫、面试经验等优质内容，持续输出ing
🤔 Website：https://cuijiahua.com/
🌈 BiliBili：https://space.bilibili.com/331507846
👯 Wechat：微信公众号搜索：「JackCui」

python-spider's People

Contributors

Stargazers

Watchers

Forkers

activityfor 304464079 jager9418 jerryhanjj yywx66 ljheee86 gutwql unclesusan zhy0313 hmozju xyzs996 jiangfanli nxcosa devopsmi redhandsome guanminxiao web2-zhangsanfeng pddsa taoyunaming zcc888 yichengzhou ndlxp2008 yunbaobau 17853829 xp-git remainsu laizhouzhang zoutianjie bxybxy1234 zhangll1990 yiliqsmy zwq0906 userlyj sunjiyun26 benjamesbabala babyboy12 giddily wanmeilingdu1234 oldbig-carry yuanyedc wyonerain yuruoshuang zhisiying lisadecember ztianxiang altaken quincyc379 wdq233 hylpowerr diezauberin fanwentao-felix tonizhou maoxinhaizhi yikedada antoinelee aff49866 learnerada gavinwxz lfyg whnet 521314 zjlavender zhjih7988 heiyixueren winningcn dianerdianer mbaey levishen cyber211 zchenglonglove summercherry jinhuicheng suzichen mason8971 lf185 lilixj wdq33 sqltxt fzhqzxguest sunmowang pdool dadizhimai aixoraer wy2160640 std-in teller110 jiguang123 hao83 yang-shihong baifasan charliebearwc losterhan zuxbabyworld ilikerex xujinhui1995 piaohe111 yohlee chenyangnjit happylth123 victorzhong

python-spider's Issues

我是为了福利

我想爬取ttp://www.biqukan.com 这个网站所有小说，然后自己写个小说app，能给我点意见吗

File "1.py", line 126, in run
video_names, video_urls, nickname = self.get_video_urls(user_id)
File "1.py", line 35, in get_video_urls
aweme_count = html['user_list'][0]['user_info']['aweme_count']
KeyError: 'user_list'

百度文库的爬虫是不是炸了。。。

爬出来的result是空数组

无法下载

错误的问题检测...

运行后，显示如下错误，无法正常运行

12306 抢票error

如下

等待验证码，自行输入...
购票页面开始...
循环点击查询... 第 1 次
[2832:8432:0114/111719.531:ERROR:service_manager.cc(157)] Connection InterfaceProviderSpec prevented service: content_renderer from binding interface: blink::mojom::ReportingServiceProxy exposed by: content_browser
Message: stale element reference: element is not attached to the page document
(Session info: chrome=63.0.3239.132)
(Driver info: chromedriver=2.34.522940 (1a76f96f66e3ca7b8e57d503b4dd3bccfba87af1),platform=Windows NT 10.0.16299 x86_64)

还没开始预订 1
开始预订...
开始选择用户...
提交订单...
'ElementList' object has no attribute 'click'

F:\trainticket_booker-master>[5620:16128:0114/111729.132:ERROR:process_metrics.cc(105)] NOT IMPLEMENTED
[5620:16128:0114/111729.132:ERROR:process_metrics.cc(105)] NOT IMPLEMENTED
[5620:16128:0114/111729.133:ERROR:process_metrics.cc(105)] NOT IMPLEMENTED
[5620:16128:0114/111729.135:ERROR:process_metrics.cc(105)] NOT IMPLEMENTED

浏览器先显示了登陆界面，登录后跳转到订票页面，还没有来得及看清楚就又返回登陆页面了，以上是命令行显示的错误

出错了呀KeyError: 'aweme_list'`

解析视频链接中 Traceback (most recent call last): File "douyin_appsign.py", line 325, in <module> douyin.run() File "douyin_appsign.py", line 283, in run video_names, video_urls, share_urls, nickname = self.get_video_urls(user_id, type_flag) File "douyin_appsign.py", line 194, in get_video_urls for each in html['aweme_list']: KeyError: 'aweme_list'

出错了。。。。

链接异常

找不到文件，或者被forbid403

WINDOWS下输入参数后，没有生成文件夹

但是有看到网络在下载。。奇怪。。

12306订票

登陆成功后的跳转到的网址变为“https://kyfw.12306.cn/otn/view/index.html”了。需要修改一下原代码。
另外代码的这个地方“self.driver.find_by_id('').select(self.pz)”缺少了id选择器。

大佬方便说一下 B站视频的解析原理吗

想知道你是如何知道它的视频真实地址的我看网页都说blob的加密地址如果可以的话求分享一下感激不尽

B站视频下载不全

python bilibili.py -d lex -k lexburner2009年的零 -p 1
我这么下载一个10分钟的视频，只能下载前6分钟

12306抢票 error

等待验证码，自行输入...
购票页面开始...
循环点击查询... 第 1 次
no elements could be found with text "预订"
还没开始预订
循环点击查询... 第 2 次
Message: unknown error: Element ... is not clickable at point (1102, 15). O
ther element would receive the click: .
..
(Session info: chrome=63.0.3213.3)
(Driver info: chromedriver=2.34.522940 (1a76f96f66e3ca7b8e57d503b4dd3bccfba87a
f1),platform=Windows NT 6.1.7601 SP1 x86_64)

建议加上`requirements.txt`依赖文件

建议加上requirements.txt依赖文件，方便使用pip install -r requirements.txt把依赖文件一次性装好

能帮忙写一个爬取和讯期货新闻标题和时间的python爬虫吗

http://futures.hexun.com/focus，多谢了

运行报错

Traceback (most recent call last):
File "D:/Files/Python Practice/test/test.py", line 160, in
douyin.run()
File "D:/Files/Python Practice/test/test.py", line 125, in run
video_names, video_urls, share_urls, nickname = self.get_video_urls(user_id)
File "D:/Files/Python Practice/test/test.py", line 46, in get_video_urls
uid = html['user_list'][0]['user_info']['uid']
KeyError: 'user_list'

大佬，抖音爬虫的js程序是如何从页面原始js分析出来的，能不能在博客上分享下过程。

大佬，抖音爬虫的js程序是如何从页面原始js分析出来的，能不能在博客上分享下过程。谢谢！

VIP视频

大神video_downloader目录下没有requirements.txt文件，python安装不了。
爱奇艺vip用软件出来的只有免费的6分钟吗？

新的搜索api

要修改的地方

serarch_url = https://api.bilibili.com/x/web-interface/search/type?jsonp=jsonp&search_type=video&keyword={}&page={}

返回的json数据

videos = html["data"]['result']
多谢大神分享，学习了很多！

这个12306.py怎么跑步起来

bilibili 里面 get_download_url函数返回的download_url为什么都是访问不了的地址？

如题：报错

爬虫pycharm报错

你好，从CSDN上看到你的爬虫教程觉得很有趣点了个赞就跟着照做，但出现了点问题望能回复
问题很简单，在pharm中
import requests

if name == 'main':
target = 'http://gitbook.cn/'
req = requests.get(url=target)
print(req.text)

会报错
D:\Python\Python36\python.exe "C:/Users/zxy/PycharmProjects/Python Excerise/WebSpider/random.py"
Traceback (most recent call last):
File "C:/Users/zxy/PycharmProjects/Python Excerise/WebSpider/random.py", line 1, in
import requests
File "D:\Python\Python36\lib\site-packages\requests_init_.py", line 97, in
from . import utils
File "D:\Python\Python36\lib\site-packages\requests\utils.py", line 11, in
import cgi
File "D:\Python\Python36\lib\cgi.py", line 44, in
import tempfile
File "D:\Python\Python36\lib\tempfile.py", line 45, in
from random import Random as _Random
ImportError: cannot import name 'Random'

但是copy到cmd的python中运行没出现问题

填了硬座，抢的怎么还是二等座

大大帮忙看一看这是什么原因

抖音搜索接口变更

刚用了下抖音的爬虫，发现抖音搜索接口已变更会为每个关键字生成两个token参数。不知如何破解，请大神帮忙看看有没有啥好方法。
抓包发现最新接口为http://aweme.snssdk.com/aweme/v1/general/search
会用到cookie判断是否登录
还有搜索关键字参数变化mas和as 参数都会变化
keyword=***&offset=0&mas=010abf2fd5bc52c15a2cccb755d17528ee1d793a0e0b26f18808de&as=a135dc16bc6edbd9022563

星级的服务，专业的技术

邮箱留下，技术拿走。

some problems trouble me

Hello, master Cui.
有一个问题想请教一下，我的代码（功能是一个电商平台自动下单）在windows 平台上跑没问题。但是在Linux 系统上面跑，只能得到get请求，post请求就没有得到响应，似乎是禁止了平台操作。还望大佬支招。

抖音无法下载视频- 一直在解析视频连接。。。

不知道为什么一直处于这种状态之中无法下载视频希望能够帮帮看一下感谢！

12306抢票报错

Message: unknown error: Element ... is not clickable at point (1112, 99). Other element would receive the click:

(Session info: chrome=63.0.3239.132)
(Driver info: chromedriver=2.35.528161 (5b82f2d2aae0ca24b877009200ced9065a772e73),platform=Windows NT 10.0.10586 x86_64)

抖音视频无法下载

使用抖音升级版2 提示
/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning) 应该是要 https 认证

哇

感觉都可以那这个去创个小业了

滑动验证码，我参照你的做了一个，遇到一些问题，有时间时候能否帮忙看看，谢谢

https://github.com/sys0613/python-spider/tree/master/geetest我暂时传到我的库了，等弄好了，再往你的库push，目前工商网站滑块不能测试，我就采用geetest官网的测试页面进行测试，点击登录，获取滑块图片时，我从fiddler抓包效果和说明我都写出来了。
我不太懂js和css，我想让你简单帮我说下流程。我现在不知道怎么往下走，获取到完整验证码图片，和有缺口的验证码图片了。
我现在有两个问题，想让你帮帮忙，谢谢。具体内容我都写到我github网址文件夹了。
问题1，我第一个请求访问的网址是不是根据登录页面中的js函数计算出来的？（我目前看不懂js）
问题2，我请求5和请求7中的乱序图片是要根据请求3中的js进行重新组合，还是根据请求4中的css进行组合，才能得到一个完整的图片，和一个有缺口的图片？
谢谢了，我目前的弱项是js和css。我看你的工商网址验证码改了，就想试好geetest官网的，push上来，让大家能用。目前卡在这里了。

网易云音乐下载失败

post_request函数一直返回post_request error，分析了一下是在执行get_song_url函数的时候产生的，不知道是什么原因？

支持验证码自动填写吗

支持验证码自动填写吗?

抖音的用户视频下载失效了

抓包看请求数据，里面的视频列表是空的，随便哪个用户都是
分享的原链接是有视频的

可以获取到点赞数么大哥

VIP解析接口

大神，您会爬这个VIP解析接口吗
http://api.baiyug.cn/vip/index.php?url=

财经请求路径已经过期了喔

"请先登录，再继续搜索吧"

After modifying douyin_pro_2.py:

			req = requests.get(search_url, headers=self.headers)
			html = json.loads(req.text)
			print(html) ###!!!!
			aweme_count = 32767 # html['user_list'][0]['user_info']['aweme_count']
			uid = html['user_list'][0]['user_info']['uid']

$ python douyin_pro_2.py
[...]
{'status_code': 2483, 'rid': '20180713080926010011047200583C2E', 'log_pb': {'impr_id': '20180713080926010011047200583C2E'}, 'status_msg': '请先登录，再继续搜索吧', 'extra': {'logid': '20180713080926010011047200583C2E', 'now': 1531440566657, 'fatal_item_ids': []}}
Traceback (most recent call last):                                                                                                                                                                                                            
  File "douyin_pro_2.py", line 153, in <module>
    douyin.run()
  File "douyin_pro_2.py", line 118, in run
    video_names, video_urls, share_urls, nickname = self.get_video_urls(user_id)
  File "douyin_pro_2.py", line 39, in get_video_urls
    uid = html['user_list'][0]['user_info']['uid']
KeyError: 'user_list'

How do I login?

archlinux下运行video_downloader.py出现界面都是乱码

我是在archlinux下运行这个脚本，可以正常显示界面，除了标题栏中应用程序的标题是对的，其他界面元素与按钮、菜单上的中文都显示为小方框。

抖音的那个好像已经失效了

大概看了下,抖音的用户分享页面好像没了

致作者

能不能增加批量下载喜玛拉雅音频的爬虫？

下载限制在33左右这是为什么呢可以改么

小说的那个脚本貌似章节比较长的小说会报错，不显示下载进度

#我用的这个目录网址报错一千多章的小说 http://www.biqukan.com/1_1094/
Traceback (most recent call last):
File "D:/py17/爬虫/小说.py", line 137, in
name,numbers,url_dict = d.get_download_url()
File "D:/py17/爬虫/小说.py", line 73, in get_download_url
download_dict['第' + str(numbers) + '章 ' + names[1]] = download_url
IndexError: list index out of range

#这个列表貌似没把整个目录全部取下来，只取到1259章，实际到1260章
1277 1278 ['第1256', ' 帝戟']
1278 1279 ['第1257', ' 你敢骂我？']
1279 1280 ['第1258', ' 陨落']
1280 1281 ['第1259张风波再起！']

Process finished with exit code 1

网易云下载音乐出现问题。

UnboundLocalError: local variable 'song' referenced before assignment