xiazy / driveit Goto Github PK
View Code? Open in Web Editor NEWA New Multithreading Crawler Supports Multiple Websites
License: Do What The F*ck You Want To Public License
A New Multithreading Crawler Supports Multiple Websites
License: Do What The F*ck You Want To Public License
我的命令:python3 ./driveit.py http://www.dmzj.com/info/chuanlingwuyu.html
返回打印:
url http://www.dmzj.com/info/chuanlingwuyu.html header {'Referer': '', 'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/600.1.3 (KHTML, like Gecko) Version/8.0 Mobile/12A4345d Safari/600.1.4'}
Traceback (most recent call last):
File "./driveit.py", line 99, in
website_object = SiteClass(user_input_url)
File "/Users//Documents//DriveIt-master/sites.py", line 107, in init
self.flyleaf_data = self.get_data(self.flyleaf_url).decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
测试url
http://www.dm5.com/manhua-yiquanchaoren/
下载至‘外传:第8话 ’可复现,图片无法打开,修改后缀名为 .png后图片可正常观看
图片url 如
http://manhua1025.61-174-50-141.cdndm5.com/11/10684/399544/1_7450.jpg?cid=399544&key=26598c8617417b16446db25625610c9a
http://manhua1023.61-174-50-131.cdndm5.com/11/10684/145458/3_5957.png?cid=145458&key=26598c8617417b16446db25625610c9a
里有图片的格式,很好识别
p.s. 您的源码非常有帮助,谢谢您~
问题链接
cudianxin
gg
本来是想加个代理支持....都加完了测试的时候发现挂了= =
http://manhua.dmzj.com/fengxia/
http://www.dm5.com/manhua-reclksdysjsh/
准备写个单元测试,想动手修复发现连定位问题发生的具体函数都很难....
ps:readme里面的这个链接已经失效了
http://www.dmzj.com/info/shenshimenlianaizhanzheng.html is a flyleaf page of DMZJ
似乎是idm5的网页结构变了?
···
URL?
http://www.dm5.com/manhua-cudianxinzhanzheng/
Traceback (most recent call last):
File ".\driveit.py", line 30, in
ref_box = website_object.get_parent_info()
File "C:\Users\Niu\DriveIt\sites.py", line 72, in get_parent_info
ref_title = li.a['title']
File "C:\Users\Niu\AppData\Local\Programs\Python\Python35\lib\site-packages\bs4\element.py", line 958, in getitem
return self.attrs[key]
KeyError: 'title'
···
刚刚爬粗点心战争的时候报的错
如果楼主有心修复就好啦-。-
实验漫画网址:
http://www.dm5.com/manhua-fangxuehoudefengbaoguanxianledui/
实验过程
PS E:\LearnPython\DriveIt> python3 .\driveit.py
URL?
http://www.dm5.com/manhua-fangxuehoudefengbaoguanxianledui/
Where to save?
E:\BaiduYunDownload\漫画\
实验结果:
放学后的风暴管弦乐队, total 5 chapters detected.
Traceback (most recent call last):
File ".\driveit.py", line 67, in <module>
main_loop(ref_box)
File ".\driveit.py", line 12, in main_loop
website_object.down(comic_name, parent_link, link, title, page)
File "E:\LearnPython\DriveIt\sites.py", line 94, in down
img_data = self.get_data(link, 'http://www.dm5.com%s' % parent_link)
File "E:\LearnPython\DriveIt\base.py", line 23, in get_data
web_page = request.urlopen(req)
File "E:\Python34\lib\urllib\request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "E:\Python34\lib\urllib\request.py", line 463, in open
response = self._open(req, data)
File "E:\Python34\lib\urllib\request.py", line 481, in _open
'_open', req)
File "E:\Python34\lib\urllib\request.py", line 441, in _call_chain
result = func(*args)
File "E:\Python34\lib\urllib\request.py", line 1210, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "E:\Python34\lib\urllib\request.py", line 1182, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "E:\Python34\lib\http\client.py", line 1088, in request
self._send_request(method, url, body, headers)
File "E:\Python34\lib\http\client.py", line 1116, in _send_request
self.putrequest(method, url, **skips)
File "E:\Python34\lib\http\client.py", line 973, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-21: ordinal not in rang
e(128)
问题:URL中有中文。
因为我用firebug查看了网页元素,发现这个图片的网址是:http://manhua1014.61-147-113-113.cdndm5.com/f/放学后的风暴管弦乐队/放学后的风暴管弦队_ch01/000001_fb457d98.jpg?cid=49096&key=acd841ae172b8fa7b82c1a60d545f8ae
。
解决方案:
我等会尝试着用PR试一下……我自己不会弄,原理在这里:知乎——urlopen的中文问题
requests是支持自动解码gzip, deflate和sdch,下一次commit的时候顺手在header里加上
"Accept-Encoding": "gzip, deflate, sdch"
,应该对速度有帮助
原因:原来应该是dm5没有移动端网页适配,现在有了,各种元素都重命名了
修复方法:
dm5不伪造UA,使用笔记本UA即可
class DM5(SharedBase):
def get_data(self, url, referrer=''):
self.webheader = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36',
'Referer': referrer}
req = request.Request(url=url, headers=self.webheader)
web_page = request.urlopen(req)
page_data = web_page.read()
return page_data
def __init__(self, url):
重写了get_data方法,但是在文件头要from urllib import request
,有没有什么优雅一点的办法...
绅士你好。我因为你在某个网站的活跃而来到了这里,以后还是把.git文件给删了吧……毕竟两会召开了不是,个人资料就放在那里实在是比较危险。
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.