qiyeboy / spiderbook Goto Github PK

View Code? Open in Web Editor NEW

970.0 970.0 518.0 6.24 MB

<<python爬虫开发与项目实战>>书籍配套源码和说明

License: MIT License

Python 10.58% HTML 89.12% JavaScript 0.30% Batchfile 0.01%

spiderbook's Introduction

SpiderBook

<<python爬虫开发与项目实战>>书籍配套源码和说明。
欢迎大家支持我的公众号：

近期将会把代码向Py3进行兼容，如果书中有什么疑问，错字，描述不清楚的地方，希望大家在github中提问。接下来我也会对书中可能出现的错误在此进行纠正。谢谢大家的支持。

最后友情提示：书的前言部分一定要看完。

书或者代码中的问题会在勘误表中修正，敬请查看。

在github中提出书中错误最多的三个人，我将在再次印刷的时候送给他们每人一套实体书，以表示我的感谢。

github ID	纠错个数	排名
@yaleimeng	7	1
@Judy0513	5	2
@wushicanASL	3	3
@jsqlzy	2	4
@heqingbao	2	4
@exl2	2	4
@lg-Cat73	1	5
@shaodamao	1	5
@BillWing726	1	5
@wsl-victor	1	5
@liyang610	1	5
@Dang9527	1	5
@doujanbo	1	5

spiderbook's People

Contributors

Stargazers

Watchers

Forkers

peterdocter shellleyma sushengbuhuo limberc 4thirteen2one fighteryu sdfsdfsdfsdfsdfsdfdsfdsfdsf jsqlzy marklin007 geniustesda wjssx eillot janecd knowledgeocean jinliangyang laoding2016 whtcmiss xxxphf shadowwcode suiwuyi jinzaizhichi ldh-666 guili618 zhuyi9999 chendeng walle13 bokunobike long5313828 a353442710 wuliseeking helei112g dofospider liyanfeng0127 hehuanshu96 wenguonideshou tangxinkevin aijiajia diyinqianchang easouchen iamblinking pandapang1992 bullfroglt guguobao kanmodel kaka1526 rusea liang1996 rebortboss xinpengliu lukaka jackywk hanseltu jaapyang loooo139 terrmy zhenxing-hu meto001 rtmario ok2fly stripelf zhangcapricorn warxmelon crazybars chouhui chapzq77 duccnuer sunguotao hnldjj muzilii frfy dadaqingjian brucebnu daoos xdis shuangman keyman9848 airob jazzlly guanleon radial-hks jixc1985 zhubuming leesx zgho koloz mengyibo chengdu0839 1095135037 sunjikui henghu-bai bonnietang xieyingvs makaidong hanks110 christings leechur jizongfox jiyong101 lsss1111 lanny1119

spiderbook's Issues

关于ch06的基础爬虫

ch06的基础爬虫程序并不能正常运行，debug一下，问题出现在
new_urls,data = self.parser.parser(new_url,html)
希望可以解决吧

p122-p123: #id之间多个空格

print (soup.select("p > #link1"))
print(soup.select("#link1 ~ .sister"))
print(soup.select("#link1 + .sister"))
print(soup.select("#link1"))
print(soup.select("#link2"))

111 页书写错误

111 页 11行：
原文：抽取 p: print soup.a

应该是：抽取 p: pring soup.p

<title>FireFox 测试</title> <script type="text/javascript"> var a="python"; var b="爬虫开发"; document.write(a,b); console.log(a,b); console.debug(a+b); console.error(a+b); console.info(a+b); console.warn(a+b); </script> 我照书中所写的，但是显示只显示了一行‘’python 爬虫开发‘’，并且没有其他显示，不像书中显示的有五行‘python 爬虫开发’

第三张是我打开网页显示的，与书中不一样，只显示了一行

P29，上面那块代码后的文字，任务进程已经编写完成（应该为服务进程）

第九章phantomjs

使用phantomjs获得的网页信息都是乱码，不能正常显示中文，如9.3.5中的例子

x

P144, 百度百科词条链接换成中文相对应字节编码形式了，获取符合要求链接的正则表达式不可以用了

第七章：数据存储器条件判断错误

    def store_data(self, data):
        if data is None:
            return

        self.datas.append(data)
        if len(self.datas) > 10:  # 10应该是0吧？
            self.output_html(self.filepath)

P113,倒数第5行 Tag 的content属性应为Tag的contents属性

5.2 爬取图片

原始代码：
per = 100.0 * blocknum * blocksize / totalsize
if per > 100:
per = 100
print('当前进度：%d' %per)

这一段改成以下代码，上面的执行结果，总是打印当前进度 100%，下面的代码会显示开始以及中间的进度，不知是否正确
per = 100.0 * blocknum * blocksize / totalsize
per = round(per, 1)
if per > 100:
per = 100
print('当前进度：%d' %per)
else:
print('当前进度: {0}'.format(per))

firebug的console无法使用

书上86页推荐的firebug好像在新的firefox上无法显示console，换成旧版的firefox（知乎上有人推荐采用v49.0.2）也不行~

第12章构建item Pipeline时文件打开模式问题（书第287页倒数第二行）

原书中书第287页的文件打开模式为wb，这会导致json文件中只存在最后一次写入的数据，应改为ab。
原书代码如下：
self.file = open('papers.json', 'wb')
应该改为：
self.file = open('papers.json', 'ab')

第一章1.3.2，第14页 Mac系统行终止符

介绍os模块，其中给出当前平台使用的行终止符 os.linesep。

Windows使用 '\r\n' ， Linux使用 '\n' 而 Mac使用 '\r'。

Mac自 OS X 以后，行终止符改为 '\n'。

1.4.4.py

#实现第六步：添加任务
for url in ["ImageUrl_"+str(i) for i in range(10)]:
print 'put task %s ...' %url
task.put(url)

有个地方给你修正一下 ["ImageUrl_"+str(i) for i in range(10)

第146页爬虫调度器引用错误

在引入其他文件是引用了firstSpider目录，此地方有误
from firstSpider.DataOutput
from firstSpider.HtmlParser
from firstSpider.HtmlDownloader
from firstSpider.UrlManager

麻烦作者看一下基础篇爬百度词条的代码，每次运行都是直接到百度百科：词条锁定中

麻烦作者再一下基础篇爬百度词条的代码，每次运行都是直接到百度百科：词条锁定中

关于书中的一些可能的错误

P17，第6行和第9行结尾处，占位符应该是%，书中印的是逗号(,)
P26,大概在中间的位置代码为：except Exception,e: 这句是不是应该改为：except Exception as e: 。我用的python3.6不支持前面的格式；

P100，第一句：(?\d+)问号？后面是不是少了一个大写的P呢，应该改为(?P\d+)(我的3.6版本）。另外在第二行和第三行对分组的反向引用，书中印的是：\k,实际上应该是\g吧。

P110，第四行，应该是‘输出结果如下‘,书中写的是’输入结果如下‘

P111,第11行，书中：抽取P：print soup.a,应该是print soup.p，字母a要改成p
【因为上传不了图，所有在这里描述了，麻烦老师看到了检查下是书中错误还是我理解错误，或者是因为版本的问题。我的python是3.6.0】

DynamicSpider

这个爬虫的比网页显示的数据少，存到sqlite里的也少，运行后'NoneType' object has no attribute 'get' http://service.library.mtime.com/Movie.api?Ajax_CallBack=true&Ajax_CallBackType=Mtime.Library.Services&Ajax_CallBackMethod=GetMovieOverviewRating&Ajax_CrossDomain=1&Ajax_RequestUrl=http://movie.mtime.com/247097/&t=201707072006523282&Ajax_CallBackArgument0=247097 {u'value': {u'userLastComment': u'', u'releaseType': 1, u'isRelease': True, u'tweetId': 0, u'movieRating': {u'TitleEn': u'', u'RPictureFinal': 0, u'AttitudeCount': 22, u'MovieId': 247097, u'TitleCn': u'', u'RShowFinal': 0, u'RDirectorFinal': 0, u'EnterTime': 0, u'RTotalFinal': 0, u'UserId': 0, u'ROtherFinal': 0, u'JustTotal': 0, u'Year': u'', u'RatingFinal': -1, u'RatingCount': 0, u'RStoryFinal': 0, u'IP': 0, u'Usercount': 23}, u'movieTitle': u'\u649e\u90aa31\u53f7', u'userLastCommentUrl': u''}, u'error': None}
有这种日志

怎么快速去掉引号？

好多代码都是用
'''

'''
包围起来的，怎么快速去掉这些符号？

第六章 DataOutput

在对数据进行遍历的时候删除元素会导致元素跟踪丢失，使用切片解决
for data in self.datas[:]:
count += 1
fout.write("")
fout.write("%s" % data['url'])
fout.write("%s" % data['title'])
fout.write("%s" % data['summary'])
fout.write("")
self.datas.remove(data)

关于第17章scrapy分布式爬虫运行代码是遇到的问题

作者你好，我在运行你的GitHub ch17的代码是遇到下面的情况：
使用scrapy crawl yunqi.qq.com时，爬虫能Get到start_urls，但是没有进行解析，爬虫就结束关闭了，如图：

我打算用scrapy shell查看被处理网页的response，但是报错了，如图：

因为我是才开始学习和使用scrapy，所以遇到这个问题不知道该从哪里开始定位，请作者指点下，谢谢

代码运行的环境：ubuntu17.04，运行代码前已启动redis服务和mongodb服务器

5.3发邮件的，用网页登录163正确，但用代码就说认证失败，有遇到过吗？

第一章，第17页，关于join方法的建议

@qiyeboy 您好，
join方法阻塞运行直至该进程结束，如果需要等待所有子进程结束，建议针对每个进程调用该方法。

原书代码：
def run_proc(name):

print 'Child process %s (%s) Running...' % (name, os.getpid())

if name == 'main':

print 'Parent process %s.' % os.getpid()

for i in range(5):

    p = Process(target=run_proc, args=(str(i),))

    print 'Process will start.'

    p.start()

p.join()

print 'Process end.'

建议将子进程规整入列表，以方便使用迭代处理每个进程，并“同步”每个进程，确保父进程在所有子进程结束后继续。修改如下：
def run_proc(name):

print 'Child process %s (%s) Running...' % (name, os.getpid())

if name == 'main':

print 'Parent process %s.' % os.getpid()
p=[]
for i in range(5):

    p.append(Process(target=run_proc, args=(str(i),)))

    print 'Process will start.'

    p[i].start()

利用进程列表，确保子进程全部结束后父进程再继续

 for process in p:
       process.join()

 print 'Process end.'

关于第四章的

我显示出来的是这样的和书上不一样

issues

from BloomfilterOnRedis import BloomFilter,
我下载的包都导入报错，需要安装哪个第三方库

赠书

请排名前三的读者，在公众号中给我留言，给我说一下你们个人信息和地址，我好安排赠书。
感谢你们的付出。
@yaleimeng
@Judy0513
@wushicanASL

P58 表2-4 第7行“这些元素拥有值为eng的lang属性”应改为“这些元素拥有值为en的lang属性”

如题

P58 --表2-4 实现效果第6行，应该是"这些元素拥有值为en的lang属性"

想请教下作者一些事

我之前没学过python，想了解一下goagent的原理才找到这里的，但是看了下是循环渐进的教学，应该挺不错的，但是看了反馈，貌似书中错误挺多的，我现在购买好呢，还是等修订版噢？
还有，我以前看的Java教程，源码都是可以跑起来的，github上的源码能跑吗？（我懒，我都是先跑源码再对着书了解的，没源码的书更懒得看了）
最后，我看到goagent有打包exe程序，想问下作者书里有这方面的介绍吗？谢谢。

请教问题？

你好：
第９章　动态网站抓取　获取时光网例子　　网页解析类的代码 HtmlParser 中
pattern = re.compile(r'(http://movie.mtime.com/(\d+)/)')　此处　\d＋加上了括号　(\d+)/)　
打印如下：　我不理解为何会出现数字 u'10910'被单独提取出来．
[(u'http://movie.mtime.com/10910/', u'10910'), (u'http://movie.mtime.com/211901/', u'211901'), (u'http://movie.mtime.com/211901/', u'211901'),　　谢谢解释．请问这是正则的一个用法么？

P146 运行SpiderMan.py出错

运行SpiderMan.py报错：
self.manager.add_new_url(root_url)
AttributeError: 'SpiderMan' object has no attribute 'manager'

用的py3
代码如下：

配置环境

您好，我最近在用这本书进行爬虫的学习，因为没接触过爬虫，我想问下您书中提到的两个IDE是必须要配置的吗，我只用python自带的shell 进行编写可以的吗

第一章17页

17页代码os的fork代码，格式化字符串错误

    print 'I am child process(%s) and my parent process is (%s)',(os.getpid(),os.getppid())
else:
    print 'I(%s) created a chlid process (%s).',(os.getpid(),pid)

字符串格式化中间应为%

    print 'I am child process(%s) and my parent process is (%s)' % (os.getpid(), os.getppid())
else:
    print 'I(%s) created a chlid process (%s).' % (os.getpid(), pid)

书籍错误

昨天(2017年6月21日)在京东买的电子版。

8.2.3.2 sqlite3模块使用connect方法打开数据库
这儿是在讲 mysql 数据库。。。

之前还看到不少错误，不过都没来提，以后慢慢提吧。。
话说京东的书修订了会更新么。。。

请教关于第一章分布式进程taskManager.py运行错误

第31页，第一章分布式进程的代码taskManager.py运行提示错误，
指向代码：manager = QueueManager(address=('127.0.0.1', 8001), authkey='qiye')
错误提示：TypeError: string argument without an encoding
不清楚是怎么回事，看着代码也没有错呢。请范老师有空了回复下哦

29页笔误

第29页的第15行，

原文：" ## 任务进程已经编写完成，接下来编写任务进程（taskWorker.py）"，

应该改成：" ## 服务进程已经编写完成，接下来编写任务进程（taskWorker.py）"

关于第12章cnblog爬虫的问题

1.在settings.py中

IMAGES_STORE = 'F:\\cnblogs'

会提示找不到盘符f，若改为项目目录下的相对路径更佳

2.在cnblogs_spider.py

content = paper.xpath(".//*[@class='postCon']/a/text()").extract()[0]

Error:
  File "/Users/lifu/pyworkspace/SpiderBook/ch12/cnblogSpider/cnblogSpider/spiders/cnblogs_spider.py", line 25, in parse
    content = paper.xpath(".//*[@class='postCon']/a/text()").extract()[0]
IndexError: list index out of range

应该是由于content xpath 失效，导致数组越界错误

第12章爬取文章摘要的XPath需更改

第12章cnblogSpider/cnblogSpider/spiders/cnblogs_spider.py文件的第25行目前为：
content = paper.xpath(".//*[@class='postCon']/a/text()").extract()[0]
但我运行时抓不到任何东西，将其中XPath表达式改为：
.//*[@class='postCon']/div/text()
后即可运行。似乎是因为该网页的源代码有改动：
<div class="postCon"><div class="c_b_p_desc">摘要: 熊猫烧香病毒在当年可是火的一塌糊涂，感染非常迅速，算是病毒史上比较经典的案例。不过已经比

有个地方给你指正一下？

第６章　DataOutput.py　生成普通文档是没问题的，但生成html文档，当用火狐浏览器打开时会出现乱码，所以需要加入下面这条语句具体如下

　　　　　fout.write("")
fout.write('')
fout.write("")

24页运行打印结果漏了一句

在24页，threading.Thread继承类创建线程类，
运行后，打印结果的最后一行，应该会有以下这句主线程结束的语句：
MainThread ended..

文中漏了。

第31页代码结束符错误

第31页的Windows版的taskManager.py代码，
第9行：

定义收发队列

task_queue = Queue.Queue(task_number);
result_queue = Queue.Queue(task_number);

这里应该没有冒号结束符";"。

UserWarning: You provided Unicode markup but also provided a value for from_encoding.

关于第六章爬取网络爬虫词条的语法报错
我将你写的代码以及按照github中的勘误修正你书中的错误后，执行后发现：
UserWarning: You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.
warnings.warn("You provided Unicode markup but also provided a value for from_encoding. Your from_encoding will be ignored.")
然后我百度了一下，有人说将soup = BeautifulSoup(html_cont,'html.parser',from_encoding='utf-8')中的
from_encoding='utf-8去掉，但仍然还是报错！
爬了你的部落格什么的都没有相关问题，我已经遇到好几次这个问题，准备丢书了！

第七章：简单分布式爬虫

有个关键点没讲到：multiprocessing.Queue与普通的Queue的区别。

如果这里导入了普通的Queue，根本就无法跨进程通信。

而书中你所说的参考1.4.4节中服务进程的代码，1.4.4节中用的普通的Queue，但是后面操作的时候并不是直接用外面创建的普通Queue。

这里容易误导。

P305 def parse_item(self, response), def parse_body(self, response)这两个函数应该是排版错了

14页获取路径名的拼写错误

14页获取路径名的文件名拼写错误：
原文错误-->os.path.dirname(filetpah)
应该修正-->os.path.dirname(filepath)

第九章爬去哪网

在执行到第4页的时候出现如下错误
Traceback (most recent call last):
File "/home/luna/PycharmProjects/CRAWL/Selenium/qunasite.py", line 169, in
spider.crawl('http://hotel.qunar.com/',u'上海')
File "/home/luna/PycharmProjects/CRAWL/Selenium/qunasite.py", line 164, in crawl
self.get_hotel(driver,to_city,today,tomorrow)
File "/home/luna/PycharmProjects/CRAWL/Selenium/qunasite.py", line 132, in get_hotel
htm_const = driver.page_source
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 532, in page_source
return self.execute(Command.GET_PAGE_SOURCE)['value']
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/webdriver.py", line 256, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python2.7/dist-packages/selenium/webdriver/remote/errorhandler.py", line 194, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: unknown error: bad inspector message: {"id":282,"result":{"result":{"type":"object","value":{"status":0,"value":"\u003C!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd\"\u003E\u003Chtml xmlns="http://www.w3.org/1999/xhtml\"\u003E\u003Chead\u003E\n\t\u003Cmeta http-equiv="Content-Type" content="text/html; charset=UTF-8" /\u003E\n \u003Ctitle\u003E\u4E0A\u6D77\u9152\u5E97_\u4E0A\u6D77\u9152\u5E97\u9884\u8BA2_\u4E0A\u6D77\u9152\u5E97\u4EF7\u683C\u67E5\u8BE2-\u53BB\u54EA\u513FQunar.com\u003C/title\u003E\n \u003Cmeta name="description" content="\u4E0A\u6D77\u9152\u5E97\u9884\u8BA2\u548C\u4E0A\u6D77\u9152\u5E97\u67E5\u8BE2:\u60A8\u53EF\u4EE5\u901A\u8FC7\u4EF7\u683C,\u884C\u653F\u533A,\u5546\u5708,\u661F\u7EA7\u7B49\u5B9E\u65F6\u67E5\u8BE2\u548C\u6BD4\u8F83121\u5BB6\u7F51\u7AD9,20613\u5BB6\u4E0A\u6D77\u9152\u5E97\u6700\u65B0\u4EF7\u683C\u53CA\u62A5\u4EF7! \u53BB\u54EA\u513FQunar.com\u4E3A\u60A8\u63D0\u4F9B\u4E0A\u6D77\u9152\u5E97\u9884\u5B9A\u4E00\u7AD9\u5F0F\u670D\u52A1!" /\u003E\n \u003Cmeta name="keywords" content="\u4E0A\u6D77\u9152\u5E97,

137页 5.3节Email提醒开始第三行，发送邮件的协议是STMP（应该为SMTP）

p138 代码报错

Traceback (most recent call last):
File "C:/spider/chapter5_4.py", line 25, in
server = smtplib.SMTP(smtp_server, 25)
File "C:\Python27\Lib\smtplib.py", line 256, in init
(code, msg) = self.connect(host, port)
File "C:\Python27\Lib\smtplib.py", line 316, in connect
self.sock = self._get_socket(host, port, self.timeout)
File "C:\Python27\Lib\smtplib.py", line 291, in _get_socket
return socket.create_connection((host, port), timeout)
File "C:\Python27\Lib\socket.py", line 557, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
socket.gaierror: [Errno 11001] getaddrinfo failed

这个是什么问题呢

qiyeboy / spiderbook Goto Github PK

spiderbook's Introduction

SpiderBook

近期将会把代码向Py3进行兼容，如果书中有什么疑问，错字，描述不清楚的地方，希望大家在github中提问。接下来我也会对书中可能出现的错误在此进行纠正。谢谢大家的支持。

最后友情提示：书的前言部分一定要看完。

书或者代码中的问题会在 勘误表 中修正，敬请查看。

在github中提出书中错误最多的三个人，我将在再次印刷的时候送给他们每人一套实体书，以表示我的感谢。

spiderbook's People

Contributors

Stargazers

Watchers

Forkers

spiderbook's Issues

定义收发队列

定义收发队列

Recommend Projects

Recommend Topics

Recommend Org

书或者代码中的问题会在勘误表中修正，敬请查看。