ehco1996 / python-crawler Goto Github PK

View Code? Open in Web Editor NEW

1.9K 100.0 595.0 3.73 MB

从头开始系统化的学习如何写Python爬虫。 Python版本 3.6

Python 31.60% HTML 68.40%

python-crawler's Introduction

Python-crawler

由于代码是比较早之前写的，抓取的网站目录结构可能有所变动
所以部分代码可能不能使用了，欢迎正在学习爬虫的大家给这个项目提PR
让更多的代码能跑起来~

从零开始系统化的学习写Python爬虫。
主要是记录一下自己写Python爬虫的经过与心得。
同时也是为了分享一下如何能更高效率的学习写爬虫。
IDE：Vscode Python版本: 3.6

知乎专栏：https://zhuanlan.zhihu.com/Ehco-python

详细学习路径：

一：Beautiful Soup 爬虫

requests库的安装与使用 https://zhuanlan.zhihu.com/p/26681429
安装beautiful soup 爬虫环境 https://zhuanlan.zhihu.com/p/26683864
beautiful soup 的解析器 https://zhuanlan.zhihu.com/p/26691931
re库正则表达式的使用 https://zhuanlan.zhihu.com/p/26701898
bs4 爬虫实践：获取百度贴吧的内容 https://zhuanlan.zhihu.com/p/26722495
bs4 爬虫实践：获取双色球中奖信息 https://zhuanlan.zhihu.com/p/26747717
bs4 爬虫实践：排行榜小说批量下载 https://zhuanlan.zhihu.com/p/26756909
bs4 爬虫实践：获取电影信息 https://zhuanlan.zhihu.com/p/26786056
bs4 爬虫实践：悦音台mv排行榜与反爬虫技术 https://zhuanlan.zhihu.com/p/26809626

二： Scrapy 爬虫框架

Scrapy 爬虫框架的安装与基本介绍 https://zhuanlan.zhihu.com/p/26832971
Scrapy 选择器和基本使用 https://zhuanlan.zhihu.com/p/26854842
Scrapy 爬虫实践：天气预报&数据存储 https://zhuanlan.zhihu.com/p/26885412
Scrapy 爬虫实践：代理的爬取和验证 https://zhuanlan.zhihu.com/p/26939527
Scrapy 爬虫实践：糗事百科&爬虫攻防 https://zhuanlan.zhihu.com/p/26980300
Scrapy 爬虫实践：重构排行榜小说爬虫&Mysql数据库 https://zhuanlan.zhihu.com/p/27027200

三：浏览器模拟爬虫

Selenium模拟浏览器 https://zhuanlan.zhihu.com/p/27115580
爬虫实践：获取快代理 https://zhuanlan.zhihu.com/p/27150025
爬虫实践：漫画批量下载 https://zhuanlan.zhihu.com/p/27155429

四：练手项目

爬虫实践：螺纹钢数据&Cookies https://zhuanlan.zhihu.com/p/27232687
爬虫实践：登录正方教务系统 https://zhuanlan.zhihu.com/p/27256315
爬虫应用： requests+django实现微信公众号后台 https://zhuanlan.zhihu.com/p/27625233
爬虫应用： 12306火车票信息查询 https://zhuanlan.zhihu.com/p/27969976
爬虫应用：利用斗鱼Api抓取弹幕 https://zhuanlan.zhihu.com/p/28164017
爬虫应用：获取支付宝账单信息 https://zhuanlan.zhihu.com/p/28537306
爬虫应用：IT之家热门段子（评论）爬取 https://zhuanlan.zhihu.com/p/28806210
爬虫应用：一号店商品信息查询程序 https://zhuanlan.zhihu.com/p/28982497
爬虫应用：搜狗输入法词库抓取 https://zhuanlan.zhihu.com/p/31186373
爬虫应用：复古网盘游戏抓取 https://zhuanlan.zhihu.com/p/32420131
爬虫应用：自动填写问卷星 https://zhuanlan.zhihu.com/p/36224375
爬虫应用：腾讯漫画下载~ https://zhuanlan.zhihu.com/p/39578774

python-crawler's People

Contributors

Stargazers

Watchers

Forkers

zeyee anywilliam yunzhongke alpha0x00 perfectworld233 pickite yywx66 fishereat ericsunquan lynnljl smartsmartbao haonanwu wanqy033 knowledgeocean kb100824 jack12xl liuyuhang791034063 ll-w wo8335224 monarch1995 phpdever awesome-archive zjdznl raivien big2cat reax1x lovecppp houyongsheng45 cqworker mini-milkman sunhailin-leo minimaluminiumalism piguptree celticsduo tonizhou topvitamin 919015502 whb224117 pxluu spud-obb godzzbboss liu-xiang liwg1995 brotherand2 duguruiyuan yousstone zhf459 zyk1995 dafengchui61 sunjiyun26 eight-corner glenpy zh790543412 loversaber yqzs tsytsy sky2fly mingzzc shengjunqiu mandomwu nufy323 xlelou iamaxisme 672110619 tonyshao80 wangboxin019 mxparson zhiwuya satan123a tycao xxlichung jcops sunlight2728 javashu h1n2 kino521 by2009 aegeansea eternal1025 vincentpanqi kingideayou quqw leeegreat chengtian5huang ly2016 anxyanxyan leader0721 simonzhao88 mujunfu nbsp-null matoude1234567 liujicun xciwei eyelivermore xy-imaw wdc1006 seekdoor yqmac trustyboy googlebe

python-crawler's Issues

知乎专栏从零开始写Python爬虫 --- 1.5 爬虫实践：获取百度贴吧内容发现问题

win10，pycharm，python3.5
litags = soup.find_all('li', attrs={'class': ' j_thread_list clearfix'}) 该行代码返回的是空列表，无法爬取到内容。
求解惑
运行没有报错。
下面附上我的代码

import requests
from bs4 import BeautifulSoup
import time

def get_html(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status()
r.encoding = 'utf-8' # 爬取其他未知编码网页用r.encoding = r.apparent_encoding
return r.text
except:
return 'ERROR'

def get_content(url):
'''分析贴吧的网页内容，整理信息，保存在列表中'''
# 初始化列表，保存帖子所有信息
comments = []
# 首先，下载需要爬取的网页
html = get_html(url)
soup = BeautifulSoup(html, 'lxml')

# 找到所有具有' j_thread_list clearfix'属性的Li标签，返回列表类型
#litags = soup.find_all('li', attrs={'class': ' j_thread_list clearfix'})
litags = soup.find_all('li', attrs={'class': ' j_thread_list clearfix'})
print(litags)#litags内容为[]
print('**************')#该行以下没有运行
# 通过循环找到每个帖子里面需要的信息
for li in litags:
    # 初始化一个字典来储存文章信息
    comment = {}
    print('**************')
    try:
        # 开始筛选信息，并且保存到字典中
        comment['title'] = li.find('a', attrs={'class': 'j_th_tit '}).text.strip()
        comment['link'] = "http://tieba.baidu.com/" + li.find('a', attrs={'class': 'j_th_tit '})['href']
        comment['name'] = li.find('span', attrs={'class': "tb_icon_author "}).text.strip()
        comment['time'] = li.find('span', attrs={'class': 'pull-right is_show_creat_time'}).text.strip()
        comment['replayNum'] = li.find('span', attrs={'class': "threadlist_rep_num center_text"}).text.strip()
        comments.append(comment)
        print('**************')

    except:
        print('出了点小问题')

return comments

def out2file(dict):
"""将爬取到的内容写入本地文件，保存到ttbt.txt"""
with open('TTBT.txt', 'a+') as f:
for comment in dict:
f.write('正在写入内容...\n')
f.write('标题: {} \t 链接: {} \t 发帖人: {} \t 发帖时间: {} \t 回复数量: {} '.format(comment['title'],
comment['link'], comment['name'],
comment['time'],
comment['replayNum']))

    print("当前页面爬取完成！")

def main(base_url, deep):
url_list = []

# 将所有需要爬取的页面url存入列表
for i in range(0, deep):
    url_list.append(base_url + "&pn=" + str(50 * i))

print("所有网页已经下载到本地！开始筛选信息...")

# 循环写入所有数据
for url in url_list:
    content = get_content(url)
    print(content)  #该行有运行，打印[]
    out2file(content)
print("所有信息都已经保存完毕！")

base_url = 'http://tieba.baidu.com/f?kw=%E7%94%9F%E6%B4%BB%E5%A4%A7%E7%88%86%E7%82%B8&ie=utf-8'
deep = 1 # 需要爬取的页面数

if name == 'main':

main(base_url, deep)

alipay脚本出错，请问一下是Python用的是哪个版本？

alipay脚本出错，请问一下是Python用的是哪个版本？
ImportError: No module named http.cookies

gbk的问题

针对第一个例子，在爬百度贴吧的时候。有表情符号（⬇️ 💄）之类的，控制台报错：
UnicodeEncodeError: 'gbk' codec can't encode character '\u2b07' in position 17: illegal multibyte sequence
抓取不到内容了。

悦美整形那个，有很多bug

比如68行多写了个空格
basediv = soup.find_all('div', class_='list-imgs ')

账单页获取不全，缺少数据

确定登陆成功，账单页只返回给了一半代码，没有数据代码。目测是被支付宝给搞了，仿佛是请求头浏览器版本太低了。
对比了下账单页的数据，用浏览器看是3722行，用脚本是338行。
脚本爬下来的账单页：
<div i="">d = " J - g l o b a l - n o t i c e - s s l " c l a s s = " g l o b a l - n o t i c e - a n n o u n c e m e n t s s l - v 3 - r c 4 " s t y l e = " b a c k g r o u n d - c o l o r : # f f 6 6 0 0 ; " > </div></div></body></html>

后边就没了，请问怎么解决

文件乱码问题及注释中若干拼写错误

诸如在百度贴吧爬虫中，Windows系统上似乎在存取文件时需要设置：encoding=‘utf-8’ ，否则的话会导致文件乱码。
此外在代码第二十行
# r.endcodding = r.apparent_endconding
中有明显的拼写错误。

Element is not currently interactable and may not be manipulated

支付宝 PhantomJS 报这个错误
这是为啥