Giter Club home page Giter Club logo

python-crawler's Introduction

Python-crawler

由于代码是比较早之前写的,抓取的网站目录结构可能有所变动
所以部分代码可能不能使用了,欢迎正在学习爬虫的大家给这个项目提PR
让更多的代码能跑起来~

从零开始系统化的学习写Python爬虫。
主要是记录一下自己写Python爬虫的经过与心得。
同时也是为了分享一下如何能更高效率的学习写爬虫。
IDE:Vscode Python版本: 3.6

详细学习路径:

一:Beautiful Soup 爬虫


二: Scrapy 爬虫框架


三: 浏览器模拟爬虫

四: 练手项目

python-crawler's People

Contributors

ehco1996 avatar shulincome avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

python-crawler's Issues

知乎专栏 从零开始写Python爬虫 --- 1.5 爬虫实践: 获取百度贴吧内容 发现问题

win10,pycharm,python3.5
litags = soup.find_all('li', attrs={'class': ' j_thread_list clearfix'}) 该行代码返回的是空列表,无法爬取到内容。
求解惑
运行没有报错。
下面附上我的代码

import requests
from bs4 import BeautifulSoup
import time

def get_html(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status()
r.encoding = 'utf-8' # 爬取其他未知编码网页用r.encoding = r.apparent_encoding
return r.text
except:
return 'ERROR'

def get_content(url):
'''分析贴吧的网页内容,整理信息,保存在列表中'''
# 初始化列表,保存帖子所有信息
comments = []
# 首先,下载需要爬取的网页
html = get_html(url)
soup = BeautifulSoup(html, 'lxml')

# 找到所有具有' j_thread_list clearfix'属性的Li标签,返回列表类型
#litags = soup.find_all('li', attrs={'class': ' j_thread_list clearfix'})
litags = soup.find_all('li', attrs={'class': ' j_thread_list clearfix'})
print(litags)#litags内容为[]
print('**************')#该行以下没有运行
# 通过循环找到每个帖子里面需要的信息
for li in litags:
    # 初始化一个字典来储存文章信息
    comment = {}
    print('**************')
    try:
        # 开始筛选信息,并且保存到字典中
        comment['title'] = li.find('a', attrs={'class': 'j_th_tit '}).text.strip()
        comment['link'] = "http://tieba.baidu.com/" + li.find('a', attrs={'class': 'j_th_tit '})['href']
        comment['name'] = li.find('span', attrs={'class': "tb_icon_author "}).text.strip()
        comment['time'] = li.find('span', attrs={'class': 'pull-right is_show_creat_time'}).text.strip()
        comment['replayNum'] = li.find('span', attrs={'class': "threadlist_rep_num center_text"}).text.strip()
        comments.append(comment)
        print('**************')

    except:
        print('出了点小问题')

return comments

def out2file(dict):
"""将爬取到的内容写入本地文件,保存到ttbt.txt"""
with open('TTBT.txt', 'a+') as f:
for comment in dict:
f.write('正在写入内容...\n')
f.write('标题: {} \t 链接: {} \t 发帖人: {} \t 发帖时间: {} \t 回复数量: {} '.format(comment['title'],
comment['link'], comment['name'],
comment['time'],
comment['replayNum']))

    print("当前页面爬取完成!")

def main(base_url, deep):
url_list = []

# 将所有需要爬取的页面url存入列表
for i in range(0, deep):
    url_list.append(base_url + "&pn=" + str(50 * i))

print("所有网页已经下载到本地!开始筛选信息...")

# 循环写入所有数据
for url in url_list:
    content = get_content(url)
    print(content)  #该行有运行,打印[]
    out2file(content)
print("所有信息都已经保存完毕!")

base_url = 'http://tieba.baidu.com/f?kw=%E7%94%9F%E6%B4%BB%E5%A4%A7%E7%88%86%E7%82%B8&ie=utf-8'
deep = 1 # 需要爬取的页面数

if name == 'main':

main(base_url, deep)

gbk的问题

针对第一个例子,在爬百度贴吧的时候。有表情符号(⬇️ 💄)之类的,控制台报错:
UnicodeEncodeError: 'gbk' codec can't encode character '\u2b07' in position 17: illegal multibyte sequence
抓取不到内容了。

账单页获取不全,缺少数据

确定登陆成功,账单页只返回给了一半代码,没有数据代码。目测是被支付宝给搞了,仿佛是请求头浏览器版本太低了。
对比了下账单页的数据,用浏览器看是3722行,用脚本是338行。
脚本爬下来的账单页:
<div i="">d = " J - g l o b a l - n o t i c e - s s l " c l a s s = " g l o b a l - n o t i c e - a n n o u n c e m e n t s s l - v 3 - r c 4 " s t y l e = " b a c k g r o u n d - c o l o r : # f f 6 6 0 0 ; " &gt; </div></div></body></html>

后边就没了,请问怎么解决

文件乱码问题及注释中若干拼写错误

诸如在百度贴吧爬虫中,Windows系统上似乎在存取文件时需要设置:encoding=‘utf-8’ ,否则的话会导致文件乱码。
此外在代码第二十行
# r.endcodding = r.apparent_endconding
中有明显的拼写错误。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.