baiducrawler's Introduction

BaiduCrawler

爬取百度搜索结果中c-abstract里的数据，并使用不断更换代理ip的方式绕过百度反爬虫策略，从而实现对数以10w计的词条的百度搜索结果进行连续爬取。

获取代理ip策略

1. 抓取页面上全部[ip:port]对，并检测可用性（有的代理ip是连不通的）。
1. 使用"多轮检测"策略，即每个ip要经历N轮，间隔为duration连接测试，每轮都会丢弃连接时间超过timeout的ip。N轮下来，存活的ip都是每次都在timeout范围以内连通的，从而避免了"辉煌的15分钟"效应。

爬取策略

有3个策略：

1. 每当出现download_error，更换一个IP
1. 每爬取200条文本，更换一个IP
1. 每爬取20,000次，更新一次IP资源池

上述参数均可手动调整。目前ip池的使用都是一次性的，如果需要更多的优质ip，可参考我的另一个项目Proxy,它是一个代理ip抓取测试评估存储一体化工具，也许可以帮到你。

TODO

1. 对因网络原因未爬取的词进行二次爬取，直到达到用户指定的爬取率
1. 对爬取速度快的优质ip增加权重，从而形成一个具有优先级的ip池
1. ip评估改写成多线程

使用

准备工作

pip install requests
pip install lxml
pip install beautifulsoup4

git clone https://github.com/fancoo/BaiduCrawler
cd BaiduCrawler

Python 2.7

python baidu_crawler.py

Python 3

本程序仅在win版本的Python3.6测试通过。

cd Py3
python baidu_crawler.py

2017/5/4更新

原有的判断ip是否有效的网站失效，已替换。
增加更多代理ip网站。
提高可配置性。

2017/6/13更新

新增抓取的代理IP数据存到MySql中下次先从库中读取再从网站抓取

2017/6/18更新

修改了部分BoBoGithub提交的PR，并重构了ip_pool.py的代码。
目前这个版本其实只将有效ip保存到数据库，没能实现ip质量评优以及爬取的多线程，因时间精力有限，考虑未来再加入。

2017/7/25更新

增加对Python3.6的支持。

baiducrawler's Issues

是不是mysql没连接上？

File "e:/py/baidu/baidu_crawler.py", line 78
print "总共：" + str(len(useful_proxies)) + 'IP可用'
^
SyntaxError: invalid syntax
应该如何操作，有没有详细的连接mysql的说明

换了关键词就不行了

你这个爬虫程序换了关键词就不行了啊。然后我的疑问是为什么不适用 BeautifulSoup呢？

多线程的处理在哪里？

我见提交中有支持多线程的版本，有具体的代码么？

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.

Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

TensorFlow

An Open Source Machine Learning Framework for Everyone

Django

The Web framework for perfectionists with deadlines.

Laravel

A PHP framework for web artisans

D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

web

Some thing interesting about web. New door for the world.

server

A server is a program made to process requests and deliver data to clients.

Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

Visualization

Some thing interesting about visualization, use data art

Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.

Microsoft

Open source projects and samples from Microsoft.

Google

Google ❤️ Open Source for everyone.

Alibaba

Alibaba Open Source for everyone

D3

Data-Driven Documents codes.

Tencent

China tencent open source team.

mazzzystar / baiducrawler Goto Github PK