Giter Club home page Giter Club logo

vulcan's Introduction

Vulcan Spider

A spider use gevent and multi-thread module,support webkit for dom parsing.

基于gevent和多线程模型,支持WebKit引擎的动态爬虫框架。

特性

  1. 支持gevent和多线程两种并行模型
  2. 支持Webkit引擎 (dom parse,ajax fetch,etc...)
  3. 多个自定义选项设置
  • 最大爬取深度限制
  • 最大抓取URL数限制
  • 同源(域)限制
  • 自定义头部 (UA,Cookies,etc...)

依赖

  • python 2.7+ (must)
  • gevent 1.0 (must)
  • lxml 2.3 (must,for static parsing)
  • chardet 2.2.1 (must)
  • requests 1.2.3 (must)
  • splinter 0.6.0 (optional,webkit framework for dynamic parsing)
  • phantomjs 1.9 (optional,webkit engine)

说明

1, 框架由两部分组成:

  • fetcher:下载器,负责获取HTML,送入crawler。
  • crawler:爬取器,负责解析并爬取HTML中的URL,送入fetcher。

fetcher和crawler两部分独立工作,互不干扰,通过queue进行链接。fetcher需要发送HTTP请求,涉及到阻塞操作,使用gevent池控制。crawler没有涉及阻塞操作,但为了扩展可以自选gevent池和多线程池两种模型控制。

2, 爬虫相关选项说明:

  • concurrent_num : 并行crawler和fetcher数量
  • crawl_tags : 爬行时收集URL所属标签列表
  • depth : 爬行深度限制
  • max_url_num : 最大收集URL数量
  • internal_timeout : 内部调用超时时间
  • spider_timeout : 爬虫超时时间
  • crawler_mode : 爬取器模型(0:多线程模型,1:gevent模型)
  • same_origin : 是否限制相同域下
  • dynamic_parse : 是否使用WebKit动态解析

示例

spider = Spider(concurrent_num=20,depth=3,max_url_num=300,crawler_mode=1)
spider.feed_url("http://www.baidu.com/")
spider.start()

image

更新日志

20140217:

  • 修正fetcher中一处bug。
  • 增加插件机制,可以自定义类和类方法来自定义插件处理爬虫请求及HTML。
  • gevent模式下增加socket超时机制。
  • 增加URL后缀黑名单。

TODO

  • URL拆分成独立部分存储(pagename,params,fragments,post data)
  • 相似URL合并
  • 保证了框架运行的稳定性,抛砖引玉。

LICENSE

Copyright © 2014 by pnig0s

Under MIT license : rem.mit-license.org

vulcan's People

Contributors

pnigos avatar

Stargazers

DZG404 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.