Giter Club home page Giter Club logo

gooseeker's Introduction

项目名称

=========

gooseeker

集搜客即时模式网络爬虫项目

项目背景

在python 即时网络爬虫项目启动说明中我们讨论一个数字:程序员浪费在调测内容提取规则上的时间。 网络数据抓取的工作量有80%是在为各种网站的各种数据结构编写抓取规则。

所以我们发起了这个项目,把程序员从繁琐的调测规则中解放出来,投入到更高端的数据处理工作中。

GooSeeker发布基于xslt的内容提取器,xslt可以通过GooSeeker API获得,让大家能省掉90%的调测正则表达式或者XPath的时间

项目资源

入口页

Python交流园地

知乎专栏

GooSeeker收割模式网络爬虫

项目目录文件说明

gooseeker

- core/gooseeker.py 提取器类
- core/README  说明文件

- crawler/anjuke.py  采集安居客房产经纪人
- crawler/result1.xml  安居客房产经纪人结果文件1
- crawler/result2.xml  安居客房产经纪人结果文件2
- crawler/crawl_gooseeker_bbs.py  采集集搜客论坛内容
- crawler/xslt_bbs.xml  集搜客论坛内容提取本地xslt文件
- crawler/douban.py  采集豆瓣小组讨论话题

- crawler/simpleSpider  一个小爬虫(基于Scrapy开源框架)
- crawler/tmSpider  采集天猫商品信息(基于Scrapy开源框架)

- test/readPdf.py  python读取pdf文档

gooseeker's People

Contributors

fullerhua avatar gz51837844 avatar ipfono avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.