Giter Club home page Giter Club logo

xcrawler's Introduction

xcrawler, a light-weight web crawler framework

Introduction

xcrawler, it's a light-weight web crawler framework. Some of its design concepts are borrowed from the well-known framework Scrapy. The downloader of the engine is implemented with the requests library.

I'm very interested in web crawling, however, I'm just a newbie to web scraping. I did this so that I can learn more basics of web crawling and Python language.

Features

  • Very simple;
  • Very easy to customize your own spider;
  • Process multiple requests and responses simultaneously.

TO-DO

  • Use priority queue instead;
  • Add docs and tests.

Examples

class BaiduNewsSpider(BaseSpider):
    name = 'baidu_news_spider'
    start_urls = ['http://news.baidu.com/']
    default_headers = {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/50.0.2661.102 Safari/537.36'
    }

    def spider_started(self):
        self.file = open('items.jl', 'w')

    def spider_stopped(self):
        self.file.close()

    def spider_idle(self):
        # you can add new requests to the engine
        print('I am in idle mode')
        # self.crawler.crawl(new_request, spider=self)

    def make_requests_from_url(self, url):
        return Request(url, headers=self.default_headers)

    def parse(self, response):
        root = fromstring(response.content, base_url=response.base_url)
        for element in root.xpath('//a[@target="_blank"]'):
            title = self._extract_first(element, 'text()')
            link = self._extract_first(element, '@href').strip()
            if title:
                if link.startswith('http://') or link.startswith('https://'):
                    yield {'title': title, 'link': link}
                    yield Request(link, headers=self.default_headers, callback=self.parse_news,
                                  meta={'title': title})

    def parse_news(self, response):
        pass

    def process_item(self, item):
        print(item)
        print(json.dumps(item, ensure_ascii=False), file=self.file)

    @staticmethod
    def _extract_first(element, exp, default=''):
        r = element.xpath(exp)
        if len(r):
            return r[0]

        return default


def main():
    settings = {
        'download_delay': 1,
        'download_timeout': 6,
        'retry_on_timeout': True,
        'concurrent_requests': 16,
        'queue_size': 512
    }
    crawler = CrawlerProcess(settings, 'DEBUG')
    crawler.crawl(BaiduNewsSpider)
    crawler.start()

main()
  • log

  • results

Changelog

2017-01-14

  1. Fix some known bugs of the engine.
  2. Use a Bloom filter instead of the set container, it's helpful for general crawling.
  3. Add a new general spider, and demonstrate how to get the basic information of a website.

License

xcrawler is licensed under the MIT license, please feel free and happy crawling!

xcrawler's People

Contributors

ifaceless avatar

Watchers

James Cloos avatar 0.0 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.