Giter Club home page Giter Club logo

grab's Introduction

๐Ÿ‡ท๐Ÿ‡บ Grab Framework Project

Grab Test Status Grab Test Coverage Status Grab Documentation

Project Status

Important notice: pycurl backend is dropped. The only network transport now is urllib3.

The project is being in a slow refactoring stage. It might be possible there will no be new feaures.

Things that are going to happen (no estimation time):

  • Refactoring the source code while keeping most of external API unchanged
  • Fixing bugs
  • Annotating source code with type hints
  • Improving quality of source code to comply with pylint and other linters
  • Moving some features into external packages or moving external dependencies inside Grab
  • Fixing memory leaks
  • Improving test coverage
  • Adding more platforms and python versions to test matrix
  • Releasing new versions on pypi

Installation

$ pip install -U grab

See details about installing Grab on different platforms here http://docs.grablib.org/en/latest/usage/installation.html

Documentation

Get it here grab.readthedocs.io

Telegram chat groups

About Grab (very old description)

Grab is a python web scraping framework. Grab provides a number of helpful methods to perform network requests, scrape web sites and process the scraped content:

  • Automatic cookies (session) support
  • HTTPS/SOCKS proxy support with/without authentication
  • Keep-Alive support
  • IDN support
  • Tools to work with web forms
  • Easy multipart file uploading
  • Flexible customization of HTTP requests
  • Automatic charset detection
  • Powerful API to extract data from DOM tree of HTML documents with XPATH queries

Grab provides interface called Spider to develop multithreaded web-site scrapers:

  • Rules and conventions to organize crawling logic
  • Multiple parallel network requests
  • Automatic processing of network errors (failed tasks go back to task queue)
  • You can create network requests and parse responses with Grab API (see above)
  • Different backends for task queue (in-memory, redis, mongodb)
  • Tools to debug and collect statistics

Grab Example

    import logging

    from grab import Grab

    logging.basicConfig(level=logging.DEBUG)

    g = Grab()

    g.go('https://github.com/login')
    g.doc.set_input('login', '****')
    g.doc.set_input('password', '****')
    g.doc.submit()

    g.doc.save('/tmp/x.html')

    g.doc('//ul[@id="user-links"]//button[contains(@class, "signout")]').assert_exists()

    home_url = g.doc('//a[contains(@class, "header-nav-link name")]/@href').text()
    repo_url = home_url + '?tab=repositories'

    g.go(repo_url)

    for elem in g.doc.select('//h3[@class="repo-list-name"]/a'):
        print('%s: %s' % (elem.text(),
                          g.make_url_absolute(elem.attr('href'))))

Grab::Spider Example

    import logging

    from grab.spider import Spider, Task

    logging.basicConfig(level=logging.DEBUG)


    class ExampleSpider(Spider):
        def task_generator(self):
            for lang in 'python', 'ruby', 'perl':
                url = 'https://www.google.com/search?q=%s' % lang
                yield Task('search', url=url, lang=lang)

        def task_search(self, grab, task):
            print('%s: %s' % (task.lang,
                              grab.doc('//div[@class="s"]//cite').text()))


    bot = ExampleSpider(thread_number=2)
    bot.run()

grab's People

Contributors

lorien avatar signaldetect avatar rushter avatar egorsmkv avatar michael-f-bryan avatar subeax avatar xxxxxxxxxxxxx avatar sashahart avatar yegorov-p avatar imbolc avatar shamcode avatar spikevlg avatar oiwn avatar kevinlondon avatar alxistr avatar dmytrokyrychuk avatar brabadu avatar usergrab avatar rblack avatar allineer avatar tri0l avatar dekat avatar ixtel avatar 2dkot avatar valfa14 avatar artem279 avatar temptask avatar skingreek avatar dpwiz avatar matlex avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.