Giter Club home page Giter Club logo

aiocrawler's Introduction

AIOCrawler

Build Status Codacy Badge PyPI version

Asynchronous web crawler built on asyncio

Installation

pip install pyaiocrawler

Usage

Generating sitemap

import asyncio
from aiocrawler import SitemapCrawler

crawler = SitemapCrawler('https://www.google.com', depth=3)
sitemap = asyncio.run(crawler.get_results())

Configuring the crawler

from aiocrawler import SitemapCrawler

crawler = SitemapCrawler(
    init_url='https://www.google.com', # The base URL to start crawling from
    depth=3,                           # The maximum depth to crawl till
    concurrency=300,                   # Maximum concurrent requests to make
    max_retries=3,                     # Maximum times the crawler will retry to get a response from a URL
    user_agent='My Crawler',           # Use a custom user agent for requests
)

Extending the crawler

To create your own crawler, simply subclass AIOCrawler and implement the parse method. For every page crawled, the parse method is executed with the url of the page, the links in that page and the html of the page. The return of the parse method is appended to an array which is then available when the get_results method is called. We have implemented an example crawler here that extracts the title from the page.

import asyncio
from aiocrawler import AIOCrawler
from bs4 import BeautifulSoup          # We will use beautifulsoup to extract the title from the html
from typing import Set, Tuple


class TitleScraper(AIOCrawler):
    '''
    Subclasses AIOCrawler to extract titles for the pages on the given domain
    '''
    timeout = 10
    max_redirects = 2

    def parse(self, url: str, links: Set[str], html: bytes) -> Tuple[str, str]:
        '''
        Returns the url and the title of the url
        '''
        soup = BeautifulSoup(html, 'html.parser')
        title = soup.find('title').string
        return url, title


crawler = TitleScraper('https://www.google.com', 3)
titles = asyncio.run(crawler.get_results())

Contributing

Installing dependencies

pipenv install --dev

Running tests

pytest --cov=aiocrawler

aiocrawler's People

Contributors

tapanpandita avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.