Giter Club home page Giter Club logo

scrapy-crawlera-fetch's Introduction

Scrapy Middleware for Crawlera Simple Fetch API

actions codecov

This package provides a Scrapy Downloader Middleware to transparently interact with the Crawlera Fetch API.

Requirements

  • Python 3.5+
  • Scrapy 1.6+

Installation

Not yet available on PyPI. However, it can be installed directly from GitHub:

pip install git+ssh://[email protected]/scrapy-plugins/scrapy-crawlera-fetch.git

or

pip install git+https://github.com/scrapy-plugins/scrapy-crawlera-fetch.git

Configuration

Enable the CrawleraFetchMiddleware via the DOWNLOADER_MIDDLEWARES setting:

DOWNLOADER_MIDDLEWARES = {
    "crawlera_fetch.CrawleraFetchMiddleware": 585,
}

Please note that the middleware needs to be placed before the built-in HttpCompressionMiddleware middleware (which has a priority of 590), otherwise incoming responses will be compressed and the Crawlera middleware won't be able to handle them.

Settings

  • CRAWLERA_FETCH_ENABLED (type bool, default False)

    Whether or not the middleware will be enabled, i.e. requests should be downloaded using the Crawlera Fetch API. The crawlera_fetch_enabled spider attribute takes precedence over this setting.

  • CRAWLERA_FETCH_APIKEY (type str)

    API key to be used to authenticate against the Crawlera endpoint (mandatory if enabled)

  • CRAWLERA_FETCH_URL (Type str, default "http://fetch.crawlera.com:8010/fetch/v2/")

    The endpoint of a specific Crawlera instance

  • CRAWLERA_FETCH_RAISE_ON_ERROR (type bool, default True)

    Whether or not the middleware will raise an exception if an error occurs while downloading or decoding a response. If False, a warning will be logged and the raw upstream response will be returned upon encountering an error.

  • CRAWLERA_FETCH_DOWNLOAD_SLOT_POLICY (type enum.Enum - crawlera_fetch.DownloadSlotPolicy, default DownloadSlotPolicy.Domain)

    Possible values are DownloadSlotPolicy.Domain, DownloadSlotPolicy.Single, DownloadSlotPolicydefault (Scrapy default). If set to DownloadSlotPolicy.Domain, please consider setting SCHEDULER_PRIORITY_QUEUE="scrapy.pqueues.DownloaderAwarePriorityQueue" to make better usage of concurrency options and avoiding delays.

  • CRAWLERA_FETCH_DEFAULT_ARGS (type dict, default {})

    Default values to be sent to the Crawlera Fetch API. For instance, set to {"device": "mobile"} to render all requests with a mobile profile.

Spider attributes

  • crawlera_fetch_enabled (type bool, default False)

    Whether or not the middleware will be enabled. Takes precedence over the CRAWLERA_FETCH_ENABLED setting.

Log formatter

Since the URL for outgoing requests is modified by the middleware, by default the logs will show the URL for the Crawlera endpoint. To revert this behaviour you can enable the provided log formatter by overriding the LOG_FORMATTER setting:

LOG_FORMATTER = "crawlera_fetch.CrawleraFetchLogFormatter"

Note that the ability to override the error messages for spider and download errors was added in Scrapy 2.0. When using a previous version, the middleware will add the original request URL to the Request.flags attribute, which is shown in the logs by default.

Usage

If the middleware is enabled, by default all requests will be redirected to the specified Crawlera Fetch endpoint, and modified to comply with the format expected by the Crawlera Fetch API. The three basic processed arguments are method, url and body. For instance, the following request:

Request(url="https://httpbin.org/post", method="POST", body="foo=bar")

will be converted to:

Request(url="<Crawlera Fetch API endpoint>", method="POST",
        body='{"url": "https://httpbin.org/post", "method": "POST", "body": "foo=bar"}',
        headers={"Authorization": "Basic <derived from APIKEY>",
                 "Content-Type": "application/json",
                 "Accept": "application/json"})

Additional arguments

Additional arguments could be specified under the crawlera_fetch.args Request.meta key. For instance:

Request(
    url="https://example.org",
    meta={"crawlera_fetch": {"args": {"region": "us", "device": "mobile"}}},
)

is translated into the following body:

'{"url": "https://example.org", "method": "GET", "body": "", "region": "us", "device": "mobile"}'

Arguments set for a specific request through the crawlera_fetch.args key override those set with the CRAWLERA_FETCH_DEFAULT_ARGS setting.

Accessing original request and raw Crawlera response

The url, method, headers and body attributes of the original request are available under the crawlera_fetch.original_request Response.meta key.

The status, headers and body attributes of the upstream Crawlera response are available under the crawlera_fetch.upstream_response Response.meta key.

Skipping requests

You can instruct the middleware to skip a specific request by setting the crawlera_fetch.skip Request.meta key:

Request(
    url="https://example.org",
    meta={"crawlera_fetch": {"skip": True}},
)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.