Giter Club home page Giter Club logo

scrapers's Introduction

Scrapy Scrapers Scrape

A repo of Web Scrapers written with Scrapy

Scrapy Quickstart

Install / Upgrade

If you haven't yet, use Conda to create a new python environment with scrapy in it.

Activate that environment, I called it scrapy

source activate scrapy

As of writing this README, the latest Scrapy version is 1.3.

See what version of scrapy you have:

scrapy version

Update as necessary:

pip install scrapy --upgrade

New Scraper

Use scrapy startproject to create a new sub-directory in this repo to store a new scraper.

For example, here we start a poem scraper in the poems sub-directory:

scrapy startproject poems

New Spider

Usually I have just one spider for each scraper - but perhaps I'm doing something wrong.

We can create a new spider using genspider. Apparently the Spider's name cannot be the same as the scraper, so I append _spider to it

cd poems
scrapy genspider poems_spider

Scrapy Shell

You're gonna want the shell, to test out xpath and css selections

scrapy shell 'https://www.poetryfoundation.org/poems-and-poets/poems#page=1&sort_by=recently_added'

In the shell, just like in your spider's parsing functions, you have access to response.

Scrapy Selections

Use .css or .xpath to pull out selections of the HTML.

Examples:

Pull out the href attributes of a bunch of links using css:

response.css('#content .title a::attr(href)')

Use the .extract() method of a selection to get out the text

scrapers's People

Contributors

vlandham avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

isearch-gp

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.