Giter Club home page Giter Club logo

rapider's Introduction

Overview

Rapider is a web scraping micro-framework based on thread, written with racket and async-channel, JUST FOR FUN.

Installation

raco pkg install rapider

Usage

Item & Field

Item and Field used to extract data by xpath from xexp doc. I use SXML library.

#lang racket
;;; items.rkt

(require rapider)

(provide
  quote-item
  about-item)

(define quote-item
  (item
    (item-field #:name "title" #:xpath "//*[@class='text']/text()" #:filter (λ (x) (car x)))
    (item-field #:name "author" #:xpath "//*[@class='author']/text()" #:filter (λ (x) (car x)))
    (item-field
      #:name "about-url" 
      #:xpath "//span[2]/a/@href/text()"
      #:filter (λ (x) (string-append "http://quotes.toscrape.com" (car x))))
    (item-field #:name "tags" #:xpath "//*[@class='tag']/text()")))

(define about-item 
  (item
    (item-field #:name "author" #:xpath "//*[@class='author-title']/text()" #:filter (λ (x) (car x)))
    (item-field #:name "born-date" #:xpath "//*[@class='author-born-date']/text()" #:filter (λ (x) (car x)))
    (item-field #:name "born-location" #:xpath "//*[@class='author-born-location']/text()" #:filter (λ (x) (car x)))
    (item-field #:name "description" #:xpath "//*[@class='author-description']/text()")))

Spider

Spider is used for control requests.

#lang racket
;;; crawler.rkt

(require 
  rapider
  "items.rkt")

(define quote-spider%
  (class spider%

    (init-field
      (pages 10)
      (header '("User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36"))
      (base-url "http://quotes.toscrape.com")
      (start-urls (map (λ (x) (string-append base-url "/page/" (number->string x))) (range 1 pages))))

    (define/public (start)
      (for ([url start-urls])
        (request this url 'quotes-list)))

    (define/public (quotes-list rsp)
      (for ([item (extract-data (html->xexp (response-content rsp)) "//*[@class='quote']")])
        (next this item 'quote-element)))

    (define/public (quote-element rsp)
      (define quote-items (quote-item rsp))
       ;;;lets handle quote data
      (displayln quote-items)
      (request this (hash-ref quote-items "about-url") 'about))

    (define/public (about rsp)
      ;;;lets handle quote data
      (displayln (about-item (html->xexp (response-content rsp)))))

  (super-new)))

(run-spider quote-spider%)

Run racket crawler.rkt:

(info) (2019-10-22 02:50:08) (rapider: http://quotes.toscrape.com/author/Mother-Teresa) (crawled)
(info) (2019-10-22 02:50:08) (rapider: http://quotes.toscrape.com/author/J-K-Rowling) (crawled)
(info) (2019-10-22 02:50:08) (rapider: http://quotes.toscrape.com/author/Charles-M-Schulz) (crawled)
(info) (2019-10-22 02:50:08) (rapider: http://quotes.toscrape.com/author/William-Nicholson) (crawled)
(info) (2019-10-22 02:50:08) (rapider: http://quotes.toscrape.com/author/Albert-Einstein) (crawled)
(info) (2019-10-22 02:50:08) (rapider: http://quotes.toscrape.com/author/Jorge-Luis-Borges) (crawled)
(info) (2019-10-22 02:50:09) (rapider: http://quotes.toscrape.com/author/George-Eliot) (crawled)
(info) (2019-10-22 02:50:09) (rapider: http://quotes.toscrape.com/author/Jane-Austen) (crawled)
(info) (2019-10-22 02:50:09) (rapider: http://quotes.toscrape.com/author/Eleanor-Roosevelt) (crawled)
(info) (2019-10-22 02:50:09) (rapider: http://quotes.toscrape.com/author/Marilyn-Monroe) (crawled)
(info) (2019-10-22 02:50:09) (rapider: http://quotes.toscrape.com/author/Albert-Einstein) (crawled)
(info) (2019-10-22 02:50:09) (rapider: http://quotes.toscrape.com/author/Haruki-Murakami) (crawled)
(info) (2019-10-22 02:50:09) (rapider: http://quotes.toscrape.com/author/Alexandre-Dumas-fils) (crawled)

examples


Realworld spider.

Demos.

rapider's People

Contributors

nuty avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

davidalphafox

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.