Giter Club home page Giter Club logo

harvestman's Introduction

Harvestman

Harvestman is a very simple, lightweight web crawler for Quick'n'Dirty™ web scraping.
It's quite useful for scraping search result pages:

require 'harvestman'

Harvestman.crawl 'http://www.foo.com/bars?page=*', (1..5) do
  price = css 'div.item-price a'
  ...
end

[!] Warning: this gem is in alpha stage (no tests), don't use it for anything serious.

Installation

Via command line:

$ gem install harvestman

Basic usage

Harvestman is fairly simple to use: you specify the URL to crawl and pass in a block. Inside the block you can call the css (or xpath) method to search the HTML document and get the inner text inside each node. See Nokogiri for more information.

Perhaps this is best understood with an example:
Harvestman.crawl "http://www.24pullrequests.com" do
  headline = xpath "//h3"
  catchy_phrase = css "div.visible-phone h3"

  puts "Headline: #{headline}"
  puts "Catchy phrase: #{catchy_phrase}"
end

One node to rule them all

Harvestman assumes there's only one node at the path you passed to the css. If there is more than one node at that path, you can pass in an additional block.

Another example:
Harvestman.crawl 'http://en.wikipedia.org/wiki/Main_Page' do
  # Print today's featured article
  tfa = css "div#mp-tfa"

  puts "Today's featured article: #{tfa}"

  # Print all the sister projects
  sister_projects = []

  css "div#mp-sister b" do
    sister_projects << css("a")
  end

  puts "Sister projects:"
  sister_projects.each { |sp| puts "- #{sp}" }
end

Note that inside the block we use css("a") and not css("div#mp-sister b a"). Calls to css or xpath here assume div#mp-sister b is the parent node.

Pages / Search results

If you want to crawl a group of similar pages (eg: search results, as shown above), you can insert a * somewhere in the URL string and it will be replaced by each element in the second argument.

Final example:
require 'harvestman'

Harvestman.crawl 'http://www.etsy.com/browse/vintage-category/electronics/*', (1..3) do
  css "div.listing-hover" do
    title = css "div.title a"
    price = css "span.listing-price"

    puts "* #{title} (#{price})"
  end
end

The above code is going to crawl Etsy's electronics category pages (from 1 to 3) and output every item's title and price. Here we're using a range (1..3) but you could've passed an array with search queries:

"http://www.site.com?query=*", ["dogs", "cats", "birds"]

Performance

When using the * feature described above, each page is run inside a separate thread. You can disable multithreading by passing an additional argument :plain to the crawl method, like this:

require 'harvestman'

Harvestman.crawl 'http://www.store.com/products?page=*', (1..99), :plain do
  ...
end

Needless to say, this will greatly decrease performance.

License

See LICENSE.txt

harvestman's People

Contributors

mion avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.