Giter Club home page Giter Club logo

gscraper's Introduction

GScraper

  • Source
  • Issues
  • [Email](mailto:postmodern.mod3 at gmail.com)

Description

GScraper is a web-scraping interface to various Google Services.

Features

  • Supports the Google Search service.
    • Provides access to search results and ranks.
    • Provides access to the Sponsored Links.
  • Provides HTTP access with custom User-Agent strings.
  • Provides proxy settings for HTTP access.

Requirements

Install

$ gem install gscraper

Examples

Basic query:

q = GScraper::Search.query(:query => 'ruby')

Advanced query:

q = GScraper::Search.query(:query => 'ruby') do |q|
  q.without_words = 'is'
  q.within_past_day = true
  q.numeric_range = 2..10
end

Queries from URLs:

q = GScraper::Search.query_from_url('http://www.google.com/search?as_q=ruby&as_epq=&as_oq=rails&as_ft=i&as_qdr=all&as_occt=body&as_rights=%28cc_publicdomain%7Ccc_attribute%7Ccc_sharealike%7Ccc_noncommercial%29.-%28cc_nonderived%29')

q.query # => "ruby"
q.with_words # => "rails"
q.occurs_within # => :title
q.rights # => :cc_by_nc

Getting the search results:

q.first_page.select do |result|
  result.title =~ /Blog/
end

q.page(2).map do |result|
  result.title.reverse
end

q.result_at(25) # => Result

q.top_result # => Result

A Result object contains the rank, title, summary, cahced URL, similiar query URL and link URL of the search result.

page = q.page(2)

page.urls # => [...]
page.summaries # => [...]
page.ranks_of { |result| result.url =~ /^https/ } # => [...]
page.titles_of { |result| result.summary =~ /password/ } # => [...]
page.cached_pages # => [...]
page.similar_queries # => [...]

Iterating over the search results:

q.each_on_page(2) do |result|
  puts result.title
end

page.each do |result|
  puts result.url
end

Iterating over the data within the search results:

page.each_title do |title|
  puts title
end

page.each_summary do |text|
  puts text
end

Selecting search results:

page.results_with do |result|
  ((result.rank > 2) && (result.rank < 10))
end

page.results_with_title(/Ruby/i) # => [...]

Selecting data within the search results:

page.titles # => [...]

page.summaries # => [...]

Selecting the data of search results based on the search result:

page.urls_of do |result|
  result.description.length > 10
end

Selecting the Sponsored Links of a Query:

q.sponsored_links # => [...]

q.top_sponsored_link # => SponsoredAd

Setting the User-Agent globally:

GScraper.user_agent # => nil
GScraper.user_agent = 'Awesome Browser v1.2'

License

GScraper - A web-scraping interface to various Google Services.

Copyright (c) 2007-2012 Hal Brodigan (postmodern.mod3 at gmail.com)

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA

gscraper's People

Contributors

blue-dog-archolite avatar ezkl avatar postmodern avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

gscraper's Issues

Use of newer Mechanize throws warnings

You've referenced the WWW constant from xxxxx/gems/gscraper-0.2.4/lib/gscraper/gscraper.rb:156:in `web_agent', please
switch the "WWW" to "Mechanize". Thanks!

Only fetches top result?

Using the version currently in github repo rather than RubyGems, because the version from RubyGems doesn't include :search_host.

https://gist.github.com/1365681

No matter how i try to query, it just returns the top result over and over.

Any guesses on whats going on?

Search result pages with video links are raising "NoMethodError: undefined method `inner_text' for nil:NilClass"

I can replicate this in irb by trying to resolve a page of a query with video results like "michael jackson":

irb(main):001:0> require 'gscraper'
=> true
irb(main):002:0> q = GScraper::Search.query(:query => "michael jackson")
irb(main):003:0> q.page(1)
=> NoMethodError: undefined method `inner_text' for nil:NilClass

It might be a good idea to treat li.g elements that also contain "videobox" in the class differently or just ignore them by default since it makes more sense not to make them increment the rank counter.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.