ynnsnr / google-parser Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 164 KB

Extract large amounts of data from the Google search results page (RoR)

Dockerfile 2.39% Ruby 76.56% JavaScript 1.70% CSS 2.56% HTML 16.57% Shell 0.22%

google-parser's Introduction

Hi there 👋

I am Yoann, a French web developer and music composer living in Bangkok, Thailand.

🚀 Projects

Founded Jarvis, an AI-powered app that generates lyrics ideas to help songwriters overcome writer's block.
Launched BeatHub, a web app to help music producers dig sample packs and beats more efficiently by providing a random and continuous listening experience.

🤹 Skills

google-parser's People

Contributors

Watchers

google-parser's Issues

Code cleanup

Remove all un-used generated files e.g. helpers

GoogleScraper class architecture

The current implementation of GoogleScraper is quite raw and could be improved:

Not using class accessors
Methods are super long so hard to read and maintain over time

Something like that would have been more satisfying:

class GoogleScraper
  def initialize
    @driver = selenium_driver
  end

  def scrape(keyword)
    search(keyword)

    {
      raw_html: driver.page_source,
      links_count: scraped_links_count,
      adwords_advertiser_count: scraped_ads_count,
      results_count: scraped_results_count
    }
  end

  def quit
    @driver.quit
  end

  private

  attr_reader :driver

  def selenium_driver
    Selenium::WebDriver.for :chrome,
                            options: selenium_options,
                            desired_capabilities: driver_capabilities
  end

  def driver_options
    options = Selenium::WebDriver::Chrome::Options.new
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--disable-popup-blocking')
    options.add_argument('--disable-translate')
    options.add_argument('--no-sandbox')
    options.add_argument('--disable-extensions')
    options.add_argument('--window-size=800,600')
    options.add_argument('--headless')
    options
  end

  def driver_capabilities
    Selenium::WebDriver::Remote::Capabilities.chrome
  end

  def search(keyword)
    driver.navigate.to('https://www.google.com/?hl=en')
    driver.find_element(name: 'q').send_keys(keyword)
    driver.find_element(name: 'q').send_keys(:return)
  end

  def scraped_links_count
    driver.find_elements(tag_name: 'a').count
  end

  def scraped_results_count
    driver.find_element(id: 'resultStats')
          .text
          .match(/(?<count>[\d,]+) results/)[:count]
          .delete(',')
          .to_i
  end

  def scraped_ads_count
    driver.find_elements(class_name: 'ads-ad').count + driver.find_elements(class_name: 'pla-unit').count
  end
end

GoogleScraper tests are too simplistic

The tests do not cover the basic functionality which is to extract date from an HTML page.

It does not need to be complicated as using a stubbed response for Selenium::WebDriver::Remote would work (i.e. using a stored HTML response in spec/fixtures)

Overall the tests coverage is way too low.

Why using Selenium::WebDriver instead of Nokogiri

Nokogiri is usually the preferred implementation choice as it has lots of built-in processes to access HTML nodes.

It's not a criticism here but just want to understand why you made this choice.

Keep all test configuration in `spec/support`

Prefer keeping ActiveJob::Base.queue_adapter = :test in the support subdirectory as it would applu to all tests.

#spec/support/activejob.rb

ActiveJob::Base.queue_adapter = :test

Keep specs for models in the same describe block

analysis_spec.rb currently has two describe block, instead prefer using on block:

RSpec.describe Address do
  it '' do
  end

  describe '#keywords' do
  end
end

Refer to this resource for more info.

ynnsnr / google-parser Goto Github PK

google-parser's Introduction

Hi there 👋

🚀 Projects

🤹 Skills

google-parser's People

Contributors

Watchers

google-parser's Issues

Code cleanup

GoogleScraper class architecture

GoogleScraper tests are too simplistic

Why using Selenium::WebDriver instead of Nokogiri

Keep all test configuration in `spec/support`

Keep specs for models in the same describe block

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent