I am Yoann, a French web developer and music composer living in Bangkok, Thailand.
ynnsnr / google-parser Goto Github PK
View Code? Open in Web Editor NEWExtract large amounts of data from the Google search results page (RoR)
Extract large amounts of data from the Google search results page (RoR)
I am Yoann, a French web developer and music composer living in Bangkok, Thailand.
Remove all un-used generated files e.g. helpers
The current implementation of GoogleScraper
is quite raw and could be improved:
Something like that would have been more satisfying:
class GoogleScraper
def initialize
@driver = selenium_driver
end
def scrape(keyword)
search(keyword)
{
raw_html: driver.page_source,
links_count: scraped_links_count,
adwords_advertiser_count: scraped_ads_count,
results_count: scraped_results_count
}
end
def quit
@driver.quit
end
private
attr_reader :driver
def selenium_driver
Selenium::WebDriver.for :chrome,
options: selenium_options,
desired_capabilities: driver_capabilities
end
def driver_options
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--ignore-certificate-errors')
options.add_argument('--disable-popup-blocking')
options.add_argument('--disable-translate')
options.add_argument('--no-sandbox')
options.add_argument('--disable-extensions')
options.add_argument('--window-size=800,600')
options.add_argument('--headless')
options
end
def driver_capabilities
Selenium::WebDriver::Remote::Capabilities.chrome
end
def search(keyword)
driver.navigate.to('https://www.google.com/?hl=en')
driver.find_element(name: 'q').send_keys(keyword)
driver.find_element(name: 'q').send_keys(:return)
end
def scraped_links_count
driver.find_elements(tag_name: 'a').count
end
def scraped_results_count
driver.find_element(id: 'resultStats')
.text
.match(/(?<count>[\d,]+) results/)[:count]
.delete(',')
.to_i
end
def scraped_ads_count
driver.find_elements(class_name: 'ads-ad').count + driver.find_elements(class_name: 'pla-unit').count
end
end
The tests do not cover the basic functionality which is to extract date from an HTML page.
It does not need to be complicated as using a stubbed response for Selenium::WebDriver::Remote
would work (i.e. using a stored HTML response in spec/fixtures
)
Overall the tests coverage is way too low.
Nokogiri is usually the preferred implementation choice as it has lots of built-in processes to access HTML nodes.
It's not a criticism here but just want to understand why you made this choice.
Prefer keeping ActiveJob::Base.queue_adapter = :test
in the support
subdirectory as it would applu to all tests.
#spec/support/activejob.rb
ActiveJob::Base.queue_adapter = :test
analysis_spec.rb
currently has two describe block, instead prefer using on block:
RSpec.describe Address do
it '' do
end
describe '#keywords' do
end
end
Refer to this resource for more info.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.