Giter Club home page Giter Club logo

web-scraper's Introduction

Web scraping using WWW::Mechanize(Perl) and BeautifulSoup(Python)

Corpora are the indispensable parts of the research in different levels of natural language processing. Undoubtedly Web is the best resource that may help us constructing our corpus. This repository includes a simple scraper for scraping any desired content from a search engine (here, Google).

The idea is simple:

  • We do the query in Google of what we are looking for, we grab the result links (for how many pages that we want).

  • Then we scrap again the content of the pages that contain what we're looking for via the provided links in the previous step.

  • Just a little manipulation and cleaning would be the last step of having some text to append to our corpus.

Step 1: WWW::Mechanize

WWW::Mechanize is a Perl module that provides the tools to get access to the sites and to get their content. What makes Mechanize, or Mech in short, so useful is the features that it has, including all HTTP methods, High-level hyperlink and HTML form support, SSL support, automatic cookies, custom HTTP headers, proxies and etc.

The reason that I've not integrated this step with the step 2 is due to the size of the processing. I personally have created an enormous corpus using this technique, so I've prefered to save the output in each step, in order not to repeat the same tasks in the case of having an unpredicted error (that you would see later one).

Step 2: Scrape content of the links

The python script gets the directoy of the links gathered in the first step and saves the scraped content of each one.

web-scraper's People

Contributors

sinaahmadi avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.