Giter Club home page Giter Club logo

basic_webcrawler's Introduction

WebCrawler

Contents

Introduction

This webcrawler is a simple crawler designed to return a sitemap of internal links. While not complete, this webcrawler should work with most sites.

Running

Before Running the Crawler

Before running this crawler, ensure you have installed the correct dependencies by navigating to this folder and installing the requirements.txt with pip install -r requirements.txt

Running the Webcrawler

This project is run by executing the executable python file crawler.py, which takes in a url with -u and an optional depth with -u. An example of running this would be to run the following command in this directory: ./crawler.py -u 'https://en.wikipedia.org' -d 5

After this has finished running, it will print out the sitemap in the terminal - however, a lot of sites will produce a huge sitemap, which might not all be visible in the terminal. To overcome this, the sitemap will also be written into the output.json file in this directory. This will be overwritten every time you generate a new sitemap.

Note that from the terminal you can also run ./crawler.py -h to see usage for the crawler.

Using Depth

Note, that if a site is very big and you don't want to crawl its entirety (It might take quite a while to crawl the whole of wikipedia, for example...), there is an optional depth argument, passed into the programme with the -d flag. This sets the limit to how many levels you want to want to crawl.

Testing

There are unit tests for this crawler, which can be executed with: python crawlerTest.py

Notes

This is by no means a finished crawlers, and there are a few things to note:

  1. This webcrawler will make a request to the url to get the html to parse for each site. At the moment, if the request is unsuccessful, it will simply return an empty array of links for that site - if I were to have more time, I would impliment a retry system and at the very least better error handling in the event of a bad request.
  2. The webcrawler is quite brittle - if you pass in an initial site without http(s) if will not work. I would like to provide better url handling.
  3. For a semi large site, this webcrawler will take a long time - had I more time I would have implimented concurrency to take this time down.

basic_webcrawler's People

Contributors

alice-carr avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.