Giter Club home page Giter Club logo

crawler-script's Introduction

Mwmbl Crawler Script

Usage:

python main.py [-j n] [-u url1 url2 ...]

where n is the number of threads you want to run in parallel. If you specify URLs using the -u option, then just those URLs will be crawled instead of retrieving batches from the server.

Installing

Clone this repo, install poetry if necessary, cd into the repo and type

poetry install
poetry shell

then the main.py command as documented above.

crawler-script's People

Contributors

daoudclarke avatar echedellelr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

crawler-script's Issues

Unable to decode robots file - 'utf-8' codec can't decode byte 0xe9

(crawler-script-py3.9) mwmbl@debian-pc:~/crawler-script$ python main.py
INFO:__main__:Got batch with 100 items
INFO:__main__:Crawled batch in 169.679112 seconds
INFO:__main__:Sending batch
INFO:__main__:Response status: 200, b'{"status":"ok","public_user_id":"4a22e1caa9461233746ac43c02bee5a44a109e846c41d65f9f9db104063bc81c","url":"https://f004.backblazeb2.com/file/mwmbl-crawl/1/v1/2023-01-16/1/4a22e1caa9461233746ac43c02bee5a44a109e846c41d65f9f9db104063bc81c/76989__e32bd971.json.gz"}'
INFO:__main__:Got batch with 100 items
INFO:__main__:Crawled batch in 179.940824 seconds
INFO:__main__:Sending batch
INFO:__main__:Response status: 200, b'{"status":"ok","public_user_id":"4a22e1caa9461233746ac43c02bee5a44a109e846c41d65f9f9db104063bc81c","url":"https://f004.backblazeb2.com/file/mwmbl-crawl/1/v1/2023-01-16/1/4a22e1caa9461233746ac43c02bee5a44a109e846c41d65f9f9db104063bc81c/77171__105d5086.json.gz"}'
INFO:__main__:Got batch with 100 items
ERROR:__main__:Unable to decode robots file
Traceback (most recent call last):
  File "/srv/mwmbl/crawler-script/main.py", line 100, in robots_allowed
    parse_robots.parse(content.decode('utf-8').splitlines())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 517: invalid continuation byte

[Feature] Adjustable requests limit per website

I have ran into major IP address blocking because of the number of requests.

Recently, it happened with the fixed IP in my job and we received a question from our ISP.

Not like most people fall into this but may harm crawling even in dynamic IP addresses environments.

Maybe we should have an option to limit the number of requests per time to a specific website / domain / IP address.

[Feature] Add support for sitemaps

Would be ideal that, given a domain, the crawler was capable of detecting the sitemap or accepting a sitemap as parameter.

I would think in the following workflow:

  • A url is passed -> check for sitemap -> index everything on it
  • A url is passed + a no-sitemap parameter is passed -> index just that page
  • A url is passed + that url is not a path itself (it goes to html or a file instead -> index just that

This would be both for main urls or secondary urls (there are sub-sitemaps and specially if you are using routes and behind reverse proxy server).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.