It comprises of two main modules:
- a REST API server that shows the web page information stored in the database;
- a CLI crawler used to crawl web pages and populate the database.
Build the image.
$ docker build -t crawler .
Start the REST API server.
You need to adjust the MongoDB connection string according to your server.
$ docker run --rm -d --name crawler -p 3000:3000 \
-e MONGODB_URL="mongodb+srv://<user>:<password>@<host>/<db>?retryWrites=true&w=majority" \
crawler
Query the REST API server.
The examples below use jq to format the JSON responses in a human-readable way.
$ curl -s 'http://localhost:3000/pages?offset=0&limit=5' | jq .
$ curl -s 'http://localhost:3000/pages/1bab3f16e219f6242b86db0c18e33cfd' | jq .
Start crawling a web page. The example below starts a crawler that visits 100 pages beginning at https://www.crawler-test.com/
.
You need to adjust the MongoDB connection string according to your server.
$ docker run --rm -it \
-e MONGODB_URL="mongodb+srv://<user>:<password>@<host>/<db>?retryWrites=true&w=majority" \
crawler bin/crawl --maxVisits 100 https://www.crawler-test.com/
You need at Node.js version 12 or above (it will probably work on other versions with proper async/await support).
Create a .env
file containing the MONGODB_URL variable.
You need to adjust the MongoDB connection string according to your server.
$ echo MONGODB_URL="mongodb+srv://<user>:<password>@<host>/<db>?retryWrites=true&w=majority" > .env
Install dependencies.
$ npm install
Run unit tests continuously and code.
$ npm run test:watch
Debug the REST API server.
npm start
Debug the crawler CLI.
$ npm run cli:debug -- --maxVisits 1 https://www.crawler-test.com/
You may want to run the commands above (to debug stuff) in a terminal within Visual Studio Code because it auto-attaches its debugger.
Visual Studio Code is automatically configured to reformat code on save. You will need its Prettier, ESLint and EditorConfig extensions.
There is also a Husky git commit hook to run ESLint and Prettier before any git commit.