This challenge consists of constructing a Analytical Platform on which data analysts can run queries and build basic data visualization. The object of analysis for this challenge is the book information from the BooksToScrape website, however, there is no API, nor other method of easy data extraction available for the books present on this website, thus requiring a Web Scraping solution to be built. Also, there's a need of comparing how prices/stock changed for the books over the time, creating the need for a daily ETL job to be ran.
Requirements: Docker and Docker Compose installed on your machine.
- Clone the repo and cd into the new directory:
git clone https://github.com/muriloxyz/cayena-challenge.git && cd cayena-challenge
- Initialize the docker-compose script:
docker-compose up -d
- Restore the data for the Postgres:
docker exec cayena-challenge_pgsql_1 psql -h localhost -p 5432 -U cayena -d cayena -f restore_data.sql
- Access the data platform (SqlPad) within
http://localhost:3000
. The default user is[email protected]
, and it's password iscayena
. (Mindblown 🤯) - Select your connection
Postgres Database
and start analysing your data inside thebook_info
table. - Write your most beautiful queries!
For more info and how to use the SqlPad analytical platform (plus on how to build visualizations), please refer to the SqlPad Docs.
- If you need daily data updates updates, keep the docker containers running! There's a cron job scheduled to scrape the website every day, 3am UTC time (00:00 Brazilian time). It will scrape the website and store all the processed data into the book_info table, leaving it ready for analysis;
- It is possible to trigger the job manually! But you'll need to enter the worker container to execute the python job.
Description of the 3 containers used for this challenge:
- Worker: Responsible for running the ETL job with the help of the configured cron scheduling. It uses 4 threads for faster web scraping. Uses a python/ubuntu base image.
- Pgsql: Docker image provided by the PostgreSQL team. Simply hosts a database in which all treated data is stored.
- Sqlpad: Docker image provided by the Sqlpad team. It creates an locally hosted web analytics platform which connects to the Pgsql container for data.
Here are the main articles I needed in the making of this challenge: