Giter Club home page Giter Club logo

cayena-challenge's Introduction

Cayena Challenge 🌶️

This challenge consists of constructing a Analytical Platform on which data analysts can run queries and build basic data visualization. The object of analysis for this challenge is the book information from the BooksToScrape website, however, there is no API, nor other method of easy data extraction available for the books present on this website, thus requiring a Web Scraping solution to be built. Also, there's a need of comparing how prices/stock changed for the books over the time, creating the need for a daily ETL job to be ran.

How to use the platform

Requirements: Docker and Docker Compose installed on your machine.

  1. Clone the repo and cd into the new directory:
    git clone https://github.com/muriloxyz/cayena-challenge.git && cd cayena-challenge
  2. Initialize the docker-compose script:
    docker-compose up -d
  3. Restore the data for the Postgres:
    docker exec cayena-challenge_pgsql_1 psql -h localhost -p 5432 -U cayena -d cayena -f restore_data.sql
  4. Access the data platform (SqlPad) within http://localhost:3000. The default user is [email protected], and it's password is cayena. (Mindblown 🤯)
  5. Select your connection Postgres Database and start analysing your data inside the book_info table.
  6. Write your most beautiful queries!

For more info and how to use the SqlPad analytical platform (plus on how to build visualizations), please refer to the SqlPad Docs.

Main takeaways

  • If you need daily data updates updates, keep the docker containers running! There's a cron job scheduled to scrape the website every day, 3am UTC time (00:00 Brazilian time). It will scrape the website and store all the processed data into the book_info table, leaving it ready for analysis;
  • It is possible to trigger the job manually! But you'll need to enter the worker container to execute the python job.

Architecture

Architecture-Diagram

Description of the 3 containers used for this challenge:

  • Worker: Responsible for running the ETL job with the help of the configured cron scheduling. It uses 4 threads for faster web scraping. Uses a python/ubuntu base image.
  • Pgsql: Docker image provided by the PostgreSQL team. Simply hosts a database in which all treated data is stored.
  • Sqlpad: Docker image provided by the Sqlpad team. It creates an locally hosted web analytics platform which connects to the Pgsql container for data.

Reference

Here are the main articles I needed in the making of this challenge:

cayena-challenge's People

Contributors

muriloxyz avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.