Giter Club home page Giter Club logo

takehome-data-eng's Introduction

Bungalow Take Home Project for Data Engineer Role (V2. 2021-11-02)

Welcome to the Bungalow Takehome Challenge for Data Engineering! This is a barebones repo to get you started.

What to build

A common task for data engineers at Bungalow involves the integration of the of third-party data, modelling data, storing it and making it available for downstream teams such as analytics, data science and ultimately the entire organization. For this challenge we'd like to give a brief snapshot of a common workload may entail. Of course, this might become a big task. Therefore, to save time for you, we did some of the heavy lifting, like the set up and some scaffolding of the environment.

For this test we will collect the current weather data from OpenWeatherMap. The free API will work for this assignment. You shouldn’t pay for the API.

Please install Docker Desktop on your laptop. It will contain the environment that we would need for the next steps.

The Docker compose would have two software applications and simple setup required for them.

  • Airflow: To run your additions to the boilerplate DAGs.

  • Postgres: To maintain your tables. (You can swap it with any other database or your choice, i.e. SQLite, MySQL)

Below are the steps in the data flow diagram:

  • fetcher.py script, that represents the fetcher DAG, would retrieve the data from the current weather API.

  • The fetcher script would process and clean the data, then stores it the Postgres database considering relationships, integrity, performance, and extendability.

  • The transformer.py script, that represents the Transformer DAG, would transform the data from the previous step to prepare some derived dataset tables. You will have the choice to implement the transformations both in Python or SQL.

  • The Transformer writes the datasets back to Postgres.

  • The downstream customer(s) would read both original and derived tables. They will execute historical queries to run analytics and science models.

This project is meant to be flexible as to showcase your decision making capabilities and your overall technical experience.

Note: If you are uncomfortable with Docker, Postgres or Airflow, please feel free to remove or replace them. They are meant to save time for you. As long as you can achieve the outcome feel free to use any additional tooling, programming language (i.e. Java or Scala) and approach you see fit. We will ask follow up questions about your decision mechanism in the follow up conversation.

We are more interested in seeing your thought process and approach to solving the problem!

Deliverables

We will expect to see the following items in your Github pull request:

  • Your Python code for data fetcher and transformer.

  • The data model SQL and your design for its data modelling

  • Readme file with your notes

Evaluation

We will use this project as our basis for our evaluation of your overall fit for a data engineering role from a technical viewpoint.

To do this, we will review your code with an eye for the following:

  • Readability, scalability and usability

  • Data processing and relational modelling

  • Python and SQL know-how

Time expectations

We know you are busy and likely have other commitments in your life, so we don't want to take too much of your time. We don't expect you to spend more than 2 hours working on this project. That being said, if you choose to put more or less time into it for whatever reason, that is your choice.

Feel free to indicate in your notes below if you worked on this for a different amount of time and we will keep that in mind while evaluating the project. You can also provide us with additional context if you would like to.

Additionally, we have left a spot below for you to note. If you have ideas for pieces that you would have done differently or additional things you would have implemented if you had more time, you can indicate those in your notes below as well, and we will use those as part of the evaluation.

Public forks

We encourage you to try this project without looking at the solutions others may have posted. This will give the most honest representation of your abilities and skills. However, we also recognize that day-to-day programming often involves looking at solutions others have provided and iterating on them. Being able to pick out the best parts and truly understand them well enough to make good choices about what to copy and what to pass on by is a skill in and of itself. As such, if you do end up referencing someone else's work and building upon it, we ask that you note that as a comment. Provide a link to the source so we can see the original work and any modifications that you chose to make.

Challenge instructions

Fork this repository and clone to your local environment

  • Prepare your environment with Python and any other tools you may need. Docker can do it for you.
    • To run the docker-compose, you need to run the following commands:
      # Initializing the folders and the non-root user for Airflow
      mkdir -p  ./logs ./plugins
      echo -e "AIRFLOW_UID=$(id -u)" > .env
      # Initializing airflow database
      docker-compose up airflow-init
      # Running the docker-compose
      docker-compose up 
      # You can see the Airflow UI in http://localhost:8080 with username/password: airflow
    • If you run to any problems with the environment, please refer to here.
  • Fill in the TODO in the repository. There are currently 6 TODOS, but you can go beyond and above.
    • Any problems with the DAGs? They are taken from here. Please take a look at the rest of tutorial if needed.
    • You can check Postgres operator from here
    • To keep it simple, let's use the Airflow database for the storage of your dataset
  • Write down the notes, in the Readme.md file.
  • Complete the challenge and push back to the repo
    • If you have any questions in any step, please reach out to your recruiter. A member of engineering team will be involved to support you, as if you were working for Bungalow.
  • Note: If you are using Apple hardware with M1 processor, there is a common challenge with Docker. You can read more about it here.

Your notes (Readme.md)

@TODO: Add any additional notes / documentation in this file.

Time spent

Give us a rough estimate of the time you spent working on this. If you spent time learning in order to do this project please feel free to let us know that too. This makes sure that we are evaluating your work fairly and in context. It also gives us the opportunity to learn and adjust our process if needed.

Assumptions

Did you find yourself needing to make assumptions to finish this? If so, what were they and how did they impact your design/code?

Next steps

Provide us with some notes about what you would do next if you had more time. Are there additional features that you would want to add? Specific improvements to your code you would make?

Instructions to the evaluator

Provide any end user documentation you think is necessary and useful here

takehome-data-eng's People

Contributors

saeedzareian avatar tsoporan avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.