The nira-interview from spencer-sullivan

Get setup to develop for the take home assignment.

First, fork the nira-interview repo into your own repo.

Install Pyenv to switch between different Python versions.
https://github.com/pyenv/pyenv, windows: https://github.com/pyenv-win/pyenv-win
Install the python version specified in /nira-interview/.python-version using Pyenv. pyenv install 3.9.6
Navigate into the /nira-interview directory and double check that you've set up pyenv correct.
When you run python --version, the version should match the version specified at /nira-interview/.python-version, which is 3.9.6
Pyenv works by reading the .python-version file and automaticaly switching to the right python version.
Install poetry https://python-poetry.org/docs/
Configure Poetry to create .venv folders in the project poetry config virtualenvs.in-project true
Navigate into the pipeline folder cd /nira-interview/pipeline
Install dependencies poetry install
Activate the virtual environment. poetry shell
Double check that the right version of python is being used in the virtual environment. python --version
Make sure that the dependencies were installed. poetry show
Spin up dagit. poetry run dagit
Navigate to localhost:3000. You should see dagster running there
In the jobs pane on the left, click the "nira_smoke_test_job" job. Click "Launchpad" and then "Launch run". You should see the job print "Successfully ran smoketest".
Specify python interpreter in VSCode You should open the setting in VScode to "Python: Select interpreter". Input your own path, which should be ./pipeline/.venv/bin/python
You should be ready if you get here

Introduction to Dagster

Dagster is an open source tool we use to orchestrate our pipelines. You can learn more about Dagster at dagster.io. They're an awesome company.

Dagster jobs are essentially a list of steps written in pipeline. Each step is called an op. If you open smoke_test_job.py, you'll see the nira_smoke_test_job python definition which is annotated with @job.

The job is made from a series of calls to ops. The two ops are also defined there. The output of smoke_test_op1 is passed into smoke_test_op2.

Its that easy, Dagster jobs are constructed from ops.

Your assignment

Your task is to edit the interview_job defined in interview_job.py. First, lets see whats going on inside of interview_job.

First, we read in a raw CSV of buses we need to run the pipeline on in raw_buses_to_run.
Then we calculate the MW available for each bus get_mw_available_for_each_bus_very_slow. You can see in the code that calculating this takes 5 minutes per bus! Super slow.
Then we convert MW to GW in add_gw_available_column.
Lastly, we write the final DF to disk in output_interview_job.

This pipeline has already been run and has results inside of pipeline/interview_job/output. This pipeline has been run the slow way with the initial set of buses.

Modifications needed

Sometimes, we have a new bus we need to run as well. But we don't want to rerun all the buses because that's too slow.

Your task is to figure out how to construct this pipeline so that we don't have to rerun all the buses, only the new ones, while still outputting one single CSV to disk.

A few constraints:

You can tweak get_mw_available_for_each_bus_very_slow for testing purposes, but you are not allowed to change the code inside this file in the final submission. Don't get clever and just decrease the sleep() call to one second.
We are only ever adding new buses, you do not need to worry about buses being removed.
For any given bus, the values calculated in get_mw_available_for_each_bus_very_slow will always be exactly the same (you can see this in the code).

Final deliverable:

Send over a link to the forked repo you made the modifications in.
Inside raw_buses_to_run.py, comment out line 4, and uncomment line 5. This will switch the raw buses csv to a new csv. You can go look in the CSVs, the only difference is one additional bus in the new one. Remember, there will only ever be bus additions in the new csv.
There should be only one new file inside the /output folder that contains all the buses results for the buses defined in new_raw_buses.csv. You should delete the original csv in the output folder that the repo started with. There should never be 2 csv's in the output folder.
Any new ops you need should be added to interview_job.py and also be implemented in their own file in the /ops folder.

spencer-sullivan / nira-interview Goto Github PK

nira-interview's Introduction

Get setup to develop for the take home assignment.

Introduction to Dagster

Your assignment

Modifications needed

nira-interview's People

Contributors

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent