- Version Control: Init the git repo
- Dependencies:Can reuse existing environment:
conda activate analytics-training-samples
- Data: Set-up
dvc
- Create folder
data/01_raw
: Put the data there dvc init
- Set up remote storage:
dvc remote add -d storage gs://dvc-data-storage
(separate steps to set-up Google Storage and connect local machine to GCP) - Turn on autostaging:
dvc config core.autostage true
dvc add data/01_raw/online_retail.xlsx
dvc push
- Create folder
Convert the notebook to script
- Example notebook:
notebooks/Original_OnlineRetail_Cohort.ipynb
- Make a copy:
notebooks/Original_OnlineRetail_Cohort-Copy1.ipynb
- On Jupyter notebook:
- Cell -> All Output -> Clear
- Run All (To make sure that all cells are in right sequence)
- Check all results
- On Terminal, run:
jupyter nbconvert notebooks/Original_OnlineRetail_Cohort-Copy1.ipynb --to python
- Check the script output, remove all print/notebook statements:
print(...)
,df.head()
,df.describe(...)
, etc.
Read through the code, sketch out the flow, list any code smells
- Sketch out the flow
- Ingest & process data
- Aggregate to have the data input for plotting
- Plotting the reports
- List code smells & Clean code enhancements
- Leave
# TODO
comments for things to change
Convert codes into DRY functions, write tests, and create local module (which could be imported to notebooks)
- Create the
src
folder to keep all source codes - Create
tests
folder to keep tests for source codes - Write/Test follows the Test-driven Development (TDD) approach
tests/test_*.py
: Write the test first, then write the codepytest
: Run the test, make sure it fails- Write the code, make sure it passes
- Refactor the code, make sure it still passes
- Structures within source code:
src/
: All source codes__init__.py
: Empty file to makesrc
a moduleutils.py
: Utility functionsingest.py
: All functions to process/clean dataprocess.py
: All functions to aggregate dataplot.py
: All functions to plot data
tests/
: All tests__init__.py
: Empty file to maketests
a moduletest_utils.py
: Utility functions for teststest_ingest.py
: Test foringest.py
test_process.py
: Test forprocess.py
test_plot.py
: Test forplot.py
- Within each test file:
- Import the functions from local modules (
src
folder) - Tests for each function could be organized by
Class
with each test cases as methods (naming corresponding to the function name)
- Import the functions from local modules (
- Write the code until passing the tests
- Reformat with
black
- Commit to git
- Pull request to merge to
main
branch
We refactor codes from the Original noteboks as local modules, and import them to the new notebooks. This will enforce the TDD approach, and reusable of functions across notebooks.
If you reuse local module a lot, consider to convert them to proper Python pkg with pip
manager
See:
notebooks/Refactored_OnlineRetail_Cohort.ipynb