Workshop ruminations

hey @ericmjl this Workshop is really exciting. I have now gone through all NBs and attempted to run everything locally.

My feedback is below. I'm on a plane and git pulled before take-off so some of may feedback may already be deprecated. I have included some check boxes, if it helps. Between now and the Workshop, I am happy to help implement any of these or other changes you see fit. Happy to chat here but would be good to sync and chat also. See you in Portland!

general:

what are your requirements on an attendee? Intermediate Python programming skills, for example, know how to write functions, pandas slicing, raising errors in functions (could be good to include in README.md to manage expectations); I could imagine that code such as the following may require explanation for many attendees:

hashes['concat'] = df.apply(lambda x: ''.join(str(x[col]) for col in df.columns), axis=1)

when i run checkenv.py, I get

ModuleNotFoundError: No module named 'colorama'
include in environment? it's funny that i have all packages required for the Workshop installed but not all packages required to check the environment! :D

2-pytest-intro

1st exercise says to create test_datafuncs.py but it's already in repo: perhaps this is instructor repo? in general, it seems as though many of the exercises have already been done in this repo, e.g. min-max scaler (prob intentional)
it isn't obvious why running py.test shuold do anything; perhaps explain in a few words? The terminal is a mystical place for many.
for the first test_min_max_scaler() , do you need to import numpy in test_datafuncs.py?
the following line of code in the 2nd test_min_max_scaler() makes pytest throw a SyntamError; can you reproduce this?

assert np.allclose(tfm, np.array([0, 0.5, 1])

i really love the textual data example; i'm wondering if you expect attendees to know what lines of code such the following do or if you'll explain it? return ''.join(s for s in text if s not in exclude)

similarly with the tests; also, why do you run the test functions are defining them in the NB? Just wondering :)

3-file-integrity

cell 2: attendees may wonder what sha256(), update() and hexdigest() are
in cell 12 ( defining hash_file), for the 1st time you're using the update() method several times; this is worth a few words;

love the tinydb :)

4-data-checks

typo: under Schema Checks, 'expected' is missing the 'd'
yaml: you could remind them that they used one to set up their system with the conda env! :D
you write 'Let's now switch roles, and pretend that we're on side of the "analyst" and are no longer the "data provider".' but this is the first time the idea of a "data provider has come up in the NB"
in the exercise to write function test_data_columns, do you want them to run py.test to see the fruits of their labour?
when you write ' Take the schema spec file and write a test for it.', which schema spec file are you talking about?

i love missingno!
interesting to use pandas_summary; idiomatic pandas would generally lead me down the path of using dfs.describe() and dfs.info()

after they write the function test_data_completeness() and add it to test_datafuncs.py, running py.test doesn't find a DataFrame df; similarly with test_data_range(); i'm probably missing something silly;

great ECDFs! ;)

you write 'We can take the EDA portion further, by doing an empirical cumulative distribution plot for each data column.' but then do something else first; explain why we need compute_dimensions()?

wrt K-S test, also check out Lilliefors: https://en.wikipedia.org/wiki/Lilliefors_test

5-test-coverage

coverage seems really cool! running py.test --cov, though, throws

py.test: error: unrecognized arguments: --cov

do you plan to expand on this NB? Can I help at all? One way would be to play around with the functions and test functions and see the differing results;

Ecdf_scatter not imported/defined

I actually dont know where this function comes from. Otherwise I would have PRed

Incorporate datatest into tutorial

Datatest implements exactly what I've been looking for and advocating for in this tutorial repository.

checkenv.py permanently sets text color of terminal

Mac user

Plan for PyCon 2017?

Hey @reneighbor! Hope life's good for you. Are we still on for this, and shall we put together a proposal for PyCon 2017?

SW Carpentry's testing tutorial

http://katyhuff.github.io/python-testing/

Might be good for inspiration!

Hypothesis was not installed after setting up conda

Code error in min_max_scaler in notebook #2

Within file 2-pytest-introduction.ipynb, you have test_min_max_scaler():

def test_min_max_scaler():
    arr = np.array([1, 2, 3])  # set up the test with necessary variables.
    tfm = dfn.min_max_scaler(arr)  # collect the result into a variable
    assert tfm == np.array([0, 0.5, 1])  # assertion statements
    assert tfm.min() == 0  
    assert tfm.max() == 0

I believe this line

assert tfm.max() == 0

should be

assert tfm.max() == 1

It's correct in the code file test_datafuncs_soln.py, just not in the notebook file.

Also, really enjoyed the training session today, I learned a ton. Thanks for taking the time Eric.

Win64 Dependencies Issue

I am finding that missingno and rise are both incompatible with Windows 64 when I use the environment.yml to create my conda environment. Any recommendations?

Proposal is in!

Hey @reneighbor!

I took the liberty of putting in the proposal. It's up for mods if you see places that it can be improved upon. Please let me know!

Basic Information

Title: Best Testing Practices for Data Science
Description:
- So you're a data scientist wrangling with data that's continually avalanching in, and there's always errors cropping up! NaNs, strings where there are supposed to be integers, and more. Moreover, your team is writing code that is getting reused, but that code is failing in mysterious places. How do you solve this? Testing is the answer! In this tutorial, you will gain practical hands-on experience writing tests in a data science setting so that you can continually ensure the integrity of your code and data. You will learn how to use py.test, coverage.py, and hypothesis to write better tests for your code.
Audience:
- This tutorial is for current and aspiring data scientists. Some experience with Python is a must, including knowledge of assertion statements; however, beginners who have not seen assert are welcome, as well will go through this. Expected knowledge include: writing functions, knowing how to write for-loops, and knowing how to write flow control (if/else). At the end of the tutorial, we expect students to learn how to write good tests for their code and for their data.

Outline

Part 1: Motivation for Testing (10 min, lecture-style)

We make assumptions with our data and our code. Automated testing is crucial for continually testing those assumptions.
Things that need to be checked - data science:
1. Data types
2. Data completeness
3. Data structure
4. Data integrity
Things that need to be checked - both data science & software development:
1. Function correctness
2. Boundary cases
3. Counter-examples
4. Code coverage

Part 2: Introduction to Testing & Writing Tests for Functions(30 min)

Summary of section: Introduction to py.test as a simple way to get off the ground with automated testing.

Using py.test at the command line in an empty project directory.
The assertion statement is your friend!
Warm-up example: write an add() function, write a test for correctness, write a counter-example.
More complex example: write a function that sums up two floating point numbers, and log10 transforms them. Write a test for correctness, write a counter-example.
More complex example: write a function that strips punctuation and spaces and special characters from text, and transforms it into a "bag of words". Write test for correctness and counter-example test.

Part 3: Checking File Integrity (40 min)

Summary of section: Participants will write simple tests that automatically check integrity of their files. Also will learn how to create simple change logs of data. We will provide a directory of data for this. Applied to detecting file tampering.

Getting hashes of a file (md5, sha256), write script to automatically record data hashes.
Write tests that check that a file has not been changed since last recorded version (i.e. check file hash)
Write tests that check other basic properties of a data file: number of rows, number of columns, number of words in a text file.
What to do when file hashes change?

Part 4: Checking Data Assumptions (40 min)

Summary of section: Participants will write tests that check the assumptions that they may have about the data that they received. Applied to pre-data analysis, especially with streams of data coming in.

Write tests for column data types: int, float, str, object.
Write tests that check that there are no unexpected empty data cells: finding NaN with pandas.
Write tests that check for value correctness (e.g. cannot have negative numbers where log10 transform expected).
Write tests that check textual data for correct lengths (e.g. biological sequence data should fall within some expected range of lengths).
Write tests that check for potential outlier data points (e.g. use of Q-test).
Write tests that check for normality of data (e.g. use of scipy's K-S test).

Part 5: Code coverage (20 min)

Summary of section: use of coverage.py to check code coverage.

Using coverage.py with py.test.
Exercise: write a test for non-covered functions to bring up coverage score.
Emphasize: 100% code coverage doesn't mean 100% good coverage, but it's a good starting point!

Part 6: Property-based testing (30 min)

Summary of section: introduction to hypothesis.

Property-based testing: we don't write fixed tests, but instead test the "properties" of a function (input data range, data type) for correctness.
How to use hypothesis: the decorators!
Finding boundary cases using hypothesis, and fixing functions (e.g. log10 transform function: add assertion statement in function to alert function user to potential errors ==> defensive programming!).
Write tests using hypothesis for checking the functions from Part 2.

Package - hypothesis

I have begun using Hypothesis to test my code, and I have found it to be immensely useful for reasoning about how my code ought to work.

I think we should introduce this in this tutorial.

ericmjl / data-testing-tutorial Goto Github PK

data-testing-tutorial's Introduction

Best Testing Practices for Data Science

How to use this repository

Pre-Requisite Knowledge

Feedback

Environment Setup

conda setup

pip setup

Manual Setup

Checks

Authors

Contributors

Data Credits

data-testing-tutorial's People

Contributors

Stargazers

Watchers

Forkers

data-testing-tutorial's Issues