Giter Club home page Giter Club logo

data-testing-tutorial's Introduction

Best Testing Practices for Data Science

A short tutorial for data scientists on how to write tests for your code and your data. Before the tutorial, please read through this README file, for it contains a lot of useful information that will help you best prepare for the tutorial.

How to use this repository

The tutorial notes are typed up in Jupyter notebooks, and static HTML versions are available under the docs folder. For the non-bonus material, I suggest working through the notes in order. With the exception of the Projects, the bonus material can be tackled in any order. During the tutorial, be sure to have the HTML versions open.

Pre-Requisite Knowledge

I am assuming you are of the following type of coder:

  • You are a data analytics type, who knows how to read/write CSV files with Pandas, and do basic data manipulation (slicing, indexing rows + columns, using the .apply() function).
  • You are not necessarily a seasoned software developer who has experience running tests.
  • You are comfortable with operating in the Terminal environment.
  • You have some rudimentary knowledge of numpy, particularly the the array.min(), array.max(), array.mean(), array.std(), and numpy.allclose(a1, a2) function calls.

In order to prepare for the tutorial, there are some pieces of Python syntax that will come in handy to know:

  • the context manager syntax (with ....),
  • assertions (assert conditions1 == condition2),
  • file I/O (with open(....) as f:...),
  • list/dict/tuple comprehensions ([a for a in container if condition(a)]),
  • checking types & attributes (isinstance(obj, type) or hasattr(obj, attr)).

Feedback

If you've taken a version of this tutorial, please leave feedback here. I use the suggestions in there to adjust the tutorial content and make it better. The changes are always released publicly on GitHub, so everybody benefits!

Environment Setup

conda setup

This installation route should work cross-platform. I recommend using the Anaconda distribution of Python because it is a good way to bootstrap your data science environment.

To get setup, create a conda environment based on the provided environment.yml spec file. Run the following commands in your bash terminal.

$ bash conda-setup.sh

pip setup

The alternative way is to use a virtualenv environment:

$ bash venv-setup.sh
$ source datatest/bin/activate

Alternatively, you can pip install each of the dependencies listed in the environment.yml file. (The requirements.txt file may be less eagerly maintained than the environment.yml file, given the conda-biases that I have.)

Manual Setup

If you prefer having more control over your installation process, conda or pip install the dependencies listed in the environment.yml file.

Checks

To check whether the environment is correctly setup, run the checkenv.py script:

$ python checkenv.py

It should print to your terminal, All packages found; environment checks passed.. Otherwise, conda or pip install the necessary packages stated (they will show up one by one).

Authors

Contributors

Special thanks goes to individuals who have contributed in ways big and small to the improvement of the material.

  • Renee Chu
  • Matt Bachmann: @Bachmann1234
  • Hugo Bowne-Anderson: @hugobowne
  • Boston Python tutorial attendees:
    • @races1986
    • Thao Nguyen: @ThaoNguyen15
    • @ChrisMuir

Data Credits

data-testing-tutorial's People

Contributors

bachmann1234 avatar ericmjl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

data-testing-tutorial's Issues

Workshop ruminations

hey @ericmjl this Workshop is really exciting. I have now gone through all NBs and attempted to run everything locally.

My feedback is below. I'm on a plane and git pulled before take-off so some of may feedback may already be deprecated. I have included some check boxes, if it helps. Between now and the Workshop, I am happy to help implement any of these or other changes you see fit. Happy to chat here but would be good to sync and chat also. See you in Portland!


general:

  • what are your requirements on an attendee? Intermediate Python programming skills, for example, know how to write functions, pandas slicing, raising errors in functions (could be good to include in README.md to manage expectations); I could imagine that code such as the following may require explanation for many attendees:

hashes['concat'] = df.apply(lambda x: ''.join(str(x[col]) for col in df.columns), axis=1)

  • when i run checkenv.py, I get

ModuleNotFoundError: No module named 'colorama'
include in environment? it's funny that i have all packages required for the Workshop installed but not all packages required to check the environment! :D

2-pytest-intro

  • 1st exercise says to create test_datafuncs.py but it's already in repo: perhaps this is instructor repo? in general, it seems as though many of the exercises have already been done in this repo, e.g. min-max scaler (prob intentional)
  • it isn't obvious why running py.test shuold do anything; perhaps explain in a few words? The terminal is a mystical place for many.
  • for the first test_min_max_scaler() , do you need to import numpy in test_datafuncs.py?
  • the following line of code in the 2nd test_min_max_scaler() makes pytest throw a SyntamError; can you reproduce this?

assert np.allclose(tfm, np.array([0, 0.5, 1])

  • i really love the textual data example; i'm wondering if you expect attendees to know what lines of code such the following do or if you'll explain it? return ''.join(s for s in text if s not in exclude)
  • similarly with the tests; also, why do you run the test functions are defining them in the NB? Just wondering :)

3-file-integrity

  • cell 2: attendees may wonder what sha256(), update() and hexdigest() are
  • in cell 12 ( defining hash_file), for the 1st time you're using the update() method several times; this is worth a few words;
  • love the tinydb :)

4-data-checks

  • typo: under Schema Checks, 'expected' is missing the 'd'
  • yaml: you could remind them that they used one to set up their system with the conda env! :D
  • you write 'Let's now switch roles, and pretend that we're on side of the "analyst" and are no longer the "data provider".' but this is the first time the idea of a "data provider has come up in the NB"
  • in the exercise to write function test_data_columns, do you want them to run py.test to see the fruits of their labour?
  • when you write ' Take the schema spec file and write a test for it.', which schema spec file are you talking about?
  • i love missingno!
  • interesting to use pandas_summary; idiomatic pandas would generally lead me down the path of using dfs.describe() and dfs.info()
  • after they write the function test_data_completeness() and add it to test_datafuncs.py, running py.test doesn't find a DataFrame df; similarly with test_data_range(); i'm probably missing something silly;
  • great ECDFs! ;)
  • you write 'We can take the EDA portion further, by doing an empirical cumulative distribution plot for each data column.' but then do something else first; explain why we need compute_dimensions()?

5-test-coverage

  • coverage seems really cool! running py.test --cov, though, throws

py.test: error: unrecognized arguments: --cov

  • do you plan to expand on this NB? Can I help at all? One way would be to play around with the functions and test functions and see the differing results;

Code error in min_max_scaler in notebook #2

Within file 2-pytest-introduction.ipynb, you have test_min_max_scaler():

def test_min_max_scaler():
    arr = np.array([1, 2, 3])  # set up the test with necessary variables.
    tfm = dfn.min_max_scaler(arr)  # collect the result into a variable
    assert tfm == np.array([0, 0.5, 1])  # assertion statements
    assert tfm.min() == 0  
    assert tfm.max() == 0 

I believe this line

assert tfm.max() == 0

should be

assert tfm.max() == 1

It's correct in the code file test_datafuncs_soln.py, just not in the notebook file.

Also, really enjoyed the training session today, I learned a ton. Thanks for taking the time Eric.

Win64 Dependencies Issue

I am finding that missingno and rise are both incompatible with Windows 64 when I use the environment.yml to create my conda environment. Any recommendations?

image

Proposal is in!

Hey @reneighbor!

I took the liberty of putting in the proposal. It's up for mods if you see places that it can be improved upon. Please let me know!


Basic Information

  • Title: Best Testing Practices for Data Science
  • Description:
    • So you're a data scientist wrangling with data that's continually avalanching in, and there's always errors cropping up! NaNs, strings where there are supposed to be integers, and more. Moreover, your team is writing code that is getting reused, but that code is failing in mysterious places. How do you solve this? Testing is the answer! In this tutorial, you will gain practical hands-on experience writing tests in a data science setting so that you can continually ensure the integrity of your code and data. You will learn how to use py.test, coverage.py, and hypothesis to write better tests for your code.
  • Audience:
    • This tutorial is for current and aspiring data scientists. Some experience with Python is a must, including knowledge of assertion statements; however, beginners who have not seen assert are welcome, as well will go through this. Expected knowledge include: writing functions, knowing how to write for-loops, and knowing how to write flow control (if/else). At the end of the tutorial, we expect students to learn how to write good tests for their code and for their data.

Outline

Part 1: Motivation for Testing (10 min, lecture-style)

  1. We make assumptions with our data and our code. Automated testing is crucial for continually testing those assumptions.
  2. Things that need to be checked - data science:
    1. Data types
    2. Data completeness
    3. Data structure
    4. Data integrity
  3. Things that need to be checked - both data science & software development:
    1. Function correctness
    2. Boundary cases
    3. Counter-examples
    4. Code coverage

Part 2: Introduction to Testing & Writing Tests for Functions(30 min)

Summary of section: Introduction to py.test as a simple way to get off the ground with automated testing.

  1. Using py.test at the command line in an empty project directory.
  2. The assertion statement is your friend!
  3. Warm-up example: write an add() function, write a test for correctness, write a counter-example.
  4. More complex example: write a function that sums up two floating point numbers, and log10 transforms them. Write a test for correctness, write a counter-example.
  5. More complex example: write a function that strips punctuation and spaces and special characters from text, and transforms it into a "bag of words". Write test for correctness and counter-example test.

Part 3: Checking File Integrity (40 min)

Summary of section: Participants will write simple tests that automatically check integrity of their files. Also will learn how to create simple change logs of data. We will provide a directory of data for this. Applied to detecting file tampering.

  1. Getting hashes of a file (md5, sha256), write script to automatically record data hashes.
  2. Write tests that check that a file has not been changed since last recorded version (i.e. check file hash)
  3. Write tests that check other basic properties of a data file: number of rows, number of columns, number of words in a text file.
  4. What to do when file hashes change?

Part 4: Checking Data Assumptions (40 min)

Summary of section: Participants will write tests that check the assumptions that they may have about the data that they received. Applied to pre-data analysis, especially with streams of data coming in.

  1. Write tests for column data types: int, float, str, object.
  2. Write tests that check that there are no unexpected empty data cells: finding NaN with pandas.
  3. Write tests that check for value correctness (e.g. cannot have negative numbers where log10 transform expected).
  4. Write tests that check textual data for correct lengths (e.g. biological sequence data should fall within some expected range of lengths).
  5. Write tests that check for potential outlier data points (e.g. use of Q-test).
  6. Write tests that check for normality of data (e.g. use of scipy's K-S test).

Part 5: Code coverage (20 min)

Summary of section: use of coverage.py to check code coverage.

  1. Using coverage.py with py.test.
  2. Exercise: write a test for non-covered functions to bring up coverage score.
  3. Emphasize: 100% code coverage doesn't mean 100% good coverage, but it's a good starting point!

Part 6: Property-based testing (30 min)

Summary of section: introduction to hypothesis.

  1. Property-based testing: we don't write fixed tests, but instead test the "properties" of a function (input data range, data type) for correctness.
  2. How to use hypothesis: the decorators!
  3. Finding boundary cases using hypothesis, and fixing functions (e.g. log10 transform function: add assertion statement in function to alert function user to potential errors ==> defensive programming!).
  4. Write tests using hypothesis for checking the functions from Part 2.

Package - hypothesis

I have begun using Hypothesis to test my code, and I have found it to be immensely useful for reasoning about how my code ought to work.

I think we should introduce this in this tutorial.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.