Giter Club home page Giter Club logo

hooqu's Introduction

Hooqu - Unit Tests for Data

image

image

Documentation Status

image

Updates


Documentation: https://hooqu.readthedocs.io

Source Code: https://github.com/mfcabrera/hooqu


Hooqu is a library built on top of Pandas dataframes for defining "unit tests for data", which measure data quality datasets.

Hooqu is a "spiritual" Python port of Apache Deequ and is currently in an experimental state. I am happy to receive feedback and contributions.

The main motivation of Hooqu is to enable data science projects to discover the quality of their input/output data using a similar API to the on found in Deequ, allowing to share the same vocabulary of checks between different teams.

Install

Hooqu requires Pandas >= 1.0 and Python >= 3.7. To install via pip use:

pip install hooqu

Quick Start

import pandas as pd

# data to validate
df = pd.DataFrame(
       [
           (1, "Thingy A", "awesome thing.", "high", 0),
           (2, "Thingy B", "available at http://thingb.com", None, 0),
           (3, None, None, "low", 5),
           (4, "Thingy D", "checkout https://thingd.ca", "low", 10),
           (5, "Thingy E", None, "high", 12),
       ],
       columns=["id", "productName", "description", "priority", "numViews"]
)

Checks we want to perform:

  • there are 5 rows in total
  • values of the id attribute are never Null/None and unique
  • values of the productName attribute are never null/None
  • the priority attribute can only contain "high" or "low" as value
  • numViews should not contain negative values
  • at least half of the values in description should contain a url
  • the median of numViews should be less than or equal to 10

In code this looks as follows:

from hooqu.checks import Check, CheckLevel, CheckStatus
from hooqu.verification_suite import VerificationSuite
from hooqu.constraints import ConstraintStatus


verification_result = (
      VerificationSuite()
      .on_data(df)
      .add_check(
          Check(CheckLevel.ERROR, "Basic Check")
          .has_size(lambda sz: sz == 5)  # we expect 5 rows
          .is_complete("id")  # should never be None/Null
          .is_unique("id")  # should not contain duplicates
          .is_complete("productName")  # should never be None/Null
          .is_contained_in("priority", ("high", "low"))
          .is_non_negative("numViews")
          # at least half of the descriptions should contain a url
          .contains_url("description", lambda d: d >= 0.5)
          # half of the items should have less than 10 views
          .has_quantile("numViews", 0.5, lambda v: v <= 10)
      )
      .run()
)

After calling run, hooqu will compute some metrics on the data. Afterwards it invokes your assertion functions (e.g., lambda sz: sz == 5 for the size check) on these metrics to see if the constraints hold on the data.

We can inspect the VerificationResult to see if the test found errors:

if verification_result.status == CheckStatus.SUCCESS:
      print("Alles klar: The data passed the test, everything is fine!")
else:
      print("We found errors in the data")

for check_result in verification_result.check_results.values():
      for cr in check_result.constraint_results:
          if cr.status != ConstraintStatus.SUCCESS:
              print(f"{cr.constraint}: {cr.message}")

If we run the example, we get the following output:

We found errors in the data
CompletenessConstraint(Completeness(productName)): Value 0.8 does not meet the constraint requirement.
PatternMatchConstraint(containsURL(description)): Value 0.4 does not meet the constraint requirement.

The test found that our assumptions are violated! Only 4 out of 5 (80%) of the values of the productName attribute are non-null and only 2 out of 5 (40%) values of the description attribute contained a url. Fortunately, we ran a test and found the errors, somebody should immediately fix the data :)

Contributing

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome. Please use GitHub issues: for bug reports, feature requests, install issues, RFCs, thoughts, etc.

See the full cotributing guide for more information.

Why Hooqu?

  • Easy to use declarative API to add data verification steps to your data processing pipeline.
  • The VerificationResult allows you know not only what check fail but the values of the computed metric, allowing for flexible handling of issues with the data.
  • Incremental metric computation capability allows to compare quality metrics change across time (planned).
  • Support for storing and loading computed metrics (planned).

References

This project is a "spiritual" port of Apache Deequ and thus tries to implement the declarative API described on the paper "Automating large-scale data quality verification" while trying to remain pythonic as much as possible. This project does not use (py)Spark but rather Pandas (and hopefully in the future it will support other compatible dataframe implementations).

Name

Jukumari (pronounced hooqumari) is the Aymara name for the spectacled bear (Tremarctos ornatus), also known as the Andean bear, Andean short-faced bear, or mountain bear.

hooqu's People

Contributors

mfcabrera avatar pyup-bot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

tdoehmen

hooqu's Issues

Option for get the errors as Dataframe

First I want to say Thanks!, this project is really amazing, please don't stop updating it. It would be nice if the result of the testing, I mean, the result of run() method had an option for returning the testing result in as pandas dataframe. Something like:

verification_suite.on_data(df_toy).add_checks(list_checks).run(as_dataframe=True)

I have seen AWS Dequee has this option, would be nice to implement it on this repo because we could save the results as a CSV file or into a database table easier.

Questions and suggestions

Hey Miguel,

Great work, I think this could be very useful for many people.

I have a question:

"Unit test" for me implies that this is part of a CI suite. As a dev, I make a change to ETL code and before it gets merged, my changes are tested on data using hooqu. But would it not make sense to use this for runtime checks as well? Maybe that's the intent, then it didn't become clear to me.

I could imagine this being used like this:

verification_suite = VerificationSuite().add_check(
    Check(CheckLevel.ERROR, "Basic Check")
    .has_size(lambda sz: sz == 5)  # we expect 5 rows
    .is_complete("id")  # should never be None/Null
    .is_complete("productName")  # should never be None/Null
    .has_mean("numViews", lambda mean: mean <= 10)
)

@verification_suite.check_input(lambda df, *args, **kwargs: df)
def my_fun(df, foo, bar=123):
    df = ...
    return df

The idea would be that at runtime, when my_fun is called, the verification suite is performed on the input df (same idea could apply to the output df). Through the CheckLevel, you could control if this should raise an error or just produce an error log, for example. I know this would need a bit of redesign of the API, since at the moment, it seems that VerificationSuite needs reference to the data to be tested via add_data (but I think this isn't necessary and could prove problematic down the line).

This way, it's less of a "unit test" and more of a runtime test for data. It would not only catch errors that stem from changes in the code, but also from changes in the data. Again, maybe that's the intent, then you could make it more explicit in the README.

Some minor comments:

  • I saw widespread use of lambdas in your code, be careful to store references to them, unless you think it will never make sense to pickle the objects.
  • typo dupliucatees in README

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.