danielbeach / tinytimmy Goto Github PK

View Code? Open in Web Editor NEW

44.0 5.0 7.0 1.19 MB

A simple and easy to use Data Quality (DQ) tool built with Python.

License: Apache License 2.0

Python 93.84% Dockerfile 1.86% Shell 4.29%

tinytimmy's Introduction

Tiny Timmy

A dead simple and easy to use Data Quality (DQ) tool built for Dataframes and Files with Python.

Tiny Timmy uses the Python bindings for Polars a Rust based DataFrame tool.

Support includes ...

polars
pandas
pyspark
csv files
parquet files

Both dataframe and file support. Simply "point and shoot."

Installation

Install Tiny Timmy with pip

pip install tinytimmy

Usage

Create an instance of Tiny Timmy.

specify source_type
- polars
- pandas
- pyspark
- csv
- parquet
specify either file_path or dataframe

from tinytimmy.tinytim import TinyTim
tm = TinyTim(source_type="csv", file_path="202306-divvy-tripdata.csv")

Then call either the default checks or a custom check.

results = tm.default_checks()
results = tm.run_custom_check(["{SQL filter}", "{SQL filter}"])

You can pass Tiny Timmy a dataframe while specifying what type it is (pandas, polars, pyspark) and ask for default_checks, also you can simply pass a file uri to a csv or parquet file.

You can also pass custom DQ checks as a list of SQL statements in a normal WHERE clause.

Tiny Timmy returns check results as a Polars dataframe by default, you can request the results as a pandas or pyspark dataframe.

results = tm.default_checks(return_as='pandas')

For example.

┌───────────────────────────────────┬─────────────┐
│ check_type                        ┆ check_value │
│ ---                               ┆ ---         │
│ str                               ┆ i64         │
╞═══════════════════════════════════╪═════════════╡
│ null_check_start_station_name     ┆ 978         │
│ null_check_start_station_id       ┆ 978         │
│ …                                 ┆ …           │
│ started_at_whitespace_count       ┆ 1000        │
│ ended_at_whitespace_count         ┆ 1000        │
│ start_station_name_whitespace_co… ┆ 22          │
│ end_station_name_whitespace_coun… ┆ 22          │

Current functionality ...

default_checks()
- check all columns for null values
- check if dataset is distinct or contains duplicates
- check if columns have whitespace
- check for leading or trailing whitespace
run_custom_check(["{some SQL WHERE clause}"])

Example Usage

CSV support.

from tinytimmy.tinytim import TinyTim
tm = TinyTim(source_type="csv", file_path="202306-divvy-tripdata.csv")
results = tm.default_checks()
>> Column start_station_name has 978 null values
>> Column start_station_id has 978 null values
>> Column end_station_name has 978 null values
>> Column end_station_id has 978 null values
>> Your dataset has 45 duplicates

Pandas support.

from tinytimmy.tinytim import TinyTim
df = pd.read_csv("202306-divvy-tripdata.csv")
tm = TinyTim(source_type="pandas", dataframe=df)
results = tm.default_checks()
>> Column start_station_name has 978 null values
>> Column start_station_id has 978 null values
>> Column end_station_name has 978 null values
>> Column end_station_id has 978 null values
>> Your dataset has no duplicates

Custom Data Quality checks are supported as a list of SQL based formats. They are given as they would appear in a WHERE clause. You can pass one or more checks in the list.

from tinytimmy.tinytim import TinyTim
tm = TinyTim(source_type="csv", file_path="202306-divvy-tripdata.csv")
tm.default_checks()
results = tm.run_custom_check(["start_station_name IS NULL", "end_station_name IS NULL"])
Column start_station_name has 978 null values
Column start_station_id has 978 null values
Column end_station_name has 978 null values
Column end_station_id has 978 null values
Your dataset has no duplicates
Column started_at has 1000 whitespace values
Column ended_at has 1000 whitespace values
Column start_station_name has 22 whitespace values
Column end_station_name has 22 whitespace values
No leading or trailing whitespace values found
shape: (10, 2)
┌───────────────────────────────────┬─────────────┐
│ check_type                        ┆ check_value │
│ ---                               ┆ ---         │
│ str                               ┆ i64         │
╞═══════════════════════════════════╪═════════════╡
│ null_check_start_station_name     ┆ 978         │
│ null_check_start_station_id       ┆ 978         │
│ null_check_end_station_name       ┆ 978         │
│ null_check_end_station_id         ┆ 978         │
│ …                                 ┆ …           │
└───────────────────────────────────┴─────────────┘
Your custom check start_station_name IS NULL found 978 records that match your filter statement
Your custom check end_station_name IS NULL found 978 records that match your filter statement

Tests / Local Setup / Contributions.

To develop and work on TinyTimmy locally, a Docker image and docker-compose is provided.

First, build the image docker build --tag=tinytimmy .

To run the local unit tests run ... docker-compose up test

To simply work inside the Docker container run ... docker run -it tinytimmy /bin/bash

tinytimmy's People

Contributors

Stargazers

Watchers

Forkers

jcorrado76 mabr3 krishnaduttpanchagnula nicholastanderson driscollis lordirah

tinytimmy's Issues

Add support for Avro files.

Add support for Avro and possibly JSON files. Have to flatten structure etc.

Run unit tests in GitHub Actions CI

Support for dbt-style validation configuration files

I like how DBT lets you define validations via a YAML file. Great expectations does this too. I think it would be nice to let users provide some sort of YAML configuration of validations to be run. This would also allow users to separate non-critical validations from critical guard rail validations.

Add ruff to linter and add all linting and unit tests to GitHub actions

Support for concurrent validations

Supposing there is some way to pass a list of validations to Tiny Timmy, validations should be able to run concurrently, like how they happen in dbt when you run tests with multiple threads.

Add Support for Source freshness test

Add the support to test for source freshness. This is especially useful when the data pipeline reads data from a source and expects new data in it.

The user should be able to call TinyTimmy.source_freshness() and pass in the date time column and how recent you expect the data.

Support can be added to enable the user pass in a filter to limit the amount of data scanned (similar to how dbt does it). This way the test can be done more efficiently.

Add support to point to SQL sources

Adding some basic support for the major SQL engines would be nice:

Postgres
MySQL
Snowflake
Bigquery
Redshift

Add Delta Lake support

Add the ability to point TinyTimmy at a Delta Lake table and run checks.

Get data quality unit tests to green

The unit tests on the DataQuality class currently don't pass their unit tests.

Add the ability to return a Dataframe as something other than Polars.

All results are returned be default as a Polars dataframe. Add the ability to return other dataframes, say Pandas for example.

Add ability to fix some data quality issues.

Add the ability to "fix" some data quality issues and return a fixed dataframe. Example, removing duplicates or removing whitespace etc.

Keeping dependencies optional

Installing tinytimmy must not install pyspark, pandas and other dependencies. We must keep those optional and organize code to handle it

Change pytest pre-commit to run on pre-push stage only, not pre-commit

I run tests before every push to the repo, as opposed to every commit, since I commit much more frequently. I lint on every commit, but test on every push.

Change github actions lint job to use pre-commit

Instead of maintaining a list of linting and actions to perform separately in the CI/CD script (in python-publish.yaml), we should list all the linting in a single location (.pre-commit-config.yaml), and then run pre-commit run in the CI/CD script. This way, what people are running locally via pre-commit is the same as what's being required on the CI server.

This reduces code duplication between the CI configuration, and the pre-commit configuration.

Auto pypi package push

@jcorrado76 Right now to push a new version out to PYPI, I'm doing that locally with poetry , with poetry build and poetry publish, to get a new package out on PYPI. Can we automate this somehow with GitHub actions?

Not sure we want a new PYPI package published every time we push to main? I'm assuming I can load some ENV Vars for GitHub actions so it has permissions to do the push, just curious on thoughts on if it should be every time we merge to main or something else?

Fix poetry.lock file

The most recent build failed because the pyproject.toml file isn't consistent with the poetry.lock file, so installing dependencies failed.

This issue is to just run poetry lock with the current pyproject.toml.

Add more default checks

Currently, only a few default checks exist like distinct, whitespace, and trailing or leading whitespace. Need ideas for more default checks.

Add the ability to DQ on s3 buckets

We should add the ability to do simple DQ checks on s3 buckets. Say something like ...

results = tm.check_s3_uri(bucket='somebucket', prefix='some_prefix', check_type='files_exisit')
True

Could probably implement with boto3
Or something along those lines ...

check files exist at some location
count the number of files at some location
check for empty files
check for missing files over a date range
check file names meet some regex
etc.