Giter Club home page Giter Club logo

tinytimmy's Introduction

Tiny Timmy

A dead simple and easy to use Data Quality (DQ) tool built for Dataframes and Files with Python.

Tiny Timmy uses the Python bindings for Polars a Rust based DataFrame tool.

Support includes ...

  • polars
  • pandas
  • pyspark
  • csv files
  • parquet files

Both dataframe and file support. Simply "point and shoot."

Installation

Install Tiny Timmy with pip

pip install tinytimmy

Usage

Create an instance of Tiny Timmy.

  • specify source_type
    • polars
    • pandas
    • pyspark
    • csv
    • parquet
  • specify either file_path or dataframe
from tinytimmy.tinytim import TinyTim
tm = TinyTim(source_type="csv", file_path="202306-divvy-tripdata.csv")

Then call either the default checks or a custom check.

results = tm.default_checks()
results = tm.run_custom_check(["{SQL filter}", "{SQL filter}"])

You can pass Tiny Timmy a dataframe while specifying what type it is (pandas, polars, pyspark) and ask for default_checks, also you can simply pass a file uri to a csv or parquet file.

You can also pass custom DQ checks as a list of SQL statements in a normal WHERE clause.

Tiny Timmy returns check results as a Polars dataframe by default, you can request the results as a pandas or pyspark dataframe.

results = tm.default_checks(return_as='pandas')

For example.

┌───────────────────────────────────┬─────────────┐
│ check_type                        ┆ check_value │
│ ---                               ┆ ---         │
│ str                               ┆ i64         │
╞═══════════════════════════════════╪═════════════╡
│ null_check_start_station_name     ┆ 978         │
│ null_check_start_station_id       ┆ 978         │
│ …                                 ┆ …           │
│ started_at_whitespace_count       ┆ 1000        │
│ ended_at_whitespace_count         ┆ 1000        │
│ start_station_name_whitespace_co… ┆ 22          │
│ end_station_name_whitespace_coun… ┆ 22          │

Current functionality ...

  • default_checks()
    • check all columns for null values
    • check if dataset is distinct or contains duplicates
    • check if columns have whitespace
    • check for leading or trailing whitespace
  • run_custom_check(["{some SQL WHERE clause}"])

Example Usage

CSV support.

from tinytimmy.tinytim import TinyTim
tm = TinyTim(source_type="csv", file_path="202306-divvy-tripdata.csv")
results = tm.default_checks()
>> Column start_station_name has 978 null values
>> Column start_station_id has 978 null values
>> Column end_station_name has 978 null values
>> Column end_station_id has 978 null values
>> Your dataset has 45 duplicates

Pandas support.

from tinytimmy.tinytim import TinyTim
df = pd.read_csv("202306-divvy-tripdata.csv")
tm = TinyTim(source_type="pandas", dataframe=df)
results = tm.default_checks()
>> Column start_station_name has 978 null values
>> Column start_station_id has 978 null values
>> Column end_station_name has 978 null values
>> Column end_station_id has 978 null values
>> Your dataset has no duplicates

Custom Data Quality checks are supported as a list of SQL based formats. They are given as they would appear in a WHERE clause. You can pass one or more checks in the list.

from tinytimmy.tinytim import TinyTim
tm = TinyTim(source_type="csv", file_path="202306-divvy-tripdata.csv")
tm.default_checks()
results = tm.run_custom_check(["start_station_name IS NULL", "end_station_name IS NULL"])
Column start_station_name has 978 null values
Column start_station_id has 978 null values
Column end_station_name has 978 null values
Column end_station_id has 978 null values
Your dataset has no duplicates
Column started_at has 1000 whitespace values
Column ended_at has 1000 whitespace values
Column start_station_name has 22 whitespace values
Column end_station_name has 22 whitespace values
No leading or trailing whitespace values found
shape: (10, 2)
┌───────────────────────────────────┬─────────────┐
│ check_type                        ┆ check_value │
│ ---                               ┆ ---         │
│ str                               ┆ i64         │
╞═══════════════════════════════════╪═════════════╡
│ null_check_start_station_name     ┆ 978         │
│ null_check_start_station_id       ┆ 978         │
│ null_check_end_station_name       ┆ 978         │
│ null_check_end_station_id         ┆ 978         │
│ …                                 ┆ …           │
└───────────────────────────────────┴─────────────┘
Your custom check start_station_name IS NULL found 978 records that match your filter statement
Your custom check end_station_name IS NULL found 978 records that match your filter statement

Tests / Local Setup / Contributions.

To develop and work on TinyTimmy locally, a Docker image and docker-compose is provided.

First, build the image docker build --tag=tinytimmy .

To run the local unit tests run ... docker-compose up test

To simply work inside the Docker container run ... docker run -it tinytimmy /bin/bash

tinytimmy's People

Contributors

danielbeach avatar driscollis avatar jcorrado76 avatar krishnaduttpanchagnula avatar lordirah avatar nicholastanderson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tinytimmy's Issues

Support for dbt-style validation configuration files

I like how DBT lets you define validations via a YAML file. Great expectations does this too. I think it would be nice to let users provide some sort of YAML configuration of validations to be run. This would also allow users to separate non-critical validations from critical guard rail validations.

Support for concurrent validations

Supposing there is some way to pass a list of validations to Tiny Timmy, validations should be able to run concurrently, like how they happen in dbt when you run tests with multiple threads.

Add Support for Source freshness test

Add the support to test for source freshness. This is especially useful when the data pipeline reads data from a source and expects new data in it.

The user should be able to call TinyTimmy.source_freshness() and pass in the date time column and how recent you expect the data.

Support can be added to enable the user pass in a filter to limit the amount of data scanned (similar to how dbt does it). This way the test can be done more efficiently.

Keeping dependencies optional

Installing tinytimmy must not install pyspark, pandas and other dependencies. We must keep those optional and organize code to handle it

Change github actions lint job to use pre-commit

Instead of maintaining a list of linting and actions to perform separately in the CI/CD script (in python-publish.yaml), we should list all the linting in a single location (.pre-commit-config.yaml), and then run pre-commit run in the CI/CD script. This way, what people are running locally via pre-commit is the same as what's being required on the CI server.

This reduces code duplication between the CI configuration, and the pre-commit configuration.

Auto pypi package push

@jcorrado76 Right now to push a new version out to PYPI, I'm doing that locally with poetry , with poetry build and poetry publish, to get a new package out on PYPI. Can we automate this somehow with GitHub actions?

Not sure we want a new PYPI package published every time we push to main? I'm assuming I can load some ENV Vars for GitHub actions so it has permissions to do the push, just curious on thoughts on if it should be every time we merge to main or something else?

Fix poetry.lock file

The most recent build failed because the pyproject.toml file isn't consistent with the poetry.lock file, so installing dependencies failed.

This issue is to just run poetry lock with the current pyproject.toml.

Add more default checks

Currently, only a few default checks exist like distinct, whitespace, and trailing or leading whitespace. Need ideas for more default checks.

Add the ability to DQ on s3 buckets

We should add the ability to do simple DQ checks on s3 buckets. Say something like ...

results = tm.check_s3_uri(bucket='somebucket', prefix='some_prefix', check_type='files_exisit')
True

Could probably implement with boto3
Or something along those lines ...

  • check files exist at some location
  • count the number of files at some location
  • check for empty files
  • check for missing files over a date range
  • check file names meet some regex
  • etc.

Switch build system to poetry

This is more of a question/suggestion. What's the appetite around using poetry install of pip for building, installing dependencies?

Add CLI entrypoint

It would be nice to have some basic functionality to call Tiny Timmy from the CLI as well. I like Typer

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.