canimus / cuallee Goto Github PK

View Code? Open in Web Editor NEW

154.0 7.0 20.0 1.92 MB

Possibly the fastest DataFrame-agnostic quality check library in town.

Home Page: https://canimus.github.io/cuallee/

License: Apache License 2.0

Python 97.64% Makefile 0.23% Dockerfile 0.34% Shell 0.16% TeX 1.63%

bigdata performance-metrics pyspark python3 unit-testing pydeequ pandas snowpark dataquality data-quality

cuallee's People

Contributors

Stargazers

Watchers

Forkers

ryanjulyan madhu1512 runkelcorey dcodeyl cradleofdata papydick stuffbyyuki dsaad68 mihaigurau-tomtom saberdy maltzsama devarops minzastro 0xbadidea jbytecode scottwilliamanderson anmajuma

cuallee's Issues

Check for ISO-3166 for country codes

Complete remaining test cases for `pyspark`

test_positive
test_negative
test_parameters
test_coverage

Executing validation tasks through Spark Connect is failing

Issue

I've been using Spark Connect for both testing and data validation tasks. Despite following the provided documentation closely, I encountered errors with every example I attempted.

These issues occurred on Apache Spark version 3.5.1. Below, I provide detailed steps to reproduce two specific errors, along with the corresponding error messages.

Environment:

Python Version: 3.11.8
Apache Spark Versions Tested: 3.5.1
Scala Version: 2.12
Operating System: Windows

Steps to Reproduce:

Setup

Single node Spark cluster initiated via Docker using the command:

docker run -ti --name spark -p 15002:15002 bitnami/spark:latest /opt/bitnami/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.1

Spark Session creation:

spark = SparkSession.builder.appName("PySpark Test") \
            .remote("sc://localhost:15002") \
            .getOrCreate()

1. Error with Dates Example

Issue Reproduction:

Executing the Dates Example as provided in the documentation:

# Unique values on id
check = Check(CheckLevel.WARNING, "CheckIsBetweenDates")
df = spark.sql(
    """
    SELECT 
        explode(
            sequence(
                to_date('2022-01-01'), 
                to_date('2022-01-10'), 
                interval 1 day)) as date
    """)
assert (
    check.is_between("date", "2022-01-01", "2022-01-10")
    .validate(df)
    .first()
    .status == "PASS"
)

Error Message:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[8], line 14
      3 df = spark.sql(
      4     """
      5     SELECT 
   (...)
     10                 interval 1 day)) as date
     11     """)
     12 df = df.toPandas()
     13 assert (
---> 14     check.is_between("date", "2022-01-01", "2022-01-10")
     15     .validate(df)
     16     .first()
     17     .status == "PASS"
     18 )

File c:\Users\Dsaad\GitHub\pyspark-tools\.venv\Lib\site-packages\cuallee\__init__.py:403, in Check.is_between(self, column, value, pct)
    401 def is_between(self, column: str, value: Tuple[Any], pct: float = 1.0):
    402     """Validation of a column between a range"""
--> 403     Rule("is_between", column, value, CheckDataType.AGNOSTIC, pct) >> self._rule
    404     return self

File <string>:13, in __init__(self, method, column, value, data_type, coverage, options, status, violations, pass_rate, ordinal)

File c:\Users\Dsaad\GitHub\pyspark-tools\.venv\Lib\site-packages\cuallee\__init__.py:100, in Rule.__post_init__(self)
     99 def __post_init__(self):
--> 100     if (self.coverage <= 0) or (self.coverage > 1):
    101         raise ValueError("Coverage should be between 0 and 1")
    103     if isinstance(self.column, List):

TypeError: '<=' not supported between instances of 'str' and 'int'

2. Error with Completeness Check Example:

Issue Reproduction:

Executing an example that checks for null values and uniqueness:

from datetime import date
from pyspark.sql import Row

df = spark.createDataFrame([
        Row(user_id=1111, order_id=4343, preferred_store='string1', birthdate=date(1999, 1, 1), joined_date=date(2022, 6, 1)),
        Row(user_id=2222, order_id=5454, preferred_store='string2', birthdate=date(2000, 2, 1), joined_date=date(2022, 7, 2)),
        Row(user_id=3333, order_id=6565, preferred_store='string3', birthdate=date(2001, 3, 1), joined_date=date(2022, 8, 3))
        ])

# Nulls on column Id
check = Check(CheckLevel.WARNING, "Completeness")
(   check
    .is_complete("user_id")
    .is_unique("user_id")
    .validate(df)
).show()

Error Message:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[7], line 13
      7 # Nulls on column Id
      8 check = Check(CheckLevel.WARNING, "Completeness")
      9 (   check
     10     .is_complete("user_id")
     11     .is_unique("user_id")
---> 12     .validate(df)
     13 ).show()

File c:\Users\Dsaad\GitHub\pyspark-tools\.venv\Lib\site-packages\cuallee\__init__.py:703, in Check.validate(self, dataframe)
    700     self.compute_engine = importlib.import_module("cuallee.polars_validation")
    702 else:
--> 703     raise Exception(
    704         "Cuallee is not ready for this data structure. You can log a Feature Request in Github."
    705     )
    707 assert self.compute_engine.validate_data_types(
    708     self.rules, dataframe
    709 ), "Invalid data types between rules and dataframe"
    711 return self.compute_engine.summary(self, dataframe)

Exception: Cuallee is not ready for this data structure. You can log a Feature Request in Github.

I would appreciate any guidance or updates on resolving these errors. Thank you for your assistance.

how to use it on pandas dataframe

would it be possible to add example on using with pandas dataframe

Implementation of `has_workflow` on snowpark

The has_sum method, collapses a column in a dataframe by adding up all the elements. It has been successfully implemented, in pandas, pyspark and duckdb. Missing implementation on snowpark

Complete remaining test cases for `snowpark`

test_positive
test_negative
test_parameters
test_coverage

[JOSS REVIEW] Community guidelines

Hi @canimus,

I can see a link to the contributing guidelines on a sidebar titled Helpful resources when creating a new issue.

However, I need help finding a link to these guidelines within the documentation.
Can you please point me to the section in the README with the guidelines for third parties wishing to:

Contribute to the software.
Report issues or problems with the software.
Seek support

Thank you!

openjournals/joss-reviews#6684

[JOSS REVIEW] Great Expectations and Soda Core references are not rendering properly

We see that in the article, we are trying to cite Soda Core and Great Expectations:

cuallee/paper/paper.md

Line 58 in e4c1242

 On the other hand, `great-expectations` [@Gong_Great_Expectations] and `soda` [@soda_core] additionaly to an open-source platform also offer commercial options that require registration and issuing of keys for cloud reporting capabilities. 

but it is not rendering properly:

Maybe the @software option is not working.

cuallee/paper/paper.bib

Lines 141 to 152 in e4c1242

 @software{Gong_Great_Expectations, 

 author = {Gong, Abe and Campbell, James and {Great Expectations}}, 

 license = {Apache-2.0}, 

 title = {{Great Expectations}}, 

 url = {https://github.com/great-expectations/great_expectations} 

 } 

 @software{soda_core, 

 license = {Apache-2.0}, 

 title = {Soda Core}, 

 url = {https://github.com/sodadata/soda-core} 

 }

We are confident that you can find a solution. Could you please consider using the @misc option or any other alternative that you think might work?

This issue is part of a JOSS REVIEW

Implement test cases for `duckdb`

test_positive
test_negative
test_coverage
test_parameters

[JOSS Review] State of the field in `paper.md`

This is probably my last issue of this review, as I have finished reviewing both the functionality and the documentation page. Only one minor thing about the recent changes in the paper:

Both soda and great-expectations have a core component that is open source and additional features / cloud stuff where you need to pay. Could you correct this in your paper where you write:
On the other hand, great-expectations and soda are commercial options that require registration and issuing of keys for cloud reporting capabilities.
Could you add references to the two repos? great-expectation has a citation.cff file for that purpose

This issue is part of a JOSS Review

[JOSS Review] `Paper` Report on state of the field

JOSS wants the software papers to contain a comparison to the state of the field, i.e.

State of the field: Do the authors describe how this software compares to other commonly-used packages?

Hence, could you add a section in your paper where you shortly compare cuallee to other data testing frameworks, like soda, great_expectations, or others?

This issue is part of a JOSS Review

Feature tolerance on statistical methods

For example std need to have a tolerance between passed value and real value (with rounding/truncate/ ceil-floor) and number of decimals for rounding.

[JOSS Review] duckdb v0.9.2 does not work

Describe the bug
With duckdb v0.9.2 I get the error:
Exception: Cuallee is not ready for this data structure. You can log a Feature Request in Github.

To Reproduce
Steps to reproduce the behavior:

Install duckdb v0.9.2
Read the taxi parquet file to duckdb and pandas
Define some check
Make sure that the check runs sucessfully for the pandas df
Try to run the check with duckdb, get the error

Desktop (please complete the following information):

OS: Windows
python: 3.11.9
cuallee: 0.10.2

This issue is part of a JOSS Review

Number of overall violations divided by number of columns in `are_complete`

Describe the bug
The issue is probably noticeable with other are_* validations. It seems that when passing a number of columns to be checked, the total number of violations found will be divided by the number of columns given. For example, if a check will be done over 3 columns, and 12 violations are found, only 4 will be reported. This is OK if all violations were on the same row, but will underreport when the violations are on different rows.

To Reproduce
Run the following code:

from pyspark.sql import SparkSession
from cuallee import Check, CheckLevel

spark = SparkSession.builder.getOrCreate()

check = Check(CheckLevel.WARNING, "Not NULL").are_complete(["col_a", "col_b", "col_c"])

# This is fine

df1 = spark.createDataFrame([
    {"col_a": 1, "col_b": 1, "col_c": 1}, {"col_a": 2, "col_b": 2, "col_c": 2}, {"col_a": 3, "col_b": 3, "col_c": 3}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df1.show(truncate=False)
check.validate(df1).show(truncate=False)

+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
|1    |1    |1    |
|2    |2    |2    |
|3    |3    |3    |
+-----+-----+-----+

+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+---------+--------------+------+
|id |timestamp          |check   |level  |column                     |rule        |value|rows|violations|pass_rate|pass_threshold|status|
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+---------+--------------+------+
|1  |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A  |3   |0.0       |1.0      |1.0           |PASS  |
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+---------+--------------+------+

# This is also fine

df2 = spark.createDataFrame([
    {"col_a": None, "col_b": None, "col_c": None}, {"col_a": 2, "col_b": 2, "col_c": 2}, {"col_a": 3, "col_b": 3, "col_c": 3}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df2.show(truncate=False)
check.validate(df2).show(truncate=False)

+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
|null |null |null |
|2    |2    |2    |
|3    |3    |3    |
+-----+-----+-----+

+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
|id |timestamp          |check   |level  |column                     |rule        |value|rows|violations|pass_rate         |pass_threshold|status|
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
|1  |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A  |3   |1.0       |0.6666666666666666|1.0           |FAIL  |
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+

# This should show 3 violations, but it seems that the number of violations outputted is divided by the number of columns

df3 = spark.createDataFrame([
    {"col_a": None, "col_b": 1, "col_c": 1}, {"col_a": None, "col_b": 2, "col_c": 2}, {"col_a": None, "col_b": 3, "col_c": 3}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df3.show(truncate=False)
check.validate(df3).show(truncate=False)

+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
|null |1    |1    |
|null |2    |2    |
|null |3    |3    |
+-----+-----+-----+

+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
|id |timestamp          |check   |level  |column                     |rule        |value|rows|violations|pass_rate         |pass_threshold|status|
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
|1  |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A  |3   |1.0       |0.6666666666666666|1.0           |FAIL  |
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+

# This should also show 3 violations and a 0% pass rate

df4 = spark.createDataFrame([
    {"col_a": None, "col_b": 1, "col_c": 1}, {"col_a": 2, "col_b": None, "col_c": 2}, {"col_a": 3, "col_b": 3, "col_c": None}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df4.show(truncate=False)
check.validate(df4).show(truncate=False)

+-----+-----+-----+
|col_a|col_b|col_c|
+-----+-----+-----+
|null |1    |1    |
|2    |null |2    |
|3    |3    |null |
+-----+-----+-----+

+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
|id |timestamp          |check   |level  |column                     |rule        |value|rows|violations|pass_rate         |pass_threshold|status|
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
|1  |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A  |3   |1.0       |0.6666666666666666|1.0           |FAIL  |
+---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+

Expected behavior
Ideally, the number of violations should reflect the number of offending rows (unless I'm misunderstanding something).

Desktop (please complete the following information):

OS: Linux
Browser: Visual Studio Code / Jupyter Notebooks
Version cuallee==0.4.7

[JOSS Review] Show documentation page prominently

Great that you have updated your documentation page. Could you now refer to it prominently, for example in the Readme and/or in the project description on github (upper right corner of the github repository page).

This issue is part of a JOSS Review

Implementation of `has_sum` for snowpark

Define the has_sum check for snowpark. It collapses by adding all elements in a column for a dataframe.

[JOSS REVIEW] Automatic testing is failing

Hi @canimus,

The test snowpark_dataframe/test_are_complete.py::test_positive ends with an error.

ERROR test/unit/snowpark_dataframe/test_are_complete.py::test_positive - ValueError: snowpark did not yield a value

Please pass the tests in this pipeline.

openjournals/joss-reviews#6684

Show results when running 1000 rules are not in order

I appears that when running a test with 1000 rules.
Test scenario is with the Taxi NYC data set with 20M Rows.

df = spark.read.parquet("temp/data/*.parquet")
c = Check(CheckLevel.Warning, "NYC")
for i in range(1000):
  c.is_greater_than("fare_amount", i)
c.validate(spark, df).show(n=1000, truncate=False)

# Displayed dataframe contains wrong order in rows
# in 995 there is a discrepancy because 10% of the rows are certainly not with `fare_amount > 995`

Exception thrown when whole dataset fails validation using Polars

Describe the bug
When using Polars, if the given dataset fails validation, instead of getting a dataframe with the validation results, an exception is thrown: TypeError: '>=' not supported between instances of 'NoneType' and 'float'

The example below shows the use of the is_unique check, but this has been replicated with other checks as well.

To Reproduce
Steps to reproduce the behavior:

Create a new Jupyter notebook / Python file
Paste the following lines in:

import polars as pl
from cuallee import Check, CheckLevel

df = pl.DataFrame(
    {
        "id": [1, 1, 2, 3, 4],
        "bar": [6, 7, 8, 9, 10],
        "ham": ["a", "b", "c", "d", "e"],
    }
)

id_check = Check(CheckLevel.WARNING, "ID unique")
display(id_check.is_unique("id").validate(df))

Expected behavior
Code returns a dataframe with validation results.

Actual behavior
Code throws exception below:

TypeError Traceback (most recent call last)
Cell In[28], line 13
4 df = pl.DataFrame(
5 {
6 "id": [1, 1, 2, 3, 4],
(...)
9 }
10 )
12 id_check = Check(CheckLevel.WARNING, "ID unique")
---> 13 display(id_check.is_unique("id").validate(df))

File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/init.py:586, in Check.validate(self, dataframe)
581 self.compute_engine = importlib.import_module("cuallee.polars_validation")
583 assert self.compute_engine.validate_data_types(
584 self.rules, dataframe
585 ), "Invalid data types between rules and dataframe"
--> 586 return self.compute_engine.summary(self, dataframe)

File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/polars_validation.py:414, in summary(check, dataframe)
410 return "FAIL"
412 rows = len(dataframe)
--> 414 computation_basis = [
415 {
416 "id": index,
417 "timestamp": check.date.strftime("%Y-%m-%d %H:%M:%S"),
418 "check": check.name,
419 "level": check.level.name,
420 "column": str(rule.column),
421 "rule": rule.method,
422 "value": rule.value,
423 "rows": rows,
424 "violations": _calculate_violations(first(unified_results[hash_key]), rows),
425 "pass_rate": _calculate_pass_rate(first(unified_results[hash_key]), rows),
426 "pass_threshold": rule.coverage,
427 "status": _evaluate_status(
428 _calculate_pass_rate(first(unified_results[hash_key]), rows),
429 rule.coverage,
430 ),
431 }
432 for index, (hash_key, rule) in enumerate(check._rule.items(), 1)
433 ]
434 pl.Config.set_tbl_cols(12)
435 return pl.DataFrame(computation_basis)

File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/polars_validation.py:427, in (.0)
410 return "FAIL"
412 rows = len(dataframe)
414 computation_basis = [
415 {
416 "id": index,
417 "timestamp": check.date.strftime("%Y-%m-%d %H:%M:%S"),
418 "check": check.name,
419 "level": check.level.name,
420 "column": str(rule.column),
421 "rule": rule.method,
422 "value": rule.value,
423 "rows": rows,
424 "violations": _calculate_violations(first(unified_results[hash_key]), rows),
425 "pass_rate": _calculate_pass_rate(first(unified_results[hash_key]), rows),
426 "pass_threshold": rule.coverage,
--> 427 "status": _evaluate_status(
428 _calculate_pass_rate(first(unified_results[hash_key]), rows),
429 rule.coverage,
430 ),
431 }
432 for index, (hash_key, rule) in enumerate(check._rule.items(), 1)
433 ]
434 pl.Config.set_tbl_cols(12)
435 return pl.DataFrame(computation_basis)

File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/polars_validation.py:407, in summary.._evaluate_status(pass_rate, pass_threshold)
405 def _evaluate_status(pass_rate, pass_threshold):
--> 407 if pass_rate >= pass_threshold:
408 return "PASS"
410 return "FAIL"

TypeError: '>=' not supported between instances of 'NoneType' and 'float'

Desktop (please complete the following information):

OS: Linux
cuallee = "^0.4.5"
ipykernel = "^6.24.0"
polars = "0.18.7"

Additional context
None.

[JOSS REVIEW] Run performance benchmark on CI pipeline

We were able to reproduce the performance benchmark manually. 🎉
However, seeing these results in the CI pipeline would be ideal. 🤖

Please add a job to the CI workflow on GitHub Actions to reproduce the performance benchmark.

This is the last issue of the review:

openjournals/joss-reviews#6684

[JOSS Review] Extend Docstring of central class

The class Check seems like the central entry point to the package, however it lacks a documentation. Could you add docstrings to it to explain the purpose of the function and of all parameters?

In case you have not worked with docstrings yet, they are super helpful in modern IDEs to get info about the function when hovering over it. There are some docstring format conventions and also tools like autodocstring that help to create docstrings.

This issue is part of a JOSS Review

[JOSS Review] Documentation page

There is a documentation page at https://canimus.github.io/cuallee/ hosting all the files in the docs folder.
However it seems to be not quite finished. Is it part of the official documentation? Or is the Readme.md the official documentation?

This issue is part of a JOSS Review

Check for ISO-4217 for currency codes

[JOSS Review] Performance Tests

For the joss review, I need to validate the performance test:
Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)
Is the temp/taxi/ parquet files open data? If yes, could you provide them somehow?

This issue is part of a JOSS Review

Complete all unit test cases for `pandas`

test_positive
test_negative
test_coverage
test_parameters

daft_validation - AttributeError: module 'statistics' has no attribute 'correlation'

Describe the bug
When attempting to use the correlation function within a custom defined function decorated with @daft.udf, an AttributeError is raised due to the absence of the correlation attribute in the statistics module. This error occurs because the correlation function was introduced in Python version 3.10. Nowadays the minor python version supported by cuallee is 3.8, where this error can be reproduced

To Reproduce
To reproduce the behavior:

Use the correlation function within a custom defined function decorated with @daft.udf.
Execute the code.

    @daft.udf(return_dtype=daft.DataType.float64())
    def correlation(x, y):
>       return [statistics.correlation(x.to_pylist(), y.to_pylist())]
E       AttributeError: module 'statistics' has no attribute 'correlation'

cuallee/daft_validation.py:211: AttributeError

Expected behavior
The correlation function should be successfully called within the custom defined function decorated with @daft.udf.

Additional context
This bug arises due to the fact that the correlation function was introduced in Python version 3.10, and the environment where the code is being executed is likely using an older version of Python where this function does not exist.

	@software{Gong_Great_Expectations,
	author = {Gong, Abe and Campbell, James and {Great Expectations}},
	license = {Apache-2.0},
	title = {{Great Expectations}},
	url = {https://github.com/great-expectations/great_expectations}
	}

	@software{soda_core,
	license = {Apache-2.0},
	title = {Soda Core},
	url = {https://github.com/sodadata/soda-core}
	}

canimus / cuallee Goto Github PK

cuallee's People

Contributors

Stargazers

Watchers

Forkers

cuallee's Issues

Issue

Steps to Reproduce:

Setup

1. Error with Dates Example

2. Error with Completeness Check Example:

Recommend Projects

Recommend Topics

Recommend Org