canimus / cuallee Goto Github PK
View Code? Open in Web Editor NEWPossibly the fastest DataFrame-agnostic quality check library in town.
Home Page: https://canimus.github.io/cuallee/
License: Apache License 2.0
Possibly the fastest DataFrame-agnostic quality check library in town.
Home Page: https://canimus.github.io/cuallee/
License: Apache License 2.0
I've been using Spark Connect for both testing and data validation tasks. Despite following the provided documentation closely, I encountered errors with every example I attempted.
These issues occurred on Apache Spark version 3.5.1
. Below, I provide detailed steps to reproduce two specific errors, along with the corresponding error messages.
Environment:
3.11.8
3.5.1
2.12
docker run -ti --name spark -p 15002:15002 bitnami/spark:latest /opt/bitnami/spark/sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.1
spark = SparkSession.builder.appName("PySpark Test") \
.remote("sc://localhost:15002") \
.getOrCreate()
Issue Reproduction:
Executing the Dates Example as provided in the documentation:
# Unique values on id
check = Check(CheckLevel.WARNING, "CheckIsBetweenDates")
df = spark.sql(
"""
SELECT
explode(
sequence(
to_date('2022-01-01'),
to_date('2022-01-10'),
interval 1 day)) as date
""")
assert (
check.is_between("date", "2022-01-01", "2022-01-10")
.validate(df)
.first()
.status == "PASS"
)
Error Message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[8], line 14
3 df = spark.sql(
4 """
5 SELECT
(...)
10 interval 1 day)) as date
11 """)
12 df = df.toPandas()
13 assert (
---> 14 check.is_between("date", "2022-01-01", "2022-01-10")
15 .validate(df)
16 .first()
17 .status == "PASS"
18 )
File c:\Users\Dsaad\GitHub\pyspark-tools\.venv\Lib\site-packages\cuallee\__init__.py:403, in Check.is_between(self, column, value, pct)
401 def is_between(self, column: str, value: Tuple[Any], pct: float = 1.0):
402 """Validation of a column between a range"""
--> 403 Rule("is_between", column, value, CheckDataType.AGNOSTIC, pct) >> self._rule
404 return self
File <string>:13, in __init__(self, method, column, value, data_type, coverage, options, status, violations, pass_rate, ordinal)
File c:\Users\Dsaad\GitHub\pyspark-tools\.venv\Lib\site-packages\cuallee\__init__.py:100, in Rule.__post_init__(self)
99 def __post_init__(self):
--> 100 if (self.coverage <= 0) or (self.coverage > 1):
101 raise ValueError("Coverage should be between 0 and 1")
103 if isinstance(self.column, List):
TypeError: '<=' not supported between instances of 'str' and 'int'
Issue Reproduction:
Executing an example that checks for null values and uniqueness:
from datetime import date
from pyspark.sql import Row
df = spark.createDataFrame([
Row(user_id=1111, order_id=4343, preferred_store='string1', birthdate=date(1999, 1, 1), joined_date=date(2022, 6, 1)),
Row(user_id=2222, order_id=5454, preferred_store='string2', birthdate=date(2000, 2, 1), joined_date=date(2022, 7, 2)),
Row(user_id=3333, order_id=6565, preferred_store='string3', birthdate=date(2001, 3, 1), joined_date=date(2022, 8, 3))
])
# Nulls on column Id
check = Check(CheckLevel.WARNING, "Completeness")
( check
.is_complete("user_id")
.is_unique("user_id")
.validate(df)
).show()
Error Message:
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Cell In[7], line 13
7 # Nulls on column Id
8 check = Check(CheckLevel.WARNING, "Completeness")
9 ( check
10 .is_complete("user_id")
11 .is_unique("user_id")
---> 12 .validate(df)
13 ).show()
File c:\Users\Dsaad\GitHub\pyspark-tools\.venv\Lib\site-packages\cuallee\__init__.py:703, in Check.validate(self, dataframe)
700 self.compute_engine = importlib.import_module("cuallee.polars_validation")
702 else:
--> 703 raise Exception(
704 "Cuallee is not ready for this data structure. You can log a Feature Request in Github."
705 )
707 assert self.compute_engine.validate_data_types(
708 self.rules, dataframe
709 ), "Invalid data types between rules and dataframe"
711 return self.compute_engine.summary(self, dataframe)
Exception: Cuallee is not ready for this data structure. You can log a Feature Request in Github.
I would appreciate any guidance or updates on resolving these errors. Thank you for your assistance.
would it be possible to add example on using with pandas dataframe
The has_sum
method, collapses a column in a dataframe by adding up all the elements. It has been successfully implemented, in pandas
, pyspark
and duckdb
. Missing implementation on snowpark
Hi @canimus,
I can see a link to the contributing guidelines on a sidebar titled Helpful resources when creating a new issue.
However, I need help finding a link to these guidelines within the documentation.
Can you please point me to the section in the README with the guidelines for third parties wishing to:
Thank you!
We see that in the article, we are trying to cite Soda Core and Great Expectations:
Line 58 in e4c1242
but it is not rendering properly:
Maybe the @software
option is not working.
Lines 141 to 152 in e4c1242
We are confident that you can find a solution. Could you please consider using the @misc
option or any other alternative that you think might work?
This issue is part of a JOSS REVIEW
This is probably my last issue of this review, as I have finished reviewing both the functionality and the documentation page. Only one minor thing about the recent changes in the paper:
soda
and great-expectations
have a core component that is open source and additional features / cloud stuff where you need to pay. Could you correct this in your paper where you write:This issue is part of a JOSS Review
JOSS wants the software papers to contain a comparison to the state of the field, i.e.
State of the field: Do the authors describe how this software compares to other commonly-used packages?
Hence, could you add a section in your paper where you shortly compare cuallee
to other data testing frameworks, like soda, great_expectations, or others?
This issue is part of a JOSS Review
For example std need to have a tolerance between passed value and real value (with rounding/truncate/ ceil-floor) and number of decimals for rounding.
Describe the bug
With duckdb v0.9.2 I get the error:
Exception: Cuallee is not ready for this data structure. You can log a Feature Request in Github.
To Reproduce
Steps to reproduce the behavior:
Desktop (please complete the following information):
This issue is part of a JOSS Review
Describe the bug
The issue is probably noticeable with other are_*
validations. It seems that when passing a number of columns to be checked, the total number of violations found will be divided by the number of columns given. For example, if a check will be done over 3 columns, and 12 violations are found, only 4 will be reported. This is OK if all violations were on the same row, but will underreport when the violations are on different rows.
To Reproduce
Run the following code:
from pyspark.sql import SparkSession
from cuallee import Check, CheckLevel
spark = SparkSession.builder.getOrCreate()
check = Check(CheckLevel.WARNING, "Not NULL").are_complete(["col_a", "col_b", "col_c"])
# This is fine
df1 = spark.createDataFrame([
{"col_a": 1, "col_b": 1, "col_c": 1}, {"col_a": 2, "col_b": 2, "col_c": 2}, {"col_a": 3, "col_b": 3, "col_c": 3}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df1.show(truncate=False)
check.validate(df1).show(truncate=False)
+-----+-----+-----+ |col_a|col_b|col_c| +-----+-----+-----+ |1 |1 |1 | |2 |2 |2 | |3 |3 |3 | +-----+-----+-----+ +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+---------+--------------+------+ |id |timestamp |check |level |column |rule |value|rows|violations|pass_rate|pass_threshold|status| +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+---------+--------------+------+ |1 |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A |3 |0.0 |1.0 |1.0 |PASS | +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+---------+--------------+------+
# This is also fine
df2 = spark.createDataFrame([
{"col_a": None, "col_b": None, "col_c": None}, {"col_a": 2, "col_b": 2, "col_c": 2}, {"col_a": 3, "col_b": 3, "col_c": 3}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df2.show(truncate=False)
check.validate(df2).show(truncate=False)
+-----+-----+-----+ |col_a|col_b|col_c| +-----+-----+-----+ |null |null |null | |2 |2 |2 | |3 |3 |3 | +-----+-----+-----+ +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+ |id |timestamp |check |level |column |rule |value|rows|violations|pass_rate |pass_threshold|status| +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+ |1 |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A |3 |1.0 |0.6666666666666666|1.0 |FAIL | +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
# This should show 3 violations, but it seems that the number of violations outputted is divided by the number of columns
df3 = spark.createDataFrame([
{"col_a": None, "col_b": 1, "col_c": 1}, {"col_a": None, "col_b": 2, "col_c": 2}, {"col_a": None, "col_b": 3, "col_c": 3}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df3.show(truncate=False)
check.validate(df3).show(truncate=False)
+-----+-----+-----+ |col_a|col_b|col_c| +-----+-----+-----+ |null |1 |1 | |null |2 |2 | |null |3 |3 | +-----+-----+-----+ +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+ |id |timestamp |check |level |column |rule |value|rows|violations|pass_rate |pass_threshold|status| +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+ |1 |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A |3 |1.0 |0.6666666666666666|1.0 |FAIL | +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
# This should also show 3 violations and a 0% pass rate
df4 = spark.createDataFrame([
{"col_a": None, "col_b": 1, "col_c": 1}, {"col_a": 2, "col_b": None, "col_c": 2}, {"col_a": 3, "col_b": 3, "col_c": None}
], schema="struct<col_a: int, col_b: int, col_c: int>")
df4.show(truncate=False)
check.validate(df4).show(truncate=False)
+-----+-----+-----+ |col_a|col_b|col_c| +-----+-----+-----+ |null |1 |1 | |2 |null |2 | |3 |3 |null | +-----+-----+-----+ +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+ |id |timestamp |check |level |column |rule |value|rows|violations|pass_rate |pass_threshold|status| +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+ |1 |2023-08-07 12:23:32|Not NULL|WARNING|('col_a', 'col_b', 'col_c')|are_complete|N/A |3 |1.0 |0.6666666666666666|1.0 |FAIL | +---+-------------------+--------+-------+---------------------------+------------+-----+----+----------+------------------+--------------+------+
Expected behavior
Ideally, the number of violations should reflect the number of offending rows (unless I'm misunderstanding something).
Desktop (please complete the following information):
cuallee==0.4.7
Great that you have updated your documentation page. Could you now refer to it prominently, for example in the Readme and/or in the project description on github (upper right corner of the github repository page).
This issue is part of a JOSS Review
Define the has_sum
check for snowpark. It collapses by adding all elements in a column for a dataframe.
Hi @canimus,
The test snowpark_dataframe/test_are_complete.py::test_positive
ends with an error.
ERROR test/unit/snowpark_dataframe/test_are_complete.py::test_positive - ValueError: snowpark did not yield a value
Please pass the tests in this pipeline.
I appears that when running a test with 1000 rules.
Test scenario is with the Taxi NYC data set with 20M Rows.
df = spark.read.parquet("temp/data/*.parquet")
c = Check(CheckLevel.Warning, "NYC")
for i in range(1000):
c.is_greater_than("fare_amount", i)
c.validate(spark, df).show(n=1000, truncate=False)
# Displayed dataframe contains wrong order in rows
# in 995 there is a discrepancy because 10% of the rows are certainly not with `fare_amount > 995`
Describe the bug
When using Polars, if the given dataset fails validation, instead of getting a dataframe with the validation results, an exception is thrown: TypeError: '>=' not supported between instances of 'NoneType' and 'float'
The example below shows the use of the is_unique
check, but this has been replicated with other checks as well.
To Reproduce
Steps to reproduce the behavior:
import polars as pl
from cuallee import Check, CheckLevel
df = pl.DataFrame(
{
"id": [1, 1, 2, 3, 4],
"bar": [6, 7, 8, 9, 10],
"ham": ["a", "b", "c", "d", "e"],
}
)
id_check = Check(CheckLevel.WARNING, "ID unique")
display(id_check.is_unique("id").validate(df))
Expected behavior
Code returns a dataframe with validation results.
Actual behavior
Code throws exception below:
TypeError Traceback (most recent call last)
Cell In[28], line 13
4 df = pl.DataFrame(
5 {
6 "id": [1, 1, 2, 3, 4],
(...)
9 }
10 )
12 id_check = Check(CheckLevel.WARNING, "ID unique")
---> 13 display(id_check.is_unique("id").validate(df))File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/init.py:586, in Check.validate(self, dataframe)
581 self.compute_engine = importlib.import_module("cuallee.polars_validation")
583 assert self.compute_engine.validate_data_types(
584 self.rules, dataframe
585 ), "Invalid data types between rules and dataframe"
--> 586 return self.compute_engine.summary(self, dataframe)File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/polars_validation.py:414, in summary(check, dataframe)
410 return "FAIL"
412 rows = len(dataframe)
--> 414 computation_basis = [
415 {
416 "id": index,
417 "timestamp": check.date.strftime("%Y-%m-%d %H:%M:%S"),
418 "check": check.name,
419 "level": check.level.name,
420 "column": str(rule.column),
421 "rule": rule.method,
422 "value": rule.value,
423 "rows": rows,
424 "violations": _calculate_violations(first(unified_results[hash_key]), rows),
425 "pass_rate": _calculate_pass_rate(first(unified_results[hash_key]), rows),
426 "pass_threshold": rule.coverage,
427 "status": _evaluate_status(
428 _calculate_pass_rate(first(unified_results[hash_key]), rows),
429 rule.coverage,
430 ),
431 }
432 for index, (hash_key, rule) in enumerate(check._rule.items(), 1)
433 ]
434 pl.Config.set_tbl_cols(12)
435 return pl.DataFrame(computation_basis)File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/polars_validation.py:427, in (.0)
410 return "FAIL"
412 rows = len(dataframe)
414 computation_basis = [
415 {
416 "id": index,
417 "timestamp": check.date.strftime("%Y-%m-%d %H:%M:%S"),
418 "check": check.name,
419 "level": check.level.name,
420 "column": str(rule.column),
421 "rule": rule.method,
422 "value": rule.value,
423 "rows": rows,
424 "violations": _calculate_violations(first(unified_results[hash_key]), rows),
425 "pass_rate": _calculate_pass_rate(first(unified_results[hash_key]), rows),
426 "pass_threshold": rule.coverage,
--> 427 "status": _evaluate_status(
428 _calculate_pass_rate(first(unified_results[hash_key]), rows),
429 rule.coverage,
430 ),
431 }
432 for index, (hash_key, rule) in enumerate(check._rule.items(), 1)
433 ]
434 pl.Config.set_tbl_cols(12)
435 return pl.DataFrame(computation_basis)File ~/Source/replicate-issue/.venv/lib/python3.11/site-packages/cuallee/polars_validation.py:407, in summary.._evaluate_status(pass_rate, pass_threshold)
405 def _evaluate_status(pass_rate, pass_threshold):
--> 407 if pass_rate >= pass_threshold:
408 return "PASS"
410 return "FAIL"TypeError: '>=' not supported between instances of 'NoneType' and 'float'
Desktop (please complete the following information):
Additional context
None.
We were able to reproduce the performance benchmark manually. ๐
However, seeing these results in the CI pipeline would be ideal. ๐ค
Please add a job to the CI workflow on GitHub Actions to reproduce the performance benchmark.
This is the last issue of the review:
The class Check
seems like the central entry point to the package, however it lacks a documentation. Could you add docstrings to it to explain the purpose of the function and of all parameters?
In case you have not worked with docstrings yet, they are super helpful in modern IDEs to get info about the function when hovering over it. There are some docstring format conventions and also tools like autodocstring that help to create docstrings.
This issue is part of a JOSS Review
There is a documentation page at https://canimus.github.io/cuallee/ hosting all the files in the docs
folder.
However it seems to be not quite finished. Is it part of the official documentation? Or is the Readme.md
the official documentation?
This issue is part of a JOSS Review
For the joss review, I need to validate the performance test:
Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)
Is the temp/taxi/
parquet files open data? If yes, could you provide them somehow?
This issue is part of a JOSS Review
Describe the bug
When attempting to use the correlation
function within a custom defined function decorated with @daft.udf
, an AttributeError is raised due to the absence of the correlation
attribute in the statistics
module. This error occurs because the correlation
function was introduced in Python version 3.10. Nowadays the minor python version supported by cuallee is 3.8, where this error can be reproduced
To Reproduce
To reproduce the behavior:
correlation
function within a custom defined function decorated with @daft.udf
. @daft.udf(return_dtype=daft.DataType.float64())
def correlation(x, y):
> return [statistics.correlation(x.to_pylist(), y.to_pylist())]
E AttributeError: module 'statistics' has no attribute 'correlation'
cuallee/daft_validation.py:211: AttributeError
Expected behavior
The correlation
function should be successfully called within the custom defined function decorated with @daft.udf
.
Additional context
This bug arises due to the fact that the correlation
function was introduced in Python version 3.10, and the environment where the code is being executed is likely using an older version of Python where this function does not exist.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.