Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
When we extract columns, it would be very handy to be able to run checks against those columns. pandera is a great, lightweight tool for validating dtypes, nullability, uniqueness, and any arbitrary Check
callable.
Describe the solution you'd like
A clear and concise description of what you want to happen.
Ideally this would be a decorator that would work similar to extra_columns
, would ingest a DataFrame
and return the same dataframe, and expand the nodes to have a dataframe validation node. This could be specific to pandera, or could be made more general, so something like
import pandas as pd
from pandera import DataFrameSchema, Column, Check
@validate_columns({
"user_id": Column(str, unique=True),
"age": Column(int, Check.in_range(18, 150),
"shirt_size": Column(float, description="arm length in inches", Check.greater_than(10)),
"favorite_apparel": Column(str, Check.isin(["pants", "shirts", "hats"]),
})
def users(input_file: str) -> pd.DataFrame:
return pd.read_csv(input_file)
or more generically
import pandas as pd
import abc
class Schema(abc.ABC):
@abc.abstract_method
def validate(self):
pass
class SimpleColumnChecker(Schema):
def __init__(self, columns: Dict[str, Any]):
self.columns = columns
def validate(self, df):
for column, col_schema in self.columns.items():
assert column in df.columns
if col_schema.get("unique"):
assert df[column].shape[0] == df[column].drop_duplicates().shape[0]
if col_schema.get("min"):
assert df[column].min() > col_schema.get("min")
if col_schema.get("max"):
assert df[column].max() < col_schema.get("min")
if col_schema.get("isin"):
assert set(df[column]) == set(col_schema.get("isin"))
@validate_columns({
"user_id": { "unique": True},
"age": {"min": 18, "max": 150},
"shirt_size": {"min": 10},
"favorite_apparel": {"isin": ["pants", "shirts", "hats"]},
})
def users(input_file: str) -> pd.DataFrame:
return pd.read_csv(input_file)
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
certainly you can have a splitting node where you validate data yourself, but I think this is a common enough pattern (or it really should be common enough and made a first class citizen of any dataframe manipulation) that it would benefit from being easy to plug in directly to a node
Additional context
Add any other context or screenshots about the feature request here.