timgent / data-flare Goto Github PK
View Code? Open in Web Editor NEWData quality control tool built on spark and deequ
Home Page: https://timgent.github.io/data-flare
License: Apache License 2.0
Data quality control tool built on spark and deequ
Home Page: https://timgent.github.io/data-flare
License: Apache License 2.0
Some code which may or may not help...
def showQcResults(checksSuiteResult: ChecksSuiteResult): String = {
val (successfulQcResults, unsuccessfulQcResults) =
checksSuiteResult.checkResults.partition(_.status == CheckStatus.Success)
s"""
| SUCCESSFUL QC Checks:
|
| - ${successfulQcResults.map(showCheckResult).mkString("\n\n - ")}
|
| UNSUCCESSFUL QC Checks:
|
| - ${unsuccessfulQcResults.map(showCheckResult).mkString("\n\n - ")}
|
|""".stripMargin
}
private def showCheckResult(checkResult: CheckResult): String = {
import checkResult._
s"""
|checkDescription -> $checkDescription
|resultDescription -> $resultDescription
|status -> $status
|datasourceDescription -> $datasourceDescription
|qcType -> $qcType
|""".stripMargin
}
Something like...
def withinXPercentComparator(percentage: Double) =
MetricComparator[LongMetric](
s"within${(percentage).toString}PercentComparator",
(metric1, metric2) => {
metric1 <= metric2 * (100 + percentage) / 100 && metric1 >= metric2 * (100 - percentage) / 100
}
)
Metrics Storage:
"metricDescriptor" : {
"metricName" : "Size",
"filterDescription" : "no filter",
"complianceDescription" : null,
"onColumns" : null
},
"metricValue" : {
"LongMetric" : {
"value" : 11
}
}
This code should not compile due to mismatch of metric types between the SumValuesMetric and MetricComparator, which can lead to run time errors (because it will try to cast). However currently it does compile :(
DualMetricBasedCheck(
SumValuesMetricDoubleMetric,
SumValuesMetricDoubleMetric,
"change in unprojected patients",
MetricComparator[LongMetric](???, ???)
)
QC Results Storage:
"MetricsBasedQualityCheck" : { }
- Done "status" : {
"Success" : { }
},
"checkDescription" : "Keep all customers. Comparing metric SimpleMetricDescriptor(Size,Some(no filter),None,None) to DistinctValuesMetricDescriptor(List(customer_id),MetricFilter(true,no filter)) using comparator of metrics are equal",
"datasourceDescription" : "customers compared to customerOrders"
- Done"resultDescription" : "metric comparison passed. dsAMetric of LongMetric(22) was compared to dsBMetric of LongMetric(22)",
- Done "overallStatus" : {
"Success" : { }
},
Issues identified so far:
Turns out DualDatasetMetricCheck is also helpful for comparing different metrics on a single dataset. So maybe the API/naming there could be improved.
The point of this ticket though is that it would also be helpful to compare an arbitrary number of metrics across one or many datasets.
This came up when looking at data where you can get the number of losses, wins, and the net. Each of those can be calculated in a metric, and 2 of them can be compared. However it isn't possible to compare all 3 of them. Could that be done? Possibly with composite metrics (i.e. Define a MetricDescriptor that is composed of other MetricDescriptors), possible with something else like taking a list of arguments rather than set number, or using type params to make things more flexible.
"Spark" is an apache trademark of some kind I believe, so shouldn't be used in the name of this library. I will need to come up with a new name, rename git repo and relevant code sections, and upload to a maven repo with the name, which may require me to create another ticket with them.
Work started on release-ci branch. However I need to decide how I want it to work, mainly what the trigger should be.
2 main options:
There are too many too verbose names you need to use when creating a ChecksSuite. They should be considered as a whole and renamed. Also I believe the number of input parameters to a check suite could be reduced, by combining all checks that operate on a single or dual dataset. Would mean introducing a common supertype for those probably. Also instead of wrapper types using a Map from datasets to list of checks might be a bit less verbose
Get it from an env var, which can be set using git before the docs are compiled:
git describe --abbrev=0
This is leading to unclear check descriptions where SumValues metric is used, as it doesn't say which column it is summing up!
As a data engineer
I would like to be able to run anomaly checks with deequ
So that I can identify anomalous results in my data
Example usage:
ComplianceCheck(col("someCol").isin("a", "b"), "someCol isin a,b", noFilter)
Currently it is just a big long string with the metric descriptors in there. Which is then inconsistent with SingleMetricCheck which doesn't include the metric descriptor in the check description at all.
Probably check description should be a sealed trait with different types in there, so that we can have some neater info here
Example uses:
I want to know that 2 of my tables are the same size
I want to know that 1 table always has more records than another
....etc
Enable a user to:
It will help people avoid dependency version issues
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.