timgent / data-flare Goto Github PK

View Code? Open in Web Editor NEW

23.0 10.0 10.0 3.45 MB

Data quality control tool built on spark and deequ

Home Page: https://timgent.github.io/data-flare

License: Apache License 2.0

Scala 91.48% Dockerfile 0.07% JavaScript 7.64% CSS 0.24% HTML 0.21% Python 0.36%

spark data-quality big-data

data-flare's Issues

SumValues metric descriptor does not add onColumn info to SimpleMetricDescriptor

This is leading to unclear check descriptions where SumValues metric is used, as it doesn't say which column it is summing up!

Store deequ metrics in ElasticSearch

If tracking metrics have errors when being calculated those are currently silent, and instead should be reported back to the user somehow

Code should not compile but does issue

This code should not compile due to mismatch of metric types between the SumValuesMetric and MetricComparator, which can lead to run time errors (because it will try to cast). However currently it does compile :(

DualMetricBasedCheck(
SumValuesMetricDoubleMetric,
SumValuesMetricDoubleMetric,
"change in unprojected patients",
MetricComparator[LongMetric](???, ???)
)

Polish the external API

Issues identified so far:

No need for QualityChecker object - should just be able to run a CheckSuite directly - MR created
- Remove QualityChecker
- Pass in qcResultsRepository directly to a ChecksSuite instead
- Persist QC Results in the ChecksSuite.run method
Could RawCheckResult be a bit neater or more nicely named? - I still can't come up with anything nicer at the moment...
Make MetricComparator curried for neater syntax - Can't do this without breaking case class. Additional constructor results in double definition error, which I can only find ugly workaround for. Just going to have to live with this for now

Home-baked ElasticSearch metrics repository

Docs for Deequ based checks

Documentation for metrics based checks

Store home-baked metrics in Repository (incl. InMemoryRepo implementation)

API Naming

There are too many too verbose names you need to use when creating a ChecksSuite. They should be considered as a whole and renamed. Also I believe the number of input parameters to a check suite could be reduced, by combining all checks that operate on a single or dual dataset. Would mean introducing a common supertype for those probably. Also instead of wrapper types using a Map from datasets to list of checks might be a bit less verbose

Can we replace all metric based checks with one generic implementation?

Add check that checks list of distinct values match a given list of values

Implement metric based Sum of values check

Setup scalafmt

Cross builds for different spark versions

Sort out error handling when arbitrary checks can't be calculated (due to exception being thrown)

Polish the QC results storage layer

QC Results Storage:

resultDescription is not helpful and could be removed. It just summarises info already in the list of QC Results - Done
QcType is persisted weirdly "MetricsBasedQualityCheck" : { } - Done
Status is persisted weirdly: - Done

              "status" : {
                "Success" : { }
              },

Check Description persistence could be nicer - Can always add more fields later to make it easier to parse. Not required right now

"checkDescription" : "Keep all customers. Comparing metric SimpleMetricDescriptor(Size,Some(no filter),None,None) to DistinctValuesMetricDescriptor(List(customer_id),MetricFilter(true,no filter)) using comparator of metrics are equal",

datasourceDescription could be neater: "datasourceDescription" : "customers compared to customerOrders" - Done
resultDescription could give the names of the datasets for metric comparison checks: "resultDescription" : "metric comparison passed. dsAMetric of LongMetric(22) was compared to dsBMetric of LongMetric(22)", - Done
overallStatus is persisted a bit weirdly: - Done

          "overallStatus" : {
            "Success" : { }
          },

Compliance Metric should be 0 for empty dataset

Add metric based compliance check

Example usage:
ComplianceCheck(col("someCol").isin("a", "b"), "someCol isin a,b", noFilter)

Publish and link to javadocs

Sort out error handling when metrics can't be calculated

Setup CI - testing only

Tests should be run on any PRs and need to pass before merging
Tests should be run on merge to master and report build status to README

Remove distinction between metric calculator and metric descriptor

Check for comparing 2 tables are identical (will be less efficient than metrics based check)

Add column distinctness metric

Build for Scala 2.12

https://blog.ssanj.net/posts/2017-06-26-crosscompile-multiple-versions-of-scala-via-sbt.html

https://www.scala-sbt.org/1.x/docs/Cross-Build.html

Helper for writing filters where filter is a string and is the same as the description

Setup CI for publishing a release

Work started on release-ci branch. However I need to decide how I want it to work, mainly what the trigger should be.

2 main options:

Trigger is publishing a release, which then does whole release, pushes jars to sonatype, sets version, etc
Trigger is performing release from local machine, CI just publishes a release on Github

A convincing example and demo

Setup example in the code for how this library might be used
Include a readme
Include instructions for setting up ELK stack
Include instructions for graphing metrics over time

Rename the project before publicising!!

"Spark" is an apache trademark of some kind I believe, so shouldn't be used in the name of this library. I will need to come up with a new name, rename git repo and relevant code sections, and upload to a maven repo with the name, which may require me to create another ticket with them.

Add versioning for documentation to docusaurus

https://docusaurus.io/docs/en/versioning

Comparison of arbitrary number of metrics

Turns out DualDatasetMetricCheck is also helpful for comparing different metrics on a single dataset. So maybe the API/naming there could be improved.

The point of this ticket though is that it would also be helpful to compare an arbitrary number of metrics across one or many datasets.

This came up when looking at data where you can get the number of losses, wins, and the net. Each of those can be calculated in a metric, and 2 of them can be compared. However it isn't possible to compare all 3 of them. Could that be done? Possibly with composite metrics (i.e. Define a MetricDescriptor that is composed of other MetricDescriptors), possible with something else like taking a list of arguments rather than set number, or using type params to make things more flexible.

Add proper documentation in README

Use https://github.com/scalameta/mdoc

Add ability to use deequ anomaly detection check

As a data engineer
I would like to be able to run anomaly checks with deequ
So that I can identify anomalous results in my data

Publish library to public repository

Make it possible for someone to add custom metric checks

Implement metric for distinct value count

Add ability to pretty print QC Results

Some code which may or may not help...

def showQcResults(checksSuiteResult: ChecksSuiteResult): String = {
    val (successfulQcResults, unsuccessfulQcResults) =
      checksSuiteResult.checkResults.partition(_.status == CheckStatus.Success)
    s"""
       | SUCCESSFUL QC Checks:
       |
       | - ${successfulQcResults.map(showCheckResult).mkString("\n\n - ")}
       | 
       | UNSUCCESSFUL QC Checks:
       | 
       | - ${unsuccessfulQcResults.map(showCheckResult).mkString("\n\n - ")}
       |
       |""".stripMargin
  }

  private def showCheckResult(checkResult: CheckResult): String = {
    import checkResult._
    s"""
       |checkDescription -> $checkDescription
       |resultDescription -> $resultDescription
       |status -> $status
       |datasourceDescription -> $datasourceDescription
       |qcType -> $qcType
       |""".stripMargin
  }

Polish the Metrics storage layer

Metrics Storage:

Don't show fields which aren't relevant for a metric. e.g. complianceDescription in the below

          "metricDescriptor" : {
            "metricName" : "Size",
            "filterDescription" : "no filter",
            "complianceDescription" : null,
            "onColumns" : null
          },

Persist some metricKey which is a concatenation of the different settings for a metric to make working with them in Kibana easier - Not doing this because we can always add a field later once the requirement is clearer, without breaking backwards compatibility
Make metricValue persistence a bit neater:

          "metricValue" : {
            "LongMetric" : {
              "value" : 11
            }
          }

Implement metric based check for Size that compares size between 2 data sources

Example uses:
I want to know that 2 of my tables are the same size
I want to know that 1 table always has more records than another
....etc

Allow metrics suite to calculate requested metrics even when there's no checks on them

Add Comparator helper to check values within X percent of each other

Something like...

  def withinXPercentComparator(percentage: Double) =
    MetricComparator[LongMetric](
      s"within${(percentage).toString}PercentComparator",
      (metric1, metric2) => {
        metric1 <= metric2 * (100 + percentage) / 100 && metric1 >= metric2 * (100 - percentage) / 100
      }
    )

Helper for check to compare schemas

Implement metric based quality check from scratch with a simple Size metric

Enable a user to:

Define a quality check based on the number of rows in the DataFrame and an absolute threshold for that
Quality Check to be based in a way where with multiple similar checks they could be computed in one pass over the data
Give user ability to store the metrics calculated for this check in a metrics repository

Introductory blog past

Add runBlocking method for running checks suite without returning a future

Make releasing automatically update library version number for docs

Get it from an env var, which can be set using git before the docs are compiled:

git describe --abbrev=0

timgent / data-flare Goto Github PK

data-flare's Issues

Recommend Projects

Recommend Topics

Recommend Org