Giter Club home page Giter Club logo

data-flare's Issues

Code should not compile but does issue

This code should not compile due to mismatch of metric types between the SumValuesMetric and MetricComparator, which can lead to run time errors (because it will try to cast). However currently it does compile :(

DualMetricBasedCheck(
SumValuesMetricDoubleMetric,
SumValuesMetricDoubleMetric,
"change in unprojected patients",
MetricComparator[LongMetric](???, ???)
)

Polish the external API

Issues identified so far:

  • No need for QualityChecker object - should just be able to run a CheckSuite directly - MR created
    • Remove QualityChecker
    • Pass in qcResultsRepository directly to a ChecksSuite instead
    • Persist QC Results in the ChecksSuite.run method
  • Could RawCheckResult be a bit neater or more nicely named? - I still can't come up with anything nicer at the moment...
  • Make MetricComparator curried for neater syntax - Can't do this without breaking case class. Additional constructor results in double definition error, which I can only find ugly workaround for. Just going to have to live with this for now

API Naming

There are too many too verbose names you need to use when creating a ChecksSuite. They should be considered as a whole and renamed. Also I believe the number of input parameters to a check suite could be reduced, by combining all checks that operate on a single or dual dataset. Would mean introducing a common supertype for those probably. Also instead of wrapper types using a Map from datasets to list of checks might be a bit less verbose

Polish the QC results storage layer

QC Results Storage:

  • resultDescription is not helpful and could be removed. It just summarises info already in the list of QC Results - Done
  • QcType is persisted weirdly "MetricsBasedQualityCheck" : { } - Done
  • Status is persisted weirdly: - Done
              "status" : {
                "Success" : { }
              },
  • Check Description persistence could be nicer - Can always add more fields later to make it easier to parse. Not required right now
"checkDescription" : "Keep all customers. Comparing metric SimpleMetricDescriptor(Size,Some(no filter),None,None) to DistinctValuesMetricDescriptor(List(customer_id),MetricFilter(true,no filter)) using comparator of metrics are equal",
  • datasourceDescription could be neater: "datasourceDescription" : "customers compared to customerOrders" - Done
  • resultDescription could give the names of the datasets for metric comparison checks: "resultDescription" : "metric comparison passed. dsAMetric of LongMetric(22) was compared to dsBMetric of LongMetric(22)", - Done
  • overallStatus is persisted a bit weirdly: - Done
          "overallStatus" : {
            "Success" : { }
          },

Setup CI - testing only

  • Tests should be run on any PRs and need to pass before merging
  • Tests should be run on merge to master and report build status to README

Setup CI for publishing a release

Work started on release-ci branch. However I need to decide how I want it to work, mainly what the trigger should be.

2 main options:

  1. Trigger is publishing a release, which then does whole release, pushes jars to sonatype, sets version, etc
  2. Trigger is performing release from local machine, CI just publishes a release on Github

A convincing example and demo

  • Setup example in the code for how this library might be used
  • Include a readme
  • Include instructions for setting up ELK stack
  • Include instructions for graphing metrics over time

Rename the project before publicising!!

"Spark" is an apache trademark of some kind I believe, so shouldn't be used in the name of this library. I will need to come up with a new name, rename git repo and relevant code sections, and upload to a maven repo with the name, which may require me to create another ticket with them.

Comparison of arbitrary number of metrics

Turns out DualDatasetMetricCheck is also helpful for comparing different metrics on a single dataset. So maybe the API/naming there could be improved.

The point of this ticket though is that it would also be helpful to compare an arbitrary number of metrics across one or many datasets.

This came up when looking at data where you can get the number of losses, wins, and the net. Each of those can be calculated in a metric, and 2 of them can be compared. However it isn't possible to compare all 3 of them. Could that be done? Possibly with composite metrics (i.e. Define a MetricDescriptor that is composed of other MetricDescriptors), possible with something else like taking a list of arguments rather than set number, or using type params to make things more flexible.

Add ability to pretty print QC Results

Some code which may or may not help...

def showQcResults(checksSuiteResult: ChecksSuiteResult): String = {
    val (successfulQcResults, unsuccessfulQcResults) =
      checksSuiteResult.checkResults.partition(_.status == CheckStatus.Success)
    s"""
       | SUCCESSFUL QC Checks:
       |
       | - ${successfulQcResults.map(showCheckResult).mkString("\n\n - ")}
       | 
       | UNSUCCESSFUL QC Checks:
       | 
       | - ${unsuccessfulQcResults.map(showCheckResult).mkString("\n\n - ")}
       |
       |""".stripMargin
  }

  private def showCheckResult(checkResult: CheckResult): String = {
    import checkResult._
    s"""
       |checkDescription -> $checkDescription
       |resultDescription -> $resultDescription
       |status -> $status
       |datasourceDescription -> $datasourceDescription
       |qcType -> $qcType
       |""".stripMargin
  }

Polish the Metrics storage layer

Metrics Storage:

  • Don't show fields which aren't relevant for a metric. e.g. complianceDescription in the below
          "metricDescriptor" : {
            "metricName" : "Size",
            "filterDescription" : "no filter",
            "complianceDescription" : null,
            "onColumns" : null
          },
  • Persist some metricKey which is a concatenation of the different settings for a metric to make working with them in Kibana easier - Not doing this because we can always add a field later once the requirement is clearer, without breaking backwards compatibility
  • Make metricValue persistence a bit neater:
          "metricValue" : {
            "LongMetric" : {
              "value" : 11
            }
          }

Add Comparator helper to check values within X percent of each other

Something like...

  def withinXPercentComparator(percentage: Double) =
    MetricComparator[LongMetric](
      s"within${(percentage).toString}PercentComparator",
      (metric1, metric2) => {
        metric1 <= metric2 * (100 + percentage) / 100 && metric1 >= metric2 * (100 - percentage) / 100
      }
    )

Implement metric based quality check from scratch with a simple Size metric

Enable a user to:

  • Define a quality check based on the number of rows in the DataFrame and an absolute threshold for that
  • Quality Check to be based in a way where with multiple similar checks they could be computed in one pass over the data
  • Give user ability to store the metrics calculated for this check in a metrics repository

DualMetricCheck description is hard to read

Currently it is just a big long string with the metric descriptors in there. Which is then inconsistent with SingleMetricCheck which doesn't include the metric descriptor in the check description at all.

Probably check description should be a sealed trait with different types in there, so that we can have some neater info here

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.