target / data-validator Goto Github PK

View Code? Open in Web Editor NEW

95.0 95.0 33.0 669 KB

A tool to validate data, built around Apache Spark.

License: Other

Shell 7.51% Scala 91.99% Makefile 0.37% Ruby 0.13%

data-science data-validation hacktoberfest

data-validator's People

Contributors

Stargazers

Watchers

data-validator's Issues

Include text in the results email

I'd like to include some text in the validator email, specifically a URL linking to where the validator configuration is stored.

Improve end-to-end testing and include the ValidatorSpecifiedFormatLoader in config parsing tests

End-to-end testing can now include some versioned JSONL data in src/test/resources and be loaded through ValidatorSpecifiedFormatLoader. #82 didn't include ValidatorSpecifiedFormatLoader in the config parser tests, so add it while you're in there.

Lock dependencies with sbt-dependency-lock

Is your feature request related to a problem? Please describe.

As a maintainer, I want to be able to validate my dependencies are what they were when I pulled them in.

Describe the solution you'd like

https://github.com/stringbean/sbt-dependency-lock

Describe alternatives you've considered

I've searched for other plugins that do this and ☝️ seems to be the currently maintained one.

ColumnSumCheck should treat an unexpected type as a test failure, not exception

@samratmitra-0812 pointed out:

This behaviour of throwing an exception for unsupported type [in columnSumCheck] is different from columnMaxCheck, where it is treated as a normal check failure. I think we should make both of them consistent.

Setup Travis CI

Release builds for newer versions of Scala and Spark

Is your feature request related to a problem? Please describe.

We currently only ship for Scala 2.11 and Spark 2.3.x.

Describe the solution you'd like

We should ship for newer versions in whatever pairs are appropriate.

Additional context

https://github.com/sbt/sbt-projectmatrix is probably the right tool for the job.

Add support for variable substitution for `minNumRows` in `rowCount` check

We should be able to support the following use case or something analogous to it:

---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true

vars:
  - name: NUM_ROWS
    value: 1000
  
tables:
  - db: census_income
    table: adult
    checks:
      - type: rowCount
        minNumRows: $NUM_ROWS

Currently, trying the above configuration yields a fairly non-descriptive DecodingFailure

21/01/15 12:24:03 ERROR Main$: Failed to parse config file 'issue.yaml, {}
DecodingFailure(Attempt to decode value on failed cursor, List(DownField(parquetFile), DownArray, DownField(tables)))

It is noted in the documentation that this is not currently supported.

Move off of olafurpg/setup-scala in Actions

Per olafurpg/setup-scala#49, olafurpg/setup-scala is ~unmaintained and there are better options available. We do have a need to compile against JDK8 so we can probably take the recommended route of using the actions/setup-java action.

`thresholds` are not in report.json

I forgot to add thresholds to the report.json.

Test fails for ConfigVarSpec

When running sbt clean assembly on terminal, the following tests are failing:

[info]   - from Json snippet
[info]   - addEntry works *** FAILED ***
[info]     sut.addEntry(ConfigVarSpec.this.spark, varSub) was true (ConfigVarSpec.scala:70)
[info]   - asJson works
[info]   - var sub in env value *** FAILED ***
[info]     sut.addEntry(ConfigVarSpec.this.spark, varSub) was true (ConfigVarSpec.scala:83)
[info]   - var sub fails when value doesn't exist

Here is the ScalaTest output:

[info] ScalaTest
[info] Run completed in 3 minutes, 2 seconds.
[info] Total number of tests run: 330
[info] Suites: completed 25, aborted 0
[info] Tests: succeeded 328, failed 2, canceled 0, ignored 0, pending 0
[info] *** 2 TESTS FAILED ***
[error] Failed: Total 330, Failed 2, Errors 0, Passed 328
[error] Failed tests:
[error]         com.target.data_validator.validator.ConfigVarSpec
[error] (Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 892 s (14:52), completed 10 Jan, 2022 8:36:27 AM

Add `data-science` tag to this repo to make it show on opensource.target.com

There's a filter that requires a tag on the repo in order for it to show up.

Thresholds parsed as JSON floats are ignored

Describe the bug

When specifying a check with a threshold that will parse to a JSON float, e.g.

threshold: 0.10 # will be ignored
threshold: 10% # works
threshold: "0.10" # works

the threshold will be ignored.

To Reproduce

Configure a check with:

type: nullCheck
column: foo
threshold: 0.10

or put that into a test in NullCheckSpec.

Expected behavior

Thresholds specified as floats should work.

Use sbt-scalafmt instead of installing scalafmt with Coursier

Remove Coursier pull of packages Makefile in exchange for sbt-scalafmt in sbt itself via adding to project/plugins.sbt:

addSbtPlugin("org.scalameta" % "sbt-scalafmt" % "2.4.6")

Enable a configuration check using com.target.data_validator.ConfigParser#main

There is an unused com.target.data_validator.ConfigParser#main that could be exposed somehow to enable configuration testing.

data-validator/src/main/scala/com/target/data_validator/ConfigParser.scala

Lines 67 to 78 in 6c8eab3

 def main(args: Array[String]): Unit = { 

 logger.info(s"Args[${args.length}]: $args") 

 val filename = args(0) 

 var error = false 

 parseFile(filename, Map.empty) match { 

 case Left(pe) => logger.error(s"Failed to parse $filename, ${pe.getMessage}"); error = true 

 case Right(config) => logger.info(s"Config: $config") 

 } 

 System.exit(if (error) 1 else 0) 

 }

Ideally, this should be a separate mode but minimally we could document how to use it locally to validate a configuration.

I think it could be as simple as documenting using it like this:

spark-submit data-validator-assembly-${version}.jar config.yaml

Unknown fields in the check section of `.yaml` should cause `WARN` log messages.

While testing stringLengthCheck I accidently referenced minLength instead of minValue
This caused configTest() to fail for no apparent reason and took me a really long time to debug because the program was not logging any useful information. configTest() did generate a ValidatorError() event in the eventLog, but the program doesn't write the report.json or HTML report on configTest() failures.

The new Object.fromJson() constructors should log a warn for every unknown field present in the config.
In general, I do not think that unknown fields should be an error, only a warning, this helps keeps the config "compatible" across versions.
Maybe create a cli option for strict config parsing.

Use a better email abstraction

https://github.com/dmurvihill/courier is very attractive, missing only the retry support that would satisfy #70. Using Courier would satisfy #19 and could facilitate #5 as something to do in the process.

Migrate your Actions workflows to the new syntax

The HCL syntax in GitHub Actions will stop working on September 30, 2019. We are contacting you because you’ve run workflows using HCL syntax in the last week in your account with the following repos: target/data-validator.

To continue using workflows that you created with the HCL syntax, you'll need to migrate the workflow files to the new YAML syntax. Once you have your YAML workflows ready, visit your repositories and follow the prompts to upgrade. Once you upgrade, your HCL workflows will stop working.

https://help.github.com/en/articles/migrating-github-actions-from-hcl-syntax-to-yaml-syntax

NoSuchMethodError when running with test data and test config

$ spark-submit --master "local[*]" $(ls -t target/scala-2.11/data-validator-assembly-*.jar | head -n 1) --config local_validators.yaml --jsonReport target/testreport.json --htmlReport target/testreport.html
20/04/07 18:04:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/04/07 18:04:34 INFO Main$: Logging configured!
20/04/07 18:04:34 INFO Main$: Data Validator
20/04/07 18:04:34 INFO ConfigParser$: Parsing `local_validators.yaml`
20/04/07 18:04:34 INFO ConfigParser$: Attempting to load `local_validators.yaml` from file system
20/04/07 18:04:35 INFO ValidatorConfig: substituteVariables()
20/04/07 18:04:35 INFO Substitutable$class: Substituting filename var: ${WORKDIR}/test.json with `/Users/z003xc4/Source/OSS/target_data-validator/test.json`
20/04/07 18:04:35 INFO Main$: Checking Cli Outputs htmlReport: Some(target/testreport.html) jsonReport: Some(target/testreport.json)
20/04/07 18:04:35 INFO Main$: filename: Some(target/testreport.html) append: false
20/04/07 18:04:35 INFO Main$: CheckFile Some(target/testreport.html)
20/04/07 18:04:35 INFO Main$: Checking file 'target/testreport.html append: false failed: false
20/04/07 18:04:35 INFO Main$: filename: Some(target/testreport.json) append: true
20/04/07 18:04:35 INFO Main$: CheckFile Some(target/testreport.json)
20/04/07 18:04:35 INFO Main$: Checking file 'target/testreport.json append: true failed: false
20/04/07 18:04:35 INFO ValidatorOrcFile: Reading orc file: testData.orc
20/04/07 18:04:36 INFO Main$: Running sparkChecks
20/04/07 18:04:36 INFO ValidatorConfig: Running Quick Checks...
20/04/07 18:04:36 INFO ValidatorOrcFile: Reading orc file: testData.orc
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias$.apply$default$4(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;)Lscala/Option;
        at com.target.data_validator.ValidatorTable.createCountSelect(ValidatorTable.scala:33)
        at com.target.data_validator.ValidatorTable.quickChecks(ValidatorTable.scala:87)
        at com.target.data_validator.ValidatorConfig$$anonfun$quickChecks$1.apply(ValidatorConfig.scala:51)
        at com.target.data_validator.ValidatorConfig$$anonfun$quickChecks$1.apply(ValidatorConfig.scala:51)
        at scala.collection.immutable.List.map(List.scala:284)
        at com.target.data_validator.ValidatorConfig.quickChecks(ValidatorConfig.scala:51)
        at com.target.data_validator.Main$.runSparkChecks(Main.scala:80)
        at com.target.data_validator.Main$$anonfun$2.apply(Main.scala:106)
        at com.target.data_validator.Main$$anonfun$2.apply(Main.scala:100)
        at scala.Option.map(Option.scala:146)
        at com.target.data_validator.Main$.runChecks(Main.scala:99)
        at com.target.data_validator.Main$.loadConfigRun(Main.scala:27)
        at com.target.data_validator.Main$.main(Main.scala:170)
        at com.target.data_validator.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

local_validators.yaml:

---
numKeyCols: 2
numErrorsToReport: 5
detailedErrors: true
vars:
- name: WORKDIR
  env: PWD
tables:
- orcFile: testData.orc
  checks:
  - type: rowCount
    minNumRows: 1000
    #  - type: nullCheck
    #    column: nullCol
outputs:
- filename: ${WORKDIR}/test.json

Spark version 2.4.4
Scala version 2.11.12

Error % not calculated correctly for ColumnBased checks

Example

CheckType: minNumRows
minValue: 1800
actualValue: 1144

In the above example error % should be (1800 - 1144) * 100/1800 = 36.44%
However, it is calculated as (1 * 100)/1144 = 0.09%

Similar issue exists for ColumnMaxCheck.

distinctCountCheck as a validator

It would be great to have a distinctCountCheck validator that checks the number of distinct values in a column of a given table, and that this number matches a user provided value.

SQL variable substitution fails when result is a double

Observed, for example, when combined with ColumnSumCheck. See example config below:

---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true

vars:
  - name: MAX_AGE
    sql: SELECT CAST(MAX(age) AS DOUBLE) FROM census_income.adult

outputs:
  - filename: report.json
    append: false

email:
  smtpHost: smtp.example.com
  subject: Data Validation Summary
  from: [email protected]
  to:
    - [email protected]

tables:
  - db: census_income
    table: adult
    checks:
      - type: columnSumCheck
        column: age
        minValue: $MAX_AGE
        inclusive: true

yields:

...
...
21/01/14 09:12:01 ERROR JsonUtils$: Unimplemented dataType 'double' in column: CAST(max(age) AS DOUBLE) Please report this as a bug.
21/01/14 09:12:01 INFO ValidatorConfig: substituteVariables()
21/01/14 09:12:02 INFO Substitutable$class: Substituting Json minValue Json: "$MAX_AGE" with `null`
...
...
21/01/14 09:12:02 ERROR ColumnSumCheck$$anonfun$configCheck$1: 'minValue' defined but type is not a Number, is: Null
21/01/14 09:12:02 ERROR ValidatorTable$$anonfun$1: ConfigCheck failed for HiveTable:`census_income.adult`
...
...

[SECURITY] Releases are built/executed/released in the context of insecure/untrusted code

CWE-829: Inclusion of Functionality from Untrusted Control Sphere
CWE-494: Download of Code Without Integrity Check

The build files indicate that this project is resolving dependencies over HTTP instead of HTTPS. Any of these artifacts could have been MITM to maliciously compromise them and infect the build artifacts that were produced. Additionally, if any of these JARs or other dependencies were compromised, any developers using these could continue to be infected past updating to fix this.

This vulnerability has a CVSS v3.0 Base Score of 8.1/10
https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator?vector=AV:N/AC:H/PR:N/UI:N/S:U/C:H/I:H/A:H

This isn't just theoretical

POC code has existed since 2014 to maliciously compromise a JAR file inflight.
See:

MITM Attacks Increasingly Common

See:

Source Locations

data-validator/build.sbt

Line 12 in 67eec1d

resolvers += "Concurrent Conjars repository" at "http://conjars.org/repo"

Setup publishing release .jars to GitHub packages

Setup publishing release .jars to maven central or some other jar hosting place.

Allow format+options to be passed before Hive query

We have a use case which necessitates constructing queries a certain way:

val df = spark.read.format("internal_format").option("database", "foo").load("select * from myTable")
// … Spark magic …
df.write.format("internal_format").option("database", "my_select_database").save()

DV doesn't have a way to allow the user to pass arbitrary format or a map of options in this manner. A particular Target-internal use case requires this and DV cannot be used for this use case until these are supported.

A proposed solution is to allow format: String and options: Object|Map[String, String] properties:

tables:
  - db: census_income
    table: adult
    format: internal_format
    options:
      database: census_income
      hive.vectorized.execution.reduce.enabled: "false"
    keyColumns:
      - age
      - occupation
    condition: educationNum >= 5
    checks:
      - type: rowCount
        minNumRows: 50000

Remove duplicate tests that add no additional value.

We have quite a few tests. We should make a pass through them a remove any ones that don't improve our code coverage.

Support for GroupBy

Feature request for supporting grouping and then checks on grouped data.

Streamline configuration for the same test applied to multiple columns

Currently, if I wanted to check for null values in each of the columns (age, occupation) of a table, the checks: section of the configuration file would contain something this:

- type: nullCheck
  column: age

- type: nullCheck
  column: occupation

Ideally, we should support a more streamlined config. Something like:

- type: nullCheck
  columns: age, occupation

We would need to decide how to handle optional parameters in the streamlined case. One option is that we do not support streamlining if any optional parameters are specified:

- type: nullCheck
  column: age
  threshold: 1%

- type: nullCheck
  column: occupation
  threshold: 5%

Another option would be to allow additional parameters to be streamlined and applied in the same order as the specified columns:

- type: nullCheck
  columns: age, occupation
  thresholds: 1%, 5%

Ratchet up to newer baseline

Target's internal baseline is rebasing onto these facts:

Ubuntu
JDK 17
Scala 2.13 (or 2.12)
Spark 3.5.1

#166 will handle Spark 3.5.1 and sets the stage for JDK 17. It enables Scala 2.12, too, but keeps Scala 2.11. We'll want to roll off Scala 2.11 and onboard 2.13.

I don't think we've got anything that cares about the underlying distro.

After #166 is merged, we'll need to do some testing and work to ensure operability on JDK 17, including bumping CI workflows.

data-validator/.github/workflows/ci.yaml

Line 25 in 2eb1389

jvm: adopt:1.8

data-validator/.github/workflows/release.yaml

Line 24 in 2eb1389

jvm: adopt:1.8

data-validator/.github/workflows/scala-steward.yaml

Line 27 in 2eb1389

java-version: '11'

java.lang.IllegalArgumentException when using parquet file

When trying to run a config check on a parquet file, the following error can be seen:

root@lubuntu:/home/jyoti/Spark# /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit --num-executors 10 --executor-cores 2 data-validator-assembly-20220111T034941.jar --config config.yaml
22/01/11 11:50:53 WARN Utils: Your hostname, lubuntu resolves to a loopback address: 127.0.1.1; using 192.168.195.131 instead (on interface ens33)
22/01/11 11:50:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/01/11 11:50:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/11 11:50:59 INFO Main$: Logging configured!
22/01/11 11:51:00 INFO Main$: Data Validator
22/01/11 11:51:01 INFO ConfigParser$: Parsing `config.yaml`
22/01/11 11:51:01 INFO ConfigParser$: Attempting to load `config.yaml` from file system
Exception in thread "main" java.lang.ExceptionInInitializerError
	at com.target.data_validator.validator.RowBased.<init>(RowBased.scala:11)
	at com.target.data_validator.validator.NullCheck.<init>(NullCheck.scala:12)
	at com.target.data_validator.validator.NullCheck$.fromJson(NullCheck.scala:37)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$decoders$2.apply(JsonDecoders.scala:16)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$decoders$2.apply(JsonDecoders.scala:16)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$2.apply(JsonDecoders.scala:32)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$2.apply(JsonDecoders.scala:32)
	at scala.Option.map(Option.scala:230)
	at com.target.data_validator.validator.JsonDecoders$$anon$7.com$target$data_validator$validator$JsonDecoders$$anon$$getDecoder(JsonDecoders.scala:32)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$apply$3.apply(JsonDecoders.scala:27)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$apply$3.apply(JsonDecoders.scala:27)
	at cats.syntax.EitherOps$.flatMap$extension(either.scala:149)
	at com.target.data_validator.validator.JsonDecoders$$anon$7.apply(JsonDecoders.scala:27)
	at io.circe.SeqDecoder.apply(SeqDecoder.scala:17)
	at io.circe.Decoder$class.tryDecode(Decoder.scala:36)
	at io.circe.SeqDecoder.tryDecode(SeqDecoder.scala:6)
	at com.target.data_validator.ConfigParser$anon$importedDecoder$macro$15$1$$anon$6.apply(ConfigParser.scala:21)
	at io.circe.generic.decoding.DerivedDecoder$$anon$1.apply(DerivedDecoder.scala:13)
	at io.circe.Decoder$$anon$28.apply(Decoder.scala:178)
	at io.circe.Decoder$$anon$28.apply(Decoder.scala:178)
	at io.circe.SeqDecoder.apply(SeqDecoder.scala:17)
	at io.circe.Decoder$class.tryDecode(Decoder.scala:36)
	at io.circe.SeqDecoder.tryDecode(SeqDecoder.scala:6)
	at com.target.data_validator.ConfigParser$anon$importedDecoder$macro$81$1$$anon$10.apply(ConfigParser.scala:28)
	at io.circe.generic.decoding.DerivedDecoder$$anon$1.apply(DerivedDecoder.scala:13)
	at io.circe.Json.as(Json.scala:106)
	at com.target.data_validator.ConfigParser$.configFromJson(ConfigParser.scala:28)
	at com.target.data_validator.ConfigParser$$anonfun$parse$1.apply(ConfigParser.scala:65)
	at com.target.data_validator.ConfigParser$$anonfun$parse$1.apply(ConfigParser.scala:65)
	at cats.syntax.EitherOps$.flatMap$extension(either.scala:149)
	at com.target.data_validator.ConfigParser$.parse(ConfigParser.scala:65)
	at com.target.data_validator.ConfigParser$.parseFile(ConfigParser.scala:60)
	at com.target.data_validator.Main$.loadConfigRun(Main.scala:23)
	at com.target.data_validator.Main$.main(Main.scala:171)
	at com.target.data_validator.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to bigint, but class Integer found.
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:219)
	at org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:296)
	at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:144)
	at com.target.data_validator.validator.ValidatorBase$.<init>(ValidatorBase.scala:139)
	at com.target.data_validator.validator.ValidatorBase$.<clinit>(ValidatorBase.scala)
	... 47 more

Ran a spark-submit job as follows:

spark-submit --num-executors 10 --executor-cores 2 data-validator-assembly-20220111T034941.jar --config config.yaml

The config.yaml file has the following content:

numKeyCols: 2
numErrorsToReport: 742

tables:
  - parquetFile: /home/jyoti/Spark/userdata1.parquet
    checks:
      - type: nullCheck
        column: salary

I got the userdata1.parquet from the following github link:
https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet

Environment Details:
latest source code: data-validator-0.13.0
Lubuntu 18.04 LTS x64 version on VMWare Player
4 CPU cores and 2GB ram
Java version

yoti@lubuntu:~$ java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

lsb_release output:

jyoti@lubuntu:~$ lsb_release -a 2>/dev/null
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04 LTS
Release:	18.04
Codename:	bionic

uname -s:

jyoti@lubuntu:~$ uname -s
Linux

sbt -version:

root@lubuntu:/home/jyoti/Spark# sbt -version
downloading sbt launcher 1.6.1
[info] [launcher] getting org.scala-sbt sbt 1.6.1  (this may take some time)...
[info] [launcher] getting Scala 2.12.15 (for sbt)...
sbt version in this project: 1.6.1
sbt script version: 1.6.1

Please let me know if you need anything else.

Implement Column Statistics / Data Profiling for Numeric Columns

As discussed in our original Spark Summit presentation: See 22 min mark.

Listening to myself is awful btw.

Inspired by the nice visualization provided by Facets Overview while leveraging spark to handle large distributed data sets.

Attempt to send email should be retried if it fails

Currently, if sending email fails because the email server is temporarily offline or overloaded, the only choice of action is to rerun the whole validation. This can be very expensive, and may require manual intervention if the program is running as part of an automatic workflow.

It would be better if the program detected the error in sending email and did its own wait-and-retry loop. This would be pretty cheap and much better than failing.

Range check configuration at debug log level

The range check configuration should be a debug log level to make it consistent with how other row-based tests are logged.

---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true

outputs:
  - filename: report.json
    append: false

tables:
  - db: census_income
    table: adult
    checks:
      - type: rangeCheck
        column: age
        minValue: 40
        maxValue: 50

yields:

21/01/14 14:34:43 INFO Main$: Logging configured!
21/01/14 14:34:43 INFO Main$: Data Validator
21/01/14 14:34:43 INFO ConfigParser$: Parsing `issue.yaml`
21/01/14 14:34:43 INFO ConfigParser$: Attempting to load `issue.yaml` from file system
21/01/14 14:34:43 INFO RangeCheck$$anonfun$fromJson$1: RangeCheckJson: {
  "type" : "rangeCheck",
  "column" : "age",
  "minValue" : 4e1,
  "maxValue" : 5e1
}
...
...

Drop env dump from JSON output

data-validator may expose secrets held in environment variables in the output JSON.

data-validator/src/main/scala/com/target/data_validator/ValidatorConfig.scala

Lines 97 to 100 in 67eec1d

 private def envToJson: Json = { 

 val env = System.getenv.asScala.toList.map(x => (x._1, Json.fromString(x._2))) 

 Json.obj(env: _*) 

 }

dumps the current environment into the output JSON via

data-validator/src/main/scala/com/target/data_validator/ValidatorConfig.scala

Line 74 in 67eec1d

("runtimeInfo", ValidatorConfig.runtimeInfoJson(spark)),

that calls

data-validator/src/main/scala/com/target/data_validator/ValidatorConfig.scala

Line 102 in 67eec1d

private def runtimeInfoJson(spark: SparkSession): Json = {

which includes it here

data-validator/src/main/scala/com/target/data_validator/ValidatorConfig.scala

Line 114 in 67eec1d

("environment", envToJson)

It's safe to dump variables that data-validator accesses but it's unwise to dump everything.

Add a 'sum of numeric column' check

Acceptance Criteria:

Implement the sum of numeric column
Usage should be documented in the read me
Test coverage should be added and passing

Refactor tests using `traits`

@colindean made some good suggestions in #14 around refactoring tests using traits (See comment)
I tried to create a few new utility functions in #13, but I'd like to see if we can do something like Colin suggested, and make a pass through the tests and use any new traits or functions to make them more concise.
I suspect we can reduce the duplicate code in the tests, greatly reduce the test SLOC and make it easier to develop tests.

createKeySelect log msg is potentially redundant and should not be at error level

For example, given:

---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true

outputs:
  - filename: report.json
    append: false

tables:
  - db: census_income
    table: adult
    checks:
      - type: rangeCheck
        column: age
        minValue: 40
        maxValue: 50

In the output you will see:

...
...
21/01/14 14:35:32 ERROR ValidatorTable: createKeySelect: age, workclass keyColumns: None
...
...

This is not an error. It is merely informing you what the keyColumns are for ValidatorQuickCheckError details. In the case that the keyColumns are specified in the configuration, you will end up seeing them listed twice.

Build tiny site using gh-pages

https://github.com/sbt/sbt-ghpages with perhaps a one-pager that links to docs that this plugin can also put in the gh-pages branch.

Move back to GitHub Actions

Travis CI's recent billing changes could affect this project. We moved away from GHA in #31 because our GH paid plan didn't include GHA. Apparently, it does now, so we could use it instead.

A potential inspiration for a modern GHA configuration: https://github.com/scallop/scallop/blob/develop/.github/workflows/ci.yml

We may be able to tackle #2 as a part of this…

Support sending email via ssltls auth

Support connecting to smtp host using SSL and user authentication.
See Sending email java ssltls auth

Will require adding some addition options to EmailConfig

Unit tests failures after adding dependency on HiveWarehouseConnector

In order to enable data-validator for Hadoop 3, a dependency on HiveWarehouseConnector was added. Post this unit tests started failing with the following exception:

java.lang.SecurityException: class "org.codehaus.janino.JaninoRuntimeException"'s signer information does not match signer information of other classes in the same package
at java.lang.ClassLoader.checkCerts(ClassLoader.java:898)
at java.lang.ClassLoader.preDefineClass(ClassLoader.java:668)
at java.lang.ClassLoader.defineClass(ClassLoader.java:761)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:197)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)

As suggested in this SO thread, HiveWarehouseConnector jar was added to the end of the classpath. Post that, a NoClassDefFoundError showed up.

java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
[error] sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
[error]     at org.apache.spark.SparkContext.withScope(SparkContext.scala:693)
[error]     at org.apache.spark.SparkContext.parallelize(SparkContext.scala:710)

This seems like a typical jar hell issue. And the issue is only with the unit tests. When the unit test runs were skipped, the data-validator was successfully deployed and ran fine on Hadoop 2 and Hadoop 3.

Rename "tables" concept

Is your feature request related to a problem? Please describe.

We've got ValidatorTable and tables in the config, but they're not really tables in the case of orc or parquet files. Let's get rid of the tables moniker and choose something else.

Describe the solution you'd like

ValidatorDataSource and sources might be more appropriate.

N.b. this would be a breaking change.

Move back to Travis CI

Recent changes to GitHub Actions disable it for paid orgs that are on older plans. We have to go back to Travis since this is not likely to be resolved amenable anytime soon.

	def main(args: Array[String]): Unit = {
	logger.info(s"Args[${args.length}]: $args")
	val filename = args(0)
	var error = false

	parseFile(filename, Map.empty) match {
	case Left(pe) => logger.error(s"Failed to parse $filename, ${pe.getMessage}"); error = true
	case Right(config) => logger.info(s"Config: $config")
	}

	System.exit(if (error) 1 else 0)
	}

	private def envToJson: Json = {
	val env = System.getenv.asScala.toList.map(x => (x._1, Json.fromString(x._2)))
	Json.obj(env: _*)
	}

target / data-validator Goto Github PK

data-validator's People

Contributors

Stargazers

Watchers

Forkers

data-validator's Issues

Example

This isn't just theoretical

MITM Attacks Increasingly Common

Source Locations

Recommend Projects

Recommend Topics

Recommend Org