Giter Club home page Giter Club logo

data-validator's People

Contributors

abs428 avatar c-horn avatar colindean avatar dependabot[bot] avatar dougb avatar github-actions[bot] avatar holdenk avatar jaygaynor avatar phpisciuneri avatar samratmitra-0812 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-validator's Issues

Add support for variable substitution for `minNumRows` in `rowCount` check

We should be able to support the following use case or something analogous to it:

---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true

vars:
  - name: NUM_ROWS
    value: 1000
  
tables:
  - db: census_income
    table: adult
    checks:
      - type: rowCount
        minNumRows: $NUM_ROWS

Currently, trying the above configuration yields a fairly non-descriptive DecodingFailure

21/01/15 12:24:03 ERROR Main$: Failed to parse config file 'issue.yaml, {}
DecodingFailure(Attempt to decode value on failed cursor, List(DownField(parquetFile), DownArray, DownField(tables)))

It is noted in the documentation that this is not currently supported.

Test fails for ConfigVarSpec

When running sbt clean assembly on terminal, the following tests are failing:

[info]   - from Json snippet
[info]   - addEntry works *** FAILED ***
[info]     sut.addEntry(ConfigVarSpec.this.spark, varSub) was true (ConfigVarSpec.scala:70)
[info]   - asJson works
[info]   - var sub in env value *** FAILED ***
[info]     sut.addEntry(ConfigVarSpec.this.spark, varSub) was true (ConfigVarSpec.scala:83)
[info]   - var sub fails when value doesn't exist

Here is the ScalaTest output:

[info] ScalaTest
[info] Run completed in 3 minutes, 2 seconds.
[info] Total number of tests run: 330
[info] Suites: completed 25, aborted 0
[info] Tests: succeeded 328, failed 2, canceled 0, ignored 0, pending 0
[info] *** 2 TESTS FAILED ***
[error] Failed: Total 330, Failed 2, Errors 0, Passed 328
[error] Failed tests:
[error]         com.target.data_validator.validator.ConfigVarSpec
[error] (Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 892 s (14:52), completed 10 Jan, 2022 8:36:27 AM

Thresholds parsed as JSON floats are ignored

Describe the bug

When specifying a check with a threshold that will parse to a JSON float, e.g.

threshold: 0.10 # will be ignored
threshold: 10% # works
threshold: "0.10" # works

the threshold will be ignored.

To Reproduce

Configure a check with:

type: nullCheck
column: foo
threshold: 0.10

or put that into a test in NullCheckSpec.

Expected behavior

Thresholds specified as floats should work.

Enable a configuration check using com.target.data_validator.ConfigParser#main

There is an unused com.target.data_validator.ConfigParser#main that could be exposed somehow to enable configuration testing.

def main(args: Array[String]): Unit = {
logger.info(s"Args[${args.length}]: $args")
val filename = args(0)
var error = false
parseFile(filename, Map.empty) match {
case Left(pe) => logger.error(s"Failed to parse $filename, ${pe.getMessage}"); error = true
case Right(config) => logger.info(s"Config: $config")
}
System.exit(if (error) 1 else 0)
}

Ideally, this should be a separate mode but minimally we could document how to use it locally to validate a configuration.

I think it could be as simple as documenting using it like this:

spark-submit data-validator-assembly-${version}.jar config.yaml

Unknown fields in the check section of `.yaml` should cause `WARN` log messages.

While testing stringLengthCheck I accidently referenced minLength instead of minValue
This caused configTest() to fail for no apparent reason and took me a really long time to debug because the program was not logging any useful information. configTest() did generate a ValidatorError() event in the eventLog, but the program doesn't write the report.json or HTML report on configTest() failures.

The new Object.fromJson() constructors should log a warn for every unknown field present in the config.
In general, I do not think that unknown fields should be an error, only a warning, this helps keeps the config "compatible" across versions.
Maybe create a cli option for strict config parsing.

Migrate your Actions workflows to the new syntax

The HCL syntax in GitHub Actions will stop working on September 30, 2019. We are contacting you because you’ve run workflows using HCL syntax in the last week in your account with the following repos: target/data-validator.

To continue using workflows that you created with the HCL syntax, you'll need to migrate the workflow files to the new YAML syntax. Once you have your YAML workflows ready, visit your repositories and follow the prompts to upgrade. Once you upgrade, your HCL workflows will stop working.

https://help.github.com/en/articles/migrating-github-actions-from-hcl-syntax-to-yaml-syntax

NoSuchMethodError when running with test data and test config

$ spark-submit --master "local[*]" $(ls -t target/scala-2.11/data-validator-assembly-*.jar | head -n 1) --config local_validators.yaml --jsonReport target/testreport.json --htmlReport target/testreport.html
20/04/07 18:04:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/04/07 18:04:34 INFO Main$: Logging configured!
20/04/07 18:04:34 INFO Main$: Data Validator
20/04/07 18:04:34 INFO ConfigParser$: Parsing `local_validators.yaml`
20/04/07 18:04:34 INFO ConfigParser$: Attempting to load `local_validators.yaml` from file system
20/04/07 18:04:35 INFO ValidatorConfig: substituteVariables()
20/04/07 18:04:35 INFO Substitutable$class: Substituting filename var: ${WORKDIR}/test.json with `/Users/z003xc4/Source/OSS/target_data-validator/test.json`
20/04/07 18:04:35 INFO Main$: Checking Cli Outputs htmlReport: Some(target/testreport.html) jsonReport: Some(target/testreport.json)
20/04/07 18:04:35 INFO Main$: filename: Some(target/testreport.html) append: false
20/04/07 18:04:35 INFO Main$: CheckFile Some(target/testreport.html)
20/04/07 18:04:35 INFO Main$: Checking file 'target/testreport.html append: false failed: false
20/04/07 18:04:35 INFO Main$: filename: Some(target/testreport.json) append: true
20/04/07 18:04:35 INFO Main$: CheckFile Some(target/testreport.json)
20/04/07 18:04:35 INFO Main$: Checking file 'target/testreport.json append: true failed: false
20/04/07 18:04:35 INFO ValidatorOrcFile: Reading orc file: testData.orc
20/04/07 18:04:36 INFO Main$: Running sparkChecks
20/04/07 18:04:36 INFO ValidatorConfig: Running Quick Checks...
20/04/07 18:04:36 INFO ValidatorOrcFile: Reading orc file: testData.orc
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias$.apply$default$4(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;)Lscala/Option;
        at com.target.data_validator.ValidatorTable.createCountSelect(ValidatorTable.scala:33)
        at com.target.data_validator.ValidatorTable.quickChecks(ValidatorTable.scala:87)
        at com.target.data_validator.ValidatorConfig$$anonfun$quickChecks$1.apply(ValidatorConfig.scala:51)
        at com.target.data_validator.ValidatorConfig$$anonfun$quickChecks$1.apply(ValidatorConfig.scala:51)
        at scala.collection.immutable.List.map(List.scala:284)
        at com.target.data_validator.ValidatorConfig.quickChecks(ValidatorConfig.scala:51)
        at com.target.data_validator.Main$.runSparkChecks(Main.scala:80)
        at com.target.data_validator.Main$$anonfun$2.apply(Main.scala:106)
        at com.target.data_validator.Main$$anonfun$2.apply(Main.scala:100)
        at scala.Option.map(Option.scala:146)
        at com.target.data_validator.Main$.runChecks(Main.scala:99)
        at com.target.data_validator.Main$.loadConfigRun(Main.scala:27)
        at com.target.data_validator.Main$.main(Main.scala:170)
        at com.target.data_validator.Main.main(Main.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
        at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
        at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
        at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

local_validators.yaml:

---
numKeyCols: 2
numErrorsToReport: 5
detailedErrors: true
vars:
- name: WORKDIR
  env: PWD
tables:
- orcFile: testData.orc
  checks:
  - type: rowCount
    minNumRows: 1000
    #  - type: nullCheck
    #    column: nullCol
outputs:
- filename: ${WORKDIR}/test.json
  • Spark version 2.4.4
  • Scala version 2.11.12

Error % not calculated correctly for ColumnBased checks

Example

CheckType: minNumRows
minValue: 1800
actualValue: 1144

In the above example error % should be (1800 - 1144) * 100/1800 = 36.44%
However, it is calculated as (1 * 100)/1144 = 0.09%

Similar issue exists for ColumnMaxCheck.

distinctCountCheck as a validator

It would be great to have a distinctCountCheck validator that checks the number of distinct values in a column of a given table, and that this number matches a user provided value.

SQL variable substitution fails when result is a double

Observed, for example, when combined with ColumnSumCheck. See example config below:

---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true

vars:
  - name: MAX_AGE
    sql: SELECT CAST(MAX(age) AS DOUBLE) FROM census_income.adult

outputs:
  - filename: report.json
    append: false

email:
  smtpHost: smtp.example.com
  subject: Data Validation Summary
  from: [email protected]
  to:
    - [email protected]

tables:
  - db: census_income
    table: adult
    checks:
      - type: columnSumCheck
        column: age
        minValue: $MAX_AGE
        inclusive: true

yields:

...
...
21/01/14 09:12:01 ERROR JsonUtils$: Unimplemented dataType 'double' in column: CAST(max(age) AS DOUBLE) Please report this as a bug.
21/01/14 09:12:01 INFO ValidatorConfig: substituteVariables()
21/01/14 09:12:02 INFO Substitutable$class: Substituting Json minValue Json: "$MAX_AGE" with `null`
...
...
21/01/14 09:12:02 ERROR ColumnSumCheck$$anonfun$configCheck$1: 'minValue' defined but type is not a Number, is: Null
21/01/14 09:12:02 ERROR ValidatorTable$$anonfun$1: ConfigCheck failed for HiveTable:`census_income.adult`
...
...

[SECURITY] Releases are built/executed/released in the context of insecure/untrusted code

CWE-829: Inclusion of Functionality from Untrusted Control Sphere
CWE-494: Download of Code Without Integrity Check

The build files indicate that this project is resolving dependencies over HTTP instead of HTTPS. Any of these artifacts could have been MITM to maliciously compromise them and infect the build artifacts that were produced. Additionally, if any of these JARs or other dependencies were compromised, any developers using these could continue to be infected past updating to fix this.

This vulnerability has a CVSS v3.0 Base Score of 8.1/10
https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator?vector=AV:N/AC:H/PR:N/UI:N/S:U/C:H/I:H/A:H

This isn't just theoretical

POC code has existed since 2014 to maliciously compromise a JAR file inflight.
See:

MITM Attacks Increasingly Common

See:

Source Locations

resolvers += "Concurrent Conjars repository" at "http://conjars.org/repo"

Allow format+options to be passed before Hive query

We have a use case which necessitates constructing queries a certain way:

val df = spark.read.format("internal_format").option("database", "foo").load("select * from myTable")
// … Spark magic …
df.write.format("internal_format").option("database", "my_select_database").save()

DV doesn't have a way to allow the user to pass arbitrary format or a map of options in this manner. A particular Target-internal use case requires this and DV cannot be used for this use case until these are supported.

A proposed solution is to allow format: String and options: Object|Map[String, String] properties:

tables:
  - db: census_income
    table: adult
    format: internal_format
    options:
      database: census_income
      hive.vectorized.execution.reduce.enabled: "false"
    keyColumns:
      - age
      - occupation
    condition: educationNum >= 5
    checks:
      - type: rowCount
        minNumRows: 50000

Streamline configuration for the same test applied to multiple columns

Currently, if I wanted to check for null values in each of the columns (age, occupation) of a table, the checks: section of the configuration file would contain something this:

- type: nullCheck
  column: age

- type: nullCheck
  column: occupation

Ideally, we should support a more streamlined config. Something like:

- type: nullCheck
  columns: age, occupation

We would need to decide how to handle optional parameters in the streamlined case. One option is that we do not support streamlining if any optional parameters are specified:

- type: nullCheck
  column: age
  threshold: 1%

- type: nullCheck
  column: occupation
  threshold: 5%

Another option would be to allow additional parameters to be streamlined and applied in the same order as the specified columns:

- type: nullCheck
  columns: age, occupation
  thresholds: 1%, 5%

Ratchet up to newer baseline

Target's internal baseline is rebasing onto these facts:

  • Ubuntu
  • JDK 17
  • Scala 2.13 (or 2.12)
  • Spark 3.5.1

#166 will handle Spark 3.5.1 and sets the stage for JDK 17. It enables Scala 2.12, too, but keeps Scala 2.11. We'll want to roll off Scala 2.11 and onboard 2.13.

I don't think we've got anything that cares about the underlying distro.

After #166 is merged, we'll need to do some testing and work to ensure operability on JDK 17, including bumping CI workflows.

java.lang.IllegalArgumentException when using parquet file

When trying to run a config check on a parquet file, the following error can be seen:

root@lubuntu:/home/jyoti/Spark# /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit --num-executors 10 --executor-cores 2 data-validator-assembly-20220111T034941.jar --config config.yaml
22/01/11 11:50:53 WARN Utils: Your hostname, lubuntu resolves to a loopback address: 127.0.1.1; using 192.168.195.131 instead (on interface ens33)
22/01/11 11:50:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/01/11 11:50:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/11 11:50:59 INFO Main$: Logging configured!
22/01/11 11:51:00 INFO Main$: Data Validator
22/01/11 11:51:01 INFO ConfigParser$: Parsing `config.yaml`
22/01/11 11:51:01 INFO ConfigParser$: Attempting to load `config.yaml` from file system
Exception in thread "main" java.lang.ExceptionInInitializerError
	at com.target.data_validator.validator.RowBased.<init>(RowBased.scala:11)
	at com.target.data_validator.validator.NullCheck.<init>(NullCheck.scala:12)
	at com.target.data_validator.validator.NullCheck$.fromJson(NullCheck.scala:37)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$decoders$2.apply(JsonDecoders.scala:16)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$decoders$2.apply(JsonDecoders.scala:16)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$2.apply(JsonDecoders.scala:32)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$2.apply(JsonDecoders.scala:32)
	at scala.Option.map(Option.scala:230)
	at com.target.data_validator.validator.JsonDecoders$$anon$7.com$target$data_validator$validator$JsonDecoders$$anon$$getDecoder(JsonDecoders.scala:32)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$apply$3.apply(JsonDecoders.scala:27)
	at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$apply$3.apply(JsonDecoders.scala:27)
	at cats.syntax.EitherOps$.flatMap$extension(either.scala:149)
	at com.target.data_validator.validator.JsonDecoders$$anon$7.apply(JsonDecoders.scala:27)
	at io.circe.SeqDecoder.apply(SeqDecoder.scala:17)
	at io.circe.Decoder$class.tryDecode(Decoder.scala:36)
	at io.circe.SeqDecoder.tryDecode(SeqDecoder.scala:6)
	at com.target.data_validator.ConfigParser$anon$importedDecoder$macro$15$1$$anon$6.apply(ConfigParser.scala:21)
	at io.circe.generic.decoding.DerivedDecoder$$anon$1.apply(DerivedDecoder.scala:13)
	at io.circe.Decoder$$anon$28.apply(Decoder.scala:178)
	at io.circe.Decoder$$anon$28.apply(Decoder.scala:178)
	at io.circe.SeqDecoder.apply(SeqDecoder.scala:17)
	at io.circe.Decoder$class.tryDecode(Decoder.scala:36)
	at io.circe.SeqDecoder.tryDecode(SeqDecoder.scala:6)
	at com.target.data_validator.ConfigParser$anon$importedDecoder$macro$81$1$$anon$10.apply(ConfigParser.scala:28)
	at io.circe.generic.decoding.DerivedDecoder$$anon$1.apply(DerivedDecoder.scala:13)
	at io.circe.Json.as(Json.scala:106)
	at com.target.data_validator.ConfigParser$.configFromJson(ConfigParser.scala:28)
	at com.target.data_validator.ConfigParser$$anonfun$parse$1.apply(ConfigParser.scala:65)
	at com.target.data_validator.ConfigParser$$anonfun$parse$1.apply(ConfigParser.scala:65)
	at cats.syntax.EitherOps$.flatMap$extension(either.scala:149)
	at com.target.data_validator.ConfigParser$.parse(ConfigParser.scala:65)
	at com.target.data_validator.ConfigParser$.parseFile(ConfigParser.scala:60)
	at com.target.data_validator.Main$.loadConfigRun(Main.scala:23)
	at com.target.data_validator.Main$.main(Main.scala:171)
	at com.target.data_validator.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to bigint, but class Integer found.
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:219)
	at org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:296)
	at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:144)
	at com.target.data_validator.validator.ValidatorBase$.<init>(ValidatorBase.scala:139)
	at com.target.data_validator.validator.ValidatorBase$.<clinit>(ValidatorBase.scala)
	... 47 more

Ran a spark-submit job as follows:

spark-submit --num-executors 10 --executor-cores 2 data-validator-assembly-20220111T034941.jar --config config.yaml

The config.yaml file has the following content:

numKeyCols: 2
numErrorsToReport: 742

tables:
  - parquetFile: /home/jyoti/Spark/userdata1.parquet
    checks:
      - type: nullCheck
        column: salary

I got the userdata1.parquet from the following github link:
https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet

Environment Details:
latest source code: data-validator-0.13.0
Lubuntu 18.04 LTS x64 version on VMWare Player
4 CPU cores and 2GB ram
Java version

yoti@lubuntu:~$ java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)

lsb_release output:

jyoti@lubuntu:~$ lsb_release -a 2>/dev/null
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04 LTS
Release:	18.04
Codename:	bionic

uname -s:

jyoti@lubuntu:~$ uname -s
Linux

sbt -version:

root@lubuntu:/home/jyoti/Spark# sbt -version
downloading sbt launcher 1.6.1
[info] [launcher] getting org.scala-sbt sbt 1.6.1  (this may take some time)...
[info] [launcher] getting Scala 2.12.15 (for sbt)...
sbt version in this project: 1.6.1
sbt script version: 1.6.1

Please let me know if you need anything else.

Attempt to send email should be retried if it fails

Currently, if sending email fails because the email server is temporarily offline or overloaded, the only choice of action is to rerun the whole validation. This can be very expensive, and may require manual intervention if the program is running as part of an automatic workflow.

It would be better if the program detected the error in sending email and did its own wait-and-retry loop. This would be pretty cheap and much better than failing.

Range check configuration at debug log level

The range check configuration should be a debug log level to make it consistent with how other row-based tests are logged.

---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true

outputs:
  - filename: report.json
    append: false

tables:
  - db: census_income
    table: adult
    checks:
      - type: rangeCheck
        column: age
        minValue: 40
        maxValue: 50

yields:

21/01/14 14:34:43 INFO Main$: Logging configured!
21/01/14 14:34:43 INFO Main$: Data Validator
21/01/14 14:34:43 INFO ConfigParser$: Parsing `issue.yaml`
21/01/14 14:34:43 INFO ConfigParser$: Attempting to load `issue.yaml` from file system
21/01/14 14:34:43 INFO RangeCheck$$anonfun$fromJson$1: RangeCheckJson: {
  "type" : "rangeCheck",
  "column" : "age",
  "minValue" : 4e1,
  "maxValue" : 5e1
}
...
...

Drop env dump from JSON output

data-validator may expose secrets held in environment variables in the output JSON.

private def envToJson: Json = {
val env = System.getenv.asScala.toList.map(x => (x._1, Json.fromString(x._2)))
Json.obj(env: _*)
}
dumps the current environment into the output JSON via
("runtimeInfo", ValidatorConfig.runtimeInfoJson(spark)),
that calls
private def runtimeInfoJson(spark: SparkSession): Json = {
which includes it here

It's safe to dump variables that data-validator accesses but it's unwise to dump everything.

Add a 'sum of numeric column' check

Acceptance Criteria:

  • Implement the sum of numeric column
  • Usage should be documented in the read me
  • Test coverage should be added and passing

Refactor tests using `traits`

@colindean made some good suggestions in #14 around refactoring tests using traits (See comment)
I tried to create a few new utility functions in #13, but I'd like to see if we can do something like Colin suggested, and make a pass through the tests and use any new traits or functions to make them more concise.
I suspect we can reduce the duplicate code in the tests, greatly reduce the test SLOC and make it easier to develop tests.

createKeySelect log msg is potentially redundant and should not be at error level

For example, given:

---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true

outputs:
  - filename: report.json
    append: false

tables:
  - db: census_income
    table: adult
    checks:
      - type: rangeCheck
        column: age
        minValue: 40
        maxValue: 50

In the output you will see:

...
...
21/01/14 14:35:32 ERROR ValidatorTable: createKeySelect: age, workclass keyColumns: None
...
...

This is not an error. It is merely informing you what the keyColumns are for ValidatorQuickCheckError details. In the case that the keyColumns are specified in the configuration, you will end up seeing them listed twice.

Unit tests failures after adding dependency on HiveWarehouseConnector

In order to enable data-validator for Hadoop 3, a dependency on HiveWarehouseConnector was added. Post this unit tests started failing with the following exception:

java.lang.SecurityException: class "org.codehaus.janino.JaninoRuntimeException"'s signer information does not match signer information of other classes in the same package
at java.lang.ClassLoader.checkCerts(ClassLoader.java:898)
at java.lang.ClassLoader.preDefineClass(ClassLoader.java:668)
at java.lang.ClassLoader.defineClass(ClassLoader.java:761)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:197)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)

As suggested in this SO thread, HiveWarehouseConnector jar was added to the end of the classpath. Post that, a NoClassDefFoundError showed up.

java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
[error] sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
[error]     at org.apache.spark.SparkContext.withScope(SparkContext.scala:693)
[error]     at org.apache.spark.SparkContext.parallelize(SparkContext.scala:710)

This seems like a typical jar hell issue. And the issue is only with the unit tests. When the unit test runs were skipped, the data-validator was successfully deployed and ran fine on Hadoop 2 and Hadoop 3.

Rename "tables" concept

Is your feature request related to a problem? Please describe.

We've got ValidatorTable and tables in the config, but they're not really tables in the case of orc or parquet files. Let's get rid of the tables moniker and choose something else.

Describe the solution you'd like

ValidatorDataSource and sources might be more appropriate.

N.b. this would be a breaking change.

Move back to Travis CI

Recent changes to GitHub Actions disable it for paid orgs that are on older plans. We have to go back to Travis since this is not likely to be resolved amenable anytime soon.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.