target / data-validator Goto Github PK
View Code? Open in Web Editor NEWA tool to validate data, built around Apache Spark.
License: Other
A tool to validate data, built around Apache Spark.
License: Other
I'd like to include some text in the validator email, specifically a URL linking to where the validator configuration is stored.
End-to-end testing can now include some versioned JSONL data in src/test/resources
and be loaded through ValidatorSpecifiedFormatLoader
. #82 didn't include ValidatorSpecifiedFormatLoader
in the config parser tests, so add it while you're in there.
Is your feature request related to a problem? Please describe.
As a maintainer, I want to be able to validate my dependencies are what they were when I pulled them in.
Describe the solution you'd like
https://github.com/stringbean/sbt-dependency-lock
Describe alternatives you've considered
I've searched for other plugins that do this and ☝️ seems to be the currently maintained one.
@samratmitra-0812 pointed out:
This behaviour of throwing an exception for unsupported type [in columnSumCheck] is different from columnMaxCheck, where it is treated as a normal check failure. I think we should make both of them consistent.
Setup Travis CI
Is your feature request related to a problem? Please describe.
We currently only ship for Scala 2.11 and Spark 2.3.x.
Describe the solution you'd like
We should ship for newer versions in whatever pairs are appropriate.
Additional context
https://github.com/sbt/sbt-projectmatrix is probably the right tool for the job.
We should be able to support the following use case or something analogous to it:
---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true
vars:
- name: NUM_ROWS
value: 1000
tables:
- db: census_income
table: adult
checks:
- type: rowCount
minNumRows: $NUM_ROWS
Currently, trying the above configuration yields a fairly non-descriptive DecodingFailure
21/01/15 12:24:03 ERROR Main$: Failed to parse config file 'issue.yaml, {}
DecodingFailure(Attempt to decode value on failed cursor, List(DownField(parquetFile), DownArray, DownField(tables)))
It is noted in the documentation that this is not currently supported.
Per olafurpg/setup-scala#49, olafurpg/setup-scala
is ~unmaintained and there are better options available. We do have a need to compile against JDK8 so we can probably take the recommended route of using the actions/setup-java action.
I forgot to add thresholds
to the report.json
.
When running sbt clean assembly on terminal, the following tests are failing:
[info] - from Json snippet
[info] - addEntry works *** FAILED ***
[info] sut.addEntry(ConfigVarSpec.this.spark, varSub) was true (ConfigVarSpec.scala:70)
[info] - asJson works
[info] - var sub in env value *** FAILED ***
[info] sut.addEntry(ConfigVarSpec.this.spark, varSub) was true (ConfigVarSpec.scala:83)
[info] - var sub fails when value doesn't exist
Here is the ScalaTest output:
[info] ScalaTest
[info] Run completed in 3 minutes, 2 seconds.
[info] Total number of tests run: 330
[info] Suites: completed 25, aborted 0
[info] Tests: succeeded 328, failed 2, canceled 0, ignored 0, pending 0
[info] *** 2 TESTS FAILED ***
[error] Failed: Total 330, Failed 2, Errors 0, Passed 328
[error] Failed tests:
[error] com.target.data_validator.validator.ConfigVarSpec
[error] (Test / test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 892 s (14:52), completed 10 Jan, 2022 8:36:27 AM
There's a filter that requires a tag on the repo in order for it to show up.
Describe the bug
When specifying a check with a threshold that will parse to a JSON float, e.g.
threshold: 0.10 # will be ignored
threshold: 10% # works
threshold: "0.10" # works
the threshold will be ignored.
To Reproduce
Configure a check with:
type: nullCheck
column: foo
threshold: 0.10
or put that into a test in NullCheckSpec
.
Expected behavior
Thresholds specified as floats should work.
Remove Coursier pull of packages Makefile in exchange for sbt-scalafmt in sbt itself via adding to project/plugins.sbt
:
addSbtPlugin("org.scalameta" % "sbt-scalafmt" % "2.4.6")
There is an unused com.target.data_validator.ConfigParser#main
that could be exposed somehow to enable configuration testing.
Ideally, this should be a separate mode but minimally we could document how to use it locally to validate a configuration.
I think it could be as simple as documenting using it like this:
spark-submit data-validator-assembly-${version}.jar config.yaml
While testing stringLengthCheck
I accidently referenced minLength
instead of minValue
This caused configTest()
to fail for no apparent reason and took me a really long time to debug because the program was not logging any useful information. configTest()
did generate a ValidatorError()
event in the eventLog
, but the program doesn't write the report.json
or HTML report on configTest()
failures.
The new Object.fromJson()
constructors should log a warn for every unknown field present in the config.
In general, I do not think that unknown fields should be an error, only a warning, this helps keeps the config "compatible" across versions.
Maybe create a cli option for strict config parsing.
https://github.com/dmurvihill/courier is very attractive, missing only the retry support that would satisfy #70. Using Courier would satisfy #19 and could facilitate #5 as something to do in the process.
The HCL syntax in GitHub Actions will stop working on September 30, 2019. We are contacting you because you’ve run workflows using HCL syntax in the last week in your account with the following repos: target/data-validator.
To continue using workflows that you created with the HCL syntax, you'll need to migrate the workflow files to the new YAML syntax. Once you have your YAML workflows ready, visit your repositories and follow the prompts to upgrade. Once you upgrade, your HCL workflows will stop working.
https://help.github.com/en/articles/migrating-github-actions-from-hcl-syntax-to-yaml-syntax
$ spark-submit --master "local[*]" $(ls -t target/scala-2.11/data-validator-assembly-*.jar | head -n 1) --config local_validators.yaml --jsonReport target/testreport.json --htmlReport target/testreport.html
20/04/07 18:04:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/04/07 18:04:34 INFO Main$: Logging configured!
20/04/07 18:04:34 INFO Main$: Data Validator
20/04/07 18:04:34 INFO ConfigParser$: Parsing `local_validators.yaml`
20/04/07 18:04:34 INFO ConfigParser$: Attempting to load `local_validators.yaml` from file system
20/04/07 18:04:35 INFO ValidatorConfig: substituteVariables()
20/04/07 18:04:35 INFO Substitutable$class: Substituting filename var: ${WORKDIR}/test.json with `/Users/z003xc4/Source/OSS/target_data-validator/test.json`
20/04/07 18:04:35 INFO Main$: Checking Cli Outputs htmlReport: Some(target/testreport.html) jsonReport: Some(target/testreport.json)
20/04/07 18:04:35 INFO Main$: filename: Some(target/testreport.html) append: false
20/04/07 18:04:35 INFO Main$: CheckFile Some(target/testreport.html)
20/04/07 18:04:35 INFO Main$: Checking file 'target/testreport.html append: false failed: false
20/04/07 18:04:35 INFO Main$: filename: Some(target/testreport.json) append: true
20/04/07 18:04:35 INFO Main$: CheckFile Some(target/testreport.json)
20/04/07 18:04:35 INFO Main$: Checking file 'target/testreport.json append: true failed: false
20/04/07 18:04:35 INFO ValidatorOrcFile: Reading orc file: testData.orc
20/04/07 18:04:36 INFO Main$: Running sparkChecks
20/04/07 18:04:36 INFO ValidatorConfig: Running Quick Checks...
20/04/07 18:04:36 INFO ValidatorOrcFile: Reading orc file: testData.orc
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.expressions.Alias$.apply$default$4(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;)Lscala/Option;
at com.target.data_validator.ValidatorTable.createCountSelect(ValidatorTable.scala:33)
at com.target.data_validator.ValidatorTable.quickChecks(ValidatorTable.scala:87)
at com.target.data_validator.ValidatorConfig$$anonfun$quickChecks$1.apply(ValidatorConfig.scala:51)
at com.target.data_validator.ValidatorConfig$$anonfun$quickChecks$1.apply(ValidatorConfig.scala:51)
at scala.collection.immutable.List.map(List.scala:284)
at com.target.data_validator.ValidatorConfig.quickChecks(ValidatorConfig.scala:51)
at com.target.data_validator.Main$.runSparkChecks(Main.scala:80)
at com.target.data_validator.Main$$anonfun$2.apply(Main.scala:106)
at com.target.data_validator.Main$$anonfun$2.apply(Main.scala:100)
at scala.Option.map(Option.scala:146)
at com.target.data_validator.Main$.runChecks(Main.scala:99)
at com.target.data_validator.Main$.loadConfigRun(Main.scala:27)
at com.target.data_validator.Main$.main(Main.scala:170)
at com.target.data_validator.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
local_validators.yaml
:
---
numKeyCols: 2
numErrorsToReport: 5
detailedErrors: true
vars:
- name: WORKDIR
env: PWD
tables:
- orcFile: testData.orc
checks:
- type: rowCount
minNumRows: 1000
# - type: nullCheck
# column: nullCol
outputs:
- filename: ${WORKDIR}/test.json
CheckType: minNumRows
minValue: 1800
actualValue: 1144
In the above example error % should be (1800 - 1144) * 100/1800 = 36.44%
However, it is calculated as (1 * 100)/1144 = 0.09%
Similar issue exists for ColumnMaxCheck.
It would be great to have a distinctCountCheck
validator that checks the number of distinct values in a column of a given table, and that this number matches a user provided value.
Observed, for example, when combined with ColumnSumCheck
. See example config below:
---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true
vars:
- name: MAX_AGE
sql: SELECT CAST(MAX(age) AS DOUBLE) FROM census_income.adult
outputs:
- filename: report.json
append: false
email:
smtpHost: smtp.example.com
subject: Data Validation Summary
from: [email protected]
to:
- [email protected]
tables:
- db: census_income
table: adult
checks:
- type: columnSumCheck
column: age
minValue: $MAX_AGE
inclusive: true
yields:
...
...
21/01/14 09:12:01 ERROR JsonUtils$: Unimplemented dataType 'double' in column: CAST(max(age) AS DOUBLE) Please report this as a bug.
21/01/14 09:12:01 INFO ValidatorConfig: substituteVariables()
21/01/14 09:12:02 INFO Substitutable$class: Substituting Json minValue Json: "$MAX_AGE" with `null`
...
...
21/01/14 09:12:02 ERROR ColumnSumCheck$$anonfun$configCheck$1: 'minValue' defined but type is not a Number, is: Null
21/01/14 09:12:02 ERROR ValidatorTable$$anonfun$1: ConfigCheck failed for HiveTable:`census_income.adult`
...
...
CWE-829: Inclusion of Functionality from Untrusted Control Sphere
CWE-494: Download of Code Without Integrity Check
The build files indicate that this project is resolving dependencies over HTTP instead of HTTPS. Any of these artifacts could have been MITM to maliciously compromise them and infect the build artifacts that were produced. Additionally, if any of these JARs or other dependencies were compromised, any developers using these could continue to be infected past updating to fix this.
This vulnerability has a CVSS v3.0 Base Score of 8.1/10
https://nvd.nist.gov/vuln-metrics/cvss/v3-calculator?vector=AV:N/AC:H/PR:N/UI:N/S:U/C:H/I:H/A:H
POC code has existed since 2014 to maliciously compromise a JAR file inflight.
See:
See:
Line 12 in 67eec1d
Setup publishing release .jars to maven central or some other jar hosting place.
We have a use case which necessitates constructing queries a certain way:
val df = spark.read.format("internal_format").option("database", "foo").load("select * from myTable")
// … Spark magic …
df.write.format("internal_format").option("database", "my_select_database").save()
DV doesn't have a way to allow the user to pass arbitrary format or a map of options in this manner. A particular Target-internal use case requires this and DV cannot be used for this use case until these are supported.
A proposed solution is to allow format: String
and options: Object|Map[String, String]
properties:
tables:
- db: census_income
table: adult
format: internal_format
options:
database: census_income
hive.vectorized.execution.reduce.enabled: "false"
keyColumns:
- age
- occupation
condition: educationNum >= 5
checks:
- type: rowCount
minNumRows: 50000
We have quite a few tests. We should make a pass through them a remove any ones that don't improve our code coverage.
Feature request for supporting grouping and then checks on grouped data.
Currently, if I wanted to check for null values in each of the columns (age
, occupation
) of a table, the checks:
section of the configuration file would contain something this:
- type: nullCheck
column: age
- type: nullCheck
column: occupation
Ideally, we should support a more streamlined config. Something like:
- type: nullCheck
columns: age, occupation
We would need to decide how to handle optional parameters in the streamlined case. One option is that we do not support streamlining if any optional parameters are specified:
- type: nullCheck
column: age
threshold: 1%
- type: nullCheck
column: occupation
threshold: 5%
Another option would be to allow additional parameters to be streamlined and applied in the same order as the specified columns:
- type: nullCheck
columns: age, occupation
thresholds: 1%, 5%
Target's internal baseline is rebasing onto these facts:
#166 will handle Spark 3.5.1 and sets the stage for JDK 17. It enables Scala 2.12, too, but keeps Scala 2.11. We'll want to roll off Scala 2.11 and onboard 2.13.
I don't think we've got anything that cares about the underlying distro.
After #166 is merged, we'll need to do some testing and work to ensure operability on JDK 17, including bumping CI workflows.
data-validator/.github/workflows/ci.yaml
Line 25 in 2eb1389
When trying to run a config check on a parquet file, the following error can be seen:
root@lubuntu:/home/jyoti/Spark# /opt/spark/spark-3.1.2-bin-hadoop3.2/bin/spark-submit --num-executors 10 --executor-cores 2 data-validator-assembly-20220111T034941.jar --config config.yaml
22/01/11 11:50:53 WARN Utils: Your hostname, lubuntu resolves to a loopback address: 127.0.1.1; using 192.168.195.131 instead (on interface ens33)
22/01/11 11:50:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
22/01/11 11:50:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/01/11 11:50:59 INFO Main$: Logging configured!
22/01/11 11:51:00 INFO Main$: Data Validator
22/01/11 11:51:01 INFO ConfigParser$: Parsing `config.yaml`
22/01/11 11:51:01 INFO ConfigParser$: Attempting to load `config.yaml` from file system
Exception in thread "main" java.lang.ExceptionInInitializerError
at com.target.data_validator.validator.RowBased.<init>(RowBased.scala:11)
at com.target.data_validator.validator.NullCheck.<init>(NullCheck.scala:12)
at com.target.data_validator.validator.NullCheck$.fromJson(NullCheck.scala:37)
at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$decoders$2.apply(JsonDecoders.scala:16)
at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$decoders$2.apply(JsonDecoders.scala:16)
at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$2.apply(JsonDecoders.scala:32)
at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$2.apply(JsonDecoders.scala:32)
at scala.Option.map(Option.scala:230)
at com.target.data_validator.validator.JsonDecoders$$anon$7.com$target$data_validator$validator$JsonDecoders$$anon$$getDecoder(JsonDecoders.scala:32)
at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$apply$3.apply(JsonDecoders.scala:27)
at com.target.data_validator.validator.JsonDecoders$$anon$7$$anonfun$apply$3.apply(JsonDecoders.scala:27)
at cats.syntax.EitherOps$.flatMap$extension(either.scala:149)
at com.target.data_validator.validator.JsonDecoders$$anon$7.apply(JsonDecoders.scala:27)
at io.circe.SeqDecoder.apply(SeqDecoder.scala:17)
at io.circe.Decoder$class.tryDecode(Decoder.scala:36)
at io.circe.SeqDecoder.tryDecode(SeqDecoder.scala:6)
at com.target.data_validator.ConfigParser$anon$importedDecoder$macro$15$1$$anon$6.apply(ConfigParser.scala:21)
at io.circe.generic.decoding.DerivedDecoder$$anon$1.apply(DerivedDecoder.scala:13)
at io.circe.Decoder$$anon$28.apply(Decoder.scala:178)
at io.circe.Decoder$$anon$28.apply(Decoder.scala:178)
at io.circe.SeqDecoder.apply(SeqDecoder.scala:17)
at io.circe.Decoder$class.tryDecode(Decoder.scala:36)
at io.circe.SeqDecoder.tryDecode(SeqDecoder.scala:6)
at com.target.data_validator.ConfigParser$anon$importedDecoder$macro$81$1$$anon$10.apply(ConfigParser.scala:28)
at io.circe.generic.decoding.DerivedDecoder$$anon$1.apply(DerivedDecoder.scala:13)
at io.circe.Json.as(Json.scala:106)
at com.target.data_validator.ConfigParser$.configFromJson(ConfigParser.scala:28)
at com.target.data_validator.ConfigParser$$anonfun$parse$1.apply(ConfigParser.scala:65)
at com.target.data_validator.ConfigParser$$anonfun$parse$1.apply(ConfigParser.scala:65)
at cats.syntax.EitherOps$.flatMap$extension(either.scala:149)
at com.target.data_validator.ConfigParser$.parse(ConfigParser.scala:65)
at com.target.data_validator.ConfigParser$.parseFile(ConfigParser.scala:60)
at com.target.data_validator.Main$.loadConfigRun(Main.scala:23)
at com.target.data_validator.Main$.main(Main.scala:171)
at com.target.data_validator.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to bigint, but class Integer found.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:219)
at org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:296)
at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:144)
at com.target.data_validator.validator.ValidatorBase$.<init>(ValidatorBase.scala:139)
at com.target.data_validator.validator.ValidatorBase$.<clinit>(ValidatorBase.scala)
... 47 more
Ran a spark-submit job as follows:
spark-submit --num-executors 10 --executor-cores 2 data-validator-assembly-20220111T034941.jar --config config.yaml
The config.yaml file has the following content:
numKeyCols: 2
numErrorsToReport: 742
tables:
- parquetFile: /home/jyoti/Spark/userdata1.parquet
checks:
- type: nullCheck
column: salary
I got the userdata1.parquet from the following github link:
https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet
Environment Details:
latest source code: data-validator-0.13.0
Lubuntu 18.04 LTS x64 version on VMWare Player
4 CPU cores and 2GB ram
Java version
yoti@lubuntu:~$ java -version
openjdk version "1.8.0_312"
OpenJDK Runtime Environment (build 1.8.0_312-8u312-b07-0ubuntu1~18.04-b07)
OpenJDK 64-Bit Server VM (build 25.312-b07, mixed mode)
lsb_release output:
jyoti@lubuntu:~$ lsb_release -a 2>/dev/null
Distributor ID: Ubuntu
Description: Ubuntu 18.04 LTS
Release: 18.04
Codename: bionic
uname -s:
jyoti@lubuntu:~$ uname -s
Linux
sbt -version:
root@lubuntu:/home/jyoti/Spark# sbt -version
downloading sbt launcher 1.6.1
[info] [launcher] getting org.scala-sbt sbt 1.6.1 (this may take some time)...
[info] [launcher] getting Scala 2.12.15 (for sbt)...
sbt version in this project: 1.6.1
sbt script version: 1.6.1
Please let me know if you need anything else.
As discussed in our original Spark Summit presentation: See 22 min mark.
Listening to myself is awful btw.
Inspired by the nice visualization provided by Facets Overview while leveraging spark to handle large distributed data sets.
Currently, if sending email fails because the email server is temporarily offline or overloaded, the only choice of action is to rerun the whole validation. This can be very expensive, and may require manual intervention if the program is running as part of an automatic workflow.
It would be better if the program detected the error in sending email and did its own wait-and-retry loop. This would be pretty cheap and much better than failing.
The range check configuration should be a debug log level to make it consistent with how other row-based tests are logged.
---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true
outputs:
- filename: report.json
append: false
tables:
- db: census_income
table: adult
checks:
- type: rangeCheck
column: age
minValue: 40
maxValue: 50
yields:
21/01/14 14:34:43 INFO Main$: Logging configured!
21/01/14 14:34:43 INFO Main$: Data Validator
21/01/14 14:34:43 INFO ConfigParser$: Parsing `issue.yaml`
21/01/14 14:34:43 INFO ConfigParser$: Attempting to load `issue.yaml` from file system
21/01/14 14:34:43 INFO RangeCheck$$anonfun$fromJson$1: RangeCheckJson: {
"type" : "rangeCheck",
"column" : "age",
"minValue" : 4e1,
"maxValue" : 5e1
}
...
...
data-validator may expose secrets held in environment variables in the output JSON.
It's safe to dump variables that data-validator accesses but it's unwise to dump everything.
Acceptance Criteria:
@colindean made some good suggestions in #14 around refactoring tests using traits
(See comment)
I tried to create a few new utility functions in #13, but I'd like to see if we can do something like Colin suggested, and make a pass through the tests and use any new traits
or functions to make them more concise.
I suspect we can reduce the duplicate code in the tests, greatly reduce the test SLOC and make it easier to develop tests.
For example, given:
---
numKeyCols: 2
numErrorsToReport: 1
detailedErrors: true
outputs:
- filename: report.json
append: false
tables:
- db: census_income
table: adult
checks:
- type: rangeCheck
column: age
minValue: 40
maxValue: 50
In the output you will see:
...
...
21/01/14 14:35:32 ERROR ValidatorTable: createKeySelect: age, workclass keyColumns: None
...
...
This is not an error. It is merely informing you what the keyColumns are for ValidatorQuickCheckError
details. In the case that the keyColumns are specified in the configuration, you will end up seeing them listed twice.
https://github.com/sbt/sbt-ghpages with perhaps a one-pager that links to docs that this plugin can also put in the gh-pages branch.
Travis CI's recent billing changes could affect this project. We moved away from GHA in #31 because our GH paid plan didn't include GHA. Apparently, it does now, so we could use it instead.
A potential inspiration for a modern GHA configuration: https://github.com/scallop/scallop/blob/develop/.github/workflows/ci.yml
We may be able to tackle #2 as a part of this…
Support connecting to smtp host using SSL and user authentication.
See Sending email java ssltls auth
Will require adding some addition options to EmailConfig
In order to enable data-validator for Hadoop 3, a dependency on HiveWarehouseConnector was added. Post this unit tests started failing with the following exception:
java.lang.SecurityException: class "org.codehaus.janino.JaninoRuntimeException"'s signer information does not match signer information of other classes in the same package
at java.lang.ClassLoader.checkCerts(ClassLoader.java:898)
at java.lang.ClassLoader.preDefineClass(ClassLoader.java:668)
at java.lang.ClassLoader.defineClass(ClassLoader.java:761)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:197)
at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:36)
at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1321)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2484)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3254)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3253)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2484)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2698)
As suggested in this SO thread, HiveWarehouseConnector jar was added to the end of the classpath. Post that, a NoClassDefFoundError showed up.
java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
[error] sbt.ForkMain$ForkError: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
[error] at org.apache.spark.SparkContext.withScope(SparkContext.scala:693)
[error] at org.apache.spark.SparkContext.parallelize(SparkContext.scala:710)
This seems like a typical jar hell issue. And the issue is only with the unit tests. When the unit test runs were skipped, the data-validator was successfully deployed and ran fine on Hadoop 2 and Hadoop 3.
Is your feature request related to a problem? Please describe.
We've got ValidatorTable and tables
in the config, but they're not really tables in the case of orc or parquet files. Let's get rid of the tables moniker and choose something else.
Describe the solution you'd like
ValidatorDataSource and sources
might be more appropriate.
N.b. this would be a breaking change.
Recent changes to GitHub Actions disable it for paid orgs that are on older plans. We have to go back to Travis since this is not likely to be resolved amenable anytime soon.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.