zinggai / zingg Goto Github PK
View Code? Open in Web Editor NEWScalable identity resolution, entity resolution, data mastering and deduplication using ML
License: GNU Affero General Public License v3.0
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
License: GNU Affero General Public License v3.0
We should embrace the idea from https://roundup.getdbt.com/p/devops-and-the-modern-data-experience and build up a nice api for Zingg.
try https://cloud.google.com/solutions/spark and see if we can build something out with Zingg
Need to write how to integrate GE and Zingg.
didn't accept the "_" underscore
Originally posted by @premsmac2021 in #49 (comment)
During Linking phase, matches and comparison is done on the rows from the same dataset.
** To Reproduce **
** Expected behavior **
I'm expecting during labeling phase to only show rows from different z_source. And I'm expecting output to also only have pairs with different z_source.
Need a guide for this.
Having a prebuilt simple model can be helpful to get people started.
User tried to use Zingg on windows and got the following error
Program 'zingg.sh' failed to run: No application is associated with the specified file for this operationAt line:1
See https://zinggai.slack.com/archives/C02JNH144TB/p1636045905031900
Understanding Zingg usage in an anonymous way - number of attributes, size of data, pipes, number of runs, time taken for each run will be helpful to build the right features. This should be absolutely anonymous, with no actual data leaving user environment. And user should have the ability to switch it off anytime.
Need to figure out how Zingg can be run from pyspark or python.
What if we could take sql from say a dbt model or otherwise and use that for our model training - blocking as well as similarity? Then non Java programmers can also code and customize Zingg without bothering about the internals.
https://docs.zingg.ai/docs/setup/configuration.html - pipes links are broken
explore terraform
WARN org.apache.spark.ml.util.Instrumentation - [7dacb264] All labels are the same value and fitIntercept=true, so the coefficients will be zeros. Training is not needed.
INFO org.apache.spark.ml.util.Instrumentation - [aa3fb364] training finished
INFO org.apache.spark.ml.util.Instrumentation - [e20e40e0] training finished
ERROR org.apache.spark.ml.util.Instrumentation - org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
WARN zingg.client.util.Email - Unable to send email Can't send command to SMTP host
WARN zingg.client.Client - Apologies for this message. Zingg has encountered an error. Exception thrown in awaitResult:
findTrainingData and Label completed
right now perf is hidden under hardware sizing make it a lot difficult to find
We can build a cli that can filter and show results to the user, much like the labeller
Add documentation on how to use Zingg with Azure
Hi, I hit a NullPointerException
running zingg at the match
phase. I had previously run the findTrainingData
/ label
/ train
phases successfully. The error occurs early in the run (after ~1 minute):
I have no name!@e036d492b95b:/zingg-0.3.0-SNAPSHOT$ time ./scripts/zingg.sh --phase match --conf /tmp/research/tsmart_config.json
2021-11-04 14:52:31,814 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-11-04 14:52:32,064 [main] INFO zingg.client.Client -
2021-11-04 14:52:32,065 [main] INFO zingg.client.Client - ********************************************************
2021-11-04 14:52:32,066 [main] INFO zingg.client.Client - * Zingg AI *
2021-11-04 14:52:32,067 [main] INFO zingg.client.Client - * (C) 2021 Zingg.AI *
2021-11-04 14:52:32,068 [main] INFO zingg.client.Client - ********************************************************
2021-11-04 14:52:32,069 [main] INFO zingg.client.Client -
2021-11-04 14:52:32,070 [main] INFO zingg.client.Client - using: Zingg v0.3
2021-11-04 14:52:32,070 [main] INFO zingg.client.Client -
2021-11-04 14:52:32,289 [main] WARN zingg.client.Arguments - Config Argument is /tmp/research/tsmart_config.json
2021-11-04 14:52:32,564 [main] WARN zingg.client.Arguments - phase is match
2021-11-04 14:52:35,660 [main] WARN org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry - The function round replaced a previously registered function.
2021-11-04 14:52:35,660 [main] INFO zingg.ZinggBase - Start reading internal configurations and functions
2021-11-04 14:52:35,675 [main] INFO zingg.ZinggBase - Finished reading internal configurations and functions
2021-11-04 14:52:35,728 [main] WARN zingg.util.PipeUtil - Reading input csv
2021-11-04 14:53:13,325 [main] INFO zingg.Matcher - Read 932474
2021-11-04 14:53:13,414 [main] INFO zingg.Matcher - Blocked
2021-11-04 14:53:13,918 [main] WARN org.apache.spark.sql.catalyst.util.package - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
2021-11-04 14:53:16,745 [Executor task launch worker for task 30] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 8.0 (TID 30)
java.lang.NullPointerException
at zingg.block.Block$BlockFunction.call(Block.java:403)
at zingg.block.Block$BlockFunction.call(Block.java:393)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
...
Several instances of the exception are reported, this is the first one.
My config file is:
{
"fieldDefinition":[
{
"fieldName" : "id",
"matchType" : "DONT_USE",
"fields" : "voterbase_id"
},
{
"fieldName" : "fname",
"matchType" : "fuzzy",
"fields" : "vb_tsmart_first_name",
"dataType": "\"string\""
},
{
"fieldName" : "mname",
"matchType" : "fuzzy",
"fields" : "vb_tsmart_middle_name",
"dataType": "\"string\""
},
{
"fieldName" : "lname",
"matchType" : "fuzzy",
"fields" : "vb_tsmart_last_name",
"dataType": "\"string\""
},
{
"fieldName" : "address1",
"matchType": "fuzzy",
"fields" : "vb_tsmart_full_address",
"dataType": "\"string\""
},
{
"fieldName" : "city",
"matchType": "fuzzy",
"fields" : "vb_tsmart_city",
"dataType": "\"string\""
},
{
"fieldName" : "state",
"matchType": "exact",
"fields" : "vb_tsmart_state",
"dataType": "\"string\""
},
{
"fieldName" : "zip",
"matchType": "exact",
"fields" : "vb_tsmart_zip",
"dataType": "\"string\""
},
{
"fieldName" : "homephone",
"matchType": "exact",
"fields" : "vb_voterbase_phone",
"dataType": "\"string\""
},
{
"fieldName" : "cellphone",
"matchType": "exact",
"fields" : "vb_voterbase_phone_wireless",
"dataType": "\"string\""
},
{
"fieldName" : "dob",
"matchType": "fuzzy",
"fields" : "vb_voterbase_dob",
"dataType": "\"string\""
},
{
"fieldName" : "gender",
"matchType": "fuzzy",
"fields" : "vb_voterbase_gender",
"dataType": "\"string\""
},
{
"fieldName" : "tier",
"matchType": "DONT_USE",
"fields" : "tier",
"dataType": "\"string\""
}
],
"output" : [{
"name":"output",
"format":"csv",
"props": {
"location": "/tmp/zinggOutput",
"delimiter": ",",
"header":true
}
}],
"data" : [{
"name":"test",
"format":"csv",
"props": {
"location": "/tmp/research/tmp/dupes/combined.csv",
"delimiter": ",",
"header":false
},
"schema":
"{\"type\" : \"struct\",
\"fields\" : [
{\"name\":\"id\", \"type\":\"string\", \"nullable\":false},
{\"name\":\"fname\", \"type\":\"string\", \"nullable\":true},
{\"name\":\"mname\", \"type\":\"string\", \"nullable\":true},
{\"name\":\"lname\",\"type\":\"string\",\"nullable\":true} ,
{\"name\":\"address1\", \"type\":\"string\", \"nullable\":true},
{\"name\":\"city\", \"type\":\"string\", \"nullable\":true},
{\"name\":\"state\", \"type\":\"string\", \"nullable\":true},
{\"name\":\"zip\", \"type\":\"string\", \"nullable\":true},
{\"name\":\"homephone\", \"type\":\"string\", \"nullable\":true},
{\"name\":\"cellphone\", \"type\":\"string\", \"nullable\":true},
{\"name\":\"dob\",\"type\":\"string\",\"nullable\":true} ,
{\"name\":\"gender\",\"type\":\"string\",\"nullable\":true}
]
}"
}],
"labelDataSampleSize" : 0.1,
"numPartitions":4,
"modelId": 100,
"zinggDir": "/tmp/research/tmp/models"
}
I am running zingg via docker. I built my image using the release tar.gz
from github combined with a slightly altered Dockerfile:
FROM docker.io/bitnami/spark:3.0.2
WORKDIR /
ADD assembly/target/zingg-0.3.0-SNAPSHOT-spark-3.0.3.tar.gz .
WORKDIR /zingg-0.3.0-SNAPSHOT
I am not able to include a data sample since it includes PII, but it has fields:
voterbase_id
vb_tsmart_first_name
vb_tsmart_middle_name
vb_tsmart_last_name
vb_tsmart_full_address
vb_tsmart_city
vb_tsmart_state
vb_tsmart_zip
vb_voterbase_phone
vb_voterbase_phone_wireless
vb_voterbase_dob
vb_voterbase_gender
tier
Build a new febrl test file in XLS/XLSX format and test it
model for legal entity identifier
Hi Sonalgoya,
In the test.csv, you have the first column called REC ID which already says whether a record is Original or Duplicate. In my case, I have an input file that contains First name, last name and other similar columns that you have in test.csv. How do I use Zingg to find out the duplicates? Is there a pre-processing step I am missing?
Now that the documents are in a better consumable shape, we should add the case studies
Need to relook at the zingg.sh script and remove elastic, license and email options. Also SPARK_MEM etc
explainability will be critical in business context, to say why some records matched and why not.
Is your feature request related to a problem? Please describe.
Currently, the user has to manually set up the environments to compile and install Zingg, Instead, we should add make targets that handle all the pre-requisites and setup.
Installed the spark&jdk in my local machine
When try to run the following script ./scripts/zingg.sh --phase match --conf examples/febrl/config.json
its getting error.
Error message:
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.unsafe.array.ByteArrayMethods.(ByteArrayMethods.java:54)
at org.apache.spark.internal.config.package$.(package.scala:1095)
at org.apache.spark.internal.config.package$.(package.scala)
at org.apache.spark.deploy.SparkSubmitArguments.$anonfun$loadEnvironmentArguments$3(SparkSubmitArguments.scala:157)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:157)
at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:115)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$3.(SparkSubmit.scala:1022)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:1022)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:85)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make private java.nio.DirectByteBuffer(long,int) accessible: module java.base does not "opens java.nio" to unnamed module @755c9148
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Constructor.checkCanSetAccessible(Constructor.java:188)
at java.base/java.lang.reflect.Constructor.setAccessible(Constructor.java:181)
at org.apache.spark.unsafe.Platform.(Platform.java:56)
It will improve readability quite a bit if
How similarity works - how to interpret the scores. I have already got that in an email explaining to a customer, need to put it to shape and add it to this repo.
Right now there is no way to update labels if a wrong one has been specified. See https://zinggai.slack.com/archives/C02JNH144TB/p1638361934009800
Let's have multiple profiles in the pom to support different Spark versions (2.4+)
Describe the bug
Hello! I'm testing this out on Azure Databricks. I believe my job spec is correct:
{
"settings": {
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"spark_conf": {
"spark.databricks.delta.preview.enabled": "true"
},
"azure_attributes": {
"availability": "ON_DEMAND_AZURE",
"first_on_demand": 1,
"spot_bid_max_price": -1
},
"node_type_id": "Standard_DS3_v2",
"enable_elastic_disk": true,
"num_workers": 5
},
"libraries": [
{
"jar": "dbfs:/FileStore/jars/960f43e3_3c55_4c5b_a296_e69bc80ed1a6-zingg_0_3_0_SNAPSHOT-aa6ea.jar"
},
{
"maven": {
"coordinates": "org.jsoup:jsoup:1.7.2"
}
}
],
"spark_jar_task": {
"main_class_name": "zingg.client.Client",
"parameters": [
"--phase",
"findTrainingData",
"--conf",
"dbfs:/Bilbro/examples/zinggconfig.json"
],
"run_as_repl": true
},
"email_notifications": {},
"name": "zinggtest",
"max_concurrent_runs": 1
}
}
But the main class does not appear to accept a list of strings. I receive this error:
21/09/22 20:41:51 command--1:1: error: type mismatch;
found : Array[String]
required: String
zingg.client.Client.main(Array("--phase","findTrainingData","--conf","dbfs:/Bilbro/examples/zinggconfig.json"))
^
A user reported git lfs issue while cloning the repo. need to verify the steps and fix cloning instructions
I see. The pairs to be labeled are found by looking through the entire dataset sample and do not depend on where the records are placed. How many times did you run the findTrainignData and the label phases? You can start off from where you left till you have 30-40 pairs of actual matches.
Also, if you are not finding many matches in the label phase and if the findTrainingData phase is fast enough depending on the hardware you have deployed, you can also change labelDataSampleSize to 0.2 so it will look at twice the sample to fetch the pairs.
I will update the documentation to make things clearer.
Originally posted by @sonalgoyal in #46 (comment)
Explore if we can provide an easy way to build the json args - can we use something like https://github.com/json-editor/json-editor ?
I have tried to run Zingg utility as per steps shown in readme but zingg-0.3.0-SNAPSHOT.jar referred in zingg.sh is not present in published code. Kindly provide the same
Need documentation or a blog post for this
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.