zinggai / zingg Goto Github PK

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

License: GNU Affero General Public License v3.0

Java 74.81% Scala 1.51% Shell 0.31% Dockerfile 0.07% FreeMarker 0.01% Python 13.95% Makefile 0.08% Batchfile 0.09% HTML 9.16%

analytics analytics-engineering data-science data-transformation data-transformations dataengineering datalake dataquality dedupe deduplication entity-resolution etl fuzzy-matching fuzzymatch identity identity-resolution masterdata ml modern-data-stack spark

zingg's People

Contributors

Stargazers

Watchers

Forkers

mithunmanohar awmanoj tufanrakshit swathimystery anupgoenka stjordanis buster-wang liliesinlakes er-research olivierbinette gitusertd srikanth-gandi jgooly navant datanir rohitongit alpha-sme fantasticrabbit navinrathore sanjc manishnakar kaushik-das khanhgithead adarsh-dattatri micseb srinu1093 ameeransari doloresferrando sanyaohri itsaaditya96 ayoubelabbassi aroy1856 edmondop misselvexu shibu2006 abhipn hackthecrisis21 paultreadel aditya-r-chakole chetan453 aa09010yahoocom ranjithmuddana maxoehm ravirajbaraiya shefalika-thapa itshugo77 anujmnit71 siddharth2798 akash-r-7 guptam rahulsagigainwell bagotia16 daniel-tang-43 arpit94 cckmit pchoengtawee hugojoh prasanth08 cronus1007 shunya007 gopikrishnaangadi abhay447 jgransac pidtoo zinggai joaocps gg-big-org dougscc dumdumju kapambwe zhuohuwu0603 vemaya-asia dochris morazow hirschfeldeli joaonart patilvikram iitbhumanish kendralabs jslacks manan-s0ni jfevrier10 ramkarkinos rohankumardubey efpm04013 vishalshetty104 ygetit12 anth0ny-x gnanaprakash-ravi luatnc87 mehdi-infostrux brunoscaglione vishalmcf thedatadecoder ffos sharath-m priyans619 starwingo3o subintp nirajzade

zingg's Issues

Zingg API

We should embrace the idea from https://roundup.getdbt.com/p/devops-and-the-modern-data-experience and build up a nice api for Zingg.

serverless spark

try https://cloud.google.com/solutions/spark and see if we can build something out with Zingg

Tutorial with Great Expectations

Need to write how to integrate GE and Zingg.

sanctions matching and pep model

field name didn't accept the "_" underscore

didn't accept the "_" underscore

Originally posted by @premsmac2021 in #49 (comment)

Linking phase matches from the same dataset

During Linking phase, matches and comparison is done on the rows from the same dataset.

** To Reproduce **

Run findTrainingData, label, train, and link phase.
During label phase, rows from the same dataset is displayed for labeling.
The resulting output have matching pairs with the same z_source.

** Expected behavior **
I'm expecting during labeling phase to only show rows from different z_source. And I'm expecting output to also only have pairs with different z_source.

Add recipes for Snowflake, Elastic, Cassandra, Parquet, Avro, XLS, XLSX, JDBC

Unsupervised model

Having a prebuilt simple model can be helpful to get people started.

Zingg on Windows

User tried to use Zingg on windows and got the following error
Program 'zingg.sh' failed to run: No application is associated with the specified file for this operationAt line:1

See https://zinggai.slack.com/archives/C02JNH144TB/p1636045905031900

Understanding Zingg usage in an anonymous way - number of attributes, size of data, pipes, number of runs, time taken for each run will be helpful to build the right features. This should be absolutely anonymous, with no actual data leaving user environment. And user should have the ability to switch it off anytime.

Run Zingg through PySpark

Need to figure out how Zingg can be run from pyspark or python.

SQL based blocking and distance functions

What if we could take sql from say a dbt model or otherwise and use that for our model training - blocking as well as similarity? Then non Java programmers can also code and customize Zingg without bothering about the internals.

documentation broken links

https://docs.zingg.ai/docs/setup/configuration.html - pipes links are broken

terraform for zingg

explore terraform

Issue on running --phase train

WARN org.apache.spark.ml.util.Instrumentation - [7dacb264] All labels are the same value and fitIntercept=true, so the coefficients will be zeros. Training is not needed.
INFO org.apache.spark.ml.util.Instrumentation - [aa3fb364] training finished
INFO org.apache.spark.ml.util.Instrumentation - [e20e40e0] training finished
ERROR org.apache.spark.ml.util.Instrumentation - org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)

WARN zingg.client.util.Email - Unable to send email Can't send command to SMTP host
WARN zingg.client.Client - Apologies for this message. Zingg has encountered an error. Exception thrown in awaitResult:

findTrainingData and Label completed

call out performance numbers separately

right now perf is hidden under hardware sizing make it a lot difficult to find

Add performance stats for half and 1 million records

Add contributor guidelines

Command line data stewardship

We can build a cli that can filter and show results to the user, much like the labeller

Recipe for Azure

Add documentation on how to use Zingg with Azure

NullPointerException during match phase

Hi, I hit a NullPointerException running zingg at the match phase. I had previously run the findTrainingData / label / train phases successfully. The error occurs early in the run (after ~1 minute):

I have no name!@e036d492b95b:/zingg-0.3.0-SNAPSHOT$ time ./scripts/zingg.sh --phase match --conf /tmp/research/tsmart_config.json
2021-11-04 14:52:31,814 [main] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 2021-11-04 14:52:32,064 [main] INFO  zingg.client.Client -
 2021-11-04 14:52:32,065 [main] INFO  zingg.client.Client - ********************************************************
 2021-11-04 14:52:32,066 [main] INFO  zingg.client.Client - *                    Zingg AI                           *
 2021-11-04 14:52:32,067 [main] INFO  zingg.client.Client - *               (C) 2021 Zingg.AI                       *
 2021-11-04 14:52:32,068 [main] INFO  zingg.client.Client - ********************************************************
 2021-11-04 14:52:32,069 [main] INFO  zingg.client.Client -
 2021-11-04 14:52:32,070 [main] INFO  zingg.client.Client - using: Zingg v0.3
 2021-11-04 14:52:32,070 [main] INFO  zingg.client.Client -
 2021-11-04 14:52:32,289 [main] WARN  zingg.client.Arguments - Config Argument is /tmp/research/tsmart_config.json
 2021-11-04 14:52:32,564 [main] WARN  zingg.client.Arguments - phase is match
 2021-11-04 14:52:35,660 [main] WARN  org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry - The function round replaced a previously registered function.
 2021-11-04 14:52:35,660 [main] INFO  zingg.ZinggBase - Start reading internal configurations and functions
 2021-11-04 14:52:35,675 [main] INFO  zingg.ZinggBase - Finished reading internal configurations and functions
 2021-11-04 14:52:35,728 [main] WARN  zingg.util.PipeUtil - Reading input csv
 2021-11-04 14:53:13,325 [main] INFO  zingg.Matcher - Read 932474
 2021-11-04 14:53:13,414 [main] INFO  zingg.Matcher - Blocked
 2021-11-04 14:53:13,918 [main] WARN  org.apache.spark.sql.catalyst.util.package - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
 2021-11-04 14:53:16,745 [Executor task launch worker for task 30] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 8.0 (TID 30)
 java.lang.NullPointerException
        at zingg.block.Block$BlockFunction.call(Block.java:403)
        at zingg.block.Block$BlockFunction.call(Block.java:393)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
...

Several instances of the exception are reported, this is the first one.

My config file is:

{
  "fieldDefinition":[
    {
      "fieldName" : "id",
      "matchType" : "DONT_USE",
      "fields" : "voterbase_id"
    },
    {
      "fieldName" : "fname",
      "matchType" : "fuzzy",
      "fields" : "vb_tsmart_first_name",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "mname",
      "matchType" : "fuzzy",
      "fields" : "vb_tsmart_middle_name",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "lname",
      "matchType" : "fuzzy",
      "fields" : "vb_tsmart_last_name",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "address1",
      "matchType": "fuzzy",
      "fields" : "vb_tsmart_full_address",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "city",
      "matchType": "fuzzy",
      "fields" : "vb_tsmart_city",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "state",
      "matchType": "exact",
      "fields" : "vb_tsmart_state",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "zip",
      "matchType": "exact",
      "fields" : "vb_tsmart_zip",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "homephone",
      "matchType": "exact",
      "fields" : "vb_voterbase_phone",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "cellphone",
      "matchType": "exact",
      "fields" : "vb_voterbase_phone_wireless",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "dob",
      "matchType": "fuzzy",
      "fields" : "vb_voterbase_dob",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "gender",
      "matchType": "fuzzy",
      "fields" : "vb_voterbase_gender",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "tier",
      "matchType": "DONT_USE",
      "fields" : "tier",
      "dataType": "\"string\""
    }
    ],
    "output" : [{
      "name":"output",
      "format":"csv",
      "props": {
        "location": "/tmp/zinggOutput",
        "delimiter": ",",
        "header":true
      }
    }],
    "data" : [{
      "name":"test",
      "format":"csv",
      "props": {
        "location": "/tmp/research/tmp/dupes/combined.csv",
        "delimiter": ",",
        "header":false
      },
      "schema":
        "{\"type\" : \"struct\",
        \"fields\" : [
          {\"name\":\"id\", \"type\":\"string\", \"nullable\":false},
          {\"name\":\"fname\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"mname\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"lname\",\"type\":\"string\",\"nullable\":true} ,
          {\"name\":\"address1\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"city\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"state\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"zip\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"homephone\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"cellphone\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"dob\",\"type\":\"string\",\"nullable\":true} ,
          {\"name\":\"gender\",\"type\":\"string\",\"nullable\":true}
        ]
      }"
    }],
    "labelDataSampleSize" : 0.1,
    "numPartitions":4,
    "modelId": 100,
    "zinggDir": "/tmp/research/tmp/models"
}

I am running zingg via docker. I built my image using the release tar.gz from github combined with a slightly altered Dockerfile:

FROM docker.io/bitnami/spark:3.0.2
WORKDIR /
ADD assembly/target/zingg-0.3.0-SNAPSHOT-spark-3.0.3.tar.gz .
WORKDIR /zingg-0.3.0-SNAPSHOT

I am not able to include a data sample since it includes PII, but it has fields:

voterbase_id
vb_tsmart_first_name
vb_tsmart_middle_name
vb_tsmart_last_name
vb_tsmart_full_address
vb_tsmart_city
vb_tsmart_state
vb_tsmart_zip
vb_voterbase_phone
vb_voterbase_phone_wireless
vb_voterbase_dob
vb_voterbase_gender
tier

Test support for xls/xlsx

Build a new febrl test file in XLS/XLSX format and test it

Snowflake through external fn

https://medium.com/snowflake/how-snowflakes-it-team-uses-external-functions-497505fb49df

add support for integer types

Add performance information of half and one million febrl datasets

glei model

model for legal entity identifier

Get output using our input file

Hi Sonalgoya,

In the test.csv, you have the first column called REC ID which already says whether a record is Original or Duplicate. In my case, I have an input file that contains First name, last name and other similar columns that you have in test.csv. How do I use Zingg to find out the duplicates? Is there a pre-processing step I am missing?

Dockerize the installation

Add case studies

Now that the documents are in a better consumable shape, we should add the case studies

Clean up zingg.sh script

Need to relook at the zingg.sh script and remove elastic, license and email options. Also SPARK_MEM etc

Add information on improving accuracy

hardware sizing link is broken in the docs

Explaining matching output

explainability will be critical in business context, to say why some records matched and why not.

Move documentation to docs.zingg.ai

Add make files and targets for install and compile code.

Is your feature request related to a problem? Please describe.
Currently, the user has to manually set up the environments to compile and install Zingg, Instead, we should add make targets that handle all the pre-requisites and setup.

Zingg setup issue

Installed the spark&jdk in my local machine
When try to run the following script ./scripts/zingg.sh --phase match --conf examples/febrl/config.json
its getting error.
Error message:
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.unsafe.array.ByteArrayMethods.(ByteArrayMethods.java:54)
at org.apache.spark.internal.config.package$.(package.scala:1095)
at org.apache.spark.internal.config.package$.(package.scala)
at org.apache.spark.deploy.SparkSubmitArguments.$anonfun$loadEnvironmentArguments$3(SparkSubmitArguments.scala:157)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:157)
at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:115)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$3.(SparkSubmit.scala:1022)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:1022)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:85)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make private java.nio.DirectByteBuffer(long,int) accessible: module java.base does not "opens java.nio" to unnamed module @755c9148
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Constructor.checkCanSetAccessible(Constructor.java:188)
at java.base/java.lang.reflect.Constructor.setAccessible(Constructor.java:181)
at org.apache.spark.unsafe.Platform.(Platform.java:56)

make the interactive learner color coded

It will improve readability quite a bit if

headers were bold
options yes/no etc were color coded

Fix up Febrl Junits

Add documentation on scoring

How similarity works - how to interpret the scores. I have already got that in an email explaining to a customer, need to put it to shape and add it to this repo.

Ability to update labels once marked

Right now there is no way to update labels if a wrong one has been specified. See https://zinggai.slack.com/archives/C02JNH144TB/p1638361934009800

Add support for other versions of Spark

Let's have multiple profiles in the pom to support different Spark versions (2.4+)

Add Snowflake support

Client argument list formatting

Describe the bug
Hello! I'm testing this out on Azure Databricks. I believe my job spec is correct:

{
    "settings": {
        "new_cluster": {
            "spark_version": "7.3.x-scala2.12",
            "spark_conf": {
                "spark.databricks.delta.preview.enabled": "true"
            },
            "azure_attributes": {
                "availability": "ON_DEMAND_AZURE",
                "first_on_demand": 1,
                "spot_bid_max_price": -1
            },
            "node_type_id": "Standard_DS3_v2",
            "enable_elastic_disk": true,
            "num_workers": 5
        },
        "libraries": [
            {
                "jar": "dbfs:/FileStore/jars/960f43e3_3c55_4c5b_a296_e69bc80ed1a6-zingg_0_3_0_SNAPSHOT-aa6ea.jar"
            },
            {
                "maven": {
                    "coordinates": "org.jsoup:jsoup:1.7.2"
                }
            }
        ],
        "spark_jar_task": {
            "main_class_name": "zingg.client.Client",
            "parameters": [
                "--phase",
                "findTrainingData",
                "--conf",
                "dbfs:/Bilbro/examples/zinggconfig.json"
            ],
            "run_as_repl": true
        },
        "email_notifications": {},
        "name": "zinggtest",
        "max_concurrent_runs": 1
    }
}

But the main class does not appear to accept a list of strings. I receive this error:

21/09/22 20:41:51 command--1:1: error: type mismatch;
 found   : Array[String]
 required: String
zingg.client.Client.main(Array("--phase","findTrainingData","--conf","dbfs:/Bilbro/examples/zinggconfig.json"))
                              ^

git lfs issue while cloning repo

A user reported git lfs issue while cloning the repo. need to verify the steps and fix cloning instructions

Add graphic to denote findTrainingData and label phases have to be repeatedd

I see. The pairs to be labeled are found by looking through the entire dataset sample and do not depend on where the records are placed. How many times did you run the findTrainignData and the label phases? You can start off from where you left till you have 30-40 pairs of actual matches.

Also, if you are not finding many matches in the label phase and if the findTrainingData phase is fast enough depending on the hardware you have deployed, you can also change labelDataSampleSize to 0.2 so it will look at twice the sample to fetch the pairs.

I will update the documentation to make things clearer.

Originally posted by @sonalgoyal in #46 (comment)