Giter Club home page Giter Club logo

zinggai / zingg Goto Github PK

View Code? Open in Web Editor NEW
882.0 882.0 109.0 448.73 MB

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

License: GNU Affero General Public License v3.0

Java 74.81% Scala 1.51% Shell 0.31% Dockerfile 0.07% FreeMarker 0.01% Python 13.95% Makefile 0.08% Batchfile 0.09% HTML 9.16%
analytics analytics-engineering data-science data-transformation data-transformations dataengineering datalake dataquality dedupe deduplication entity-resolution etl fuzzy-matching fuzzymatch identity identity-resolution masterdata ml modern-data-stack spark

zingg's People

Contributors

abhay447 avatar aditya-r-chakole avatar akash-r-7 avatar bagotia16 avatar chetan453 avatar dependabot[bot] avatar edmondop avatar gnanaprakash-ravi avatar jgransac avatar manan-s0ni avatar morazow avatar navinrathore avatar ravirajbaraiya avatar semyonsinchenko avatar shefalika-thapa avatar siddharth2798 avatar sonalgoyal avatar vikasgupta78 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

zingg's Issues

Linking phase matches from the same dataset

During Linking phase, matches and comparison is done on the rows from the same dataset.

** To Reproduce **

  1. Run findTrainingData, label, train, and link phase.
  2. During label phase, rows from the same dataset is displayed for labeling.
  3. The resulting output have matching pairs with the same z_source.

** Expected behavior **
I'm expecting during labeling phase to only show rows from different z_source. And I'm expecting output to also only have pairs with different z_source.

Anonymous Telemetry

Understanding Zingg usage in an anonymous way - number of attributes, size of data, pipes, number of runs, time taken for each run will be helpful to build the right features. This should be absolutely anonymous, with no actual data leaving user environment. And user should have the ability to switch it off anytime.

SQL based blocking and distance functions

What if we could take sql from say a dbt model or otherwise and use that for our model training - blocking as well as similarity? Then non Java programmers can also code and customize Zingg without bothering about the internals.

Issue on running --phase train

WARN org.apache.spark.ml.util.Instrumentation - [7dacb264] All labels are the same value and fitIntercept=true, so the coefficients will be zeros. Training is not needed.
INFO org.apache.spark.ml.util.Instrumentation - [aa3fb364] training finished
INFO org.apache.spark.ml.util.Instrumentation - [e20e40e0] training finished
ERROR org.apache.spark.ml.util.Instrumentation - org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)

WARN zingg.client.util.Email - Unable to send email Can't send command to SMTP host
WARN zingg.client.Client - Apologies for this message. Zingg has encountered an error. Exception thrown in awaitResult:

findTrainingData and Label completed

NullPointerException during match phase

Hi, I hit a NullPointerException running zingg at the match phase. I had previously run the findTrainingData / label / train phases successfully. The error occurs early in the run (after ~1 minute):

I have no name!@e036d492b95b:/zingg-0.3.0-SNAPSHOT$ time ./scripts/zingg.sh --phase match --conf /tmp/research/tsmart_config.json
2021-11-04 14:52:31,814 [main] WARN  org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 2021-11-04 14:52:32,064 [main] INFO  zingg.client.Client -
 2021-11-04 14:52:32,065 [main] INFO  zingg.client.Client - ********************************************************
 2021-11-04 14:52:32,066 [main] INFO  zingg.client.Client - *                    Zingg AI                           *
 2021-11-04 14:52:32,067 [main] INFO  zingg.client.Client - *               (C) 2021 Zingg.AI                       *
 2021-11-04 14:52:32,068 [main] INFO  zingg.client.Client - ********************************************************
 2021-11-04 14:52:32,069 [main] INFO  zingg.client.Client -
 2021-11-04 14:52:32,070 [main] INFO  zingg.client.Client - using: Zingg v0.3
 2021-11-04 14:52:32,070 [main] INFO  zingg.client.Client -
 2021-11-04 14:52:32,289 [main] WARN  zingg.client.Arguments - Config Argument is /tmp/research/tsmart_config.json
 2021-11-04 14:52:32,564 [main] WARN  zingg.client.Arguments - phase is match
 2021-11-04 14:52:35,660 [main] WARN  org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry - The function round replaced a previously registered function.
 2021-11-04 14:52:35,660 [main] INFO  zingg.ZinggBase - Start reading internal configurations and functions
 2021-11-04 14:52:35,675 [main] INFO  zingg.ZinggBase - Finished reading internal configurations and functions
 2021-11-04 14:52:35,728 [main] WARN  zingg.util.PipeUtil - Reading input csv
 2021-11-04 14:53:13,325 [main] INFO  zingg.Matcher - Read 932474
 2021-11-04 14:53:13,414 [main] INFO  zingg.Matcher - Blocked
 2021-11-04 14:53:13,918 [main] WARN  org.apache.spark.sql.catalyst.util.package - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
 2021-11-04 14:53:16,745 [Executor task launch worker for task 30] ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 8.0 (TID 30)
 java.lang.NullPointerException
        at zingg.block.Block$BlockFunction.call(Block.java:403)
        at zingg.block.Block$BlockFunction.call(Block.java:393)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.mapelements_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.deserializetoobject_doConsume_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
        at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
...

Several instances of the exception are reported, this is the first one.

My config file is:

{
  "fieldDefinition":[
    {
      "fieldName" : "id",
      "matchType" : "DONT_USE",
      "fields" : "voterbase_id"
    },
    {
      "fieldName" : "fname",
      "matchType" : "fuzzy",
      "fields" : "vb_tsmart_first_name",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "mname",
      "matchType" : "fuzzy",
      "fields" : "vb_tsmart_middle_name",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "lname",
      "matchType" : "fuzzy",
      "fields" : "vb_tsmart_last_name",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "address1",
      "matchType": "fuzzy",
      "fields" : "vb_tsmart_full_address",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "city",
      "matchType": "fuzzy",
      "fields" : "vb_tsmart_city",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "state",
      "matchType": "exact",
      "fields" : "vb_tsmart_state",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "zip",
      "matchType": "exact",
      "fields" : "vb_tsmart_zip",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "homephone",
      "matchType": "exact",
      "fields" : "vb_voterbase_phone",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "cellphone",
      "matchType": "exact",
      "fields" : "vb_voterbase_phone_wireless",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "dob",
      "matchType": "fuzzy",
      "fields" : "vb_voterbase_dob",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "gender",
      "matchType": "fuzzy",
      "fields" : "vb_voterbase_gender",
      "dataType": "\"string\""
    },
    {
      "fieldName" : "tier",
      "matchType": "DONT_USE",
      "fields" : "tier",
      "dataType": "\"string\""
    }
    ],
    "output" : [{
      "name":"output",
      "format":"csv",
      "props": {
        "location": "/tmp/zinggOutput",
        "delimiter": ",",
        "header":true
      }
    }],
    "data" : [{
      "name":"test",
      "format":"csv",
      "props": {
        "location": "/tmp/research/tmp/dupes/combined.csv",
        "delimiter": ",",
        "header":false
      },
      "schema":
        "{\"type\" : \"struct\",
        \"fields\" : [
          {\"name\":\"id\", \"type\":\"string\", \"nullable\":false},
          {\"name\":\"fname\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"mname\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"lname\",\"type\":\"string\",\"nullable\":true} ,
          {\"name\":\"address1\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"city\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"state\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"zip\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"homephone\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"cellphone\", \"type\":\"string\", \"nullable\":true},
          {\"name\":\"dob\",\"type\":\"string\",\"nullable\":true} ,
          {\"name\":\"gender\",\"type\":\"string\",\"nullable\":true}
        ]
      }"
    }],
    "labelDataSampleSize" : 0.1,
    "numPartitions":4,
    "modelId": 100,
    "zinggDir": "/tmp/research/tmp/models"
}

I am running zingg via docker. I built my image using the release tar.gz from github combined with a slightly altered Dockerfile:

FROM docker.io/bitnami/spark:3.0.2
WORKDIR /
ADD assembly/target/zingg-0.3.0-SNAPSHOT-spark-3.0.3.tar.gz .
WORKDIR /zingg-0.3.0-SNAPSHOT

I am not able to include a data sample since it includes PII, but it has fields:

voterbase_id
vb_tsmart_first_name
vb_tsmart_middle_name
vb_tsmart_last_name
vb_tsmart_full_address
vb_tsmart_city
vb_tsmart_state
vb_tsmart_zip
vb_voterbase_phone
vb_voterbase_phone_wireless
vb_voterbase_dob
vb_voterbase_gender
tier

Get output using our input file

Hi Sonalgoya,

In the test.csv, you have the first column called REC ID which already says whether a record is Original or Duplicate. In my case, I have an input file that contains First name, last name and other similar columns that you have in test.csv. How do I use Zingg to find out the duplicates? Is there a pre-processing step I am missing?

Add case studies

Now that the documents are in a better consumable shape, we should add the case studies

Clean up zingg.sh script

Need to relook at the zingg.sh script and remove elastic, license and email options. Also SPARK_MEM etc

Add make files and targets for install and compile code.

Is your feature request related to a problem? Please describe.
Currently, the user has to manually set up the environments to compile and install Zingg, Instead, we should add make targets that handle all the pre-requisites and setup.

Zingg setup issue

Installed the spark&jdk in my local machine
When try to run the following script ./scripts/zingg.sh --phase match --conf examples/febrl/config.json
its getting error.
Error message:
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.apache.spark.unsafe.array.ByteArrayMethods.(ByteArrayMethods.java:54)
at org.apache.spark.internal.config.package$.(package.scala:1095)
at org.apache.spark.internal.config.package$.(package.scala)
at org.apache.spark.deploy.SparkSubmitArguments.$anonfun$loadEnvironmentArguments$3(SparkSubmitArguments.scala:157)
at scala.Option.orElse(Option.scala:447)
at org.apache.spark.deploy.SparkSubmitArguments.loadEnvironmentArguments(SparkSubmitArguments.scala:157)
at org.apache.spark.deploy.SparkSubmitArguments.(SparkSubmitArguments.scala:115)
at org.apache.spark.deploy.SparkSubmit$$anon$2$$anon$3.(SparkSubmit.scala:1022)
at org.apache.spark.deploy.SparkSubmit$$anon$2.parseArguments(SparkSubmit.scala:1022)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:85)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make private java.nio.DirectByteBuffer(long,int) accessible: module java.base does not "opens java.nio" to unnamed module @755c9148
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Constructor.checkCanSetAccessible(Constructor.java:188)
at java.base/java.lang.reflect.Constructor.setAccessible(Constructor.java:181)
at org.apache.spark.unsafe.Platform.(Platform.java:56)

Add documentation on scoring

How similarity works - how to interpret the scores. I have already got that in an email explaining to a customer, need to put it to shape and add it to this repo.

Client argument list formatting

Describe the bug
Hello! I'm testing this out on Azure Databricks. I believe my job spec is correct:

{
    "settings": {
        "new_cluster": {
            "spark_version": "7.3.x-scala2.12",
            "spark_conf": {
                "spark.databricks.delta.preview.enabled": "true"
            },
            "azure_attributes": {
                "availability": "ON_DEMAND_AZURE",
                "first_on_demand": 1,
                "spot_bid_max_price": -1
            },
            "node_type_id": "Standard_DS3_v2",
            "enable_elastic_disk": true,
            "num_workers": 5
        },
        "libraries": [
            {
                "jar": "dbfs:/FileStore/jars/960f43e3_3c55_4c5b_a296_e69bc80ed1a6-zingg_0_3_0_SNAPSHOT-aa6ea.jar"
            },
            {
                "maven": {
                    "coordinates": "org.jsoup:jsoup:1.7.2"
                }
            }
        ],
        "spark_jar_task": {
            "main_class_name": "zingg.client.Client",
            "parameters": [
                "--phase",
                "findTrainingData",
                "--conf",
                "dbfs:/Bilbro/examples/zinggconfig.json"
            ],
            "run_as_repl": true
        },
        "email_notifications": {},
        "name": "zinggtest",
        "max_concurrent_runs": 1
    }
}

But the main class does not appear to accept a list of strings. I receive this error:

21/09/22 20:41:51 command--1:1: error: type mismatch;
 found   : Array[String]
 required: String
zingg.client.Client.main(Array("--phase","findTrainingData","--conf","dbfs:/Bilbro/examples/zinggconfig.json"))
                              ^

Add graphic to denote findTrainingData and label phases have to be repeatedd

I see. The pairs to be labeled are found by looking through the entire dataset sample and do not depend on where the records are placed. How many times did you run the findTrainignData and the label phases? You can start off from where you left till you have 30-40 pairs of actual matches.

Also, if you are not finding many matches in the label phase and if the findTrainingData phase is fast enough depending on the hardware you have deployed, you can also change labelDataSampleSize to 0.2 so it will look at twice the sample to fetch the pairs.

I will update the documentation to make things clearer.

Originally posted by @sonalgoyal in #46 (comment)

zingg-0.3.0-SNAPSHOT.jar is not available

I have tried to run Zingg utility as per steps shown in readme but zingg-0.3.0-SNAPSHOT.jar referred in zingg.sh is not present in published code. Kindly provide the same

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.