Giter Club home page Giter Club logo

databricks-industry-solutions / customer-er Goto Github PK

View Code? Open in Web Editor NEW
18.0 2.0 8.0 170 KB

Translating text attributes (like name, address, phone number) into quantifiable numerical representations Training ML models to determine if these numerical labels form a match Scoring the confidence of each match

Home Page: https://www.databricks.com/solutions/accelerators/customer-entity-resolution

License: Other

Python 100.00%
databricks-industry-solutions zingg cme cybersecurity entity-resolution fsi fuzzy-matching pubsec rcg

customer-er's Introduction

Introduction

The process of matching records to one another is known as entity-resolution. When dealing with entities such as persons, the process often requires the comparison of name and address information which is subject to inconsistencies and errors. In these scenarios, we often rely on probabilistic (fuzzy) matching techniques that identify likely matches based on degrees of similarity between these elements.

There are a wide range of techniques which can be employed to perform such matching. The challenge is not just to identify which of these techniques provide the best matches but how to compare one record to all the other records that make up the dataset in an efficient manner. Data Scientists specializing in entity-resolution often employ specialized blocking techniques that limit which customers should be compared to one another using mathematical short-cuts.

In a prior solution accelerator, we explored how some of these techniques may be employed (in a product matching scenario). In this solution accelerator, we'll take a look at how the Zingg library can be used to simplify an person matching implementation that takes advantage of a fuller range of techniques.

The Zingg Workflow

Zingg is a library that provides the building blocks for ML-based entity-resolution using industry-recognized best practices. It is not an application, but it provides the capabilities required to the assemble a robust application. When run in combination with Databricks, Zingg provides the application the scalability that's often needed to perform entity-resolution on enterprise-scaled datasets.

To build a Zingg-enabled application, it's easiest to think of Zingg as being deployed in two phases. In the first phase, candidate pairs of potential duplicates are extracted from an initial dataset and labeled by expert users. These labeled pairs are then used to train a model capable of scoring likely matches.

In the second phase, the model trained in the first phase is applied to newly arrived data. Those data are compared to data processed in prior runs to identify likely matches between in incoming and previously processed dataset. The application engineer is responsible for how matched and unmatched data will be handled, but typically information about groups of matching records are updated with each incremental run to identify all the record variations believed to represent the same entity.


© 2022 Databricks, Inc. All rights reserved. The source in this notebook is provided subject to the Databricks License [https://databricks.com/db-license-source]. All included or referenced third party libraries are subject to the licenses set forth below.

library description license source
zingg entity resolution library GNU Affero General Public License v3.0 https://github.com/zinggAI/zingg/
tabulate pretty-print tabular data in Python MIT License https://pypi.org/project/tabulate/
filesplit Python module that is capable of splitting files and merging it back MIT License https://pypi.org/project/filesplit/

To run this accelerator, clone this repo into a Databricks workspace. Attach the RUNME notebook to any cluster running a DBR 11.0 or later runtime, and execute the notebook via Run-All. A multi-step-job describing the accelerator pipeline will be created, and the link will be provided. Execute the multi-step-job to see how the pipeline runs.

The job configuration is written in the RUNME notebook in json format. The cost associated with running the accelerator is the user's responsibility.

customer-er's People

Contributors

danielsparing avatar dbbnicole avatar sonalgoyal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

customer-er's Issues

find_training_job.run_and_wait()

Hi,
I followed the notebooks and got stuck on the notebook 01_initial_workflow

executing find_training_job.run_and_wait() gives me the error

Job in INTERNAL_ERROR state at 331 seconds since launch. Waiting 30 seconds before checking again

I am wondering because this step should not take so long. Could that an error of the API class defined in o intro and config???

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.