Giter Club home page Giter Club logo

auto-data-linkage's Introduction

Darabricks ARC

Welcome to the Databricks ARC Github page.

Installation

The package can be installed with pip:

%pip install databricks-arc

Project Description

Databricks ARC (Automated Record Connector) is a solution accelerator by Databricks that performs highly scalable probabilistic data de-duplication without the requirement for any labelled data or subject matter expertise in entity resolution.

ARC's linking engine is the UK Ministry of Justice's open-sourced entity resolution package, Splink. It builds on the technology of Splink by removing the need to manually provide parameters to calibrate an unsupervised de-duplication task, which require both a deep understanding of entity resolution and good knowledge of the dataset itself. The way in which ARC achieves this is detailed in the table below:

Parameter Splink ARC
Prior match probability User to provide SQL-like statements for “deterministic rules” and a recall score for Splink to estimate the prior probability that two records match. Automatically set prior probability to $$\frac{1}{N}$$.
Training rules User to provide SQL-like statements for a series of rules which trains the m probability values for each column. Automatically generate training rule statements such that each column is trained.
Comparisons User to provide distance functions and thresholds for each column to compare. Automatically optimise multi-level parameter space for functions and thresholds.
Blocking rules User to provide SQL-like statements to determine the possible comparison space and reduce the number of pairs to compare. User to provide a parameter for the maximum number of pairs they’re willing to compare; Arc identifies all possible blocking rules within that boundary and optimises for the best one.

Parameter Optimisation

Arc uses Hyperopt (http://hyperopt.github.io/hyperopt/) to perform a Bayesian search to find the optimal settings, where optimality is defined as minimising the entropy of the data after clustering and standardising cluster record values. The intuition here is that as we are linking different representations of the same entity together (e.g. Facebook == Fakebook), then standardising data values within a cluster will reduce the total number of data values in the dataset.

To achieve this, Arc optimises for a custom information gain metric which it calculates based on the clusters of duplicates that Splink predicts. Intuitively, it is based on the reduction in entropy when the data is split into its clusters. The higher the reduction in entropy in the predicted clusters of duplicates predicted, the better the model is doing. Mathematically, we define the metric as follows:

Let the number of clusters in the matched subset of the data be c.

Let the maximum number of unique values in any column in the original dataset be u.

Then the "scaled" entropy of column k, N unique values with probability P is

$$E_{s,k} = -\Sigma_{i}^{N} P_{i} \log_{c}(P_{i})$$

Then the "adjusted" entropy of column k, N unique values with probability P is

$$E_{a,k} = -\Sigma_{i}^{N} P_{i} \log_{u}(P_{i})$$

The scaled information gain is

$$I_{s} = \Sigma_{k}^{K} E_{s,k} - E'_{s,k}$$

and the adjusted information gain is

$$I_{a} = \Sigma_{k}^{K} E_{a,k} - E'_{a,k}$$

where E is the mean entropy of the individual clusters predicted.

The metric to optimise for is:

$$I_{s}^{I_{a}}$$

Getting Started

Load a Spark DataFrame of data to be deduplicated:

data = spark.read.table("my_catalog.my_schema.my_duplicated_data")

After installation, import and enable ARC:

import arc
from arc.autolinker import AutoLinker

arc.enable_arc()

Initialise an instance of the AutoLinker class:

autolinker = AutoLinker()

Run unsupervised de-duplication:

autolinker.auto_link(
  data=data,                                                         # Spark DataFrame of data to deduplicate
  attribute_columns=["givenname", "surname", "postcode", "suburb"],  # List of column names containing attribute to compare
  unique_id="uid",                                                   # Name of the unique identifier column
  comparison_size_limit=100000,                                      # Maximum number of pairs to compare
  max_evals=20                                                       # Number of trials to run during optimisation process
)

Access clustered DataFrame - predicted duplicates will share the same cluster_id:

clusters = autolinker.best_clusters_at_threshold()

Use Splink's built-in visualisers and dashboards:

autolinker.cluster_viewer()

auto-data-linkage's People

Contributors

dependabot[bot] avatar lindacmsheard avatar marcell-ferencz-databricks avatar robertwhiffin avatar rossken avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.