Giter Club home page Giter Club logo

dedupe-examples's Introduction

Exploring Entity Resolution with Dedupe in Python

This walk-through uses Jupyter Notebook and Pandas (and of course, Dedupe) to explore some initial approaches to deduplication and entity resolution with the Python library Dedupe.

Please make sure you have Jupyter and Pandas installed before we move on.

pip install jupyter
pip install pandas

Dedupe Examples

These are example scripts for dedupe, a library that uses machine learning to perform de-duplication and entity resolution quickly on structured data.

To get these examples:

git clone https://github.com/DistrictDataLabs/dedupe-examples.git
cd dedupe-examples

Now we'll launch Jupyter and open up the file called "DDRL_EntResLab.ipynb":

jupyter notebook

CSV example - early childhood locations

Testing out dedupe

Let's experiment with using the dedupe library to try cleaning up our file.

To get dedupe running, we'll need to install Unidecode, Future, and Dedupe.

In your terminal:

pip install unidecode
pip install future
pip install dedupe

Then we'll run the csv_example.py file to see what dedupe can do:

python csv_example.py

You can see that dedupe is a command line application that will prompt the user to engage in active learning by showing pairs of entities and asking if they are the same or different.

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

Let's start training! Use 'y', 'n' and 'u' keys to flag duplicates for active learning.

When you are finished, enter 'f' to quit.

To see how you might use dedupe with smallish data, see the annotated source code for csv_example.py.

Patent example - patent holders

This example works with Dutch inventors from the PATSTAT international patent data file

cd patent_example
pip install unidecode
python patent_example.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

Record Linkage example - electronics products

This example links two spreadsheets of electronics products and links up the matching entries. Each dataset individually has no duplicates.

cd record_linkage_example
python record_linkage_example.py

To see how you might use dedupe for linking datasets, see the annotated source code for record_linkage_example.py.

MySQL example - IL campaign contributions

See mysql_example/README.md for details

To see how you might use dedupe with bigish data, see the annotated source code for mysql_example.

PostgreSQL big dedupe example - PostgreSQL example on large dataset

See pgsql_big_dedupe_example/README.md for details

This is the same example as the MySQL IL campaign contributions dataset above, but ported to run on PostgreSQL.

Training

The secret sauce of dedupe is human input. In order to figure out the best rules to deduplicate a set of data, you must give it a set of labeled examples to learn from.

The more labeled examples you give it, the better the deduplication results will be. At minimum, you should try to provide 10 positive matches and 10 negative matches.

The results of your training will be saved in a JSON file for future runs of dedupe.

Here's an example labeling operation:

Phone :  2850617
Address :  3801 s. wabash
Zip :
Site name :  ada s. mckinley st. thomas cdc

Phone :  2850617
Address :  3801 s wabash ave
Zip :
Site name :  ada s. mckinley community services - mckinley - st. thomas

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

dedupe-examples's People

Contributors

fgregg avatar derekeder avatar nikitsaraf avatar markhuberty avatar cathydeng avatar rebeccabilbro avatar fideln8 avatar mekarpeles avatar xykev avatar justinmanley avatar dmkoch avatar adamferguson avatar davidheryanto avatar gburt avatar jpvelez avatar paulshannon avatar wleftwich avatar chancyk avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.