Giter Club home page Giter Club logo

dedupe's Introduction

Dedupe Python Library

Tests Passingcodecov

dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on structured data.

dedupe will help you:

  • remove duplicate entries from a spreadsheet of names and addresses
  • link a list with customer information to another with order history, even without unique customer IDs
  • take a database of campaign contributions and figure out which ones were made by the same person, even if the names were entered slightly differently for each record

dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.

Important links

dedupe library consulting

If you or your organization would like professional assistance in working with the dedupe library, Dedupe.io LLC offers consulting services. Read more about pricing and available services here.

Tools built with dedupe

A cloud service powered by the dedupe library for de-duplicating and finding matches in your data. It provides a step-by-step wizard for uploading your data, setting up a model, training, clustering and reviewing the results.

Dedupe.io also supports record linkage across data sources and continuous matching and training through an API.

For more, see the Dedupe.io product site, tutorials on how to use it, and differences between it and the dedupe library.

Dedupe is well adopted by the Python community. Check out this blogpost, a YouTube video on how to use Dedupe with Python and a Youtube video on how to apply Dedupe at scale using Spark.

Command line tool for de-duplicating and linking CSV files. Read about it on Source Knight-Mozilla OpenNews.

Installation

Using dedupe

If you only want to use dedupe, install it this way:

pip install dedupe

Familiarize yourself with dedupe's API, and get started on your project. Need inspiration? Have a look at some examples.

Developing dedupe

We recommend using virtualenv and virtualenvwrapper for working in a virtualized development environment. Read how to set up virtualenv.

Once you have virtualenvwrapper set up,

mkvirtualenv dedupe
git clone https://github.com/dedupeio/dedupe.git
cd dedupe
pip install -e . --config-settings editable_mode=compat
pip install -r requirements.txt

If these tests pass, then everything should have been installed correctly!

pytest

Afterwards, whenever you want to work on dedupe,

workon dedupe

Testing

Unit tests of core dedupe functions

pytest

Test using canonical dataset from Bilenko's research

Using Deduplication

python -m pip install -e ./benchmarks
python benchmarks/benchmarks/canonical.py

Using Record Linkage

python -m pip install -e ./benchmarks
python benchmarks/benchmarks/canonical_matching.py

Team

  • Forest Gregg, DataMade
  • Derek Eder, DataMade

Credits

Dedupe is based on Mikhail Yuryevich Bilenko's Ph.D. dissertation: Learnable Similarity Functions and their Application to Record Linkage and Clustering.

Errors / Bugs

If something is not behaving intuitively, it is a bug, and should be reported. Report it here

Note on Patches/Pull Requests

  • Fork the project.
  • Make your feature addition or bug fix.
  • Send us a pull request. Bonus points for topic branches.

Copyright

Copyright (c) 2022 Forest Gregg and Derek Eder. Released under the MIT License.

Third-party copyright in this distribution is noted where applicable.

Citing Dedupe

If you use Dedupe in an academic work, please give this citation:

Forest Gregg and Derek Eder. 2022. Dedupe. https://github.com/dedupeio/dedupe.

dedupe's People

Contributors

cathydeng avatar cclauss avatar dejori avatar dependabot[bot] avatar derekeder avatar dwave-pmilesi avatar fgregg avatar fideln8 avatar fjsj avatar jeancochrane avatar leobouloc avatar lmores avatar markhuberty avatar matthhong avatar mbauman avatar mekarpeles avatar metcalfetom avatar nickcrews avatar nikitsaraf avatar nmiranda avatar primoz-k avatar sachinaraballi avatar shahin avatar stevemartingale avatar tfmorris avatar timgates42 avatar tonyduan avatar toolness avatar wleftwich avatar zmaril avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dedupe's Issues

SemiSupervisedNonDuplicates should provide fewer examples and find them more efficiently

Our algorithm to learn blocking can only handle a modest number of examples, and we currently have code to reduce the number of examples the code looks if it is passed too many.

At the same time we sometimes make a pretty expensive call to recordDistances in semiSupervisedLearning to find likely distinct pairs to feed to our blocking training.

This is not a critical performance issue, because in typical case, when we use active learning, we don't make that call in semiSupervisedLearning.

However, it slows down the dev cycle, as about 20 of the 30 seconds it takes to run canonical_test.py is taken up by this function call. It also seems pretty smelly . 0

Other potential predicates

  • geospacial - within a distance radius from each other (an example of 5 miles was given)
  • phonetically matching similar sounding words

Reduce memory usage for larger datasets

core.scoreDuplicates creates a large numpy array in memory which blows up as the number of records increases. Currently trying to reduce this by chunking the candidates using itertools.islice.

However, the memory used the numpy arrays don't seem to be reclaimed by the garbage collector. This may be the issue: numpy/numpy#1601

Investigating further with memory_profiler and valgrind:

valgrind --tool=memcheck --suppressions=valgrind-python.supp python -E -tt ./dedupe/numpy_memory_test.py
valgrind --tool=massif /usr/bin/python dedupe/numpy_memory_test.py 

Better Blocking

The big thing is we need better blocking. To get better blocking we need to find better predicates. Part of that is making better predicates available, like tf-idf, but the major part is supplying more positive examples. The major limit to that is the number of records that we can calculate record-distances between. Right now we are around 700.

We should work on that bottleneck. First assignment: consolidate these three near identical functions, to one which we can begin to optimize. https://gist.github.com/3761519

Affine Gap fails to compile in OSX Mountain Lion

When running on Mountain Lion

python setup.py install

executes successfully, but when the dedupe library is called, the following error occurs:

Traceback (most recent call last):
File "examples/canonical_example.py", line 3, in
import exampleIO
File "/Users/derekeder/projects/open-city/deduplication/dedupe/examples/exampleIO.py", line 3, in
import dedupe.core
File "/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/init.py", line 14, in
import affinegap
ImportError: dlopen(/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/affinegap.so, 2): Symbol not found: _newarrayobject
Referenced from: /Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/affinegap.so
Expected in: flat namespace
in /Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/affinegap.so

Address Preprocessing

Talked to a guy who used to work for NavTec. He said the way they used to handled address deduplication was by first matching street names to a canonical list, and then checking to see if the address was on the same 'block'. With Tiger or OSM, we could definitely do that.

Pre-processing the data

Bilenko pre-processes the data by lowering the case and removing non-alphanumeric characters. Should we do this, or leave it to the user? Having everything be the same case helps our performance, but removing punctuation does not have much effect.

Better clustering

We have implemented Chaduri and Hierarchical clustering algorithms, both with not the best results. Need to do more research.

Improving pairwise scoring (#3) will help regardless of which clustering approach we take.

better pattern for testing

We currently have two ways of testing each piece of dedupe:

  • Files like predicates.py and blocking.py are executable with some tests in the init section. This is now partially broken with the new package directory structure
  • We have two test files in dedupe/test that run unit tests on affinegap and clustering. These can be executed under the dedupe.test namespace.

We should think of a way to do tests in a consistent way. Here are some things to read up on:

Guidance for model selection

Interaction terms: provide some kind of hint, perhaps based on calculated weights, as to what fields are good candidates for interaction terms

Remove SciPy dependency for fastcluster

option 1: ask Daniel Mullner (fastcluster author) to make SciPy import on line 27 of fastcluster.py optional

option 2: fork fastcluster, remove line, add to dedupe package (GPL license restrictions)

option 3: alternative library to fastcluster

Interaction terms

Interaction terms, i.e. affine gap distance of name field * affine gap distance of address field

Clustering of duplicate pairs

Michael Wick describes an joint approach to deduplication, clustering and canonicalization that makes an enormous amount of sense and seems to perform wonderfully. http://people.cs.umass.edu/~mwick/MikeWeb/Publications_files/wick09entity.pdf This approach is extremely attractive, but would require an understanding of probabilistic graphical models currently beyond my ken.

If we don't go that route, then we should use the approach described by Chaudhuri, et.al. ftp://ftp.research.microsoft.com/users/datacleaning/dedup_icde05.pdf, also explained in Nauman's /An Introduction to Duplicate Detection/ http://www.morganclaypool.com/doi/pdf/10.2200/S00262ED1V01Y201003DTM003

Normalization of Affine Gap Edit Distance

There are a number of approaches to 'normalizing an edit distance'

  1. Amortized edit distance, i.e. the minimum ratio of the sum of the cost of edit operation divided by edit sequence length
  2. Division by the sum of the length of the strings
  3. Division by the maximum length of two strings
  4. Li Yujian and Li Bo's procedure: http://ieeexplore.ieee.org.proxy.uchicago.edu/xpl/articleDetails.jsp?arnumber=4160958

I haven't tried 1, as it is a more complicated algorithm. There does not seem to be any appreciable difference between 2,3, and 4 on performance. 4 has an advantage over 2 and 3 that if the unnormalized edit distance is a metric so is the normalized edit distance.

determine useful output

idea 1: a list of numbers
a list of numbers in the same order as the original dataset with IDs (row numbers) of found duplicates and zeros for the rest

idea 2: 2 files

  • one with the entire data_d without the duplicates, solving canonicalization by just picking the first one
  • another file with just the duplicates and what row they were flagged as duplicates of with a score

Create InMemory helper functions class

We want to call certain functions when running on smaller datastes that are processed 100% in memory. Abstract these functions into a helper class.

  • blocking.blockingIndex
  • sampling for training (not written yet)

invertIndex fails when no TF/IDF canopy is chosen

Sometimes getting the following error when running canonical example. This is most likely due to the invertedIndex function failing when no TF/IDF canopy is chosen

Traceback (most recent call last):
  File "test/canonical_test.py", line 123, in 
    blocked_data = dedupe.blocking.blockingIndex(data_d, blocker)
  File "/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/blocking.py", line 173, in blockingIndex
    blocker.invertIndex(data_d.iteritems())
  File "/Users/derekeder/projects/open-city/deduplication/dedupe/dedupe/blocking.py", line 74, in invertIndex
    num_docs = len(self.token_vector[field])
UnboundLocalError: local variable 'field' referenced before assignment

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.