Giter Club home page Giter Club logo

name-quicksearch's Introduction

QuickSearch

A tool for quickly locating the most likely match for a name (or other short natural language string) in another (very large) set of names.

Entries for the tool are in the form (Name, UID) where the unique identifier is of any arbitrary data type that satisfies both Hashable and Eq, so there's no extra matching to be done afterwards to recover results; the results are returned with both the matching name and the matching name's UID.

Use case for which it was developed: You need to match records across two data sets using only names, but have too many records to reasonably perform string distance calculations between every pair in the cartesian set. QuickSearch only performs distance calculations between strings that share an entire token, making it well suited to quickly remove the low-hanging fruit. This in turn drastically reduces the sizes of the sets requiring a costly full scan, allowing for more iteration and experimentation on the edge cases.

Able to process thousands of names against a population of a hundred thousand names in under a second (once the QuickSearch is actually built... first search always takes several seconds to build the filters).

Uses Data.Text internally, but there is an identical String interface to be found at QuickSearch.String if that suits the pipeline better (but you should probably be using Data.Text, String is much slower).

Usage:

> import QuickSearch

> names = map T.pack ["Rep. Meg Mueller","Twana Jacobs","Sammie Paucek"]

> targets = zip names [1..] --Stand-in for your UIDs
> qs = buildQuickSearch targets

-- Exposes jaro, jaroWinkler, and damerauLevenshteinNorm from Data.Text.Metrics
-- but scorer can be any func of type (T.Text -> T.Text -> Ratio Int)
> entry = T.pack "Rep. Meg Muller"
> topNMatches qs 1 jaroWinkler entry
[Match (100,Entry ("rep. meg mueller", 1))]

> entry = T.pack "Towana Jacobs"
> matchesWithThreshold qs 90 damerauLevenshteinNorm entry
[Match (92,Entry ("twana jacobs", 2))]

Batch Usage

Both topNMatches and matchesWithThreshold have an associated batch function for processing entire lists at once, named batchTopNMatches and batchMatchesWithThreshold. These take lists of pairs of names and UIDs instead of a target string, and return for each entry the results of running the associated single function.

Both are built using a helper function batch, so the definition of batchTopNMatches is batchTopNMatches = batch topNMatches. If you define your own retrieval functions on a single entry, you can in turn define your own batch version in the same way.

QuickSearch.OneShot

If you have your list of names to be matched and list of target names both in the form [(T.Text, uid)] (or [(String, uid)]) and you don't plan on re-using the QuickSearch filters, you can run it as a one-shot batch with

import QuickSearch.OneShot
-- or, import QuickSearch.String.OneShot

names, targets :: (Hashable uid, Eq uid) => [(T.Text, uid)]
scorer :: (T.Text -> T.Text -> Ratio Int)

> oneShotTopNMatches 5 names targets scorer

which will return a list of (Entry uid1, Match (Entry uid2))

topNMatches and matchesWithThreshold have one-shot batch versions, named oneShotTopNMatches and oneShotMatchesWithThreshold respectively. Like the batch versions, they are built with a helper, oneShot, so you're free to make your own One-Shot retrieval functions by defining it for a single item and decorating it.

Shout out to Charles Sommers, who wrote the Python tool I'm reimplementing the idea of.

unstableNub used in QuickSearch.Filter borrowed from Universum.Nub and modified slightly under the MIT license.

name-quicksearch's People

Contributors

osoleve avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

name-quicksearch's Issues

Parallelize batch/oneshot functions

For large batches/oneshots run on a multicore machine, it's possible that there's lift to be gained by parallelizing the lookups. This should be investigated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.