Giter Club home page Giter Club logo

hybrid-jaccard's Introduction

hybrid-jaccard

Implementation of hybrid jaccard similarity

Package files: | |-> init.py | |-> hybrid_jaccard.py: contains the base class for hybrid jaccard string matching | |-> jaro.py & typo_tables.py: contain the methods for jaro distance calculation | |-> munkres.py: contains the hungarian matching algorithm | |-> eye_config.txt: contains the configuration info for the hybrid-jaccard class | |-> eye_reference.txt: contains the reference eye colors | |-> input.txt: a sample input file for testing the program | |-> README.md | |-> LICENSE

Usage:

You should import "HybridJaccard" in your code. The main class is HybridJaccard. The class constructor gets two arguments, path to reference and config files respectively. The "findBestMatch" method returns the best match for the input string among those in the reference file if one exists, and returns "NONE" otherwise. A sample usage might be like:

sm = HybridJaccard() match = sm.findBestMatchString("beautiful light bluish eyes")

If the match fails, it will return the singleton value None.

Other matching calls are:

match = sm.findBestMatchStringCached("beautiful light bluish eyes")

match = sm.findBestMatchWords(["beautiful", "light", "bluish", "eyes"])

match = sm.findBestMatchWordsCached(["beautiful", "light", "bluish", "eyes"])

The "Cached" variants maintain a local cache of previously tested phrases.

Here is a sample configuration file ("hybrid_jaccard_config.json"):

{ "eyeColor": { "type": "hybrid_jaccard", "partial_method": "jaro", "parameters": { "threshold": "0.90" }, "references": [ "blue", "green", "brown", "hazel", "gray:grey" ] }, "hairType": { "type": "hybrid_jaccard", "partial_method": "jaro", "parameters": { "threshold": "0.90" }, "references": [ "long", "curly", "blonde: blond", "brunette", "brown: chestnut", "black", "red: redhead", "auburn: reddish brown", "pink" ] } }

The outer dictionary can be used to select the rules used by a specific HybridJaccard instance, controlled by the "method_type" parameter in in object creation. For example:

sm = HybridJaccard(method_type="eyeColor")

The inner dictionary:

-- has a field "type" which is for now always "hybrid_jaccard", -- has a field "partial_method" which can be "jaro" or "levenshtein", -- has a field "threshold" which determines how picky we want to be in hybrid jaccard algorithm before doing the matching, -- can included reference data as strings, or -- can include the names of reference data files:

"reference_files": [ "eye_color.txt" ]

Reference data may be supplied inside the selected part of the

configuration file, from files referenced in the configuration file (the reference_files list is not restricted to a single file), and in a file whose name is passed to HybridJaccard object initialization. When multiple sources are supplied, they aee merged. Sample eye color references are:

amber blue: azure, sapphire brown gray green hazel red violet

The first set of whitespace-separated words on a line is a reference

phrase. if there is a colon, it may be followed by a comma-separated list of phrases (aliases). The aliases will be mapped to the main (left-side) phrase.

Samples:

The "samples" folder is intended to hold sample files for testing HybridJaccard. There is one sample folder at present:

hbase-dump-2015-10-01-2015-12-01-aman-hbase/

This folder contains:

2 original sample files:

hbase-dump-2015-10-01-2015-12-01-aman-hbase-crf-hair-eyes-sample.txt hbase-dump-2015-10-01-2015-12-01-aman-hbase-crf-name-ethnic-sample.txt

2 intermediary files:

hbase-dump-2015-10-01-2015-12-01-aman-hbase-crf-hair-eyes-sample.jsonl hbase-dump-2015-10-01-2015-12-01-aman-hbase-crf-name-ethnic-sample.jsonl

6 final sample files:

hbase-dump-2015-10-01-2015-12-01-aman-hbase-crf-eyes-sample.jsonl hbase-dump-2015-10-01-2015-12-01-aman-hbase-crf-hair-sample.jsonl

hbase-dump-2015-10-01-2015-12-01-aman-hbase-crf-B_ethnic-sample.jsonl hbase-dump-2015-10-01-2015-12-01-aman-hbase-crf-B_workingname-sample.jsonl hbase-dump-2015-10-01-2015-12-01-aman-hbase-crf-I_ethnic-sample.jsonl hbase-dump-2015-10-01-2015-12-01-aman-hbase-crf-I_workingname-sample.jsonl

hybrid-jaccard's People

Contributors

craigmilorogers avatar majidghgol avatar philpot avatar szeke avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hybrid-jaccard's Issues

Duplicate Reference Data Slows Processing

hybridJaccard does not remove duplicate data as it builds its internal data structures, although it could easily do so at low cost. Duplicate input data should not cause incorrect results, but may slow down processing.

Built-in Test Has Poor Structure and Feedback

There is some built-in self-test data in hybridJaccard.py. The file "input.txt" is read and processed each line, but some built-in data is also processed. The input data is processed each time a line of input.txt is read, when it really needn't be processed more than once. Also, the build-in data could give better feedback.

bluish is missing

The eye color reference file is missing color variants such as "bluish" and "greenish". This causes on of the test cases in "hybridJaccard.py" to give unexpected results.

Optimization: findBestMatch Recalculates Similarities

If findBestMatch is called several times with the same input, it will calculate its result each time. If this happens a lot, it could be much cheaper to cache the result of the first call for a given input and reuse it for subsequent calls.

findBestMatch Has Extra Lookups

The final line of findBestMatch:

return self.labels[self.references[similarities.index(max_sim)]]

could be more efficient if we tracked the winning index instead of just the winning value, and if labels were a parallel sequence to references instead of a has lookup. Of course, tracking the winning index might be more expensive than looking for it at the end, so optimizing this code might require experimental justification.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.