Giter Club home page Giter Club logo

ngramdistance-ffi's Introduction

ngramdistance-ffi

Converted to FFI by Bali for Ruby portability.

This gem originally based on ngram distance or similarity algorithm from Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005).

  • This program was ported by hand from lucene-3.0.2. (lucene is Java product)
  • It supports only ngram distance algorithm.

Tested on:

  • MRI 1.9.2

Known Issues

  • The C extension uses char* strings, and so Unicode strings will give incorrect distances.

Including in Gemfile

gem 'ngramdistance-ffi', :require => 'ngramdistance'

Original README

N-Gram version of edit distance based on paper by Grzegorz Kondrak, "N-gram similarity and distance". Proceedings of the Twelfth International Conference on String Processing and Information Retrieval (SPIRE 2005), pp. 115-126, Buenos Aires, Argentina, November 2005. http://www.cs.ualberta.ca/~kondrak/papers/spire05.pdf

This implementation uses the position-based optimization to compute partial matches of n-gram sub-strings and adds a null-character prefix of size n-1 so that the first character is contained in the same number of n-grams as a middle character. Null-character prefix matches are discounted so that strings with no matching characters will return a distance of 0.

The module has a following functions:

  • distance

require 'ngramdistance'
NGramDistance.distance("string1", "string1") # returns 1.0
NGramDistance.distance("university", "univearsitty",3) # returns 0.750000

method signature is distance(source,target,ngram=3,mode=PROPOSITIONAL)

Third argument is the ngram granularity.

  • 1 -> unigram
  • 2 -> bigram
  • 3 -> trigram

default argument is 3 (tri-gram)

Fourth argument is mode of operation. We have 3 modes defined

  • PROPOSITIONAL=0 Propositional Computes the distance between n-grams with partial matching
  • BINARY=1 Only considers 2 n-grams match when they are exactly the same
  • COMPLEX=2 Uses ngramdistance with Binary option and 1-gram match for the n-gram match.

default is PROPOSITIONAL mode.

  • token_sort_distance

require 'ngramdistance'
NGramDistance.token_sort_distance("hello bello", "bello hello") # returns 1.0

The token sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and then joining them back into a string. For example:

"new york mets vs atlanta braves"   →→  "atlanta braves mets new vs york" 

We then compare the transformed strings with a simple distance(). That nicely solves our ordering problem, as our helper function below indicates:

NGramDistance.token_sort_distance("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets") ⇒ 1.0

method signature is token_sort_distance(source,target,regex=' ',ngram=3,mode=PROPOSITIONAL) the arguments are same as distance function but with extra argument regex, character or string to split source and target

  • token_set_distance

require 'ngramdistance'
NGramDistance.token_sort_distance("hello", "hello hello") # returns 1.0

The token set approach is similar, but a little bit more flexible. Here, we tokenize both strings, but instead of immediately sorting and comparing, we split the tokens into two groups: intersection and remainder. We use those sets to build up a comparison string. Here is an illustrative example: s1 = "mariners vs angels" s2 = "los angeles angels of anaheim at seattle mariners"

Using the token sort method isn't that helpful, because the second (longer) string has too many extra tokens that get interleaved with the sort. We'd end up comparing:

t1 = "angels mariners vs"
t2 = "anaheim angeles angels los mariners of seattle vs"

Not very useful. Instead, the set method allows us to detect that "angels" and "mariners" are common to both strings, and separate those out (the set intersection). Now we construct and compare strings of the following form

t0 = [SORTED_INTERSECTION]
t1 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING1]
t2 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING2]

And then compare each pair.

The intuition here is that because the SORTED_INTERSECTION component is always exactly the same, the scores increase when (a) that makes up a larger percentage of the full string, and (b) the string remainders are more similar. In our example

t0 = "angels mariners"
t1 = "angels mariners vs"
t2 = "angels mariners anaheim angels"
NGramDistance.distance(t0, t1) ⇒ 0.83333
NGramDistance.distance(t0, t2) ⇒ 0.4838
NGramDistance.distance(t1, t2) ⇒ 0.54828
NGramDistance.token_set_distance(t0,t2) ⇒ 0.6521


NGramDistance.token_set_distance("hello hello world", "hello world") ⇒ 1.0

method signature is token_set_distance(source,target,regex=' ',ngram=3,mode=PROPOSITIONAL) the arguments are same as distance function but with extra argument regex, character or string to split source and target

Author

  • Copyright (c) 2012, Bali ([email protected])
  • I ported from java source code of lucene-3.0.2.

License

  • Apache 2.0 LICENSE

ngramdistance-ffi's People

Contributors

pecbali avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ngramdistance-ffi's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.