Giter Club home page Giter Club logo

rockymadden / stringmetric Goto Github PK

View Code? Open in Web Editor NEW
485.0 41.0 82.0 2.12 MB

:dart: String metrics and phonetic algorithms for Scala (e.g. Dice/Sorensen, Hamming, Jaccard, Jaro, Jaro-Winkler, Levenshtein, Metaphone, N-Gram, NYSIIS, Overlap, Ratcliff/Obershelp, Refined NYSIIS, Refined Soundex, Soundex, Weighted Levenshtein).

Home Page: https://rockymadden.com/stringmetric/

Scala 99.68% Shell 0.32%
similarity-metric phonetic-algorithms soundex nysiis jaro-winkler metaphone levenshtein jaccard jaro overlap

stringmetric's Introduction

#stringmetric Build Status String metrics and phonetic algorithms for Scala. The library provides facilities to perform approximate string matching, measurement of string similarity/distance, indexing by word pronunciation, and sounds-like comparisons. In addition to the core library, each metric and algorithm has a command line interface.

Metrics and algorithms

Depending upon

SBT:

libraryDependencies += "com.rockymadden.stringmetric" %% "stringmetric-core" % "0.27.4"

Gradle:

compile 'com.rockymadden.stringmetric:stringmetric-core_2.10:0.27.4'

Maven:

<dependency>
	<groupId>com.rockymadden.stringmetric</groupId>
	<artifactId>stringmetric-core_2.10</artifactId>
	<version>0.27.4</version>
</dependency>

Similarity package

Useful for approximate string matching and measurement of string distance. Most metrics calculate the similarity of two strings as a double with a value between 0 and 1. A value of 0 being completely different and a value of 1 being completely similar.


Dice / Sorensen Metric:

DiceSorensenMetric(1).compare("night", "nacht") // 0.6
DiceSorensenMetric(1).compare("context", "contact") // 0.7142857142857143

Note you must specify the size of the n-gram you wish to use.


Hamming Metric:

HammingMetric.compare("toned", "roses") // 3
HammingMetric.compare("1011101", "1001001") // 2

Note the exception of integers, rather than doubles, being returned.


Jaccard Metric:

JaccardMetric(1).compare("night", "nacht") // 0.3
JaccardMetric(1).compare("context", "contact") // 0.35714285714285715

Note you must specify the size of the n-gram you wish to use.


Jaro Metric:

JaroMetric.compare("dwayne", "duane") // 0.8222222222222223
JaroMetric.compare("jones", "johnson") // 0.7904761904761904
JaroMetric.compare("fvie", "ten") // 0.0

Jaro-Winkler Metric:

JaroWinklerMetric.compare("dwayne", "duane") // 0.8400000000000001
JaroWinklerMetric.compare("jones", "johnson") // 0.8323809523809523
JaroWinklerMetric.compare("fvie", "ten") // 0.0

Levenshtein Metric:

LevenshteinMetric.compare("sitting", "kitten") // 3
LevenshteinMetric.compare("cake", "drake") // 2

Note the exception of integers, rather than doubles, being returned.


N-Gram Metric:

NGramMetric(1).compare("night", "nacht") // 0.6
NGramMetric(2).compare("night", "nacht") // 0.25
NGramMetric(2).compare("context", "contact") // 0.5

Note you must specify the size of the n-gram you wish to use.


Overlap Metric:

OverlapMetric(1).compare("night", "nacht") // 0.6
OverlapMetric(1).compare("context", "contact") // 0.7142857142857143

Note you must specify the size of the n-gram you wish to use.


Ratcliff/Obershelp Metric:

RatcliffObershelpMetric.compare("aleksander", "alexandre") // 0.7368421052631579
RatcliffObershelpMetric.compare("pennsylvania", "pencilvaneya") // 0.6666666666666666

Weighted Levenshtein Metric:

WeightedLevenshteinMetric(10, 0.1, 1).compare("book", "back") // 2
WeightedLevenshteinMetric(10, 0.1, 1).compare("hosp", "hospital") // 0.4
WeightedLevenshteinMetric(10, 0.1, 1).compare("hospital", "hosp") // 40

Note you must specify the weight of each operation. Delete, insert, and then substitute. Note that while a double is returned, it can be outside the range of 0 to 1, based upon the weights used.


Phonetic package

Useful for indexing by word pronunciation and performing sounds-like comparisons. All metrics return a boolean value indicating if the two strings sound the same, per the algorithm used. All metrics have an algorithm counterpart which provide the means to perform indexing by word pronunciation.


Metaphone Metric:

MetaphoneMetric.compare("merci", "mercy") // true
MetaphoneMetric.compare("dumb", "gum") // false

Metaphone Algorithm:

MetaphoneAlgorithm.compute("dumb") // tm
MetaphoneAlgorithm.compute("knuth") // n0

NYSIIS Metric:

NysiisMetric.compare("ham", "hum") // true
NysiisMetric.compare("dumb", "gum") // false

NYSIIS Algorithm:

NysiisAlgorithm.compute("macintosh") // mcant
NysiisAlgorithm.compute("knuth") // nnat

Refined NYSIIS Metric:

RefinedNysiisMetric.compare("ham", "hum") // true
RefinedNysiisMetric.compare("dumb", "gum") // false

Refined NYSIIS Algorithm:

RefinedNysiisAlgorithm.compute("macintosh") // mcantas
RefinedNysiisAlgorithm.compute("westerlund") // wastarlad

Refined Soundex Metric:

RefinedSoundexMetric.compare("robert", "rupert") // true
RefinedSoundexMetric.compare("robert", "rubin") // false

Refined Soundex Algorithm:

RefinedSoundexAlgorithm.compute("hairs") // h093
RefinedSoundexAlgorithm.compute("lambert") // l7081096

Soundex Metric:

SoundexMetric.compare("robert", "rupert") // true
SoundexMetric.compare("robert", "rubin") // false

Soundex Algorithm:

SoundexAlgorithm.compute("rupert") // r163
SoundexAlgorithm.compute("lukasiewicz") // l222

Convenience objects

StringAlgorithm:

StringAlgorithm.computeWithMetaphone("abcdef")
StringAlgorithm.computeWithNysiis("abcdef")

StringMetric:

StringMetric.compareWithJaccard(1)("abcdef", "abcxyz")
StringMetric.compareWithJaroWinkler("abcdef", "abcxyz")

Decorating

It is possible to decorate algorithms and metrics with additional functionality, which you can mix and match. Decorations include:

  • withMemoization: Computations and comparisons are cached. Future calls made with identical arguments will be looked up, rather than computed.

  • withTransform: Transform arguments prior to computation/comparison. A handful of pre-built transforms are located in the transform module.


Non-decorated:

MetaphoneAlgorithm.compute("abcdef")
MetaphoneMetric.compare("abcdef", "abcxyz")

Using memoization:

(MetaphoneAlgorithm withMemoization).compute("abcdef")

Using a transform so that we only examine alphabetical characters:

(MetaphoneAlgorithm withTransform filterAlpha).compute("abcdef")
(MetaphoneMetric withTransform filterAlpha).compare("abcdef", "abcxyz")

Using a functionally composed transform so that we only examine alphabetical characters, but the case will not matter:

val composedTransform = (filterAlpha andThen ignoreAlphaCase)

(MetaphoneAlgorithm withTransform composedTransform).compute("abcdef")
(MetaphoneMetric withTransform composedTransform).compare("abcdef", "abcxyz")

Making your own transform:

val myTransform: StringTransform = (ca) => ca.filter(_ == 'x')

(MetaphoneAlgorithm withTransform myTransform).compute("abcdef")
(MetaphoneMetric withTransform myTransform).compare("abcdef", "abcxyz")

Using memoization and a transform:

((MetaphoneAlgorithm withMemoization) withTransform filterAlpha).compute("abcdef")

Building the CLIs

$ git clone https://github.com/rockymadden/stringmetric.git
$ cd stringmetric
$ sbt clean package
$ ./project/build.sh
$ ./target/cli/jarometric abc xyz

Using the CLIs

Get help:

$ metaphonemetric --help
Compares two strings to determine if they are phonetically similarly, per the Metaphone algorithm.

Syntax:
  metaphonemetric [Options] string1 string2...

Options:
  -h, --help
    Outputs description, syntax, and options.

Get comparison value with metrics:

$ jarowinklermetric dog dawg
0.75

Get representation value with phonetic algorithms:

$ metaphonealgorithm dog
tk

License

The MIT License (MIT)

Copyright (c) 2013 Rocky Madden (http://rockymadden.com/)

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

stringmetric's People

Contributors

rockymadden avatar totekp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stringmetric's Issues

Maven dependency mixup

Here's an easy one - the maven groupId / artifactId on the readme seem to be mixed up.

Via readme:
artifactId: com.rockymadden.stringmetric
groupId: stringmetric-core

Should read:
groupId: com.rockymadden.stringmetric
artifactId: stringmetric-core

edits - formatting messed up

Possible errors in metric computations

Take the following two strings:

S1: 1100000000110001100001000000000110000000000000000011000000000011
S2: 1000000000100001000001000000000100000000000000000000100000000011

S1 has 13 ones
S2 has 8 ones
S1 AND S2 has 7 ones
S1 OR S2 has 14 ones

Jaccard(S1,S2) = 7/14 = 0.5
Running stringmetric-cli:

./jaccardMetric --n=1 1100000000110001100001000000000110000000000000000011000000000011 1000000000100001000001000000000100000000000000000000100000000011
0.4609375

Dice(S1,S2) = (2 * 7) / (13 + 8) = 0.066(6)
Running stringmetric-cli:

./diceSorensenMetric --n=1 1100000000110001100001000000000110000000000000000011000000000011 1000000000100001000001000000000100000000000000000000100000000011
0.921875

Java 8 compatibility

Could you create a new release of your library with java 8 compatibility?
Thx a lot!

Baptiste

Levenshtein distance: change recursion to dynamic programming

The current implementation is too slow: it is a recursive algorithm without memoization, which means a lot of states are repeatedly computed.
Try this dynamic programming version that guarantees O(mn) performance:

  def dist(x: String, y: String): Int = {
    val d = Array.ofDim[Int](x.length + 1, y.length + 1)
    for (i <- 0 to x.length) d(i)(0) = i
    for (j <- 0 to y.length) d(0)(j) = j
    for (j <- 1 to y.length; i <- 1 to x.length) {
      if (x(i - 1) == y(j - 1)) d(i)(j) = d(i - 1)(j - 1)
      else d(i)(j) = min(d(i - 1)(j), d(i)(j - 1), d(i - 1)(j - 1)) + 1
    }
    d(x.length)(y.length)
  }

`com.rockymadden.stringmetric.transform` character ranges are exclusive

The character ranges in com.rockymadden.stringmetric.transform

    private val Ascii = NumericRange(0x00, 0x7F, 1)
    private val ExtendedAscii = NumericRange(0x00, 0x7F, 1)
    private val Latin = NumericRange(0x00, 0x24F, 1)
    private val LowerCase = NumericRange(0x61, 0x7A, 1)
    private val Numbers = NumericRange(0x30, 0x39, 1)
    private val UpperCase = NumericRange(0x41, 0x5A, 1)

do not include all the intended characters, because the apply method of Scala's NumericRange is exclusive, i.e. it does not include the upper bound of the defined range. Consequently, LowerCase does not include z, UpperCase does not include Z, Numbers does not include 9, etc.
Thus, when using withTransform with any of the predefined filters in this object, these characters will be discarded from comparison.

Cannot resolve dependency for 0.24 com.rockymadden.stringmetric#stringmetric-core_2.10;0.27.4: not found

0.27.3 works
libraryDependencies += "com.rockymadden.stringmetric" %% "stringmetric-core" % "0.27.3"

but 0.27.4 doesn't
libraryDependencies += "com.rockymadden.stringmetric" %% "stringmetric-core" % "0.27.4"

and gives the following error.

[info] Resolving com.rockymadden.stringmetric#stringmetric-core_2.10;0.27.4 ...
[warn] module not found: com.rockymadden.stringmetric#stringmetric-core_2.10;0.27.4
[warn] ==== local: tried
[warn] /Users/ssimanta/.ivy2/local/com.rockymadden.stringmetric/stringmetric-core_2.10/0.27.4/ivys/ivy.xml
[warn] ==== public: tried
[warn] http://repo1.maven.org/maven2/com/rockymadden/stringmetric/stringmetric-core_2.10/0.27.4/stringmetric-core_2.10-0.27.4.pom
[warn] ==== Typesafe Releases: tried
[warn] http://repo.typesafe.com/typesafe/releases/com/rockymadden/stringmetric/stringmetric-core_2.10/0.27.4/stringmetric-core_2.10-0.27.4.pom
[warn] ==== Typesafe Releases: tried
[warn] http://repo.typesafe.com/typesafe/releases/com/rockymadden/stringmetric/stringmetric-core_2.10/0.27.4/stringmetric-core_2.10-0.27.4.pom
[warn] ==== rediscala: tried
[warn] https://raw.github.com/etaty/rediscala-mvn/master/releases/com/rockymadden/stringmetric/stringmetric-core_2.10/0.27.4/stringmetric-core_2.10-0.27.4.pom
[warn] ==== Sonatype Snapshots: tried
[warn] https://oss.sonatype.org/content/repositories/snapshots/com/rockymadden/stringmetric/stringmetric-core_2.10/0.27.4/stringmetric-core_2.10-0.27.4.pom
[warn] ==== Sonatype Releases: tried
[warn] https://oss.sonatype.org/content/repositories/releases/com/rockymadden/stringmetric/stringmetric-core_2.10/0.27.4/stringmetric-core_2.10-0.27.4.pom
[info] Resolving com.typesafe.trace#trace-sigar-libs;0.1.1 ...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: com.rockymadden.stringmetric#stringmetric-core_2.10;0.27.4: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
sbt.ResolveException: unresolved dependency: com.rockymadden.stringmetric#stringmetric-core_2.10;0.27.4: not found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:115)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:115)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:103)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:48)
at sbt.IvySbt$$anon$3.call(Ivy.scala:57)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
at xsbt.boot.Using$.withResource(Using.scala:11)
at xsbt.boot.Using$.apply(Using.scala:10)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:57)
at sbt.IvySbt.withIvy(Ivy.scala:98)
at sbt.IvySbt.withIvy(Ivy.scala:94)
at sbt.IvySbt$Module.withModule(Ivy.scala:115)
at sbt.IvyActions$.update(IvyActions.scala:125)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1223)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1221)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$74.apply(Defaults.scala:1244)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$74.apply(Defaults.scala:1242)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1246)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1241)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1249)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1214)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1192)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
at sbt.std.Transform$$anon$4.work(System.scala:64)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
at sbt.Execute.work(Execute.scala:244)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:30)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
error sbt.ResolveException: unresolved dependency: com.rockymadden.stringmetric#stringmetric-core_2.10;0.27.4: not found
[error] Total time: 6 s, completed Aug 3, 2014 10:27:27 PM

CLI errors for n-gram algorithms

I am receiving syntax errors "Expected valid syntax." for all CLI tools that are based off n-grams and which require the user to supply the n parameter. I think there is an error somewhere in how the opts is parsing the integer. The following are not working for me:
dicesorensenmetric
jaccardmetric
ngrammetric
overlapmetric
Commenting out the the opts and hardcoding the n value resolves the issue.

cross Compile with Scala 2.9.2

Is it possible that you could cross compile this library with Scala 2.9.2? I am stuck on Scala 2.9.2 for foreseeable future and hence on a really old version of String Metric.

I would really appreciate it if you could backport it 2.9.2

Thanks

Calculation of overlap coefficient may be incorrect

According to your tests, overlap coefficient of "context" and "contact" is 0.7142857142857143, i.e. 5/7. This means that you count character 't' twice and intersection of these words is "contt".

Now if you read definition of overlap coefficient, you will find that it's defined in terms of sets. There cannot be two identical elements in a set. There is no order is a set. You only can say that an element is in set or not. Intersection of "context" and "contact" should be "cont" then.

Moreover, if we consider arguments as sets, denominator of our ratio be 5, not 7, because the set for "contact" is "conta".

So, result should be 4/5 = 0.8

In an old issue, you gave this link http://www.planetcalc.com/1664/ as a demonstration of 'loose' intersection. Note that this calculator works just like I've described:

C, O, N, T, E, X, T and C, O, N, T, A, C, T = C, N, O, T

I'm not sure about all this stuff. Maybe you know some special rules for string to set conversion? Maybe two distinct characters should be considered as distinct elements in a set?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.