Giter Club home page Giter Club logo

rockymadden / stringmetric Goto Github PK

View Code? Open in Web Editor NEW
485.0 41.0 82.0 2.12 MB

:dart: String metrics and phonetic algorithms for Scala (e.g. Dice/Sorensen, Hamming, Jaccard, Jaro, Jaro-Winkler, Levenshtein, Metaphone, N-Gram, NYSIIS, Overlap, Ratcliff/Obershelp, Refined NYSIIS, Refined Soundex, Soundex, Weighted Levenshtein).

Home Page: https://rockymadden.com/stringmetric/

Scala 99.68% Shell 0.32%
similarity-metric phonetic-algorithms soundex nysiis jaro-winkler metaphone levenshtein jaccard jaro overlap

stringmetric's Issues

`com.rockymadden.stringmetric.transform` character ranges are exclusive

The character ranges in com.rockymadden.stringmetric.transform

    private val Ascii = NumericRange(0x00, 0x7F, 1)
    private val ExtendedAscii = NumericRange(0x00, 0x7F, 1)
    private val Latin = NumericRange(0x00, 0x24F, 1)
    private val LowerCase = NumericRange(0x61, 0x7A, 1)
    private val Numbers = NumericRange(0x30, 0x39, 1)
    private val UpperCase = NumericRange(0x41, 0x5A, 1)

do not include all the intended characters, because the apply method of Scala's NumericRange is exclusive, i.e. it does not include the upper bound of the defined range. Consequently, LowerCase does not include z, UpperCase does not include Z, Numbers does not include 9, etc.
Thus, when using withTransform with any of the predefined filters in this object, these characters will be discarded from comparison.

Cannot resolve dependency for 0.24 com.rockymadden.stringmetric#stringmetric-core_2.10;0.27.4: not found

0.27.3 works
libraryDependencies += "com.rockymadden.stringmetric" %% "stringmetric-core" % "0.27.3"

but 0.27.4 doesn't
libraryDependencies += "com.rockymadden.stringmetric" %% "stringmetric-core" % "0.27.4"

and gives the following error.

[info] Resolving com.rockymadden.stringmetric#stringmetric-core_2.10;0.27.4 ...
[warn] module not found: com.rockymadden.stringmetric#stringmetric-core_2.10;0.27.4
[warn] ==== local: tried
[warn] /Users/ssimanta/.ivy2/local/com.rockymadden.stringmetric/stringmetric-core_2.10/0.27.4/ivys/ivy.xml
[warn] ==== public: tried
[warn] http://repo1.maven.org/maven2/com/rockymadden/stringmetric/stringmetric-core_2.10/0.27.4/stringmetric-core_2.10-0.27.4.pom
[warn] ==== Typesafe Releases: tried
[warn] http://repo.typesafe.com/typesafe/releases/com/rockymadden/stringmetric/stringmetric-core_2.10/0.27.4/stringmetric-core_2.10-0.27.4.pom
[warn] ==== Typesafe Releases: tried
[warn] http://repo.typesafe.com/typesafe/releases/com/rockymadden/stringmetric/stringmetric-core_2.10/0.27.4/stringmetric-core_2.10-0.27.4.pom
[warn] ==== rediscala: tried
[warn] https://raw.github.com/etaty/rediscala-mvn/master/releases/com/rockymadden/stringmetric/stringmetric-core_2.10/0.27.4/stringmetric-core_2.10-0.27.4.pom
[warn] ==== Sonatype Snapshots: tried
[warn] https://oss.sonatype.org/content/repositories/snapshots/com/rockymadden/stringmetric/stringmetric-core_2.10/0.27.4/stringmetric-core_2.10-0.27.4.pom
[warn] ==== Sonatype Releases: tried
[warn] https://oss.sonatype.org/content/repositories/releases/com/rockymadden/stringmetric/stringmetric-core_2.10/0.27.4/stringmetric-core_2.10-0.27.4.pom
[info] Resolving com.typesafe.trace#trace-sigar-libs;0.1.1 ...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: com.rockymadden.stringmetric#stringmetric-core_2.10;0.27.4: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
sbt.ResolveException: unresolved dependency: com.rockymadden.stringmetric#stringmetric-core_2.10;0.27.4: not found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)
at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:115)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:115)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:103)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:48)
at sbt.IvySbt$$anon$3.call(Ivy.scala:57)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)
at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)
at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)
at xsbt.boot.Using$.withResource(Using.scala:11)
at xsbt.boot.Using$.apply(Using.scala:10)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:57)
at sbt.IvySbt.withIvy(Ivy.scala:98)
at sbt.IvySbt.withIvy(Ivy.scala:94)
at sbt.IvySbt$Module.withModule(Ivy.scala:115)
at sbt.IvyActions$.update(IvyActions.scala:125)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1223)
at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1221)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$74.apply(Defaults.scala:1244)
at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$74.apply(Defaults.scala:1242)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1246)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1241)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1249)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1214)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1192)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)
at sbt.std.Transform$$anon$4.work(System.scala:64)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)
at sbt.Execute.work(Execute.scala:244)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)
at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:30)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
error sbt.ResolveException: unresolved dependency: com.rockymadden.stringmetric#stringmetric-core_2.10;0.27.4: not found
[error] Total time: 6 s, completed Aug 3, 2014 10:27:27 PM

cross Compile with Scala 2.9.2

Is it possible that you could cross compile this library with Scala 2.9.2? I am stuck on Scala 2.9.2 for foreseeable future and hence on a really old version of String Metric.

I would really appreciate it if you could backport it 2.9.2

Thanks

CLI errors for n-gram algorithms

I am receiving syntax errors "Expected valid syntax." for all CLI tools that are based off n-grams and which require the user to supply the n parameter. I think there is an error somewhere in how the opts is parsing the integer. The following are not working for me:
dicesorensenmetric
jaccardmetric
ngrammetric
overlapmetric
Commenting out the the opts and hardcoding the n value resolves the issue.

Java 8 compatibility

Could you create a new release of your library with java 8 compatibility?
Thx a lot!

Baptiste

Levenshtein distance: change recursion to dynamic programming

The current implementation is too slow: it is a recursive algorithm without memoization, which means a lot of states are repeatedly computed.
Try this dynamic programming version that guarantees O(mn) performance:

  def dist(x: String, y: String): Int = {
    val d = Array.ofDim[Int](x.length + 1, y.length + 1)
    for (i <- 0 to x.length) d(i)(0) = i
    for (j <- 0 to y.length) d(0)(j) = j
    for (j <- 1 to y.length; i <- 1 to x.length) {
      if (x(i - 1) == y(j - 1)) d(i)(j) = d(i - 1)(j - 1)
      else d(i)(j) = min(d(i - 1)(j), d(i)(j - 1), d(i - 1)(j - 1)) + 1
    }
    d(x.length)(y.length)
  }

Possible errors in metric computations

Take the following two strings:

S1: 1100000000110001100001000000000110000000000000000011000000000011
S2: 1000000000100001000001000000000100000000000000000000100000000011

S1 has 13 ones
S2 has 8 ones
S1 AND S2 has 7 ones
S1 OR S2 has 14 ones

Jaccard(S1,S2) = 7/14 = 0.5
Running stringmetric-cli:

./jaccardMetric --n=1 1100000000110001100001000000000110000000000000000011000000000011 1000000000100001000001000000000100000000000000000000100000000011
0.4609375

Dice(S1,S2) = (2 * 7) / (13 + 8) = 0.066(6)
Running stringmetric-cli:

./diceSorensenMetric --n=1 1100000000110001100001000000000110000000000000000011000000000011 1000000000100001000001000000000100000000000000000000100000000011
0.921875

Maven dependency mixup

Here's an easy one - the maven groupId / artifactId on the readme seem to be mixed up.

Via readme:
artifactId: com.rockymadden.stringmetric
groupId: stringmetric-core

Should read:
groupId: com.rockymadden.stringmetric
artifactId: stringmetric-core

edits - formatting messed up

Calculation of overlap coefficient may be incorrect

According to your tests, overlap coefficient of "context" and "contact" is 0.7142857142857143, i.e. 5/7. This means that you count character 't' twice and intersection of these words is "contt".

Now if you read definition of overlap coefficient, you will find that it's defined in terms of sets. There cannot be two identical elements in a set. There is no order is a set. You only can say that an element is in set or not. Intersection of "context" and "contact" should be "cont" then.

Moreover, if we consider arguments as sets, denominator of our ratio be 5, not 7, because the set for "contact" is "conta".

So, result should be 4/5 = 0.8

In an old issue, you gave this link http://www.planetcalc.com/1664/ as a demonstration of 'loose' intersection. Note that this calculator works just like I've described:

C, O, N, T, E, X, T and C, O, N, T, A, C, T = C, N, O, T

I'm not sure about all this stuff. Maybe you know some special rules for string to set conversion? Maybe two distinct characters should be considered as distinct elements in a set?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.