Giter Club home page Giter Club logo

nak's Introduction

Nak Build Status

Nak is a Scala/Java library for machine learning and related tasks, with a focus on having an easy to use API for some standard algorithms. It is formed from Breeze, Liblinear Java, and Scalabha. It is currently undergoing a pretty massive evolution, so be prepared for quite big changes in the API for this and probably several future versions.

We'd love to have some more contributors: if you are interested in helping out, please see the #helpwanted issues or suggest your own ideas.

What's inside

Nak currently provides implementations for k-means clustering and supervised learning with logistic regression and support vector machines. Other models and algorithms that were formerly in [breeze.learn] are now in Nak.

See the Nak wiki for (some preliminary and unfortunately sparse) documentation.

The latest stable release of Nak is 1.2.1. Changes from the previous release include:

  • breeze-learn pulled into Nak
  • K-means from breeze-learn and Nak merged.
  • Added locality sensitive hashing

See the CHANGELOG for changes in previous versions.

Using Nak

In SBT:

libraryDependencies += "org.scalanlp" % "nak" % "1.2.1"

In Maven:

<dependency>
   <groupId>org.scalanlp</groupId>
   <artifactId>nak</artifactId>
   <version>1.2.1</version>
</dependency>

Example

Here's an example of how easy it is to train and evaluate a text classifier using Nak. See TwentyNewsGroups.scala for more details.

def main(args: Array[String]) {
  val newsgroupsDir = new File(args(0))
  implicit val isoCodec = scala.io.Codec("ISO-8859-1")
  val stopwords = Set("the","a","an","of","in","for","by","on")

  val trainDir = new File(newsgroupsDir, "20news-bydate-train")
  val trainingExamples = fromLabeledDirs(trainDir).toList
  val config = LiblinearConfig(cost=5.0)
  val featurizer = new BowFeaturizer(stopwords)
  val classifier = trainClassifier(config, featurizer, trainingExamples)

  val evalDir = new File(newsgroupsDir, "20news-bydate-test")
  val maxLabelNews = maxLabel(classifier.labels) _
  val comparisons = for (ex <- fromLabeledDirs(evalDir).toList) yield
    (ex.label, maxLabelNews(classifier.evalRaw(ex.features)), ex.features)
  val (goldLabels, predictions, inputs) = comparisons.unzip3
  println(ConfusionMatrix(goldLabels, predictions, inputs))
}

Questions or suggestions?

Post a message to the scalanlp-discuss mailing list or create an issue.

nak's People

Contributors

brollb avatar dlwh avatar eponvert avatar gabeos avatar jasonbaldridge avatar jfrazee avatar mlehman avatar mortimerp9 avatar muuki88 avatar reactormonk avatar rtreffer avatar treadstone90 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nak's Issues

Support classifier serialization

It's currently possible to serialize a liblinear model, but not a classifier that wraps around it (with associated featurizer, etc).

Cut Nak 1.2.1

We've battle tested v1.2.1-SNAPSHOT reasonably well. Can we cut a version before incorporating new changes? Cheers ๐Ÿป

org.scalanlp#nak;1.2.1-SNAPSHOT dependency unresolved

[info] Resolving org.scalanlp#nak;1.2.1-SNAPSHOT ...
[warn] module not found: org.scalanlp#nak;1.2.1-SNAPSHOT
[warn] ==== local: tried
[warn] /Users/scotthendrickson/.ivy2/local/org.scalanlp/nak/1.2.1-SNAPSHOT/ivys/ivy.xml
[warn] ==== opennlp sourceforge repo: tried
[warn] http://opennlp.sourceforge.net/maven2/org/scalanlp/nak/1.2.1-SNAPSHOT/nak-1.2.1-SNAPSHOT.pom
[warn] ==== Sonatype Snapshots: tried
[warn] https://oss.sonatype.org/content/repositories/snapshots/org/scalanlp/nak/1.2.1-SNAPSHOT/nak-1.2.1-SNAPSHOT.pom
[warn] ==== Sonatype Releases: tried
[warn] https://oss.sonatype.org/content/repositories/releases/org/scalanlp/nak/1.2.1-SNAPSHOT/nak-1.2.1-SNAPSHOT.pom
[warn] ==== public: tried
[warn] http://repo1.maven.org/maven2/org/scalanlp/nak/1.2.1-SNAPSHOT/nak-1.2.1-SNAPSHOT.pom
[info] Resolving gov.nist.math#jama;1.0.2 ...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.scalanlp#nak;1.2.1-SNAPSHOT: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
sbt.ResolveException: unresolved dependency: org.scalanlp#nak;1.2.1-SNAPSHOT: not found

Fail travis-ci build

[warn] ::::::::::::::::::::::::::::::::::::::::::::::

[warn] :: UNRESOLVED DEPENDENCIES ::

[warn] ::::::::::::::::::::::::::::::::::::::::::::::

[warn] :: org.scalanlp#breeze-config_2.10;0.9.1: not found

[warn] ::::::::::::::::::::::::::::::::::::::::::::::

sbt.ResolveException: unresolved dependency: org.scalanlp#breeze-config_2.10;0.9.1: not found

at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:217)

at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:126)

at sbt.IvyActions$$anonfun$update$1.apply(IvyActions.scala:125)

at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:115)

at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:115)

at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:103)

at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:48)

at sbt.IvySbt$$anon$3.call(Ivy.scala:57)

at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:98)

at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:81)

at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:102)

at xsbt.boot.Using$.withResource(Using.scala:11)

at xsbt.boot.Using$.apply(Using.scala:10)

at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:62)

at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:52)

at xsbt.boot.Locks$.apply0(Locks.scala:31)

at xsbt.boot.Locks$.apply(Locks.scala:28)

at sbt.IvySbt.withDefaultLogger(Ivy.scala:57)

at sbt.IvySbt.withIvy(Ivy.scala:98)

at sbt.IvySbt.withIvy(Ivy.scala:94)

at sbt.IvySbt$Module.withModule(Ivy.scala:115)

at sbt.IvyActions$.update(IvyActions.scala:125)

at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1223)

at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1221)

at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$74.apply(Defaults.scala:1244)

at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$74.apply(Defaults.scala:1242)

at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:35)

at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1246)

at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1241)

at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:45)

at sbt.Classpaths$.cachedUpdate(Defaults.scala:1249)

at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1214)

at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1192)

at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)

at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:42)

at sbt.std.Transform$$anon$4.work(System.scala:64)

at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)

at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:237)

at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:18)

at sbt.Execute.work(Execute.scala:244)

at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)

at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:237)

at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:160)

at sbt.CompletionService$$anon$2.call(CompletionService.scala:30)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

at java.util.concurrent.FutureTask.run(FutureTask.java:262)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

[error] (*:update) sbt.ResolveException: unresolved dependency: org.scalanlp#breeze-config_2.10;0.9.1: not found

nak.classify.LinearClassifier serialization

Hi,

can somebody please tell me if serializing nak.classify._ objects DataSerialization is even possible?

Also what is the state of the nak.classify._ namespace. It looks like there's a move toward using liblinear. Is the stuff pulled in from breeze-learn deprecated?

Thanks,
Richard

A Documentation for Classifiers other than Liblinear?

Hi,

I am a graduate student very new to NLP. I really enjoy the convenience of Scala so I decide to try out Breeze (and turns out it shared out it's NLP package). I need to use classifiers so I picked up Nak :)

The documentation is good and APIs are great. However, after spending 3 hours writing and reading the document (plus two examples), I can't help but notice they are both for LibLinear classifier. However I really want to try out NaiveBayes, Maximum Entropy (Logistic classifier), and Perceptron (well, I need to do my assignments as students right (: The NakContext has many utility functions like "trainClassifier" or "trainModel" that help building up classifiers but they are again, for Liblinear.

So what should I do if I want to use the three classifiers I mentioned above? I tried to just new them, but certainly I can't put unlabeled and unfeatured raw example in. Is there a way to use them??

I also posted this message on Google Group (at least the first half is the same..)

https://groups.google.com/forum/?fromgroups#!topic/scalanlp-discuss/KeFSy841NIE

Caught an AssertionError doing K-means clustering

(David: I think clusters are getting emptied out sometimes--which I guess can happen. We need to decide what to do when that happens. Either just drop the cluster, or split another cluster in half.)

While performing K-means clustering, I got the following AssertionError:

Exception in thread "main" java.lang.AssertionError: assertion failed
    at scala.Predef$.assert(Predef.scala:165)
    at nak.cluster.Kmeans$$anonfun$5.apply(Kmeans.scala:112)
    at nak.cluster.Kmeans$$anonfun$5.apply(Kmeans.scala:108)
        ...

The assertion is at Kmeans.scala:112 but has to do with not finding a min in a list of doubles at Kmeans.scala:111. I suspect it's because sometimes double precision numbers don't behave correctly with == via a rounding error.

Bug in expanding cluster?

Hi, in GDBScan, the last line of the expand method says "if new point is a cluster"... but in the "if clause" the point is evaluated with the current neighbourhood... not with the new neighbours.

    neighbours.foldLeft(neighbours) {
      case (neighbourhood, neighbour @ Point(row)) =>
        // if not visited yet, create a new neighbourhood
        val newNeighbours = if (!(visited contains neighbour)) {
          visited add neighbour
          getNeighbours(neighbour, points.filterNot(_.row == neighbour.row))
        } else {
          Seq.empty
        }
        // Add to cluster if neighbour point isn't assigned to a  cluster yet
        if (!(clustered contains neighbour)) {
          cluster add neighbour
          clustered add neighbour
        }
        // if the neighbour point is a cluster, join the neighbourhood
        if (isCorePoint(neighbour, neighbourhood)) neighbourhood ++ newNeighbours else neighbourhood
    }

I think the last line should be:

 // if the neighbour point is a cluster, join the neighbourhood
   if (isCorePoint(neighbour, newNeighbours)) neighbourhood ++ newNeighbours else neighbourhood

If you agree I can do a PR.

Improve k-means code. #helpwanted

The current k-means implementation is something I did for homework assignments for teaching NLP courses at UT Austin. It can handle a fair amount, but it runs out of steam (in particular, memory) for larger datasets, especially if they have a lot of features. It currently uses dense vectors to represent the features for each data point, so it should be a fairly straightforward win to change this to use sparse vectors instead.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.