Giter Club home page Giter Club logo

open-nlp's Introduction

Build Status

###About

This library provides high-level Ruby bindings to the Open NLP package, a Java machine learning toolkit for natural language processing (NLP). This gem is compatible with Ruby 1.9.2 and 1.9.3 as well as JRuby 1.7.1. It is tested on both Java 6 and Java 7.

###Installing

First, install the gem: gem install open-nlp. Then, download the JARs and English language models in one package (80 MB).

Place the contents of the extracted archive inside the /bin/ folder of the open-nlp gem (e.g. [...]/gems/open-nlp-0.x.x/bin/).

Alternatively, from a terminal window, cd to the gem's folder and run:

wget http://www.louismullie.com/treat/open-nlp-english.zip
unzip -o open-nlp-english.zip -d bin/

Afterwards, you may individually download the appropriate models for other languages from the open-nlp website.

###Configuring

After installing and requiring the gem (require 'open-nlp'), you may want to set some of the following configuration options.

# Set an alternative path to look for the JAR files.
# Default is gem's bin folder.
OpenNLP.jar_path = '/path_to_jars/'

# Set an alternative path to look for the model files.
# Default is gem's bin folder.
OpenNLP.model_path = '/path_to_models/'

# Pass some alternative arguments to the Java VM.
# Default is ['-Xms512M', '-Xmx1024M'].
OpenNLP.jvm_args = ['-option1', '-option2']

# Redirect VM output to log.txt
OpenNLP.log_file = 'log.txt'

# Set default models for a language.
OpenNLP.use :language

###Examples

Simple tokenizer

OpenNLP.load

sent = "The death of the poet was kept from his poems."
tokenizer = OpenNLP::SimpleTokenizer.new

tokens = tokenizer.tokenize(sent).to_a
# => %w[The death of the poet was kept from his poems .]

Maximum entropy tokenizer, chunker and POS tagger

OpenNLP.load

chunker   = OpenNLP::ChunkerME.new
tokenizer = OpenNLP::TokenizerME.new
tagger    = OpenNLP::POSTaggerME.new

sent   = "The death of the poet was kept from his poems."

tokens = tokenizer.tokenize(sent).to_a
# => %w[The death of the poet was kept from his poems .]

tags   = tagger.tag(tokens).to_a
# => %w[DT NN IN DT NN VBD VBN IN PRP$ NNS .]

chunks = chunker.chunk(tokens, tags).to_a
# => %w[B-NP I-NP B-PP B-NP I-NP B-VP I-VP B-PP B-NP I-NP O]

Abstract Bottom-Up Parser

OpenNLP.load

sent      = "The death of the poet was kept from his poems."
parser = OpenNLP::Parser.new
parse = parser.parse(sent)

parse.get_text.should eql sent

parse.get_span.get_start.should eql 0
parse.get_span.get_end.should eql 46
parse.get_child_count.should eql 1

child = parse.get_children[0]

child.text # => "The death of the poet was kept from his poems."
child.get_child_count # => 3
child.get_head_index #=> 5
child.get_type # => "S"

Maximum Entropy Name Finder*

OpenNLP.load

text = File.read('./spec/sample.txt').gsub!("\n", "")

tokenizer   = OpenNLP::TokenizerME.new
segmenter   = OpenNLP::SentenceDetectorME.new
ner_models  = ['person', 'time', 'money']

ner_finders = ner_models.map do |model|
  OpenNLP::NameFinderME.new("en-ner-#{model}.bin")
end

sentences = segmenter.sent_detect(text)
named_entities = []

sentences.each do |sentence|

  tokens = tokenizer.tokenize(sentence)
  
  ner_models.each_with_index do |model,i|
    finder = ner_finders[i]
    name_spans = finder.find(tokens)
    name_probs = finder.probs()
    name_spans.each_with_index do |name_span,j|
      start = name_span.get_start
      stop  = name_span.get_end-1
      slice = tokens[start..stop].to_a
      prob  = name_probs[j]
      named_entities << [slice, model, prob]
    end
  end

end

Loading specific models

Just pass the name of the model file to the constructor. The gem will search for the file in the OpenNLP.model_path folder.

OpenNLP.load

tokenizer = OpenNLP::TokenizerME.new('en-token.bin')
tagger = OpenNLP::POSTaggerME.new('en-pos-perceptron.bin')
name_finder = OpenNLP::NameFinderME.new('en-ner-person.bin')
# etc.

Loading specific classes

You may want to load specific classes from the OpenNLP library that are not loaded by default. The gem provides an API to do this:

# Default base class is opennlp.tools.
OpenNLP.load_class('SomeClassName')  
# => OpenNLP::SomeClassName

# Here, we specify another base class.
OpenNLP.load_class('SomeOtherClass', 'opennlp.tools.namefind')
# => OpenNLP::SomeOtherClass

Contributing

Fork the project and send me a pull request! Config updates for other languages are welcome.

open-nlp's People

Contributors

amiryal avatar louismullie avatar pgwillia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open-nlp's Issues

MRI 1.9.3

Disregard: problem solved by adding .jar files.

/../bin/en-sent.bin (No such file or directory) (FileNotFoundException)

I have en-sent.bin in the same folder as my ruby code that I am running, but it does not seem to see it. I get this error.

/usr/local/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/open-nlp-0.1.4/lib/open-nlp/bindings.rb:126:in new': /usr/local/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/open-nlp-0.1.4/lib/open-nlp/../../bin/en-sent.bin (No such file or directory) (FileNotFoundException) from /usr/local/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/open-nlp-0.1.4/lib/open-nlp/bindings.rb:126:inload_model'
from /usr/local/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/open-nlp-0.1.4/lib/open-nlp/bindings.rb:110:in get_model' from /usr/local/opt/rbenv/versions/1.9.3-p392/lib/ruby/gems/1.9.1/gems/open-nlp-0.1.4/lib/open-nlp/base.rb:13:ininitialize'

Chunker and Parser specs fail on JRuby

The chunker and parser specs are not working due to a problem with casting Ruby objects back to Java objects before passing them back to Java.

In the parser spec:

sent      = "The death of the poet was kept from his poems."
tokenizer = OpenNLP::TokenizerME.new
p_model   = OpenNLP.load_model(:parser)
parser    = OpenNLP::ParserFactory.create(p_model)
tokens = tokenizer.tokenize(sent)

result = parser.parse(tokens.to_java(:String))

The last line throws :

Cannot convert instance of class org.jruby.java.proxies.ArrayJavaProxy to class java.lang.String
org/jruby/java/addons/KernelJavaAddons.java:70:in `to_java'
/ruby/gems/open-nlp/spec/english_spec.rb in `(root)'

Similarly, in the chunker spec:

sent   = "The death of the poet was kept from his poems."
tokens = tokenizer.tokenize(sent)
tags   = tagger.tag(tokens)

chunks = chunker.chunk(tokens.to_java(:String), pos_tags.to_java(:String))

The last line throws the same error.

Span with Sentence detection

Hi, firstly, thanks for the hard work on the wrapper!

Just one quick question. What is the syntax to do this...

Span sentences[] = sentenceDetector.sentDetect(" First sentence. Second sentence. ");

I need to input a sentence and get the sentence boundaries points as an index of start and stop locations.

I also need the sentences...

string sentences[] = sentenceDetector.sentDetect(" First sentence. Second sentence. ");

Looking through your examples, I was not too sure about sentence detection in this manner. Sorry if this is a basic question. I'm new to this. You examples, could have even shown it, but might be misunderstanding them.

thanks you!

Running on Windows 64-bit

Thank you for this making this wrapper gem. One thing, I would like to mention in here - If someone is planning to use this on Windows 7 (64-bit) machine, please change JAVA_HOME to 32-bit jdk path before calling OpenNLP.load. Else, RJB will fail in loading the JVM.

Broken binary files link

Hi. Does anyone have a working link to an OpenNLP dependency files? Link in Readme is broken(

Can not install 'gem install open-nlp'

I can not install this library.

I have an error :

$ gem install open-nlp
ERROR:  Error installing open-nlp:
    invalid gem: package is corrupt, exception while verifying: undefined method `size' for nil:NilClass (NoMethodError) in /Users/sapi_mabur/.rvm/gems/ruby-2.1.5/cache/open-nlp-0.1.5.gem

Would you like to help me solve this problem ?

Thanks

java.lang.NullPointerException in classes.rb line 13

I am excited to have this code. Thanks for supplying it!

I am trying to run the sample code provided by this gem. I have it running under JRuby, but am trying to run it under Ruby 1.9.3p448 and JDK 1.7.0-40 x586 using louismullie/open-nlp 0.1.4 and rjb 1.4.8.

The open-nlp gem is failing with java.lang.NullPointerException and Ruby error; `method_missing': unknown exception (NullPointerException). The calling line in the sample file is:

tags = tagger.tag(tokens).to_a  # 

The error is thrown here:

class OpenNLP::POSTaggerME < OpenNLP::Base

  unless RUBY_PLATFORM =~ /java/
    def tag(*args)
      OpenNLP::Bindings::Utils.tagWithArrayList(@proxy_inst, args[0])  # <== Line 13
    end
  end

end

args[0] is the sentence being processed. @proxy_inst is opennlp.tools.postag.POSTaggerME@1b5080a.

I have the bin/Jar/java files in the gem's bin directory. If they are not there, they get flagged as missing. Can we solve this? Thanks again...

Java 7

Firstly, thanks for writing a Jruby wrapper for Open-NLP!!

I'm very new to this, but need Java 7. Do you have any plans to update the code to work with Java 7?

thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.