Giter Club home page Giter Club logo

punkt-segmenter's Introduction

Punkt sentence tokenizer

This code is a ruby 1.9.x port of the Punkt sentence tokenizer algorithm implemented by the NLTK Project (http://www.nltk.org/). Punkt is a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified.

The full description of the algorithm is presented in the following academic paper:

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence Boundary Detection.
Computational Linguistics 32: 485-525.
Download paper

Here are the credits for the original implementation:

I simply did the ruby port and some API changes.

Install

gem install punkt-segmenter

Currently, this gem only runs on ruby 1.9.x (because of unicode_utils dependency)

How to use

Let's suppose we have the following text:

"A minute is a unit of measurement of time or of angle. The minute is a unit of time equal to 1/60th of an hour or 60 seconds by 1. In the UTC time scale, a minute occasionally has 59 or 61 seconds; see leap second. The minute is not an SI unit; however, it is accepted for use with SI units. The symbol for minute or minutes is min. The fact that an hour contains 60 minutes is probably due to influences from the Babylonians, who used a base-60 or sexagesimal counting system. Colloquially, a min. may also refer to an indefinite amount of time substantially longer than the standardized length." (source: http://en.wikipedia.org/wiki/Minute)

You can separate in sentences using the Punkt::SentenceTokenizer object:

tokenizer = Punkt::SentenceTokenizer.new(text)
result    = tokenizer.sentences_from_text(text, :output => :sentences_text)

The result will be:

result    = [
    [0] "A minute is a unit of measurement of time or of angle.",
    [1] "The minute is a unit of time equal to 1/60th of an hour or 60 seconds by 1.",
    [2] "In the UTC time scale, a minute occasionally has 59 or 61 seconds; see leap second.",
    [3] "The minute is not an SI unit; however, it is accepted for use with SI units.",
    [4] "The symbol for minute or minutes is min.",
    [5] "The fact that an hour contains 60 minutes is probably due to influences from the Babylonians, who used a base-60 or sexagesimal counting system.",
    [6] "Colloquially, a min. may also refer to an indefinite amount of time substantially longer than the standardized length."
]

The algorithm uses the text passed as parameter to train and tokenize in sentences. Sometimes the size of the input text is not enough to have a well trained set, which may cause some mistakes on the sentences splitting. For these cases you can train the Punkt segmenter:

trainer = Punkt::Trainer.new()
trainer.train(trainning_text)

tokenizer = Punkt::SentenceTokenizer.new(trainer.parameters)
result    = tokenizer.sentences_from_text(text, :output => :sentences_text)

In this case, instead of passing the text to SentenceTokenizer, you pass the trainer parameters.

A recommended use case for the trainning object is to train a big corpus in a specific language and then marshal the object to a file. Then you can load the already trained tokenizer from a file. You can even add more texts to the trainning set whenever you want.

The available options for sentences_from_text method are:

  • array of sentences indexes (default)
  • array of sentences string (:output => :sentences_text)
  • array of sentences tokens (:output => :tokenized_sentences)
  • realigned boundaries (:realign_boundaries => true): do this if you want to realign sentences that end with, for example, parenthesis, quotes, brackets, etc

If you have a list of tokens, you can use the sentences_from_tokens method, which takes only the list of tokens as parameter.

Check the unit tests for more detailed examples in English and Portuguese.


This code follows the terms and conditions of Apache License v2 (http://www.apache.org/licenses/LICENSE-2.0)

Copyright (C) Luis Cipriani

Bitdeli Badge

punkt-segmenter's People

Contributors

bitdeli-chef avatar lfcipriani avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

punkt-segmenter's Issues

Math::DomainError

The following expression raises a Math::DomainError (Math.log(-Infinity))

Punkt::SentenceTokenizer.new("08. 94 01. 95")

RubyNLP

Dear Luis,

I've recently added your project to our RubyNLP list: https://github.com/arbox/nlp-with-ruby

I wonder if you want to participate in the Ruby for NLP network. You could do this in a very simple step by adding the rubynlp topic to your GitHub repository.

Thank you for the project!

Increase code coverage

Pre-requisite to go to version 1.0.0, at least coverage all algorithm parts ported from Python.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.