Giter Club home page Giter Club logo

detectormorse's Introduction

Detector Morse

Detector Morse is a program for sentence boundary detection (henceforth, SBD), also known as sentence segmentation. Consider the following sentence, from the Wall St. Journal portion of the Penn Treebank:

Rolls-Royce Motor Cars Inc. said it expects its U.S. sales to remain
steady at about 1,200 cars in 1990.

This sentence contains 4 periods, but only the last denotes a sentence boundary. The first one in U.S. is unambiguously part of an acronym, not a sentence boundary; the same is true of expressions like $12.53. But the periods at the end of Inc. and U.S. could easily denote a sentence boundary. Humans use the local context to determine that neither period denote sentence boundaries (e.g. the selectional properties of the verb expect are not met if there is a sentence bounary immediately after U.S.). Detector Morse uses artisinal, handcrafted contextual features and low-impact, leave-no-trace machine learning methods to automatically detect sentence boundaries.

SBD is one of the earliest pieces of many natural language processing pipelines. Since errors at this step are likely to propagate, SBD is an important---albeit overlooked---problem in natural language processing.

Detector Morse has been tested on CPython 3.4 and PyPy3 (2.3.1, corresponding to Python 3.2); the latter is much faster. Detector Morse depends on the Python module jsonpickle to (de)serialize models. For the versions used, see requirements.txt.

Installation

    sudo pip install -r requirements.txt
    sudo python setup build install

Usage

 Detector Morse, by Kyle Gorman
 
 usage: python -m detectormorse [-h] [-v] [-V] (-t TRAIN | -r READ)
                                (-s SEGMENT | -w WRITE | -e EVALUATE)
                                [-E EPOCHS] [-C]

 optional arguments:
   -h, --help            show this help message and exit
   -v, --verbose         enable verbose output
   -V, --really-verbose  enable even more verbose output
   -t TRAIN, --train TRAIN
                         training data
   -r READ, --read READ  read in serialized model
   -s SEGMENT, --segment SEGMENT
                         segment sentences
   -w WRITE, --write WRITE
                         write out serialized model
   -e EVALUATE, --evaluate EVALUATE
                         evaluate on segmented data
   -E EPOCHS, --epochs EPOCHS
                         # of epochs (default: 20)
   -C, --nocase          disable case features

Files used for training (-t/--train) and evaluation (-e/--evaluate) should contain one sentence per line; newline characters are ignored otherwise.

When segmenting a file (-s/--segment), DetectorMorse simply inserts a newline after predicted sentence boundaries that aren't already marked by one. All other newline characters are passed through, unmolested.

The included DM-wsj.json.gz is a segmenter model trained on the Wall St. Journal portion of the Penn Treebank.

Method

See this blog post.

Caveats

DetectorMorse processes text by reading the entire file into memory. This means it will not work with files that won't fit into the available RAM. The easiest way to get around this is to import the Detector instance in your own Python script.

Exciting extras!

I've included a Perl script untokenize.pl which attempts to invert the Penn Treebank tokenization process. Tokenization is an inherently "lossy" procedure, so there is no guarantee that the output is exactly how it appeared in the WSJ. But, the rules appear to be correct and produce sane text, and I have used it for all experiments. Update (2015-02-10): I've removed this script; I just use the Stanford tokenizer for this purpose, now.

detectormorse's People

Contributors

kylebgorman avatar prateeksingh0001 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.