Giter Club home page Giter Club logo

mascara's Introduction

mascara

A natural language tokenizer.

Purpose

This is a C library and command-line tool for segmenting written texts into grapheme clusters, tokens, or sentences. It has specific support for English, French, Italian and German. A generic tokenizer is also available.

Building

The library is available in source form, as an amalgamation. Compile mascara.c together with your source code, and use the interface described in mascara.h. You'll need a C11 compiler, which means either GCC or CLang on Unix.

A command-line tool mascara is included, plus a set of sentence boundary detection models. To install all these:

$ make && sudo make install

Although the library itself is BSD-licensed and can thus be used for free in commercial software, sentence boundary detection models are derived from corpora covered by more restrictive licenses. Here are the corpora used for creating each model:

Usage

Examples

The examples directory contains concrete usage examples. Compile these files with make, and use them like so:

  • Split a text into extended grapheme clusters:

      $ examples/graphemes हृदय
      हृ
      द
      य
    
  • Split a sentence into tokens:

      $ examples/tokens "And now, Laertes, what's the news with you?"
      And
      now
      ,
      Laertes
      ,
      what
      's
      the
      news
      with
      you
      ?
    
  • Split a text into sentences:

      $ examples/sentences "Pierre Vinken, 61 years old, will join the board
      as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier
      N.V., the Dutch publishing group."  
      Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .  
      Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
    

The library API is fully described in mascara.h.

Tokenization mode

Before allocating a tokenizer, you must choose whether you want to iterate over tokens or over sentences. Segmentation is slightly different depending on which mode you choose:

  • When iterating over tokens, periods that immediately follow a word are always separated from it, even if the token is an abbreviation:

    Mr . and Mrs . Smith have two children .
    
  • When iterating over sentences, all but the last period of the sentence are left attached to the token that precedes them, provided it is a word:

    Mr. and Mrs. Smith have two children .
    

Token types

During tokenization, each token is annotated with a type. This information is sometimes useful for its own sake, but it is intended to be used as feature for later processing. Existing token types are:

  • LATIN. A token principally made of Latin characters:

      Hamlet
      entr'ouvert
      willy-nilly
      AT&T
      tris(dimethylamino)bromophosphonium
    
  • PREFIX. A token at the beginning of a text segment:

      y'    => y'know
      d'    => d'entrée de jeu
      qu'   => qu'on se le dise
      dell' => dell'altro
    
  • SUFFIX. A token at the end of a text segment:

      'll   => he'll
      'd    => he'd
      -t-il => pense-t-il
    
  • SYM. A symbol. This doesn't include all Unicode symbols, only the most common ones that need to be recognized for the input text to be tokenized correctly:

    ?
    !!!
    +
    $
    
  • NUM. A numeric token. This includes numbers in decimal, hexadecimal, and exponential notation, phone numbers, and a few other types. Examples:

    1,234,567
    80's
    12.34
    0xdeadbeef
    20 000
    3e-27
    
  • ABBR. A likely abbreviation, with internal periods:

      Ph.D.
      a.m.
      J.-C.
    
  • EMAIL. An email address:

      [email protected]
      john@café.be
    
  • URI. A likely URI:

      http://www.example.com?q=fubar
      www.google.de
    
  • PATH. A path in the file system:

      /usr/bin/fubar
      ~/home_sweet_home/foo.txt
    
  • UNK. Anything but one of the above. This includes unknown symbols, as well as words not in the Latin script. The longest possible span of unknown characters is systematically selected:

    ☎
    मन्त्र
    

You can check wich type is assigned to which token with the command-line tool:

$ echo "And now, Laertes, what's the news with you?" | mascara -f "%s/%t "
And/LATIN now/LATIN ,/SYM Laertes/LATIN ,/SYM what/LATIN 's/SUFFIX the/LATIN
news/LATIN with/LATIN you/LATIN ?/SYM

Implementation

Tokenization

There are two main approaches for implementing a tokenizer: a) using finite-state automata, and b) using a supervised sequence model. The second solution is much heavier than its alternative and doesn't seem to be worth the extra work for such a light task as tokenization, so I discarded it.

Each tokenizer uses two finite-state machines, written in Ragel. The first one matches the input text from left to right, in the usual way. The second one reads it from right to left, and is used to recognize contractions at the end of a word. Using two separate machines helps to disambiguate the role of single quotes.

In my first attempt at the task, I performed a preliminary segmentation of the text on whitespace characters, and then repeatedly attempted to trim tokens (punctuation, prefixes, suffixes) from the left and the right of the delimited text chunks. I changed that because this cannot deal with tokens that contain internal whitespace characters, such as numbers in French.

Sentence boundary detection

Two sentence boundary detection modules are implemented.

The first one is a simple finite-state machine. It uses a fixed list of rules and abbreviations. It cannot disambiguate the most ambiguous use of periods (cardinal numbers, abbreviation at the end of a sentence, etc.). It is currently only used as a fallback, when no language-specific model is available.

The second one is a finite-state machine that uses a Naive Bayes classifier to disambiguate the role of periods. Only periods that are seemingly not token-internal are examined by the classifier. The remaining ones, as well as question marks, etc., are deemed to be unambiguous, and are always classified as end-of-sentence markers. We use one feature set per language, obtained through semi-automatic optimization with the help of my feature selection tool. Features are extracted from a window of three tokens, centered on the period to classify.

Despite the simplicity of this approach, its performance is close to state-of-the-art. The following tables show the performance of our classifier on four corpora. We use five-fold cross-validation, and only classify ambiguous periods.

  • Brown

                 accuracy    precision   recall      F1
      bayes      98.83       99.61       99.11       99.35
      baseline   90.83       90.83       100.00      95.19
    
  • Sequoia

                 accuracy    precision   recall      F1
      bayes      98.89       98.94       99.92       99.43
      baseline   96.28       96.28       100.00      98.11
    
  • Tiger

                 accuracy    precision   recall      F1
      bayes      99.48       99.65       99.79       99.72
      baseline   93.34       93.34       100.00      96.58
    
  • Turin University Treebank

                 accuracy    precision   recall      F1
      bayes      98.22       99.53       98.38       98.95
      baseline   85.58       85.58       100.00      92.23
    

References

mascara's People

Contributors

michaelnmmeyer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.