Giter Club home page Giter Club logo

tokenizer's Introduction

Build Status

Tokenizer

Tokenizer is a C++ implementation of OpenNMT tokenization and detokenization.

Dependencies

Compiling executables requires:

  • Boost (program_options)

Compiling

CMake and a compiler that supports the C++11 standard are required to compile the project.

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=<Release or Debug> ..
make

It will produce the dynamic library libtokenizer.so (or .dylib on Mac OS, .dll on Windows), and the tokenization tools cli/tokenize and cli/detokenize.

Options

  • To compile only the library, use the -DLIB_ONLY=ON flag.

Using

Clients

See --help on the clients to discover available options and usage. They have the same interface as their Lua counterpart.

Library

This project is also a convenient way to apply OpenNMT tokenization in existing software.

See:

  • include/onmt/Tokenizer.h to apply OpenNMT's tokenization and detokenization

Testing

make check

Adding new tests

  1. Create the input raw text file <name>_<mode>_<joiner_annotate>_<case_feature>[_<bpe_model>].raw, where:
    • <name> is the name of the test case without underscore
    • <mode> is the value of the --mode option on cli/tokenize
    • <joiner_annotate> is the marker of the --joiner_annotate option on cli/tokenize
    • <case_feature> is the value of the --case_feature option on cli/tokenize and cli/detokenize
    • (optional) <bpe_model> is the name of the file in bpe-models/ for the --bpe_model option on cli/tokenize
  2. Create the expected tokenized output file <name>.tokenized
  3. (optional) Create the expected tokenized output file <name>.tokenized.new that will be compared to the output produced with the --joiner_new option
  4. (optional) Create the expected detokenized output file <name>.detokenized. If this file is not provided, the detokenization of <name>.tokenized and <name>.tokenized.new must match the raw input text.

tokenizer's People

Contributors

guillaumekln avatar

Watchers

James Cloos avatar Fedor Kovalev avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.