Tokenizer is a C++ implementation of OpenNMT tokenization and detokenization.
Compiling executables requires:
Boost
(program_options
)
CMake and a compiler that supports the C++11 standard are required to compile the project.
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=<Release or Debug> ..
make
It will produce the dynamic library libtokenizer.so
(or .dylib
on Mac OS, .dll
on Windows), and the tokenization tools cli/tokenize
and cli/detokenize
.
- To compile only the library, use the
-DLIB_ONLY=ON
flag.
See --help
on the clients to discover available options and usage. They have the same interface as their Lua counterpart.
This project is also a convenient way to apply OpenNMT tokenization in existing software.
See:
include/onmt/Tokenizer.h
to apply OpenNMT's tokenization and detokenization
make check
- Create the input raw text file
<name>_<mode>_<joiner_annotate>_<case_feature>[_<bpe_model>].raw
, where:<name>
is the name of the test case without underscore<mode>
is the value of the--mode
option oncli/tokenize
<joiner_annotate>
is the marker of the--joiner_annotate
option oncli/tokenize
<case_feature>
is the value of the--case_feature
option oncli/tokenize
andcli/detokenize
- (optional)
<bpe_model>
is the name of the file inbpe-models/
for the--bpe_model
option oncli/tokenize
- Create the expected tokenized output file
<name>.tokenized
- (optional) Create the expected tokenized output file
<name>.tokenized.new
that will be compared to the output produced with the--joiner_new
option - (optional) Create the expected detokenized output file
<name>.detokenized
. If this file is not provided, the detokenization of<name>.tokenized
and<name>.tokenized.new
must match the raw input text.