Giter Club home page Giter Club logo

docker-thai-tokenizers's Introduction

Thai Word Tokenizers

Publish Docker

This repository is a collection of almost all Thai tokenisers that are publicly available. Having this collection allows us to try each algorithm as ease via Docker.

Technically, each project (called vendor) has its own Docker image with a entry script and auxiliary scripts. These scripts bring a unified interface, allowing us to run those algorithms in the same way.

Vendors

Vendor Alias Available Methods Container Profile
PyThaiNLP pythainlp newmm, longest
DeepCut deepcut deepcut
CutKum cutkum cutkum
Sertis sertis sertis
Thai Language Toolkit tltk mm, ngram, colloc
Smart Word Analysis for Thai (SWATH) swath max, long
Chrome's v8Breakiterator chrome v8breakiterator

Please see Usages for more details.

Setup

  • Pull necessary Docker images. Please check Docker Hub for the avaliable images.
    $ docker pull pythainlp/word-tokenizers:<vendor-alias>
    

Usages

  1. Put text files that you want to tokenise into ./data.
  2. Run the following command ...
$ ./scripts/tokenise.sh <vendor-alias>-<method> <**filename**>

Please check Vendors section for vendors and methods included here.

Example

Let's say you want to tokenise text in ./data/example.text using PyThaiNLP's newmm algorithm. You can use the following command:

$ cat ./data/example.text
อันนี้คือตัวอย่าง

$ ./scripts/tokenise.sh pythainlp:newmm example.text
# Please be aware that you don't need to have ./data in front of the filename.
# Command Output
Tokenising example.text using vendor=pythainlp and method=newmm
CMD: docker run -v /Users/heytitle/projects/tokenisers-for-thai/data:/data  thai-tokeniser:pythainlp newmm example.text
100%|██████████| 1/1 [00:00<00:00, 151.70it/s]
Tokenising /data/example.text with newmm
Tokenised text is written to /data/example_tokenised-pythainlp-newmm.text

$ cat ./data/example_tokenised-pythainlp-newmm.text
อันนี้|คือ|ตัวอย่าง

Development

Architecture

TBD.

Build a vendor's new Docker image

$ ./scripts/build <vendor>

Push a new Docker image to Docker Hub

$ ./scripts/push <vendor>

Acknowledgements

docker-thai-tokenizers's People

Contributors

p16i avatar wannaphong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

docker-thai-tokenizers's Issues

Current CI build isn't working

It seems Docker CI isn't working properly. In fact, we probably have to build images manually via ./scripts/build.sh. This script will append the main function to the vendor file and we must keep also the copy commands like this

COPY vendor.py .
COPY entry .

cc: @wannaphong

Implement performance metrics

Based on this discussion from PyThaiNLP/tokenization-benchmark#8.

It would be good if we also benchmark performance of each tokeniser on other aspects. These aspects include:

Speed

Characters per second (on standardized machine)
May be tested with different sizes of text (small and large), to notice the "boot time" of a tokenizer

Memory footprint

Memory used by the tokenizer (when tokenizing a certain amount of text), at the running time

Disk size

Total size of the tokenizer, including dictionary, models, and all non-standard dependencies (excluding runtime environment, like interpreter/VM)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.