docker-thai-tokenizers's Issues
Make a Docker image for Chrome's v8BreakIterator
Chrome has an API for tokenising text. This feature also includes Thai. Its evaluation result shows a promising performance both speed and quality. In brief, its speed is comparable to PyThaiNLP's newmm, but its quality is significantly better. This could be another good baseline.
Using Puppeteer, we can have this vendor through CLI. I've made the first version of the program: https://gist.github.com/heytitle/02b818b02114644152bd0317e62750a7.
Implement performance metrics
Based on this discussion from PyThaiNLP/tokenization-benchmark#8.
It would be good if we also benchmark performance of each tokeniser on other aspects. These aspects include:
Speed
Characters per second (on standardized machine)
May be tested with different sizes of text (small and large), to notice the "boot time" of a tokenizer
Memory footprint
Memory used by the tokenizer (when tokenizing a certain amount of text), at the running time
Disk size
Total size of the tokenizer, including dictionary, models, and all non-standard dependencies (excluding runtime environment, like interpreter/VM)
Current CI build isn't working
It seems Docker CI isn't working properly. In fact, we probably have to build images manually via ./scripts/build.sh
. This script will append the main function to the vendor file and we must keep also the copy commands like this
COPY vendor.py .
COPY entry .
cc: @wannaphong
New vendor: KUCut
I think that I will add kucut.
New vendor: SynThai
New vendor: Multi-Candidate-Word-Segmentation (MCWS)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.