Thai Natural Language Processing (Thai NLP) Resource
Collection of Thai Natural Language Processing (NLP) software libraries, dictionaries, and corpus.
Always welcome for pull requests.
Thai NLP Libraries/Services
Library
Description
Programming Languages
Features
License
Author & Link
JTCC
Thai Character Cluster
Java
GPL-3.0
Wittawat
TCC
Thai Character Cluster
Python
Apache 2.0
Wannaphong
Library
Description
Programming Languages
Features
License
Author & Link
sentiment_analysis_thai
JagerV3
Library
Description
Programming Languages
Features
License
Author & Link
LK82 + Udom83
Thai Soundex
Python
Korakot
Library
Description
Programming Languages
Features
License
Author & Link
Swath
SWATH (Smart Word Analysis for THai) is a word segmentation for Thai
C
Longest Matching, Maximal Matching and Part-of-Speech Bigram.
GPL
CMU
Lexto
Lexto: Thai Lexeme Tokenizer
Java
LGPL
NECTEC
Python 2
LGPL
Python2 Wrapper
Python 3
LGPL
Python3 Wrapper
Wordcut
Thai word breaker for Node.js
JavaScript, Node.JS
LGPL-3.0
veer66, github
wordcutpy
A simple Thai word tokenizer written in 1 Python file
Python 3
LGPL-3.0
veer66, github
CutKum
Thai Word-Segmentation with Deep Learning in Tensorflow. RNN.
Python
93% F-measure.
MIT
Pucktada, github
Thai Language Toolkit (tltk)
Based on a paper by Wirote Aroonmanakun in 2002. Word segmentation is based on a maximum collocation approach. Syllable segmentation is based on 3grams statistics. (Dataset is included)
Python
97.86% F-measure. (It was tested on a different testset; it is not fair to compare it with other models.)
GPLv3
awirote, the Python Package Index
DeepCut
A Thai word tokenization library using Deep Neural Network. CNN.
Python
98.8% F-measure.
MIT
rkcosmos, github
SynThai
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.
Python
99.2% F-measure
MIT
KenjiroAI, github
CutThai
Thai word segmentation written in coffee-script Edit
Coffee-script
MIT
Pureexe/cutthai Github
Multi-Candidate-Word-Segmentation
Multi Candidate Word Segmentation for Thai language
Python, RNN, LSTM
97.0% F-measure (Word Level), 98.95% F-measure (Boundary Level)
MIT
Paper , earthy123/Multi-Candidate-Word-Segmentation
Part of Speech Tagging (POS Tagging)
Library
Description
Programming Languages
Features
License
Author & Link
Jitar+NAiST
A simple Trigram HMM part-of-speech tagger
Java
Ver66 , Jitar + NAiST, 1 + NAiST, 2
SynThai
Thai Word Segmentation and Part-of-Speech Tagging with Deep Learning. RNN. LSTM.
Python
0.9163 F-measure. RNN. LSTM
MIT
KenjiroAI, github
Chart-POS
Thai POS Tagger
C
All rights reserved
AIAT, KINDML, Thanaruk T. ([email protected] ), Thodsaporn C., Demo at iApp
Library
Description
Programming Languages
Features
License
Author & Link
Named Entity Tagging (Thai NEST)
Thai Named Entity tagging Specification and Tools
GPL
KINDML, SIIT , AIAT
ThaiNER
Thai Named Entity Recognition for PyThaiNLP
Python
Apache 2.0 (code) & CC BY 3.0 (Dataset)
ThaiNER
Library
Description
Programming Languages
Features
License
Author & Link
News Structure Tagging Program
Thai News Structure Tagging Program
Metadata tagging, Structure tagging, Automatic News Title Generation
GPL
AIAT
Syntactic Parsing & Tools
Library
Description
Programming Languages
Features
License
Author & Link
Chart-parser
Extract Syntactic Structure from POS Tagged Sentence.
C
All rights reserved
AIAT, KINDML, Thanaruk T. ([email protected] ), Thodsaporn C., Demo at iApp
Grammar Processing
Labelled Brackets -> Context Free Grammars (CFGs)
Python
Transform and compute probability
Thodsaporn C.
Library
Description
Programming Languages
Features
License
Author & Link
kobkrit-word-embedding
Tensorflow implementation of Thai word embedding
Python
Source code, Example, Word distance graph
LGPL
Kobkrit V.
Thai Question Answering (Machine Comprehension)
Service
Description
License
Author & Link
Thai Machine Comprehension (ThaiMC)
Bidirectional Attention Flow
Copyright (As the service)
iApp-AI
Dictionaries / Translation Pairs
Library
Description
Size
Features
License
Link
Transliteration Corpus
31K pairs
Thai-Eng Translation Pair
CC BY-NC-SA 3.0 TH
NECTEC
LEXiTRON
Thai<->English Dictionary
TH->EN, EN->TH
LEXiTRON License
NECTEC
Yaitron
LEXiTRON in machine readable format (XML)
TH->EN, EN->TH
LEXiTRON License
Veer66 Schema , Data & Conversion Code
Library
Description
Size
Features
License
Link
Thai National Corpus 2
32M words
Query text by genre, domain
All rights reserved
CHULA
Thai Medical Document
3,594 docs
Document and dynamic keyword map
All rights reserved
KINDML, SIIT
Southeast Asian Languages Library
Thai News, Web Text, Pop Music, Literature, Toponyms
20M chars
Phase around a search text
SEALang
HSE Thai Corpus
Modern texts written in Thai language (mostly news websites)
50M tokens
Query by word form, lexeme, translation, grammatical attributes, lexical attributees
HSE School of Linguistics
Library
Description
Size
Features
License
Link
TALPCo
TUFS Asian Language Parallel Corpus
1327 sent
open parallel corpus consisting of Japanese sentences and their translations into Burmese (Myanmar; the official language of the Republic of the Union of Myanmar), Malay (the national language of Malaysia, Singapore and Brunei), Indonesian, Thai, Vietnamese and English
CC BY 4.0
TALPCo
Pre-trained Model
Description
Size
Dimensions
License
Link
fastText
Skip-Gram model trained on Wikipedia using fastText
300
CC BY-SA 3.0
Facebook + Bin & Text + Text Only
thai2fit
ULMFit on Wikipedia. Perplexity of 46.80959 with 60,002 embeddings.
70MB
300
MIT
thai2vec / pyThaiNLP
model
micro_f1_public
micro_f1_private
ULMFit
0.59313
0.60322
fastText
0.5145
0.5109
LinearSVC
0.5022
0.4976
Kaggle Score
0.59139
0.58139
BERT
0.56612
0.57057
Model
Macro-accuracy
Macro-F1
fastText
0.9302
0.5529
LinearSVC
0.513277
0.552801
ULMFit
0.948737
0.744875
Model
Public Accuracy
Private Accuracy
Logistic Regression
0.72781
0.7499
FastText
0.63144
0.6131
ULMFit
0.71259
0.74194
ULMFit Semi-supervised
0.73119
0.75859
ULMFit Semi-supervised Repeated One Time
0.73372
0.75968
model
accuracy
micro-F1
fastText
0.384116
0.384116
LinearSVC
0.807876
0.327565
ULMFit
0.834981
0.834981
Not found? Try to look at another Thai NLP Awesome List/Resource (Like this one)
https://resources.aiat.or.th/