coccoc / coccoc-tokenizer Goto Github PK

View Code? Open in Web Editor NEW

386.0 386.0 119.0 55.94 MB

high performance tokenizer for Vietnamese language

License: GNU Lesser General Public License v3.0

CMake 2.13% Shell 0.22% Java 2.18% C++ 90.87% C 4.19% Python 0.42%

coccoc-tokenizer's People

Contributors

Stargazers

Watchers

Forkers

mikelhpdatke binhvq doanvanthien nightwalker89 tdh4vn luongthanhlam hpham04 tungns-304 trunghieu11 tuanvhuynh ngocphuongnb maihau nhaplycafedang vominhtrius thanhtoan1196 volinh michaellampard 0xflotus hongthaiphi ntson2002 thuvh dracodopham neonqa1102 cuongnv nguyentai1112 nth123 novawish tuan-l nguyenanh rvor manhnd1112 mudzot thangbg huy-lv nguyenlehai minhdtb hungph anhlt309 jacknhat lntutor longhust hiepnguyen1101 phucvdb tranhoangcore bachan canhld94 vic4key txdat nhangg brobusta suale-dev 4kir4vjp datvithanh japlinchen nqtrieu7987 thientu tks1998 dcthienit1997 giahuyng98 dunglason6789p luongdolong leqnam zenthangplus diophung hausuresh minhdc tranhieudev23 winterdl nguyensen trungnghiahoang96 ruanzx nguyen-viet-hung hoangdqvn dangkhoa0894 tech-save hoangvuduyanh truongtrnghia hailoc12 ddpham duydo bizflycloud buom lamhoangtung techainer maxinminax botranvan milynox mgiay hoannc54 wokaio hiepnguyenvan-backup trancong12102 lixibox sonnyit hoainam25699 thaokv dangxuanvuong98 bachdgvn truongcntn2017 devdiary2203

coccoc-tokenizer's Issues

Error when install verson of Python on Mac

I got this error. Please help me to fix this:

running install
running build
running build_ext
skipping 'CocCocTokenizer.cpp' Cython extension (up-to-date)
building 'CocCocTokenizer' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/lap00986/anaconda3/include -arch x86_64 -I/Users/lap00986/anaconda3/include -arch x86_64 -I. -I/Users/lap00986/Documents/product-matching/env/include -I/Users/lap00986/anaconda3/include/python3.7m -c CocCocTokenizer.cpp -o build/temp.macosx-10.7-x86_64-3.7/CocCocTokenizer.o -Wno-cpp -Wno-unused-function -O2 -march=native
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
CocCocTokenizer.cpp:610:10: fatal error: 'ios' file not found
#include "ios"
^~~~~
1 warning and 1 error generated.
error: command 'gcc' failed with exit status 1

Khi build trên Windows với tham số -DBUILD_PYTHON=1 bị lỗi

Khi build trên Windows với tham số -DBUILD_PYTHON=1 thì mình gặp lỗi như sau:

Hướng dẫn đầy đủ cài đặt C++ Tokenizer & ES 7.12.1 Analysis Vietnam plugin

*** Môi trường Ubuntu 18.04 (or whatever), phải install Java JDK chứ không phải JRE vì cần javac cho cái C++ Tokenizer. Các file .yml tự làm cho chuẩn theo hường dẫn của các gits. Docker hay VM cũng vậy, đơn giản thế này.

sudo su
apt-get update -y
apt-get upgrade -y
apt-get install build-essential cmake unzip pkg-config gcc-7 g++-7 -y
apt-get install wget curl nano git default-jdk maven -y

cd /

*** Tải ElasticSearch 7.12.1
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.12.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.12.1-linux-x86_64.tar.gz
mv elasticsearch-7.12.1-linux-x86_64 /es

** Tải ES Analysis Vietnam
git clone https://github.com/duydo/elasticsearch-analysis-vietnamese.git
cd elasticsearch-analysis-vietnamese
mvn package

** Tải C++ Tokenizer
git clone https://github.com/coccoc/coccoc-tokenizer.git
cd coccoc-tokenizer
mkdir build
cd build
cmake -DBUILD_JAVA=1 ..
make install

** Cài plugin:
cd /es
echo "Y" | ./bin/elasticsearch-plugin install file:///elasticsearch-analysis-vietnamese/target/releases/elasticsearch-analysis-vietnamese-7.12.1.zip

*** Chuẩn bị
groupadd -g 999 nqrt && useradd -r -u 999 -g nqrt nqrt
usermod -aG sudo nqrt
chown nqrt:nqrt /es -R
sysctl -w vm.max_map_count=262144

su nqrt

** Run
export ES_JAVA_OPTS="-Xms2048m -Xmx2048m -Djava.library.path=/usr/local/lib"
cd /es
./bin/elasticsearch

how to install coccoc-tokenizer with pip

i get this error when try install coccoc-tokenizer with pip

Missed tokenizing entity's name

Thank you for open-sourcing one of the best and blazing fast Vietnamese tokenizer 💯

Today when playing around with CocCocTokenizer python binding, I find out that sometimes it missed tokenizing on entity's name

For example:

>>> T.word_tokenize("Những lần Lam Trường - Đan Trường tái ngộ chung khung hình ở U50")
['Những', 'lần', 'Lam', 'Trường', '-', 'Đan', 'Trường', 'tái_ngộ', 'chung', 'khung_hình', 'ở', 'U50']
# Expected result : ['Những', 'lần', 'Lam_Trường', '-', 'Đan_Trường', 'tái_ngộ', 'chung', 'khung_hình', 'ở', 'U50']

What can I do to help the tokenizer perform better on these cases ?

Python with coccoc_tokenizer

In README file, i found this line:
from CocCocTokenizer import PyTokenizer
but how can i install CocCocTokenizer (I try copy and paste example and i got No module name: 'CocCocTokenizer' error)

lỗi khi chạy bằng c++ trên window

Khi chạy bằng c++ mình gặp lỗi này
bạn thử demo giúp mình tokenizer trên c hoặc c++ được ko ạ?

build errors on Ubuntu 18.04

I followed the instructions of building this tool from README.md and encountered this error:

In file included from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie/syllable_da_trie.hpp:10,

             from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie.hpp:5,

             from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/tokenizer.hpp:10,

             from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/utils/tokenizer.cpp:3:
/home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie/da_trie.hpp: In member function ‘int DATrie<HashNode, Node>::read_from_file(const string&) [with HashNode = MultitermHashTrieNode; Node = MultitermDATrieNode]’:

/home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie/da_trie.hpp:237:8: error: ignoring return value of ‘size_t fread(void, size_t, size_t, FILE)’, declared with attribute warn_unused_result [-Werror=unused-result]**

fread(&alphabet_size, sizeof(alphabet_size), 1, in_file);

this happened several times in tokenizer.hpp file

Lỗi khi chạy câu lệnh make install

Chào bạn, khi mình cài Step đầu tiên qua Ubuntu LTS, khi chạy câu lệnh make install thì lỗi như sau:
Scanning dependencies of target dict_compiler
[ 12%] Building CXX object CMakeFiles/dict_compiler.dir/utils/dict_compiler.cpp.o
[ 25%] Linking CXX executable dict_compiler
[ 25%] Built target dict_compiler
Scanning dependencies of target vn_lang_tool
[ 37%] Building CXX object CMakeFiles/vn_lang_tool.dir/utils/vn_lang_tool.cpp.o
[ 50%] Linking CXX executable vn_lang_tool
[ 50%] Built target vn_lang_tool
Scanning dependencies of target tokenizer
[ 62%] Building CXX object CMakeFiles/tokenizer.dir/utils/tokenizer.cpp.o
[ 75%] Linking CXX executable tokenizer
[ 75%] Built target tokenizer
Scanning dependencies of target compile_dict
[ 87%] Generating multiterm_trie.dump, syllable_trie.dump, nontone_pair_freq_map.dump
[ 87%] Built target compile_dict
Scanning dependencies of target compile_java
[100%] Generating coccoc-tokenizer.jar
: not foundld_java.sh: 2: ../java/build_java.sh:
../java/build_java.sh: 36: ../java/build_java.sh: Syntax error: end of file unexpected (expecting "then")
CMakeFiles/compile_java.dir/build.make:60: recipe for target 'coccoc-tokenizer.jar' failed
make[2]: *** [coccoc-tokenizer.jar] Error 2
CMakeFiles/Makefile2:215: recipe for target 'CMakeFiles/compile_java.dir/all' failed
make[1]: *** [CMakeFiles/compile_java.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

Mình chạy với tham số: cmake -DBUILD_JAVA=1 ..

Hy vọng bạn giúp mình với ^^

Error in coccoc-tokenizer/dicts/tokenizer/Freq2NontoneUniFile when installing on CentOS 7

When I run the command: make install. The following errors are showed up:

How to fix it? Please help me @bachan @anhducle98 .

ERROR ON ELASTIC SEARCH 7.13.1

--------------- S U M M A R Y ------------

Command Line: -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=ALL-UNNAMED -XX:+UseG1GC -Djava.io.tmpdir=/tmp/elasticsearch-847774562637708686 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Des.cgroups.hierarchy.override=/ -Xms12g -Xmx12g -XX:MaxDirectMemorySize=6442450944 -XX:InitiatingHeapOccupancyPercent=30 -XX:G1ReservePercent=25 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -Des.distribution.flavor=default -Des.distribution.type=docker -Des.bundled_jdk=true org.elasticsearch.bootstrap.Elasticsearch -Ebootstrap.memory_lock=true -Enode.name=esnode1 -Ecluster.initial_master_nodes=10.10.2.1, 10.10.2.2 -Enode.data=true -Ediscovery.seed_hosts=10.10.2.1, 10.10.2.2, 10.10.2.3 -Ecluster.name=es-docker-cluster -Enode.master=true

Host: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 8 cores, 23G, CentOS Linux release 8.4.2105
Time: Thu Aug 5 16:31:14 2021 UTC elapsed time: 65.623954 seconds (0d 0h 1m 5s)

--------------- T H R E A D ---------------

Current thread (0x00007f1e7c028e80): JavaThread "elasticsearch[esnode1][write][T#7]" daemon [_thread_in_native, id=286, stack(0x00007f1d17afb000,0x00007f1d17bfc000)]

Stack: [0x00007f1d17afb000,0x00007f1d17bfc000], sp=0x00007f1d17bf93b0, free space=1016k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libc.so.6+0x86101] cfree+0x21

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j com.coccoc.Tokenizer.initialize(Ljava/lang/String;)I+0
j com.coccoc.Tokenizer.(Ljava/lang/String;)V+6
j org.apache.lucene.analysis.vi.VietnameseTokenizerImpl.lambda$new$0(Lorg/elasticsearch/analysis/VietnameseConfig;)Lcom/coccoc/Tokenizer;+8
j org.apache.lucene.analysis.vi.VietnameseTokenizerImpl$$Lambda$5942+0x0000000801994660.run()Ljava/lang/Object;+4
J 2859 c2 java.security.AccessController.doPrivileged(Ljava/security/PrivilegedAction;)Ljava/lang/Object; java.base@16 (9 bytes) @ 0x00007f1f5bb8bfd0 [0x00007f1f5bb8bfa0+0x0000000000000030]
j org.apache.lucene.analysis.vi.VietnameseTokenizerImpl.(Lorg/elasticsearch/analysis/VietnameseConfig;Ljava/io/Reader;)V+73
j org.apache.lucene.analysis.vi.VietnameseTokenizer.(Lorg/elasticsearch/analysis/VietnameseConfig;)V+71
j org.elasticsearch.index.analysis.VietnameseTokenizerFactory.create()Lorg/apache/lucene/analysis/Tokenizer;+8
j org.elasticsearch.index.analysis.CustomAnalyzer.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;+4
j org.apache.lucene.analysis.AnalyzerWrapper.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;+8
j org.apache.lucene.analysis.AnalyzerWrapper.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;+8
j org.apache.lucene.analysis.Analyzer.tokenStream(Ljava/lang/String;Ljava/lang/String;)Lorg/apache/lucene/analysis/TokenStream;+58
J 8816 c1 org.apache.lucene.document.Field.tokenStream(Lorg/apache/lucene/analysis/Analyzer;Lorg/apache/lucene/analysis/TokenStream;)Lorg/apache/lucene/analysis/TokenStream; (188 bytes) @ 0x00007f1f5511d944 [0x00007f1f5511abc0+0x0000000000002d84]
j org.apache.lucene.index.DefaultIndexingChain$PerField.invert(ILorg/apache/lucene/index/IndexableField;Z)V+91
j org.apache.lucene.index.DefaultIndexingChain.processField(ILorg/apache/lucene/index/IndexableField;JI)I+113
J 10620 c2 org.apache.lucene.index.DefaultIndexingChain.processDocument(ILjava/lang/Iterable;)V (181 bytes) @ 0x00007f1f5c12b094 [0x00007f1f5c12ae20+0x0000000000000274]
J 10633 c2 org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(Ljava/lang/Iterable;Lorg/apache/lucene/index/DocumentsWriterDeleteQueue$Node;Lorg/apache/lucene/index/DocumentsWriter$FlushNotifications;)J (180 bytes) @ 0x00007f1f5c132aec [0x00007f1f5c132980+0x000000000000016c]
J 10436 c1 org.apache.lucene.index.DocumentsWriter.updateDocuments(Ljava/lang/Iterable;Lorg/apache/lucene/index/DocumentsWriterDeleteQueue$Node;)J (280 bytes) @ 0x00007f1f54d4573c [0x00007f1f54d455a0+0x000000000000019c]
j org.apache.lucene.index.IndexWriter.updateDocuments(Lorg/apache/lucene/index/DocumentsWriterDeleteQueue$Node;Ljava/lang/Iterable;)J+13
J 10062 c1 org.elasticsearch.index.engine.InternalEngine.addDocs(Ljava/util/List;Lorg/apache/lucene/index/IndexWriter;)V (49 bytes) @ 0x00007f1f5496e12c [0x00007f1f5496d960+0x00000000000007cc]
j org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(Lorg/elasticsearch/index/engine/Engine$Index;Lorg/elasticsearch/index/engine/InternalEngine$IndexingStrategy;)Lorg/elasticsearch/index/engine/Engine$IndexResult;+272
j org.elasticsearch.index.engine.InternalEngine.index(Lorg/elasticsearch/index/engine/Engine$Index;)Lorg/elasticsearch/index/engine/Engine$IndexResult;+418
J 10965 c2 org.elasticsearch.index.shard.IndexShard.index(Lorg/elasticsearch/index/engine/Engine;Lorg/elasticsearch/index/engine/Engine$Index;)Lorg/elasticsearch/index/engine/Engine$IndexResult; (316 bytes) @ 0x00007f1f5c1e258c [0x00007f1f5c1e22e0+0x00000000000002ac]
j org.elasticsearch.index.shard.IndexShard.applyIndexOperation(Lorg/elasticsearch/index/engine/Engine;JJJLorg/elasticsearch/index/VersionType;JJJZLorg/elasticsearch/index/engine/Engine$Operation$Origin;Lorg/elasticsearch/index/mapper/SourceToParse;)Lorg/elasticsearch/index/engine/Engine$IndexResult;+230
j org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(JLorg/elasticsearch/index/VersionType;Lorg/elasticsearch/index/mapper/SourceToParse;JJJZ)Lorg/elasticsearch/index/engine/Engine$IndexResult;+49
j org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(Lorg/elasticsearch/action/bulk/BulkPrimaryExecutionContext;Lorg/elasticsearch/action/update/UpdateHelper;Ljava/util/function/LongSupplier;Lorg/elasticsearch/action/bulk/MappingUpdatePerformer;Ljava/util/function/Consumer;Lorg/elasticsearch/action/ActionListener;)Z+456
j org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun()V+45
J 11088 c1 org.elasticsearch.common.util.concurrent.AbstractRunnable.run()V (32 bytes) @ 0x00007f1f5530740c [0x00007f1f55307300+0x000000000000010c]
j org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(Lorg/elasticsearch/action/bulk/BulkShardRequest;Lorg/elasticsearch/index/shard/IndexShard;Lorg/elasticsearch/action/update/UpdateHelper;Ljava/util/function/LongSupplier;Lorg/elasticsearch/action/bulk/MappingUpdatePerformer;Ljava/util/function/Consumer;Lorg/elasticsearch/action/ActionListener;Lorg/elasticsearch/threadpool/ThreadPool;Ljava/lang/String;)V+21
j org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(Lorg/elasticsearch/action/bulk/BulkShardRequest;Lorg/elasticsearch/index/shard/IndexShard;Lorg/elasticsearch/action/ActionListener;)V+71
j org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(Lorg/elasticsearch/action/support/replication/ReplicatedWriteRequest;Lorg/elasticsearch/index/shard/IndexShard;Lorg/elasticsearch/action/ActionListener;)V+7
j org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun()V+16
j org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun()V+24
J 11088 c1 org.elasticsearch.common.util.concurrent.AbstractRunnable.run()V (32 bytes) @ 0x00007f1f5530740c [0x00007f1f55307300+0x000000000000010c]
j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+92 java.base@16
j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 java.base@16
j java.lang.Thread.run()V+11 java.base@16
v ~StubRoutines::call_stub

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0xfffffffffffffff7

RAX=0x0 is NULL
RBX=0x00007f1e4c00a620 points into unknown readable memory: 0xffffffffffffffff | ff ff ff ff ff ff ff ff
RCX=0x00007f1f71c90999: <offset 0x0000000000080999> in /lib64/libc.so.6 at 0x00007f1f71c10000
RDX=0x00007f1e4c000080 points into unknown readable memory: 0x00007f1e4c034220 | 20 42 03 4c 1e 7f 00 00
RSP=0x00007f1d17bf93b0 is pointing into the stack for thread: 0x00007f1e7c028e80
RBP=0x00007f1e9413a040 points into unknown readable memory: 0x00007f1e40009110 | 10 91 00 40 1e 7f 00 00
RSI=0x0 is NULL
RDI=0xffffffffffffffff is an unknown value
R8 =0x00007f1e5408319e points into unknown readable memory: 07 00
R9 =0x0000000000000007 is an unknown value
R10=0x0 is NULL
R11=0x0000000000000202 is an unknown value
R12=0x0 is NULL
R13=0x00007f1e54083ea0 points into unknown readable memory: 0x00000000fbad2488 | 88 24 ad fb 00 00 00 00
R14=0x0000000000000018 is an unknown value
R15=0x00007f1d17bf9460 is pointing into the stack for thread: 0x00007f1e7c028e80

java.lang.RuntimeException: Cannot initialize Tokenizer

mình đã build libs java. nhưng khi sử dụng thì gặp lỗi này

What option available for `tokenize_option` in Python binding ?

print(T.word_tokenize("xin chào, tôi là người Việt Nam", tokenize_option=0))

What tokenize_option available can I pass into tokenize_option.

misunderstanding about segment

I have some misunderstanding about segmentation

why does it consider " " as a token?, I think it is meaningless
with segment method, it keeps case sensitive but removing punctuations
with segment_original method, it makes text case insensitve, but keeps punctuations
" " inside token cannot be changed to "_" (space_positions is empty)

Can you explain, thanks!

V/v giữ nguyên hoa/thường cho bản Java wrapper

Cám ơn Cốc Cốc đã phát triển bộ công cụ tách từ với độ chính xác cao, tốc độ rất nhanh.

Mình đang sử dụng thử thì thấy khi build cho Java thì văn bản bị đưa về hết chữ thường, mình cũng có thử xem phần Java code nhưng không thấy và chưa tìm ra cách để chỉnh lại.

$ LD_LIBRARY_PATH=build java -cp build/coccoc-tokenizer.jar com.coccoc.Tokenizer "một câu văn tiếng Việt"
một	 	câu_văn	 	tiếng_việt	.

Mong nhận được sự giúp đỡ!

use lib for python

mình đã cài xong phần

$ mkdir build && cd build
$ cmake -DBUILD_PYTHON=1 ..
# make install

nhưng mình vẫn chưa sử dụng được

from CocCocTokenizer import PyTokenizer

Build trên hệ điều hành windows

Mình build trên MacOS thì chạy được. Nhưng build trên HĐH Windows thì báo lỗi như sau:

Lỗi khi build project trên Ubuntu

Mình gặp lỗi này khi chạy make install trên Ubuntu 20.04:

CMake Error at cmake_install.cmake:105 (file):
file INSTALL cannot make directory "/usr/local/share/tokenizer/dicts_text":
No such file or directory.

make: *** [Makefile:74: install] Error 1

Need help when install libary for python in conda environment

Dear all,
I've some issues when trying to install the CocCocTokenizer library for python 3.8 in a conda environment.
I've tried:

activate the conda environment before building with Cmake

git clone [email protected]:coccoc/coccoc-tokenizer.git
cd coccoc-tokenizer/
mkdir build && cd build
conda activate py3.8
sudo  cmake -DBUILD_JAVA=1 -DBUILD_PYTHON=1 ..
sudo make install

using DCMAKE_INSTALL_PREFIX flag and point it to site-packages directory

sudo  cmake -DBUILD_JAVA=1 -DBUILD_PYTHON=1 -DCMAKE_INSTALL_PREFIX=/home/huyqnguyen/anaconda3/envs/py3.8 ..

insert conda activate py3.8 to the /python/build_python.sh file

but none of those solutions help me to include the coccoctokenizer in python:

from CocCocTokenizer import PyTokenizer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'PyTokenizer' from 'CocCocTokenizer' (unknown location)

How to config options when using Python binding

This is a great project!
Can you please provide document guiding how to use options such as -t, -u... when using Python binding?
Thank you so much!

Error openning file, alphabetic

hi Mọi người,

Minh cài lần tiên coccoc-tokenizer thì thành công, mọi việc tốt đẹp.
Khi mình build lần 2 và make install , sau đó chạy lệnh
/usr/local/bin/tokenizer "Cộng hòa Xã hội chủ nghĩa Việt Nam"

thì sinh lỗi
Error openning file, alphabetic

Mình đã xóa bằng tay toàn bộ để build lại nhưng vẫn bị lỗi.
OS: Ubuntu Server 20.04
Mình chưa tìm được các xử lý, có ai có cách nào xóa coccoc-tokenizer triệt để chỉ mình giúp.

Thanks mọi người.

Thêm từ ghép tiếng Việt mới

Xin chào!

Vui lòng hướng dẫn mình cách thêm từ ghép tiếng Việt mới để build lại lib cho mục đích sử dụng riêng thì phải làm như thế nào?

Cảm ơn nhiều!

Can the tokenizer produce tokens as original case?

I've been using coccoc-tokenizer Java bindings for Vietnamese Elasticsearch analysis plugin, we want to keep tokens as original cases. For example, text "Cộng hòa Xã hội chủ nghĩa Việt Nam" will be tokenized as cộng_hòa xã_hội chủ_nghĩa việt_nam,
what we expect is Cộng_hòa Xã_hội chủ_nghĩa Việt_Nam.

Can we somehow make the tokenizer produce tokens as original cases?

CocCocTokenizer is worse than VnCoreNLP Tokenizer in information-retrieval tasks

Hi, Have you ever measure metrics like Recall@1, Recall@100 accuracy in any information retrieval tasks before and compare the results to other Vietnamese tokenizing models, say, VnCoreNLP ?

In my own datasets, VnCoreNLP is little bit better than CocCocTokenizer (I use the basic BM25 score)