Giter Club home page Giter Club logo

coccoc-tokenizer's People

Contributors

0xflotus avatar anhducle98 avatar bachan avatar duydo avatar thphuong avatar tranhieudev23 avatar txdat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

coccoc-tokenizer's Issues

Error when install verson of Python on Mac

I got this error. Please help me to fix this:

running install
running build
running build_ext
skipping 'CocCocTokenizer.cpp' Cython extension (up-to-date)
building 'CocCocTokenizer' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/lap00986/anaconda3/include -arch x86_64 -I/Users/lap00986/anaconda3/include -arch x86_64 -I. -I/Users/lap00986/Documents/product-matching/env/include -I/Users/lap00986/anaconda3/include/python3.7m -c CocCocTokenizer.cpp -o build/temp.macosx-10.7-x86_64-3.7/CocCocTokenizer.o -Wno-cpp -Wno-unused-function -O2 -march=native
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
CocCocTokenizer.cpp:610:10: fatal error: 'ios' file not found
#include "ios"
^~~~~
1 warning and 1 error generated.
error: command 'gcc' failed with exit status 1

Hướng dẫn đầy đủ cài đặt C++ Tokenizer & ES 7.12.1 Analysis Vietnam plugin

*** Môi trường Ubuntu 18.04 (or whatever), phải install Java JDK chứ không phải JRE vì cần javac cho cái C++ Tokenizer. Các file .yml tự làm cho chuẩn theo hường dẫn của các gits. Docker hay VM cũng vậy, đơn giản thế này.

sudo su
apt-get update -y
apt-get upgrade -y
apt-get install build-essential cmake unzip pkg-config gcc-7 g++-7 -y
apt-get install wget curl nano git default-jdk maven -y

cd /

*** Tải ElasticSearch 7.12.1
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.12.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.12.1-linux-x86_64.tar.gz
mv elasticsearch-7.12.1-linux-x86_64 /es

** Tải ES Analysis Vietnam
git clone https://github.com/duydo/elasticsearch-analysis-vietnamese.git
cd elasticsearch-analysis-vietnamese
mvn package

** Tải C++ Tokenizer
git clone https://github.com/coccoc/coccoc-tokenizer.git
cd coccoc-tokenizer
mkdir build
cd build
cmake -DBUILD_JAVA=1 ..
make install

** Cài plugin:
cd /es
echo "Y" | ./bin/elasticsearch-plugin install file:///elasticsearch-analysis-vietnamese/target/releases/elasticsearch-analysis-vietnamese-7.12.1.zip

*** Chuẩn bị
groupadd -g 999 nqrt && useradd -r -u 999 -g nqrt nqrt
usermod -aG sudo nqrt
chown nqrt:nqrt /es -R
sysctl -w vm.max_map_count=262144

su nqrt

** Run
export ES_JAVA_OPTS="-Xms2048m -Xmx2048m -Djava.library.path=/usr/local/lib"
cd /es
./bin/elasticsearch

Missed tokenizing entity's name

Thank you for open-sourcing one of the best and blazing fast Vietnamese tokenizer 💯

Today when playing around with CocCocTokenizer python binding, I find out that sometimes it missed tokenizing on entity's name

For example:

>>> T.word_tokenize("Những lần Lam Trường - Đan Trường tái ngộ chung khung hình ở U50")
['Những', 'lần', 'Lam', 'Trường', '-', 'Đan', 'Trường', 'tái_ngộ', 'chung', 'khung_hình', 'ở', 'U50']
# Expected result : ['Những', 'lần', 'Lam_Trường', '-', 'Đan_Trường', 'tái_ngộ', 'chung', 'khung_hình', 'ở', 'U50']

What can I do to help the tokenizer perform better on these cases ?

Python with coccoc_tokenizer

In README file, i found this line:
from CocCocTokenizer import PyTokenizer
but how can i install CocCocTokenizer (I try copy and paste example and i got No module name: 'CocCocTokenizer' error)

build errors on Ubuntu 18.04

I followed the instructions of building this tool from README.md and encountered this error:

In file included from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie/syllable_da_trie.hpp:10,

             from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie.hpp:5,
             from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/tokenizer.hpp:10,
             from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/utils/tokenizer.cpp:3:

/home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie/da_trie.hpp: In member function ‘int DATrie<HashNode, Node>::read_from_file(const string&) [with HashNode = MultitermHashTrieNode; Node = MultitermDATrieNode]’:

/home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie/da_trie.hpp:237:8: error: ignoring return value of ‘size_t fread(void, size_t, size_t, FILE)’, declared with attribute warn_unused_result [-Werror=unused-result]**

fread(&alphabet_size, sizeof(alphabet_size), 1, in_file);


this happened several times in tokenizer.hpp file

Lỗi khi chạy câu lệnh make install

Chào bạn, khi mình cài Step đầu tiên qua Ubuntu LTS, khi chạy câu lệnh make install thì lỗi như sau:
Scanning dependencies of target dict_compiler
[ 12%] Building CXX object CMakeFiles/dict_compiler.dir/utils/dict_compiler.cpp.o
[ 25%] Linking CXX executable dict_compiler
[ 25%] Built target dict_compiler
Scanning dependencies of target vn_lang_tool
[ 37%] Building CXX object CMakeFiles/vn_lang_tool.dir/utils/vn_lang_tool.cpp.o
[ 50%] Linking CXX executable vn_lang_tool
[ 50%] Built target vn_lang_tool
Scanning dependencies of target tokenizer
[ 62%] Building CXX object CMakeFiles/tokenizer.dir/utils/tokenizer.cpp.o
[ 75%] Linking CXX executable tokenizer
[ 75%] Built target tokenizer
Scanning dependencies of target compile_dict
[ 87%] Generating multiterm_trie.dump, syllable_trie.dump, nontone_pair_freq_map.dump
[ 87%] Built target compile_dict
Scanning dependencies of target compile_java
[100%] Generating coccoc-tokenizer.jar
: not foundld_java.sh: 2: ../java/build_java.sh:
../java/build_java.sh: 36: ../java/build_java.sh: Syntax error: end of file unexpected (expecting "then")
CMakeFiles/compile_java.dir/build.make:60: recipe for target 'coccoc-tokenizer.jar' failed
make[2]: *** [coccoc-tokenizer.jar] Error 2
CMakeFiles/Makefile2:215: recipe for target 'CMakeFiles/compile_java.dir/all' failed
make[1]: *** [CMakeFiles/compile_java.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2

Mình chạy với tham số: cmake -DBUILD_JAVA=1 ..

Hy vọng bạn giúp mình với ^^

ERROR ON ELASTIC SEARCH 7.13.1

--------------- S U M M A R Y ------------

Command Line: -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=ALL-UNNAMED -XX:+UseG1GC -Djava.io.tmpdir=/tmp/elasticsearch-847774562637708686 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Des.cgroups.hierarchy.override=/ -Xms12g -Xmx12g -XX:MaxDirectMemorySize=6442450944 -XX:InitiatingHeapOccupancyPercent=30 -XX:G1ReservePercent=25 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -Des.distribution.flavor=default -Des.distribution.type=docker -Des.bundled_jdk=true org.elasticsearch.bootstrap.Elasticsearch -Ebootstrap.memory_lock=true -Enode.name=esnode1 -Ecluster.initial_master_nodes=10.10.2.1, 10.10.2.2 -Enode.data=true -Ediscovery.seed_hosts=10.10.2.1, 10.10.2.2, 10.10.2.3 -Ecluster.name=es-docker-cluster -Enode.master=true

Host: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 8 cores, 23G, CentOS Linux release 8.4.2105
Time: Thu Aug 5 16:31:14 2021 UTC elapsed time: 65.623954 seconds (0d 0h 1m 5s)

--------------- T H R E A D ---------------

Current thread (0x00007f1e7c028e80): JavaThread "elasticsearch[esnode1][write][T#7]" daemon [_thread_in_native, id=286, stack(0x00007f1d17afb000,0x00007f1d17bfc000)]

Stack: [0x00007f1d17afb000,0x00007f1d17bfc000], sp=0x00007f1d17bf93b0, free space=1016k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libc.so.6+0x86101] cfree+0x21

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j com.coccoc.Tokenizer.initialize(Ljava/lang/String;)I+0
j com.coccoc.Tokenizer.(Ljava/lang/String;)V+6
j org.apache.lucene.analysis.vi.VietnameseTokenizerImpl.lambda$new$0(Lorg/elasticsearch/analysis/VietnameseConfig;)Lcom/coccoc/Tokenizer;+8
j org.apache.lucene.analysis.vi.VietnameseTokenizerImpl$$Lambda$5942+0x0000000801994660.run()Ljava/lang/Object;+4
J 2859 c2 java.security.AccessController.doPrivileged(Ljava/security/PrivilegedAction;)Ljava/lang/Object; java.base@16 (9 bytes) @ 0x00007f1f5bb8bfd0 [0x00007f1f5bb8bfa0+0x0000000000000030]
j org.apache.lucene.analysis.vi.VietnameseTokenizerImpl.(Lorg/elasticsearch/analysis/VietnameseConfig;Ljava/io/Reader;)V+73
j org.apache.lucene.analysis.vi.VietnameseTokenizer.(Lorg/elasticsearch/analysis/VietnameseConfig;)V+71
j org.elasticsearch.index.analysis.VietnameseTokenizerFactory.create()Lorg/apache/lucene/analysis/Tokenizer;+8
j org.elasticsearch.index.analysis.CustomAnalyzer.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;+4
j org.apache.lucene.analysis.AnalyzerWrapper.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;+8
j org.apache.lucene.analysis.AnalyzerWrapper.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;+8
j org.apache.lucene.analysis.Analyzer.tokenStream(Ljava/lang/String;Ljava/lang/String;)Lorg/apache/lucene/analysis/TokenStream;+58
J 8816 c1 org.apache.lucene.document.Field.tokenStream(Lorg/apache/lucene/analysis/Analyzer;Lorg/apache/lucene/analysis/TokenStream;)Lorg/apache/lucene/analysis/TokenStream; (188 bytes) @ 0x00007f1f5511d944 [0x00007f1f5511abc0+0x0000000000002d84]
j org.apache.lucene.index.DefaultIndexingChain$PerField.invert(ILorg/apache/lucene/index/IndexableField;Z)V+91
j org.apache.lucene.index.DefaultIndexingChain.processField(ILorg/apache/lucene/index/IndexableField;JI)I+113
J 10620 c2 org.apache.lucene.index.DefaultIndexingChain.processDocument(ILjava/lang/Iterable;)V (181 bytes) @ 0x00007f1f5c12b094 [0x00007f1f5c12ae20+0x0000000000000274]
J 10633 c2 org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(Ljava/lang/Iterable;Lorg/apache/lucene/index/DocumentsWriterDeleteQueue$Node;Lorg/apache/lucene/index/DocumentsWriter$FlushNotifications;)J (180 bytes) @ 0x00007f1f5c132aec [0x00007f1f5c132980+0x000000000000016c]
J 10436 c1 org.apache.lucene.index.DocumentsWriter.updateDocuments(Ljava/lang/Iterable;Lorg/apache/lucene/index/DocumentsWriterDeleteQueue$Node;)J (280 bytes) @ 0x00007f1f54d4573c [0x00007f1f54d455a0+0x000000000000019c]
j org.apache.lucene.index.IndexWriter.updateDocuments(Lorg/apache/lucene/index/DocumentsWriterDeleteQueue$Node;Ljava/lang/Iterable;)J+13
J 10062 c1 org.elasticsearch.index.engine.InternalEngine.addDocs(Ljava/util/List;Lorg/apache/lucene/index/IndexWriter;)V (49 bytes) @ 0x00007f1f5496e12c [0x00007f1f5496d960+0x00000000000007cc]
j org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(Lorg/elasticsearch/index/engine/Engine$Index;Lorg/elasticsearch/index/engine/InternalEngine$IndexingStrategy;)Lorg/elasticsearch/index/engine/Engine$IndexResult;+272
j org.elasticsearch.index.engine.InternalEngine.index(Lorg/elasticsearch/index/engine/Engine$Index;)Lorg/elasticsearch/index/engine/Engine$IndexResult;+418
J 10965 c2 org.elasticsearch.index.shard.IndexShard.index(Lorg/elasticsearch/index/engine/Engine;Lorg/elasticsearch/index/engine/Engine$Index;)Lorg/elasticsearch/index/engine/Engine$IndexResult; (316 bytes) @ 0x00007f1f5c1e258c [0x00007f1f5c1e22e0+0x00000000000002ac]
j org.elasticsearch.index.shard.IndexShard.applyIndexOperation(Lorg/elasticsearch/index/engine/Engine;JJJLorg/elasticsearch/index/VersionType;JJJZLorg/elasticsearch/index/engine/Engine$Operation$Origin;Lorg/elasticsearch/index/mapper/SourceToParse;)Lorg/elasticsearch/index/engine/Engine$IndexResult;+230
j org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(JLorg/elasticsearch/index/VersionType;Lorg/elasticsearch/index/mapper/SourceToParse;JJJZ)Lorg/elasticsearch/index/engine/Engine$IndexResult;+49
j org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(Lorg/elasticsearch/action/bulk/BulkPrimaryExecutionContext;Lorg/elasticsearch/action/update/UpdateHelper;Ljava/util/function/LongSupplier;Lorg/elasticsearch/action/bulk/MappingUpdatePerformer;Ljava/util/function/Consumer;Lorg/elasticsearch/action/ActionListener;)Z+456
j org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun()V+45
J 11088 c1 org.elasticsearch.common.util.concurrent.AbstractRunnable.run()V (32 bytes) @ 0x00007f1f5530740c [0x00007f1f55307300+0x000000000000010c]
j org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(Lorg/elasticsearch/action/bulk/BulkShardRequest;Lorg/elasticsearch/index/shard/IndexShard;Lorg/elasticsearch/action/update/UpdateHelper;Ljava/util/function/LongSupplier;Lorg/elasticsearch/action/bulk/MappingUpdatePerformer;Ljava/util/function/Consumer;Lorg/elasticsearch/action/ActionListener;Lorg/elasticsearch/threadpool/ThreadPool;Ljava/lang/String;)V+21
j org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(Lorg/elasticsearch/action/bulk/BulkShardRequest;Lorg/elasticsearch/index/shard/IndexShard;Lorg/elasticsearch/action/ActionListener;)V+71
j org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(Lorg/elasticsearch/action/support/replication/ReplicatedWriteRequest;Lorg/elasticsearch/index/shard/IndexShard;Lorg/elasticsearch/action/ActionListener;)V+7
j org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun()V+16
j org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun()V+24
J 11088 c1 org.elasticsearch.common.util.concurrent.AbstractRunnable.run()V (32 bytes) @ 0x00007f1f5530740c [0x00007f1f55307300+0x000000000000010c]
j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+92 java.base@16
j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 java.base@16
j java.lang.Thread.run()V+11 java.base@16
v ~StubRoutines::call_stub

siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0xfffffffffffffff7

Register to memory mapping:

RAX=0x0 is NULL
RBX=0x00007f1e4c00a620 points into unknown readable memory: 0xffffffffffffffff | ff ff ff ff ff ff ff ff
RCX=0x00007f1f71c90999: <offset 0x0000000000080999> in /lib64/libc.so.6 at 0x00007f1f71c10000
RDX=0x00007f1e4c000080 points into unknown readable memory: 0x00007f1e4c034220 | 20 42 03 4c 1e 7f 00 00
RSP=0x00007f1d17bf93b0 is pointing into the stack for thread: 0x00007f1e7c028e80
RBP=0x00007f1e9413a040 points into unknown readable memory: 0x00007f1e40009110 | 10 91 00 40 1e 7f 00 00
RSI=0x0 is NULL
RDI=0xffffffffffffffff is an unknown value
R8 =0x00007f1e5408319e points into unknown readable memory: 07 00
R9 =0x0000000000000007 is an unknown value
R10=0x0 is NULL
R11=0x0000000000000202 is an unknown value
R12=0x0 is NULL
R13=0x00007f1e54083ea0 points into unknown readable memory: 0x00000000fbad2488 | 88 24 ad fb 00 00 00 00
R14=0x0000000000000018 is an unknown value
R15=0x00007f1d17bf9460 is pointing into the stack for thread: 0x00007f1e7c028e80

misunderstanding about segment

I have some misunderstanding about segmentation

  • why does it consider " " as a token?, I think it is meaningless
  • with segment method, it keeps case sensitive but removing punctuations
  • with segment_original method, it makes text case insensitve, but keeps punctuations
  • " " inside token cannot be changed to "_" (space_positions is empty)

Can you explain, thanks!

V/v giữ nguyên hoa/thường cho bản Java wrapper

Cám ơn Cốc Cốc đã phát triển bộ công cụ tách từ với độ chính xác cao, tốc độ rất nhanh.

Mình đang sử dụng thử thì thấy khi build cho Java thì văn bản bị đưa về hết chữ thường, mình cũng có thử xem phần Java code nhưng không thấy và chưa tìm ra cách để chỉnh lại.

$ LD_LIBRARY_PATH=build java -cp build/coccoc-tokenizer.jar com.coccoc.Tokenizer "một câu văn tiếng Việt"
một	 	câu_văn	 	tiếng_việt	.

Mong nhận được sự giúp đỡ!

use lib for python

mình đã cài xong phần

$ mkdir build && cd build
$ cmake -DBUILD_PYTHON=1 ..
# make install

nhưng mình vẫn chưa sử dụng được

from CocCocTokenizer import PyTokenizer

Lỗi khi build project trên Ubuntu

Mình gặp lỗi này khi chạy make install trên Ubuntu 20.04:

CMake Error at cmake_install.cmake:105 (file):
file INSTALL cannot make directory "/usr/local/share/tokenizer/dicts_text":
No such file or directory.

make: *** [Makefile:74: install] Error 1

Need help when install libary for python in conda environment

Dear all,
I've some issues when trying to install the CocCocTokenizer library for python 3.8 in a conda environment.
I've tried:

  • activate the conda environment before building with Cmake
git clone [email protected]:coccoc/coccoc-tokenizer.git
cd coccoc-tokenizer/
mkdir build && cd build
conda activate py3.8
sudo  cmake -DBUILD_JAVA=1 -DBUILD_PYTHON=1 ..
sudo make install
  • using DCMAKE_INSTALL_PREFIX flag and point it to site-packages directory
sudo  cmake -DBUILD_JAVA=1 -DBUILD_PYTHON=1 -DCMAKE_INSTALL_PREFIX=/home/huyqnguyen/anaconda3/envs/py3.8 ..
  • insert conda activate py3.8 to the /python/build_python.sh file

but none of those solutions help me to include the coccoctokenizer in python:

from CocCocTokenizer import PyTokenizer
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'PyTokenizer' from 'CocCocTokenizer' (unknown location)

Error openning file, alphabetic

hi Mọi người,

Minh cài lần tiên coccoc-tokenizer thì thành công, mọi việc tốt đẹp.
Khi mình build lần 2 và make install , sau đó chạy lệnh
/usr/local/bin/tokenizer "Cộng hòa Xã hội chủ nghĩa Việt Nam"

thì sinh lỗi
Error openning file, alphabetic

Mình đã xóa bằng tay toàn bộ để build lại nhưng vẫn bị lỗi.
OS: Ubuntu Server 20.04
Mình chưa tìm được các xử lý, có ai có cách nào xóa coccoc-tokenizer triệt để chỉ mình giúp.

Thanks mọi người.

Thêm từ ghép tiếng Việt mới

Xin chào!

Vui lòng hướng dẫn mình cách thêm từ ghép tiếng Việt mới để build lại lib cho mục đích sử dụng riêng thì phải làm như thế nào?

Cảm ơn nhiều!

Can the tokenizer produce tokens as original case?

I've been using coccoc-tokenizer Java bindings for Vietnamese Elasticsearch analysis plugin, we want to keep tokens as original cases. For example, text "Cộng hòa Xã hội chủ nghĩa Việt Nam" will be tokenized as cộng_hòa xã_hội chủ_nghĩa việt_nam,
what we expect is Cộng_hòa Xã_hội chủ_nghĩa Việt_Nam.

Can we somehow make the tokenizer produce tokens as original cases?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.