coccoc / coccoc-tokenizer Goto Github PK
View Code? Open in Web Editor NEWhigh performance tokenizer for Vietnamese language
License: GNU Lesser General Public License v3.0
high performance tokenizer for Vietnamese language
License: GNU Lesser General Public License v3.0
I got this error. Please help me to fix this:
running install
running build
running build_ext
skipping 'CocCocTokenizer.cpp' Cython extension (up-to-date)
building 'CocCocTokenizer' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/lap00986/anaconda3/include -arch x86_64 -I/Users/lap00986/anaconda3/include -arch x86_64 -I. -I/Users/lap00986/Documents/product-matching/env/include -I/Users/lap00986/anaconda3/include/python3.7m -c CocCocTokenizer.cpp -o build/temp.macosx-10.7-x86_64-3.7/CocCocTokenizer.o -Wno-cpp -Wno-unused-function -O2 -march=native
warning: include path for stdlibc++ headers not found; pass '-stdlib=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
CocCocTokenizer.cpp:610:10: fatal error: 'ios' file not found
#include "ios"
^~~~~
1 warning and 1 error generated.
error: command 'gcc' failed with exit status 1
*** Môi trường Ubuntu 18.04 (or whatever), phải install Java JDK chứ không phải JRE vì cần javac cho cái C++ Tokenizer. Các file .yml tự làm cho chuẩn theo hường dẫn của các gits. Docker hay VM cũng vậy, đơn giản thế này.
sudo su
apt-get update -y
apt-get upgrade -y
apt-get install build-essential cmake unzip pkg-config gcc-7 g++-7 -y
apt-get install wget curl nano git default-jdk maven -y
cd /
*** Tải ElasticSearch 7.12.1
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.12.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.12.1-linux-x86_64.tar.gz
mv elasticsearch-7.12.1-linux-x86_64 /es
** Tải ES Analysis Vietnam
git clone https://github.com/duydo/elasticsearch-analysis-vietnamese.git
cd elasticsearch-analysis-vietnamese
mvn package
** Tải C++ Tokenizer
git clone https://github.com/coccoc/coccoc-tokenizer.git
cd coccoc-tokenizer
mkdir build
cd build
cmake -DBUILD_JAVA=1 ..
make install
** Cài plugin:
cd /es
echo "Y" | ./bin/elasticsearch-plugin install file:///elasticsearch-analysis-vietnamese/target/releases/elasticsearch-analysis-vietnamese-7.12.1.zip
*** Chuẩn bị
groupadd -g 999 nqrt && useradd -r -u 999 -g nqrt nqrt
usermod -aG sudo nqrt
chown nqrt:nqrt /es -R
sysctl -w vm.max_map_count=262144
su nqrt
** Run
export ES_JAVA_OPTS="-Xms2048m -Xmx2048m -Djava.library.path=/usr/local/lib"
cd /es
./bin/elasticsearch
Thank you for open-sourcing one of the best and blazing fast Vietnamese tokenizer 💯
Today when playing around with CocCocTokenizer python binding, I find out that sometimes it missed tokenizing on entity's name
For example:
>>> T.word_tokenize("Những lần Lam Trường - Đan Trường tái ngộ chung khung hình ở U50")
['Những', 'lần', 'Lam', 'Trường', '-', 'Đan', 'Trường', 'tái_ngộ', 'chung', 'khung_hình', 'ở', 'U50']
# Expected result : ['Những', 'lần', 'Lam_Trường', '-', 'Đan_Trường', 'tái_ngộ', 'chung', 'khung_hình', 'ở', 'U50']
What can I do to help the tokenizer perform better on these cases ?
In README file, i found this line:
from CocCocTokenizer import PyTokenizer
but how can i install CocCocTokenizer (I try copy and paste example and i got No module name: 'CocCocTokenizer' error)
I followed the instructions of building this tool from README.md and encountered this error:
In file included from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie/syllable_da_trie.hpp:10,
from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie.hpp:5,
from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/tokenizer.hpp:10,
from /home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/utils/tokenizer.cpp:3:
/home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie/da_trie.hpp: In member function ‘int DATrie<HashNode, Node>::read_from_file(const string&) [with HashNode = MultitermHashTrieNode; Node = MultitermDATrieNode]’:
/home/extreme45nm/main-projects/nlp-starter/coccoc-tokenizer/tokenizer/auxiliary/trie/da_trie.hpp:237:8: error: ignoring return value of ‘size_t fread(void, size_t, size_t, FILE)’, declared with attribute warn_unused_result [-Werror=unused-result]**
fread(&alphabet_size, sizeof(alphabet_size), 1, in_file);
this happened several times in tokenizer.hpp file
Chào bạn, khi mình cài Step đầu tiên qua Ubuntu LTS, khi chạy câu lệnh make install thì lỗi như sau:
Scanning dependencies of target dict_compiler
[ 12%] Building CXX object CMakeFiles/dict_compiler.dir/utils/dict_compiler.cpp.o
[ 25%] Linking CXX executable dict_compiler
[ 25%] Built target dict_compiler
Scanning dependencies of target vn_lang_tool
[ 37%] Building CXX object CMakeFiles/vn_lang_tool.dir/utils/vn_lang_tool.cpp.o
[ 50%] Linking CXX executable vn_lang_tool
[ 50%] Built target vn_lang_tool
Scanning dependencies of target tokenizer
[ 62%] Building CXX object CMakeFiles/tokenizer.dir/utils/tokenizer.cpp.o
[ 75%] Linking CXX executable tokenizer
[ 75%] Built target tokenizer
Scanning dependencies of target compile_dict
[ 87%] Generating multiterm_trie.dump, syllable_trie.dump, nontone_pair_freq_map.dump
[ 87%] Built target compile_dict
Scanning dependencies of target compile_java
[100%] Generating coccoc-tokenizer.jar
: not foundld_java.sh: 2: ../java/build_java.sh:
../java/build_java.sh: 36: ../java/build_java.sh: Syntax error: end of file unexpected (expecting "then")
CMakeFiles/compile_java.dir/build.make:60: recipe for target 'coccoc-tokenizer.jar' failed
make[2]: *** [coccoc-tokenizer.jar] Error 2
CMakeFiles/Makefile2:215: recipe for target 'CMakeFiles/compile_java.dir/all' failed
make[1]: *** [CMakeFiles/compile_java.dir/all] Error 2
Makefile:129: recipe for target 'all' failed
make: *** [all] Error 2
Mình chạy với tham số: cmake -DBUILD_JAVA=1 ..
Hy vọng bạn giúp mình với ^^
When I run the command: make install
. The following errors are showed up:
How to fix it? Please help me @bachan @anhducle98 .
--------------- S U M M A R Y ------------
Command Line: -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT --add-opens=java.base/java.io=ALL-UNNAMED -XX:+UseG1GC -Djava.io.tmpdir=/tmp/elasticsearch-847774562637708686 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m -Des.cgroups.hierarchy.override=/ -Xms12g -Xmx12g -XX:MaxDirectMemorySize=6442450944 -XX:InitiatingHeapOccupancyPercent=30 -XX:G1ReservePercent=25 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/usr/share/elasticsearch/config -Des.distribution.flavor=default -Des.distribution.type=docker -Des.bundled_jdk=true org.elasticsearch.bootstrap.Elasticsearch -Ebootstrap.memory_lock=true -Enode.name=esnode1 -Ecluster.initial_master_nodes=10.10.2.1, 10.10.2.2 -Enode.data=true -Ediscovery.seed_hosts=10.10.2.1, 10.10.2.2, 10.10.2.3 -Ecluster.name=es-docker-cluster -Enode.master=true
Host: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, 8 cores, 23G, CentOS Linux release 8.4.2105
Time: Thu Aug 5 16:31:14 2021 UTC elapsed time: 65.623954 seconds (0d 0h 1m 5s)
--------------- T H R E A D ---------------
Current thread (0x00007f1e7c028e80): JavaThread "elasticsearch[esnode1][write][T#7]" daemon [_thread_in_native, id=286, stack(0x00007f1d17afb000,0x00007f1d17bfc000)]
Stack: [0x00007f1d17afb000,0x00007f1d17bfc000], sp=0x00007f1d17bf93b0, free space=1016k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libc.so.6+0x86101] cfree+0x21
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j com.coccoc.Tokenizer.initialize(Ljava/lang/String;)I+0
j com.coccoc.Tokenizer.(Ljava/lang/String;)V+6
j org.apache.lucene.analysis.vi.VietnameseTokenizerImpl.lambda$new$0(Lorg/elasticsearch/analysis/VietnameseConfig;)Lcom/coccoc/Tokenizer;+8
j org.apache.lucene.analysis.vi.VietnameseTokenizerImpl$$Lambda$5942+0x0000000801994660.run()Ljava/lang/Object;+4
J 2859 c2 java.security.AccessController.doPrivileged(Ljava/security/PrivilegedAction;)Ljava/lang/Object; java.base@16 (9 bytes) @ 0x00007f1f5bb8bfd0 [0x00007f1f5bb8bfa0+0x0000000000000030]
j org.apache.lucene.analysis.vi.VietnameseTokenizerImpl.(Lorg/elasticsearch/analysis/VietnameseConfig;Ljava/io/Reader;)V+73
j org.apache.lucene.analysis.vi.VietnameseTokenizer.(Lorg/elasticsearch/analysis/VietnameseConfig;)V+71
j org.elasticsearch.index.analysis.VietnameseTokenizerFactory.create()Lorg/apache/lucene/analysis/Tokenizer;+8
j org.elasticsearch.index.analysis.CustomAnalyzer.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;+4
j org.apache.lucene.analysis.AnalyzerWrapper.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;+8
j org.apache.lucene.analysis.AnalyzerWrapper.createComponents(Ljava/lang/String;)Lorg/apache/lucene/analysis/Analyzer$TokenStreamComponents;+8
j org.apache.lucene.analysis.Analyzer.tokenStream(Ljava/lang/String;Ljava/lang/String;)Lorg/apache/lucene/analysis/TokenStream;+58
J 8816 c1 org.apache.lucene.document.Field.tokenStream(Lorg/apache/lucene/analysis/Analyzer;Lorg/apache/lucene/analysis/TokenStream;)Lorg/apache/lucene/analysis/TokenStream; (188 bytes) @ 0x00007f1f5511d944 [0x00007f1f5511abc0+0x0000000000002d84]
j org.apache.lucene.index.DefaultIndexingChain$PerField.invert(ILorg/apache/lucene/index/IndexableField;Z)V+91
j org.apache.lucene.index.DefaultIndexingChain.processField(ILorg/apache/lucene/index/IndexableField;JI)I+113
J 10620 c2 org.apache.lucene.index.DefaultIndexingChain.processDocument(ILjava/lang/Iterable;)V (181 bytes) @ 0x00007f1f5c12b094 [0x00007f1f5c12ae20+0x0000000000000274]
J 10633 c2 org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(Ljava/lang/Iterable;Lorg/apache/lucene/index/DocumentsWriterDeleteQueue$Node;Lorg/apache/lucene/index/DocumentsWriter$FlushNotifications;)J (180 bytes) @ 0x00007f1f5c132aec [0x00007f1f5c132980+0x000000000000016c]
J 10436 c1 org.apache.lucene.index.DocumentsWriter.updateDocuments(Ljava/lang/Iterable;Lorg/apache/lucene/index/DocumentsWriterDeleteQueue$Node;)J (280 bytes) @ 0x00007f1f54d4573c [0x00007f1f54d455a0+0x000000000000019c]
j org.apache.lucene.index.IndexWriter.updateDocuments(Lorg/apache/lucene/index/DocumentsWriterDeleteQueue$Node;Ljava/lang/Iterable;)J+13
J 10062 c1 org.elasticsearch.index.engine.InternalEngine.addDocs(Ljava/util/List;Lorg/apache/lucene/index/IndexWriter;)V (49 bytes) @ 0x00007f1f5496e12c [0x00007f1f5496d960+0x00000000000007cc]
j org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(Lorg/elasticsearch/index/engine/Engine$Index;Lorg/elasticsearch/index/engine/InternalEngine$IndexingStrategy;)Lorg/elasticsearch/index/engine/Engine$IndexResult;+272
j org.elasticsearch.index.engine.InternalEngine.index(Lorg/elasticsearch/index/engine/Engine$Index;)Lorg/elasticsearch/index/engine/Engine$IndexResult;+418
J 10965 c2 org.elasticsearch.index.shard.IndexShard.index(Lorg/elasticsearch/index/engine/Engine;Lorg/elasticsearch/index/engine/Engine$Index;)Lorg/elasticsearch/index/engine/Engine$IndexResult; (316 bytes) @ 0x00007f1f5c1e258c [0x00007f1f5c1e22e0+0x00000000000002ac]
j org.elasticsearch.index.shard.IndexShard.applyIndexOperation(Lorg/elasticsearch/index/engine/Engine;JJJLorg/elasticsearch/index/VersionType;JJJZLorg/elasticsearch/index/engine/Engine$Operation$Origin;Lorg/elasticsearch/index/mapper/SourceToParse;)Lorg/elasticsearch/index/engine/Engine$IndexResult;+230
j org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(JLorg/elasticsearch/index/VersionType;Lorg/elasticsearch/index/mapper/SourceToParse;JJJZ)Lorg/elasticsearch/index/engine/Engine$IndexResult;+49
j org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(Lorg/elasticsearch/action/bulk/BulkPrimaryExecutionContext;Lorg/elasticsearch/action/update/UpdateHelper;Ljava/util/function/LongSupplier;Lorg/elasticsearch/action/bulk/MappingUpdatePerformer;Ljava/util/function/Consumer;Lorg/elasticsearch/action/ActionListener;)Z+456
j org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun()V+45
J 11088 c1 org.elasticsearch.common.util.concurrent.AbstractRunnable.run()V (32 bytes) @ 0x00007f1f5530740c [0x00007f1f55307300+0x000000000000010c]
j org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(Lorg/elasticsearch/action/bulk/BulkShardRequest;Lorg/elasticsearch/index/shard/IndexShard;Lorg/elasticsearch/action/update/UpdateHelper;Ljava/util/function/LongSupplier;Lorg/elasticsearch/action/bulk/MappingUpdatePerformer;Ljava/util/function/Consumer;Lorg/elasticsearch/action/ActionListener;Lorg/elasticsearch/threadpool/ThreadPool;Ljava/lang/String;)V+21
j org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(Lorg/elasticsearch/action/bulk/BulkShardRequest;Lorg/elasticsearch/index/shard/IndexShard;Lorg/elasticsearch/action/ActionListener;)V+71
j org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(Lorg/elasticsearch/action/support/replication/ReplicatedWriteRequest;Lorg/elasticsearch/index/shard/IndexShard;Lorg/elasticsearch/action/ActionListener;)V+7
j org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun()V+16
j org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun()V+24
J 11088 c1 org.elasticsearch.common.util.concurrent.AbstractRunnable.run()V (32 bytes) @ 0x00007f1f5530740c [0x00007f1f55307300+0x000000000000010c]
j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+92 java.base@16
j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 java.base@16
j java.lang.Thread.run()V+11 java.base@16
v ~StubRoutines::call_stub
siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0xfffffffffffffff7
Register to memory mapping:
RAX=0x0 is NULL
RBX=0x00007f1e4c00a620 points into unknown readable memory: 0xffffffffffffffff | ff ff ff ff ff ff ff ff
RCX=0x00007f1f71c90999: <offset 0x0000000000080999> in /lib64/libc.so.6 at 0x00007f1f71c10000
RDX=0x00007f1e4c000080 points into unknown readable memory: 0x00007f1e4c034220 | 20 42 03 4c 1e 7f 00 00
RSP=0x00007f1d17bf93b0 is pointing into the stack for thread: 0x00007f1e7c028e80
RBP=0x00007f1e9413a040 points into unknown readable memory: 0x00007f1e40009110 | 10 91 00 40 1e 7f 00 00
RSI=0x0 is NULL
RDI=0xffffffffffffffff is an unknown value
R8 =0x00007f1e5408319e points into unknown readable memory: 07 00
R9 =0x0000000000000007 is an unknown value
R10=0x0 is NULL
R11=0x0000000000000202 is an unknown value
R12=0x0 is NULL
R13=0x00007f1e54083ea0 points into unknown readable memory: 0x00000000fbad2488 | 88 24 ad fb 00 00 00 00
R14=0x0000000000000018 is an unknown value
R15=0x00007f1d17bf9460 is pointing into the stack for thread: 0x00007f1e7c028e80
print(T.word_tokenize("xin chào, tôi là người Việt Nam", tokenize_option=0))
What tokenize_option
available can I pass into tokenize_option.
I have some misunderstanding about segmentation
" "
as a token?, I think it is meaninglesssegment
method, it keeps case sensitive but removing punctuationssegment_original
method, it makes text case insensitve, but keeps punctuations" "
inside token cannot be changed to "_"
(space_positions
is empty)Can you explain, thanks!
Cám ơn Cốc Cốc đã phát triển bộ công cụ tách từ với độ chính xác cao, tốc độ rất nhanh.
Mình đang sử dụng thử thì thấy khi build cho Java thì văn bản bị đưa về hết chữ thường, mình cũng có thử xem phần Java code nhưng không thấy và chưa tìm ra cách để chỉnh lại.
$ LD_LIBRARY_PATH=build java -cp build/coccoc-tokenizer.jar com.coccoc.Tokenizer "một câu văn tiếng Việt"
một câu_văn tiếng_việt .
Mong nhận được sự giúp đỡ!
mình đã cài xong phần
$ mkdir build && cd build
$ cmake -DBUILD_PYTHON=1 ..
# make install
nhưng mình vẫn chưa sử dụng được
from CocCocTokenizer import PyTokenizer
Mình gặp lỗi này khi chạy make install
trên Ubuntu 20.04:
CMake Error at cmake_install.cmake:105 (file):
file INSTALL cannot make directory "/usr/local/share/tokenizer/dicts_text":
No such file or directory.
make: *** [Makefile:74: install] Error 1
Dear all,
I've some issues when trying to install the CocCocTokenizer library for python 3.8 in a conda environment.
I've tried:
git clone [email protected]:coccoc/coccoc-tokenizer.git
cd coccoc-tokenizer/
mkdir build && cd build
conda activate py3.8
sudo cmake -DBUILD_JAVA=1 -DBUILD_PYTHON=1 ..
sudo make install
sudo cmake -DBUILD_JAVA=1 -DBUILD_PYTHON=1 -DCMAKE_INSTALL_PREFIX=/home/huyqnguyen/anaconda3/envs/py3.8 ..
conda activate py3.8
to the /python/build_python.sh
filebut none of those solutions help me to include the coccoctokenizer in python:
from CocCocTokenizer import PyTokenizer
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'PyTokenizer' from 'CocCocTokenizer' (unknown location)
This is a great project!
Can you please provide document guiding how to use options such as -t, -u... when using Python binding?
Thank you so much!
hi Mọi người,
Minh cài lần tiên coccoc-tokenizer thì thành công, mọi việc tốt đẹp.
Khi mình build lần 2 và make install , sau đó chạy lệnh
/usr/local/bin/tokenizer "Cộng hòa Xã hội chủ nghĩa Việt Nam"
thì sinh lỗi
Error openning file, alphabetic
Mình đã xóa bằng tay toàn bộ để build lại nhưng vẫn bị lỗi.
OS: Ubuntu Server 20.04
Mình chưa tìm được các xử lý, có ai có cách nào xóa coccoc-tokenizer triệt để chỉ mình giúp.
Thanks mọi người.
Xin chào!
Vui lòng hướng dẫn mình cách thêm từ ghép tiếng Việt mới để build lại lib cho mục đích sử dụng riêng thì phải làm như thế nào?
Cảm ơn nhiều!
I've been using coccoc-tokenizer Java bindings for Vietnamese Elasticsearch analysis plugin, we want to keep tokens as original cases. For example, text "Cộng hòa Xã hội chủ nghĩa Việt Nam" will be tokenized as cộng_hòa xã_hội chủ_nghĩa việt_nam
,
what we expect is Cộng_hòa Xã_hội chủ_nghĩa Việt_Nam
.
Can we somehow make the tokenizer produce tokens as original cases?
Hi, Have you ever measure metrics like Recall@1, Recall@100 accuracy in any information retrieval tasks before and compare the results to other Vietnamese tokenizing models, say, VnCoreNLP ?
In my own datasets, VnCoreNLP is little bit better than CocCocTokenizer (I use the basic BM25 score)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.