Giter Club home page Giter Club logo

vncorenlp's Introduction

Table of contents

  1. Introduction
  2. Installation
  3. Usage for Python users
  4. Usage for Java users
  5. Experimental results

VnCoreNLP: A Vietnamese natural language processing toolkit

VnCoreNLP is a fast and accurate NLP annotation pipeline for Vietnamese, providing rich linguistic annotations through key NLP components of word segmentation, POS tagging, named entity recognition (NER) and dependency parsing. Users do not have to install external dependencies. Users can run processing pipelines from either the command-line or the API. The general architecture and experimental results of VnCoreNLP can be found in the following related papers:

  1. Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark Dras and Mark Johnson. 2018. VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, NAACL 2018, pages 56-60. [.bib]
  2. Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark Dras and Mark Johnson. 2018. A Fast and Accurate Vietnamese Word Segmenter. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, pages 2582-2587. [.bib]
  3. Dat Quoc Nguyen, Thanh Vu, Dai Quoc Nguyen, Mark Dras and Mark Johnson. 2017. From Word Segmentation to POS Tagging for Vietnamese. In Proceedings of the 15th Annual Workshop of the Australasian Language Technology Association, ALTA 2017, pages 108-113. [.bib]

Please CITE paper [1] whenever VnCoreNLP is used to produce published results or incorporated into other software. If you are dealing in depth with either word segmentation or POS tagging, you are also encouraged to cite paper [2] or [3], respectively.

If you are looking for light-weight versions, VnCoreNLP's word segmentation and POS tagging components have also been released as independent packages RDRsegmenter [2] and VnMarMoT [3], resepectively.

Installation

  • Java 1.8+ (Prerequisite)

  • File VnCoreNLP-1.2.jar (27MB) and folder models (115MB) are placed in the same working folder.

  • Python 3.6+ if using a Python wrapper of VnCoreNLP. To install this wrapper, users have to run the following command:

    $ pip3 install py_vncorenlp

    A special thanks goes to Linh The Nguyen for creating this wrapper!

Usage for Python users

import py_vncorenlp

# Automatically download VnCoreNLP components from the original repository
# and save them in some local working folder
py_vncorenlp.download_model(save_dir='/absolute/path/to/vncorenlp')

# Load VnCoreNLP from the local working folder that contains both `VnCoreNLP-1.2.jar` and `models` 
model = py_vncorenlp.VnCoreNLP(save_dir='/absolute/path/to/vncorenlp')
# Equivalent to: model = py_vncorenlp.VnCoreNLP(annotators=["wseg", "pos", "ner", "parse"], save_dir='/absolute/path/to/vncorenlp')

# Annotate a raw corpus
model.annotate_file(input_file="/absolute/path/to/input/file", output_file="/absolute/path/to/output/file")

# Annotate a raw text
model.print_out(model.annotate_text("Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."))

By default, the output is formatted with 6 columns representing word index, word form, POS tag, NER label, head index of the current word and its dependency relation type:

1       Ông     Nc      O       4       sub
2       Nguyễn_Khắc_Chúc        Np      B-PER   1       nmod
3       đang    R       O       4       adv
4       làm_việc        V       O       0       root
5       tại     E       O       4       loc
6       Đại_học N       B-ORG   5       pob
...

For users who use VnCoreNLP only for word segmentation:

rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=["wseg"], save_dir='/absolute/path/to/vncorenlp')
text = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."
output = rdrsegmenter.word_segment(text)
print(output)
# ['Ông Nguyễn_Khắc_Chúc đang làm_việc tại Đại_học Quốc_gia Hà_Nội .', 'Bà Lan , vợ ông Chúc , cũng làm_việc tại đây .']

Usage for Java users

Using VnCoreNLP from the command line

You can run VnCoreNLP to annotate an input raw text corpus (e.g. a collection of news content) by using following commands:

// To perform word segmentation, POS tagging, NER and then dependency parsing
$ java -Xmx2g -jar VnCoreNLP-1.2.jar -fin input.txt -fout output.txt
// To perform word segmentation, POS tagging and then NER
$ java -Xmx2g -jar VnCoreNLP-1.2.jar -fin input.txt -fout output.txt -annotators wseg,pos,ner
// To perform word segmentation and then POS tagging
$ java -Xmx2g -jar VnCoreNLP-1.2.jar -fin input.txt -fout output.txt -annotators wseg,pos
// To perform word segmentation
$ java -Xmx2g -jar VnCoreNLP-1.2.jar -fin input.txt -fout output.txt -annotators wseg    

Using VnCoreNLP from the API

The following code is a simple and complete example:

import vn.pipeline.*;
import java.io.*;
public class VnCoreNLPExample {
    public static void main(String[] args) throws IOException {
    
        // "wseg", "pos", "ner", and "parse" refer to as word segmentation, POS tagging, NER and dependency parsing, respectively. 
        String[] annotators = {"wseg", "pos", "ner", "parse"}; 
        VnCoreNLP pipeline = new VnCoreNLP(annotators); 
    
        String str = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."; 
        
        Annotation annotation = new Annotation(str); 
        pipeline.annotate(annotation); 
        
        System.out.println(annotation.toString());
        // 1    Ông                 Nc  O       4   sub 
        // 2    Nguyễn_Khắc_Chúc    Np  B-PER   1   nmod
        // 3    đang                R   O       4   adv
        // 4    làm_việc            V   O       0   root
        // ...
        
        //Write to file
        PrintStream outputPrinter = new PrintStream("output.txt");
        pipeline.printToFile(annotation, outputPrinter); 
    
        // You can also get a single sentence to analyze individually 
        Sentence firstSentence = annotation.getSentences().get(0);
        System.out.println(firstSentence.toString());
    }
}

vncorenlpexample

See VnCoreNLP's open-source in folder src for API details.

Experimental results

See details in papers [1,2,3] above or at NLP-progress.

vncorenlp's People

Contributors

datquocnguyen avatar tienthanhdhcn avatar vncorenlp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vncorenlp's Issues

Chunk tagging

I realized that changing the word segmentation of VLSP 2016 NER shared task dataset makes it hard to measure the performance of the model. The issue can also be seen from the paper "Attentive Neural Network for Named Entity Recognition in Vietnamese".
image
As a result, I want to do POS and CHUNK tagging with pre-word segmentation, is that possible

How to Use toolkit as a service for Python users

Hi,

Thank for this awesome toolkit @vncorenlp ,

Could you please show me how to run your toolkit as a service so I can send request to it when using python?

I follow the instruction in this link but when I enter the command vncorenlp -Xmx2g E:\Data\Projects\MobiFone5\NLP\VnCoreNLP-master\VnCoreNLP-1.1.jar -p 9000 -a "wseg,pos,ner,parse" into command prompt, it returns 'vncorenlp' is not recognized as an internal or external command, operable program or batch file.

How can I run this command properly?

command prompt

Nhận diện sai từ khi thực hiện text anonate

Input Sentences:

Schauffele háo hức tiếp tục mùa giải PGA Tour
Tay golf Xander Schauffele thừa nhận anh rất muốn thi đấu trở lại, bất kể mùa giải PGA Tour được tổ chức trong điều kiện như thế nào đi nữa. Schauffele đã 4 lần vô địch PGA Tour và cũng từng về nhì tại Masters và US Open. Tay golf số 12 thế giới khẳng định anh không thực sự quan tâm đến lịch thi đấu và trở lại của các giải đấu bởi điều quan trọng là anh được ra sân và cầm lại chiếc gậy golf quen thuộc.

Khi thực hiện anonate từ Schauffele ở câu thứ 3 đã bị nhận diện thành dạng dấu câu "punct"!

Từ Schauffele đầu tiên nhận diện:

{
                    "depLabel": "sub",
                    "form": "Schauffele",
                    "head": 3,
                    "index": 1,
                    "nerLabel": "O",
                    "posTag": "N"
                }

Từ Schauffele ở câu thứ 3 bị nhận diện thành dấu câu:

{
                    "depLabel": "punct",
                    "form": "Schauffele",
                    "head": 6,
                    "index": 1,
                    "nerLabel": "O",
                    "posTag": "N"
                }

Text annotate:

"0": [
                {
                    "depLabel": "sub",
                    "form": "Schauffele",
                    "head": 3,
                    "index": 1,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "nmod",
                    "form": "háo_hức",
                    "head": 1,
                    "index": 2,
                    "nerLabel": "O",
                    "posTag": "A"
                },
                {
                    "depLabel": "root",
                    "form": "tiếp_tục",
                    "head": 0,
                    "index": 3,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "dob",
                    "form": "mùa",
                    "head": 3,
                    "index": 4,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "nmod",
                    "form": "giải",
                    "head": 4,
                    "index": 5,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "nmod",
                    "form": "PGA",
                    "head": 4,
                    "index": 6,
                    "nerLabel": "O",
                    "posTag": "Ny"
                },
                {
                    "depLabel": "nmod",
                    "form": "Tour_Tay",
                    "head": 4,
                    "index": 7,
                    "nerLabel": "O",
                    "posTag": "Np"
                },
                {
                    "depLabel": "nmod",
                    "form": "golf",
                    "head": 4,
                    "index": 8,
                    "nerLabel": "O",
                    "posTag": "Nb"
                },
                {
                    "depLabel": "nmod",
                    "form": "Xander_Schauffele",
                    "head": 4,
                    "index": 9,
                    "nerLabel": "B-PER",
                    "posTag": "Np"
                },
                {
                    "depLabel": "vmod",
                    "form": "thừa_nhận",
                    "head": 3,
                    "index": 10,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "sub",
                    "form": "anh",
                    "head": 13,
                    "index": 11,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "adv",
                    "form": "rất",
                    "head": 13,
                    "index": 12,
                    "nerLabel": "O",
                    "posTag": "R"
                },
                {
                    "depLabel": "vmod",
                    "form": "muốn",
                    "head": 10,
                    "index": 13,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "vmod",
                    "form": "thi_đấu",
                    "head": 13,
                    "index": 14,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "vmod",
                    "form": "trở_lại",
                    "head": 14,
                    "index": 15,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "punct",
                    "form": ",",
                    "head": 10,
                    "index": 16,
                    "nerLabel": "O",
                    "posTag": "CH"
                },
                {
                    "depLabel": "nmod",
                    "form": "bất_kể",
                    "head": 18,
                    "index": 17,
                    "nerLabel": "O",
                    "posTag": "R"
                },
                {
                    "depLabel": "sub",
                    "form": "mùa",
                    "head": 22,
                    "index": 18,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "nmod",
                    "form": "giải",
                    "head": 18,
                    "index": 19,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "nmod",
                    "form": "PGA",
                    "head": 19,
                    "index": 20,
                    "nerLabel": "O",
                    "posTag": "Ny"
                },
                {
                    "depLabel": "nmod",
                    "form": "Tour",
                    "head": 20,
                    "index": 21,
                    "nerLabel": "O",
                    "posTag": "Np"
                },
                {
                    "depLabel": "vmod",
                    "form": "được",
                    "head": 10,
                    "index": 22,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "vmod",
                    "form": "tổ_chức",
                    "head": 22,
                    "index": 23,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "vmod",
                    "form": "trong",
                    "head": 23,
                    "index": 24,
                    "nerLabel": "O",
                    "posTag": "E"
                },
                {
                    "depLabel": "pob",
                    "form": "điều_kiện",
                    "head": 24,
                    "index": 25,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "x",
                    "form": "như_thế_nào",
                    "head": 25,
                    "index": 26,
                    "nerLabel": "O",
                    "posTag": "X"
                },
                {
                    "depLabel": "x",
                    "form": "đi_nữa",
                    "head": 25,
                    "index": 27,
                    "nerLabel": "O",
                    "posTag": "X"
                },
                {
                    "depLabel": "punct",
                    "form": ".",
                    "head": 3,
                    "index": 28,
                    "nerLabel": "O",
                    "posTag": "CH"
                }
            ],
            "1": [
                {
                    "depLabel": "punct",
                    "form": "Schauffele",
                    "head": 6,
                    "index": 1,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "nmod",
                    "form": "đã",
                    "head": 4,
                    "index": 2,
                    "nerLabel": "O",
                    "posTag": "R"
                },
                {
                    "depLabel": "det",
                    "form": "4",
                    "head": 4,
                    "index": 3,
                    "nerLabel": "O",
                    "posTag": "M"
                },
                {
                    "depLabel": "root",
                    "form": "lần",
                    "head": 0,
                    "index": 4,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "nmod",
                    "form": "vô_địch",
                    "head": 4,
                    "index": 5,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "nmod",
                    "form": "PGA",
                    "head": 4,
                    "index": 6,
                    "nerLabel": "O",
                    "posTag": "Ny"
                },
                {
                    "depLabel": "nmod",
                    "form": "Tour",
                    "head": 6,
                    "index": 7,
                    "nerLabel": "O",
                    "posTag": "Np"
                },
                {
                    "depLabel": "coord",
                    "form": "và",
                    "head": 4,
                    "index": 8,
                    "nerLabel": "O",
                    "posTag": "Cc"
                },
                {
                    "depLabel": "adv",
                    "form": "cũng",
                    "head": 11,
                    "index": 9,
                    "nerLabel": "O",
                    "posTag": "R"
                },
                {
                    "depLabel": "adv",
                    "form": "từng",
                    "head": 11,
                    "index": 10,
                    "nerLabel": "O",
                    "posTag": "R"
                },
                {
                    "depLabel": "conj",
                    "form": "về",
                    "head": 8,
                    "index": 11,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "vmod",
                    "form": "nhì",
                    "head": 11,
                    "index": 12,
                    "nerLabel": "O",
                    "posTag": "A"
                },
                {
                    "depLabel": "loc",
                    "form": "tại",
                    "head": 11,
                    "index": 13,
                    "nerLabel": "O",
                    "posTag": "E"
                },
                {
                    "depLabel": "pob",
                    "form": "Masters",
                    "head": 13,
                    "index": 14,
                    "nerLabel": "O",
                    "posTag": "Np"
                },
                {
                    "depLabel": "coord",
                    "form": "và",
                    "head": 14,
                    "index": 15,
                    "nerLabel": "O",
                    "posTag": "Cc"
                },
                {
                    "depLabel": "conj",
                    "form": "US",
                    "head": 15,
                    "index": 16,
                    "nerLabel": "B-MISC",
                    "posTag": "Np"
                },
                {
                    "depLabel": "nmod",
                    "form": "Open",
                    "head": 16,
                    "index": 17,
                    "nerLabel": "I-MISC",
                    "posTag": "Np"
                },
                {
                    "depLabel": "punct",
                    "form": ".",
                    "head": 17,
                    "index": 18,
                    "nerLabel": "O",
                    "posTag": "CH"
                }
            ],
            "2": [
                {
                    "depLabel": "sub",
                    "form": "Tay",
                    "head": 6,
                    "index": 1,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "nmod",
                    "form": "golf",
                    "head": 1,
                    "index": 2,
                    "nerLabel": "O",
                    "posTag": "Nb"
                },
                {
                    "depLabel": "nmod",
                    "form": "số",
                    "head": 1,
                    "index": 3,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "det",
                    "form": "12",
                    "head": 3,
                    "index": 4,
                    "nerLabel": "O",
                    "posTag": "M"
                },
                {
                    "depLabel": "nmod",
                    "form": "thế_giới",
                    "head": 3,
                    "index": 5,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "root",
                    "form": "khẳng_định",
                    "head": 0,
                    "index": 6,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "sub",
                    "form": "anh",
                    "head": 10,
                    "index": 7,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "adv",
                    "form": "không",
                    "head": 10,
                    "index": 8,
                    "nerLabel": "O",
                    "posTag": "R"
                },
                {
                    "depLabel": "vmod",
                    "form": "thực_sự",
                    "head": 10,
                    "index": 9,
                    "nerLabel": "O",
                    "posTag": "A"
                },
                {
                    "depLabel": "vmod",
                    "form": "quan_tâm",
                    "head": 6,
                    "index": 10,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "vmod",
                    "form": "đến",
                    "head": 10,
                    "index": 11,
                    "nerLabel": "O",
                    "posTag": "E"
                },
                {
                    "depLabel": "pob",
                    "form": "lịch",
                    "head": 11,
                    "index": 12,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "nmod",
                    "form": "thi_đấu",
                    "head": 12,
                    "index": 13,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "coord",
                    "form": "và",
                    "head": 10,
                    "index": 14,
                    "nerLabel": "O",
                    "posTag": "Cc"
                },
                {
                    "depLabel": "conj",
                    "form": "trở_lại",
                    "head": 14,
                    "index": 15,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "vmod",
                    "form": "của",
                    "head": 10,
                    "index": 16,
                    "nerLabel": "O",
                    "posTag": "E"
                },
                {
                    "depLabel": "det",
                    "form": "các",
                    "head": 18,
                    "index": 17,
                    "nerLabel": "O",
                    "posTag": "L"
                },
                {
                    "depLabel": "pob",
                    "form": "giải",
                    "head": 16,
                    "index": 18,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "nmod",
                    "form": "đấu",
                    "head": 18,
                    "index": 19,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "prp",
                    "form": "bởi",
                    "head": 10,
                    "index": 20,
                    "nerLabel": "O",
                    "posTag": "E"
                },
                {
                    "depLabel": "pob",
                    "form": "điều",
                    "head": 20,
                    "index": 21,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "nmod",
                    "form": "quan_trọng",
                    "head": 21,
                    "index": 22,
                    "nerLabel": "O",
                    "posTag": "A"
                },
                {
                    "depLabel": "vmod",
                    "form": "là",
                    "head": 6,
                    "index": 23,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "dob",
                    "form": "anh",
                    "head": 23,
                    "index": 24,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "vmod",
                    "form": "được",
                    "head": 23,
                    "index": 25,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "vmod",
                    "form": "ra",
                    "head": 25,
                    "index": 26,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "dob",
                    "form": "sân",
                    "head": 26,
                    "index": 27,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "coord",
                    "form": "và",
                    "head": 23,
                    "index": 28,
                    "nerLabel": "O",
                    "posTag": "Cc"
                },
                {
                    "depLabel": "conj",
                    "form": "cầm",
                    "head": 28,
                    "index": 29,
                    "nerLabel": "O",
                    "posTag": "V"
                },
                {
                    "depLabel": "adv",
                    "form": "lại",
                    "head": 29,
                    "index": 30,
                    "nerLabel": "O",
                    "posTag": "R"
                },
                {
                    "depLabel": "dob",
                    "form": "chiếc",
                    "head": 29,
                    "index": 31,
                    "nerLabel": "O",
                    "posTag": "Nc"
                },
                {
                    "depLabel": "nmod",
                    "form": "gậy",
                    "head": 31,
                    "index": 32,
                    "nerLabel": "O",
                    "posTag": "N"
                },
                {
                    "depLabel": "nmod",
                    "form": "golf",
                    "head": 31,
                    "index": 33,
                    "nerLabel": "O",
                    "posTag": "Nb"
                },
                {
                    "depLabel": "nmod",
                    "form": "quen_thuộc",
                    "head": 31,
                    "index": 34,
                    "nerLabel": "O",
                    "posTag": "A"
                },
                {
                    "depLabel": "punct",
                    "form": ".",
                    "head": 6,
                    "index": 35,
                    "nerLabel": "O",
                    "posTag": "CH"
                }
            ]

Source Code

Hello,

Thank you for this awesome project. @datquocnguyen @vncorenlp

I'm very interested in your Dynamic Feature induction for NER. Can you please provide the source code? At least the dynamic feature induction part.

I can only find the compiled jar.

Add new word to VnCoreNLP

I want to add new word to tokenizer
Example: FIS Bank -> FIS_Bank
Are you show me train VnCoreNLP?
Thank you

Retrain word segmentation guideline

Hi, I have read your VnCoreNLP but I didn't see any documentation for retraining the RDRsegmenter. Could you please put more details on how to retrain this and make the "rules" automatically from "gold standard corpus" and "raw corpus"?

Many thanks

[end and start of tokens]

Hi,

StanfordCoreNLP has a lot of very convenient information contained in tokens, including the positions of the tokens in the original sentence, materialized by .beginPosition and .endPosition.

Would it be possible to get the same kind of information for VnCoreNLP Token or Word ?

Thanks

dataset

i want to deploy new model to accuracy NER task, the ERNIE become the best accuracy model in chinese and i want to deploy in vietnamese. PLZ can i take your dataset, thanks

The server has stopped working.

Hi

When trying to use your package on python, got this error

RuntimeError: The server has stopped working.

Could you help to fix it?

Many thanks!

Nhận diện tên địa điểm viết tắt ở cuối câu

Input string: Về ca chỉ điểm ổ dịch Covid-19 tại quán bar Buddha là bệnh nhân 91 (phi công Vietnam Airlines). Hiện, bệnh nhân đang được điều trị tại Bệnh viện Bệnh nhiệt đới TP.HCM. Tình trạng bệnh nhân không sốt, mạch huyết áp bình thường, rối loạn đông máu kiểm soát tạm ổn, chức năng phổi có cải thiện, tiếp tục thở máy và hỗ trợ ECMO.

Khi tokenize từ viết tắt TP.HCM. không được tách đúng:

{
    "depLabel": "nmod",
    "form": "TP.HCM.",
    "head": 8,
    "index": 10,
     "nerLabel": "O",
     "posTag": "Ny"
}

Nếu thay dấu "." ở đoạn văn input thành "Bệnh việt Bệnh nhiệt đới TP.HCM, tình trạng bệnh nhân ...."
thì tokenize mới nhận diện được:

{
    "depLabel": "nmod",
    "form": "TP.",
    "head": 8,
    "index": 10,
     "nerLabel": "I-ORG",
    "posTag": "Ny"
}, {
    "depLabel": "nmod",
    "form": "HCM",
    "head": 10,
    "index": 11,
    "nerLabel": "I-ORG",
    "posTag": "Np"
}

Issue #0

I just want to give a very thank for publishing this very useful library. :)

Connection error when using the Python wrapper to process a big text file

When processing a big text file (>= 1GB text), there might be a case of connection error that unfortunately stops the program. The following example might be used to fix this error:

for _ in range(5):
	try:      
		# Process the data:
		annotated_text = annotator.annotate(text)   
		word_segmented_text = annotator.tokenize(text)

		break 
	except:
		# Reconnect if there is a connection error:
		print("Retry in 5 seconds")
		time.sleep(5)

Sentence Tokenize

When do word_tokenize a paragraph, annotator automatically splits that paragraph into multiple sentences, this separation is a learned model or using a regular expression ?

What is JAVA_HOME?

Hi, I'm currently running your code in my VSCode, and I came across this error:


Traceback (most recent call last):
  File "path\name\nlp.py", line 8, in <module>
    model = py_vncorenlp.VnCoreNLP(save_dir='/models')
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\Programs\Python\Python312\Lib\site-packages\py_vncorenlp\vncorenlp.py", line 53, in __init__
    from jnius import autoclass
  File "C:\Users\user\AppData\Local\Programs\Python\Python312\Lib\site-packages\jnius\__init__.py", line 18, in <module>        
    java = get_java_setup(sys.platform)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\Programs\Python\Python312\Lib\site-packages\jnius\env.py", line 60, in get_java_setup
    JAVA_HOME = get_jdk_home(platform)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\Programs\Python\Python312\Lib\site-packages\jnius\env.py", line 335, in get_jdk_home
    raise Exception('Unable to find JAVA_HOME')
Exception: Unable to find JAVA_HOME

I really don't get it. Please help.
P/S: The code I'm using:

import py_vncorenlp
import os
import shutil

# Load VnCoreNLP model
model = py_vncorenlp.VnCoreNLP(save_dir='/models')

# Annotate a raw text
text = "Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."
annotated_text = model.annotate(text)
print(annotated_text)

# Annotate a file
input_file_path = "/input.txt"
output_file_path = "/output.txt"
model.annotate_file(input_file=input_file_path, output_file=output_file_path)

what kind of model for Ner task?

Thank you for completely awesome model. But I wanna deal in depth with model
Did you use CRF, or LSTM, and what feature use for this model? Do you have any paper in detail?
Many thanks

Resources to POS corpus (mapping table)

Thanks for creating such a wonderful tool. However, I cannot seem to find the POS tag corpus and mapping table anywhere, therefore the result of POS tagging is a bit confusing to me. Could you guide me to the resources that you use to make the POS tag label?
Thanks

JVM exception occurred: vn/pipeline/VnCoreNLP java.lang.NoClassDefFoundError

I followed all steps in the instruction for python users and i got the problem like this when i tried to create the model:

/usr/local/lib/python3.7/dist-packages/py_vncorenlp/vncorenlp.py in init(self, max_heap_size, annotators, save_dir)
52 jnius_config.set_classpath(save_dir + "/VnCoreNLP-1.1.1.jar")
53 from jnius import autoclass
---> 54 javaclass_vncorenlp = autoclass('vn.pipeline.VnCoreNLP')
55 self.javaclass_String = autoclass('java.lang.String')
56 self.annotators = annotators
/usr/local/lib/python3.7/dist-packages/jnius/reflect.py in autoclass(clsname, include_protected, include_private)
209
210 # c = Class.forName(clsname)
--> 211 c = find_javaclass(clsname)
212 if c is None:
213 raise Exception('Java class {0} not found'.format(c))
jnius/jnius_export_func.pxi in jnius.find_javaclass()
jnius/jnius_utils.pxi in jnius.check_exception()
JavaException: JVM exception occurred: vn/pipeline/VnCoreNLP java.lang.NoClassDefFoundError

How could i handle this problem ? Thanks

Multiple requests for pipeline

I'm trying to use VnCoreNLP for multiple requests purpose.

Supposed I got something like this:

    String[] annotators = {"wseg", "pos", "ner", "parse"};
    pipeline = new VnCoreNLP(annotators);

I tried to use the variable pipeline to annotate multiple annotations at a time. As the result it causes 2 exceptions: ConcurrentModificationException and ArrayIndexOutOfBound.

I would like to know if the lib support multi-thread or you want it to be single-thread only?

Thanks

word segmentation without sentence segmentation?

I want just word segmentation my data, not split sentence by "." or other punctuation. But I don't know what's option for this. Or do you need sentence segmentation for word segmentation better ?
Thanks you

Tone marks are changed in Vietnamese

Dear @datquocnguyen, thank you for sharing your great work. I just obsereved an abnormality in VNCoreNLP with word segmentation. With the input "Hòa", I received "Hoà", that means the "`" tone mask is shifted one character. Could you please fix this problem or provide solution in the future. Thank you in advance.

Form too large error with VncoreNLPServer

I got this problem when tokenizing data using vncorenlp python, and it work fine until it reach 1 of the line in the data file.

org.eclipse.jetty.http.BadMessageException: 400: Unable to parse form content at org.eclipse.jetty.server.Request.getParameters(Request.java:380) at org.eclipse.jetty.server.Request.getParameter(Request.java:1021) at javax.servlet.ServletRequestWrapper.getParameter(ServletRequestWrapper.java:194) at spark.Request.queryParams(Request.java:283) at spark.http.matching.RequestWrapper.queryParams(RequestWrapper.java:141) at vncorenlp.VnCoreNLPServer.handle(VnCoreNLPServer.java:247) at vncorenlp.VnCoreNLPServer.lambda$3(VnCoreNLPServer.java:184) at spark.ResponseTransformerRouteImpl$1.handle(ResponseTransformerRouteImpl.java:47) at spark.http.matching.Routes.execute(Routes.java:61) at spark.http.matching.MatcherFilter.doFilter(MatcherFilter.java:130) at spark.embeddedserver.jetty.JettyHandler.doHandle(JettyHandler.java:50) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1568) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:530) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:347) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:256) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:279) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:124) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:247) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:140) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626) at java.base/java.lang.Thread.run(Thread.java:834) Caused by: java.lang.IllegalStateException: Form too large: 250544 > 200000 at org.eclipse.jetty.server.Request.extractFormParameters(Request.java:523) at org.eclipse.jetty.server.Request.extractContentParameters(Request.java:461) at org.eclipse.jetty.server.Request.getParameters(Request.java:376)

Best regards

Training Configuration for NER task

Hi @vncorenlp,

Thanks for your amazing toolkit. After read your publication, I'm trying to train Vietnamese NER model. But the accuracy is not good.

So I hope you can share with me your configuration file.

I'm using VLSP dataset for training.

Thanks!

Issue when tokenize string contains single quote character.

Issue when tokenize string contains single quote character.
I have a string : "Nghiên cứu Đại học King's College London và Đại học Leiceste đăng trên Tạp chí Di truyền Con người Mỹ năm 2021 cũng cho thấy"

When run with VnCoreNLP result:

[
    "Nghiên",
    "cứu",
    "Đại_học",
    "King",
    "'",
    "s",
    "College_London",
    "và",
    "Đại_học",
    "Leiceste",
    "đăng",
    "trên",
    "Tạp_chí",
    "Di_truyền",
    "Con_người",
    "Mỹ",
    "năm",
    "2021",
    "cũng",
    "cho",
    "thấy"
]

Other sample string: "H'Hen Niê là một hoa hậu và người mẫu người Việt Nam."

result:

[
    "H",
    "'",
    "Hen_Niê",
    "là",
    "một",
    "hoa_hậu",
    "và",
    "người_mẫu",
    "người",
    "Việt_Nam",
    "."
]

ArrayIndexOutOfBoundsException when multithreading call

When I submit multiple requests async to VnCoreNLPServer I got ArrayIndexOutOfBoundsException: -1

2019-11-05 15:39:37,233 [qtp950350040-18] ERROR vncorenlp.VnCoreNLPServer.handle(VnCoreNLPServer.java:278) - -1
java.lang.ArrayIndexOutOfBoundsException: -1
	at marmot.util.Encoder.append(Encoder.java:68)
	at marmot.morph.MorphWeightVector.extractStateFeatures(MorphWeightVector.java:366)
	at marmot.core.SimpleTagger.getStates(SimpleTagger.java:181)
	at marmot.core.SimpleTagger.getSumLattice(SimpleTagger.java:322)
	at marmot.core.SimpleTagger.tag_states(SimpleTagger.java:565)
	at marmot.morph.MorphTagger.tagWithLemma(MorphTagger.java:49)
	at vn.corenlp.postagger.PosTagger.tagSentence(PosTagger.java:52)
	at vn.pipeline.Sentence.createWords(Sentence.java:60)
	at vn.pipeline.Sentence.init(Sentence.java:53)
	at vn.pipeline.Sentence.<init>(Sentence.java:30)
	at vncorenlp.VnCoreNLPServer.annotate(VnCoreNLPServer.java:236)
	at vncorenlp.VnCoreNLPServer.handle(VnCoreNLPServer.java:269)
	at vncorenlp.VnCoreNLPServer.lambda$main$2(VnCoreNLPServer.java:178)
	at spark.ResponseTransformerRouteImpl$1.handle(ResponseTransformerRouteImpl.java:47)
	at spark.http.matching.Routes.execute(Routes.java:61)
	at spark.http.matching.MatcherFilter.doFilter(MatcherFilter.java:134)
	at spark.embeddedserver.jetty.JettyHandler.doHandle(JettyHandler.java:50)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1671)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
	at org.eclipse.jetty.server.Server.handle(Server.java:505)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
	at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
	at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
	at java.lang.Thread.run(Thread.java:748)

Không thể sử dụng được tookit.

Mình đã làm theo hướng dẫn của tác giả, nhưng khi chạy đến đoạn:

annotator = VnCoreNLP(address="http://127.0.0.1", port=9000)

thì mình không thể chạy tiếp được, code bị treo ngay chỗ này.
Mình đổi lại thành

annotator = VnCoreNLP()

cũng không chạy được luôn.
Mình cũng thấy nhiều bạn phản hồi là chạy không được trên diễn đàn machinelearningcoban.
Mong nhận được phản hồi từ tác giả.

Xin cám ơn

add dependency maven

Hello, I'm glad to use VNCoreNLP but I have issue when import this lib into other project. I want to use maven dependency to use the lib in a maven project. Can you suggest me how to do this ?

Many thanks !!

Cannot start NER and PARSE service

After cloning the repo, I tried to run vncorenlp -Xmx2g <FULL-PATH-to-VnCoreNLP-jar-file> -p 9000 -a "wseg,pos,ner,parse" and it gave me the following error

2020-06-30 22.26.18 INFO VnCoreNLPServer - Using annotators: wseg, pos, ner, parse 2020-06-30 22:26:18 INFO WordSegmenter:24 - Loading Word Segmentation model 2020-06-30 22:26:18 INFO PosTagger:21 - Loading POS Tagging model 2020-06-30 22:26:21 INFO NerRecognizer:33 - Loading NER model 2020-06-30 22.26.21 INFO GlobalLexica - Loading word clusters 2020-06-30 22.26.21 ERROR VnCoreNLPServer - null java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at vncorenlp.VnCoreNLPServer.loadVnCoreNLP(VnCoreNLPServer.java:91) at vncorenlp.VnCoreNLPServer.main(VnCoreNLPServer.java:175) Caused by: java.lang.ExceptionInInitializerError at java.nio.file.FileSystems.getDefault(FileSystems.java:176) at java.nio.file.Paths.get(Paths.java:84) at edu.emory.mathcs.nlp.common.util.IOUtils.createArtifactInputStream(IOUtils.java:363) at edu.emory.mathcs.nlp.common.util.IOUtils.createArtifactObjectInputStream(IOUtils.java:396) at edu.emory.mathcs.nlp.component.template.lexicon.GlobalLexica.getGlobalLexicon(GlobalLexica.java:95) at edu.emory.mathcs.nlp.component.template.lexicon.GlobalLexica.getGlobalLexicon(GlobalLexica.java:80) at edu.emory.mathcs.nlp.component.template.lexicon.GlobalLexica.<init>(GlobalLexica.java:72) at vn.pipeline.LexicalInitializer.initializeLexica(LexicalInitializer.java:74) at vn.pipeline.LexicalInitializer.initialize(LexicalInitializer.java:44) at vn.corenlp.ner.NerRecognizer.<init>(NerRecognizer.java:39) at vn.corenlp.ner.NerRecognizer.initialize(NerRecognizer.java:26) ... 6 more Caused by: java.lang.RuntimeException: default directory must be absolute at sun.nio.fs.UnixFileSystem.<init>(UnixFileSystem.java:54) at sun.nio.fs.LinuxFileSystem.<init>(LinuxFileSystem.java:39) at sun.nio.fs.LinuxFileSystemProvider.newFileSystem(LinuxFileSystemProvider.java:46) at sun.nio.fs.LinuxFileSystemProvider.newFileSystem(LinuxFileSystemProvider.java:39) at sun.nio.fs.UnixFileSystemProvider.<init>(UnixFileSystemProvider.java:56) at sun.nio.fs.LinuxFileSystemProvider.<init>(LinuxFileSystemProvider.java:41) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at java.lang.Class.newInstance(Class.java:442) at sun.nio.fs.DefaultFileSystemProvider.createProvider(DefaultFileSystemProvider.java:48) at sun.nio.fs.DefaultFileSystemProvider.create(DefaultFileSystemProvider.java:63) at java.nio.file.FileSystems$DefaultFileSystemHolder.getDefaultProvider(FileSystems.java:108) at java.nio.file.FileSystems$DefaultFileSystemHolder.access$000(FileSystems.java:89) at java.nio.file.FileSystems$DefaultFileSystemHolder$1.run(FileSystems.java:98) at java.nio.file.FileSystems$DefaultFileSystemHolder$1.run(FileSystems.java:96) at java.security.AccessController.doPrivileged(Native Method) at java.nio.file.FileSystems$DefaultFileSystemHolder.defaultFileSystem(FileSystems.java:96) at java.nio.file.FileSystems$DefaultFileSystemHolder.<clinit>(FileSystems.java:90) ... 17 more

Howerver, running the service with only "wseg,pos" worked as expected

Here is my Java version

openjdk version "1.8.0_252" OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09) OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)

Over segmentaion (conflation)

When segmenting (conflating) foreign names, there is over segmentaion (conflation).
For instance, "Benjamin Franklin" in
"Khi tôi nói đến từ đó, chắc chắn trong đầu bạn sẽ không liên tưởng đến Benjamin Franklin, nhưng tôi sẽ giải thích cho bạn tại sao lại thế."
is segmented as:
"Khi tôi nói đến từ đó , chắc_chắn trong đầu bạn sẽ không liên_tưởng đến Benjamin_Franklin , nhưng tôi sẽ giải_thích cho bạn tại_sao lại thế ."

Is there a way to fix this problem?

Tokenizer thay đổi từ gốc

Chào @daiquocnguyen,

Mình có một vấn đề khi tokenizer đó là kết quả của token sẽ khác so với từ gốc. Ví dụ từ Thanh Hóa thì nó token ra Thanh_Hoá. Mình có thể fix được cái này không nhỉ.

Unable to run the project test files

I downloaded the whole project, put it down and tried to run the VNCoreNLPExample.class but I can't get it working, is there any way I can fix it?
image

Connection timed out

I've got the timed out error while trying to run with these annotators="wseg,pos,ner,parse"
If I set the annotators="wseg", it runs fine.

image

Add builders with specific paths for models

Hello,

Thanks for the great work,

Using it at a dependency from a project I can't initialize the WordSegmenter, NerRecognizer, because my "user.dir" is different from the root of VnCoreNLP, (example here). So the models appears to be not found.

Would it be possible to use a write a Builder for classes like WordSegmenter, NerRecognizer .. that takes a property file to ease models loading and dependency management, a bit like StanfordCoreNLP to separate the code from the resources ? I would be happy to contribute if it's possible.

Thanks,

VnCoreNLP-1.0.jar don't use unicode by default

I run command line java -Xmx2g -jar VnCoreNLP-1.0.jar -fin input.txt -fout output.txt -annotators wseg,pos on window os and output.txt show issue: it show Nguyá» instead of Nguyễn. Seem like issue comes from both reading and writing file.
Window don't like Linux or MacOs, it doesn't use Java Utf8 by default, So your code should read and write file with unicode by default or add option add option:
-Dfile.encoding=UTF8

Quetions

Code này anh sử dụng thuật toán gì thế ạ?
Em cảm ơn ạ

Unable to create server if NER is used

Hey guys, thanks for the amazing work! I'm using the python wrapper of this project and currently having some trouble using NER. If I only use wseg and pos tagger, then it works fine, but once I added NER, it's now stuck at this line, which to my understanding means no server is available. Weird thing is it works fine if I only use wseg and pos tagger, and this only happens if I add NER. Thanks!

Does this library support thread-safety?

When calling the word_segment function from multiple threads, my service sometimes encounters the following exception:

"/usr/local/lib/python3.9/concurrent/futures/_base.py", line 609, in result_iterator                                                                                                                     ││     yield fs.pop().result()                                                                                                                                                                                     ││   File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 446, in result                                                                                                                              ││     return self.__get_result()                                                                                                                                                                                  ││   File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result                                                                                                                        ││     raise self._exception                                                                                                                                                                                       ││   File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 58, in run                                                                                                                                 │
│     result = self.fn(*self.args, **self.kwargs)                                                                                                                                                                 │
│   File "/usr/local/lib/python3.9/site-packages/py_vncorenlp/vncorenlp.py", line 95, in word_segment                                                                                                             │
│     self.model.annotate(annotation)                                                                                                                                                                             │
│   File "jnius/jnius_export_class.pxi", line 878, in jnius.JavaMethod.__call__                                                                                                                                   │
│   File "jnius/jnius_export_class.pxi", line 972, in jnius.JavaMethod.call_method                                                                                                                                │
│   File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception                                                                                                                                               │
│ jnius.JavaException: JVM exception occurred: java.util.ConcurrentModificationException         

Here is my code to call the word_segment function:

thread_pool_executor = ThreadPoolExecutor(max_workers=4)
rdrsegmenter = py_vncorenlp.VnCoreNLP(annotators=["wseg"])
segmented_contents = list(thread_pool_executor.map(rdrsegmenter.word_segment, contents))

So my question is whether this library is thread-safe or not. Note that this exception does not always occur, only occasionally.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.