adobe / nlp-cube Goto Github PK

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing

Home Page: http://opensource.adobe.com/NLP-Cube/index.html

License: Apache License 2.0

Python 34.69% Dockerfile 0.09% Shell 0.01% HTML 65.20%

embeddings parse nlp-cube language-pipeline tokenization sentence-splitting part-of-speech-tagger lemmatization dependency-parser dependency-parsing

nlp-cube's Introduction

News

[05 August 2021] - We are releasing version 3.0 of NLPCube and models and introducing FLAVOURS. This is a major update, but we did our best to maintain the same API, so previous implementation will not crash. The supported language list is smaller, but you can open an issue for unsupported languages, and we will do our best to add them. Other options include fixing the pip package version 1.0.8 pip install nlpcube==0.1.0.8.

[15 April 2019] - We are releasing version 1.1 models - check all supported languages below. Both 1.0 and 1.1 models are trained on the same UD2.2 corpus; however, models 1.1 do not use vector embeddings, thus reducing disk space and time required to use them. Some languages actually have a slightly increased accuracy, some a bit decreased. By default, NLP Cube will use the latest (at this time) 1.1 models.

To use the older 1.0 models just specify this version in the load call: cube.load("en", 1.0) (en for English, or any other language code). This will download (if not already downloaded) and use this specific model version. Same goes for any language/version you want to use.

If you already have NLP Cube installed and want to use the newer 1.1 models, type either cube.load("en", 1.1) or cube.load("en", "latest") to auto-download them. After this, calling cube.load("en") without version number will automatically use the latest ones from your disk.

NLP-Cube

NLP-Cube is an opensource Natural Language Processing Framework with support for languages which are included in the UD Treebanks (list of all available languages below). Use NLP-Cube if you need:

Sentence segmentation
Tokenization
POS Tagging (both language independent (UPOSes) and language dependent (XPOSes and ATTRs))
Lemmatization
Dependency parsing

Example input: "This is a test.", output is:

1       This    this    PRON    DT      Number=Sing|PronType=Dem        4       nsubj   _
2       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4       cop     _
3       a       a       DET     DT      Definite=Ind|PronType=Art       4       det     _
4       test    test    NOUN    NN      Number=Sing     0       root    SpaceAfter=No
5       .       .       PUNCT   .       _       4       punct   SpaceAfter=No

If you just want to run it, here's how to set it up and use NLP-Cube in a few lines: Quick Start Tutorial.

For advanced users that want to create and train their own models, please see the Advanced Tutorials in examples/, starting with how to locally install NLP-Cube.

Simple (PIP) installation / update version

Install (or update) NLP-Cube with:

pip3 install -U nlpcube

API Usage

To use NLP-Cube *programmatically (in Python), follow this tutorial The summary would be:

from cube.api import Cube       # import the Cube object
cube=Cube(verbose=True)         # initialize it
cube.load("en", device='cpu')   # select the desired language (it will auto-download the model on first run)
text="This is the text I want segmented, tokenized, lemmatized and annotated with POS and dependencies."
document=cube(text)            # call with your own text (string) to obtain the annotations

The document object now contains the annotated text, one sentence at a time. To print the third words's POS (in the first sentence), just run:

print(document.sentences[0][2].upos) # [0] is the first sentence and [2] is the third word

Each token object has the following attributes: index, word, lemma, upos, xpos, attrs, head, label, deps, space_after. For detailed info about each attribute please see the standard CoNLL format.

Flavours

Previous versions on NLP-Cube were trained on individual treebanks. This means that the same language was supported by multiple models at the same time. For instance, you could parse English (en) text with en_ewt, en_esl, en_lines, etc. The current version of NLPCube combines all flavours of a treebank under the same umbrella, by jointly optimizing a conditioned model. You only need to load the base language, for example en and then select which flavour to apply at runtime:

from cube.api import Cube       # import the Cube object
cube=Cube(verbose=True)         # initialize it
cube.load("en", device='cpu')   # select the desired language (it will auto-download the model on first run)
text="This is the text I want segmented, tokenized, lemmatized and annotated with POS and dependencies."


# Parse using the default flavour (in this case EWT)
document=cube(text)            # call with your own text (string) to obtain the annotations
# or you can specify a flavour
document=cube(text, flavour='en_lines')

Webserver Usage

The current version dropped supported, since most people preferred to implement their one NLPCube as a service.

Cite

If you use NLP-Cube in your research we would be grateful if you would cite the following paper:

NLP-Cube: End-to-End Raw Text Processing With Neural Networks, Boroș, Tiberiu and Dumitrescu, Stefan Daniel and Burtica, Ruxandra, Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics. p. 171--179. October 2018

or, in bibtex format:

@InProceedings{boro-dumitrescu-burtica:2018:K18-2,
  author    = {Boroș, Tiberiu  and  Dumitrescu, Stefan Daniel  and  Burtica, Ruxandra},
  title     = {{NLP}-Cube: End-to-End Raw Text Processing With Neural Networks},
  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
  pages     = {171--179},
  abstract  = {We introduce NLP-Cube: an end-to-end Natural Language Processing framework, evaluated in CoNLL's "Multilingual Parsing from Raw Text to Universal Dependencies 2018" Shared Task. It performs sentence splitting, tokenization, compound word expansion, lemmatization, tagging and parsing. Based entirely on recurrent neural networks, written in Python, this ready-to-use open source system is freely available on GitHub. For each task we describe and discuss its specific network architecture, closing with an overview on the results obtained in the competition.},
  url       = {http://www.aclweb.org/anthology/K18-2017}
}

Languages and performance

For comparison, the performance of 3.0 models is reported on the 2.2 UD corpus, but distributed models are obtained from UD 2.7.

Results are reported against the test files for each language (available in the UD 2.2 corpus) using the 2018 conll eval script. Please see more info about what each metric represents here.

Notes:

version 1.1 of the models no longer need the large external vector embedding files. This makes loading the 1.1 models faster and less RAM-intensive.
all reported results here are end-2-end. (e.g. we test the tagging accuracy on our own segmented text, as this is the real use-case; CoNLL results are mostly reported on "gold" - or pre-segmented text, leading to higher accuracy for the tagger/parser/etc.)

Language	Model	Token	Sentence	UPOS	XPOS	AllTags	Lemmas	UAS	LAS
Chinese
	zh-1.0	93.03	99.10	88.22	88.15	86.91	92.74	73.43	69.52
	zh-1.1	92.34	99.10	86.75	86.66	85.35	92.05	71.00	67.04
	zh.3.0	95.88	87.36	91.67	83.54	82.74	85.88	79.15	70.08
English
	en-1.0	99.25	72.8	95.34	94.83	92.48	95.62	84.7	81.93
	en-1.1	99.2	70.94	94.4	93.93	91.04	95.18	83.3	80.32
	en-3.0	98.95	75.00	96.01	95.71	93.75	96.06	87.06	84.61
French
	fr-1.0	99.68	94.2	92.61	95.46	90.79	93.08	84.96	80.91
	fr-1.1	99.67	95.31	92.51	95.45	90.8	93.0	83.88	80.16
	fr-3.0	99.71	93.92	97.33	99.56	96.61	90.79	89.81	87.24
German
	de-1.0	99.7	81.19	91.38	94.26	80.37	75.8	79.6	74.35
	de-1.1	99.77	81.99	90.47	93.82	79.79	75.46	79.3	73.87
	de-3.0	99.77	86.25	94.70	97.00	85.02	82.73	87.08	82.69
Hungarian
	hu-1.0	99.8	94.18	94.52	99.8	86.22	91.07	81.57	75.95
	hu-1.1	99.88	97.77	93.11	99.88	86.79	91.18	77.89	70.94
	hu-3.0	99.75	91.64	96.43	99.75	89.89	91.31	86.34	81.29
Italian
	it-1.0	99.89	98.14	86.86	86.67	84.97	87.03	78.3	74.59
	it-1.1	99.92	99.07	86.58	86.4	84.53	86.75	76.38	72.35
	it-3.0	99.92	98.13	98.26	98.15	97.34	97.76	94.07	92.66
Romanian (RO-RRT)
	ro-1.0	99.74	95.56	97.42	96.59	95.49	96.91	90.38	85.23
	ro-1.1	99.71	95.42	96.96	96.32	94.98	96.57	90.14	85.06
	ro-3.0	99.80	95.64	97.67	97.11	96.76	97.55	92.06	87.67
Spanish
	es-1.0	99.98	98.32	98.0	98.0	96.62	98.05	90.53	88.27
	es-1.1	99.98	98.40	98.01	98.00	96.6	97.99	90.51	88.16
	es-3.0	99.96	97.17	96.88	99.91	94.88	98.17	92.11	89.86

nlp-cube's People

Contributors

Stargazers

Watchers

Forkers

franck-dernoncourt conll-ud-2018 shubhampachori12110095 silviupanaite victor-armegioiu ushukkla bogdananton marquiserosier aliosamahassan mdp0999 almoslmi coffeebeanustb navpreetsamra dumitrescustefan ktp-forked-repos koichiyasuoka gdsttian zorrock legendtianjin renespeck dmowery repierre fbatroni amir22010 blueskychina kevin-jfzhu microhexhq arayatachibana asajatovic ruikunhuang uday1201 oguzhankarahan jiangnanyida hailubeshada shauryauppal fluffybird2323 shauryauppal-1mg kraktus ajloria chiraggandhi123 theganeshsharma simarbedi tailorkuldeep monrowl futurulus gazzola caoxu915683474 bigdatasciencegroup dpakpdl samarth-b dr-data yuhonghong7035 qiming-zou erdincay libing8719 jughead712 sshuster isabella232 guden acheamponge spaskich copperdong topliftarm quitares somasekhar-nakkala flexudy-pipe adbmd davidegazze vishnubalaji ikonushok swipswaps testxsubject babumukul95-d zhyongwei aahmadai sadakatlincode bryanchance ratulsikder97 maldil r0h17 rasah673 digitty-forks cyberflamego jcarlosneto charlestonx-dao joiike sshyran lang0808 rafalposwiata aqhali ghas-results wzsxb233

nlp-cube's Issues

Add abstraction layer for storing models

Is your feature request related to a problem? Please describe.
We need a layer that deals with model storing/loading.
This class can be used from the API directly or used separately.

Our models are stored in cloud, so we also need an interface for getting the latest model.

Describe the solution you'd like
A ModelStore class, that handles model loading/saving.

Create a FST Lemmatizer (similar to the compound word expander)

Add another lemmatizer that switches from the classic encoder-decoder architecture to the encoder-FST architecture

Expected Result

Hopefully the lemmatizer will better capture local dependencies and it will not fail at lemmatizing long words.

Random seed should be parametrized

Right now it's fixed. We need a --random-seed parameter in the training phase.

Delexicalized version for parser

Small treebanks for UD are not suitable for our model. As such, we should train delexicalized versions of the parser, by combining compatible treebanks for those languages. Also, we could train delexicalized parsers directly on the small treebanks because using only morphology should not yield data sparseness on small datasets.

Expected Result

We need to add another flag in the parser's config file for using lexical features or not: USE_LEXICAL
This should be similar to USE_MORPHOLOGY

Details for implementation

io_utils/config.py <- we need to add the attribute here. We also need to be backwards compatible with older config files which don't have this flag enabled. Default value should be True
generic_networks/parsers.py

__init__(self...) should use this information when computing the input (lines 45--62)
make_input(self,..) should use this information to create the input: I recommend to branch the execution before line 260. I don't think dropout or any other mechanism will help prevent overfitting, so right now we should not be doing it. Just go through each entry, lookup Morphological information and sum-up the vectors (t_emb = upos_emb + xpos_emb + attrs_emb), than add to the sequence x_list.append(w_emb)

For PR to pass it should:

Be able to train on large datasets and provide results same as before the update (USE_LEXICAL=True, USE_MORPHOLOGY=False, PREDICT_MORPHOLOGY=True)
Be able to load pretrained models, which don't have the USE_LEXICAL flag at all
Be able to train and load delexicalized models for small treebanks.

Does cube support enhanced/collapsed dependency parsing.

Currently our service is based on the java based Stanford's core NLP library and we are looking to migrate from java to python in our backend systems. Does cube provide enhance dependency annotations and if yes how will one go about getting them? Thanks

Issue template and PR template

Need

We don't currently have an issue template and the PR template needs to be moved to the .github folder.

Adobe-wide pypi deployment credentials

Hi, I'm Fil from Adobe's Open Source Office.

FYI I recently created an adobe-wide pypi account: https://pypi.org/user/adobe/

If you have access to adobe's corporate network, you can see more info about plans for centralizing adobe releases to pypi here: https://git.corp.adobe.com/OpenSourceAdvisoryBoard/handbook/issues/21

If you'd like to try publishing this project under the adobe account, let me know / take a look at the URL above.

Restructure entry point

Need

Now, we use cube/main.py as an entry-point for testing, training and running a server. Update the this entry-point to have sub-commands for each of the use-cases.

Deliverables

Identify which parameters are needed for each of the following sub-commands:
- train
- test
- runserver
Update the entry-point so that these commands can easily be used separately.

Full training and model import example

We need to provide a jupyter example and create the associated script that will allow users to:

download the UD corpus
train a full model
package the full model in a zip file
import the model in their local model storage folder

Enhanced compound word expander

Based on the success of the FST attention free lemmatizer, the compound word expander can be modified to work in a similar way

Expected Result

Faster training time and better accuracy in compound word expanders

Update the download_data.py script to use UD version 2.2

The download_data script uses ud_treebanks version 2.1 training data and 2.0 test data. This is actually weird, because I don't think they are compatible (at least not for all languages).

The download link for 2.2 is http://ufal.mff.cuni.cz/~zeman/soubory/release-2.2-st-train-dev-data.zip

license type of models

@tiberiu44 Is there are any explicit license for the models or they too are covered with Apache 2.0.

Add notebook(s) for tokenization

Is your feature request related to a problem? Please describe.
It's related to the adoption of NLP-cube

Describe the solution you'd like
A Jupyter notebook example that emphasizes how tokenization can be done.

Possible bug - parameter updating

It's possible that due to changes in Dynet the default parameter for .expr() is now update=False, which freezes updates on them, so no learning is performed. Correction would be to explicitly set .expr(update=True).

Rebuild and document the REST API

Need

Document the REST API.

Deliverables

Upgrade our current API, and build a REST version.
Note that we should use HATEOAS from the start-off (maybe use marshmallow?)

Kazakh language wrong result

Evaluated example from readme for kazakh language with no error, but result is wrong. English language works fine.

Expected Result

I expected a list of words with their correct normalized form but for every word normalization form consists of only 1 letter.

Actual Result

[[1 Алтай а NOUN adj Case=Gen 2 nmod:poss _ _,
2 жерінің ж NOUN n _ 3 obl _ _,
3 асты а VERB adj _ 4 nsubj _ _,
4 қандай қ PRON adv _ 5 nsubj _ _,
5 қазыналы қ VERB n _ 0 root _ _,
6 болса б VERB v Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 5 cop _ SpaceAfter=No,
7 . . PUNCT sent _ 5 punct _ _],
[1 Ағаш а VERB v Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 0 root _ SpaceAfter=No,
2 . . PUNCT sent _ 1 punct _ SpaceAfter=No]]

Reproduction Steps

from cube.api import Cube # import the Cube object
cube=Cube(verbose=True) # initialize it
cube.load("kk") # select the desired language (it will auto-download the model on first run)
text="Алтай жерінің асты қандай қазыналы болса. Ағаш."
sentences=cube(text) # call with your own text (string) to obtain the annotations
sentences

System Information

Python version 3.6.12
Operating system Ubuntu 20.04

The NlpCloud seems to be down

Describe the bug
Apparently the nlp cloud at http://nlp-cube.speed.pub.ro/models/ is out of service

To Reproduce
Run the following commands

from cube.api import Cube

nlp_cube = Cube()
nlp_cube.load("en")

Expected behavior
It's expected that the cube library downloads the modules from it's cloud path

Tagger does not support batching

Summary of the issue.

Need to add batching support for tagger

pip-ify NLP-Cube

In the near future we should be able to install this as a python package like "pip install nlpcube". This is a major issue as it will also involve having ready the python API for all tasks.

Expected Result

Initial results:

Install by "pip install"
Import it like "import nlpcube as cube"

Later results - decide on the usage API:
3. Agree on a way to use it, be it in a static manner (e.g. "cube.load()" , "cube.tokenize(", etc. like NLTK) or object-centered (e.g. instantiate a cube object, set a pipeline from the start and then just call the wrapper function OR let the user call individual functions of the object (sentencesplit/tok/tag/etc.) sequentially).

Integration testing of NLP-Cube

We need tests to run to determine:

running status of all main.py functions
running status of all api.py functions
running status of model_store import/export/listing functions

Perform tokenization and sentence splitting in two separate steps

Currently we perform sentence splitting and tokenization in a single step. We need to split, because best tokenization f-score does not directly correlate with best sentence f-score.

Expected Result

Optimal model should cope with both tokenization and sentence splitting.

Actual Result

We have two models (bestSS and bestTok) which are obtained for different training epochs.

Reproduction Steps

Just train the tokenizer

System Information

Please provide some basic information about your system:

GPU support (enabled/disabled): anu
Hyperparameters used: default
Training set used (embeddings, language, ud-version): 2.1, wiki.ro (Facebook)
Dynet version
Python version
Operating system

Update configuration examples

This is related to the previously closed issue #35.
The configuration example in examples/parser.conf has to be updated with the new flag use_lexical. Also, the config reader should give an error and fail if both use_lexical and use_morphology are set to False

bug.tokenizer

The tokenizer performs poorly on the development and testsets for latin languages.

Expected Result

The EOS symbol (usually DOT) should be a separate token in the sentence. There is no reason why the tokenizer should not learn this, especially that we already know it is a sentece delimiter.

Actual Result

The EOS symbol is merged with the previous word for some sentences.

Action required: Greenkeeper could not be activated 🚨

🚨 You need to enable Continuous Integration on Greenkeeper branches of this repository. 🚨

To enable Greenkeeper, you need to make sure that a commit status is reported on all branches. This is required by Greenkeeper because it uses your CI build statuses to figure out when to notify you about breaking changes.

Since we didn’t receive a CI status on the greenkeeper/initial branch, it’s possible that you don’t have CI set up yet. We recommend using Travis CI, but Greenkeeper will work with every other CI service as well.

If you have already set up a CI for this repository, you might need to check how it’s configured. Make sure it is set to run on all new branches. If you don’t want it to run on absolutely every branch, you can whitelist branches starting with greenkeeper/.

Once you have installed and configured CI on this repository correctly, you’ll need to re-trigger Greenkeeper’s initial pull request. To do this, please click the 'fix repo' button on account.greenkeeper.io.

Hide variables in conf files by making them private

All variables that are not to be written in the conf file should be marked with a _ in front.

Tokenzier not adding spaceAfter=no

Tokenizer is currently not adding SpaceAfter=no. This is useful for detokenization.

Add regularization to MT

The cosine distance used so far is working, but probably regularization will help. Topological regularization should bring the two vectors together. Distance should be used (i.e. 1-cosine_sim) for the cost function.

Expected Result

Performance Improvement for MT.

v1.1 models for 4 languages in performance table

The metrics table in the README reports performance numbers for v1.1 models in most languages. I'm assuming that for languages that only have v1.0 results in the table, v1.1 models are not yet available. However, for four languages, v1.1 performance numbers are included in the table, but only v1.0 models are available in the online model repository:

Ukrainian
Urdu
Vietnamese
Chinese

Just double-checking that these models weren't left out unintentionally!

Automatically build pip package

When merging a branch into master, automatically create an updated pip package and upload to pypi.org

Add pause/resume capability for the models trainer

Expected Result

Save the current trainer state (weights, costs, etc) to files on disk to be able to stop and resume training at a later time.

Actual Result

Currently, when the trainer is interrupted, no state is saved so that when you rerun the trainer it will start from scratch.

The future of the Cube project?

I see that the latest contribution on the project was over 11 months ago. What is the status and the future of the Cube Project?

Is there any stable use case for it under the Adobe umbrella or it was an side-experiment?

Thanks

In the API we need to let the user specify DynetConfig parameters

The user should decide to init a Cube object with prespecified RAM, autobatching or GPU

Issues with nn(Norwegian nynorsk) and nb(Norwegian bokmål).

When I run the code, it returns this error:
<< Exception("No model version for language ["+lang_code+"] was found in the online repository!")
Exception: No model version for language [nb] was found in the online repository!>>

My code works with other model languages like Danish, Swedish, French and Old Church Slavonic..
I use Pycharm community with Python 3.6.
Is Norwegian actually supported ?

Add a list of supported languages

Is your feature request related to a problem? Please describe.
There is no way of telling if NLP-Cube supports or not a certain language. Also, custom corpora introduces some suffixes which are hard to guess.

Describe the solution you'd like
Either update the documentation or add a function to list or languages that are uploaded in the repo. I would prefer the later.

CONLL 2018 Shared Task participation

We should participate to this year's shared task: Multilingual Parsing from Raw Text to Universal Dependencies ( http://universaldependencies.org/conll18/ )

This means we need to prepare the code & scripts for this task.

As far as I can see right now, we have three distinct cases:

we have end-to-end decoding
we start from udpipe tokenization and perform tagging, parsing, lemmatization
we start from udpipe tokenization, tagging and lemmatization and we perform parsing

TODO

update the runtime CLI to support taking as input a list of models that need to pe run on the input. for instance: --run=[tokenization, parsing, tagging, lemmatization] or --run=[parsing, tagging]
implement scripts tailored for the UD Shared Task, which take as input the supplied XML with the list of input test files and run a custom NLPCube pipeline, depending on the language code
deploy on TIRA testing environment

Expected Result

Training and evaluation scripts, models trained, handling of low-resourced minor languages.

Error on installing NLP Cube

When I run the pip3 install -U nlpcube command I get the following error:

lineiter = (line.strip() for line in open(filename))
FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-4r3qjuhb/nlpcube/

Is it a problem with something I'm doing? I am only trying to install it via pip.

Error while training a model

Hi, I am trying to train my own model of NLP-cube, for Uyghur language. I carefully followed all your tutorials and the TrainSet, DevSet and TestSet are all in CONLL-U format.

Once I run the command to train the default tokenizer I get an error message as you can read in the tokenizer.log file (attached below). I run the following command:

python3 /home/chris/Documents/NLPCube/NLP-Cube/cube/main.py --train=tokenizer --train-file=/home/chris/Documents/NLPCube/My_Model/trainSet.conllu --dev-file=/home/chris/Documents/NLPCube/My_Model/devSet.conllu --raw-train-file=/home/chris/Documents/NLPCube/My_Model/trainSet_raw.txt --raw-dev-file=/home/chris/Documents/NLPCube/My_Model/devSet_raw.txt --embeddings /home/chris/Documents/NLPCube/wiki.ug.vec --store /home/chris/Documents/NLPCube/My_Model/tokenizer --batch-size 1000 --set-mem 8000 --autobatch --patience 20 &> /home/chris/Documents/NLPCube/My_Model/tokenizer.log

Information about my system:

GPU support disabled
Python version 3.8
Operating system Linux Ubuntu
tokenizer.log

Thanks in advance for your help,
Chris.

link https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md mentioned under point 2 is not working

link https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md mentioned under Topic 2 - Train your model is not working
Also please advice if in case there is .ipynb notebook for windows.
Thanks

Model Performance

Hi there!

Is there any model performance table like this?

Classical Chinese Model needed

I've almost finished to build up UD_Classical_Chinese-Kyoto Treebank, and now I'm trying to make a Classical Chinese model for NLP-Cube (please check my diary). But in my model sentence_accuracy<35 and I can't sentencize "天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠" (check gold standard here). How do I tune up sentencization for Classical Chinese?

cube.load cannot download wiki.{}.vec

Describe the bug
Cannot load wiki.{}.vec

To Reproduce
Steps to reproduce the behavior:

remove all models in ~/.nlpcube
cube.load("en") as described in https://github.com/adobe/NLP-Cube

Possible Reason
Facebook's wiki.en.vec was moved to https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec but your "metadata.json" points to old place. So as other wiki.{}.vec

Missing files in sdist

It appears that the manifest is missing at least one file necessary to build
from the sdist for version 0.1.0.8. You're in good company, about 5% of other
projects updated in the last year are also missing files.

+ /tmp/venv/bin/pip3 wheel --no-binary nlpcube -w /tmp/ext nlpcube==0.1.0.8
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting nlpcube==0.1.0.8
  Downloading http://10.10.0.139:9191/root/pypi/%2Bf/ec7/2f7bc13654391/nlpcube-0.1.0.8.tar.gz (86 kB)
    ERROR: Command errored out with exit status 1:
     command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-pn3bj_mg/nlpcube/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-pn3bj_mg/nlpcube/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-pn3bj_mg/nlpcube/pip-egg-info
         cwd: /tmp/pip-wheel-pn3bj_mg/nlpcube/
    Complete output (7 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-wheel-pn3bj_mg/nlpcube/setup.py", line 20, in <module>
        install_requires = parse_requirements('requirements.txt', session=False),
      File "/tmp/pip-wheel-pn3bj_mg/nlpcube/setup.py", line 4, in parse_requirements
        lineiter = (line.strip() for line in open(filename))
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Kazakh language wrong result

Evaluated example from readme for kazakh language with no error, but result is wrong. English language works fine.

Expected Result

I expected a list of words with their correct normalized form but for every word normalization form consists of only 1 letter.

Actual Result

Reproduction Steps

System Information

Python version 3.6.12
Operating system Ubuntu 20.04

Use only one feature instead of all of them

It would be very helpful to use just one of the features NLP-Cube offers instead of being forced to use all of them.

Due to computation times, depending of the project NLP-Cube would be perfect if you could run only the part you want to use instead of having to charge all the models. For example, I only want to use the lemmatizer as it is good for Spanish language. I need my text to be tokenized in a specific way and I don't want to use neither NLP-Cube tokenizer nor the rest of features.

I suggest adding another way to run NLP-Cube apart from sentences = cube(text).
For example: lemmas = cube.lemmatizer(tokens) would be a nice approach, where tokens can be a list containing the tokens of a sentence and returns another list containing the lemma of each word.

ConllEntry should have a repr function

Right now, output looks like:

text="This is the text I want segmented, tokenized, lemmatized and annotated with POS and dependencies."
sentences=cube(text)
print(sentences)
[[<cube.io_utils.conll.ConllEntry object at 0x7f5ae924b358>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252080>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae92520f0>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252160>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae92521d0>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252240>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae92522b0>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252320>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252390>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252400>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252470>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae92524e0>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252550>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae92525c0>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252630>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae92526a0>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252710>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252780>]]

when we print a sentence object. We should have a repr function implemented to look pretty .. er.

Batch size is being ignored by tagger during training

Tagger is ignoring batch-size parameter

Expected Result

Increasing the batch-size should decrease accuracy for the first epoch and increase the loss. Also, it should reduce the bad-gradient 'nan' error.

Actual Result

Regardless of the batch size, the results seem to be the same. Also, after a maximum of 20 epochs I get a bad gradient 'nan' error

Tokenizer training bug

Training fails if there is a sentence that has only one token, and the _create_mixed_sentences randomly tries to pick only part of that sentence.

ERROR in app: Exception on /nlp [GET]

Describe the bug
I'm getting the following errors on some of the requests when running in a docker container as a web server:

10.42.109.126 - - [04/Feb/2021 14:38:55] "GET /nlp?lang=id&text=WE+Online%2C+Jakarta+-+%0A++Indeks+Harga+Saham+Gabungan+%28IHSG%29+masih+mampu+bangkit+0.48%25+atau29.47+poin+ke+level+6%2C107.22+dengan+saham-saham+disektor+Aneka+Industri+naik+2.11%25%2C+Infrastruktur+0.96%25+dan+Keuangan+0.81%25.++%0A++Analis+Reliance+Sekuritas%2C+Lanjar+Nafi+mengungkapkan+bila+melesatnya+harga+saham+PT+Indah+Kiat+Pulp+and+Paper+Tbk+%28INKP%29+yang+sebesar+12.2%25+dan+PT+Tjiwi+Kimia+Tbk+%28TKIM%29+sebesar+19.9%25+menjadi+motor+pengerak+IHSG+pada+perdagangan+4+Februari+2021.++%0A++Baca+Juga%3A+Sempat+Anjlok+ke+Zona+Merah%2C+IHSG+Parkir+dengan+Apresiasi+0%2C48%25+pada+Penutupan+Sesi+II.++%0A++%22Ini+setelah+harga+pulp+naik+karena+Paper+Excellence+BV+yang+di+kendalikan+oleh+Jackson+Wijaya+dari+Indonesia+memenangkan+kasus+abritrase+terhadap+grub+brazil+J%26F+Investimentos+SA+untuk+menyelesaikan+akuisisi+pabrik+kertas+Eldorado+Brazil+serta.+Kemudian+juga+karena+aksi+beli+investor+asing+sseniali+Rp609%2C15+miliar%2C%22+terangnya%2C+di+Jakarta%2C+Kamis+%284%2F2%2F2021%29.++%0A++Baca+Juga%3A+Mayoritas+Saham+Bank-bank+BUMN+Sumringah+Setelah+Erick+Lempar+Pujian.++%0A++Menurutnya%2C+secara+teknikal+IHSG+bergerak+terkonsolidasi+namun+terlihat+cukup+kuat+diatas+level+rata-rata+50+hari+dan+5+hari+sebagai+support+pergerakan.+Selanjutnya+IHSG+berpotensi+menguji+resistance+rata-rata+20+hari+sebagai+indikasi+pembentukan+uptrend+jangka+pendek+menuju+upper+bollinger+bands.+Indikator+Stochastic+terkonfirmasi+alami+momentum+bullish+dengan+MACD+yang+bergerak+positif.++%0A++%22Sehingga+diperkirakan+IHSG+bergerak+melanjutkan+penguatan+diakhir+pekan+dengan+support+resistance+6051-6239.+Saham-saham+yang+dapat+dicermati+secara+teknikal+diantaranya%3B+ACST%2C+ADRO%2C+BMRI%2C+HOKI%2C+INDY%2C+PTBA%2C+PTPP%2C+RALS%2C+TBLA%2C+WSKT%2C+dan+WSBP%2C%22+tutup+Lanjar. HTTP/1.1" 500 -
[2021-02-04 14:38:55,773] ERROR in app: Exception on /nlp [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.8/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "webserver.py", line 71, in nlp
    result = thecube(query)
  File "../cube/api.py", line 184, in __call__
    sequences+=self._tokenizer.tokenize(input_line)
  File "../cube/generic_networks/tokenizers.py", line 384, in tokenize
    if np.argmax(y.npvalue()) == 1:
  File "_dynet.pyx", line 730, in _dynet.Expression.npvalue
  File "_dynet.pyx", line 745, in _dynet.Expression.npvalue
ValueError: Attempt to do tensor forward in different devices (nodes 20330 and 39)

10.42.109.126 - - [04/Feb/2021 14:38:55] "GET /nlp?lang=id&text=Grab-Gojek+Dapat+Saingan+Baru+di+Negeri+Singa HTTP/1.1" 500 -
[2021-02-04 14:38:55,872] ERROR in app: Exception on /nlp [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.8/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "webserver.py", line 71, in nlp
    result = thecube(query)
  File "../cube/api.py", line 184, in __call__
    sequences+=self._tokenizer.tokenize(input_line)
  File "../cube/generic_networks/tokenizers.py", line 410, in tokenize
    seq = self._get_tokens(w.strip(), space_after_end_of_sentence=space_after_end_of_sentence)
  File "../cube/generic_networks/tokenizers.py", line 320, in _get_tokens
    y_pred, _, _ = self._predict_tok(input_string, runtime=True)
  File "../cube/generic_networks/tokenizers.py", line 264, in _predict_tok
    word_lstm = word_lstm.add_input(dy.tanh(emb + hol))
  File "_dynet.pyx", line 6026, in _dynet.RNNState.add_input
  File "_dynet.pyx", line 6036, in _dynet.RNNState.add_input
  File "_dynet.pyx", line 4941, in _dynet._RNNBuilder.add_input_to_prev
ValueError: Using stale builder. Create .new_graph() after computation graph is renewed.

I can't reproduce the issue when running the image locally.

To Reproduce
Steps to reproduce the behavior:

Build and run the image in a production docker environment
Wait for the models to download.
Send the following request 10.42.109.126 - - [04/Feb/2021 14:38:55] "GET /nlp?lang=id&text=Grab-Gojek+Dapat+Saingan+Baru+di+Negeri+Singa HTTP/1.1" 500 -
See error

Expected behavior
When running locally the everything works fine and the Cube returns the expected output.

Screenshots

Getting stuck at "Configuring tzdata"

Describe the bug
I'm trying to build a docker image and use the NLP as a web service but it gets stuck on the "Configuring tzdata" phase where I have to choose my geographic area. The problem is that the console is unresponsive and I can't choose any of the given options. I've tried the whole process on multiple Windows machines running Docker for Windows and I always get the same result.

To Reproduce
Steps to reproduce the behavior:

Clone the repository
Got to 'repo-dir/docker'
Execute docker build --tag nlp-cube:1.0 .
Wait until it reaches the 'Configuring tzdata' phase

Expected behavior
I want to run the NLP in a docker container and use it as a web API for sentence splitting, tokenization, lemmatization, etc. According to the documentation, I should be able to do this by starting the server and accessing container:port/nlp?lang=en&text=test.

Screenshots

Desktop (please complete the following information):

OS: Windows 10 Pro 1909
Docker for Windows v.3.0.0(50684)

Steps for setting up NLP-cube

Need

Document the steps needed for a developer to set up the development environment with NLP-cube.

Deliverables

We need a complete process to help developers onboard the project. This implies:

steps needed in order to use the repository
scripts for downloading needed resources (e.g. pre-trained models, data)

Script should download:

CONLL Universal Dependencies
Facebook embeddings - download language-specific data (e.g. https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.ro.vec)
CONLL test data
conll17_ud_eval script from here

adobe / nlp-cube Goto Github PK

nlp-cube's Introduction

News

NLP-Cube

Simple (PIP) installation / update version

API Usage

Flavours

Webserver Usage

Cite

Languages and performance

nlp-cube's People

Contributors

Stargazers

Watchers

Forkers

nlp-cube's Issues

Expected Result

Expected Result

Details for implementation

For PR to pass it should:

Need

Need

Deliverables

Expected Result

Need

Deliverables

Expected Result

Actual Result

Reproduction Steps

System Information

Expected Result

Expected Result

Actual Result

Reproduction Steps

System Information

Expected Result

Actual Result

Expected Result

Expected Result

Actual Result

TODO

Expected Result

Expected Result

Actual Result

Reproduction Steps

System Information

Expected Result

Actual Result

Need

Deliverables

Script should download:

Recommend Projects

Recommend Topics

Recommend Org