Giter Club home page Giter Club logo

adobe / nlp-cube Goto Github PK

View Code? Open in Web Editor NEW
551.0 31.0 98.0 7.4 MB

Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing

Home Page: http://opensource.adobe.com/NLP-Cube/index.html

License: Apache License 2.0

Python 34.69% Dockerfile 0.09% Shell 0.01% HTML 65.20%
embeddings parse nlp-cube language-pipeline tokenization sentence-splitting part-of-speech-tagger lemmatization dependency-parser dependency-parsing

nlp-cube's Introduction

Downloads Downloads Weekly daily Version Python 3 GitHub stars

News

[05 August 2021] - We are releasing version 3.0 of NLPCube and models and introducing FLAVOURS. This is a major update, but we did our best to maintain the same API, so previous implementation will not crash. The supported language list is smaller, but you can open an issue for unsupported languages, and we will do our best to add them. Other options include fixing the pip package version 1.0.8 pip install nlpcube==0.1.0.8.

[15 April 2019] - We are releasing version 1.1 models - check all supported languages below. Both 1.0 and 1.1 models are trained on the same UD2.2 corpus; however, models 1.1 do not use vector embeddings, thus reducing disk space and time required to use them. Some languages actually have a slightly increased accuracy, some a bit decreased. By default, NLP Cube will use the latest (at this time) 1.1 models.

To use the older 1.0 models just specify this version in the load call: cube.load("en", 1.0) (en for English, or any other language code). This will download (if not already downloaded) and use this specific model version. Same goes for any language/version you want to use.

If you already have NLP Cube installed and want to use the newer 1.1 models, type either cube.load("en", 1.1) or cube.load("en", "latest") to auto-download them. After this, calling cube.load("en") without version number will automatically use the latest ones from your disk.


NLP-Cube

NLP-Cube is an opensource Natural Language Processing Framework with support for languages which are included in the UD Treebanks (list of all available languages below). Use NLP-Cube if you need:

  • Sentence segmentation
  • Tokenization
  • POS Tagging (both language independent (UPOSes) and language dependent (XPOSes and ATTRs))
  • Lemmatization
  • Dependency parsing

Example input: "This is a test.", output is:

1       This    this    PRON    DT      Number=Sing|PronType=Dem        4       nsubj   _
2       is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   4       cop     _
3       a       a       DET     DT      Definite=Ind|PronType=Art       4       det     _
4       test    test    NOUN    NN      Number=Sing     0       root    SpaceAfter=No
5       .       .       PUNCT   .       _       4       punct   SpaceAfter=No

If you just want to run it, here's how to set it up and use NLP-Cube in a few lines: Quick Start Tutorial.

For advanced users that want to create and train their own models, please see the Advanced Tutorials in examples/, starting with how to locally install NLP-Cube.

Simple (PIP) installation / update version

Install (or update) NLP-Cube with:

pip3 install -U nlpcube

API Usage

To use NLP-Cube *programmatically (in Python), follow this tutorial The summary would be:

from cube.api import Cube       # import the Cube object
cube=Cube(verbose=True)         # initialize it
cube.load("en", device='cpu')   # select the desired language (it will auto-download the model on first run)
text="This is the text I want segmented, tokenized, lemmatized and annotated with POS and dependencies."
document=cube(text)            # call with your own text (string) to obtain the annotations

The document object now contains the annotated text, one sentence at a time. To print the third words's POS (in the first sentence), just run:

print(document.sentences[0][2].upos) # [0] is the first sentence and [2] is the third word

Each token object has the following attributes: index, word, lemma, upos, xpos, attrs, head, label, deps, space_after. For detailed info about each attribute please see the standard CoNLL format.

Flavours

Previous versions on NLP-Cube were trained on individual treebanks. This means that the same language was supported by multiple models at the same time. For instance, you could parse English (en) text with en_ewt, en_esl, en_lines, etc. The current version of NLPCube combines all flavours of a treebank under the same umbrella, by jointly optimizing a conditioned model. You only need to load the base language, for example en and then select which flavour to apply at runtime:

from cube.api import Cube       # import the Cube object
cube=Cube(verbose=True)         # initialize it
cube.load("en", device='cpu')   # select the desired language (it will auto-download the model on first run)
text="This is the text I want segmented, tokenized, lemmatized and annotated with POS and dependencies."


# Parse using the default flavour (in this case EWT)
document=cube(text)            # call with your own text (string) to obtain the annotations
# or you can specify a flavour
document=cube(text, flavour='en_lines') 

Webserver Usage

The current version dropped supported, since most people preferred to implement their one NLPCube as a service.

Cite

If you use NLP-Cube in your research we would be grateful if you would cite the following paper:

  • NLP-Cube: End-to-End Raw Text Processing With Neural Networks, Boroș, Tiberiu and Dumitrescu, Stefan Daniel and Burtica, Ruxandra, Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association for Computational Linguistics. p. 171--179. October 2018

or, in bibtex format:

@InProceedings{boro-dumitrescu-burtica:2018:K18-2,
  author    = {Boroș, Tiberiu  and  Dumitrescu, Stefan Daniel  and  Burtica, Ruxandra},
  title     = {{NLP}-Cube: End-to-End Raw Text Processing With Neural Networks},
  booktitle = {Proceedings of the {CoNLL} 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  month     = {October},
  year      = {2018},
  address   = {Brussels, Belgium},
  publisher = {Association for Computational Linguistics},
  pages     = {171--179},
  abstract  = {We introduce NLP-Cube: an end-to-end Natural Language Processing framework, evaluated in CoNLL's "Multilingual Parsing from Raw Text to Universal Dependencies 2018" Shared Task. It performs sentence splitting, tokenization, compound word expansion, lemmatization, tagging and parsing. Based entirely on recurrent neural networks, written in Python, this ready-to-use open source system is freely available on GitHub. For each task we describe and discuss its specific network architecture, closing with an overview on the results obtained in the competition.},
  url       = {http://www.aclweb.org/anthology/K18-2017}
}

Languages and performance

For comparison, the performance of 3.0 models is reported on the 2.2 UD corpus, but distributed models are obtained from UD 2.7.

Results are reported against the test files for each language (available in the UD 2.2 corpus) using the 2018 conll eval script. Please see more info about what each metric represents here.

Notes:

  • version 1.1 of the models no longer need the large external vector embedding files. This makes loading the 1.1 models faster and less RAM-intensive.
  • all reported results here are end-2-end. (e.g. we test the tagging accuracy on our own segmented text, as this is the real use-case; CoNLL results are mostly reported on "gold" - or pre-segmented text, leading to higher accuracy for the tagger/parser/etc.)
Language Model Token Sentence UPOS XPOS AllTags Lemmas UAS LAS
Chinese
zh-1.0 93.03 99.10 88.22 88.15 86.91 92.74 73.43 69.52
zh-1.1 92.34 99.10 86.75 86.66 85.35 92.05 71.00 67.04
zh.3.0 95.88 87.36 91.67 83.54 82.74 85.88 79.15 70.08
English
en-1.0 99.25 72.8 95.34 94.83 92.48 95.62 84.7 81.93
en-1.1 99.2 70.94 94.4 93.93 91.04 95.18 83.3 80.32
en-3.0 98.95 75.00 96.01 95.71 93.75 96.06 87.06 84.61
French
fr-1.0 99.68 94.2 92.61 95.46 90.79 93.08 84.96 80.91
fr-1.1 99.67 95.31 92.51 95.45 90.8 93.0 83.88 80.16
fr-3.0 99.71 93.92 97.33 99.56 96.61 90.79 89.81 87.24
German
de-1.0 99.7 81.19 91.38 94.26 80.37 75.8 79.6 74.35
de-1.1 99.77 81.99 90.47 93.82 79.79 75.46 79.3 73.87
de-3.0 99.77 86.25 94.70 97.00 85.02 82.73 87.08 82.69
Hungarian
hu-1.0 99.8 94.18 94.52 99.8 86.22 91.07 81.57 75.95
hu-1.1 99.88 97.77 93.11 99.88 86.79 91.18 77.89 70.94
hu-3.0 99.75 91.64 96.43 99.75 89.89 91.31 86.34 81.29
Italian
it-1.0 99.89 98.14 86.86 86.67 84.97 87.03 78.3 74.59
it-1.1 99.92 99.07 86.58 86.4 84.53 86.75 76.38 72.35
it-3.0 99.92 98.13 98.26 98.15 97.34 97.76 94.07 92.66
Romanian (RO-RRT)
ro-1.0 99.74 95.56 97.42 96.59 95.49 96.91 90.38 85.23
ro-1.1 99.71 95.42 96.96 96.32 94.98 96.57 90.14 85.06
ro-3.0 99.80 95.64 97.67 97.11 96.76 97.55 92.06 87.67
Spanish
es-1.0 99.98 98.32 98.0 98.0 96.62 98.05 90.53 88.27
es-1.1 99.98 98.40 98.01 98.00 96.6 97.99 90.51 88.16
es-3.0 99.96 97.17 96.88 99.91 94.88 98.17 92.11 89.86

nlp-cube's People

Contributors

bogdananton avatar burtica avatar dumitrescustefan avatar filmaj avatar finqbucks avatar fluffybird2323 avatar futurulus avatar keuhdall avatar koichiyasuoka avatar rscctest avatar ruxandraburtica avatar shauryauppal-1mg avatar spaskich avatar tiberiu44 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nlp-cube's Issues

v1.1 models for 4 languages in performance table

The metrics table in the README reports performance numbers for v1.1 models in most languages. I'm assuming that for languages that only have v1.0 results in the table, v1.1 models are not yet available. However, for four languages, v1.1 performance numbers are included in the table, but only v1.0 models are available in the online model repository:

  • Ukrainian
  • Urdu
  • Vietnamese
  • Chinese

Just double-checking that these models weren't left out unintentionally!

Classical Chinese Model needed

I've almost finished to build up UD_Classical_Chinese-Kyoto Treebank, and now I'm trying to make a Classical Chinese model for NLP-Cube (please check my diary). But in my model sentence_accuracy<35 and I can't sentencize "天平二年正月十三日萃于帥老之宅申宴會也于時初春令月氣淑風和梅披鏡前之粉蘭薰珮後之香加以曙嶺移雲松掛羅而傾盖夕岫結霧鳥封縠而迷林庭舞新蝶空歸故鴈於是盖天促膝飛觴忘言一室之裏開衿煙霞之外淡然自放快然自足若非翰苑何以攄情詩紀落梅之篇古今夫何異矣宜賦園梅聊成短詠" (check gold standard here). How do I tune up sentencization for Classical Chinese?

Getting stuck at "Configuring tzdata"

Describe the bug
I'm trying to build a docker image and use the NLP as a web service but it gets stuck on the "Configuring tzdata" phase where I have to choose my geographic area. The problem is that the console is unresponsive and I can't choose any of the given options. I've tried the whole process on multiple Windows machines running Docker for Windows and I always get the same result.

To Reproduce
Steps to reproduce the behavior:

  1. Clone the repository
  2. Got to 'repo-dir/docker'
  3. Execute docker build --tag nlp-cube:1.0 .
  4. Wait until it reaches the 'Configuring tzdata' phase

Expected behavior
I want to run the NLP in a docker container and use it as a web API for sentence splitting, tokenization, lemmatization, etc. According to the documentation, I should be able to do this by starting the server and accessing container:port/nlp?lang=en&text=test.

Screenshots
image

Desktop (please complete the following information):

  • OS: Windows 10 Pro 1909
  • Docker for Windows v.3.0.0(50684)

Batch size is being ignored by tagger during training

Tagger is ignoring batch-size parameter

Expected Result

Increasing the batch-size should decrease accuracy for the first epoch and increase the loss. Also, it should reduce the bad-gradient 'nan' error.

Actual Result

Regardless of the batch size, the results seem to be the same. Also, after a maximum of 20 epochs I get a bad gradient 'nan' error

ConllEntry should have a __repr__ function

Right now, output looks like:

text="This is the text I want segmented, tokenized, lemmatized and annotated with POS and dependencies."
sentences=cube(text)
print(sentences)
[[<cube.io_utils.conll.ConllEntry object at 0x7f5ae924b358>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252080>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae92520f0>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252160>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae92521d0>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252240>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae92522b0>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252320>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252390>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252400>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252470>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae92524e0>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252550>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae92525c0>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252630>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae92526a0>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252710>, <cube.io_utils.conll.ConllEntry object at 0x7f5ae9252780>]]

when we print a sentence object. We should have a repr function implemented to look pretty .. er.

Update configuration examples

This is related to the previously closed issue #35.
The configuration example in examples/parser.conf has to be updated with the new flag use_lexical. Also, the config reader should give an error and fail if both use_lexical and use_morphology are set to False

ERROR in app: Exception on /nlp [GET]

Describe the bug
I'm getting the following errors on some of the requests when running in a docker container as a web server:

10.42.109.126 - - [04/Feb/2021 14:38:55] "GET /nlp?lang=id&text=WE+Online%2C+Jakarta+-+%0A++Indeks+Harga+Saham+Gabungan+%28IHSG%29+masih+mampu+bangkit+0.48%25+atau29.47+poin+ke+level+6%2C107.22+dengan+saham-saham+disektor+Aneka+Industri+naik+2.11%25%2C+Infrastruktur+0.96%25+dan+Keuangan+0.81%25.++%0A++Analis+Reliance+Sekuritas%2C+Lanjar+Nafi+mengungkapkan+bila+melesatnya+harga+saham+PT+Indah+Kiat+Pulp+and+Paper+Tbk+%28INKP%29+yang+sebesar+12.2%25+dan+PT+Tjiwi+Kimia+Tbk+%28TKIM%29+sebesar+19.9%25+menjadi+motor+pengerak+IHSG+pada+perdagangan+4+Februari+2021.++%0A++Baca+Juga%3A+Sempat+Anjlok+ke+Zona+Merah%2C+IHSG+Parkir+dengan+Apresiasi+0%2C48%25+pada+Penutupan+Sesi+II.++%0A++%22Ini+setelah+harga+pulp+naik+karena+Paper+Excellence+BV+yang+di+kendalikan+oleh+Jackson+Wijaya+dari+Indonesia+memenangkan+kasus+abritrase+terhadap+grub+brazil+J%26F+Investimentos+SA+untuk+menyelesaikan+akuisisi+pabrik+kertas+Eldorado+Brazil+serta.+Kemudian+juga+karena+aksi+beli+investor+asing+sseniali+Rp609%2C15+miliar%2C%22+terangnya%2C+di+Jakarta%2C+Kamis+%284%2F2%2F2021%29.++%0A++Baca+Juga%3A+Mayoritas+Saham+Bank-bank+BUMN+Sumringah+Setelah+Erick+Lempar+Pujian.++%0A++Menurutnya%2C+secara+teknikal+IHSG+bergerak+terkonsolidasi+namun+terlihat+cukup+kuat+diatas+level+rata-rata+50+hari+dan+5+hari+sebagai+support+pergerakan.+Selanjutnya+IHSG+berpotensi+menguji+resistance+rata-rata+20+hari+sebagai+indikasi+pembentukan+uptrend+jangka+pendek+menuju+upper+bollinger+bands.+Indikator+Stochastic+terkonfirmasi+alami+momentum+bullish+dengan+MACD+yang+bergerak+positif.++%0A++%22Sehingga+diperkirakan+IHSG+bergerak+melanjutkan+penguatan+diakhir+pekan+dengan+support+resistance+6051-6239.+Saham-saham+yang+dapat+dicermati+secara+teknikal+diantaranya%3B+ACST%2C+ADRO%2C+BMRI%2C+HOKI%2C+INDY%2C+PTBA%2C+PTPP%2C+RALS%2C+TBLA%2C+WSKT%2C+dan+WSBP%2C%22+tutup+Lanjar. HTTP/1.1" 500 -
[2021-02-04 14:38:55,773] ERROR in app: Exception on /nlp [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.8/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "webserver.py", line 71, in nlp
    result = thecube(query)
  File "../cube/api.py", line 184, in __call__
    sequences+=self._tokenizer.tokenize(input_line)
  File "../cube/generic_networks/tokenizers.py", line 384, in tokenize
    if np.argmax(y.npvalue()) == 1:
  File "_dynet.pyx", line 730, in _dynet.Expression.npvalue
  File "_dynet.pyx", line 745, in _dynet.Expression.npvalue
ValueError: Attempt to do tensor forward in different devices (nodes 20330 and 39)
10.42.109.126 - - [04/Feb/2021 14:38:55] "GET /nlp?lang=id&text=Grab-Gojek+Dapat+Saingan+Baru+di+Negeri+Singa HTTP/1.1" 500 -
[2021-02-04 14:38:55,872] ERROR in app: Exception on /nlp [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.8/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "webserver.py", line 71, in nlp
    result = thecube(query)
  File "../cube/api.py", line 184, in __call__
    sequences+=self._tokenizer.tokenize(input_line)
  File "../cube/generic_networks/tokenizers.py", line 410, in tokenize
    seq = self._get_tokens(w.strip(), space_after_end_of_sentence=space_after_end_of_sentence)
  File "../cube/generic_networks/tokenizers.py", line 320, in _get_tokens
    y_pred, _, _ = self._predict_tok(input_string, runtime=True)
  File "../cube/generic_networks/tokenizers.py", line 264, in _predict_tok
    word_lstm = word_lstm.add_input(dy.tanh(emb + hol))
  File "_dynet.pyx", line 6026, in _dynet.RNNState.add_input
  File "_dynet.pyx", line 6036, in _dynet.RNNState.add_input
  File "_dynet.pyx", line 4941, in _dynet._RNNBuilder.add_input_to_prev
ValueError: Using stale builder. Create .new_graph() after computation graph is renewed.

I can't reproduce the issue when running the image locally.

To Reproduce
Steps to reproduce the behavior:

  1. Build and run the image in a production docker environment
  2. Wait for the models to download.
  3. Send the following request 10.42.109.126 - - [04/Feb/2021 14:38:55] "GET /nlp?lang=id&text=Grab-Gojek+Dapat+Saingan+Baru+di+Negeri+Singa HTTP/1.1" 500 -
  4. See error

Expected behavior
When running locally the everything works fine and the Cube returns the expected output.

Screenshots
image

Use only one feature instead of all of them

It would be very helpful to use just one of the features NLP-Cube offers instead of being forced to use all of them.

Due to computation times, depending of the project NLP-Cube would be perfect if you could run only the part you want to use instead of having to charge all the models. For example, I only want to use the lemmatizer as it is good for Spanish language. I need my text to be tokenized in a specific way and I don't want to use neither NLP-Cube tokenizer nor the rest of features.

I suggest adding another way to run NLP-Cube apart from sentences = cube(text).
For example: lemmas = cube.lemmatizer(tokens) would be a nice approach, where tokens can be a list containing the tokens of a sentence and returns another list containing the lemma of each word.

The NlpCloud seems to be down

Describe the bug
Apparently the nlp cloud at http://nlp-cube.speed.pub.ro/models/ is out of service

To Reproduce
Run the following commands

from cube.api import Cube

nlp_cube = Cube()
nlp_cube.load("en")

Expected behavior
It's expected that the cube library downloads the modules from it's cloud path

Restructure entry point

Need

Now, we use cube/main.py as an entry-point for testing, training and running a server. Update the this entry-point to have sub-commands for each of the use-cases.

Deliverables

  • Identify which parameters are needed for each of the following sub-commands:
    • train
    • test
    • runserver
  • Update the entry-point so that these commands can easily be used separately.

Enhanced compound word expander

Based on the success of the FST attention free lemmatizer, the compound word expander can be modified to work in a similar way

Expected Result

Faster training time and better accuracy in compound word expanders

Error on installing NLP Cube

When I run the pip3 install -U nlpcube command I get the following error:

lineiter = (line.strip() for line in open(filename))
FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-4r3qjuhb/nlpcube/

Is it a problem with something I'm doing? I am only trying to install it via pip.

pip-ify NLP-Cube

In the near future we should be able to install this as a python package like "pip install nlpcube". This is a major issue as it will also involve having ready the python API for all tasks.

Expected Result

Initial results:

  1. Install by "pip install"
  2. Import it like "import nlpcube as cube"

Later results - decide on the usage API:
3. Agree on a way to use it, be it in a static manner (e.g. "cube.load()" , "cube.tokenize(", etc. like NLTK) or object-centered (e.g. instantiate a cube object, set a pipeline from the start and then just call the wrapper function OR let the user call individual functions of the object (sentencesplit/tok/tag/etc.) sequentially).

Kazakh language wrong result

Evaluated example from readme for kazakh language with no error, but result is wrong. English language works fine.

Expected Result

I expected a list of words with their correct normalized form but for every word normalization form consists of only 1 letter.

Actual Result

[[1 Алтай а NOUN adj Case=Gen 2 nmod:poss _ _,
2 жерінің ж NOUN n _ 3 obl _ _,
3 асты а VERB adj _ 4 nsubj _ _,
4 қандай қ PRON adv _ 5 nsubj _ _,
5 қазыналы қ VERB n _ 0 root _ _,
6 болса б VERB v Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 5 cop _ SpaceAfter=No,
7 . . PUNCT sent _ 5 punct _ _],
[1 Ағаш а VERB v Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 0 root _ SpaceAfter=No,
2 . . PUNCT sent _ 1 punct _ SpaceAfter=No]]

Reproduction Steps

from cube.api import Cube # import the Cube object
cube=Cube(verbose=True) # initialize it
cube.load("kk") # select the desired language (it will auto-download the model on first run)
text="Алтай жерінің асты қандай қазыналы болса. Ағаш."
sentences=cube(text) # call with your own text (string) to obtain the annotations
sentences

System Information

  • Python version 3.6.12
  • Operating system Ubuntu 20.04

Full training and model import example

We need to provide a jupyter example and create the associated script that will allow users to:

  1. download the UD corpus
  2. train a full model
  3. package the full model in a zip file
  4. import the model in their local model storage folder

Error while training a model

Hi, I am trying to train my own model of NLP-cube, for Uyghur language. I carefully followed all your tutorials and the TrainSet, DevSet and TestSet are all in CONLL-U format.

Once I run the command to train the default tokenizer I get an error message as you can read in the tokenizer.log file (attached below). I run the following command:

python3 /home/chris/Documents/NLPCube/NLP-Cube/cube/main.py --train=tokenizer --train-file=/home/chris/Documents/NLPCube/My_Model/trainSet.conllu --dev-file=/home/chris/Documents/NLPCube/My_Model/devSet.conllu --raw-train-file=/home/chris/Documents/NLPCube/My_Model/trainSet_raw.txt --raw-dev-file=/home/chris/Documents/NLPCube/My_Model/devSet_raw.txt --embeddings /home/chris/Documents/NLPCube/wiki.ug.vec --store /home/chris/Documents/NLPCube/My_Model/tokenizer --batch-size 1000 --set-mem 8000 --autobatch --patience 20 &> /home/chris/Documents/NLPCube/My_Model/tokenizer.log

Information about my system:

  • GPU support disabled
  • Python version 3.8
  • Operating system Linux Ubuntu
    tokenizer.log

Thanks in advance for your help,
Chris.

Integration testing of NLP-Cube

We need tests to run to determine:

  • running status of all main.py functions
  • running status of all api.py functions
  • running status of model_store import/export/listing functions

Tokenizer training bug

Training fails if there is a sentence that has only one token, and the _create_mixed_sentences randomly tries to pick only part of that sentence.

Add notebook(s) for tokenization

Is your feature request related to a problem? Please describe.
It's related to the adoption of NLP-cube

Describe the solution you'd like
A Jupyter notebook example that emphasizes how tokenization can be done.

Issues with nn(Norwegian nynorsk) and nb(Norwegian bokmål).

When I run the code, it returns this error:
<< Exception("No model version for language ["+lang_code+"] was found in the online repository!")
Exception: No model version for language [nb] was found in the online repository!>>

My code works with other model languages like Danish, Swedish, French and Old Church Slavonic..
I use Pycharm community with Python 3.6.
Is Norwegian actually supported ?

Possible bug - parameter updating

It's possible that due to changes in Dynet the default parameter for .expr() is now update=False, which freezes updates on them, so no learning is performed. Correction would be to explicitly set .expr(update=True).

CONLL 2018 Shared Task participation

We should participate to this year's shared task: Multilingual Parsing from Raw Text to Universal Dependencies ( http://universaldependencies.org/conll18/ )

This means we need to prepare the code & scripts for this task.

As far as I can see right now, we have three distinct cases:

  • we have end-to-end decoding
  • we start from udpipe tokenization and perform tagging, parsing, lemmatization
  • we start from udpipe tokenization, tagging and lemmatization and we perform parsing

TODO

  • update the runtime CLI to support taking as input a list of models that need to pe run on the input. for instance: --run=[tokenization, parsing, tagging, lemmatization] or --run=[parsing, tagging]
  • implement scripts tailored for the UD Shared Task, which take as input the supplied XML with the list of input test files and run a custom NLPCube pipeline, depending on the language code
  • deploy on TIRA testing environment

Expected Result

Training and evaluation scripts, models trained, handling of low-resourced minor languages.

Steps for setting up NLP-cube

Need

Document the steps needed for a developer to set up the development environment with NLP-cube.

Deliverables

We need a complete process to help developers onboard the project. This implies:

  • steps needed in order to use the repository
  • scripts for downloading needed resources (e.g. pre-trained models, data)

Script should download:

Does cube support enhanced/collapsed dependency parsing.

Currently our service is based on the java based Stanford's core NLP library and we are looking to migrate from java to python in our backend systems. Does cube provide enhance dependency annotations and if yes how will one go about getting them? Thanks

Delexicalized version for parser

Small treebanks for UD are not suitable for our model. As such, we should train delexicalized versions of the parser, by combining compatible treebanks for those languages. Also, we could train delexicalized parsers directly on the small treebanks because using only morphology should not yield data sparseness on small datasets.

Expected Result

We need to add another flag in the parser's config file for using lexical features or not: USE_LEXICAL
This should be similar to USE_MORPHOLOGY

Details for implementation

io_utils/config.py <- we need to add the attribute here. We also need to be backwards compatible with older config files which don't have this flag enabled. Default value should be True
generic_networks/parsers.py

  • __init__(self...) should use this information when computing the input (lines 45--62)
  • make_input(self,..) should use this information to create the input: I recommend to branch the execution before line 260. I don't think dropout or any other mechanism will help prevent overfitting, so right now we should not be doing it. Just go through each entry, lookup Morphological information and sum-up the vectors (t_emb = upos_emb + xpos_emb + attrs_emb), than add to the sequence x_list.append(w_emb)

For PR to pass it should:

  • Be able to train on large datasets and provide results same as before the update (USE_LEXICAL=True, USE_MORPHOLOGY=False, PREDICT_MORPHOLOGY=True)
  • Be able to load pretrained models, which don't have the USE_LEXICAL flag at all
  • Be able to train and load delexicalized models for small treebanks.

bug.tokenizer

The tokenizer performs poorly on the development and testsets for latin languages.

Expected Result

The EOS symbol (usually DOT) should be a separate token in the sentence. There is no reason why the tokenizer should not learn this, especially that we already know it is a sentece delimiter.

Actual Result

The EOS symbol is merged with the previous word for some sentences.

Perform tokenization and sentence splitting in two separate steps

Currently we perform sentence splitting and tokenization in a single step. We need to split, because best tokenization f-score does not directly correlate with best sentence f-score.

Expected Result

Optimal model should cope with both tokenization and sentence splitting.

Actual Result

We have two models (bestSS and bestTok) which are obtained for different training epochs.

Reproduction Steps

Just train the tokenizer

System Information

Please provide some basic information about your system:

  • GPU support (enabled/disabled): anu
  • Hyperparameters used: default
  • Training set used (embeddings, language, ud-version): 2.1, wiki.ro (Facebook)
  • Dynet version
  • Python version
  • Operating system

Kazakh language wrong result

Evaluated example from readme for kazakh language with no error, but result is wrong. English language works fine.

Expected Result

I expected a list of words with their correct normalized form but for every word normalization form consists of only 1 letter.

Actual Result

[[1 Алтай а NOUN adj Case=Gen 2 nmod:poss _ _,
2 жерінің ж NOUN n _ 3 obl _ _,
3 асты а VERB adj _ 4 nsubj _ _,
4 қандай қ PRON adv _ 5 nsubj _ _,
5 қазыналы қ VERB n _ 0 root _ _,
6 болса б VERB v Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 5 cop _ SpaceAfter=No,
7 . . PUNCT sent _ 5 punct _ _],
[1 Ағаш а VERB v Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin 0 root _ SpaceAfter=No,
2 . . PUNCT sent _ 1 punct _ SpaceAfter=No]]

Reproduction Steps

from cube.api import Cube # import the Cube object
cube=Cube(verbose=True) # initialize it
cube.load("kk") # select the desired language (it will auto-download the model on first run)
text="Алтай жерінің асты қандай қазыналы болса. Ағаш."
sentences=cube(text) # call with your own text (string) to obtain the annotations
sentences

System Information

  • Python version 3.6.12
  • Operating system Ubuntu 20.04

The future of the Cube project?

I see that the latest contribution on the project was over 11 months ago. What is the status and the future of the Cube Project?

Is there any stable use case for it under the Adobe umbrella or it was an side-experiment?

Thanks

Add abstraction layer for storing models

Is your feature request related to a problem? Please describe.
We need a layer that deals with model storing/loading.
This class can be used from the API directly or used separately.

Our models are stored in cloud, so we also need an interface for getting the latest model.

Describe the solution you'd like
A ModelStore class, that handles model loading/saving.

Missing files in sdist

It appears that the manifest is missing at least one file necessary to build
from the sdist for version 0.1.0.8. You're in good company, about 5% of other
projects updated in the last year are also missing files.

+ /tmp/venv/bin/pip3 wheel --no-binary nlpcube -w /tmp/ext nlpcube==0.1.0.8
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting nlpcube==0.1.0.8
  Downloading http://10.10.0.139:9191/root/pypi/%2Bf/ec7/2f7bc13654391/nlpcube-0.1.0.8.tar.gz (86 kB)
    ERROR: Command errored out with exit status 1:
     command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-pn3bj_mg/nlpcube/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-pn3bj_mg/nlpcube/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-pn3bj_mg/nlpcube/pip-egg-info
         cwd: /tmp/pip-wheel-pn3bj_mg/nlpcube/
    Complete output (7 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-wheel-pn3bj_mg/nlpcube/setup.py", line 20, in <module>
        install_requires = parse_requirements('requirements.txt', session=False),
      File "/tmp/pip-wheel-pn3bj_mg/nlpcube/setup.py", line 4, in parse_requirements
        lineiter = (line.strip() for line in open(filename))
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Add a list of supported languages

Is your feature request related to a problem? Please describe.
There is no way of telling if NLP-Cube supports or not a certain language. Also, custom corpora introduces some suffixes which are hard to guess.

Describe the solution you'd like
Either update the documentation or add a function to list or languages that are uploaded in the repo. I would prefer the later.

Add regularization to MT

The cosine distance used so far is working, but probably regularization will help. Topological regularization should bring the two vectors together. Distance should be used (i.e. 1-cosine_sim) for the cost function.

Expected Result

Performance Improvement for MT.

Add pause/resume capability for the models trainer

Expected Result

Save the current trainer state (weights, costs, etc) to files on disk to be able to stop and resume training at a later time.

Actual Result

Currently, when the trainer is interrupted, no state is saved so that when you rerun the trainer it will start from scratch.

Action required: Greenkeeper could not be activated 🚨

🚨 You need to enable Continuous Integration on Greenkeeper branches of this repository. 🚨

To enable Greenkeeper, you need to make sure that a commit status is reported on all branches. This is required by Greenkeeper because it uses your CI build statuses to figure out when to notify you about breaking changes.

Since we didn’t receive a CI status on the greenkeeper/initial branch, it’s possible that you don’t have CI set up yet. We recommend using Travis CI, but Greenkeeper will work with every other CI service as well.

If you have already set up a CI for this repository, you might need to check how it’s configured. Make sure it is set to run on all new branches. If you don’t want it to run on absolutely every branch, you can whitelist branches starting with greenkeeper/.

Once you have installed and configured CI on this repository correctly, you’ll need to re-trigger Greenkeeper’s initial pull request. To do this, please click the 'fix repo' button on account.greenkeeper.io.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.