elitcloud / elit Goto Github PK

View Code? Open in Web Editor NEW

47.0 47.0 7.0 15.08 MB

🔠 Evolution of Language and Information Technology

Home Page: https://elit.cloud

License: Other

Python 91.34% Perl 8.66%

elit natural-language-processing nlp python

elit's People

Contributors

Stargazers

Watchers

Forkers

maksymdelta suraj-deshmukh colinsongf wmlba hankcs ericofosuhene ahmad-abdellatif

elit's Issues

Get ValueError when the Input format is line

When I use line as input format for sentiment analysis. I got this error message.

ValueError: Error when checking : expected input_2 to have 3 dimensions, but got array with shape (0, 1)

The code I executed is

from elit.decode import EnglishDecoder, DOC_DELIM
from elit.util.configure import *

# Please replace the correct resource path.
RESOURCE_PATH = os.path.join('resources')
local_filename = "test.txt"

config = Configuration(tokenize=True, input_format='line', segment=True, sentiment=(SENTIMENT_TWITTER, SENTIMENT_MOVIE))

nd = EnglishDecoder(resource_dir=RESOURCE_PATH, config=config)

with open(local_filename, 'r', encoding='utf8') as f:
    result = nd.decode(config, StringIO(f.read()))

The file, test.txt, I tested:

Longer version of error messages:

  File "/Users/gary/Documents/research/elitrpc/application.py", line 73, in decode
    result = nd.decode(config, StringIO(f.read()))
  File "/Users/gary/.local/share/virtualenvs/elitrpc36/lib/python3.6/site-packages/elit/decode.py", line 44, in decode
    d = decode(config, istream, ostream)
  File "/Users/gary/.local/share/virtualenvs/elitrpc36/lib/python3.6/site-packages/elit/decode.py", line 89, in decode_line
    d = self.text_to_sentences(config, line, offset)
  File "/Users/gary/.local/share/virtualenvs/elitrpc36/lib/python3.6/site-packages/elit/decode.py", line 182, in text_to_sentences
    self.sentiment_analyze(config, sentences)
  File "/Users/gary/.local/share/virtualenvs/elitrpc36/lib/python3.6/site-packages/elit/decode.py", line 202, in sentiment_analyze
    y, att = analyzer.decode(sens, att=att)
  File "/Users/gary/.local/share/virtualenvs/elitrpc36/lib/python3.6/site-packages/elit/component/sentiment.py", line 73, in decode
    y = self.p_model.predict(x, batch_size=batch_size, verbose=0)
  File "/Users/gary/.local/share/virtualenvs/elitrpc36/lib/python3.6/site-packages/keras/engine/training.py", line 1695, in predict
    check_batch_axis=False)
  File "/Users/gary/.local/share/virtualenvs/elitrpc36/lib/python3.6/site-packages/keras/engine/training.py", line 132, in _standardize_input_data
    str(array.shape))
ValueError: Error when checking : expected input_2 to have 3 dimensions, but got array with shape (0, 1)

Unit tests

Unit tests:

elit.structure.*
elit.reader.*
elit.model.*

Reader example:

filename = '../../resources/sample.tsv'
reader = TSVReader(filename, 1, 2, 3, 4, 5, 6, 7, 8)

# read the next graph
graph = reader.next()
print(str(graph)+'\n')

# read the rest
for graph in reader:
    print(str(graph)+'\n')

Should avoid to use project relative path to load file

elit/elit/nlp/dep/parser/parser.py

Line 165 in c8fc984

self._vocab = ParserVocabulary.load(self._config.save_vocab_path)

self._config.save_vocab_path is specified as a relative path in the config path. However, it causes problem if resources file are not in the same directory.

<codecvt> is not supported for in certain gcc version

is not std library in some gcc version. In certain gcc version, this problem can not be resolved even though the libstdc++ is installed.

However, I find a possible solution on stackoverflow: https://stackoverflow.com/a/28875347/1278390 . @jdchoi77 Could you please check this solution and see can we solve it.

elit/nlp/dep/parser/common/data.py

elit/elit/nlp/dep/parser/common/data.py

Line 174 in fd959c5

word_dims = len(data)

...
elit_dep_biaffine_en_mixed.load('{}/data/model/dep/jumbo-fasttext100'.format(ELIT_PATH))
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/dep/parser/parser.py", line 168, in load
    self._parser = self._create_parser(self._config, self._vocab)
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/dep/parser/parser.py", line 209, in _create_parser
    config.mlp_rel_size, config.dropout_mlp, config.debug)
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/dep/parser/biaffine_parser.py", line 55, in __init__
    trainable=False) if vocab.has_pret_embs() else None
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/dep/parser/common/data.py", line 174, in get_pret_embs
    word_dims = len(data)
UnboundLocalError: local variable 'data' referenced before assignment

Path should be passed to Embedding

elit/elit/nlp/tagger/sequence_tagger_model.py

Lines 173 to 183 in 2840f0d

 embedding_types = [ 

 WordEmbeddings('data/embedding/fasttext100.vec.txt'), 

 # comment in this line to use character embeddings 

 # CharacterEmbeddings(), 

 # comment in these lines to use contextual string embeddings 

 CharLMEmbeddings('data/model/lm-news-forward'), 

 CharLMEmbeddings('data/model/lm-news-backward'), 

 ]

Since elit can be installed through pip or run on server, the word embedding path is not always in the same directory of project.

Two distributions need. CPU and GPU version

The problem is tensorflow has cpu and gpu distributions. If we want to support both solution, we also have to create two distributions, to resolve the dependency. Do we want to support both solutions?

move out nlp components

This elit project will only be the sdk for deployment to the cloud. Since we have many overlapping with gluon nlp, it might be a better way for us that we contribute nlp model to there. Since then, we don't have to care about having a unified api interface for each nlp component.

In the future, developers don't have to worry about how should implement their component. The only thing they should care about is the decode function.

I have created a snapshot on https://github.com/elitcloud/elit/tree/nlp for this move out process. I'm gonna move out every components after this commit 9e4f185.

@hankcs would you please move your parser to https://github.com/elitcloud/parser?

Please leave your ideas/thoughts here for further discussion.

Best

`load` functions

The load functions need to be updated as follows:

It takes two parameters model_path and model_root:
- model_path indicates either the name of a model (e.g., elit_pos_flair_en_mixed_*) or the URL to a compressed file in the S3 bucket, and does not have a default value.
- model_root points to the root directory in the local machine where all models are saved, and has the default value of ~/.elit/models/.
If model_path points to a remote compressed file:
- It downloads and unzips the compressed file to model_root, which creates a directory with the same name as the remote file (e.g., ~/.elit/models/elit_pos_flair_en_mixed_*).
- If the downloaded model has dependencies to other models indicated in config.json (e.g., elit_lm_flair-bw_en_mixed_*), it downloads all dependent files and unzips them under model_root. This dependency resolution needs to be called recursively.

WordVectorModel doesn't exist in the latest version of facebook fasttext.

WordVectorModel doesn't exist in the latest version of facebook fasttext.

It affects these three files.

elit/elit/dev/dependency_parser.py

Line 16 in f3465e6

from fasttext.model import WordVectorModel

elit/elit/dev/pos_tagger.py

Line 27 in f3465e6

from fasttext.model import WordVectorModel

elit/elit/dev/template/lexicon.py

Line 18 in f3465e6

from fasttext.model import WordVectorModel

AttributeError

when I ran my own data set to decode it gives an error,
for tool in tools:
shuf = tool.decode(docs)

AttributeError Traceback (most recent call last)
in
1 for tool in tools:
----> 2 shuf = tool.decode(docs)

~/anaconda3/lib/python3.7/site-packages/elit/component/tagger/pos_tagger.py in decode(self, docs, **kwargs)
61 if isinstance(docs, Document):
62 docs = [docs]
---> 63 samples = NLPTaskDataFetcher.convert_elit_documents(docs)
64 with self.context:
65 sentences = self.tagger.predict(samples)

~/anaconda3/lib/python3.7/site-packages/elit/component/tagger/corpus.py in convert_elit_documents(docs)
1298 dataset = []
1299 for d in docs:
-> 1300 for s in d.sentences:
1301 sentence = Sentence()
1302

AttributeError: 'str' object has no attribute 'sentences

but it works fine for the example given is documentation!
Please help me to figure this out.

Lexicon

TODO

Test case of lexicon
Test case of word2vector
Test case of festtext

Replace the fasttext requirement

Remove the following fasttext library:

pip install fasttext

and add the requirement for the following:

https://github.com/facebookresearch/fastText/tree/master/python

Integration of C

Expected Behavior

This project is expected to work with C/CPP

GloVe wrapper to VectorSpace Model

Add the GloVe wrapper of VectorSpaceModel to https://github.com/elitcloud/elit/blob/develop/elit/lexicon.py.

Compound words tokenization failure

Expected Behavior

Compound words (e.g. pick-me-up, hand-me-down, know-it-all, etc.) should be tokenized as single tokens.

Actual Behavior

hyphens are treated as separators, and the components are tokenized separately.

read_pretrained_embeddings

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/wsgi.py", line 142, in __call__
    response = self.get_response(request)
  File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/base.py", line 78, in get_response
    response = self._middleware_chain(request)
  File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 36, in inner
    response = response_for_exception(request, exc)
  File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 90, in response_for_exception
    response = handle_uncaught_exception(request, get_resolver(get_urlconf()), sys.exc_info())
  File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 128, in handle_uncaught_exception
    callback, param_dict = resolver.resolve_error_handler(500)
  File "/usr/local/lib/python3.5/dist-packages/django/urls/resolvers.py", line 546, in resolve_error_handler
    callback = getattr(self.urlconf_module, 'handler%s' % view_type, None)
  File "/usr/local/lib/python3.5/dist-packages/django/utils/functional.py", line 37, in __get__
    res = instance.__dict__[self.name] = self.func(instance)
  File "/usr/local/lib/python3.5/dist-packages/django/urls/resolvers.py", line 526, in urlconf_module
    return import_module(self.urlconf_name)
  File "/usr/lib/python3.5/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 986, in _gcd_import
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 665, in exec_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "./config/urls.py", line 23, in <module>
    path('api/', include('api.urls')),
  File "/usr/local/lib/python3.5/dist-packages/django/urls/conf.py", line 34, in include
    urlconf_module = import_module(urlconf_module)
  File "/usr/lib/python3.5/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 986, in _gcd_import
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 665, in exec_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "./api/urls.py", line 18, in <module>
    from api import views
  File "./api/views.py", line 10, in <module>
    from api.components.elit import elit_tok_lexrule_en, elit_pos_flair_en_mixed
  File "./api/components/elit.py", line 94, in <module>
    elit_pos_flair_en_mixed.load('{}/data/model/pos/jumbo'.format(ELIT_PATH), word_embedding_path='{}/'.format(ELIT_PATH))
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/tagger/tagger.py", line 27, in load
    self.tagger = SequenceTagger.load_from_file(model_path, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/tagger/sequence_tagger_model.py", line 175, in load_from_file
    WordEmbeddings('{}data/embedding/fasttext100.vec.txt'.format(kwargs.get('word_embedding_path', ''))),
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/tagger/embeddings.py", line 99, in __init__
    self.precomputed_word_embeddings, self.__embedding_length = read_pretrained_embeddings(embedding_file)
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/tagger/corpus.py", line 43, in read_pretrained_embeddings
    for line in f:
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 153: ordinal not in range(128)

ImportError

ImportError Traceback (most recent call last)
in
----> 1 from elit.component import EnglishTokenizer
2 from elit.component import EnglishMorphAnalyzer
3 from elit.component import POSFlairTagger
4 from elit.component import NERFlairTagger
5 from elit.component import DEPBiaffineParser

ImportError: cannot import name 'EnglishTokenizer' from 'elit.component' (/home/jeev/anaconda3/lib/python3.7/site-packages/elit/component/init.py)

Specifications

Version:18.04
Platform:ubuntu
Subsystem:

Test for elit/component/template/model.py

TODO:

Test for model.py

gpu config should be explicitly

elit/elit/nlp/tagger/mxnet_util.py

Lines 24 to 34 in 8791199

 def mxnet_prefer_gpu(): 

 ''' 

  If gpu available return gpu, else cpu 

  :return: 

  ''' 

 if 'cuda' not in os.environ['PATH']: 

 return mx.cpu() 

 gpu = int(os.environ.get('MXNET_GPU', default=0)) 

 if gpu == -1: 

 return mx.cpu() 

 return mx.gpu(gpu)

We run models on a multi-gpus server, and each model use its gpu. In this case, we can't specify which gpu is gonna to be used outside this project.

Change the menu color in Sphinx

Change the color of the side menu where the ELIT logo is to the color of ELIT's font in Sphinx.

Code refactoring

Put elit/lemmatization and elit/token_tagger under elit/nlp.
Create elit/model.py, move the classes in model/cnn_model.py and model/rnn_model.py to elit/model.py, and remove elit/model.
Remove elit/state.py (we no longer use it?)
Remove resources/morph_analyzer and rename resources/lemmatizer to resources/morph_analyzer. You probably need to reconfigure the resource path for EnglishMorphAnalyzer (we probably don't need EnglishLemmatizer any more).

AttributeError: 'NoneType' object has no attribute '_vocab'

I installed and ran below code
from elit.structure import Document, Sentence, TOK, POS
from elit.component import SDPBiaffineParser

tokens = ['John', 'who', 'I', 'wanted', 'to', 'meet', 'was', 'smart']
postags = ['NNP', 'IN', 'WP', 'PRP', 'VBD', 'DT', 'NN', 'VBD', 'JJ']
doc = Document()
doc.add_sentence(Sentence({TOK: tokens, POS: postags}))
#print(doc)
sdp = SDPBiaffineParser()
sdp.decode(doc)
print(doc.sentences[0])

and it throws error as shown below.

AttributeError Traceback (most recent call last)
in
8 #print(doc)
9 sdp = SDPBiaffineParser()
---> 10 sdp.decode(doc)
11 print(doc.sentences[0])

~/anaconda3/lib/python3.7/site-packages/elit/component/sdp/sdp_parser.py in decode(self, docs, num_buckets_test, test_batch_size, **kwargs)
119 :return: docs
120 """
--> 121 vocab = self._parser._vocab
122 for d in docs:
123 for s in d:

AttributeError: 'NoneType' object has no attribute '_vocab'

Version:18.04
Platform:ubuntu
Subsystem:

TSVReader api is broken.

The TSVreader is broken because of this commit f30c808.

Things To Do

Code

Update requirements.txt, requirements.dev.txt, and setup.py.
Replace CHANGELOG.md with Github's release notes.
Updated the installation documentation for environmental setup.
MXComponent should take list of Embedding instead of embed_config.

def __init__(self, ctx: mx.Context, emb_list: List[Embedding], key: str = 'tag', chunking: bool =False, label_map: LabelMap = None, ...)

CLI

s = 'hello world!'
tools = [PartOfSpeechTagger(...).load(...), ..., DependencyParser(...).load(...)]
tok = EnglishTokenizer()
docs = tok.decode(s)
for tool in tools: tool.decode(docs)

Release

0.2.0.

	embedding_types = [

	WordEmbeddings('data/embedding/fasttext100.vec.txt'),

	# comment in this line to use character embeddings
	# CharacterEmbeddings(),

	# comment in these lines to use contextual string embeddings
	CharLMEmbeddings('data/model/lm-news-forward'),
	CharLMEmbeddings('data/model/lm-news-backward'),
	]

	def mxnet_prefer_gpu():
	'''
	If gpu available return gpu, else cpu
	:return:
	'''
	if 'cuda' not in os.environ['PATH']:
	return mx.cpu()
	gpu = int(os.environ.get('MXNET_GPU', default=0))
	if gpu == -1:
	return mx.cpu()
	return mx.gpu(gpu)

elitcloud / elit Goto Github PK

elit's People

Contributors

Stargazers

Watchers

Forkers

elit's Issues

Expected Behavior

Expected Behavior

Actual Behavior

Specifications

TODO:

Code

CLI

Release

Recommend Projects

Recommend Topics

Recommend Org