Giter Club home page Giter Club logo

elit's People

Contributors

bgshin avatar eric-haibin-lin avatar hankcs avatar imgarylai avatar jdchoi77 avatar lxucs avatar tlee54 avatar zhengzheyang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elit's Issues

Get ValueError when the Input format is line

When I use line as input format for sentiment analysis. I got this error message.

ValueError: Error when checking : expected input_2 to have 3 dimensions, but got array with shape (0, 1)

The code I executed is

from elit.decode import EnglishDecoder, DOC_DELIM
from elit.util.configure import *

# Please replace the correct resource path.
RESOURCE_PATH = os.path.join('resources')
local_filename = "test.txt"

config = Configuration(tokenize=True, input_format='line', segment=True, sentiment=(SENTIMENT_TWITTER, SENTIMENT_MOVIE))

nd = EnglishDecoder(resource_dir=RESOURCE_PATH, config=config)

with open(local_filename, 'r', encoding='utf8') as f:
    result = nd.decode(config, StringIO(f.read()))

The file, test.txt, I tested:

Longer version of error messages:

  File "/Users/gary/Documents/research/elitrpc/application.py", line 73, in decode
    result = nd.decode(config, StringIO(f.read()))
  File "/Users/gary/.local/share/virtualenvs/elitrpc36/lib/python3.6/site-packages/elit/decode.py", line 44, in decode
    d = decode(config, istream, ostream)
  File "/Users/gary/.local/share/virtualenvs/elitrpc36/lib/python3.6/site-packages/elit/decode.py", line 89, in decode_line
    d = self.text_to_sentences(config, line, offset)
  File "/Users/gary/.local/share/virtualenvs/elitrpc36/lib/python3.6/site-packages/elit/decode.py", line 182, in text_to_sentences
    self.sentiment_analyze(config, sentences)
  File "/Users/gary/.local/share/virtualenvs/elitrpc36/lib/python3.6/site-packages/elit/decode.py", line 202, in sentiment_analyze
    y, att = analyzer.decode(sens, att=att)
  File "/Users/gary/.local/share/virtualenvs/elitrpc36/lib/python3.6/site-packages/elit/component/sentiment.py", line 73, in decode
    y = self.p_model.predict(x, batch_size=batch_size, verbose=0)
  File "/Users/gary/.local/share/virtualenvs/elitrpc36/lib/python3.6/site-packages/keras/engine/training.py", line 1695, in predict
    check_batch_axis=False)
  File "/Users/gary/.local/share/virtualenvs/elitrpc36/lib/python3.6/site-packages/keras/engine/training.py", line 132, in _standardize_input_data
    str(array.shape))
ValueError: Error when checking : expected input_2 to have 3 dimensions, but got array with shape (0, 1)

Unit tests

Unit tests:

  • elit.structure.*
  • elit.reader.*
  • elit.model.*

Reader example:

filename = '../../resources/sample.tsv'
reader = TSVReader(filename, 1, 2, 3, 4, 5, 6, 7, 8)

# read the next graph
graph = reader.next()
print(str(graph)+'\n')

# read the rest
for graph in reader:
    print(str(graph)+'\n')

elit/nlp/dep/parser/common/data.py

word_dims = len(data)

...
elit_dep_biaffine_en_mixed.load('{}/data/model/dep/jumbo-fasttext100'.format(ELIT_PATH))
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/dep/parser/parser.py", line 168, in load
    self._parser = self._create_parser(self._config, self._vocab)
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/dep/parser/parser.py", line 209, in _create_parser
    config.mlp_rel_size, config.dropout_mlp, config.debug)
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/dep/parser/biaffine_parser.py", line 55, in __init__
    trainable=False) if vocab.has_pret_embs() else None
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/dep/parser/common/data.py", line 174, in get_pret_embs
    word_dims = len(data)
UnboundLocalError: local variable 'data' referenced before assignment

Path should be passed to Embedding

embedding_types = [
WordEmbeddings('data/embedding/fasttext100.vec.txt'),
# comment in this line to use character embeddings
# CharacterEmbeddings(),
# comment in these lines to use contextual string embeddings
CharLMEmbeddings('data/model/lm-news-forward'),
CharLMEmbeddings('data/model/lm-news-backward'),
]

Since elit can be installed through pip or run on server, the word embedding path is not always in the same directory of project.

move out nlp components

This elit project will only be the sdk for deployment to the cloud. Since we have many overlapping with gluon nlp, it might be a better way for us that we contribute nlp model to there. Since then, we don't have to care about having a unified api interface for each nlp component.

In the future, developers don't have to worry about how should implement their component. The only thing they should care about is the decode function.

I have created a snapshot on https://github.com/elitcloud/elit/tree/nlp for this move out process. I'm gonna move out every components after this commit 9e4f185.

@hankcs would you please move your parser to https://github.com/elitcloud/parser?

Please leave your ideas/thoughts here for further discussion.

Best

`load` functions

The load functions need to be updated as follows:

  • It takes two parameters model_path and model_root:
    • model_path indicates either the name of a model (e.g., elit_pos_flair_en_mixed_*) or the URL to a compressed file in the S3 bucket, and does not have a default value.
    • model_root points to the root directory in the local machine where all models are saved, and has the default value of ~/.elit/models/.
  • If model_path points to a remote compressed file:
    • It downloads and unzips the compressed file to model_root, which creates a directory with the same name as the remote file (e.g., ~/.elit/models/elit_pos_flair_en_mixed_*).
    • If the downloaded model has dependencies to other models indicated in config.json (e.g., elit_lm_flair-bw_en_mixed_*), it downloads all dependent files and unzips them under model_root. This dependency resolution needs to be called recursively.

AttributeError

when I ran my own data set to decode it gives an error,
for tool in tools:
shuf = tool.decode(docs)

AttributeError Traceback (most recent call last)
in
1 for tool in tools:
----> 2 shuf = tool.decode(docs)

~/anaconda3/lib/python3.7/site-packages/elit/component/tagger/pos_tagger.py in decode(self, docs, **kwargs)
61 if isinstance(docs, Document):
62 docs = [docs]
---> 63 samples = NLPTaskDataFetcher.convert_elit_documents(docs)
64 with self.context:
65 sentences = self.tagger.predict(samples)

~/anaconda3/lib/python3.7/site-packages/elit/component/tagger/corpus.py in convert_elit_documents(docs)
1298 dataset = []
1299 for d in docs:
-> 1300 for s in d.sentences:
1301 sentence = Sentence()
1302

AttributeError: 'str' object has no attribute 'sentences

but it works fine for the example given is documentation!
Please help me to figure this out.

Lexicon

TODO

  • Test case of lexicon
  • Test case of word2vector
  • Test case of festtext

#6

Compound words tokenization failure

Expected Behavior

Compound words (e.g. pick-me-up, hand-me-down, know-it-all, etc.) should be tokenized as single tokens.

Actual Behavior

hyphens are treated as separators, and the components are tokenized separately.

read_pretrained_embeddings

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/wsgi.py", line 142, in __call__
    response = self.get_response(request)
  File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/base.py", line 78, in get_response
    response = self._middleware_chain(request)
  File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 36, in inner
    response = response_for_exception(request, exc)
  File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 90, in response_for_exception
    response = handle_uncaught_exception(request, get_resolver(get_urlconf()), sys.exc_info())
  File "/usr/local/lib/python3.5/dist-packages/django/core/handlers/exception.py", line 128, in handle_uncaught_exception
    callback, param_dict = resolver.resolve_error_handler(500)
  File "/usr/local/lib/python3.5/dist-packages/django/urls/resolvers.py", line 546, in resolve_error_handler
    callback = getattr(self.urlconf_module, 'handler%s' % view_type, None)
  File "/usr/local/lib/python3.5/dist-packages/django/utils/functional.py", line 37, in __get__
    res = instance.__dict__[self.name] = self.func(instance)
  File "/usr/local/lib/python3.5/dist-packages/django/urls/resolvers.py", line 526, in urlconf_module
    return import_module(self.urlconf_name)
  File "/usr/lib/python3.5/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 986, in _gcd_import
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 665, in exec_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "./config/urls.py", line 23, in <module>
    path('api/', include('api.urls')),
  File "/usr/local/lib/python3.5/dist-packages/django/urls/conf.py", line 34, in include
    urlconf_module = import_module(urlconf_module)
  File "/usr/lib/python3.5/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 986, in _gcd_import
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 665, in exec_module
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "./api/urls.py", line 18, in <module>
    from api import views
  File "./api/views.py", line 10, in <module>
    from api.components.elit import elit_tok_lexrule_en, elit_pos_flair_en_mixed
  File "./api/components/elit.py", line 94, in <module>
    elit_pos_flair_en_mixed.load('{}/data/model/pos/jumbo'.format(ELIT_PATH), word_embedding_path='{}/'.format(ELIT_PATH))
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/tagger/tagger.py", line 27, in load
    self.tagger = SequenceTagger.load_from_file(model_path, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/tagger/sequence_tagger_model.py", line 175, in load_from_file
    WordEmbeddings('{}data/embedding/fasttext100.vec.txt'.format(kwargs.get('word_embedding_path', ''))),
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/tagger/embeddings.py", line 99, in __init__
    self.precomputed_word_embeddings, self.__embedding_length = read_pretrained_embeddings(embedding_file)
  File "/usr/local/lib/python3.5/dist-packages/elit/nlp/tagger/corpus.py", line 43, in read_pretrained_embeddings
    for line in f:
  File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 153: ordinal not in range(128)

ImportError

ImportError Traceback (most recent call last)
in
----> 1 from elit.component import EnglishTokenizer
2 from elit.component import EnglishMorphAnalyzer
3 from elit.component import POSFlairTagger
4 from elit.component import NERFlairTagger
5 from elit.component import DEPBiaffineParser

ImportError: cannot import name 'EnglishTokenizer' from 'elit.component' (/home/jeev/anaconda3/lib/python3.7/site-packages/elit/component/init.py)

โ€‹

Specifications

  • Version:18.04
  • Platform:ubuntu
  • Subsystem:

gpu config should be explicitly

def mxnet_prefer_gpu():
'''
If gpu available return gpu, else cpu
:return:
'''
if 'cuda' not in os.environ['PATH']:
return mx.cpu()
gpu = int(os.environ.get('MXNET_GPU', default=0))
if gpu == -1:
return mx.cpu()
return mx.gpu(gpu)

We run models on a multi-gpus server, and each model use its gpu. In this case, we can't specify which gpu is gonna to be used outside this project.

Code refactoring

  • Put elit/lemmatization and elit/token_tagger under elit/nlp.
  • Create elit/model.py, move the classes in model/cnn_model.py and model/rnn_model.py to elit/model.py, and remove elit/model.
  • Remove elit/state.py (we no longer use it?)
  • Remove resources/morph_analyzer and rename resources/lemmatizer to resources/morph_analyzer. You probably need to reconfigure the resource path for EnglishMorphAnalyzer (we probably don't need EnglishLemmatizer any more).

AttributeError: 'NoneType' object has no attribute '_vocab'

I installed and ran below code
from elit.structure import Document, Sentence, TOK, POS
from elit.component import SDPBiaffineParser

tokens = ['John', 'who', 'I', 'wanted', 'to', 'meet', 'was', 'smart']
postags = ['NNP', 'IN', 'WP', 'PRP', 'VBD', 'DT', 'NN', 'VBD', 'JJ']
doc = Document()
doc.add_sentence(Sentence({TOK: tokens, POS: postags}))
#print(doc)
sdp = SDPBiaffineParser()
sdp.decode(doc)
print(doc.sentences[0])

and it throws error as shown below.

AttributeError Traceback (most recent call last)
in
8 #print(doc)
9 sdp = SDPBiaffineParser()
---> 10 sdp.decode(doc)
11 print(doc.sentences[0])

~/anaconda3/lib/python3.7/site-packages/elit/component/sdp/sdp_parser.py in decode(self, docs, num_buckets_test, test_batch_size, **kwargs)
119 :return: docs
120 """
--> 121 vocab = self._parser._vocab
122 for d in docs:
123 for s in d:

AttributeError: 'NoneType' object has no attribute '_vocab'

  • Version:18.04
  • Platform:ubuntu
  • Subsystem:

Things To Do

Code

  • Update requirements.txt, requirements.dev.txt, and setup.py.
  • Replace CHANGELOG.md with Github's release notes.
  • Updated the installation documentation for environmental setup.
  • MXComponent should take list of Embedding instead of embed_config.
def __init__(self, ctx: mx.Context, emb_list: List[Embedding], key: str = 'tag', chunking: bool =False, label_map: LabelMap = None, ...)

CLI

  • elit download resources
    • word embeddings, models
  • elit install
  • elit task decode
  • elit all decode [tok]
  • elite task train
  • elit task eval
s = 'hello world!'
tools = [PartOfSpeechTagger(...).load(...), ..., DependencyParser(...).load(...)]
tok = EnglishTokenizer()
docs = tok.decode(s)
for tool in tools: tool.decode(docs)

Release

  • 0.2.0.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.