Giter Club home page Giter Club logo

Comments (8)

jannikbertram avatar jannikbertram commented on May 18, 2024 3

Maybe you cut your off your data at a wrong point?
Check the last rows of train.txt and valid.txt and make sure there is an empty line in the end and the last sentences are complete (a sentence is marked by an empty line after)

from anago.

jannikbertram avatar jannikbertram commented on May 18, 2024 1

This error happens when your validation set contains tags that are not existent in your training set.

As this is a possible case in other kinds of machine learning problem, I build a workaround for it:

I defined a new Proprocessing class that includes tags from validation set into self.vocab_tag list.

class Preprocessor(WordPreprocessor):

    def fit(self, x_train, y_train, y_valid):
        super().fit(x_train, y_train)
    
        entities = set()
        
        for sent in y_valid:
            entities.update(sent)
    
        for t in entities:
        
            if t not in self.vocab_tag:
                self.vocab_tag[t] = len(self.vocab_tag)
    
        return self

You also need a new wrapper class that is almost equivalent to Sequence, but uses your new preprocessor:

class AnagoWrapper(Sequence):

    def train(self, x_train, y_train, x_valid=None, y_valid=None, vocab_init=None):
        self.p = Preprocessor(vocab_init=vocab_init).fit(x_train, y_train, y_valid)
        embeddings = filter_embeddings(self.embeddings, self.p.vocab_word, self.model_config.word_embedding_size)
        self.model_config.vocab_size = len(self.p.vocab_word)
        self.model_config.char_vocab_size = len(self.p.vocab_char)

        self.model = SeqLabeling(self.model_config, embeddings, len(self.p.vocab_tag))
    
        trainer = Trainer(self.model,
                          self.training_config,
                          checkpoint_path=self.log_dir,
                          preprocessor=self.p)
        trainer.train(x_train, y_train, x_valid, y_valid)

Anyway, I am thinking about changing my preproccesor by taking a predefined list of tags into the self.vocab_tag list as this may error once you test your model and your test set contains tags that are not existens in training or validation set.

from anago.

Rowing0914 avatar Rowing0914 commented on May 18, 2024

When I tried with small dataset, this caused me such an error above though, if I fed the huge data, like the one you pushed on this repo, then it works.
so, did you set any limitations on data storage??

from anago.

Rowing0914 avatar Rowing0914 commented on May 18, 2024

Does anyone know about this??

from anago.

Hironsan avatar Hironsan commented on May 18, 2024

Probably, sentences is an empty list:

>>> max([], key=len)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
ValueError: max() arg is an empty sequence

from anago.

Rowing0914 avatar Rowing0914 commented on May 18, 2024

Hi Hironsan and bode94
Thank you for your comment though, i know that... the thing is I didn't know why this caused me such an error. anyway, bode94 is right.
I didn't put the empty line at the end of the training data.
that's why if I use the distributed dataset, it works though, when it comes to mine, it did't work...
Thank you, both!

Hope you are doing well!

Best,
Rowing0914

from anago.

Rowing0914 avatar Rowing0914 commented on May 18, 2024

hmm, even though, I put a empty line at the end of the training data.
The issue was not solved..
I think i totally impersonate the original training data given in the directory.

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    model.train(x_train, y_train, x_valid, y_valid)
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/wrapper.py", line 50, in train
    trainer.train(x_train, y_train, x_valid, y_valid)
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/trainer.py", line 51, in train
    callbacks=callbacks)
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/engine/training.py", line 2213, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/keras/callbacks.py", line 76, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/metrics.py", line 124, in on_epoch_end
    for i, (data, label) in enumerate(self.valid_batches):
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/reader.py", line 150, in data_generator
    yield preprocessor.transform(X, y)
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 115, in transform
    y = [[self.vocab_tag[t] for t in sent] for sent in y]
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 115, in <listcomp>
    y = [[self.vocab_tag[t] for t in sent] for sent in y]
  File "/Users/norio.kosaka/anaconda3/envs/py36/lib/python3.6/site-packages/anago/preprocess.py", line 115, in <listcomp>
    y = [[self.vocab_tag[t] for t in sent] for sent in y]
KeyError: 'I-MISC'
import anago
from anago.reader import load_data_and_labels

x_train, y_train = load_data_and_labels('../data/conll2003/en/ner/train_1.txt')
x_valid, y_valid = load_data_and_labels('../data/conll2003/en/ner/valid_1.txt')
x_test, y_test = load_data_and_labels('../data/conll2003/en/ner/test_1.txt')

model = anago.Sequence()
model.train(x_train, y_train, x_valid, y_valid)
model.eval(x_test, y_test)
words = 'President Obama is speaking at the White House.'.split()
model.analyze(words)

train.txt
EU B-ORG
rejects O
German B-MISC
call O
to O
boycott O
British B-MISC
lamb O
. O

Peter B-PER
Blackburn I-PER

BRUSSELS B-LOC
1996-08-22 O

The O
European B-ORG
Commission I-ORG
said O
on O
Thursday O
it O
disagreed O
with O
German B-MISC
advice O
to O
consumers O
to O
shun O
British B-MISC
lamb O
until O
scientists O
determine O
whether O
mad O
cow O
disease O
can O
be O
transmitted O
to O
sheep O
. O

Germany B-LOC
's O
representative O
to O
the O
European B-ORG
Union I-ORG
's O
veterinary O
committee O
Werner B-PER
Zwingmann I-PER
said O
on O
Wednesday O
consumers O
should O
buy O
sheepmeat O
from O
countries O
other O
than O
Britain B-LOC
until O
the O
scientific O
advice O
was O
clearer O
. O

" O
We O
do O
n't O
support O
any O
such O
recommendation O
because O
we O
do O
n't O
see O
any O
grounds O
for O
it O
, O
" O
the O
Commission B-ORG
. O

so tell me the proper format for the dataset.
There is no description on it.

from anago.

Rowing0914 avatar Rowing0914 commented on May 18, 2024

Hi bode94

Thank you for your prompt action!
Oh,, yeah it's probably i just have created the datasets using head -n 100 train/test/valid.txt > train/test/valid_1.txt

So within the first 100 lines, maybe each text contains other parts...
Now I got it!!
Thank you so much for your contribution as well!
let me check!

from anago.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.