kermitt2 / delft Goto Github PK

View Code? Open in Web Editor NEW

387.0 387.0 64.0 854.69 MB

a Deep Learning Framework for Text https://delft.readthedocs.io/

License: Apache License 2.0

Python 100.00%

deep-learning keras ner nlp sequence-labeling text-classification

delft's People

Contributors

Stargazers

Watchers

delft's Issues

How should I cite your blog?

Nice work! Your work and blog

A reproducibility study on neural NER

really helped my research, and I want to cite you blog in my research paper. Can you give me an example BibTeX entry? Just like Keras:

@misc{chollet2015keras,
title={Keras},
author={Chollet, Fran\c{c}ois and others},
year={2015},
howpublished={\url{https://keras.io}},
}

Incompatible arrays dimension when using ELMo and input is of length 1 (only 1 word)

In a sequence labelling scenario, when the input of Tagger.tag() is of length 1, like for example with the input string "test" we end up with an error:

  File "/home/olivier/.pyenv/versions/spacy/lib/python3.6/site-packages/delft/sequenceLabelling/wrapper.py", line 245, in tag
    annotations = tagger.tag(texts, output_format)
  File "/home/olivier/.pyenv/versions/spacy/lib/python3.6/site-packages/delft/sequenceLabelling/tagger.py", line 60, in tag
    preds = self.model.predict_on_batch(generator_output[0])
  File "/home/olivier/.pyenv/versions/spacy/lib/python3.6/site-packages/keras/engine/training.py", line 1274, in predict_on_batch
    outputs = self.predict_function(ins)
  File "/home/olivier/.pyenv/versions/spacy/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/home/olivier/.pyenv/versions/spacy/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/home/olivier/.pyenv/versions/spacy/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/home/olivier/.pyenv/versions/spacy/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: ConcatOp : Dimensions of inputs should match: shape[0] = [1,1,1324] vs. shape[1] = [1,2,50]
	 [[{{node concatenate_1/concat}}]]

The word input shape [1,1,1324] looks fine :
1 because 1 word
1324 because 300 for glove embeddings and 1024 for ELMo

But the character input shape [1,2,50] is not aligned

The reason why is in the DataGenerator.__data_generation method() near line 93

        # prevent sequence of length 1 alone in a batch (this causes an error in tf)
        extend = False
        if max_length_x == 1:
            max_length_x += 1
            extend = True

An input of length 1 is artificially extended to 2

But line 108 :

        if self.embeddings.use_ELMo:     
            #batch_x = to_vector_elmo(x_tokenized, self.embeddings, max_length_x)
            batch_x = to_vector_simple_with_elmo(x_tokenized, self.embeddings, max_length_x)

the batch_x is initialized with the correct shape [1,2,1324] but after the call to to_vector_simple_with_elmo() it is back to [1,1,1324]

So maybe the to_vector_simple_with_elmo() method should also extend the vector from 1 to 2 ?

Don't know what is the best fix :

pass an additional extend=True parameter to the to_vector_simple_with_elmo() method ?
take the maxlength into account in to_vector_simple_with_elmo() ?

I would be happy to contribute the fix in a PR if you tell me what would be the best way to fix

Train ELMo embeddings for French

For instance using French Wikipedia (1B words) + FrWac corpus (1.6B words)
As a reference, it will require 2-3 GeForce GTX 1080Ti.

NER performance with Ontonotes and number-related ELMo embeddings

See allenai/bilm-tf#59

We don't apply any formatting for numbers, we use the same tokenization as the one provided by CoNLL2012 dataset, so no clue for the moment.

sequenceLabelling.Trainer.train method fails if validation set is None

There are some problems with the trainer itself (during the concatenation of the training and validation sets), with the data_generator (unable to compute its len), and with the on_epoch_end of the trainer (assuming validation is not empty).
I will submit a PR today.

Cannot allocate memory

The first time that I run the program, it builds the mdb files. However, after creation of these files, it fails to continue as it cannot allocate enough memory for the rest. The memory issue does not happen if .mdb files already exist.
I was wondering how much memory mdb creation requires?
I am running your code on some linux servers with 180GB shared RAM.

Sorting of fields in the sequence labelling evaluation report

The fields with scores do not follow the same order in the 10-fold average and in the best/word result:

From @lfoppiano:
I would output the labels in the same order as the as the rest, see an example, the average output month, day, year while everything else output year, month day:

average over 3 folds
            precision    recall  f1-score   support

   <month>     0.9716    0.9661    0.9688        59
     <day>     0.8980    0.9683    0.9314        42
    <year>     0.9948    0.9792    0.9869        64

	macro f1 = 0.9659
	macro precision = 0.9601
	macro recall = 0.9717 


** Worst ** model scores -
                  precision    recall  f1-score   support

          <year>     1.0000    0.9688    0.9841        64
         <month>     0.9655    0.9492    0.9573        59
           <day>     0.8723    0.9762    0.9213        42

all (micro avg.)     0.9521    0.9636    0.9578       165


** Best ** model scores -
                  precision    recall  f1-score   support

          <year>     0.9844    0.9844    0.9844        64
         <month>     0.9667    0.9831    0.9748        59
           <day>     0.9302    0.9524    0.9412        42

all (micro avg.)     0.9641    0.9758    0.9699       165

Find a way to disable the ELMo/BERT caching mechanism in "production" mode

The contextual embeddings (ELMo or BERT) caching mechanism using a lmdb database is really nice especially in training mode because it saves a lot of time after the 1st epoch.
Anyway when you want to use intensively your trained model in "production" making a lot of prediction over a long period of time, the lmdb database can potentially grows infinitively when providing massively unseen texts to the Tagger.tag() method
It could be usefull to have a way of disabling the cache, like for example with a use_cache boolean flag that could overwrite the default behaviour (not during training but in production)

Not sure that it is crystal clear ... Let me know

Missing wikipedia-and-pmc embeddings location in starting guide

Hey, I've tried to run delft, I've configured the embeddings (it worked previously) but latest changes introduced new embeddings for which I did not find the source:

Compiling embeddings... (this is done only one time per embeddings at first launch)
[...]
FileNotFoundError: [Errno 2] No such file or directory: '/media/lopez/T5/embeddings/wikipedia-pubmed-and-PMC-w2v.vec'

I guess they are from here? http://evexdb.org/pmresources/vec-space-models/

Feature-based approach with BERT for seq. labelling is super slow

We are currently using keras-bert for the feature-based approach with BERT for seq. labelling and this is super slow: 56 tokens per second (using the concatenation of the 4 top four hidden layers of the pre-trained transformer, as in the original paper) - to be compared to ~300 tokens/s with ELMo and, more relevant, around 1000 tokens per second when using BERT fine-tuned model.

I think there is no reason to have something so slow when using the pre-trained transformer as compared to the fine-tuned model, so we should use our BERT integration too, rather than keras-bert for the feature-based approach (as bonus, it will remove this dependency).

<PAD> tags should be filtered out from the output of the Tagger

In a sequence labelling scenario, the internal tag can be present in the output of the Tagger.tag() method.
As they are internal they should probably been filtered out

I would be more than happy to provide a fix in a PR if you tell me where it is better to fix :

In WordPreprocessor.inverse_transform() ?

--use-BERT option is ignored by nerTagger eval

I will submit a PR today.

Suggestion: Provide example Jupyter Notebook

Hi @kermitt2

It might be useful to provide an example Jupyter Notebook.

I have created an example (although currently using my "extension" of DeLFT to make it work):

GitHub
Colab

That could be extended to for example define a custom model.

LMDB Embeddings

Hi, thanks for the good work!

Loving the idea of using LMDB to store and query the embeddings! I wrote a standalone package for this here: https://github.com/ThoughtRiver/lmdb-embeddings/tree/master. I thought it could be of use to embed it here thereby separating the logic out, but also to use in other settings. Is it worth me doing some work on this?

Let me know what you think!
Thanks

NER with bert

Do you have a plan to reproduce the BERT NER model? I tried, but with Bert_base, the best micro-avg Test F1 on CoNLL-2003 is 91.37, while the reported in the paper is 92.4.

Error when model checkpointing model if f1 is not available (yet)

The error is:

File "/opt/anaconda3/envs/delft-sug/lib/python3.6/site-packages/delft/sequenceLabelling/wrapper.py", line 124, in train
    trainer.train(x_train, y_train, x_valid, y_valid)
File "/opt/anaconda3/envs/delft-sug/lib/python3.6/site-packages/delft/sequenceLabelling/trainer.py", line 61, in train
    self.training_config.max_epoch)
File "/opt/anaconda3/envs/delft-sug/lib/python3.6/site-packages/delft/sequenceLabelling/trainer.py", line 109, in train_model
    callbacks=callbacks)
File "/opt/anaconda3/envs/delft-sug/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
File "/opt/anaconda3/envs/delft-sug/lib/python3.6/site-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
File "/opt/anaconda3/envs/delft-sug/lib/python3.6/site-packages/keras/engine/training_generator.py", line 251, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
File "/opt/anaconda3/envs/delft-sug/lib/python3.6/site-packages/keras/callbacks.py", line 79, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
File "/opt/anaconda3/envs/delft-sug/lib/python3.6/site-packages/keras/callbacks.py", line 429, in on_epoch_end
    filepath = self.filepath.format(epoch=epoch + 1, **logs)
KeyError: 'f1'

Unfortunately this f1 variable is created by keras so it does not seem to be possible to generate a default value in Delft.

Error when tagging with BidLSTM_CNN architecture

My warm thanks for your wonderful works! I have 2 OS (Windows and Ubuntu), both are using Python 3.6 (with Anaconda 3). I installed tensorflow-gpu version 1.10.

I've successfully trained BidLSTM_CNN model and obtained the model_weights.hdf5 file.

When I run: python nerTagger.py --dataset-type conll2003 --architecture BidLSTM_CNN eval

I get:

Evaluation on test set:
        f1 (micro): 89.38
                  precision    recall  f1-score   support

            MISC     0.7974    0.7735    0.7852       702
             ORG     0.8341    0.8958    0.8639      1661
             LOC     0.9052    0.9269    0.9159      1668
             PER     0.9465    0.9511    0.9488      1617

all (micro avg.)     0.8822    0.9056    0.8938      5648

runtime: 5.412 seconds

However, when I try to perform action tag (instead of eval), it gives me the same error on both 2 OS (NOTE: Here, I copied the result from Windows terminal. By the way, on Ubuntu terminal, it also said the same thing):

Traceback (most recent call last):
  File "nerTagger.py", line 464, in <module>
    file_out=file_out)
  File "nerTagger.py", line 381, in annotate
    model.tag_file(file_in=file_in, output_format=output_format, file_out=file_out)
  File "D:\Anaconda3\envs\ULR\delft-master\delft\sequenceLabelling\wrapper.py", line 275, in tag_file
    annotations = tagger.tag(texts, output_format)
  File "D:\Anaconda3\envs\ULR\delft-master\delft\sequenceLabelling\tagger.py", line 86, in tag
    piece["entities"] = self._build_json_response(tokens, tags, prob, offsets)["entities"]
  File "D:\Anaconda3\envs\ULR\delft-master\delft\sequenceLabelling\tagger.py", line 113, in _build_json_response
    chunks = get_entities_with_offsets(tags, offsets)
  File "D:\Anaconda3\envs\ULR\delft-master\delft\sequenceLabelling\tagger.py", line 156, in get_entities_with_offsets
    end_pos = offsets[j-1][1]-1
IndexError: list index out of range

LMDB embeddings creation is very slow on spinning drive

It is very slow on a spinning drive (40 it/s) compared to an SSD (4000 it/s). It is caused by frequent commits (and LMDB not being write optimized).
For the moment I can't find any evidence supporting the fact LMDB will or will not exclusively use RAM when writing all data in the same transaction. It does not seem likely though. There are kernel parameters controlling how dirty pages of an mmap are flushed on disk (http://jmoiron.net/blog/mmap2/). As far as I understand, those parameters will have an impact on actual RAM consumption.

Training grobidTagger without embeddings

Hi @kermitt2, should training without embedding work?

From the readme:

Reduce model size, in particular by removing word embeddings from them. For instance, the model for the toxic comment classifier went down from a size of 230 MB with embeddings to 1.8 MB. In practice the size of all the models of DeLFT is less than 2 MB, except for Ontonotes 5.0 NER model which is 4.7 MB.

When I try to set embeddings_name to None, it falls over soon after. I tried to fix the next two issues but there are more. Which makes me think maybe it's not meant to work?

Adding std to n-fold evaluation

Hi,
I suggest to add std in the report produced at "sequenceLabelling/wrapper.py", eval_nfold() function, around Line 230.

Using delft as a library

It would be good to be able to use and install delft as a library. You could then also push it to pypi.
Some changes would be required of course.

It could maybe work a bit like this:

pip install delft==1.0.0
# or pip install delft[gpu]==1.0.0

python -m delft.download_embedding glove-840B

python -m delft.grobid_tagger --embedding glove-840B --input ... --output ...

What are your thoughts on that, @kermitt2?

Automatically download embeddings

As part of the embeddings and model management in DeLFT, download on demand word embeddings, contextualized embeddings and transformer models.

Thanks to @de-code, the functionality to automatically download embeddings is already available !

Working with all lowercase dataset

Thanks for the wonderful work here!

I have some text files and want to extract NE from them by running nerTagger.py . However, my files contain all lowercase characters and of course, I can't get any NE result.

For instance:

[Normal sentence]: I live in New York.
Output:

...
"text": "I live in New York.",
"entities": [
                {
                    "text": "New York",
                    "class": "LOC",
                    "score": 1.0,
                    "beginOffset": 10,
                    "endOffset": 17
                },
            ]
...

[Lowercase sentence]: i live in new york.
Output:

...
"text": "i live in new york.",
"entities": []
...

Expected:

...
"text": "i live in new york.",
"entities": [
                {
                    "text": "new york",
                    "class": "LOC",
                    "score": 1.0,
                    "beginOffset": 10,
                    "endOffset": 17
                },
            ]
...

Therefore, should we develop a caseless NER model?

directory not empty (utilities.Embeddings.py Line 594: "os.rmdir(self.embedding_ELMo_cache)")

Hi, Thanks for the great code.
I try to run your code on some linux servers. I tried several times and everytime I received the error "directory not empty", while the program tries to save the trained model.
I replace "os.rmdir" with "shutil.rmtree", however, I received the error: "Device or resource busy".

It seems that the program keeps the directory "data/model/ELMo/en/cache" busy.

Provide Docker container

It's a bit similar to me having requested a Docker container for GROBID training.

Basically I want to make it easy to train on the cloud. i.e. I would just issue a command to run a Docker container in Kubernetes (but could also be run locally). One benefit for running it locally is also that there won't be much or any setup required, assuming Docker or Kubernetes is already configured.

I will progress that in a separate repo but thought it's worth sharing my motivation.

Average precision/recall/f1 per label

I've noticed that in the evaluation using n-fold crossvalidation, the report provides average precision/recall/f1 globally but not average scores by label.

Could that be useful if I implement it? otherwise I will just compute them manually.

Training of elmo embeddings

This might not be right place to ask, sorry for that.

I see that you trained a 5.5B token elmo model. I am facing some problems training my own elmo model. Could you be kind enough to answer a few questions?

How long did it take you to train your own elmo model for 5.5B tokens? (How many GPUs?)
What was the size of each training file that you split the data into? My kernel kills the process with around 1 million lines per file.

Thank You,

bert-{lang} bert-base-{lang} discrepancy in Embeddings

It currently requires both bert-en and bert-base-en in embedding-registry.json

Reproducibility

Retraining a model leads to different evaluation due to different random seed. For instance NER f1 score with BidLSTM-CRF can go from 90.22 to 91.07 (on eng.testb).

adding for Keras reproducibility, the classical:

import numpy as np
np.random.seed(1)

adding for tensorflow reproducibility:

from tensorflow import set_random_seed
set_random_seed(1)

(does it add a tensorflow backend dependency in Keras?)

test ...
make random seed optional

sequenceLabelling ValueError

I'm testing the tag function of sequenceLabelling models on Windows 10, and I met ValueErrors for all models. I didn't change any code inside and only ran the code:

def tag_line(text):
# load model
model_name = 'ner-en-conll2003-BidGRU_CRF'
model = sequenceLabelling.Sequence(model_name)
model.load()
results = model.tag(text, "json")
return results

if name == "main":
text = ["Trump lives in New York"]
tag_line(text)

The detailed error messages are as below.

loading model weights data/models/sequenceLabelling/ner-en-conll2003-BidGRU_CRF\model_weights.hdf5
Traceback (most recent call last):
  File "D:\Anaconda\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 686, in _call_cpp_shape_fn_impl
    input_tensors_as_shapes, status)
  File "D:\Anaconda\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 50 and 350. Shapes are [50,300] and [350,300]. for 'Assign_7' (op: 'Assign') with input shapes: [50,300], [350,300].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/YuhwaChoong/Desktop/Delft/delft/nerTagger.py", line 476, in <module>
    tag_line(text)
  File "C:/Users/YuhwaChoong/Desktop/Delft/delft/nerTagger.py", line 470, in tag_line
    model.load()
  File "C:\Users\YuhwaChoong\Desktop\Delft\delft\sequenceLabelling\wrapper.py", line 366, in load
    self.model.load(filepath=os.path.join(dir_path, self.model_config.model_name, self.weight_file))
  File "C:\Users\YuhwaChoong\Desktop\Delft\delft\sequenceLabelling\models.py", line 60, in load
    self.model.load_weights(filepath=filepath)
  File "D:\Anaconda\lib\site-packages\keras\engine\topology.py", line 2656, in load_weights
    f, self.layers, reshape=reshape)
  File "D:\Anaconda\lib\site-packages\keras\engine\topology.py", line 3382, in load_weights_from_hdf5_group
    K.batch_set_value(weight_value_tuples)
  File "D:\Anaconda\lib\site-packages\keras\backend\tensorflow_backend.py", line 2368, in batch_set_value
    assign_op = x.assign(assign_placeholder)
  File "D:\Anaconda\lib\site-packages\tensorflow\python\ops\variables.py", line 609, in assign
    return state_ops.assign(self._variable, value, use_locking=use_locking)
  File "D:\Anaconda\lib\site-packages\tensorflow\python\ops\state_ops.py", line 281, in assign
    validate_shape=validate_shape)
  File "D:\Anaconda\lib\site-packages\tensorflow\python\ops\gen_state_ops.py", line 64, in assign
    use_locking=use_locking, name=name)
  File "D:\Anaconda\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "D:\Anaconda\lib\site-packages\tensorflow\python\framework\ops.py", line 3292, in create_op
    compute_device=compute_device)
  File "D:\Anaconda\lib\site-packages\tensorflow\python\framework\ops.py", line 3332, in _create_op_helper
    set_shapes_for_outputs(op)
  File "D:\Anaconda\lib\site-packages\tensorflow\python\framework\ops.py", line 2496, in set_shapes_for_outputs
    return _set_shapes_for_outputs(op)
  File "D:\Anaconda\lib\site-packages\tensorflow\python\framework\ops.py", line 2469, in _set_shapes_for_outputs
    shapes = shape_func(op)
  File "D:\Anaconda\lib\site-packages\tensorflow\python\framework\ops.py", line 2399, in call_with_requiring
    return call_cpp_shape_fn(op, require_shape_fn=True)
  File "D:\Anaconda\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 627, in call_cpp_shape_fn
    require_shape_fn)
  File "D:\Anaconda\lib\site-packages\tensorflow\python\framework\common_shapes.py", line 691, in _call_cpp_shape_fn_impl
    raise ValueError(err.message)
ValueError: Dimension 0 in both shapes must be equal, but are 50 and 350. Shapes are [50,300] and [350,300]. for 'Assign_7' (op: 'Assign') with input shapes: [50,300], [350,300].

CoNLL 2012

I have created the IOB2 format of CoNLL 2012 dataset based on the instruction you provided here.
However, I get the following error:

TensorArray has size zero, but element shape [?,38] is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.

which is due to having sentences of length one.
Any suggestions?

Thanks for your wonderful work!

How to convert GROBID training dataset to DeLFT GROBID training dataset?

Hi @kermitt2

The DeLFT GROBID training dataset has the label as the last column whereas the dataset in GROBID doesn't. I am guessing that is coming out of the alignment with the TEI files.
How do you generate the files with the label at the end?
Is there a command for that?

new multiprocessing parameter for Sequence and TrainingConfig

There are some reports of multiprocessing causing locking problems (with Keras hanging forever):
keras-team/keras#3181
keras-team/keras#10340
keras-team/keras#9964
We encountered this problem when training sequence labelling without ELMo (where multiprocessing is disabled by Delft). To be honest it might be caused by some gloomy Python process fork, but I'm not able to find out why we're having this problem.
Having the possibility to disable multiprocessing would be great.

I will submit a PR today.

'patience' training_config parameter is ignored by Trainer class

I will submit a PR today.

reader.py misses the the <EX_ENAMEX> annotated entities on LeMonde Corpus

Hello,

I recently found that the reader.py script does not parse the entities annotated on the French corpus LeMonde annotated as <EX_ENAMEX>, meaning that it misses some entities and also that it cuts some sentences as it ignores the text inside the tags.

This should not have a big impact as the number of entities annotated like this is rather little, but still it would be nice to patch the little bug. 😄

Thanks!

Custom model

I just found this library which seems really nice! Thanks for make it available. One quick question: is there a simple way to use delft with a custom model? I want to try a keras model and use delft because of the already implemented disk space efficient pipeline and the use of elmo.

Be able to pass an additional callbacks argument to the train method of Sequence/Classifier object

It would be super useful in particular when you use delft as a library and want to implement a progress indicator during the training phase.

In practicle it could be:

def train(self, x_train, y_train, x_valid=None, y_valid=None, callbacks=None)

callbacks should be a list of valid Keras callbacks and be propagated to the underlying Trainer object and then concateneted to the already defined delft callbacks before calling the fit() method of the Keras object

header training data - mismatching features columns?

I've been testing the header model training data using features #76 and I've incurred in one small problem, it seems that the number of columns are not consistent.

For example:
'anaesthesia' on line 938 and 'elsevier', line 813 have different number of columns (30 vs 31):

ELSEVIER elsevier E EL ELS ELSE R ER IER VIER BLOCKSTART LINESTART NEWFONT HIGHERFONT 0 0 0 ALLCAP NODIGIT 0 0 0 0 0 0 0 0 0 0 NOPUNCT 0 0 I-<note>

Anaesthesia anaesthesia A An Ana Anae a ia sia esia BLOCKSTART LINESTART LINEINDENT NEWFONT HIGHERFONT 0 0 0 INITCAP NODIGIT 0 0 1 0 0 0 0 0 NOPUNCT 0 0 <reference>

I'm not sure is a bug (at least not in the current version - which is ignoring these information), and also I'm not sure this is the right place, but training the header model will fail with the automatic feature discovery enabled

max_sequence_length not used(?)

Hi @kermitt2

I was just considering whether we need sliding windows to not have to use a really large max_sequence_length. But then I realised that max_sequence_length doesn't actually seem to be used. It's passed to the DataGenerator which doesn't seem to use it. Instead it simply pads the batch to whatever the maximum length is within the batch.

avoid setting random seed at module level

It is generally preferable if module level code doesn't have side-effects. i.e. just importing a module shouldn't change anything (there may be few exceptions). It would be better if the seed was set by the main method for example.

e.g.

>>> np.random.seed(123)
>>> np.random.get_state()[1][0]
123
>>> import delft.sequenceLabelling.data_generator
Using TensorFlow backend.
>>> np.random.get_state()[1][0]
7

At the end, I would expect the seed to be the same.

(It's not a big issue as there is a simple workaround)

DataGenerator on_epoch_end / shuffle_pair not functional

Disclaimer: I haven't actually tested it but can't see it working..

The shuffle_pair function doesn't shuffle the array in place as far as I can see, but is returning a shuffled view. Which in turn is not used by on_epoch_end.

    def shuffle_pair(self, a, b):
        # generate permutation index array
        permutation = np.random.permutation(a.shape[0])
        # shuffle the two arrays
        return a[permutation], b[permutation]

    def on_epoch_end(self):
        # shuffle dataset at each epoch
        if self.shuffle == True:
            if self.y is None:
                np.random.shuffle(self.x)
            else:      
                self.shuffle_pair(self.x,self.y)

For an in-place shuffle I found the following StackOverflow answer useful (essentially shuffling each array in-place with the same seed):

def shuffle_arrays(arrays, set_seed=-1):
    """Shuffles arrays in-place, in the same order, along axis=0

    Parameters:
    -----------
    arrays : List of NumPy arrays.
    set_seed : Seed value if int >= 0, else seed is random.
    """
    assert all(len(arr) == len(arrays[0]) for arr in arrays)
    seed = np.random.randint(0, 2**(32 - 1) - 1) if set_seed < 0 else set_seed

    for arr in arrays:
        rstate = np.random.RandomState(seed)  # pylint: disable=no-member
        rstate.shuffle(arr)

Otherwise you could of course use the shuffled numpy array views.

Add layout features to GROBID model

Hi @kermitt2

Something you are already well aware of but I thought it's good to have an issue to record the discussion around it. I am not sure whether you already experimented with adding layout features.

I've started doing it and implemented something here: elifesciences/sciencebeam-trainer-delft#16

Maybe you'll find some of it useful. (I don't want to flood you with too many PRs)

save a pretrained model as .pb

Is there a way to export a trained model as .pb file that can be served with tensorflow serving?

AttributeError: 'Tensor' object has no attribute 'assign'

I installed all the packages as mentioned in your requirements.txt. I am having error in model.fit_generator
line 687, in train_model workers=6,epochs=1) File "/home/rohit/home/Development/Test/03042019/project1_env/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs) File "/home/rohit/home/Development/Test/03042019/project1_env/lib/python3.6/site-packages/keras/engine/training.py", line 2080, in fit_generator self._make_train_function() File "/home/rohit/home/Development/Test/03042019/project1_env/lib/python3.6/site-packages/keras/engine/training.py", line 990, in _make_train_function loss=self.total_loss) File "/home/rohit/home/Development/Test/03042019/project1_env/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, **kwargs) File "/home/rohit/home/Development/Test/03042019/project1_env/lib/python3.6/site-packages/keras/optimizers.py", line 257, in get_updates self.updates.append(K.update(a, new_a)) File "/home/rohit/home/Development/Test/03042019/project1_env/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 963, in update return tf.assign(x, new_x) File "/home/rohit/home/Development/Test/03042019/project1_env/lib/python3.6/site-packages/tensorflow/python/ops/state_ops.py", line 284, in assign return ref.assign(value, name=name)

Tag for v0.2.3 missing

To be able to refer to / download a particular release, it would be helpful to have a GitHub tag for each release version (as you have for GROBID) (or now at least the last release).

Undefined names

flake8 testing of https://github.com/kermitt2/delft on Python 3.7.0

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./nerTagger.py:62:53: F821 undefined name 't_train3'
        y_all = np.concatenate((y_train1, y_train2, t_train3), axis=0)
                                                    ^
./nerTagger.py:78:53: F821 undefined name 'fold_count'
                                        fold_number=fold_count,
                                                    ^
./textClassification/reader.py:31:17: F821 undefined name 'printf'
                printf("Warning: number of fields in the data file too low for line:", line)
                ^
./textClassification/reader.py:120:17: F821 undefined name 'printf'
                printf("Warning: incorrect number of fields in the data file for line:", line)
                ^
./textClassification/models.py:238:26: F821 undefined name 'inp'
    model = Model(inputs=inp, outputs=x)
                         ^
./textClassification/models.py:262:26: F821 undefined name 'inp'
    model = Model(inputs=inp, outputs=x)
                         ^
./textClassification/models.py:284:26: F821 undefined name 'inp'
    model = Model(inputs=inp, outputs=x)
                         ^
./textClassification/models.py:306:26: F821 undefined name 'inp'
    model = Model(inputs=inp, outputs=x)
                         ^
./textClassification/models.py:327:26: F821 undefined name 'inp'
    model = Model(inputs=inp, outputs=x)
                         ^
./textClassification/models.py:348:26: F821 undefined name 'inp'
    model = Model(inputs=inp, outputs=x)
                         ^
./textClassification/models.py:376:26: F821 undefined name 'inp'
    model = Model(inputs=inp, outputs=x)
                         ^
./textClassification/models.py:399:26: F821 undefined name 'inp'
    model = Model(inputs=inp, outputs=x)
                         ^
./textClassification/models.py:442:26: F821 undefined name 'inp'
    model = Model(inputs=inp, outputs=x)
                         ^
./textClassification/models.py:573:67: F821 undefined name 'X'
    X = Conv1D(filters=recurrent_units, kernel_size=2, strides=3)(X)
                                                                  ^
./utilities/Embeddings.py:565:39: F821 undefined name 'embedding_ELMo_cache'
            self.env_ELMo = lmdb.open(embedding_ELMo_cache, readonly=True, max_readers=2048, max_spare_txns=2, lock=False)
                                      ^
./utilities/Tokenizer.py:60:11: F821 undefined name 'tokenize'
    print(tokenize(test))
          ^
./utilities/Tokenizer.py:65:11: F821 undefined name 'tokenize'
    print(tokenize(test))
          ^
17    F821 undefined name 't_train3'
17

TypeError: can't pickle Environment objects on Windows/MacOs

I'm running under Windows 10, following along the instructions given by the readme document. When trying to retrain the model using this command

python nerTagger.py --dataset-type conll2003 train_eval

I ran into the following exception (right after compiling embeddings) - any tips?

Thank you for the wonderful work!

Compiling embeddings... (this is done only one time per embeddings at first launch)
path: d:\Projects\embeddings\glove.840B.300d.txt
100%|████████████████████████████████████████████████████████████████████| 2196017/2196017 [08:06<00:00, 4517.80it/s] embeddings loaded for 2196006 words and 300 dimensions
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
char_input (InputLayer)         (None, None, 30)     0
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, None, 30, 25) 2150        char_input[0][0]
__________________________________________________________________________________________________
word_input (InputLayer)         (None, None, 300)    0
__________________________________________________________________________________________________
time_distributed_2 (TimeDistrib (None, None, 50)     10200       time_distributed_1[0][0]
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, None, 350)    0           word_input[0][0]
                                                                 time_distributed_2[0][0]
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, None, 350)    0           concatenate_1[0][0]
__________________________________________________________________________________________________
bidirectional_2 (Bidirectional) (None, None, 200)    360800      dropout_1[0][0]
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, None, 200)    0           bidirectional_2[0][0]
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, None, 100)    20100       dropout_2[0][0]
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, None, 10)     1010        dense_1[0][0]
__________________________________________________________________________________________________
chain_crf_1 (ChainCRF)          (None, None, 10)     120         dense_2[0][0]
==================================================================================================
Total params: 394,380
Trainable params: 394,380
Non-trainable params: 0
__________________________________________________________________________________________________
Epoch 1/60
Exception in thread Thread-2:
Traceback (most recent call last):
  File "d:\Anaconda3\Lib\threading.py", line 916, in _bootstrap_inner
    self.run()
  File "d:\Anaconda3\Lib\threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "d:\Projects\delft\env\lib\site-packages\keras\utils\data_utils.py", line 548, in _run
    with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
  File "d:\Projects\delft\env\lib\site-packages\keras\utils\data_utils.py", line 522, in <lambda>
    initargs=(seqs,))
  File "d:\Anaconda3\Lib\multiprocessing\context.py", line 119, in Pool
    context=self.get_context())
  File "d:\Anaconda3\Lib\multiprocessing\pool.py", line 174, in __init__
    self._repopulate_pool()
  File "d:\Anaconda3\Lib\multiprocessing\pool.py", line 239, in _repopulate_pool
    w.start()
  File "d:\Anaconda3\Lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "d:\Anaconda3\Lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "d:\Anaconda3\Lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "d:\Anaconda3\Lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle Environment objects

Big dataset

How can I train a Classifier from textClassification without the need to load the full training dataset? I have a quite large text dataset and if I use, e.g., load_texts_and_classes from textClassification.reader I get a MemoryError.

Rare class and batch size

For ELMo, which is using reduced batch size because of memory constrains, it might be necessary to review how the batch are created to ensure that rare classes are well represented in each batch, with automatic over-sampling techniques for instance.

See issue #7 (although ontonotes work finally fine without sampling, we could have more extremely unbalanced training data)

Batch-size=1

Hi,
When I set batch_size=1, I got the following error:

Traceback (most recent call last):
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
status, run_metadata)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.UnimplementedError: TensorArray has size zero, but element shape [?,10] is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
[[Node: chain_crf_1/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:@chain_crf_1/TensorArray"], dtype=DT_FLOAT, element_shape=[?,10], _device="/job:localhost/replica:0/task:0/device:GPU:0"](chain_crf_1/TensorArray, chain_crf_1/TensorArrayStack/range, chain_crf_1/while/Exit_1)]]
[[Node: chain_crf_1/while_1/Reshape_1/_321 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2949_chain_crf_1/while_1/Reshape_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "nerTagger.py", line 440, in
data_path=data_path)
File "nerTagger.py", line 271, in train_eval
model.train(x_train, y_train, x_valid, y_valid)
File "/data/hpoostch/PycharmProjects/ELMo_Lample/CrossEntropy/delft-master/sequenceLabelling/wrapper.py", line 130, in train
trainer.train(x_train, y_train, x_valid, y_valid)
File "/data/hpoostch/PycharmProjects/ELMo_Lample/CrossEntropy/delft-master/sequenceLabelling/trainer.py", line 61, in train
self.training_config.max_epoch)
File "/data/hpoostch/PycharmProjects/ELMo_Lample/CrossEntropy/delft-master/sequenceLabelling/trainer.py", line 101, in train_model
callbacks=callbacks)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/keras/engine/training.py", line 2224, in fit_generator
class_weight=class_weight)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/keras/engine/training.py", line 1883, in train_on_batch
outputs = self.train_function(ins)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2478, in call
**self.session_kwargs)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1140, in _run
feed_dict_tensor, options, run_metadata)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
run_metadata)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnimplementedError: TensorArray has size zero, but element shape [?,10] is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
[[Node: chain_crf_1/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:@chain_crf_1/TensorArray"], dtype=DT_FLOAT, element_shape=[?,10], _device="/job:localhost/replica:0/task:0/device:GPU:0"](chain_crf_1/TensorArray, chain_crf_1/TensorArrayStack/range, chain_crf_1/while/Exit_1)]]
[[Node: chain_crf_1/while_1/Reshape_1/_321 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2949_chain_crf_1/while_1/Reshape_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'chain_crf_1/TensorArrayStack/TensorArrayGatherV3', defined at:
File "nerTagger.py", line 440, in
data_path=data_path)
File "nerTagger.py", line 271, in train_eval
model.train(x_train, y_train, x_valid, y_valid)
File "/data/hpoostch/PycharmProjects/ELMo_Lample/CrossEntropy/delft-master/sequenceLabelling/wrapper.py", line 121, in train
self.model = get_model(self.model_config, self.p, len(self.p.vocab_tag))
File "/data/hpoostch/PycharmProjects/ELMo_Lample/CrossEntropy/delft-master/sequenceLabelling/models.py", line 19, in get_model
return BidLSTM_CRF(config, ntags)
File "/data/hpoostch/PycharmProjects/ELMo_Lample/CrossEntropy/delft-master/sequenceLabelling/models.py", line 112, in init
pred = self.crf(x)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/keras/engine/topology.py", line 619, in call
output = self.call(inputs, **kwargs)
File "/data/hpoostch/PycharmProjects/ELMo_Lample/CrossEntropy/delft-master/utilities/layers.py", line 313, in call
y_pred = viterbi_decode(x, self.U, self.b_start, self.b_end, mask)
File "/data/hpoostch/PycharmProjects/ELMo_Lample/CrossEntropy/delft-master/utilities/layers.py", line 105, in viterbi_decode
mask)
File "/data/hpoostch/PycharmProjects/ELMo_Lample/CrossEntropy/delft-master/utilities/layers.py", line 146, in _forward
last, values, _ = K.rnn(_forward_step, inputs, initial_states)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2772, in rnn
outputs = output_ta.stack()
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/ops/tensor_array_ops.py", line 893, in stack
return self._implementation.stack(name=name)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/ops/tensor_array_ops.py", line 291, in stack
return self.gather(math_ops.range(0, self.size()), name=name)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/ops/tensor_array_ops.py", line 305, in gather
element_shape=element_shape)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 6011, in tensor_array_gather_v3
flow_in=flow_in, dtype=dtype, element_shape=element_shape, name=name)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
op_def=op_def)
File "/data/hpoostch/py3.6/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

UnimplementedError (see above for traceback): TensorArray has size zero, but element shape [?,10] is not fully defined. Currently only static shapes are supported when packing zero-size TensorArrays.
[[Node: chain_crf_1/TensorArrayStack/TensorArrayGatherV3 = TensorArrayGatherV3[_class=["loc:@chain_crf_1/TensorArray"], dtype=DT_FLOAT, element_shape=[?,10], _device="/job:localhost/replica:0/task:0/device:GPU:0"](chain_crf_1/TensorArray, chain_crf_1/TensorArrayStack/range, chain_crf_1/while/Exit_1)]]
[[Node: chain_crf_1/while_1/Reshape_1/_321 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2949_chain_crf_1/while_1/Reshape_1", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

kermitt2 / delft Goto Github PK

delft's People

Contributors

Stargazers

Watchers

Forkers

delft's Issues

Recommend Projects

Recommend Topics

Recommend Org