franck-dernoncourt / neuroner Goto Github PK

View Code? Open in Web Editor NEW

1.7K 79.0 480.0 123.78 MB

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Home Page: http://neuroner.com

License: MIT License

Python 92.78% Perl 7.06% Shell 0.17%

nlp machine-learning neural-networks named-entity-recognition deep-learning tensorflow

neuroner's Introduction

NeuroNER

NeuroNER is a program that performs named-entity recognition (NER). Website: neuroner.com.

This page gives step-by-step instructions to install and use NeuroNER.

Requirements
Installation
Using NeuroNER
Citation

Requirements

NeuroNER relies on Python 3, TensorFlow 1.0+, and optionally on BRAT:

Python 3: NeuroNER does not work with Python 2.x. On Windows, it has to be Python 3.6 64-bit or later.
TensorFlow is a library for machine learning. NeuroNER uses it for its NER engine, which is based on neural networks. Official website: https://www.tensorflow.org
BRAT (optional) is a web-based annotation tool. It only needs to be installed if you wish to conveniently create annotations or view the predictions made by NeuroNER. Official website: http://brat.nlplab.org

Installation

For GPU support, GPU requirements for Tensorflow must be satisfied. If your system does not meet these requirements, you should use the CPU version. To install neuroner:

# For CPU support (no GPU support):
pip3 install pyneuroner[cpu]

# For GPU support:
pip3 install pyneuroner[gpu]

You will also need to download some support packages.

The English language module for Spacy:

# Download the SpaCy English module
python -m spacy download en

Download word embeddings from http://neuroner.com/data/word_vectors/glove.6B.100d.zip, unzip them to the folder ./data/word_vectors

# Get word embeddings
wget -P data/word_vectors http://neuroner.com/data/word_vectors/glove.6B.100d.zip
unzip data/word_vectors/glove.6B.100d.zip -d data/word_vectors/

Load sample datasets. These can be loaded by calling the neuromodel.fetch_data() function from a Python interpreter or with the --fetch_data argument at the command line.

# Load a dataset from the command line
neuroner --fetch_data=conll2003
neuroner --fetch_data=example_unannotated_texts
neuroner --fetch_data=i2b2_2014_deid

# Load a dataset from a Python interpreter
from neuroner import neuromodel
neuromodel.fetch_data('conll2003')
neuromodel.fetch_data('example_unannotated_texts')
neuromodel.fetch_data('i2b2_2014_deid')

Load a pretrained model. The models can be loaded by calling the neuromodel.fetch_model() function from a Python interpreter or with the --fetch_trained_models argument at the command line.

# Load a pre-trained model from the command line
neuroner --fetch_trained_model=conll_2003_en
neuroner --fetch_trained_model=i2b2_2014_glove_spacy_bioes
neuroner --fetch_trained_model=i2b2_2014_glove_stanford_bioes
neuroner --fetch_trained_model=mimic_glove_spacy_bioes
neuroner --fetch_trained_model=mimic_glove_stanford_bioes

# Load a pre-trained model from a Python interpreter
from neuroner import neuromodel
neuromodel.fetch_model('conll_2003_en')
neuromodel.fetch_model('i2b2_2014_glove_spacy_bioes')
neuromodel.fetch_model('i2b2_2014_glove_stanford_bioes')
neuromodel.fetch_model('mimic_glove_spacy_bioes')
neuromodel.fetch_model('mimic_glove_stanford_bioes')

Installing BRAT (optional)

BRAT is a tool that can be used to create, change or view the BRAT-style annotations. For installation and usage instructions, see the BRAT website.

Installing Perl (platform dependent)

Perl is required because the official CoNLL-2003 evaluation script is written in this language: http://strawberryperl.com. For Unix and Mac OSX systems, Perl should already be installed. For Windows systems, you may need to install it.

Using NeuroNER

NeuroNER can either be run from the command line or from a Python interpreter.

Using NeuroNer from a Python interpreter

To use NeuroNER from the command line, create an instance of the neuromodel with your desired arguments, and then call the relevant methods. Additional parameters can be set from a parameters.ini file in the working directory. For example:

from neuroner import neuromodel
nn = neuromodel.NeuroNER(train_model=False, use_pretrained_model=True)

More detail to follow.

Using NeuroNer from the command line

By default NeuroNER is configured to train and test on the CoNLL-2003 dataset. Running neuroner with the default settings starts training on the CoNLL-2003 dataset (the F1-score on the test set should be around 0.90, i.e. on par with state-of-the-art systems). To start the training:

# To use the CPU if you have installed tensorflow, or use the GPU if you have installed tensorflow-gpu:
neuroner

# To use the CPU only if you have installed tensorflow-gpu:
CUDA_VISIBLE_DEVICES="" neuroner

# To use the GPU 1 only if you have installed tensorflow-gpu:
CUDA_VISIBLE_DEVICES=1 neuroner

If you wish to change any of NeuroNER parameters, you can modify the parameters.ini configuration file in your working directory or specify it as an argument.

For example, to reduce the number of training epochs and not use any pre-trained token embeddings:

neuroner --maximum_number_of_epochs=2 --token_pretrained_embedding_filepath=""

To perform NER on some plain texts using a pre-trained model:

neuroner --train_model=False --use_pretrained_model=True --dataset_text_folder=./data/example_unannotated_texts --pretrained_model_folder=./trained_models/conll_2003_en

If a parameter is specified in both the parameters.ini configuration file and as an argument, then the argument takes precedence (i.e., the parameter in parameters.ini is ignored). You may specify a different configuration file with the --parameters_filepath command line argument. The command line arguments have no default value except for --parameters_filepath, which points to parameters.ini.

NeuroNER has 3 modes of operation:

training mode (from scratch): the dataset folder must have train and valid sets. Test and deployment sets are optional.
training mode (from pretrained model): the dataset folder must have train and valid sets. Test and deployment sets are optional.
prediction mode (using pretrained model): the dataset folder must have either a test set or a deployment set.

Adding a new dataset

A dataset may be provided in either CoNLL-2003 or BRAT format. The dataset files and folders should be organized and named as follows:

Training set: train.txt file (CoNLL-2003 format) or train folder (BRAT format). It must contain labels.
Validation set: valid.txt file (CoNLL-2003 format) or valid folder (BRAT format). It must contain labels.
Test set: test.txt file (CoNLL-2003 format) or test folder (BRAT format). It must contain labels.
Deployment set: deploy.txt file (CoNLL-2003 format) or deploy folder (BRAT format). It shouldn't contain any label (if it does, labels are ignored).

We provide several examples of datasets:

data/conll2003/en: annotated dataset with the CoNLL-2003 format, containing 3 files (train.txt, valid.txt and test.txt).
data/example_unannotated_texts: unannotated dataset with the BRAT format, containing 1 folder (deploy/). Note that the BRAT format with no annotation is the same as plain texts.

Using a pretrained model

In order to use a pretrained model, the pretrained_model_folder parameter in the parameters.ini configuration file must be set to the folder containing the pretrained model. The following parameters in the parameters.ini configuration file must also be set to the same values as in the configuration file located in the specified pretrained_model_folder:

use_character_lstm
character_embedding_dimension
character_lstm_hidden_state_dimension
token_pretrained_embedding_filepath
token_embedding_dimension
token_lstm_hidden_state_dimension
use_crf
tagging_format
tokenizer

Sharing a pretrained model

You are highly encouraged to share a model trained on their own datasets, so that other users can use the pretrained model on other datasets. We provide the neuroner/prepare_pretrained_model.py script to make it easy to prepare a pretrained model for sharing. In order to use the script, one only needs to specify the output_folder_name, epoch_number, and model_name parameters in the script.

By default, the only information about the dataset contained in the pretrained model is the list of tokens that appears in the dataset used for training and the corresponding embeddings learned from the dataset.

If you wish to share a pretrained model without providing any information about the dataset (including the list of tokens appearing in the dataset), you can do so by setting

delete_token_mappings = True

when running the script. In this case, it is highly recommended to use some external pre-trained token embeddings and freeze them while training the model to obtain high performance. This can be done by specifying the token_pretrained_embedding_filepath and setting

freeze_token_embeddings = True

in the parameters.ini configuration file during training.

In order to share a pretrained model, please submit a new issue on the GitHub repository.

Using TensorBoard

You may launch TensorBoard during or after the training phase. To do so, run in the terminal from the NeuroNER folder:

tensorboard --logdir=output

This starts a web server that is accessible at http://127.0.0.1:6006 from your web browser.

Citation

If you use NeuroNER in your publications, please cite this paper:

@article{2017neuroner,
  title={{NeuroNER}: an easy-to-use program for named-entity recognition based on neural networks},
  author={Dernoncourt, Franck and Lee, Ji Young and Szolovits, Peter},
  journal={Conference on Empirical Methods on Natural Language Processing (EMNLP)},
  year={2017}
}

The neural network architecture used in NeuroNER is described in this article:

@article{2016deidentification,
  title={De-identification of Patient Notes with Recurrent Neural Networks},
  author={Dernoncourt, Franck and Lee, Ji Young and Uzuner, Ozlem and Szolovits, Peter},
  journal={Journal of the American Medical Informatics Association (JAMIA)},
  year={2016}
}

neuroner's People

Contributors

Stargazers

Watchers

Forkers

jennyjylee vasswhite patheticcockroach bendesp joedn156 por123 biola10 esolomka shengzhaodut kylemarkwilliams zhangruiskyline phecy qiuyuew stevenlol sunjieee njnubobo chagge danveno insomniacn zilongzhong chivychao theanhle n4thyp ajaytalati toltoxgh braemy tomaseliu gregory-howard jevenzh xingyun890 fsonntag stephensebastin jdc08161063 wwhgh benjamesbabala codeaudit s4sarath aparna-b leezqcst zihay cuulee lkngin brcordeiro ml-lab thirupathipattipaka iksnae peterwilliams97 vunb adolfoeliazat mylearning2017 samehmannai xiaofengzhou emanuelaboros tonydeep minhson-kaist collawolley pilgrim2go xuanhan863 kazarazat lihongqiang thanhlct canoefzh xiao-zhang xinghudamowang threefoldo wj1031924 linghongtao dt1219 world2005 fireae hitum-dev qwaider catcatrun xsongx fch1 avick ml-ai-nlp-ir shenyong123 zemu121 matthewwilfred computermomo datasart changfengfeng mirandayang3 faiien macanv bluetyson hedgefair mydp2017 winnerineast tony32769 wolframalpha hhself makai281 shiyybua rap9430 rahulptel catean jamesdunham atgiannako

neuroner's Issues

Using for production purpose

I'm trying to use my trained model as an API that input is text and the output is the labeled text. However, it's seem like every time i want to deploy ,it's need to initialize the model again. Can i separate the initialization and the predicting part so that I just have to load the model once then I can use it for predicting several inputs at different times.
Thank you for your amazing model :)

I have read the thread #5 for this issue but i didn't quite get the solution for this problem.

How to approach importing a new dataset to train a NN with custom entities in NeuroNER

Hi Franck,

Thanks for building this abstraction on top of TensorFlow to make it easier to apply NER. Do you have some pointers on how to convert "un-annotated" text with custom annotations existing in a separate file (labels with offsets in to the main file) to CoNLL or BRAT format such that it can be used to train a NN? The entities that I am interested in are not the standard ones but custom to a domain (names of specific models of cars). Also, the custom annotations that exist in another file donot include any POS or Coreference tags.

I have several thousand of these "un-annotated" text files so a manual annotation process (such as by using BRAT) is not feasible. This is not directly related to the NeuroNER as an issue but your suggestions in what would be the best approach to convert these to a format that could be used to prepare a training dataset for NeuroNER would be very helpful.

Thanks,
Ar

NeuroNER installation on Windows

I am new to python and have been trying to install and run NeuroNER on windows for 2 days but its not running and i think i am not able to install it properly on windows 10 54 bit OS. The installation tutorial for Ubuntu is available but for windows i am unable to find any video tutorial. Can any one please create a step by step video tutorial or installation manual with step by step snapshots? I really need it for my MS research ASAP.

No CRF and gradient clipping produces an error

When setting use_crf = False and turning on gradient clipping, the following error is thrown:

  File "/Users/Felix/Developer/NeuroNER/src/main.py", line 250, in <module>
    main()
  File "/Users/Felix/Developer/NeuroNER/src/main.py", line 245, in main
    nn = NeuroNER(**arguments)
  File "/Users/Felix/Developer/NeuroNER/src/neuroner.py", line 280, in __init__
    model = EntityLSTM(dataset, parameters)
  File "/Users/Felix/Developer/NeuroNER/src/entity_lstm.py", line 214, in __init__
    self.define_training_procedure(parameters)
  File "/Users/Felix/Developer/NeuroNER/src/entity_lstm.py", line 233, in define_training_procedure
    for grad, var in grads_and_vars]
  File "/Users/Felix/Developer/NeuroNER/src/entity_lstm.py", line 233, in <listcomp>
    for grad, var in grads_and_vars]
  File "/usr/local/lib/python3.5/site-packages/tensorflow/python/ops/clip_ops.py", line 55, in clip_by_value
    t = ops.convert_to_tensor(t, name="t")
  File "/usr/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 639, in convert_to_tensor
    as_ref=False)
  File "/usr/local/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 704, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
  File "/usr/local/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 113, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)
  File "/usr/local/lib/python3.5/site-packages/tensorflow/python/framework/constant_op.py", line 102, in constant
    tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
  File "/usr/local/lib/python3.5/site-packages/tensorflow/python/framework/tensor_util.py", line 360, in make_tensor_proto
    raise ValueError("None values not supported.")
ValueError: None values not supported.

This happens as the gradient in the grads_and_vars variable for the CRF layer is None.
A possible workaround is changing the line with setting the gradient clipping to:

grads_and_vars = [(tf.clip_by_value(grad, -parameters['gradient_clipping_value'], parameters['gradient_clipping_value']), var) 
                              for grad, var in grads_and_vars if grad is not None]

Nevertheless I'm not sure if that's a valid workaround or if it will break the model somehow...

FileNotFoundError with conll_output_filepath

I'm trying to follow the steps from the README to run NeuroNER using the default parameters.ini settings. I'm running into a FileNotFoundError at train.py, line 95.

I'm new to python, but will try to track down the source of the issue. It looks like maybe the file should've been created by this line, but it's not clear to me why.

This is on ubuntu 16.04, python 3.5.2. Any advice on how to debug this issue?

Here's partial output from running python main.py:

Starting epoch 0
Training completed in 0.00 seconds
Evaluate model on the train set
Traceback (most recent call last):
  File "main.py", line 445, in <module>
    main()
  File "main.py", line 392, in main
    y_pred, y_true, output_filepaths = train.predict_labels(sess, model, transition_params_trained, parameters, dataset, epoch_number, stats_graph_folde$
, dataset_filepaths)
  File "/home/user/neuroner/neuroner/src/train.py", line 113, in predict_labels
    prediction_output = prediction_step(sess, dataset, dataset_type, model, transition_params_trained, stats_graph_folder, epoch_number, parameters, dat$
set_filepaths)
  File "/home/user/neuroner/neuroner/src/train.py", line 95, in prediction_step
    with open(conll_output_filepath, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../output/en_2017-07-05_22-35-05-549137/000_train.txt_conll_evaluation.txt'

Newly added Data set format

Is it necessary for the newly added data set to be in either CoNLL-2003 or BRAT format?Will a simple amazon review data set file be able to work fine? If not kindly share a method for conversion into the required format. And what if i have a single data set file that is not divided into 3 files training, validation and test set?Is it okay?

Theory behind network architecture

I was wondering if you had an accompanying paper for this or if it was inspired by a paper. I'm trying to figure out the rational for architecture choice. For example, why the CRF layer?

Preload nlp_core for better performance when depoy

When using NeuroNER i noticed that loading nlp_core(spacy in my case) for each time predicting new text make the performance go down significantly compare to loading the core when initializing.

The brat_to_conll function can be rewrite to get a much better performance if it don't have to reload the core_nlp

Data Format

'Proper' way to deploy a pre-trained model on unannotated examples

I can see from #31 and #5 that model/loading and prediction were decoupled, but I am still having trouble deploying a pre-trained model quickly.

In my parameters.ini file, I set:

[mode]
# At least one of use_pretrained_model and train_model must be set to True.
train_model = False
use_pretrained_model = True
pretrained_model_folder = ../trained_models/speciesClassifier

And then call python main.py but still, deployment of this pre-trained model takes a very long time (>15 min) on only a few abstracts (~10).

Am I making an obvious mistake? Thanks very much in advance.

Bad conversion to brat (still don't know the reason)

After using a deploy I had this result in a conll file, I had deploy files in brat format
The expression :

M. Barsacq (Jean-Claude), secrétaire général du syndicat général des fabricants d’huile et de tourteaux de France, 118, avenue Achille-Peretti, 92200 Neuilly-sur-Seine.

then the conll in result :

118 JORFARTI000000970382 1540 1543 B-__Adresse_Complete__ B-__Adresse_Complete__

, JORFARTI000000970382 1543 1544 B-__Adresse_Complete__ B-__Adresse_Complete__

But this bug produce a bad entity in a .ann brat file:

T13	--Adresse-Complete-- 1540 1543	118
T14	--Adresse-Complete-- 1543 1592	, avenue Achille-Peretti, 92200 Neuilly-sur-Seine

but spacy :

>>> import spacy
>>> nlp = spacy.load("fr")
>>> doc = nlp("M.Barsacq (Jean-Claude), secrétaire général du syndicat général d
es fabricants dhuile et de tourteaux de France, 118, avenue Achille-Peretti, 922
00 Neuilly-sur-Seine.")
>>> for e in doc.sents:
...     print(str(e)+"\n")
...
M.Barsacq (Jean-Claude), secrétaire général du syndicat général des fabricants d
huile et de tourteaux de France, 118, avenue Achille-Peretti, 92200 Neuilly-sur-
Seine.

So only one sentence.

I will try to find it.

Load word2vec pretrained model from gensim

Hi,

I would like to use NeuroNER with word embeddings (word2vec) which were trained with gensim.
I was trying to change the code but I think you are more aware about what has to be modify.

Could you help me ?

Great framework by the way !

Does GPU real improve the speed?

I installed tensorflow-gpu on Ubuntu, and it can run both on GPU and CPU at the same time.
But the time of each epoch doesn't reduce, what can I do to improve the effeciency?

Thx！

Why do you use final_states?

At entity_lstm.py
you get two returns from bidirectional_dynamic_rnn

outputs
final_states = (c_states, h_states)

why do you use h_states? I don't know the difference between h_states of final states and last output in outputs

an error occurs when I use i2b2 dataset

Hi,
I have tried the data in conll 2013 and all things work well.
But when I try to use i2b2 dataset, an error occurs. The log is:
Traceback (most recent call last):

  File "main.py", line 250, in <module>
    main()
  File "main.py", line 246, in main
    nn.fit()
  File "/home/ubuntu/hang/NeuroNER/src/neuroner.py", line 395, in fit
    evaluate.evaluate_model(results, dataset, y_pred, y_true, stats_graph_folder, epoch_number, epoch_start_time, output_filepaths, parameters)
  File "/home/ubuntu/hang/NeuroNER/src/evaluate.py", line 239, in evaluate_model
    verbose=verbose)
  File "/home/ubuntu/hang/NeuroNER/src/evaluate.py", line 17, in assess_model
    classification_report = sklearn.metrics.classification_report(y_true, y_pred, labels=labels, target_names=target_names, sample_weight=None, digits=4)
  File "/usr/local/lib/python3.5/dist-packages/sklearn/metrics/classification.py", line 1428, in classification_report
    for v in (np.average(p, weights=s),
  File "/usr/local/lib/python3.5/dist-packages/numpy/lib/function_base.py", line 1140, in average
    "Weights sum to zero, can't be normalized")
ZeroDivisionError: Weights sum to zero, can't be normalized

It seems like the Weights are all zero when called the function np.average()
Thanks!

Any way to provide Lexicons

Hey guys,

First, thanks for this incredible software it is really easy to use and tune for any particular task.
I was wondering if there was anyway to provide a Lexicons on the word level embeddings such as in https://arxiv.org/pdf/1511.08308.pdf

Do you think it would improve the results in the case that we have a restricted annotated dataset?

Many thanks,
Nicolas

Load dataset on the new version - memory and time

Hi,

I am using my own dataset with NeuroNER and it is larger than CoNLL2003.

Since the commit df59495019f758d647fbde9d0084a286b3a4e99a , the load_dataset() method is taking much more time (1 hour ) than in previous version (only 50 seconds max).

This line seems to be the issue character_indices[dataset_type].append([[character_to_index.get(character, random.randint(1, max(self.index_to_character.keys()))) for character in token] for token in token_sequence]

Example on conll2003 :

old_version => Load dataset... done (23.12 seconds)
new_version => Load dataset... done (53.74 seconds)

It goes bigger on larger dataset.

Am I the only one to whom it concerns ?

Separate the model and tensorflow initialization from its application for predictions

Thanks for providing NeuroNER, it is very nice. I trained a custom model based on some BRAT annotations, and was able to use this model to predict annotations for a new document.

From a user perspective, it would be very useful to separate the model and tensorflow initialization from its application for annotation predictions.

Currently, everything happens in the main() method in the main.py file. I am thinking of a use case where NeuroNER could provide a method like predict_annotations(doc), where given a string, it would apply a loaded model and return the BRAT annotation format. Then NeuroNER could be easily integrated into other scientific workflows or pipelines. Are there any plans for such a direction?

EDIT:
In line 270 in main.py I found a similar method to what I had in mind:

y_pred, y_true, output_filepaths = train.predict_labels(sess, model, transition_params_trained, parameters, dataset, epoch_number, stats_graph_folder, dataset_filepaths)

Of course sess and the other objects are all created in main(). It seems to me that the main() method could be refactored into an initialization component that sets up and returns sess, model and other objects, and further subsequent evaluation and output components, to facilitate pipeline use, but this is just a guess from looking at the code.

Does NeuroNER support nested NER models?

I am annotating a corpus to build modles with NeuroNER. Some entities appear to belong to two categories like the following:

[Sri Lanka]'s West coast: belongs to NORP in spaCy terminology
[Sri Lanka's West coast]: may belong to LOCATION, again in spaCy terminology

Can I use 'nested tagging' and set it as belonging to both categories? I mean can NeuroNER understand such tagging?

loading saved models

Hello Franck,
First of all thanks for providing this great tool!

I found that saving the model to disk gave me 3 files instead of 1 model_000.ckpt.index and model_000.ckpt.index. I had to add a line on the train.py lines model_saver = tf.train.import_meta_graph(pretrained_model_checkpoint_filepath +'.meta')
just before model_saver.restore(sess,pretrained_model_checkpoint_filepath)
line 130 and 140.
This is a just a trick explained here
I saw as well that you changed the trained_model_checkpoint_filepath parameter in an update, but left it unchanged in the pretrained model. But I think it's not a problem.

Thanks again for this Repo!
It makes it really easy to train and tune a good model!

How to train models?

#13 Was helpful for using the pre-trained models. However, I would like to train my own models. As a first step, I removed all the existing models:

# cd trained_models/
# ls
conll_2003_en		     i2b2_2014_glove_stanford_bioes  mimic_glove_stanford_bioes
i2b2_2014_glove_spacy_bioes  mimic_glove_spacy_bioes	     performances.md
# rm -rf *

Then I trained a model using data/conll2003/en/ (identical data, default parameters.ini):

# python3 main.py --maximum_number_of_epochs=1 --token_pretrained_embedding_filepath=""

I then see a model in ../output/en_2017-07-26_22-06-40-510129/model. However, I get this error when trying to use the model:

# ls ../output/en_2017-07-26_22-06-40-510129/model/
checkpoint				     model_00001.ckpt.meta
dataset.pickle				     parameters.ini
events.out.tfevents.1501106802.114dd5c0c94c  projector_config.pbtxt
model_00001.ckpt.data-00000-of-00001	     tensorboard_metadata_characters.tsv
model_00001.ckpt.index			     tensorboard_metadata_tokens.tsv
# python3 main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../output/en_2017-07-26_22-06-40-510129/model/
NeuroNER version: 1.0-dev
TensorFlow version: 1.2.1
{'character_embedding_dimension': 25,
 'character_lstm_hidden_state_dimension': 25,
 'check_for_digits_replaced_with_zeros': 1,
 'check_for_lowercase': 1,
 'dataset_text_folder': '../data/example_unannotated_texts',
 'debug': 0,
 'dropout_rate': 0.5,
 'experiment_name': 'test',
 'freeze_token_embeddings': 0,
 'gradient_clipping_value': 5.0,
 'learning_rate': 0.005,
 'load_all_pretrained_token_embeddings': 0,
 'load_only_pretrained_token_embeddings': 0,
 'main_evaluation_mode': 'conll',
 'maximum_number_of_epochs': 100,
 'number_of_cpu_threads': 8,
 'number_of_gpus': 0,
 'optimizer': 'sgd',
 'output_folder': '../output',
 'parameters_filepath': './parameters.ini',
 'patience': 10,
 'plot_format': 'pdf',
 'pretrained_model_folder': '../output/en_2017-07-26_22-06-40-510129/model/',
 'reload_character_embeddings': 1,
 'reload_character_lstm': 1,
 'reload_crf': 1,
 'reload_feedforward': 1,
 'reload_token_embeddings': 1,
 'reload_token_lstm': 1,
 'remap_unknown_tokens_to_unk': 1,
 'spacylanguage': 'en',
 'tagging_format': 'bioes',
 'token_embedding_dimension': 100,
 'token_lstm_hidden_state_dimension': 100,
 'token_pretrained_embedding_filepath': '../data/word_vectors/glove.6B.100d.txt',
 'tokenizer': 'spacy',
 'train_model': 0,
 'use_character_lstm': 1,
 'use_crf': 1,
 'use_pretrained_model': 1,
 'verbose': 0}
Formatting deploy set from BRAT to CONLL... Done.
Converting CONLL from BIO to BIOES format... Done.
Load dataset... done (19.13 seconds)
Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1139, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ../output/en_2017-07-26_22-06-40-510129/model/model.ckpt
	 [[Node: save/RestoreV2_23 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_23/tensor_names, save/RestoreV2_23/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 250, in <module>
    main()
  File "main.py", line 245, in main
    nn = NeuroNER(**arguments)
  File "/NeuroNER/src/neuroner.py", line 285, in __init__
    self.transition_params_trained = model.restore_from_pretrained_model(parameters, dataset, sess, token_to_vector=token_to_vector)
  File "/NeuroNER/src/entity_lstm.py", line 337, in restore_from_pretrained_model
    self.saver.restore(sess, pretrained_model_checkpoint_filepath) # Works only when the dimensions of tensor variables are matched.
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/saver.py", line 1548, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ../output/en_2017-07-26_22-06-40-510129/model/model.ckpt
	 [[Node: save/RestoreV2_23 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_23/tensor_names, save/RestoreV2_23/shape_and_slices)]]

Caused by op 'save/RestoreV2_23', defined at:
  File "main.py", line 250, in <module>
    main()
  File "main.py", line 245, in main
    nn = NeuroNER(**arguments)
  File "/NeuroNER/src/neuroner.py", line 278, in __init__
    model = EntityLSTM(dataset, parameters)
  File "/NeuroNER/src/entity_lstm.py", line 216, in __init__
    self.saver = tf.train.Saver(max_to_keep=parameters['maximum_number_of_epochs'])  # defaults to saving all variables
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/saver.py", line 1139, in __init__
    self.build()
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/saver.py", line 1170, in build
    restore_sequentially=self._restore_sequentially)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/saver.py", line 691, in build
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
    tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/training/saver.py", line 247, in restore_op
    [spec.tensor.dtype])[0])
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 640, in restore_v2
    dtypes=dtypes, name=name)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

NotFoundError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ../output/en_2017-07-26_22-06-40-510129/model/model.ckpt
	 [[Node: save/RestoreV2_23 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_23/tensor_names, save/RestoreV2_23/shape_and_slices)]]

Exception ignored in: <bound method NeuroNER.__del__ of <neuroner.NeuroNER object at 0x7f9eb30e7f60>>
Traceback (most recent call last):
  File "/NeuroNER/src/neuroner.py", line 489, in __del__
    self.sess.close()
AttributeError: 'NeuroNER' object has no attribute 'sess'

I'm not sure what to make of this, since I don't see the need for a checkpoint file when using the pretrained models, and there is no model.ckpt in the pre-trained models (although I did see model.ckpt.index, model.ckpt.data-00000-of-00001, and model.ckpt.meta).

Note: I am using python 3.6. I assumed that this would not break compatibility with 3.5 and have yet to test 3.5.

How to export as `SavedModel`

Thanks for the project!

I want to save session checkpoint model as SavedModel. But I get an error. How can I add the missing values?

import tensorflow as tf
from pathlib import Path

if __name__ == '__main__':
    sess = tf.Session()
    sess_dir_path = Path("output/en_2017-07-17_13-50-14-257934/model").resolve()
    export_dir_path = Path("output/en_2017-07-17_13-50-14-257934/model2")
    model_file_path = sess_dir_path.joinpath('model_00042.ckpt.meta')
    saver = tf.train.import_meta_graph(str(model_file_path))
    saver.restore(sess, tf.train.latest_checkpoint(sess_dir_path))
    graph = tf.get_default_graph()
    builder = tf.saved_model.builder.SavedModelBuilder(str(export_dir_path))
    init_g = tf.global_variables_initializer()
    init_l = tf.local_variables_initializer()
    with tf.Session(graph=graph) as sess:
        builder.add_meta_graph_and_variables(sess, [])
        sess.run(init_g)
        sess.run(init_l)
    builder.save()

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1139, in _do_call
    return fn(*args)
  File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.6/contextlib.py", line 89, in __exit__
    next(self.gen)
  File "/usr/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value character_embedding/character_embedding_weights
         [[Node: character_embedding/character_embedding_weights/_0 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_10_character_embedding/character_embedding_weights", _device="/job:localhost/replica:0/task:0/gpu:0"](character_embedding/character_embedding_weights)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/amy/saved_model/__main__.py", line 16, in <module>
    builder.add_meta_graph_and_variables(sess, [])
  File "/usr/lib/python3.6/site-packages/tensorflow/python/saved_model/builder_impl.py", line 362, in add_meta_graph_and_variables
    saver.save(sess, variables_path, write_meta_graph=False, write_state=False)
  File "/usr/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1488, in save
    raise exc
  File "/usr/lib/python3.6/site-packages/tensorflow/python/training/saver.py", line 1472, in save
    {self.saver_def.filename_tensor_name: checkpoint_file})
  File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/usr/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value character_embedding/character_embedding_weights
         [[Node: character_embedding/character_embedding_weights/_0 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_10_character_embedding/character_embedding_weights", _device="/job:localhost/replica:0/task:0/gpu:0"](character_embedding/character_embedding_weights)]]

Using pre-trained model example not working

Hi,

Thanks a lot for the model and the code! They are very useful.

I'm trying to re-use the conll-2003 pre-trained model like in the example, using the example files in the same folder path (..\data\example_unannotated_texts\deploy).

with: dataset_text_folder = ../data/example_unannotated_texts

while the text files to be annotated are in ..\data\example_unannotated_texts\deploy

The output text file is empty, and the loading shows that it found no tokens.

I tried running it from dataset_text_folder = ../data/example_unannotated_texts/deploy instead, but then I get an error message (assertion error) saying that the tag is not 'O' (from the remove BIO function).

I also get an error message from spacy (asking to download the 'en' data from it first, which I did a few times already) the first time I run the code on the data to annotate. If I run it a second time, it then runs but the spacy file created during the first run is empty (which I believe is the problem).

Thanks for your help!
Yoann

Discontinuous annotations (Brat 1.3)

I recently discovered Brat's (v1.3) support for discontinuous annotations. These can be created intentionally by editing an existing annotation and clicking the 'Add Frag.' button. They also seem to be created, at least sometimes, when an annotation is interrupted by a newline. brat_to_conll.py doesn't expect this.

For a .txt file that begins

Lorem ipsum dolor

A discontinuous annotation spanning "Lorem" and "dolor" results in an .ann file formatted

T1	Org 0 5;12 17	Lorem dolor

This .ann file will lead to an error in brat_to_conll.py after the line is .split() and the third element, 5;12, is passed to int() as the annotation's end position.

One way to handle this would be to check lines for more than one start-end position pair, and break apart multiple pairs - moving them to their own lines and duplicating the entity label. This would work for my case. (I'd be happy to submit a PR.) Is it a general-purpose solution?

algorithm to plot token confusion matrix and classification report

Hi,

I was using another ML platform on a NER project. I found that the utilities you used being quite helpful, such as BRAT and confusion matrix.

Plotting the BIO confusion matrix is straight forward, but how do you process the result to plot a token confusion matrix? Becuase the number of items in the two lists would almost always be different and often not describing the same word.

Using token embeddings in word2vec binary format

@Franck-Dernoncourt
Quick question,

I am just wondering if it is possible to use token embeddings provided in the word2vec binary format. In the readme, a link to token embeddings saved as a plain text are provided but there is no mention of whether NeuroNER supports token embeddings in other formats.

Thanks very much in advance!

TensorBoard Graphs were not visible

when i tried to run TensorBoard Graphs were not visible(May be i am using it wrong) in the Graphs tab i have attached the screen shots below

Plan to add a licence?

First of all, I really appreciate all the work that went into NeuroNER, and how user-friendly you guys have been able to make it.

Is there any plans to add a licence to the repo? I would love to fully commit to using NeuroNER for a non-commercial, academic project - but I do not want to pour more time and effort into the tool unless it will be (officially) open-source and free to use.

I know this question has already been raised (#15), but as it was not answered I thought I would submit a new issue.

Thanks again for creating this awesome tool!

Only predicting "O", also on provided examples

python3.5 main.py --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en - it just yields "O's"

Output:

NeuroNER version: 1.0-dev
TensorFlow version: 1.1.0
NeuroNER version: 1.0-dev
TensorFlow version: 1.1.0
{'character_embedding_dimension': 25,
 'character_lstm_hidden_state_dimension': 25,
 'check_for_digits_replaced_with_zeros': 1,
 'check_for_lowercase': 1,
 'dataset_text_folder': '../data/example_unannotated_texts',
 'debug': 0,
 'dropout_rate': 0.5,
 'experiment_name': 'test',
 'freeze_token_embeddings': 0,
 'gradient_clipping_value': 5.0,
 'learning_rate': 0.005,
 'load_only_pretrained_token_embeddings': 0,
 'main_evaluation_mode': 'conll',
 'maximum_number_of_epochs': 100,
 'number_of_cpu_threads': 8,
 'number_of_gpus': 0,
 'optimizer': 'sgd',
 'output_folder': '../output',
 'parameters_filepath': './parameters.ini',
 'patience': 10,
 'plot_format': 'pdf',
 'pretrained_model_folder': '../trained_models/conll_2003_en',
 'reload_character_embeddings': 1,
 'reload_character_lstm': 1,
 'reload_crf': 1,
 'reload_feedforward': 1,
 'reload_token_embeddings': 1,
 'reload_token_lstm': 1,
 'remap_unknown_tokens_to_unk': 1,
 'spacylanguage': 'en',
 'tagging_format': 'bioes',
 'token_embedding_dimension': 100,
 'token_lstm_hidden_state_dimension': 100,
 'token_pretrained_embedding_filepath': '../data/word_vectors/glove.6B.100d.txt',
 'tokenizer': 'spacy',
 'train_model': 0,
 'use_character_lstm': 1,
 'use_crf': 1,
 'use_pretrained_model': 1,
 'verbose': 0}
Formatting deploy set from BRAT to CONLL... Done.
Converting CONLL from BIO to BIOES format... Done.
Load dataset... done (40.78 seconds)

Starting epoch 0
Load token embeddings... done (89.64 seconds)
number_of_token_original_case_found: 94
number_of_token_lowercase_found: 25
number_of_token_digits_replaced_with_zeros_found: 0
number_of_token_lowercase_and_digits_replaced_with_zeros_found: 0
number_of_loaded_word_vectors: 119
dataset.vocabulary_size: 119
Load token embeddings from pretrained model... done (0.22 seconds)
number_of_loaded_vectors: 104
dataset.vocabulary_size: 119
Load character embeddings from pretrained model... done (0.23 seconds)
number_of_loaded_vectors: 58
dataset.alphabet_size: 58
Training completed in 92.45 seconds
Predict labels for the deploy set
Formatting 000_deploy set from CONLL to BRAT... Done.
Finishing the experiment

License?

As you provide no explicit license, that makes this project unusable to anybody. No license is equivalent to "only look, don't touch". Even in a purely academic or research context, it is technically illegal to use code from a repo with no license.

What should be the max epoch_number for training?

What should be the max epoch_number value should be to train the model at the first run of main.py? I tried to change it by

python main.py --maximum_number_of_epochs=2 --token_pretrained_embedding_filepath=""

And after 3 or 4 runs the model starts training again. Should it stop at some point?

error running standalone.py

When i run python standalone.py, it give me this error. Why?
python version is Python 3.6.1

  File "standalone.py", line 257
    except SystemExit, sts:
                     ^
SyntaxError: invalid syntax

Steps to utilize NeuroNER for other languages

It appears that BART at least is pretty language agnostic. The English specific parts of NeuroNER (afaict), are the recommended glove.6B.100d word vectors, and all of the spacy related tokenizing code, which is used to translate BART format into CoNLL format (correct?)

Am I correct that if I:

Supply Korean word vectors in /data/word_vectors
Supply CoNLL formatted train, valid, and test data using BART labeled Korean text which I run through my own tokenizer

I will be able to train and use NeuroNER for Korean text?

LSTM model is bit unclear to be used for custom data such as spacy training data

It's a good idea to use LSTM for NER and I have run your engine and I'm very much satisfied with results. I'm working on an NLP project where data is not in conllu format rather it is in spacy train data format.

It will be better to have LSTM architecture independent of training data. Also the model is bit unclear as I have gone through the code. You are using dictionaries for token and characters of data. Then I got lost how you are configuring the model. I want to extend the LSTM model based on spacy training data so that I can change LSTM parameters such as number of layers, number of nodes etc.

It will be better to have a separate module of LSTM and wiki page on how to configure the LSTM.

Tagging of unseen files not working

Dear NeuroNER authors,

I have downloaded and installed the NeuroNER project as well as the necessary dependencies (tensorflow, python3, etc.). Note that I am using python 3.6.

When I try to apply an existing model on the sample of documents provided I have an error:
Command launched:
python3 main.py --parameters_filepath=./parameters.ini --train_model=False --use_pretrained_model=True --dataset_text_folder=../data/example_unannotated_texts --pretrained_model_folder=../trained_models/conll_2003_en

Error:
Traceback (most recent call last):
File "main.py", line 446, in
main()
File "main.py", line 268, in main
parameters, conf_parameters = load_parameters(arguments['parameters_filepath'], arguments=arguments)
File "main.py", line 119, in load_parameters
pretraining_parameters = load_parameters(parameters_filepath=os.path.join(parameters['pretrained_model_folder'], 'parameters.ini'), verbose=False)[0]
File "main.py", line 119, in load_parameters
pretraining_parameters = load_parameters(parameters_filepath=os.path.join(parameters['pretrained_model_folder'], 'parameters.ini'), verbose=False)[0]
File "main.py", line 119, in load_parameters
pretraining_parameters = load_parameters(parameters_filepath=os.path.join(parameters['pretrained_model_folder'], 'parameters.ini'), verbose=False)[0]
[Previous line repeated 980 more times]
File "main.py", line 93, in load_parameters
nested_parameters = utils.convert_configparser_to_dictionary(conf_parameters)
File "/home/netmail/neuroNER/NeuroNER-master/src/utils.py", line 105, in convert_configparser_to_dictionary
my_config_parser_dict = {s:dict(config.items(s)) for s in config.sections()}
File "/home/netmail/neuroNER/NeuroNER-master/src/utils.py", line 105, in
my_config_parser_dict = {s:dict(config.items(s)) for s in config.sections()}
File "/home/netmail/.local/lib/python3.6/configparser.py", line 858, in items
return [(option, value_getter(option)) for option in d.keys()]
File "/home/netmail/.local/lib/python3.6/configparser.py", line 858, in
return [(option, value_getter(option)) for option in d.keys()]
File "/home/netmail/.local/lib/python3.6/configparser.py", line 855, in
section, option, d[option], d)
File "/home/netmail/.local/lib/python3.6/configparser.py", line 394, in before_get
self._interpolate_some(parser, option, L, value, section, defaults, 1)
File "/home/netmail/.local/lib/python3.6/configparser.py", line 407, in _interpolate_some
rawval = parser.get(section, option, raw=True, fallback=rest)
File "/home/netmail/.local/lib/python3.6/configparser.py", line 781, in get
d = self._unify_values(section, vars)
File "/home/netmail/.local/lib/python3.6/configparser.py", line 1149, in _unify_values
return _ChainMap(vardict, sectiondict, self._defaults)
File "/home/netmail/vens/tensorflow/lib/python3.6/collections/init.py", line 874, in init
self.maps = list(maps) or [{}] # always at least one map
RecursionError: maximum recursion depth exceeded while calling a Python object

It seems that it fails to load the parameters from the config file.

How to use pretrained models for prediction? Is there a tutorial?

Have been trying for 2 days with little progress. Seems that the tool works only with this specific format:

Japan NNP B-NP B-LOC
began VBD B-VP O
the DT B-NP O
defence NN I-NP O
of IN B-PP O
their PRP$ B-NP O
Asian JJ I-NP B-MISC
Cup NNP I-NP I-MISC
...

Ideally, the input for the prediction mode should be plain text without format requirement. I've tried to use spacy to convert plain text to the format. This is what I got so far (no idea how to generate the 3rd column):

Google NNP B-PROPN O
’s NNP B-PROPN O
second JJ B-ADJ O
generation NN B-NOUN O
TPU NNP B-PROPN O
chips NNS B-NOUN O
...

And it seems that the result actually depends on the supplied label (the 4th column, contrary to what's said the documentation), but I'm probably missing something here...

train model using large amount of memory.

I just started to train a 40M text file and the amount of RAM memory increased too much... I had to kill the application, was consuming 15Gb of RAM and 15Gb of swap... is that normal?

[Feature Request] CNN for char-level representation and mini-batch training

Hi Franck,

Thanks so much for this brilliant NER framework!

I found that many new papers using neural architecture for NER consider using CNN instead of BiLSTM to encode the char-level word representation and they achieve better results (Ma and Hovy 2016).

Also, i found that there seems no mini-batch here for now, which I believe would speed up the training process very much.

Unable to provide parameters to prepare_pretrained_model.py

Currently, it is not possible to provide the parameters to prepare_pretrained_model.py via a config file or command line arguments (unless I am mistaken!). Thus, the user has to modify the code in order to use this script to save their own pre-trained models.

Should I add this feature and make a pull request?

Running pre-trained model on full vs chunked dataset

I'm trying to run NeuroNER with the pretrained CoNLL 2003 model to tag a corpus of 2 million documents, but I get memory errors. I see there's a PR open to parallelize this, but I have a question:

Assuming it's possible to load/tag all 2 million documents in one go, is there a difference between running that vs breaking the corpus up into chunks of 10k documents and running on each chunk separately (my current workaround for the memory issues)?

I ask this because this is the output I see:

Load dataset... done (152.69 seconds)
Load token embeddings... done (0.22 seconds)
number_of_token_original_case_found: 29059
number_of_token_lowercase_found: 19367
number_of_token_digits_replaced_with_zeros_found: 115
number_of_token_lowercase_and_digits_replaced_with_zeros_found: 0
number_of_loaded_word_vectors: 48541
dataset.vocabulary_size: 48558
Load token embeddings from pretrained model... done (0.11 seconds)
number_of_loaded_vectors: 14591
dataset.vocabulary_size: 48558
Load character embeddings from pretrained model... done (0.07 seconds)
number_of_loaded_vectors: 86
dataset.alphabet_size: 131

and I'm wondering if those statistics (which are based on a chunk of 10k documents, not all 2 million) and obviously these values will change with each new chunk. I would think that this would only matter if I was training a new model, but since I'm not I'm expecting the output to be the same whether I tag a corpus of 1 document vs 10k documents vs 2 million documents at once.

Using NeuroNer for transfer Learning

Hello Frank,
I saw you contributed to write another article based on this code, transfer learning in the NER context, Is the mentioned extension included in the git already?

I want to train a new entity type (Job Title), and I have a large training set for this entity (30K sentences) , but it has been automatically generated and is quite noisy. I also have a smaller test (5k sentences) where I have manually annotated the Job titles. I want to learn on the noisy data first and then transfer on the manually anonotated data. Is transfer learning relevant in this context?

I would like to know how to configure Neuroner to perform transfer on this new dataset.
Is it simply training further a pre-trained model with a higher learning rate and fewer epochs?
How do you restrict the parameteres transfer to specific layers only?

Thanks for your help.
BR
Armand

IndexError: list index out of range

i am getting these warning while running neuroner

epoch_elapsed_training_time: 331.644915 seconds assess_model on dataset_type: train C:\Users\erame\AppData\Local\Programs\Python\Python35\lib\site-packages\matplotlib\artist.py:233: MatplotlibDeprecationWarning: get_axes has been deprecated in mpl 1.5, please use the axes property. A removal date has not been set. stacklevel=1) C:\Users\erame\AppData\Local\Programs\Python\Python35\lib\site-packages\sklearn\metrics\classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples. 'precision', 'predicted', average, warn_for) C:\Users\erame\AppData\Local\Programs\Python\Python35\lib\site-packages\sklearn\metrics\classification.py:1115: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no true samples. 'recall', 'true', average, warn_for) assess_model on dataset_type: valid assess_model on dataset_type: test shell_command: perl .\conlleval < ..\output\en_2017-04-21_07-36-25-322798\000_014986_train.txt > ..\output\en_2017-04-21_07-36-25-322798\000_014986_train.txt_conll_evaluation.txt 'perl' is not recognized as an internal or external command, operable program or batch file. Traceback (most recent call last): File "main.py", line 442, in main() File "main.py", line 383, in main conll_parsed_output = utils_nlp.get_parsed_conll_output(conll_output_filepath) File "C:\Users\erame\AppData\Local\Programs\Python\Python35\NeuroNER-master\src\utils_nlp.py", line 42, in get_parsed_conll_output line = conll_output[1].split() IndexError: list index out of range`

How to deploy NeuroNER as a service?

Hi!
Since every run of NeuroNER on raw text takes almost a long time to finish, I thinks it will be good to run it as a RESTful service. What should I do to make it happen?

i2b2 data

Not sure if this the right forum. The i2b2 most closely fits my intended usage of NeuroNER (identifying documents with private information, for a privacy application).

I emailed [email protected] with a a signed data use agreement then followed up with another email in June this year. She has not replied.

Do you have any new advice on how to obtain the i2b2 data?

Using NeuroNER with Brat with custom annotations

Hi Franck and all the rest of you!
First, thanks for all the great work with this.

I am trying to get started with using NeuroNER on a dataset I have annotated with BRAT, within a specific domain (music), so the entities are all custom. And, I want to add this dataset, to NeuroNER, by adding the BRAT-files (the txt-file and the ann-file), and then train on that. But I am not getting anywhere.

As I have understood it from your docs is that I should put both the train.txt and the train.ann in a folder and then point to that in parameters.ini, but I guess I misunderstood because I'm not making progress...

Could you offer some guidance on how to get started with this?

Best regards,
Robert

Questions about ANN

Hi !
It is possible to NeuroNER to learn from vocabulary and not expressions? I explain my idea.
On your video (https://www.youtube.com/watch?v=BmRYkxumDvU) we can see token/word embedding.
Can NeuroNER learn more from it ?
Example : I have 30 000 different entities to learn (for example cities or universities). I can't give him 30 000 different expressions, same with 10% of entities, 3000 is a lot (if done by a human).
Let's say I can generate 1000-2000 different expressions that will represent my valid set. I will miss a lot of information given by word vectors.
But if I complete training with a bit of vocabulary (like 40% of 30 000) I will have enough sampling to find the others entities.
Is this possible ?

Labelling the same entity more than once?

Is there anyway to train the model such that it will label the same entity with multiple classes? For example here, you can see some entities are labeled as both genes and proteins.

If this is possible, how would I go about doing it? And if not, consider this a feature request! ;)

Which evaluation mode should I use?

I want to train toy dataset. This dataset is not NamedEntity Recognition dataset but it's labeld by BIO format
For example) B-TIME, B-DATE, B-EVENT ... Like this.

what should i use evaluation mode? BIO? or Token?

Loading token embedding from pretrained model for prediction

I'm not sure is this true or not but after inspecting the code and trying some different initial deploy set,
seem like in the deploy part, the loading token embedding step only load tokens that are in the dataset(deploy) when initialize, and ignore the other new tokens in the seperated prediction step. Therefore, if the initial deploy set is not large enough, it would not be able to load the full tokens embedding set in the pretrained model, as a result, the predict result will be pretty bad.
I think it would be better if the token embeded can be loaded completely with out depending on the initial data set. This can be solve by using the whole trainning data to init when deploying.
Sorry if I misunderstood the code.

The output file is not created.

Hi.. First of all, I will explain the steps what I have done for unannotated texts.

The dataset folder now contains a deploy folder with phrase.txt. No other files are included. In that case, when I run, main.py, I am getting the error as train.txt is not found. (FileNotFoundError).

If I include train,test and valid files in the same folder, it's running, but I am not getting the expected output.

As per the instructions given, I have changed these parameters in src ini file:
train_model=False
use_pretrained_model=True
pretrained_model_folder=../trained_models/conll_2003_en
dataset_text_folder=../data/dataset

use_character_lstm=True
character_embedding_dimension=25
character_lstm_hidden_state_dimension=25
token_pretrained_embedding_filepath=../data/word_vectors/glove.6B.100d.txt
token_embedding_dimension=100
token_lstm_hidden_state_dimension=100

use_crf=True
tagging_format=bioes
tokenizer=spacy

Please help me to find out what error I am making in this.