huggingface / neuralcoref Goto Github PK

✨Fast Coreference Resolution in spaCy with Neural Networks

Home Page: https://huggingface.co/coref/

License: MIT License

Python 33.91% Shell 0.26% Perl 25.45% C 40.38%

python machine-learning coreference spacy coreference-resolution spacy-extension spacy-pipeline nlp neural-networks pytorch

neuralcoref's Introduction

✨NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks.

NeuralCoref is a pipeline extension for spaCy 2.1+ which annotates and resolves coreference clusters using a neural network. NeuralCoref is production-ready, integrated in spaCy's NLP pipeline and extensible to new training datasets.

For a brief introduction to coreference resolution and NeuralCoref, please refer to our blog post. NeuralCoref is written in Python/Cython and comes with a pre-trained statistical model for English only.

NeuralCoref is accompanied by a visualization client NeuralCoref-Viz, a web interface powered by a REST server that can be tried online. NeuralCoref is released under the MIT license.

✨ Version 4.0 out now! Available on pip and compatible with SpaCy 2.1+.

Operating system: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
Python version: Python 3.6+ (only 64 bit)
Package managers: [pip]

Install NeuralCoref

Install NeuralCoref with pip

This is the easiest way to install NeuralCoref.

pip install neuralcoref

`spacy.strings.StringStore size changed` error

If you have an error mentioning spacy.strings.StringStore size changed, may indicate binary incompatibility when loading NeuralCoref with import neuralcoref, it means you'll have to install NeuralCoref from the distribution's sources instead of the wheels to get NeuralCoref to build against the most recent version of SpaCy for your system.

In this case, simply re-install neuralcoref as follows:

pip uninstall neuralcoref
pip install neuralcoref --no-binary neuralcoref

Installing SpaCy's model

To be able to use NeuralCoref you will also need to have an English model for SpaCy.

You can use whatever english model works fine for your application but note that the performances of NeuralCoref are strongly dependent on the performances of the SpaCy model and in particular on the performances of SpaCy model's tagger, parser and NER components. A larger SpaCy English model will thus improve the quality of the coreference resolution as well (see some details in the Internals and Model section below).

Here is an example of how you can install SpaCy and a (small) English model for SpaCy, more information can be found on spacy's website:

pip install -U spacy
python -m spacy download en

Install NeuralCoref from source

You can also install NeuralCoref from sources. You will need to install the dependencies first which includes Cython and SpaCy.

Here is the process:

venv .env
source .env/bin/activate
git clone https://github.com/huggingface/neuralcoref.git
cd neuralcoref
pip install -r requirements.txt
pip install -e .

Internals and Model

NeuralCoref is made of two sub-modules:

a rule-based mentions-detection module which uses SpaCy's tagger, parser and NER annotations to identify a set of potential coreference mentions, and
a feed-forward neural-network which compute a coreference score for each pair of potential mentions.

The first time you import NeuralCoref in python, it will download the weights of the neural network model in a cache folder.

The cache folder is set by defaults to ~/.neuralcoref_cache (see file_utils.py) but this behavior can be overided by setting the environment variable NEURALCOREF_CACHE to point to another location.

The cache folder can be safely deleted at any time and the module will download again the model the next time it is loaded.

You can have more information on the location, downloading and caching process of the internal model by activating python's logging module before loading NeuralCoref as follows:

import logging;
logging.basicConfig(level=logging.INFO)
import neuralcoref
>>> INFO:neuralcoref:Getting model from https://s3.amazonaws.com/models.huggingface.co/neuralcoref/neuralcoref.tar.gz or cache
>>> INFO:neuralcoref.file_utils:https://s3.amazonaws.com/models.huggingface.co/neuralcoref/neuralcoref.tar.gz not found in cache, downloading to /var/folders/yx/cw8n_njx3js5jksyw_qlp8p00000gn/T/tmp_8y5_52m
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40155833/40155833 [00:06<00:00, 6679263.76B/s]
>>> INFO:neuralcoref.file_utils:copying /var/folders/yx/cw8n_njx3js5jksyw_qlp8p00000gn/T/tmp_8y5_52m to cache at /Users/thomaswolf/.neuralcoref_cache/f46bc05a4bfba2ae0d11ffd41c4777683fa78ed357dc04a23c67137abf675e14.7d6f9a6fecf5cf09e74b65f85c7d6896b21decadb2554d486474f63b95ec4633
>>> INFO:neuralcoref.file_utils:creating metadata file for /Users/thomaswolf/.neuralcoref_cache/f46bc05a4bfba2ae0d11ffd41c4777683fa78ed357dc04a23c67137abf675e14.7d6f9a6fecf5cf09e74b65f85c7d6896b21decadb2554d486474f63b95ec4633
>>> INFO:neuralcoref.file_utils:removing temp file /var/folders/yx/cw8n_njx3js5jksyw_qlp8p00000gn/T/tmp_8y5_52m
>>> INFO:neuralcoref:extracting archive file /Users/thomaswolf/.neuralcoref_cache/f46bc05a4bfba2ae0d11ffd41c4777683fa78ed357dc04a23c67137abf675e14.7d6f9a6fecf5cf09e74b65f85c7d6896b21decadb2554d486474f63b95ec4633 to dir /Users/thomaswolf/.neuralcoref_cache/neuralcoref

Loading NeuralCoref

Adding NeuralCoref to the pipe of an English SpaCy Language

Here is the recommended way to instantiate NeuralCoref and add it to SpaCY's pipeline of annotations:

# Load your usual SpaCy model (one of SpaCy English models)
import spacy
nlp = spacy.load('en')

# Add neural coref to SpaCy's pipe
import neuralcoref
neuralcoref.add_to_pipe(nlp)

# You're done. You can now use NeuralCoref as you usually manipulate a SpaCy document annotations.
doc = nlp(u'My sister has a dog. She loves him.')

doc._.has_coref
doc._.coref_clusters

Loading NeuralCoref and adding it manually to the pipe of an English SpaCy Language

An equivalent way of adding NeuralCoref to a SpaCy model pipe is to instantiate the NeuralCoref class first and then add it manually to the pipe of the SpaCy Language model.

# Load your usual SpaCy model (one of SpaCy English models)
import spacy
nlp = spacy.load('en')

# load NeuralCoref and add it to the pipe of SpaCy's model
import neuralcoref
coref = neuralcoref.NeuralCoref(nlp.vocab)
nlp.add_pipe(coref, name='neuralcoref')

# You're done. You can now use NeuralCoref the same way you usually manipulate a SpaCy document and it's annotations.
doc = nlp(u'My sister has a dog. She loves him.')

doc._.has_coref
doc._.coref_clusters

Using NeuralCoref

NeuralCoref will resolve the coreferences and annotate them as extension attributes in the spaCy Doc, Span and Token objects under the ._. dictionary.

Here is the list of the annotations:

Attribute	Type	Description
`doc._.has_coref`	boolean	Has any coreference has been resolved in the Doc
`doc._.coref_clusters`	list of `Cluster`	All the clusters of corefering mentions in the doc
`doc._.coref_resolved`	unicode	Unicode representation of the doc where each corefering mention is replaced by the main mention in the associated cluster.
`doc._.coref_scores`	Dict of Dict	Scores of the coreference resolution between mentions.
`span._.is_coref`	boolean	Whether the span has at least one corefering mention
`span._.coref_cluster`	`Cluster`	Cluster of mentions that corefer with the span
`span._.coref_scores`	Dict	Scores of the coreference resolution of & span with other mentions (if applicable).
`token._.in_coref`	boolean	Whether the token is inside at least one corefering mention
`token._.coref_clusters`	list of `Cluster`	All the clusters of corefering mentions that contains the token

A Cluster is a cluster of coreferring mentions which has 3 attributes and a few methods to simplify the navigation inside a cluster:

Attribute or method	Type / Return type	Description
`i`	int	Index of the cluster in the Doc
`main`	`Span`	Span of the most representative mention in the cluster
`mentions`	list of `Span`	List of all the mentions in the cluster
`__getitem__`	return `Span`	Access a mention in the cluster
`__iter__`	yields `Span`	Iterate over mentions in the cluster
`__len__`	return int	Number of mentions in the cluster

Navigating the coreference cluster chains

You can also easily navigate the coreference cluster chains and display clusters and mentions.

Here are some examples, try them out to test it for yourself.

import spacy
import neuralcoref
nlp = spacy.load('en')
neuralcoref.add_to_pipe(nlp)

doc = nlp(u'My sister has a dog. She loves him')

doc._.coref_clusters
doc._.coref_clusters[1].mentions
doc._.coref_clusters[1].mentions[-1]
doc._.coref_clusters[1].mentions[-1]._.coref_cluster.main

token = doc[-1]
token._.in_coref
token._.coref_clusters

span = doc[-1:]
span._.is_coref
span._.coref_cluster.main
span._.coref_cluster.main._.coref_cluster

Important: NeuralCoref mentions are spaCy Span objects which means you can access all the usual Span attributes like span.start (index of the first token of the span in the document), span.end (index of the first token after the span in the document), etc...

Ex: doc._.coref_clusters[1].mentions[-1].start will give you the index of the first token of the last mention of the second coreference cluster in the document.

Parameters

You can pass several additional parameters to neuralcoref.add_to_pipe or NeuralCoref() to control the behavior of NeuralCoref.

Here is the full list of these parameters and their descriptions:

Parameter	Type	Description
`greedyness`	float	A number between 0 and 1 determining how greedy the model is about making coreference decisions (more greedy means more coreference links). The default value is 0.5.
`max_dist`	int	How many mentions back to look when considering possible antecedents of the current mention. Decreasing the value will cause the system to run faster but less accurately. The default value is 50.
`max_dist_match`	int	The system will consider linking the current mention to a preceding one further than `max_dist` away if they share a noun or proper noun. In this case, it looks `max_dist_match` away instead. The default value is 500.
`blacklist`	boolean	Should the system resolve coreferences for pronouns in the following list: `["i", "me", "my", "you", "your"]`. The default value is True (coreference resolved).
`store_scores`	boolean	Should the system store the scores for the coreferences in annotations. The default value is True.
`conv_dict`	dict(str, list(str))	A conversion dictionary that you can use to replace the embeddings of rare words (keys) by an average of the embeddings of a list of common words (values). Ex: `conv_dict={"Angela": ["woman", "girl"]}` will help resolving coreferences for `Angela` by using the embeddings for the more common `woman` and `girl` instead of the embedding of `Angela`. This currently only works for single words (not for words groups).

How to change a parameter

import spacy
import neuralcoref

# Let's load a SpaCy model
nlp = spacy.load('en')

# First way we can control a parameter
neuralcoref.add_to_pipe(nlp, greedyness=0.75)

# Another way we can control a parameter
nlp.remove_pipe("neuralcoref")  # This remove the current neuralcoref instance from SpaCy pipe
coref = neuralcoref.NeuralCoref(nlp.vocab, greedyness=0.75)
nlp.add_pipe(coref, name='neuralcoref')

Using the conversion dictionary parameter to help resolve rare words

Here is an example on how we can use the parameter conv_dict to help resolving coreferences of a rare word like a name:

import spacy
import neuralcoref

nlp = spacy.load('en')

# Let's try before using the conversion dictionary:
neuralcoref.add_to_pipe(nlp)
doc = nlp(u'Deepika has a dog. She loves him. The movie star has always been fond of animals')
doc._.coref_clusters
doc._.coref_resolved
# >>> [Deepika: [Deepika, She, him, The movie star]]
# >>> 'Deepika has a dog. Deepika loves Deepika. Deepika has always been fond of animals'
# >>> Not very good...

# Here are three ways we can add the conversion dictionary
nlp.remove_pipe("neuralcoref")
neuralcoref.add_to_pipe(nlp, conv_dict={'Deepika': ['woman', 'actress']})
# or
nlp.remove_pipe("neuralcoref")
coref = neuralcoref.NeuralCoref(nlp.vocab, conv_dict={'Deepika': ['woman', 'actress']})
nlp.add_pipe(coref, name='neuralcoref')
# or after NeuralCoref is already in SpaCy's pipe, by modifying NeuralCoref in the pipeline
nlp.get_pipe('neuralcoref').set_conv_dict({'Deepika': ['woman', 'actress']})

# Let's try agin with the conversion dictionary:
doc = nlp(u'Deepika has a dog. She loves him. The movie star has always been fond of animals')
doc._.coref_clusters
# >>> [Deepika: [Deepika, She, The movie star], a dog: [a dog, him]]
# >>> 'Deepika has a dog. Deepika loves a dog. Deepika has always been fond of animals'
# >>> A lot better!

Using NeuralCoref as a server

A simple example of server script for integrating NeuralCoref in a REST API is provided as an example in examples/server.py.

To use it you need to install falcon first:

pip install falcon

You can then start the server as follows:

cd examples
python ./server.py

And query the server like this:

curl --data-urlencode "text=My sister has a dog. She loves him." -G localhost:8000

There are many other ways you can manage and deploy NeuralCoref. Some examples can be found in spaCy Universe.

Re-train the model / Extend to another language

If you want to retrain the model or train it on another language, see our training instructions as well as our blog post

neuralcoref's People

Contributors

Stargazers

Watchers

Forkers

jamesvillarrubia amin24e petermartigny user01 mydp2017 liwzhi maxwellrebo fancycheung zhangxt yanghaha11514 pustar vseledkin ml-lab zhaoerchao akshayjh parthshah86 hailiang-wang mindis zhangjiulong leezqcst goodrahstar joshmeek-old research-machine-learning cuihengbin mikalv rahular r-wheeler bharath45 timtutt lihongweimail rekonder rap9430 changfengfeng tornadozou ishalyminov rajendramishra18 sreenivas134 ravenscroftj cosecant-csc zxsted ghosthamlet abhiskaushik cooravi arborlaureus markpopovich albertusk95 hitxujian kamalkashyap13 yugrocks lucaslingle prathapreddymv tongshuangwu zdong1 rachesh abinj vmath89 kevd1337 rukor cmerwich nianxiaohu wayneouyang unendin karlqu1990 elizabeth2507 ivmreg mateuszkuba carlgieringer thomwolf jlee24282 pwichmann kaliek noahkim11 bagonzalez thesage21 lxwithgod sourish-rygbee qq547276542 ryfan-rs matt-stevenson ruze00 saulhoward geraldinelemeur howl-anderson mmatlacz snazz2001 kevark adityagurjar08 marouangit dnzengou vangoran themandunord phlloh sbutner sahilbadyal keruhua dterg snehha henghuiz-zz abdallahsobhy zihaow21

neuralcoref's Issues

Cuda error

Getting this error when running learn.py to train on GPU server-
TypeError: torch.index_select received an invalid combination of arguments - got (torch.cuda.FloatTensor, int, torch.cuda.IntTensor), but expected (torch.cuda.FloatTensor source, int dim, torch.cuda.LongTensor index)
On line "embed_words = self.drop(self.word_embeds(words).view(words.size()[0], -1))"
line 72 of model.py

Failed building wheel for en-coref-sm

creating build\temp.win-amd64-3.6\Release\en_coref_sm
C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.14.26428\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\Piyush\Anaconda3\include -IC:\Users\Piyush\AppData\Local\Temp\pip-req-build-hvs3wmbv\include -IC:\Users\Piyush\Anaconda3\lib\site-packages\numpy\core\include -IC:\Users\Piyush\Anaconda3\include -IC:\Users\Piyush\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.14.26428\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.14.26428\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17134.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17134.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17134.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17134.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17134.0\cppwinrt" /EHsc /Tpen_coref_sm/neuralcoref.cpp /Fobuild\temp.win-amd64-3.6\Release\en_coref_sm/neuralcoref.obj
neuralcoref.cpp
c:\users\piyush\anaconda3\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_api.h(12) : Warning Msg: Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
en_coref_sm/neuralcoref.cpp(5842): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(6639): warning C4244: '=': conversion from 'double' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(9894): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(9969): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(16201): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(17362): warning C4244: 'argument': conversion from 'uint64_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(17617): warning C4244: '=': conversion from 'int' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(17629): warning C4244: '=': conversion from 'int' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(17641): warning C4244: '=': conversion from 'int' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(17677): warning C4244: '=': conversion from 'double' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(17686): warning C4244: 'argument': conversion from 'uint64_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(17713): warning C4244: '=': conversion from 'double' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(17725): warning C4244: '=': conversion from 'int' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(18068): warning C4244: '=': conversion from 'double' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(30044): warning C4244: 'argument': conversion from '__pyx_t_5spacy_8typedefs_attr_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(31442): warning C4244: '=': conversion from '__pyx_t_5spacy_8typedefs_attr_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(31469): warning C4244: '=': conversion from '__pyx_t_5spacy_8typedefs_attr_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(31496): warning C4244: '=': conversion from '__pyx_t_5spacy_8typedefs_attr_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(31547): warning C4244: '=': conversion from '__pyx_t_5spacy_8typedefs_attr_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(49943): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data
C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.14.26428\bin\HostX86\x64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:C:\Users\Piyush\Anaconda3\libs /LIBPATH:C:\Users\Piyush\Anaconda3\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.14.26428\ATLMFC\lib\x64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.14.26428\lib\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.17134.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.17134.0\um\x64" /EXPORT:PyInit_en_coref_sm/neuralcoref build\temp.win-amd64-3.6\Release\en_coref_sm/neuralcoref.obj /OUT:build\lib.win-amd64-3.6\en_coref_sm/neuralcoref.cp36-win_amd64.pyd /IMPLIB:build\temp.win-amd64-3.6\Release\en_coref_sm\neuralcoref.cp36-win_amd64.lib
LINK : error LNK2001: unresolved external symbol PyInit_en_coref_sm/neuralcoref
build\temp.win-amd64-3.6\Release\en_coref_sm\neuralcoref.cp36-win_amd64.lib : fatal error LNK1120: 1 unresolved externals
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.14.26428\bin\HostX86\x64\link.exe' failed with exit status 1120

### Failed building wheel for en-coref-sm
Running setup.py clean for en-coref-sm
Failed to build en-coref-sm
Installing collected packages: en-coref-sm
Running setup.py install for en-coref-sm ... error
Complete output from command C:\Users\Piyush\Anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\Piyush\AppData\Local\Temp\pip-req-build-hvs3wmbv\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\Piyush\AppData\Local\Temp\pip-record-1qodkes3\install-record.txt --single-version-externally-managed --compile:
model_name en_coref_sm
model_dir en_coref_sm\en_coref_sm-3.0.0
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.6
creating build\lib.win-amd64-3.6\en_coref_sm
copying en_coref_sm_init_.py -> build\lib.win-amd64-3.6\en_coref_sm
creating build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0
copying en_coref_sm\en_coref_sm-3.0.0\meta.json -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0
copying en_coref_sm\en_coref_sm-3.0.0\tokenizer -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0
creating build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\ner
copying en_coref_sm\en_coref_sm-3.0.0\ner\cfg -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\ner
copying en_coref_sm\en_coref_sm-3.0.0\ner\lower_model -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\ner
copying en_coref_sm\en_coref_sm-3.0.0\ner\moves -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\ner
copying en_coref_sm\en_coref_sm-3.0.0\ner\tok2vec_model -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\ner
copying en_coref_sm\en_coref_sm-3.0.0\ner\upper_model -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\ner
creating build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\neuralcoref
copying en_coref_sm\en_coref_sm-3.0.0\neuralcoref\cfg -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\neuralcoref
copying en_coref_sm\en_coref_sm-3.0.0\neuralcoref\pairs_model -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\neuralcoref
copying en_coref_sm\en_coref_sm-3.0.0\neuralcoref\single_model -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\neuralcoref
creating build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\neuralcoref\static_vectors
copying en_coref_sm\en_coref_sm-3.0.0\neuralcoref\static_vectors\key2row -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\neuralcoref\static_vectors
copying en_coref_sm\en_coref_sm-3.0.0\neuralcoref\static_vectors\vectors -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\neuralcoref\static_vectors
creating build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\neuralcoref\tuned_vectors
copying en_coref_sm\en_coref_sm-3.0.0\neuralcoref\tuned_vectors\key2row -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\neuralcoref\tuned_vectors
copying en_coref_sm\en_coref_sm-3.0.0\neuralcoref\tuned_vectors\vectors -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\neuralcoref\tuned_vectors
creating build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\parser
copying en_coref_sm\en_coref_sm-3.0.0\parser\cfg -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\parser
copying en_coref_sm\en_coref_sm-3.0.0\parser\lower_model -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\parser
copying en_coref_sm\en_coref_sm-3.0.0\parser\moves -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\parser
copying en_coref_sm\en_coref_sm-3.0.0\parser\tok2vec_model -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\parser
copying en_coref_sm\en_coref_sm-3.0.0\parser\upper_model -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\parser
creating build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\tagger
copying en_coref_sm\en_coref_sm-3.0.0\tagger\cfg -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\tagger
copying en_coref_sm\en_coref_sm-3.0.0\tagger\model -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\tagger
copying en_coref_sm\en_coref_sm-3.0.0\tagger\tag_map -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\tagger
creating build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\vocab
copying en_coref_sm\en_coref_sm-3.0.0\vocab\key2row -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\vocab
copying en_coref_sm\en_coref_sm-3.0.0\vocab\lexemes.bin -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\vocab
copying en_coref_sm\en_coref_sm-3.0.0\vocab\strings.json -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\vocab
copying en_coref_sm\en_coref_sm-3.0.0\vocab\vectors -> build\lib.win-amd64-3.6\en_coref_sm\en_coref_sm-3.0.0\vocab
copying en_coref_sm\meta.json -> build\lib.win-amd64-3.6\en_coref_sm
running build_ext
building 'en_coref_sm/neuralcoref' extension
creating build\temp.win-amd64-3.6
creating build\temp.win-amd64-3.6\Release
creating build\temp.win-amd64-3.6\Release\en_coref_sm
C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.14.26428\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -IC:\Users\Piyush\Anaconda3\include -IC:\Users\Piyush\AppData\Local\Temp\pip-req-build-hvs3wmbv\include -IC:\Users\Piyush\Anaconda3\lib\site-packages\numpy\core\include -IC:\Users\Piyush\Anaconda3\include -IC:\Users\Piyush\Anaconda3\include "-IC:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.14.26428\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.14.26428\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17134.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17134.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17134.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17134.0\winrt" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.17134.0\cppwinrt" /EHsc /Tpen_coref_sm/neuralcoref.cpp /Fobuild\temp.win-amd64-3.6\Release\en_coref_sm/neuralcoref.obj
neuralcoref.cpp
c:\users\piyush\anaconda3\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_api.h(12) : Warning Msg: Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
en_coref_sm/neuralcoref.cpp(5842): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(6639): warning C4244: '=': conversion from 'double' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(9894): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(9969): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(16201): warning C4244: '=': conversion from 'Py_ssize_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(17362): warning C4244: 'argument': conversion from 'uint64_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(17617): warning C4244: '=': conversion from 'int' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(17629): warning C4244: '=': conversion from 'int' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(17641): warning C4244: '=': conversion from 'int' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(17677): warning C4244: '=': conversion from 'double' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(17686): warning C4244: 'argument': conversion from 'uint64_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(17713): warning C4244: '=': conversion from 'double' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(17725): warning C4244: '=': conversion from 'int' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(18068): warning C4244: '=': conversion from 'double' to 'float', possible loss of data
en_coref_sm/neuralcoref.cpp(30044): warning C4244: 'argument': conversion from '__pyx_t_5spacy_8typedefs_attr_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(31442): warning C4244: '=': conversion from '__pyx_t_5spacy_8typedefs_attr_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(31469): warning C4244: '=': conversion from '__pyx_t_5spacy_8typedefs_attr_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(31496): warning C4244: '=': conversion from '__pyx_t_5spacy_8typedefs_attr_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(31547): warning C4244: '=': conversion from '__pyx_t_5spacy_8typedefs_attr_t' to 'int', possible loss of data
en_coref_sm/neuralcoref.cpp(49943): warning C4267: '=': conversion from 'size_t' to 'int', possible loss of data
C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.14.26428\bin\HostX86\x64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:C:\Users\Piyush\Anaconda3\libs /LIBPATH:C:\Users\Piyush\Anaconda3\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.14.26428\ATLMFC\lib\x64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.14.26428\lib\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.17134.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.17134.0\um\x64" /EXPORT:PyInit_en_coref_sm/neuralcoref build\temp.win-amd64-3.6\Release\en_coref_sm/neuralcoref.obj /OUT:build\lib.win-amd64-3.6\en_coref_sm/neuralcoref.cp36-win_amd64.pyd /IMPLIB:build\temp.win-amd64-3.6\Release\en_coref_sm\neuralcoref.cp36-win_amd64.lib
LINK : error LNK2001: unresolved external symbol PyInit_en_coref_sm/neuralcoref
build\temp.win-amd64-3.6\Release\en_coref_sm\neuralcoref.cp36-win_amd64.lib : fatal error LNK1120: 1 unresolved externals
error: command 'C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Tools\MSVC\14.14.26428\bin\HostX86\x64\link.exe' failed with exit status 1120

----------------------------------------

Command "C:\Users\Piyush\Anaconda3\python.exe -u -c "import setuptools, tokenize;file='C:\Users\Piyush\AppData\Local\Temp\pip-req-build-hvs3wmbv\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\Piyush\AppData\Local\Temp\pip-record-1qodkes3\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\Piyush\AppData\Local\Temp\pip-req-build-hvs3wmbv\

Attribute Error

Code:
import spacy
import en_coref_md

nlp = en_coref_md.load()
doc = nlp(u'My sister has a dog. She loves him.')

doc..has_coref
doc..coref_clusters

Error:

AttributeError Traceback (most recent call last)
in ()
2 import en_coref_md
3
----> 4 nlp = en_coref_md.load()
5 doc = nlp(u'My sister has a dog. She loves him.')
6

~\Anaconda3\Scripts\en_coref_md_init_.py in load(**overrides)
13 overrides['disable'] = disable + ['neuralcoref']
14 nlp = load_model_from_init_py(file, **overrides)
---> 15 coref = neuralcoref.NeuralCoref(nlp.vocab)
16 coref.from_disk(nlp.path / 'neuralcoref')
17 nlp.add_pipe(coref, name='neuralcoref')

AttributeError: module 'neuralcoref' has no attribute 'NeuralCoref'

Help: Can't reproduce demo

Hey Thomas! First and foremost I wanted to say that you are amazing and this is great work!

My question pertains to how your demo website model is running, in comparison to my local copy.
(https://huggingface.co/coref/)

I have a line of text that has nouns and pronouns in the second sentence, and your demo is correctly making an association! However when I use the en_coref_md pre-trained spacy model locally on the same text I get:
'doc._.has_coref' = FALSE.

And when I look at the other parts of the Doc, Span, and Token objects I get NoneType objects as well.

I'm not sure why I'm not seeing associations similar to your demo, nor do I know how to reproduce
what's being outputted on your site.

I don't have the time to annotate and train on my current data, so this is me hoping your pre-trained english model will work well on my data. (Which it seems to?)

To re-iterate: What are you using as a model for the site? And any reason why I can't see nominal and pronominal associations when they are available in the demo?

Any and all help is appreciated! Thank you again.

spaCy version

As written in the readme at "If you are an early user of spacy 2 alpha --- (and there is now spacy 2.0.6 and the version won't be lower in the future, and you could adress the project by typing its proper name with big C) --- , you can use neuralcoref with spacy 2 without any specific modification", there should be no problem with spaCy 2 using appropriate models. Instead of this iInstalling the requirements of neuralcoref overwrites and corrupts an exisiting installation of spacy. So please update your readme directing to a working version of spaCy or upgrade your own version.

Extension 'has_coref' already exists on Doc.

My code:

import spacy
import en_coref_sm

nlp = en_coref_sm.load()
doc = nlp(u'The lungs are located in the chest.They are conical in shape.')

print (doc..has_coref)
print (doc..coref_clusters)

Hey I ran into the following error when I inputted my own sentence::::

ValueError Traceback (most recent call last)
in ()
2 import en_coref_sm
3
----> 4 nlp = en_coref_sm.load()
5 doc = nlp(u'The lungs are located in the chest.They are conical in shape.')
6

~\Anaconda3\lib\site-packages\en_coref_sm_init_.py in load(**overrides)
13 overrides['disable'] = disable + ['neuralcoref']
14 nlp = load_model_from_init_py(file, **overrides)
---> 15 coref = NeuralCoref(nlp.vocab)
16 coref.from_disk(nlp.path / 'neuralcoref')
17 nlp.add_pipe(coref, name='neuralcoref')

neuralcoref.pyx in en_coref_sm.neuralcoref.neuralcoref.NeuralCoref.init()

doc.pyx in spacy.tokens.doc.Doc.set_extension()

ValueError: [E090] Extension 'has_coref' already exists on Doc. To overwrite the existing extension, set force=True on Doc.set_extension.

Getting cluster ids per word in original text

First, Thank you for the work 👍 It has made life a lot easier.

In the usage of coref.get_resolved_utterances(), I was wondering if there is some way of generating a mapping saying that each word in the original text belongs to which cluster(s). Maybe like:

John	used	to	work	for	the	army	.	He	was	an	epitome	of	discipline
0					1	1		0

I've tried to do this with get_clusters and get_mentions but the problem is that mentions are not unique in the original text.

Is there something I'm missing or should I fork and dive into neuralcoref code?

Out of Vocabulary Words

Hi
Firstly, what you've done looks great - thank you. Secondly I'm interested in a feature that you hint at in your 1st blog post:
2.We make use of recent work on word embeddings to compute embeddings for unknown words on the fly from definitions or information that you can provide (it’s very simple in fact: you can compute a word embedding for “Kendall Jenner” simply by averaging the vectors for “woman” and “model” for example).
I'd like to know how to supplement the definitions/information you refer to above. I'm working within a domain that has some very specific/unique terms that will not be in the training corpus and I'd like to know how to use/supplement the code to embed vectors in the way you describe above,
Thanks

result not matched as demo

considering this sentence:

According to this legend, Captain James Cook and naturalist Sir Joseph Banks were exploring Australia when they happened upon the animal. They asked a nearby local what the creatures were called.

In which 'they' in the second sentence should be resolved as 'Captain James Cook and naturalist Sir Joseph Banks'. On demo site this is roughly right, but from the output of the latest code this is not the case:

coref.one_shot_coref(utterances=u"According to this legend, Captain James Cook and naturalist Sir Joseph Banks were exploring Australia when they happened upon the animal. They asked a nearby local what the creatures were called.", context=u"")
resolved_utterance_text = coref.get_resolved_utterances()
print(resolved_utterance_text)

['According to this legend, Captain James Cook and naturalist Sir Joseph Banks were exploring Australia when they happened upon the animal. They asked a nearby local what the creatures were called.']

why is that?

How can I get the offset position of each mention returned by coref.get_mentions()

Currently 'coref.get_mentions()' only returns a list mentions. Ideally, it would be helpful to get a list of dictionaries returned with the mention and its offset position within the utterance (basically the string being passed in for processing).

Is there a way do this?

Token level coref_clusters access raise exceptions sometimes

import spacy

nlp = spacy.load('en_coref_sm')

x = '''The English name "Normans" comes from the French words Normans/Normanz, plural of Normant, modern French normand, which is itself borrowed from Old Low Franconian Nortmann "Northman" or directly from Old Norse Norðmaðr, Latinized variously as Nortmannus, Normannus, or Nordmannus (recorded in Medieval Latin, 9th century) to mean "Norseman, Viking". What is the original meaning of the word Norman?'''
doc = nlp(x)
doc[0]._.coref_clusters

Raises an error.

Traceback (most recent call last):
  File "bug.py", line 7, in <module>
    doc[0]._.coref_clusters
  File "/home/arjoonn/.local/share/virtualenvs/squad2.0-PnAXOzWu/lib/python3.6/site-packages/spacy/tokens/underscore.py", line 31, in __getattr__
    return getter(self._obj)
  File "neuralcoref.pyx", line 800, in en_coref_sm.neuralcoref.neuralcoref.NeuralCoref.token_clusters
TypeError: 'NoneType' object is not iterable

This only happens sometimes (for example with the text given above).

NeuralCoref-3.0 can't load the new spacy model

I couldn't load the spacy model en-coref-sm. I have installed both neuralcoref-3.0 and en-coref-sm by downloading and running the setup.py even I tried the pip install for both. Once the installation completed when tried to load the spacy model it throws the below exception.

Traceback (most recent call last):
File "/home/extraction/CoreferenceResolver.py", line 5, in
from neuralcoref import Coref
File "/usr/local/lib/python2.7/dist-packages/neuralcoref-3.0-py2.7-linux-x86_64.egg/neuralcoref/init.py", line 3, in
from .neuralcoref import NeuralCoref
File "neuralcoref.pyx", line 101, in init neuralcoref.neuralcoref
TypeError: must be char, not unicode

Please provide me the clear steps to begin with the new neuralcoref

Not able to generate coreferencing for a series of statements more than 500 words

@julien-c
I have tried using your system for co-referencing. It was giving promising results till 500 words, but it is not able to do co-referencing above 500 words. I have tried both in online as well as hosted version. Can you please update the same for above 500 words too?

Command "python setup.py egg_info" failed with error code 1

I'm not able to install it. Any suggestion is much appreciated.

$ pip install https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_lg-3.0.0/en_coref_lg-3.0.0.tar.gz
Collecting https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_lg-3.0.0/en_coref_lg-3.0.0.tar.gz
Downloading https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_lg-3.0.0/en_coref_lg-3.0.0.tar.gz (892.8MB)
100% |████████████████████████████████| 892.8MB 27kB/s
Complete output from command python setup.py egg_info:
(u'model_name', 'en_coref_lg')
(u'model_dir', u'en_coref_lg/en_coref_lg-3.0.0')
Traceback (most recent call last):
File "", line 1, in
File "/private/var/folders/vk/f4v089pd43jf98dcbzfwjnnr0000gn/T/pip-req-build-3XD223/setup.py", line 98, in
setup_package()
File "/private/var/folders/vk/f4v089pd43jf98dcbzfwjnnr0000gn/T/pip-req-build-3XD223/setup.py", line 75, in setup_package
extra_link_args=extra_link_args))
File "/Users/stian/projects/figaro-tagger/venv/lib/python2.7/site-packages/setuptools/extension.py", line 39, in init
_Extension.init(self, name, sources, *args, **kw)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/distutils/extension.py", line 106, in init
assert type(name) is StringType, "'name' must be a string"
AssertionError: 'name' must be a string

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/vk/f4v089pd43jf98dcbzfwjnnr0000gn/T/pip-req-build-3XD223/

Got a problem while installing

Got this error while installing it on my virtual environment: "Permission denied: '/Library/Python/2.7/site-packages/certifi' " It also deleted my nltk files from the virtual environment. Please guide me where's the problem.

model dependency on 'neuralcoref'?

The models appear to have a dependency on a "neuralcoref" module. Should this be installed in addition to a model such as "en-coref-lg"?

I ask due to the following error I'm seeing when running nueralcoref inside a Docker container.

In the Dockerfile, I run:

RUN pip install https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_lg-3.0.0/en_coref_lg-3.0.0.tar.gz

Then, in the Python service, I initialize coref like this:

     import spacy
     coref = spacy.load('en_coref_lg')

Which leads to the following error on the 2nd line above:

nlu-service         | INFO:root:Loading spacy model, wait for confirmation before using
nlu-service         | INFO:root:Loaded spacy model
nlu-service         | Traceback (most recent call last):
nlu-service         |   File "/app/main.py", line 58, in <module>
nlu-service         |     coref = spacy.load('en_coref_lg')
nlu-service         |   File "/usr/local/lib/python3.6/site-packages/spacy/__init__.py", line 15, in load
nlu-service         |     return util.load_model(name, **overrides)
nlu-service         |   File "/usr/local/lib/python3.6/site-packages/spacy/util.py", line 114, in load_model
nlu-service         |     return load_model_from_package(name, **overrides)
nlu-service         |   File "/usr/local/lib/python3.6/site-packages/spacy/util.py", line 134, in load_model_from_package
nlu-service         |     cls = importlib.import_module(name)
nlu-service         |   File "/usr/local/lib/python3.6/importlib/__init__.py", line 126, in import_module
nlu-service         |     return _bootstrap._gcd_import(name[level:], package, level)
nlu-service         |   File "<frozen importlib._bootstrap>", line 994, in _gcd_import
nlu-service         |   File "<frozen importlib._bootstrap>", line 971, in _find_and_load
nlu-service         |   File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
nlu-service         |   File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
nlu-service         |   File "<frozen importlib._bootstrap_external>", line 678, in exec_module
nlu-service         |   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
nlu-service         |   File "/usr/local/lib/python3.6/site-packages/en_coref_lg/__init__.py", line 6, in <module>
nlu-service         |     import neuralcoref 
nlu-service         | ModuleNotFoundError: No module named 'neuralcoref'

In other words, Python finds the en_coref_lg model, but the model is looking for something that isn't there. Have I missed a step?

Thanks in advance for any help you can provide...

Online version finds the right antecedents, but actual version does not

The online version is working fine on the following text:
"I know that Barbara and Sandy are here. I see Barbara watching TV. I hear Sandy breathing."

https://huggingface.co/coref/?text=I%20know%20that%20Barbara%20and%20Sandy%20are%20here.%20I%20see%20Barbara%20watching%20TV.%20I%20hear%20Sandy%20breathing.

But the actual version doesn't find ennough, just Barbara. It outputs by running:

clusters = coref.one_shot_coref(utterances=u"I know that Barbara and Sandy are here. I see Barbara watching TV. I hear Sandy breathing.")
print(clusters)
print (coref.get_most_representative())
mentions = coref.get_mentions()
print(mentions)

Loading spacy model

Info about model en_core_web_sm

lang               en             
pipeline           ['tagger', 'parser', 'ner']
accuracy           {'token_acc': 99.8698372794, 'ents_p': 84.9664503965, 'ents_r': 85.6312524451, 'uas': 91.7237657538, 'tags_acc': 97.0403350292, 'ents_f': 85.2975560875, 'las': 89.800872413}
name               core_web_sm    
license            CC BY-SA 3.0   
author             Explosion AI   
url                https://explosion.ai
vectors            {'keys': 0, 'width': 0, 'vectors': 0}
sources            ['OntoNotes 5', 'Common Crawl']
version            2.0.0          
spacy_version      >=2.0.0a18     
parent_package     spacy          
speed              {'gpu': None, 'nwords': 291344, 'cpu': 5122.3040471407}
email              [email protected]
description        English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.
source             /usr/local/lib/python3.6/dist-packages/en_core_web_sm

loading model from /usr/local/lib/python3.6/dist-packages/neuralcoref/weights/
{3: [3, 0]}
{}
[Barbara, Barbara and Sandy, Sandy, Barbara, TV, Sandy, Sandy breathing]

Improvising time for coreferencing

Hi julien, thanks for opensoursing such an amazing library for coreferencing

I have two questions
-Have you released pyTorch powered training workflow yet. If yes can you please provide a link for the same.

Now with this current flow, coreferencing a 10kb file takes 30 seconds on my machine. Will the time for coreferncing come down drastically using a GPU.?

`NotADirectoryError` while trainling neuralcoref on a new dataset

I am following the documentation to train the neuralcoref system on my own dataset. I have my dataset in CoNLL format and when I run the script python -m neuralcoref.conllparser --path path-to-my-dataset, I have the following error. Any suggestion please?

Traceback (most recent call last):
File "/opt/python/3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/python/3.6.3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/neha5486/git/neuralcoref/neuralcoref/conllparser.py", line 702, in
os.makedirs(SAVE_DIR)
File "/opt/python/3.6.3/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)
NotADirectoryError: [Errno 20] Not a directory: '/scratch/Shares/hunter/ontonotes/train.english.v4_gold_conll/numpy/'

A next step - recognizing more than just nouns

Hi,

I tried the following sentence on the demo page "when a notification is received that a driver is available, update that fact in the database".

The demo linked "that fact" to "a driver", while I was hoping it would link "that fact" to "a driver is available".

What would it take to enable this? I hope you see this as a possibility in the not too distant future.

Colin Goldberg

Not able to import neuralcoref.coref

Relative pronouns

Consider the below sentence
50 year old Male with history of alcohol abuse and hypertension who presents with emesis with blood clots.

Is relative pronoun outside the scope of coreference? How do I relate the relative pronoun 'who' to 'Male'?

Best Regards,
Vishal

Detection Failing on Longer Sentences

Is there any window length over which the model was trained ? It is failing with below given Input. If i reduce some words from between then it works. Looks like it is not handeling longer sentences.

Input: "The highest run scorer of all time in International cricket, Tendulkar took up cricket at the age of eleven, made his Test debut on 15 November 1989 against Pakistan. He is the only player to have scored one hundred international centuries.
Output:
{ "coreferences": [], "mentions": [ { "index": 0, "end": 9, "type": "PRONOMINAL", "start": 6, "text": "his", "utterance": 0 }, { "index": 1, "end": 57, "type": "NOMINAL", "start": 6, "text": "his Test debut on 15 November 1989 against Pakistan", "utterance": 0 }, { "index": 2, "end": 57, "type": "PROPER", "start": 24, "text": "15 November 1989 against Pakistan", "utterance": 0 }, { "index": 3, "end": 57, "type": "PROPER", "start": 49, "text": "Pakistan", "utterance": 0 }, { "index": 4, "end": 61, "type": "PRONOMINAL", "start": 59, "text": "He", "utterance": 0 }, { "index": 5, "end": 130, "type": "NOMINAL", "start": 65, "text": "the only player to have scored one hundred international centurie", "utterance": 0 }, { "index": 6, "end": 130, "type": "NOMINAL", "start": 96, "text": "one hundred international centurie", "utterance": 0 } ], "pairScores": { "0": {}, "1": { "0": -1.6678770421517723 }, "2": { "0": -1.8341348680398002, "1": -1.5090962735402649 }, "3": { "0": -2.6148013630764773, "1": -1.5516474830571154, "2": -1.5169575544555827 }, "4": { "0": 7.840848417264016, "1": -1.2979947779072352, "2": -1.4595468690699465, "3": -2.0813619591444 }, "5": { "0": -2.416013286221991, "1": -1.5004609635918187, "2": -1.4986967835407372, "3": -1.8156596791754631, "4": -1.8895537311717003 }, "6": { "0": -2.5514494172155424, "1": -1.5050492762916694, "2": -1.4961146384932793, "3": -1.5895868435565839, "4": -2.1308741164775555, "5": -1.5027189364339228 } }, "singleScores": { "0": null, "1": 1.715603142435672, "2": 1.433124625456641, "3": 1.4715869553088599, "4": -0.6005009606551885, "5": 1.9477583533315648, "6": 1.707929368512381 } }

Expected: {"He": "tendulkar"}

Dataset preparation

Which of the columns in the CoNLL format are required to train the system on a new dataset? Here is an example of a sentence that I have from my dataset, can someone comment please if this is an acceptable format:

However, with this dataset, I am having the following error message, any help please?

Traceback (most recent call last):
File "/opt/python/3.6.3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/python/3.6.3/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/neha5486/git/neuralcoref/neuralcoref/conllparser.py", line 714, in
(my-vitual-env) -bash-4.2$
File "/Users/neha5486/git/neuralcoref/neuralcoref/conllparser.py", line 649, in build_and_gather_multiple_arrays

ict[feature]))
TypeError: object of type 'NoneType' has no len()

ValueError: Extension 'has_coref' already exists on Doc

I am getting this spaCy ValueError: [E090] when loading the neuralcoref model. I installed en_coref_sm and en_coref_md just now via pip, and am using the latest spaCy version. I am not using virtualenv and I am on Mac. Could you help me figure out what to do with this error?

`---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
in ()
1 import spacy
----> 2 nlp = spacy.load('en_coref_md')

/Users/evgenykim/anaconda/lib/python3.6/site-packages/spacy/init.py in load(name, **overrides)
13 if depr_path not in (True, False, None):
14 deprecation_warning(Warnings.W001.format(path=depr_path))
---> 15 return util.load_model(name, **overrides)
16
17

/Users/evgenykim/anaconda/lib/python3.6/site-packages/spacy/util.py in load_model(name, **overrides)
112 return load_model_from_link(name, **overrides)
113 if is_package(name): # installed as package
--> 114 return load_model_from_package(name, **overrides)
115 if Path(name).exists(): # path to model data directory
116 return load_model_from_path(Path(name), **overrides)

/Users/evgenykim/anaconda/lib/python3.6/site-packages/spacy/util.py in load_model_from_package(name, **overrides)
133 """Load a model from an installed package."""
134 cls = importlib.import_module(name)
--> 135 return cls.load(**overrides)
136
137

/Users/evgenykim/anaconda/lib/python3.6/site-packages/en_coref_md/init.py in load(**overrides)
13 overrides['disable'] = disable + ['neuralcoref']
14 nlp = load_model_from_init_py(file, **overrides)
---> 15 coref = neuralcoref.NeuralCoref(nlp.vocab)
16 coref.from_disk(nlp.path / 'neuralcoref')
17 nlp.add_pipe(coref, name='neuralcoref')

neuralcoref.pyx in neuralcoref.neuralcoref.NeuralCoref.init()

doc.pyx in spacy.tokens.doc.Doc.set_extension()

ValueError: [E090] Extension 'has_coref' already exists on Doc. To overwrite the existing extension, set force=True on Doc.set_extension.`

Unable to run Demo to due to Unicode Decode Error

When tryning to run the demo source code, this is the error I get:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 5547: character maps to < undefined >

I am currently running Python 3. 5, spaCy 1.9.0 and I have all the dependencies installed.

ValueError: [E090] Extension 'has_coref' already exists on Doc. To overwrite the existing extension, set `force=True` on `Doc.set_extension`.

Hi, I see the following error when I do nlp = spacy.load('en_coref_sm') in dask partition. Any compatibility issue with dask?

Traceback (most recent call last):
  File "180606_coreference_dev.py", line 52, in <module>
    d.compute()
  File "/home/kyoungrok/anaconda3/envs/coref/lib/python3.6/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/home/kyoungrok/anaconda3/envs/coref/lib/python3.6/site-packages/dask/base.py", line 400, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/home/kyoungrok/anaconda3/envs/coref/lib/python3.6/site-packages/dask/threaded.py", line 75, in get
    pack_exception=pack_exception, **kwargs)
  File "/home/kyoungrok/anaconda3/envs/coref/lib/python3.6/site-packages/dask/local.py", line 521, in get_async
    raise_exception(exc, tb)
  File "/home/kyoungrok/anaconda3/envs/coref/lib/python3.6/site-packages/dask/compatibility.py", line 68, in reraise
    raise exc
  File "/home/kyoungrok/anaconda3/envs/coref/lib/python3.6/site-packages/dask/local.py", line 290, in execute_task
    result = _execute_task(task, data)
  File "/home/kyoungrok/anaconda3/envs/coref/lib/python3.6/site-packages/dask/local.py", line 271, in _execute_task
    return func(*args2)
  File "/home/kyoungrok/anaconda3/envs/coref/lib/python3.6/site-packages/dask/dataframe/core.py", line 3549, in apply_and_enforce
    df = func(*args, **kwargs)
  File "180606_coreference_dev.py", line 30, in process_partition
    nlp = spacy.load('en_coref_sm')
  File "/home/kyoungrok/anaconda3/envs/coref/lib/python3.6/site-packages/spacy/__init__.py", line 15, in load
    return util.load_model(name, **overrides)
  File "/home/kyoungrok/anaconda3/envs/coref/lib/python3.6/site-packages/spacy/util.py", line 114, in load_model
    return load_model_from_package(name, **overrides)
  File "/home/kyoungrok/anaconda3/envs/coref/lib/python3.6/site-packages/spacy/util.py", line 135, in load_model_from_package
    return cls.load(**overrides)
  File "/home/kyoungrok/anaconda3/envs/coref/lib/python3.6/site-packages/en_coref_sm/__init__.py", line 15, in load
    coref = NeuralCoref(nlp.vocab)
  File "neuralcoref.pyx", line 495, in neuralcoref.neuralcoref.NeuralCoref.__init__
  File "doc.pyx", line 100, in spacy.tokens.doc.Doc.set_extension
ValueError: [E090] Extension 'has_coref' already exists on Doc. To overwrite the existing extension, set `force=True` on `Doc.set_extension`.

Some Questions

I just had a couple of questions regarding neuralcoref: Do you expect to update neuralcoref in the near future, and is there a formal API available for it?

404 on spacy models with v0.3

Issuing pip install URL where url is https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_lg-3.0.0/en_coref_lg-3.0.0.tar.gz leads to an error.

Collecting https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_lg-3.0.0/en_coref_lg-3.0.0.tar.gz
  HTTP error 404 while getting https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_lg-3.0.0/en_coref_lg-3.0.0.tar.gz
  Could not install requirement https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_lg-3.0.0/en_coref_lg-3.0.0.tar.gz because of error 404 Client Error: Not Found for url: https://github-production-release-asset-2e65be.s3.amazonaws.com/134695006/37ae4b72-6b48-11e8-9c4a-c528bfca0d1a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20180608%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20180608T162021Z&X-Amz-Expires=300&X-Amz-Signature=3b94958ae6c908f54a4942f1dd773202bd1137f057ae89915c8a7b94297d179e&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Den_coref_lg-3.0.0.tar.gz&response-content-type=application%2Foctet-stream
Could not install requirement https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_lg-3.0.0/en_coref_lg-3.0.0.tar.gz because of HTTP error 404 Client Error: Not Found for url: https://github-production-release-asset-2e65be.s3.amazonaws.com/134695006/37ae4b72-6b48-11e8-9c4a-c528bfca0d1a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20180608%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20180608T162021Z&X-Amz-Expires=300&X-Amz-Signature=3b94958ae6c908f54a4942f1dd773202bd1137f057ae89915c8a7b94297d179e&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Den_coref_lg-3.0.0.tar.gz&response-content-type=application%2Foctet-stream for URL https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_lg-3.0.0/en_coref_lg-3.0.0.tar.gz

I know it's very early and so I might have run this when the models were not yet on the servers. If that's the case I can wait. If not, is there something I can do to fix this?

Error while running server.py in examples

I am trying to run the server.py file in the examples and I am getting the following error.

Traceback (most recent call last):
  File "/usr/lib/python3.5/wsgiref/handlers.py", line 137, in run
    self.result = application(self.environ, self.start_response)
  File "/home/formcept/neuralcoref/neuralconf/lib/python3.5/site-packages/falcon/api.py", line 244, in __call__
    responder(req, resp, **params)
  File "server.py", line 39, in on_get
    } for span in doc._.coref_mentions]
  File "/home/formcept/neuralcoref/neuralconf/lib/python3.5/site-packages/spacy/tokens/underscore.py", line 28, in __getattr__
    raise AttributeError(Errors.E046.format(name=name))
AttributeError: [E046] Can't retrieve unregistered extension attribute 'coref_mentions'. Did you forget to call the `set_extension` method?

I am using Ubuntu 16.04 and python3

Mentions found but resolution fails

I'm having an issue where I run coref on a piece of text and while sets of mentions are found, they do not get resolved:

>>> text = "Securities America, a wholly owned subsidiary of Ladenburg Thalmann Financial Services Inc., has hired Bonnie Reed to join its branch office development recruiting team. Reed's addition will help the independent broker-dealer manage the increasing pipeline of prospective advisors considering joining the company."
>>> clusters = coref.one_shot_coref(text)
>>> clusters
{5: [5, 2], 10: [10, 3, 0]}
>>> coref.get_resolved_utterances()
["Securities America, a wholly owned subsidiary of Ladenburg Thalmann Financial Services Inc., has hired Bonnie Reed to join its branch office development recruiting team. Reed's addition will help the independent broker-dealer manage the increasing pipeline of prospective advisors considering joining the company."]

Here, two mention sets are identified, but running get_resolved_utterances() fails to actually resolve them.

Can't detect 'former' and 'latter' as potential references.

In sentences like:

I like chicken and cheese. Although, I like the former better.

The word former is referencing to chicken but the system is unable to detect any coreference.

Coreferencing is not working for similar input

This seems to be a bug, I have tried it with models en, en_core_web_sm and en_depent_web_md

`from neuralcoref import Coref
import spacy

#nlp = spacy.load("en_depent_web_md")
coref = Coref()

Example 1 : This is working !

clusters = coref.continuous_coref(utterances=u"Phone area code will be valid only when all the below conditions are met. It cannot be left blank. It should be numeric. It cannot be less than 200. Minimum number of digits should be 3. ")

Example 2: This is not !

clusters = coref.continuous_coref(utterances=u"Phone number will be valid only when all the below conditions are met. It cannot be left blank. It should be numeric. It cannot be less than 200. Minimum number of digits should be 3. ")

print(clusters)

mentions = coref.get_mentions()
print(mentions)

utterances = coref.get_utterances()
print(utterances)

resolved_utterance_text = coref.get_resolved_utterances()
print(resolved_utterance_text)
`

first example is working fine, the second one which is very similar is not working. Please look into this.

MyEnvironment

Info about spaCy

Platform Windows-8.1-6.3.9600-SP0
Python version 3.5.1
spaCy version 1.9.0
Installed models en, en_core_web_sm, en_default, en_depent_web_md
Location D:\Apps\Python35\lib\site-packages\spacy

Finding Coreferences Only Within a Sentence

MacOS High Sierra (10.13.3)
spacy == 2.0.11
neuralcoref == 3.0
en-coref-md == 3.0.0

Hi, I'm working to resolve coreferences within a single sentence that is part of a larger paragraph. For example, in the short paragraph:

nlp = spacy.load('en_coref_md')

tokens = nlp("Carol bought a new book. As Carol read, she learned.")

print(tokens._.coref_clusters)

for n, s in enumerate(tokens.sents):
    print(s._.coref_cluster)

The first print yields: [Carol: [Carol, Carol, she]]
The second print yields (for both s in tokens): None

Why does neuralcoref not work within a Span here? I don't want coreferences between sentences, only within a single sentence. For example, I do not want 'Carol' in the first sentence to corefer to either 'Carol' or 'she' in the next sentence. However, I need to run the Doc object at the paragraph level to be able to tell what the sentences are.

I suppose a better question may be: Is it possible for neuralcoref to only find coreferences inside a single sentence?

Question

Should neuralconf work in some language other than English?

size mismatch

Thanks very much for sharing the library. It's amazing

I think there is a size mismatch between the input features and the pre-trained weights posted on this site for the input layer

The pre-trained weights, stored in neuralcoref/weights folder of this repository, have a size of ( 1000 x 668 ) for single mentions, and ( 1000 x 1364 ) for pair mentions in layer 0

However, the dimension of input features generated for each mention is 674 per single mention, and 1370 per pair mentions, if compressed = False.

This mismatch generates an error during the evaluation process if the pre-trained weights are loaded, as shown below

Do you have the same problem?

Thanks,

`---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
in ()
61 eval_evaluator.test_model()
62 start_time = time.time()
---> 63 eval_evaluator.build_test_file()
64 score, f1_conll, ident = eval_evaluator.get_score()
65 elapsed = time.time() - start_time

~/coding/coreference/notebook/learning/evaluator.py in build_test_file(self, out_path, remove_singleton, print_all_mentions, debug)
162 cur_m = 0
163 for sample_batched, mentions_idx, n_pairs_l in zip(self.dataloader, self.mentions_idx, self.n_pairs):
--> 164 scores, max_i = self.get_max_score(sample_batched)
165 for m_idx, ind, n_pairs in zip(mentions_idx, max_i, n_pairs_l):
166 if ind < n_pairs : # the single score is not the highest, we have a match !

~/coding/coreference/notebook/learning/evaluator.py in get_max_score(self, batch, debug)
140 mask = mask.cuda()
141 self.model.eval()
--> 142 scores = self.model.forward(inputs, concat_axis=1).data
143 scores.masked_fill_(mask, -float('Inf'))
144 _, max_idx = scores.max(dim=1) # We may want to weight the single score with coref.greedyness

~/coding/coreference/notebook/learning/model.py in forward(self, inputs, concat_axis)
72 embed_words = self.drop(self.word_embeds(words).view(words.size()[0], -1))
73 single_input = torch.cat([spans, embed_words, single_features], 1)
---> 74 single_scores = self.single_top(single_input)
75 if pairs:
76 batchsize, pairs_num, _ = ana_spans.size()

~/software/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
355 result = self._slow_forward(*input, **kwargs)
356 else:
--> 357 result = self.forward(*input, **kwargs)
358 for hook in self._forward_hooks.values():
359 hook_result = hook(self, input, result)

~/software/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/container.py in forward(self, input)
65 def forward(self, input):
66 for module in self._modules.values():
---> 67 input = module(input)
68 return input
69

~/software/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/linear.py in forward(self, input)
53
54 def forward(self, input):
---> 55 return F.linear(input, self.weight, self.bias)
56
57 def repr(self):

~/software/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/functional.py in linear(input, weight, bias)
833 if input.dim() == 2 and bias is not None:
834 # fused op is marginally faster
--> 835 return torch.addmm(bias, input, weight.t())
836
837 output = input.matmul(weight.t())

RuntimeError: size mismatch, m1: [1 x 674], m2: [668 x 1000] at /opt/conda/conda-bld/pytorch_1523244252089/work/torch/lib/TH/generic/THTensorMath.c:1434 `

installation

`s@hello:~/code/neuralcoref$ python -m spacy download 'en'

Downloading en_core_web_sm-1.2.0/en_core_web_sm-1.2.0.tar.gz

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-1.2.0/en_core_web_sm-1.2.0.tar.gz
Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-1.2.0/en_core_web_sm-1.2.0.tar.gz (52.2MB)
100% |████████████████████████████████| 52.2MB 2.1MB/s
Collecting spacy<2.0.0,>=1.7.0 (from en-core-web-sm==1.2.0)
Downloading spacy-1.10.0.tar.gz (3.4MB)
100% |████████████████████████████████| 3.4MB 3.8MB/s
Collecting numpy>=1.7 (from spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading numpy-1.13.3-cp27-cp27mu-manylinux1_x86_64.whl (16.6MB)
100% |████████████████████████████████| 16.7MB 2.6MB/s
Collecting murmurhash<0.27,>=0.26 (from spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading murmurhash-0.26.4-cp27-cp27mu-manylinux1_x86_64.whl
Collecting cymem<1.32,>=1.30 (from spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading cymem-1.31.2-cp27-cp27mu-manylinux1_x86_64.whl (66kB)
100% |████████████████████████████████| 71kB 5.4MB/s
Collecting preshed<2.0.0,>=1.0.0 (from spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading preshed-1.0.0.tar.gz (89kB)
100% |████████████████████████████████| 92kB 3.8MB/s
Collecting thinc<6.6.0,>=6.5.0 (from spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading thinc-6.5.2.tar.gz (926kB)
100% |████████████████████████████████| 931kB 2.6MB/s
Collecting plac<1.0.0,>=0.9.6 (from spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading plac-0.9.6-py2.py3-none-any.whl
Collecting pip<10.0.0,>=9.0.0 (from spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading pip-9.0.1-py2.py3-none-any.whl (1.3MB)
100% |████████████████████████████████| 1.3MB 2.7MB/s
Collecting six (from spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading six-1.11.0-py2.py3-none-any.whl
Collecting pathlib (from spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading pathlib-1.0.1.tar.gz (49kB)
100% |████████████████████████████████| 51kB 6.4MB/s
Collecting ujson>=1.35 (from spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading ujson-1.35.tar.gz (192kB)
100% |████████████████████████████████| 194kB 2.8MB/s
Collecting dill<0.3,>=0.2 (from spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading dill-0.2.7.1.tar.gz (64kB)
100% |████████████████████████████████| 71kB 4.5MB/s
Collecting requests<3.0.0,>=2.13.0 (from spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading requests-2.18.4-py2.py3-none-any.whl (88kB)
100% |████████████████████████████████| 92kB 2.9MB/s
Collecting regex<2017.12.1,>=2017.4.1 (from spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading regex-2017.11.09.tar.gz (608kB)
100% |████████████████████████████████| 614kB 3.2MB/s
Collecting ftfy<5.0.0,>=4.4.2 (from spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading ftfy-4.4.3.tar.gz (50kB)
100% |████████████████████████████████| 51kB 6.6MB/s
Collecting wrapt (from thinc<6.6.0,>=6.5.0->spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading wrapt-1.10.11.tar.gz
Collecting tqdm<5.0.0,>=4.10.0 (from thinc<6.6.0,>=6.5.0->spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading tqdm-4.19.4-py2.py3-none-any.whl (50kB)
100% |████████████████████████████████| 51kB 8.0MB/s
Collecting cytoolz<0.9,>=0.8 (from thinc<6.6.0,>=6.5.0->spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading cytoolz-0.8.2.tar.gz (386kB)
100% |████████████████████████████████| 389kB 2.7MB/s
Collecting termcolor (from thinc<6.6.0,>=6.5.0->spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading termcolor-1.1.0.tar.gz
Collecting idna<2.7,>=2.5 (from requests<3.0.0,>=2.13.0->spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading idna-2.6-py2.py3-none-any.whl (56kB)
100% |████████████████████████████████| 61kB 3.1MB/s
Collecting urllib3<1.23,>=1.21.1 (from requests<3.0.0,>=2.13.0->spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading urllib3-1.22-py2.py3-none-any.whl (132kB)
100% |████████████████████████████████| 133kB 3.5MB/s
Collecting certifi>=2017.4.17 (from requests<3.0.0,>=2.13.0->spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading certifi-2017.11.5-py2.py3-none-any.whl (330kB)
100% |████████████████████████████████| 337kB 3.4MB/s
Collecting chardet<3.1.0,>=3.0.2 (from requests<3.0.0,>=2.13.0->spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading chardet-3.0.4-py2.py3-none-any.whl (133kB)
100% |████████████████████████████████| 143kB 3.5MB/s
Collecting html5lib (from ftfy<5.0.0,>=4.4.2->spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading html5lib-0.999999999-py2.py3-none-any.whl (112kB)
100% |████████████████████████████████| 122kB 3.7MB/s
Collecting wcwidth (from ftfy<5.0.0,>=4.4.2->spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading wcwidth-0.1.7-py2.py3-none-any.whl
Collecting toolz>=0.8.0 (from cytoolz<0.9,>=0.8->thinc<6.6.0,>=6.5.0->spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading toolz-0.8.2.tar.gz (45kB)
100% |████████████████████████████████| 51kB 7.1MB/s
Collecting webencodings (from html5lib->ftfy<5.0.0,>=4.4.2->spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading webencodings-0.5.1-py2.py3-none-any.whl
Collecting setuptools>=18.5 (from html5lib->ftfy<5.0.0,>=4.4.2->spacy<2.0.0,>=1.7.0->en-core-web-sm==1.2.0)
Downloading setuptools-36.7.2-py2.py3-none-any.whl (482kB)
100% |████████████████████████████████| 491kB 3.0MB/s
Installing collected packages: numpy, murmurhash, cymem, preshed, wrapt, tqdm, toolz, cytoolz, plac, six, dill, termcolor, pathlib, thinc, pip, ujson, idna, urllib3, certifi, chardet, requests, regex, webencodings, setuptools, html5lib, wcwidth, ftfy, spacy, en-core-web-sm
Running setup.py install for preshed ... done
Running setup.py install for wrapt ... done
Running setup.py install for toolz ... done
Running setup.py install for cytoolz ... done
Running setup.py install for dill ... done
Running setup.py install for termcolor ... done
Running setup.py install for pathlib ... done
Running setup.py install for thinc ... done
Running setup.py install for ujson ... done
Running setup.py install for regex ... done
Running setup.py install for ftfy ... done
Running setup.py install for spacy ... done
Running setup.py install for en-core-web-sm ... done
Successfully installed certifi-2017.11.5 chardet-3.0.4 cymem-1.31.2 cytoolz-0.8.2 dill-0.2.7.1 en-core-web-sm-2.0.0 ftfy-4.4.3 html5lib-0.999999999 idna-2.6 murmurhash-0.28.0 numpy-1.13.3 pathlib-1.0.1 pip-9.0.1 plac-0.9.6 preshed-1.0.0 regex-2017.11.9 requests-2.18.4 setuptools-36.7.2 six-1.11.0 spacy-2.0.2 termcolor-1.1.0 thinc-6.10.0 toolz-0.8.2 tqdm-4.19.4 ujson-1.35 urllib3-1.22 wcwidth-0.1.7 webencodings-0.5.1 wrapt-1.10.11
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/home/s/.local/lib/python2.7/site-packages/spacy/main.py", line 133, in
plac.Interpreter.call(CLI)
File "/home/s/.local/lib/python2.7/site-packages/plac_ext.py", line 1142, in call
print(out)
File "/home/s/.local/lib/python2.7/site-packages/plac_ext.py", line 914, in exit
self.close(exctype, exc, tb)
File "/home/s/.local/lib/python2.7/site-packages/plac_ext.py", line 952, in close
self._interpreter.throw(exctype, exc, tb)
File "/home/s/.local/lib/python2.7/site-packages/plac_ext.py", line 964, in make_interpreter
arglist = yield task
File "/home/s/.local/lib/python2.7/site-packages/plac_ext.py", line 1139, in call
raise(task.etype, task.exc, task.tb)
File "/home/s/.local/lib/python2.7/site-packages/plac_ext.py", line 380, in wrap
for value in genobj:
File "/home/s/.local/lib/python2.7/site-packages/plac_ext.py", line 95, in gen_exc
raise(etype, exc, tb)
File "/home/s/.local/lib/python2.7/site-packages/plac_ext.py", line 966, in _make_interpreter
cmd, result = self.parser.consume(arglist)
File "/home/s/.local/lib/python2.7/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/home/s/.local/lib/python2.7/site-packages/spacy/main.py", line 33, in download
cli_download(model, direct)
File "/home/s/.local/lib/python2.7/site-packages/spacy/cli/download.py", line 24, in download
link_package(model_name, model, force=True)
File "/home/s/.local/lib/python2.7/site-packages/spacy/cli/link.py", line 22, in link_package
pkg = importlib.import_module(package_name)
File "/usr/lib/python2.7/importlib/init.py", line 37, in import_module
import(name)
File "/home/s/.local/lib/python2.7/site-packages/en_core_web_sm/init.py", line 5, in
from spacy.util import load_model_from_init_py, get_model_meta
ImportError: cannot import name load_model_from_init_py
Segmentation fault
`

mentions to utterances

How to match the mention word to utterance sentence ?

Training/testing workflow

Thanks for opensourcing this ready to use library!

In your blogpost you've mentioned initial training on OntoNotes-5. Could you reveal details about performance on that dataset and how current one relates to performance of the original Clark & Manning model?
Do you plan to release code/workflow for retraining/tuning of the model?

Coref.get_most_representative fails on type error

Running the server per readme, get an error when trying the sample curl.
Running python algorithm.py to trigger __main__ produces the following output.

[I, many friends, around me, me, many friends, around me received it, it, It, almost everyone, almost everyone received this SMS, this SMS]
[Yes, I noticed that many friends, around me received it. It seems that almost everyone received this SMS.]
['Yes, I noticed that many friends, around me received it. It seems that almost everyone received this SMS.']

The error when called from server is

algorithm.py line 347 
coreferences[self.data.mentions[key]] = mention
raises TypeError: unhashable type: 'Mention'

Mention class is defined in document.py. It inherits from spacy.tokens.Span but does not implement hash().

AssertionError when loading model in Ubuntu

Hi,
Firstly, thanks for the work on neuralcoref, it seems like a very useful extension to spaCy.

I successfully installed the neuralcoref model using:
sudo pip install https://github.com/huggingface/neuralcoref-models/releases/download/en_coref_sm-3.0.0/en_coref_sm-3.0.0.tar.gz

But when I try to use it / load the model, like this:

#!/usr/bin/env python

import spacy
nlp = spacy.load("en_coref_sm")
doc = nlp(u'My sister has a dog. She loves him.')

I'm getting the following error:

Traceback (most recent call last):
  File "./f.py", line 4, in <module>
    nlp = spacy.load("en_coref_sm")
  File "/home/d99kris/.local/lib/python2.7/site-packages/spacy/__init__.py", line 19, in load
    return util.load_model(name, **overrides)
  File "/home/d99kris/.local/lib/python2.7/site-packages/spacy/util.py", line 114, in load_model
    return load_model_from_package(name, **overrides)
  File "/home/d99kris/.local/lib/python2.7/site-packages/spacy/util.py", line 137, in load_model_from_package
    return cls.load(**overrides)
  File "/usr/local/lib/python2.7/dist-packages/en_coref_sm/__init__.py", line 15, in load
    coref = NeuralCoref(nlp.vocab)
  File "neuralcoref.pyx", line 527, in en_coref_sm.neuralcoref.neuralcoref.NeuralCoref.__init__
  File "doc.pyx", line 99, in spacy.tokens.doc.Doc.set_extension
AssertionError

(Also getting the same error if downloading and trying to load en_coref_md)

Any ideas what could be wrong?

Using OntoNotes 5.0 to generate coNLL files

Description
I am currently stucked at the "Get the data" section for training the neural coreference model. As a newbie, I have little understanding of converting the skeleton files to conll files. Here are the commands specified in the guide:

skeleton2conll.sh -D [path_to_ontonotes_train_folder] [path_to_skeleton_train_folder] skeleton2conll.sh -D [path_to_ontonotes_test_folder] [path_to_skeleton_test_folder] skeleton2conll.sh -D [path_to_ontonotes_dev_folder] [path_to_skeleton_dev_folder]
h

Result

Here is my command.

Here is the output in case image wont load:

$ ".\conll-2012-scripts\conll-2012\v3\scripts\skeleton2conll.sh" -D ".\ontonotes-release-5.0\data\files\data\" ".\conll-2012-train\conll-2012\"
please make sure that you are pointing to the directory 'conll-2012'

Data
OntoNotes 5.0 from LDC (thru email)
Training, and Development data (both are v4)
Test Data (Official, v9)
CoNLL 2012 scripts (v3)
last four from this link

Steps to reproduce

Download the data
Extract the data
Run the command skeleton2conll.sh -D [path/to/conll-2012-train-v0/data/files/data] [path/to/conll-2012]

Build/Platform
Windows 10
Git Bash (mingw64)
python 3.6
cpu (no CUDA)

Alternatively, if someone knows how to use conll-formatted Onotnotes 5.0, I can also put an issue about it.

Segmentation fault (core dumped)

Python 3.6.2 and neuralcoref-3.0, en_coref_sm the following code produces a segfault.

import spacy
nlp = spacy.load('en_coref_sm')
nlp('''Although the Drive moved to Massachusetts for the 1994 season, the AFL had a number of other teams which it considered "dynasties", including the Tampa Bay Storm (the only team that has existed in some form for all twenty-eight contested seasons), their arch-rival the Orlando Predators, the now-defunct San Jose SaberCats of the present decade, and their rivals the Arizona Rattlers. Where did the Drive franchise relocate to?''')

What can I do to help?

Unable to import modules

Hi,

I get the following error when I try to run either of the simple examples in your README file:

Traceback (most recent call last):
File "/Users/maximild/src/MaxQA/src/test.py", line 1, in
import en_coref_md
File "/Users/maximild/anaconda3/lib/python3.6/site-packages/en_coref_md/init.py", line 6, in
from en_coref_md.neuralcoref import NeuralCoref
File "/Users/maximild/anaconda3/lib/python3.6/site-packages/en_coref_md/neuralcoref/init.py", line 1, in
from .neuralcoref import NeuralCoref
File "strings.pxd", line 23, in init en_coref_md.neuralcoref.neuralcoref
ValueError: spacy.strings.StringStore has the wrong size, try recompiling. Expected 88, got 112

I appear to have successfully downloaded the en_coref_md model, but I am unable to import it. I'm using spaCy 2.0.11 and Python 3.6 if that helps.

Any suggestions on what might be wrong?

Thanks!

Spacy pipeline component

Is there a reason why the suggested way to load neuralcoref is through the spacy load method ("nlp = spacy.load('en_coref_md')") instead of through, what appears to be the recommended api for adding extensions as pipeline methods? https://spacy.io/usage/processing-pipelines. I see that neuralcoref is loaded this way in the cli. Doesn't the spacy load method place restrictions on how the neuralcoref library can be used with other vocab/vectors?

def load(**overrides):
    disable = overrides.get('disable', [])
    overrides['disable'] = disable + ['neuralcoref']
    nlp = load_model_from_init_py(__file__, **overrides)
    coref = NeuralCoref(nlp.vocab)
    coref.from_disk(nlp.path / 'neuralcoref')
    nlp.add_pipe(coref, name='neuralcoref')
    return nlp
""".strip()

about stage swich condition

hi：
after seeing the code, the stage switch is controlled by learning rate decrease,which is not controlled by metrics on the dev set. is this true?

lr = decrease_lr(optim_func)
if args.on_eval_decrease == 'next_stage' or lr <= args.min_lr:
print("Switch to next stage")

AttributeError: coref_clusters

Win 10
spacy==2.0.8
neuralcoref==3.0
en-coref-lg==3.0.0

I get the following error when I run the demo code.

doc = nlp(u'My sister has a dog. She loves him.')
Traceback (most recent call last):

File "", line 1, in
doc = nlp(u'My sister has a dog. She loves him.')

File "D:\anaconda3\lib\site-packages\spacy\language.py", line 341, in call
doc = proc(doc)

File "neuralcoref.pyx", line 566, in en_coref_lg.neuralcoref.neuralcoref.NeuralCoref.call

File "neuralcoref.pyx", line 785, in en_coref_lg.neuralcoref.neuralcoref.NeuralCoref.set_annotations

File "D:\anaconda3\lib\site-packages\spacy\tokens\underscore.py", line 45, in set
return self.setattr(name, value)

File "D:\anaconda3\lib\site-packages\spacy\tokens\underscore.py", line 37, in setattr
raise AttributeError(name)