Giter Club home page Giter Club logo

deeptype's Introduction

Status: Archive (code is provided as-is, no updates expected)

DeepType: Multilingual Entity Linking through Neural Type System Evolution

This repository contains code necessary for designing, evolving type systems, and training neural type systems. To read more about this technique and our results see this blog post or read the paper.

Authors: Jonathan Raiman & Olivier Raiman

Our latest approach to learning symbolic structures from data allows us to discover a set of task specific constraints on a neural network in the form of a type system, to guide its understanding of documents, and obtain state of the art accuracy at recognizing entities in natural language. Recognizing entities in documents can be quite challenging since there are often millions of possible answers. However, when using a type system to constrain the options to only those that semantically "type check," we shrink the answer set and make the problem dramatically easier to solve. Our new results suggest that learning types is a very strong signal for understanding natural language: if types were given to us by an oracle, we find that it is possible to obtain accuracies of 98.6-99% on two benchmark tasks CoNLL (YAGO) and the TAC KBP 2010 challenge.

Data collection

Get wikiarticle -> wikidata mapping (all languages) + Get anchor tags, redirections, category links, statistics (per language). To store all wikidata ids, their key properties (instance of, part of, etc..), and a mapping from all wikipedia article names to a wikidata id do as follows, along with wikipedia anchor tags and links, in three languages: English (en), French (fr), and Spanish (es):

export DATA_DIR=data/
./extraction/full_preprocess.sh ${DATA_DIR} en fr es

Create a type system manually and check oracle accuracy:

To build a graph projection using a set of rules inside type_classifier.py (or any Python file containing a classify method), and a set of nodes that should not be traversed in blacklist.json:

export LANGUAGE=fr
export DATA_DIR=data/
python3 extraction/project_graph.py ${DATA_DIR}wikidata/ extraction/classifiers/type_classifier.py

To save a graph projection as a numpy array along with a list of classes to a directory stored in CLASSIFICATION_DIR:

export LANGUAGE=fr
export DATA_DIR=data/
export CLASSIFICATION_DIR=data/type_classification
python3 extraction/project_graph.py ${DATA_DIR}wikidata/ extraction/classifiers/type_classifier.py  --export_classification ${CLASSIFICATION_DIR}

To use the saved graph projection on wikipedia data to test out how discriminative this classification is (Oracle performance) (edit the config file to make changes to the classification used):

export DATA_DIR=data/
python3 extraction/evaluate_type_system.py extraction/configs/en_disambiguator_config_export_small.json --relative_to ${DATA_DIR}

Obtain learnability scores for types

export DATA_DIR=data/
python3 extraction/produce_wikidata_tsv.py extraction/configs/en_disambiguator_config_export_small.json --relative_to ${DATA_DIR} sample_data.tsv
python3 learning/evaluate_learnability.py sample_data.tsv --out report.json --wikidata ${DATA_DIR}wikidata/

See learning/LearnabilityStudy.ipynb for a visual analysis of the AUC scores.

Evolve a type system

python3 extraction/evolve_type_system.py extraction/configs/en_disambiguator_config_export_small.json --relative_to ${DATA_DIR}  --method cem  --penalty 0.00007

Method can be cem, greedy, beam, or ga, and penalty is the soft constraint on the size of the type system (lambda in the paper).

Convert a type system solution into a trainable type classifier

The output of evolve_type_system.py is a set of types (root + relation) that can be used to build a type system. To create a config file that can be used to train an LSTM use the jupyter notebook extraction/TypeSystemToNeuralTypeSystem.ipynb.

Train a type classifier using a type system

For each language create a training file:

export LANGUAGE=en
python3 extraction/produce_wikidata_tsv.py extraction/configs/${LANGUAGE}_disambiguator_config_export.json /Volumes/Samsung_T3/tahiti/2017-12/${LANGUAGE}_train.tsv  --relative_to /Volumes/Samsung_T3/tahiti/2017-12/

Then create an H5 file from each language containing the mapping from tokens to their entity ids in Wikidata:

export LANGUAGE=en
python3 extraction/produce_windowed_h5_tsv.py  /Volumes/Samsung_T3/tahiti/2017-12/${LANGUAGE}_train.tsv /Volumes/Samsung_T3/tahiti/2017-12/${LANGUAGE}_train.h5 /Volumes/Samsung_T3/tahiti/2017-12/${LANGUAGE}_dev.h5 --window_size 10  --validation_start 1000000 --total_size 200500000

Create a training config with all languages, my_config.json. Paths to the datasets is relative to config file (e.g. you can place it in the same directory as the dataset h5 files): [Note: set wikidata_path to where you extracted wikidata information, and classification_path to where you exported the classifications with project_graph.py]. See learning/configs for a pre written config covering English, French, Spanish, German, and Portuguese.

{
    "datasets": [
        {
            "type": "train",
            "path": "en_train.h5",
            "x": 0,
            "ignore": "other",
            "y": [
                {
                    "column": 1,
                    "objective": "type",
                    "classification": "type_classification"
                },...
            ],
            "ignore": "other",
            "comment": "#//#"
        },
        {
            "type": "dev",
            "path": "en_dev.h5",
            "x": 0,
            "ignore": "other",
            "y": [
                {
                    "column": 1,
                    "objective": "type",
                    "classification": "type_classification"
                },...
            ],
            "ignore": "other",
            "comment": "#//#"
        }, ...
    ],
    "features": [
        {
            "type": "word",
            "dimension": 200,
            "max_vocab": 1000000
        },...
    ],
    "objectives": [
        {
            "name": "type",
            "type": "softmax",
            "vocab": "type_classes.txt"
        }, ...
    ],
    "wikidata_path": "wikidata",
    "classification_path": "classifications"
}

Launch training on a single gpu:

CUDA_VISIBLE_DEVICES=0 python3 learning/train_type.py my_config.json --cudnn --fused --hidden_sizes 200 200 --batch_size 256 --max_epochs 10000  --name TypeClassifier --weight_noise 1e-6  --save_dir my_great_model  --anneal_rate 0.9999

Several key parameters:

  • name: main scope for model variables, avoids name clashing when multiple classifiers are loaded
  • batch_size: how many examples are used for training simultaneously, can cause out of memory issues
  • max_epochs: length of training before auto-stopping. In practice this number should be larger than needed.
  • fused: glue all output layers into one, and do a single matrix multiply (recommended).
  • hidden_sizes: how many stacks of LSTMs are used, and their sizes (here 2, each with 200 dimensions).
  • cudnn: use faster CuDNN kernels for training
  • anneal_rate: shrink the learning rate by this amount every 33000 training steps
  • weight_noise: sprinkle Gaussian noise with this standard deviation on the weights of the LSTM (regularizer, recommended).

To test that training works:

You can test that training works as expected using the dummy training set containing a Part of Speech CRF objective and cat vs dogs log likelihood objective is contained under learning/test:

python3 learning/train_type.py learning/test/config.json

Installation

Mac OSX

pip3 install -r requirements.txt
pip3 install wikidata_linker_utils_src/

Fedora 25

sudo dnf install redhat-rpm-config
sudo dnf install gcc-c++
sudo pip3 install marisa-trie==0.7.2
sudo pip3 install -r requirements.txt
pip3 install wikidata_linker_utils_src/

deeptype's People

Contributors

cberner avatar christopherhesse avatar galtay avatar jonathanraiman avatar murtyshikhar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deeptype's Issues

where do you get candidate entities for a mention from?

I'm curious how you compute the candidate entities for some mention. In the paper it says you use a lookup table, but does not say how that lookup table is computed (or maybe I am missing it...)

Does the lookup table do any normalization of the mention text before lookup?

I ask because I'm wondering how the oracle accuracies are so high -- they are higher than the maximum possible recall of using crosswikis plus a wikipedia dump for CoNLL (98% after some text normalization, see Ganea and Hoffman '17)

License

What is the license for this work?

ModuleNotFoundError: No module named 'wikidata_linker_utils.conlleval'

file conlleval.py missing while running train type

(karim_py3) ubuntu@ip-10-0-5-31:/mnt/big_drive/deeptype$ CUDA_VISIBLE_DEVICES=0 python3 learning/train_type.py my_config.json --cudnn --fused --hidden_sizes 200 200 --batch_size 256 --max_epochs 10000  --name TypeClassifier --weight_noise 1e-6  --save_dir my_great_model  --anneal_rate 0.9999
/home/ubuntu/.virtualenvs/karim_py3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Traceback (most recent call last):
  File "learning/train_type.py", line 20, in <module>
    from wikidata_linker_utils.conlleval import (
ModuleNotFoundError: No module named 'wikidata_linker_utils.conlleval'

KeyError 'enwiki/Human' extraction/classifiers/type_classifier.py

issue running 'extraction/classifiers/type_classifier.py', please fix.
'enwiki/Human'
Traceback (most recent call last):
File "extraction/project_graph.py", line 123, in main
classification = classifier.classify(collection)
File "extraction/classifiers/type_classifier.py", line 26, in classify
HUMAN = wkp(c, "Human")
File "extraction/classifiers/type_classifier.py", line 14, in wkp
return c.article2id['enwiki/' + name][0][0]
File "src/marisa_trie.pyx", line 578, in marisa_trie.BytesTrie.getitem (src/marisa_trie.cpp:10859)
KeyError: 'enwiki/Human'

Should type_classifier.py be updated somehow like fast_link_fixer.py ?

What scores does Table 1 use on the paper?

On the paper, Table 1 (c) shows the entity linking scores, but how to solve them especially CoNLL scores?

(c) Entity Linking model Comparison. 
CoNLL
Link Count only: 68.614
manual (oracle): 98.217

For example, some mentions and its candidate entities are there.

doc_id, mention, candidate entity, label
-------------------------------------
1, apple, Apple Pie, True
1, apple, Apple (company), False
1, apple, Apple (fruits), False
...

If it predicts one entity that has the highest score of each mentions, I don't need to use false candidates to solve accuracy, but I don't know the Table 1 used false candidates or not.

How did you solve the Table 1 (c) scores?

Paper: https://arxiv.org/pdf/1802.01021.pdf

Issue replicating accuracy of 0.98

Hi,
Per https://arxiv.org/pdf/1802.01021.pdf Table 1, the tested accuracy is 0.98. The model generated using the provided systems: typeclassifier has a F1 score of .88

cmd: python3 learning/train_type.py my_config_v2.json --cudnn --fused --hidden_sizes 200 200 --batch_size 256 --max_epochs 10000 --name TypeClassifier --weight_noise 1e-6 --load_dir en_model --save_dir en_model --anneal_rate 0.9999
echo 'done generating model'

Metrics:
precision 88.472
recall 88.369
sentence_correct 86.35% (43497 correct / 50372)
time_sentence_correct 89.41% (11260 correct / 12593)
time_token_correct 91.14% (36461 correct / 40006)
token_correct 88.37% (141411 correct / 160024)
type_sentence_correct 80.66% (10158 correct / 12593)
type_token_correct 83.88% (33557 correct / 40006)

Config file used:
{
"datasets": [
{
"type": "train",
"path": "en_train.h5",
"x": 0,
"ignore": "other",
"y": [
{
"column": 1,
"objective": "type",
"classification": "type_classification"
},
{
"column": 1,
"objective": "location",
"classification": "location_classification"
},
{
"column": 1,
"objective": "country",
"classification": "country_classification"
},
{
"column": 1,
"objective": "time",
"classification": "time_classification"
}
]
},
{
"type": "dev",
"path": "en_dev.h5",
"x": 0,
"ignore": "other",
"y": [
{
"column": 1,
"objective": "type",
"classification": "type_classification"
},
{
"column": 1,
"objective": "location",
"classification": "location_classification"
},
{
"column": 1,
"objective": "country",
"classification": "country_classification"
},
{
"column": 1,
"objective": "time",
"classification": "time_classification"
}
],
"ignore": "other",
"comment": "#//#"
}
],
"features": [
{
"type": "word",
"dimension": 200,
"max_vocab": 2000000
},
{
"type": "suffix",
"length": 2,
"dimension": 6,
"max_vocab": 1000000
},
{
"type": "suffix",
"length": 3,
"dimension": 6,
"max_vocab": 1000000
},
{
"type": "prefix",
"length": 2,
"dimension": 6,
"max_vocab": 1000000
},
{
"type": "prefix",
"length": 3,
"dimension": 6
},
{
"type": "digit"
},
{
"type": "uppercase"
},
{
"type": "punctuation_count"
}
],
"objectives": [
{
"name": "type",
"type": "softmax",
"vocab": "type_classification/classes.txt"
},
{
"name": "location",
"type": "softmax",
"vocab": "location_classification/classes.txt"
},
{
"name": "country",
"type": "softmax",
"vocab": "country_classification/classes.txt"
},
{
"name": "time",
"type": "softmax",
"vocab": "time_classification/classes.txt"
}
],
"wikidata_path": "wikidata",
"classification_path": "classifications"
}

train_type error

I am having an issue with the training, even running the examplepython3 learning/train_type.py learning/test/config.json give me the following error: File "learning/train_type.py", line 2690, in
main()
File "learning/train_type.py", line 2623, in main
create_variables=True)
File "learning/train_type.py", line 1775, in init
clip_norm=self.clip_norm)
File "learning/train_type.py", line 1467, in build_model
is_training=is_training)
File "learning/train_type.py", line 1020, in build_recurrent
direction="bidirectional")
TypeError: init() got multiple values for argument 'input_mode'
Any idea what might be causing this?
Thanks!

Unable to install the wikidata_linker_utils_src

I tried installing wikidata_linker_utils_src using the command
pip install wikidata_linker_utils_src/
I got the following error. Can someone please help me out? @JonathanRaiman
P.S I am using Cython version 0.26 as pointed out in the other issues. I am doing my installation in Ubuntu. I also tried it on Windows.

Error compiling Cython file:
------------------------------------------------------------
...
        filename_byte_string = path.encode("utf-8")
        cdef char* fname = filename_byte_string
        cdef FILE* cfile
        cfile = fopen(fname, "rb")
        if cfile == NULL:
            raise FileNotFoundError(2, "No such file: '%s'" % (path,))
                                  ^
------------------------------------------------------------

src/cython/wikidata_linker_utils/successor_mask.pyx:41:35: undeclared name not builtin: FileNotFoundError
building 'wikidata_linker_utils.successor_mask' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/local/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c /mnt/mydir/deeptype/wikidata_linker_utils_src/src/cython/wikidata_linker_utils/successor_mask.cpp -o build/temp.linux-x86_64-2.7/mnt/mydir/deeptype/wikidata_linker_utils_src/src/cython/wikidata_linker_utils/successor_mask.o -std=c++11 -Wno-unused-function -Wno-sign-compare -Wno-unused-local-typedef -Wno-undefined-bool-conversion -O3 -Wno-reorder
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
/mnt/mydir/deeptype/wikidata_linker_utils_src/src/cython/wikidata_linker_utils/successor_mask.cpp:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation.
 #error Do not use this file, it is the result of a failed Cython compilation.
  ^
cc1plus: warning: unrecognized command line option '-Wno-undefined-bool-conversion'
cc1plus: warning: unrecognized command line option '-Wno-unused-local-typedef'
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

No space left on device error when running full_preprocess.sh

root@32b7ac932d95:/notebooks/deeptype# ./extraction/full_preprocess.sh ${DATA_DIR} en fr es
Downloading wikidata into data/.
Will prepare language: en
Will prepare language: fr
Will prepare language: es
Creating data directory
Done.
Downloading and preparing Wikidata:
Already downloaded data/latest-all.json.bz2
data/latest-all.json.bz2:

bzip2: I/O or other error, bailing out. Possible reason follows.
bzip2: No space left on device
Input file = data/latest-all.json.bz2, output file = data/latest-all.json
bzip2: Deleting output file data/latest-all.json, if it exists.

I've checked the disk space and this does not seem to be the problem.
Does anyone have any idea how to solve this?
Thanks!

Replicating model training and evaluation

Hi,

I'm having some problems replicating the metrics, to summarize the whole issue

Training
When i trained the system using the 108 type axes in type_classifier.py, the training saturates at around 83% F1 score saying

No improvements for 40 epochs. Stopping ...

and stops,

could you please let us know the parameters you used for training and what is the final F1 score of your model which gave you a disambiguation accuracy close to 99% using CONLL and how many epochs did you train for ? I used the parameters that are listed in the tutorials

--cudnn --fused --hidden_sizes 200 200 --batch_size 256 --max_epochs 10000  --name TypeClassifier --weight_noise 1e-6  --save_dir my_great_model  --anneal_rate 0.9999

Evaluation

i used the following evaluation method

for each mention (in each sentence) in CONLL dataset,
 i get the predicted wiki entity first from the LSTM model (trained till 83% F1) and 
if it matches exactly with the wiki entity in the CONLL for that mention (ground truth),
 i consider that as a correct prediction, basically matching wiki qids

and got accuracy close to 75%, am i doing something wrong here? please let me know and few training tips would be great too :) sorry for bombarding with questions, hope this would be helpful to other folks who are eagerly waiting to contribute to this work :) thanks!

KeyError: 'category_link'

I ran this:

python3 learning/evaluate_learnability.py --dataset sample_data.tsv --out report.json --wikidata ${DATA_DIR}wikidata/

and got KeyError message.

loading wikidata id -> index
done
/usr/local/lib/python3.5/dist-packages/wikidata_linker_utils/type_collection.py:367: UserWarning: Node 'Q21886162' under `bad_node_pair` is not a known wikidata id.
  oel
Traceback (most recent call last):
  File "learning/evaluate_learnability.py", line 297, in <module>
    main()
  File "learning/evaluate_learnability.py", line 251, in main
    proposal_sets = get_proposal_sets(collection, article_ids, args.seed)
  File "learning/evaluate_learnability.py", line 171, in get_proposal_sets
    relation = collection.relation("category_link")
  File "/usr/local/lib/python3.5/dist-packages/wikidata_linker_utils/type_collection.py", line 112, in relation
    print('load %r (%r)' % (name, self.wikidata_names2prop_names[name],))
KeyError: 'category_link'

Before I ran it, I excuted these commands:

export LANGUAGE=en
export DATA_DIR=data/
export CLASSIFICATION_DIR=data/type_classification
./extraction/full_preprocess.sh ${DATA_DIR} en
python3 extraction/project_graph.py ${DATA_DIR}wikidata/ extraction/classifiers/type_classifier.py  --export_classification ${CLASSIFICATION_DIR}
python3 extraction/evaluate_type_system.py extraction/configs/en_disambiguator_config_export_small.json --relative_to ${DATA_DIR}
python3 extraction/produce_wikidata_tsv.py extraction/configs/en_disambiguator_config_export_small.json --relative_to ${DATA_DIR} sample_data.tsv

How to run learning/evaluate_learnability.py correctly?

UnicodeEncodeError

Build fails ./extraction/full_preprocess.sh ${DATA_DIR} en:

raceback (most recent call last):
  File "extraction/get_wikiname_to_wikidata.py", line 347, in <module>
    main()
  File "extraction/get_wikiname_to_wikidata.py", line 287, in main
    missing_wikidata_important_properties_fnames
  File "extraction/get_wikiname_to_wikidata.py", line 119, in get_wikidata_mapping
    fout_name2id.write(key + "/" + value["title"] + "\t" + str(index) + "\n")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-14: ordinal not in range(128)

How can i use an evolved type system only?

Hello,
I have a question about using discovered type system (e.g. cem, greedy, ga ..)

I train type classifier with discovered type system(from evolve_type_system) using train config from TypeSystemToNeuralTypeSystem.ipynb.
but at prediction time, there is no 'type' in get_prob() at learning/SentencePredictions.ipynb.

I guess it seems because discovered type system does not include type classification.

How should the get_prob() be when i use only discovered type system?

Is there a pre-trained, plug-n-play model that we can play with?

I'd like to experiment with this but I don't want to train a model, I want to use an existing model (the best one available if possible)

Is there some way to do something like:

deeptype.run('Jaguars are running in the wild')
And get an output like:
['Jaguar', 'wiki URL']
Or
{'Jaguar': 'wiki URL'}
?

Unable to replicate oracle accuracies for evolved type systems

Hi,
I'm not able to replicate the results of Table 1(a). The oracle accuracy of the type system evolved via CEM that I obtain is ~92.5% whereas the human designed type system obtains an oracle accuracy of ~98%.

Could you please share the JSONs for the evolved type systems if possible?
Congrats, btw on the amazing work and the really well written code-base.

-Shikhar

pip install wikidata_linker_utils_src error on MacOS High Sierra

I'm having trouble installing wikidata_linker_utils_src/ on mac High Sierra.
The error message is

/private/var/folders/pg/2c5pcvy10pgbqdj931yyr82r0000gn/T/pip-auqg6k2b-build/src/cython/wikidata_linker_utils/fast_disambiguate.cpp:585:10: fatal error: 'unordered_set' file not found
#include <unordered_set>
^~~~~~~~~~~~~~~
1 error generated.
error: command 'gcc' failed with exit status 1`

Python version is 3.6.5 and Anaconda 3-5.0.0.

I saw seemingly relevant problem on https://stackoverflow.com/questions/42030598/mac-c-compiler-not-finding-tr1-unordered-map and tried to modify "extra_compile_args" in setup.py for wikidata_linker_utils_src. by adding '-stdlib=libstdc++'. It did not work.
Also, I tried to change the compiler to clang++ by adding os.environ["CC"] = "/usr/bin/clang++" at the beginning of setup.py, but also got the same error message.

I added '-mmacosx-version-min=10.13' to the extra_compile_args to the original setup.py, and now I see following error for string(...) functions.

src/cython/wikidata_linker_utils/successor_mask.pyx:1052:34: ambiguous overloaded method

Error compiling Cython file:
------------------------------------------------------------
...
                        if len(source) > 0:
                            yield source
                    else:
                        num_missing += 1
                        with nogil:
                            missing.push_back(pair[string, string](anchor_string, string(target)))

Hope to get some help in this matter.
Thank you!

successor_mask.pyx:1052:34: ambiguous overloaded method

Error at building the project:

./deeptype/wikidata_linker_utils_src$ python setup.py install

Error compiling Cython file:
------------------------------------------------------------
...
            return_code = sscanf(line, "%256[^\n\t]\t%256[^\n\t]\t%256[^\n\t]", &context, &anchor, &target)
            if return_code != 3:
                num_broken += 1
                continue

            anchor_string = string(anchor)
                                 ^
------------------------------------------------------------

src/cython/wikidata_linker_utils/successor_mask.pyx:1052:34: ambiguous overloaded method

Error compiling Cython file:
------------------------------------------------------------
...
                        if len(source) > 0:
                            yield source
                    else:
                        num_missing += 1
                        with nogil:
                            missing.push_back(pair[string, string](anchor_string, string(target)))
                                                                                       ^
------------------------------------------------------------

src/cython/wikidata_linker_utils/successor_mask.pyx:1075:88: ambiguous overloaded method
building 'wikidata_linker_utils.successor_mask' extension
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/morteza/zProject/deeptype/.env/lib/python3.6/site-packages/numpy/core/include -I/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/include/python3.6m -c /Users/morteza/zProject/deeptype/wikidata_linker_utils_src/src/cython/wikidata_linker_utils/successor_mask.cpp -o build/temp.macosx-10.13-x86_64-3.6/Users/morteza/zProject/deeptype/wikidata_linker_utils_src/src/cython/wikidata_linker_utils/successor_mask.o -std=c++11 -Wno-unused-function -Wno-sign-compare -Wno-unused-local-typedef -Wno-undefined-bool-conversion -O3 -Wno-reorder
/Users/morteza/zProject/deeptype/wikidata_linker_utils_src/src/cython/wikidata_linker_utils/successor_mask.cpp:1:2: error: Do not use this file, it is the result of a failed Cython compilation.
#error Do not use this file, it is the result of a failed Cython compilation.
 ^
1 error generated.
error: command 'clang' failed with exit status 1

Compatibility with Tensorflow 1.14

When I tried to train the deeptype model using Tensorflow 1.14 it failed.

On this line it failed saying that it got the input_mode argument twice. After some digging in Tensorflow versions, I think that the problem lies in these imports:

try:
RNNCell = tf.nn.rnn_cell.RNNCell
TFLSTMCell = tf.nn.rnn_cell.LSTMCell
MultiRNNCell = tf.nn.rnn_cell.MultiRNNCell
LSTMStateTuple = tf.nn.rnn_cell.LSTMStateTuple
from tensorflow.contrib.cudnn_rnn import CudnnLSTM
except AttributeError:
RNNCell = tf.contrib.rnn.RNNCell
TFLSTMCell = tf.contrib.rnn.LSTMCell
MultiRNNCell = tf.contrib.rnn.MultiRNNCell
LSTMStateTuple = tf.contrib.rnn.LSTMStateTuple
from tensorflow.contrib.cudnn_rnn.python.ops.cudnn_rnn_ops import CudnnLSTM

because in the latest Tensorflow, the imports in the first part are all present but they point to another (newer) implement of the classes. What worked for me was to replace the lines with:

RNNCell = tf.contrib.rnn.RNNCell
TFLSTMCell = tf.contrib.rnn.LSTMCell
MultiRNNCell = tf.contrib.rnn.MultiRNNCell
LSTMStateTuple = tf.contrib.rnn.LSTMStateTuple

I also had to replace these lines:

beta1_power = tf.cast(self._beta1_power, var.dtype.base_dtype)
beta2_power = tf.cast(self._beta2_power, var.dtype.base_dtype)

with this code:

        _beta1_power, _beta2_power = self._get_beta_accumulators()
        beta1_power = tf.cast(_beta1_power, var.dtype.base_dtype)
        beta2_power = tf.cast(_beta2_power, var.dtype.base_dtype)

Evaluate Learnability does not give output graph as per the jupyter notebook LearnabilityStudy.ipynb

I ran the evaluate_learnability.py code by creating sample_data.tsv (containing 1000 samples as per the config file). I got following warning.
/usr/local/lib/python3.5/dist-packages/sklearn/metrics/ranking.py:571: UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless UndefinedMetricWarning

  1. What can be done for the above? Is the sampled articles or their number(1000 seems less) a problem here?
  2. I ran the LearnabilityStudy.ipynb file and the figures I obtain are not as per what is shown in the paper. I have attached the figures. Please help me with same.
    auc-hist.pdf
    auc-dev-frequency.pdf
    auc-std.pdf

trie.get('CIA') doesn't work on ja_trie

I created ja_trie by running this:

./extraction/full_preprocess.sh ${DATA_DIR} ja

After that, checked this:

language_path = "../data/ja_trie/"

trie = marisa_trie.Trie().load(
    join(language_path, "trie.marisa")
)

assert trie.get('アメリカ') is not None

and it works. But if it contains any alphabet character, can't get anything:

assert trie.get('CIA') is not None
AssertionErrorTraceback (most recent call last)
<ipython-input-11-41516e200beb> in <module>()
----> 1 assert trie.get('CIA') is not None

AssertionError: 

Absolutely, jawiki contains 'CIA' as anchor text, but why this happen?

pip install error in requirements file

cssselect>=0.9.1
epub-conversion>=1.0.7
lxml>=3.4.3
msgpack-python>=0.4.8
numpy>=1.11.1
pandas>=0.15.2
progressbar2>=3.6.0
requests>=2.6.0
tensorflow>=1.4.0
wikipedia-ner>=0.0.23
ciseau>=1.0.1
Cython>=0.23.2
marisa-trie>=0.7.2
ciseau

looks like the double entries for ciseau cause a pip install issue ... probably want to remove the bottom entry.

KeyError during ./extraction/full_process.sh {DATA_DIR} en

This was run yesterday if that helps narrow down the problem (i.e. wikidata dump used). I'm happy to run with different params if that would help.

./extraction/full_preprocess.sh ${DATA_DIR} en
Downloading wikidata into data/.
Will prepare language: en
Creating data directory
Done.
Downloading and preparing Wikidata:
Already compressed Wikidata
Traceback (most recent call last):
  File "extraction/get_wikiname_to_wikidata.py", line 353, in <module>
    main()
  File "extraction/get_wikiname_to_wikidata.py", line 266, in main
    prop_names2wikidata_names[prop] for prop in important_properties
  File "extraction/get_wikiname_to_wikidata.py", line 266, in <listcomp>
    prop_names2wikidata_names[prop] for prop in important_properties
KeyError: 'P31'

the paper is hard to understand

In the last of the section 3, i find the scoring function S(e,m,D,A,θ) ,argmax it to get the optimal entity.

however ,the formulation seems have no relation with the candidate entity, it's just Plink(e,m) mutiply a constant given a single mention.
I don't know whether i have a bad understanding of the paper,or maybe Pi,(m) need to multiply a deterministic Pi,(e) which is got from the wikidata?
sincerely hope for your answer.

KeyError in fast_link_fixer.py

When running 'full_process.sh' I seem to get a key error, the exact error message is:

'Traceback (most recent call last):
File "extraction/fast_link_fixer.py", line 594, in
main()
File "extraction/fast_link_fixer.py", line 456, in main
initialize_globals(c)
File "extraction/fast_link_fixer.py", line 101, in initialize_globals
ASPECT_OF_HIST = wkd(c, "Q17524420")
File "extraction/fast_link_fixer.py", line 72, in wkd
return c.name2index[name]
File "/usr/local/lib/python3.5/dist-packages/wikidata_linker_utils/wikidata_ids.py", line 20, in getitem
value = self.marisa[key]
File "src/marisa_trie.pyx", line 577, in marisa_trie.BytesTrie.getitem
KeyError: 'Q17524420''

More specifically it occurs when running the shell script line:
'python3 extraction/fast_link_fixer.py ${DATA_DIR}wikidata ${DATA_DIR}${LANGUAGE}_trie ${DATA_DIR}${LANGUAGE}_trie_fixed'

Would anybody be able to help me with this problem?

How to retrieve the wikidata Q-ID of an item using the marisa trie and the offset/value numpy arrays?

The pre-processed data output consists of trie and bunch of numpy arrays containg values and offset.

  • Is it possible to get the QID of an item from this data? For example if I do
    trie = marisa_trie.Trie().load( join(language_path, "trie.marisa") )
    anchor = trie.get("human")

I get a number 592252 which is not a Q-ID in wikidata for 'human'.
I was trying to play around with the offset and value numpy arrays to retrieve the Q-ID but wasn't able to.
Please let me know how to do the above. @JonathanRaiman (Sorry for bugging you again :) )

Train in a custom dataset

Hi, first thanks for your repo and the docs. I want to use the library for the following case and want to know if you think it could work:
I have legal text with labeled references to laws and codes for these laws. I want to train the model to link the reference of the law with it's code.

I understand that I'll have to create a dataset like:

{"id": "doc1",
 "text": ".. as the Mortgage Law says ..."
"links" : [{"start":6 ,
             "stop":17,
             "target" "LW101"}]}

would it be possible to train with only this features?

Thanks again!

fix imports

some imports are unused or deprecated after tensorflow v1, working on it now..

how to generate the learnability score output

in the LearnabilityStudy.ipynb,there needs four json files,but I only get one file after running the 'python3 learning/evaluate_learnability.py sample_data.tsv --out report.json --wikidata ${DATA_DIR}wikidata/'

alpha in equation and how to select parameters in training type classifier

Hi Jonathan, I have trained a model with generated sample by:
python3 extraction/produce_windowed_h5_tsv.py /data/datasets/wikipedia/en_train.tsv /data/datasets/wikipedia/en_train.h5 /data/datasets/wikipedia/en_dev.h5 --window_size 10 --validation_start 1000000 --total_size 200500000
and
python3 learning/train_type.py my_config.json --cudnn --fused --hidden_sizes 200 200 --batch_size 256 --max_epochs 10000 --name TypeClassifier --weight_noise 1e-6 --save_dir my_great_model --anneal_rate 0.9999 --device cpu --faux_cudnn.

I test the ambiguration on the blog example(with only split):
The man saw a Jaguar speed on the high way.
The prey saw the jaguar cross the jungle.

The ranking score is based on #15 only considering the type classifier.
The result I get is :
The man saw a Jaguar speed on the high way.
Without type: Jaguar Cars: 0.61 Jaguar 0.29 SEPECAT Jaguar 0.019
With type: Jaguar Cars: 0.67 Jaguar 0.31 SEPECAT Jaguar 0.020

The prey saw the jaguar cross the jungle.
Without type: Jaguar Cars: 0.61 Jaguar 0.29 SEPECAT Jaguar 0.019
With type: Jaguar Cars: 0.67 Jaguar 0.31 SEPECAT Jaguar 0.021

Compared to the post, probabilities without type are very close to the report. The probabilities with type are little off. I don't know whether it comes from the underfitting of the classifier model or I pick the wrong hyper parameters.

testing model on a single sentence

Hello, It would be great if you could post an example code to run the model for the example mentioned in the blog post (see below), with and without types, would be really helpful, thanks! I have the model trained already :)

The man saw a Jaguar
with types
Jaguar Cars 0.70
jaguar 0.12

without types
Jaguar Cars 0.60
jaguar 0.29

blacklist in argument triggers error

from README file,

python extraction/project_graph.py ${DATA_DIR}wikidata/ extraction/blacklist.json extraction/classifiers/type_classifier.py  --export_classification ${CLASSIFICATION_DIR}

having "extraction/blacklist.json" in the command makes the main function error out here (https://github.com/openai/deeptype/blob/master/extraction/project_graph.py#L87). It looks like the blacklist is read no matter what here (https://github.com/openai/deeptype/blob/master/extraction/project_graph.py#L101)

What P_type and P_entity mean?

I want to extend existing Entity Linking system by using DeepType, and maybe it related to section 2 (Task) on the paper, but I don't understand the detail of following formula:

P(e|x) ∝ P_type(types(e)|x) · P_entity(e|x,types(e))

Is this formula really related to my problem? And what P_type and P_entity mean?
I need an easy-to-understand explanation of the usage of DeepType for Entity Linking.

How can I find associated types of an entity ?

I want to extract the optimal entity from a list of candidates using Equation 6 from the paper:

image

But I don't how to find associated types of each candidate entity. Is there any easy way to get associated types from Wikipedia_anchor_text or Wikipedia_ID.

FileNotFoundError: [Errno 2] No such file or directory: 'data/location_classification/classes.txt'

Hi, I have an issue when
To use the saved graph projection on wikipedia data to test out how discriminative this classification is (Oracle performance) (edit the config file to make changes to the classification used):

export DATA_DIR=data/
python extraction/evaluate_type_system.py extraction/configs/en_disambiguator_config_export_small.json --relative_to ${DATA_DIR}

FileNotFoundError: [Errno 2] No such file or directory: 'data/location_classification/classes.txt'

Getting NaN loss after few epochs ~30

Did anyone face the same problem?
I get a message "Loss is NaN". Any leads on how to resolve this?

loss is NaN. Exception ignored in: <generator object prefetch_generator at 0x7f8d1b884780> Traceback (most recent call last): File "/mnt/research-6f/aranjan/dtype/learning/generator.py", line 29, in prefetch_generator t.join() File "/usr/lib/python3.5/threading.py", line 1051, in join raise RuntimeError("cannot join current thread") RuntimeError: cannot join current thread

FileNotFoundError: [Errno 2] No such file or directory: 'data/wikidata/wikidata_wikititle2wikidata.tsv'

Hi,

I get this issue when running:
python3 extraction/project_graph.py ${DATA_DIR}wikidata/ extraction/classifiers/type_classifier.py

Full error:
Traceback (most recent call last):
File "extraction/project_graph.py", line 183, in
main()
File "extraction/project_graph.py", line 93, in main
cache=args.use_cache
File "/Users/parisakhan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/wikidata_linker_utils/type_collection.py", line 53, in init
prefix=prefix
File "/Users/parisakhan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/wikidata_linker_utils/wikidata_ids.py", line 54, in load_names
with open(path, "rt", encoding="UTF-8") as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'data/wikidata/wikidata_wikititle2wikidata.tsv'

Thanks

Undefined name 'config' in wikipedia.py

flake8 testing of https://github.com/openai/deeptype on Python 3.6.4

$ time flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./wikidata_linker_utils_src/src/python/wikidata_linker_utils/wikipedia.py:47:12: F821 undefined name 'config'
    name = config.wiki.split("/")[-1]
           ^
1     F821 undefined name 'config'

Use Python 3.6 and Cython 0.26 to install wikidata_linker_utils

There are many problems when install wikidata_linker_utils with python3.7 or Cython > 0.26.

Issue about Cython >0.26:

Problems using Python3.7:

  • Python 3.7 introduced a change which made async a reserved keyword, but Cython < 0.27.2 has a statement await = None, which will cause an error SyntaxError: invalid syntax
  • If you use Python3.7 together with Cython0.27.2, you will get many problems like this:

‘PyThreadState {aka struct _ts}’ has no member named ‘exc_type’; did you mean ‘curexc_type’

Property 'FACET_OF' missing in wikidata properties

First of all, awesome work 👏 and thank you so much for sharing your knowledge, really appreciate it 😃

it seems that the property FACET_OF is missing from the wikidata properties here, this caused the error below while running
_python3 extraction/fast_link_fixer.py ${DATA_DIR}wikidata ${DATA_DIR}${LANGUAGE}_trie ${DATA_DIR}${LANGUAGE}trie_fixed

inside extraction/full_preprocess.sh,
as a quick fix i just commented it out (this line) and it works fine after that

load 'P31' ('instance of')
inverting relation 'P31' ('instance of')
load inverted 'P31' ('instance of')
load 'P279' ('subclass of')
inverting relation 'P279' ('subclass of')
load inverted 'P279' ('subclass of')
load 'P360' ('is a list of')
inverting relation 'P360' ('is a list of')
load inverted 'P360' ('is a list of')
load 'P361' ('part of')
inverting relation 'P361' ('part of')
load inverted 'P361' ('part of')
Traceback (most recent call last):
  File "extraction/fast_link_fixer.py", line 594, in <module>
    main()
  File "extraction/fast_link_fixer.py", line 469, in main
    num_category_link=8
  File "extraction/fast_link_fixer.py", line 273, in fix
    {"steps": [wprop.INSTANCE_OF, wprop.FACET_OF]},
AttributeError: module 'wikidata_linker_utils.wikidata_properties' has no attribute 'FACET_OF'

evaluation method

Hi! It would be great if you could tell us how you evaluated the system, since the model outputs types, did you use types to evaluate the system or did you use the wiki entities extracted from the types? As wiki entities would require entity normalization (British = United Kingdom = Great Britain= European Nation = ,,etc), this would be helpful in evaluating the models we trained, thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.