Giter Club home page Giter Club logo

deeptype's Issues

How to retrieve the wikidata Q-ID of an item using the marisa trie and the offset/value numpy arrays?

The pre-processed data output consists of trie and bunch of numpy arrays containg values and offset.

  • Is it possible to get the QID of an item from this data? For example if I do
    trie = marisa_trie.Trie().load( join(language_path, "trie.marisa") )
    anchor = trie.get("human")

I get a number 592252 which is not a Q-ID in wikidata for 'human'.
I was trying to play around with the offset and value numpy arrays to retrieve the Q-ID but wasn't able to.
Please let me know how to do the above. @JonathanRaiman (Sorry for bugging you again :) )

Use Python 3.6 and Cython 0.26 to install wikidata_linker_utils

There are many problems when install wikidata_linker_utils with python3.7 or Cython > 0.26.

Issue about Cython >0.26:

Problems using Python3.7:

  • Python 3.7 introduced a change which made async a reserved keyword, but Cython < 0.27.2 has a statement await = None, which will cause an error SyntaxError: invalid syntax
  • If you use Python3.7 together with Cython0.27.2, you will get many problems like this:

‘PyThreadState {aka struct _ts}’ has no member named ‘exc_type’; did you mean ‘curexc_type’

KeyError during ./extraction/full_process.sh {DATA_DIR} en

This was run yesterday if that helps narrow down the problem (i.e. wikidata dump used). I'm happy to run with different params if that would help.

./extraction/full_preprocess.sh ${DATA_DIR} en
Downloading wikidata into data/.
Will prepare language: en
Creating data directory
Done.
Downloading and preparing Wikidata:
Already compressed Wikidata
Traceback (most recent call last):
  File "extraction/get_wikiname_to_wikidata.py", line 353, in <module>
    main()
  File "extraction/get_wikiname_to_wikidata.py", line 266, in main
    prop_names2wikidata_names[prop] for prop in important_properties
  File "extraction/get_wikiname_to_wikidata.py", line 266, in <listcomp>
    prop_names2wikidata_names[prop] for prop in important_properties
KeyError: 'P31'

pip install error in requirements file

cssselect>=0.9.1
epub-conversion>=1.0.7
lxml>=3.4.3
msgpack-python>=0.4.8
numpy>=1.11.1
pandas>=0.15.2
progressbar2>=3.6.0
requests>=2.6.0
tensorflow>=1.4.0
wikipedia-ner>=0.0.23
ciseau>=1.0.1
Cython>=0.23.2
marisa-trie>=0.7.2
ciseau

looks like the double entries for ciseau cause a pip install issue ... probably want to remove the bottom entry.

ModuleNotFoundError: No module named 'wikidata_linker_utils.conlleval'

file conlleval.py missing while running train type

(karim_py3) ubuntu@ip-10-0-5-31:/mnt/big_drive/deeptype$ CUDA_VISIBLE_DEVICES=0 python3 learning/train_type.py my_config.json --cudnn --fused --hidden_sizes 200 200 --batch_size 256 --max_epochs 10000  --name TypeClassifier --weight_noise 1e-6  --save_dir my_great_model  --anneal_rate 0.9999
/home/ubuntu/.virtualenvs/karim_py3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Traceback (most recent call last):
  File "learning/train_type.py", line 20, in <module>
    from wikidata_linker_utils.conlleval import (
ModuleNotFoundError: No module named 'wikidata_linker_utils.conlleval'

KeyError: 'category_link'

I ran this:

python3 learning/evaluate_learnability.py --dataset sample_data.tsv --out report.json --wikidata ${DATA_DIR}wikidata/

and got KeyError message.

loading wikidata id -> index
done
/usr/local/lib/python3.5/dist-packages/wikidata_linker_utils/type_collection.py:367: UserWarning: Node 'Q21886162' under `bad_node_pair` is not a known wikidata id.
  oel
Traceback (most recent call last):
  File "learning/evaluate_learnability.py", line 297, in <module>
    main()
  File "learning/evaluate_learnability.py", line 251, in main
    proposal_sets = get_proposal_sets(collection, article_ids, args.seed)
  File "learning/evaluate_learnability.py", line 171, in get_proposal_sets
    relation = collection.relation("category_link")
  File "/usr/local/lib/python3.5/dist-packages/wikidata_linker_utils/type_collection.py", line 112, in relation
    print('load %r (%r)' % (name, self.wikidata_names2prop_names[name],))
KeyError: 'category_link'

Before I ran it, I excuted these commands:

export LANGUAGE=en
export DATA_DIR=data/
export CLASSIFICATION_DIR=data/type_classification
./extraction/full_preprocess.sh ${DATA_DIR} en
python3 extraction/project_graph.py ${DATA_DIR}wikidata/ extraction/classifiers/type_classifier.py  --export_classification ${CLASSIFICATION_DIR}
python3 extraction/evaluate_type_system.py extraction/configs/en_disambiguator_config_export_small.json --relative_to ${DATA_DIR}
python3 extraction/produce_wikidata_tsv.py extraction/configs/en_disambiguator_config_export_small.json --relative_to ${DATA_DIR} sample_data.tsv

How to run learning/evaluate_learnability.py correctly?

pip install wikidata_linker_utils_src error on MacOS High Sierra

I'm having trouble installing wikidata_linker_utils_src/ on mac High Sierra.
The error message is

/private/var/folders/pg/2c5pcvy10pgbqdj931yyr82r0000gn/T/pip-auqg6k2b-build/src/cython/wikidata_linker_utils/fast_disambiguate.cpp:585:10: fatal error: 'unordered_set' file not found
#include <unordered_set>
^~~~~~~~~~~~~~~
1 error generated.
error: command 'gcc' failed with exit status 1`

Python version is 3.6.5 and Anaconda 3-5.0.0.

I saw seemingly relevant problem on https://stackoverflow.com/questions/42030598/mac-c-compiler-not-finding-tr1-unordered-map and tried to modify "extra_compile_args" in setup.py for wikidata_linker_utils_src. by adding '-stdlib=libstdc++'. It did not work.
Also, I tried to change the compiler to clang++ by adding os.environ["CC"] = "/usr/bin/clang++" at the beginning of setup.py, but also got the same error message.

I added '-mmacosx-version-min=10.13' to the extra_compile_args to the original setup.py, and now I see following error for string(...) functions.

src/cython/wikidata_linker_utils/successor_mask.pyx:1052:34: ambiguous overloaded method

Error compiling Cython file:
------------------------------------------------------------
...
                        if len(source) > 0:
                            yield source
                    else:
                        num_missing += 1
                        with nogil:
                            missing.push_back(pair[string, string](anchor_string, string(target)))

Hope to get some help in this matter.
Thank you!

What scores does Table 1 use on the paper?

On the paper, Table 1 (c) shows the entity linking scores, but how to solve them especially CoNLL scores?

(c) Entity Linking model Comparison. 
CoNLL
Link Count only: 68.614
manual (oracle): 98.217

For example, some mentions and its candidate entities are there.

doc_id, mention, candidate entity, label
-------------------------------------
1, apple, Apple Pie, True
1, apple, Apple (company), False
1, apple, Apple (fruits), False
...

If it predicts one entity that has the highest score of each mentions, I don't need to use false candidates to solve accuracy, but I don't know the Table 1 used false candidates or not.

How did you solve the Table 1 (c) scores?

Paper: https://arxiv.org/pdf/1802.01021.pdf

License

What is the license for this work?

Compatibility with Tensorflow 1.14

When I tried to train the deeptype model using Tensorflow 1.14 it failed.

On this line it failed saying that it got the input_mode argument twice. After some digging in Tensorflow versions, I think that the problem lies in these imports:

try:
RNNCell = tf.nn.rnn_cell.RNNCell
TFLSTMCell = tf.nn.rnn_cell.LSTMCell
MultiRNNCell = tf.nn.rnn_cell.MultiRNNCell
LSTMStateTuple = tf.nn.rnn_cell.LSTMStateTuple
from tensorflow.contrib.cudnn_rnn import CudnnLSTM
except AttributeError:
RNNCell = tf.contrib.rnn.RNNCell
TFLSTMCell = tf.contrib.rnn.LSTMCell
MultiRNNCell = tf.contrib.rnn.MultiRNNCell
LSTMStateTuple = tf.contrib.rnn.LSTMStateTuple
from tensorflow.contrib.cudnn_rnn.python.ops.cudnn_rnn_ops import CudnnLSTM

because in the latest Tensorflow, the imports in the first part are all present but they point to another (newer) implement of the classes. What worked for me was to replace the lines with:

RNNCell = tf.contrib.rnn.RNNCell
TFLSTMCell = tf.contrib.rnn.LSTMCell
MultiRNNCell = tf.contrib.rnn.MultiRNNCell
LSTMStateTuple = tf.contrib.rnn.LSTMStateTuple

I also had to replace these lines:

beta1_power = tf.cast(self._beta1_power, var.dtype.base_dtype)
beta2_power = tf.cast(self._beta2_power, var.dtype.base_dtype)

with this code:

        _beta1_power, _beta2_power = self._get_beta_accumulators()
        beta1_power = tf.cast(_beta1_power, var.dtype.base_dtype)
        beta2_power = tf.cast(_beta2_power, var.dtype.base_dtype)

KeyError in fast_link_fixer.py

When running 'full_process.sh' I seem to get a key error, the exact error message is:

'Traceback (most recent call last):
File "extraction/fast_link_fixer.py", line 594, in
main()
File "extraction/fast_link_fixer.py", line 456, in main
initialize_globals(c)
File "extraction/fast_link_fixer.py", line 101, in initialize_globals
ASPECT_OF_HIST = wkd(c, "Q17524420")
File "extraction/fast_link_fixer.py", line 72, in wkd
return c.name2index[name]
File "/usr/local/lib/python3.5/dist-packages/wikidata_linker_utils/wikidata_ids.py", line 20, in getitem
value = self.marisa[key]
File "src/marisa_trie.pyx", line 577, in marisa_trie.BytesTrie.getitem
KeyError: 'Q17524420''

More specifically it occurs when running the shell script line:
'python3 extraction/fast_link_fixer.py ${DATA_DIR}wikidata ${DATA_DIR}${LANGUAGE}_trie ${DATA_DIR}${LANGUAGE}_trie_fixed'

Would anybody be able to help me with this problem?

Getting NaN loss after few epochs ~30

Did anyone face the same problem?
I get a message "Loss is NaN". Any leads on how to resolve this?

loss is NaN. Exception ignored in: <generator object prefetch_generator at 0x7f8d1b884780> Traceback (most recent call last): File "/mnt/research-6f/aranjan/dtype/learning/generator.py", line 29, in prefetch_generator t.join() File "/usr/lib/python3.5/threading.py", line 1051, in join raise RuntimeError("cannot join current thread") RuntimeError: cannot join current thread

trie.get('CIA') doesn't work on ja_trie

I created ja_trie by running this:

./extraction/full_preprocess.sh ${DATA_DIR} ja

After that, checked this:

language_path = "../data/ja_trie/"

trie = marisa_trie.Trie().load(
    join(language_path, "trie.marisa")
)

assert trie.get('アメリカ') is not None

and it works. But if it contains any alphabet character, can't get anything:

assert trie.get('CIA') is not None
AssertionErrorTraceback (most recent call last)
<ipython-input-11-41516e200beb> in <module>()
----> 1 assert trie.get('CIA') is not None

AssertionError: 

Absolutely, jawiki contains 'CIA' as anchor text, but why this happen?

evaluation method

Hi! It would be great if you could tell us how you evaluated the system, since the model outputs types, did you use types to evaluate the system or did you use the wiki entities extracted from the types? As wiki entities would require entity normalization (British = United Kingdom = Great Britain= European Nation = ,,etc), this would be helpful in evaluating the models we trained, thanks!

successor_mask.pyx:1052:34: ambiguous overloaded method

Error at building the project:

./deeptype/wikidata_linker_utils_src$ python setup.py install

Error compiling Cython file:
------------------------------------------------------------
...
            return_code = sscanf(line, "%256[^\n\t]\t%256[^\n\t]\t%256[^\n\t]", &context, &anchor, &target)
            if return_code != 3:
                num_broken += 1
                continue

            anchor_string = string(anchor)
                                 ^
------------------------------------------------------------

src/cython/wikidata_linker_utils/successor_mask.pyx:1052:34: ambiguous overloaded method

Error compiling Cython file:
------------------------------------------------------------
...
                        if len(source) > 0:
                            yield source
                    else:
                        num_missing += 1
                        with nogil:
                            missing.push_back(pair[string, string](anchor_string, string(target)))
                                                                                       ^
------------------------------------------------------------

src/cython/wikidata_linker_utils/successor_mask.pyx:1075:88: ambiguous overloaded method
building 'wikidata_linker_utils.successor_mask' extension
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/morteza/zProject/deeptype/.env/lib/python3.6/site-packages/numpy/core/include -I/usr/local/Cellar/python/3.6.4_3/Frameworks/Python.framework/Versions/3.6/include/python3.6m -c /Users/morteza/zProject/deeptype/wikidata_linker_utils_src/src/cython/wikidata_linker_utils/successor_mask.cpp -o build/temp.macosx-10.13-x86_64-3.6/Users/morteza/zProject/deeptype/wikidata_linker_utils_src/src/cython/wikidata_linker_utils/successor_mask.o -std=c++11 -Wno-unused-function -Wno-sign-compare -Wno-unused-local-typedef -Wno-undefined-bool-conversion -O3 -Wno-reorder
/Users/morteza/zProject/deeptype/wikidata_linker_utils_src/src/cython/wikidata_linker_utils/successor_mask.cpp:1:2: error: Do not use this file, it is the result of a failed Cython compilation.
#error Do not use this file, it is the result of a failed Cython compilation.
 ^
1 error generated.
error: command 'clang' failed with exit status 1

UnicodeEncodeError

Build fails ./extraction/full_preprocess.sh ${DATA_DIR} en:

raceback (most recent call last):
  File "extraction/get_wikiname_to_wikidata.py", line 347, in <module>
    main()
  File "extraction/get_wikiname_to_wikidata.py", line 287, in main
    missing_wikidata_important_properties_fnames
  File "extraction/get_wikiname_to_wikidata.py", line 119, in get_wikidata_mapping
    fout_name2id.write(key + "/" + value["title"] + "\t" + str(index) + "\n")
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-14: ordinal not in range(128)

Undefined name 'config' in wikipedia.py

flake8 testing of https://github.com/openai/deeptype on Python 3.6.4

$ time flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./wikidata_linker_utils_src/src/python/wikidata_linker_utils/wikipedia.py:47:12: F821 undefined name 'config'
    name = config.wiki.split("/")[-1]
           ^
1     F821 undefined name 'config'

Evaluate Learnability does not give output graph as per the jupyter notebook LearnabilityStudy.ipynb

I ran the evaluate_learnability.py code by creating sample_data.tsv (containing 1000 samples as per the config file). I got following warning.
/usr/local/lib/python3.5/dist-packages/sklearn/metrics/ranking.py:571: UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless UndefinedMetricWarning

  1. What can be done for the above? Is the sampled articles or their number(1000 seems less) a problem here?
  2. I ran the LearnabilityStudy.ipynb file and the figures I obtain are not as per what is shown in the paper. I have attached the figures. Please help me with same.
    auc-hist.pdf
    auc-dev-frequency.pdf
    auc-std.pdf

alpha in equation and how to select parameters in training type classifier

Hi Jonathan, I have trained a model with generated sample by:
python3 extraction/produce_windowed_h5_tsv.py /data/datasets/wikipedia/en_train.tsv /data/datasets/wikipedia/en_train.h5 /data/datasets/wikipedia/en_dev.h5 --window_size 10 --validation_start 1000000 --total_size 200500000
and
python3 learning/train_type.py my_config.json --cudnn --fused --hidden_sizes 200 200 --batch_size 256 --max_epochs 10000 --name TypeClassifier --weight_noise 1e-6 --save_dir my_great_model --anneal_rate 0.9999 --device cpu --faux_cudnn.

I test the ambiguration on the blog example(with only split):
The man saw a Jaguar speed on the high way.
The prey saw the jaguar cross the jungle.

The ranking score is based on #15 only considering the type classifier.
The result I get is :
The man saw a Jaguar speed on the high way.
Without type: Jaguar Cars: 0.61 Jaguar 0.29 SEPECAT Jaguar 0.019
With type: Jaguar Cars: 0.67 Jaguar 0.31 SEPECAT Jaguar 0.020

The prey saw the jaguar cross the jungle.
Without type: Jaguar Cars: 0.61 Jaguar 0.29 SEPECAT Jaguar 0.019
With type: Jaguar Cars: 0.67 Jaguar 0.31 SEPECAT Jaguar 0.021

Compared to the post, probabilities without type are very close to the report. The probabilities with type are little off. I don't know whether it comes from the underfitting of the classifier model or I pick the wrong hyper parameters.

Unable to replicate oracle accuracies for evolved type systems

Hi,
I'm not able to replicate the results of Table 1(a). The oracle accuracy of the type system evolved via CEM that I obtain is ~92.5% whereas the human designed type system obtains an oracle accuracy of ~98%.

Could you please share the JSONs for the evolved type systems if possible?
Congrats, btw on the amazing work and the really well written code-base.

-Shikhar

KeyError 'enwiki/Human' extraction/classifiers/type_classifier.py

issue running 'extraction/classifiers/type_classifier.py', please fix.
'enwiki/Human'
Traceback (most recent call last):
File "extraction/project_graph.py", line 123, in main
classification = classifier.classify(collection)
File "extraction/classifiers/type_classifier.py", line 26, in classify
HUMAN = wkp(c, "Human")
File "extraction/classifiers/type_classifier.py", line 14, in wkp
return c.article2id['enwiki/' + name][0][0]
File "src/marisa_trie.pyx", line 578, in marisa_trie.BytesTrie.getitem (src/marisa_trie.cpp:10859)
KeyError: 'enwiki/Human'

Should type_classifier.py be updated somehow like fast_link_fixer.py ?

blacklist in argument triggers error

from README file,

python extraction/project_graph.py ${DATA_DIR}wikidata/ extraction/blacklist.json extraction/classifiers/type_classifier.py  --export_classification ${CLASSIFICATION_DIR}

having "extraction/blacklist.json" in the command makes the main function error out here (https://github.com/openai/deeptype/blob/master/extraction/project_graph.py#L87). It looks like the blacklist is read no matter what here (https://github.com/openai/deeptype/blob/master/extraction/project_graph.py#L101)

Train in a custom dataset

Hi, first thanks for your repo and the docs. I want to use the library for the following case and want to know if you think it could work:
I have legal text with labeled references to laws and codes for these laws. I want to train the model to link the reference of the law with it's code.

I understand that I'll have to create a dataset like:

{"id": "doc1",
 "text": ".. as the Mortgage Law says ..."
"links" : [{"start":6 ,
             "stop":17,
             "target" "LW101"}]}

would it be possible to train with only this features?

Thanks again!

the paper is hard to understand

In the last of the section 3, i find the scoring function S(e,m,D,A,θ) ,argmax it to get the optimal entity.

however ,the formulation seems have no relation with the candidate entity, it's just Plink(e,m) mutiply a constant given a single mention.
I don't know whether i have a bad understanding of the paper,or maybe Pi,(m) need to multiply a deterministic Pi,(e) which is got from the wikidata?
sincerely hope for your answer.

where do you get candidate entities for a mention from?

I'm curious how you compute the candidate entities for some mention. In the paper it says you use a lookup table, but does not say how that lookup table is computed (or maybe I am missing it...)

Does the lookup table do any normalization of the mention text before lookup?

I ask because I'm wondering how the oracle accuracies are so high -- they are higher than the maximum possible recall of using crosswikis plus a wikipedia dump for CoNLL (98% after some text normalization, see Ganea and Hoffman '17)

Property 'FACET_OF' missing in wikidata properties

First of all, awesome work 👏 and thank you so much for sharing your knowledge, really appreciate it 😃

it seems that the property FACET_OF is missing from the wikidata properties here, this caused the error below while running
_python3 extraction/fast_link_fixer.py ${DATA_DIR}wikidata ${DATA_DIR}${LANGUAGE}_trie ${DATA_DIR}${LANGUAGE}trie_fixed

inside extraction/full_preprocess.sh,
as a quick fix i just commented it out (this line) and it works fine after that

load 'P31' ('instance of')
inverting relation 'P31' ('instance of')
load inverted 'P31' ('instance of')
load 'P279' ('subclass of')
inverting relation 'P279' ('subclass of')
load inverted 'P279' ('subclass of')
load 'P360' ('is a list of')
inverting relation 'P360' ('is a list of')
load inverted 'P360' ('is a list of')
load 'P361' ('part of')
inverting relation 'P361' ('part of')
load inverted 'P361' ('part of')
Traceback (most recent call last):
  File "extraction/fast_link_fixer.py", line 594, in <module>
    main()
  File "extraction/fast_link_fixer.py", line 469, in main
    num_category_link=8
  File "extraction/fast_link_fixer.py", line 273, in fix
    {"steps": [wprop.INSTANCE_OF, wprop.FACET_OF]},
AttributeError: module 'wikidata_linker_utils.wikidata_properties' has no attribute 'FACET_OF'

No space left on device error when running full_preprocess.sh

root@32b7ac932d95:/notebooks/deeptype# ./extraction/full_preprocess.sh ${DATA_DIR} en fr es
Downloading wikidata into data/.
Will prepare language: en
Will prepare language: fr
Will prepare language: es
Creating data directory
Done.
Downloading and preparing Wikidata:
Already downloaded data/latest-all.json.bz2
data/latest-all.json.bz2:

bzip2: I/O or other error, bailing out. Possible reason follows.
bzip2: No space left on device
Input file = data/latest-all.json.bz2, output file = data/latest-all.json
bzip2: Deleting output file data/latest-all.json, if it exists.

I've checked the disk space and this does not seem to be the problem.
Does anyone have any idea how to solve this?
Thanks!

train_type error

I am having an issue with the training, even running the examplepython3 learning/train_type.py learning/test/config.json give me the following error: File "learning/train_type.py", line 2690, in
main()
File "learning/train_type.py", line 2623, in main
create_variables=True)
File "learning/train_type.py", line 1775, in init
clip_norm=self.clip_norm)
File "learning/train_type.py", line 1467, in build_model
is_training=is_training)
File "learning/train_type.py", line 1020, in build_recurrent
direction="bidirectional")
TypeError: init() got multiple values for argument 'input_mode'
Any idea what might be causing this?
Thanks!

FileNotFoundError: [Errno 2] No such file or directory: 'data/location_classification/classes.txt'

Hi, I have an issue when
To use the saved graph projection on wikipedia data to test out how discriminative this classification is (Oracle performance) (edit the config file to make changes to the classification used):

export DATA_DIR=data/
python extraction/evaluate_type_system.py extraction/configs/en_disambiguator_config_export_small.json --relative_to ${DATA_DIR}

FileNotFoundError: [Errno 2] No such file or directory: 'data/location_classification/classes.txt'

Is there a pre-trained, plug-n-play model that we can play with?

I'd like to experiment with this but I don't want to train a model, I want to use an existing model (the best one available if possible)

Is there some way to do something like:

deeptype.run('Jaguars are running in the wild')
And get an output like:
['Jaguar', 'wiki URL']
Or
{'Jaguar': 'wiki URL'}
?

Issue replicating accuracy of 0.98

Hi,
Per https://arxiv.org/pdf/1802.01021.pdf Table 1, the tested accuracy is 0.98. The model generated using the provided systems: typeclassifier has a F1 score of .88

cmd: python3 learning/train_type.py my_config_v2.json --cudnn --fused --hidden_sizes 200 200 --batch_size 256 --max_epochs 10000 --name TypeClassifier --weight_noise 1e-6 --load_dir en_model --save_dir en_model --anneal_rate 0.9999
echo 'done generating model'

Metrics:
precision 88.472
recall 88.369
sentence_correct 86.35% (43497 correct / 50372)
time_sentence_correct 89.41% (11260 correct / 12593)
time_token_correct 91.14% (36461 correct / 40006)
token_correct 88.37% (141411 correct / 160024)
type_sentence_correct 80.66% (10158 correct / 12593)
type_token_correct 83.88% (33557 correct / 40006)

Config file used:
{
"datasets": [
{
"type": "train",
"path": "en_train.h5",
"x": 0,
"ignore": "other",
"y": [
{
"column": 1,
"objective": "type",
"classification": "type_classification"
},
{
"column": 1,
"objective": "location",
"classification": "location_classification"
},
{
"column": 1,
"objective": "country",
"classification": "country_classification"
},
{
"column": 1,
"objective": "time",
"classification": "time_classification"
}
]
},
{
"type": "dev",
"path": "en_dev.h5",
"x": 0,
"ignore": "other",
"y": [
{
"column": 1,
"objective": "type",
"classification": "type_classification"
},
{
"column": 1,
"objective": "location",
"classification": "location_classification"
},
{
"column": 1,
"objective": "country",
"classification": "country_classification"
},
{
"column": 1,
"objective": "time",
"classification": "time_classification"
}
],
"ignore": "other",
"comment": "#//#"
}
],
"features": [
{
"type": "word",
"dimension": 200,
"max_vocab": 2000000
},
{
"type": "suffix",
"length": 2,
"dimension": 6,
"max_vocab": 1000000
},
{
"type": "suffix",
"length": 3,
"dimension": 6,
"max_vocab": 1000000
},
{
"type": "prefix",
"length": 2,
"dimension": 6,
"max_vocab": 1000000
},
{
"type": "prefix",
"length": 3,
"dimension": 6
},
{
"type": "digit"
},
{
"type": "uppercase"
},
{
"type": "punctuation_count"
}
],
"objectives": [
{
"name": "type",
"type": "softmax",
"vocab": "type_classification/classes.txt"
},
{
"name": "location",
"type": "softmax",
"vocab": "location_classification/classes.txt"
},
{
"name": "country",
"type": "softmax",
"vocab": "country_classification/classes.txt"
},
{
"name": "time",
"type": "softmax",
"vocab": "time_classification/classes.txt"
}
],
"wikidata_path": "wikidata",
"classification_path": "classifications"
}

how to generate the learnability score output

in the LearnabilityStudy.ipynb,there needs four json files,but I only get one file after running the 'python3 learning/evaluate_learnability.py sample_data.tsv --out report.json --wikidata ${DATA_DIR}wikidata/'

Replicating model training and evaluation

Hi,

I'm having some problems replicating the metrics, to summarize the whole issue

Training
When i trained the system using the 108 type axes in type_classifier.py, the training saturates at around 83% F1 score saying

No improvements for 40 epochs. Stopping ...

and stops,

could you please let us know the parameters you used for training and what is the final F1 score of your model which gave you a disambiguation accuracy close to 99% using CONLL and how many epochs did you train for ? I used the parameters that are listed in the tutorials

--cudnn --fused --hidden_sizes 200 200 --batch_size 256 --max_epochs 10000  --name TypeClassifier --weight_noise 1e-6  --save_dir my_great_model  --anneal_rate 0.9999

Evaluation

i used the following evaluation method

for each mention (in each sentence) in CONLL dataset,
 i get the predicted wiki entity first from the LSTM model (trained till 83% F1) and 
if it matches exactly with the wiki entity in the CONLL for that mention (ground truth),
 i consider that as a correct prediction, basically matching wiki qids

and got accuracy close to 75%, am i doing something wrong here? please let me know and few training tips would be great too :) sorry for bombarding with questions, hope this would be helpful to other folks who are eagerly waiting to contribute to this work :) thanks!

How can I find associated types of an entity ?

I want to extract the optimal entity from a list of candidates using Equation 6 from the paper:

image

But I don't how to find associated types of each candidate entity. Is there any easy way to get associated types from Wikipedia_anchor_text or Wikipedia_ID.

How can i use an evolved type system only?

Hello,
I have a question about using discovered type system (e.g. cem, greedy, ga ..)

I train type classifier with discovered type system(from evolve_type_system) using train config from TypeSystemToNeuralTypeSystem.ipynb.
but at prediction time, there is no 'type' in get_prob() at learning/SentencePredictions.ipynb.

I guess it seems because discovered type system does not include type classification.

How should the get_prob() be when i use only discovered type system?

testing model on a single sentence

Hello, It would be great if you could post an example code to run the model for the example mentioned in the blog post (see below), with and without types, would be really helpful, thanks! I have the model trained already :)

The man saw a Jaguar
with types
Jaguar Cars 0.70
jaguar 0.12

without types
Jaguar Cars 0.60
jaguar 0.29

What P_type and P_entity mean?

I want to extend existing Entity Linking system by using DeepType, and maybe it related to section 2 (Task) on the paper, but I don't understand the detail of following formula:

P(e|x) ∝ P_type(types(e)|x) · P_entity(e|x,types(e))

Is this formula really related to my problem? And what P_type and P_entity mean?
I need an easy-to-understand explanation of the usage of DeepType for Entity Linking.

fix imports

some imports are unused or deprecated after tensorflow v1, working on it now..

Unable to install the wikidata_linker_utils_src

I tried installing wikidata_linker_utils_src using the command
pip install wikidata_linker_utils_src/
I got the following error. Can someone please help me out? @JonathanRaiman
P.S I am using Cython version 0.26 as pointed out in the other issues. I am doing my installation in Ubuntu. I also tried it on Windows.

Error compiling Cython file:
------------------------------------------------------------
...
        filename_byte_string = path.encode("utf-8")
        cdef char* fname = filename_byte_string
        cdef FILE* cfile
        cfile = fopen(fname, "rb")
        if cfile == NULL:
            raise FileNotFoundError(2, "No such file: '%s'" % (path,))
                                  ^
------------------------------------------------------------

src/cython/wikidata_linker_utils/successor_mask.pyx:41:35: undeclared name not builtin: FileNotFoundError
building 'wikidata_linker_utils.successor_mask' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/local/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/python2.7 -c /mnt/mydir/deeptype/wikidata_linker_utils_src/src/cython/wikidata_linker_utils/successor_mask.cpp -o build/temp.linux-x86_64-2.7/mnt/mydir/deeptype/wikidata_linker_utils_src/src/cython/wikidata_linker_utils/successor_mask.o -std=c++11 -Wno-unused-function -Wno-sign-compare -Wno-unused-local-typedef -Wno-undefined-bool-conversion -O3 -Wno-reorder
cc1plus: warning: command line option '-Wstrict-prototypes' is valid for C/ObjC but not for C++
/mnt/mydir/deeptype/wikidata_linker_utils_src/src/cython/wikidata_linker_utils/successor_mask.cpp:1:2: error: #error Do not use this file, it is the result of a failed Cython compilation.
 #error Do not use this file, it is the result of a failed Cython compilation.
  ^
cc1plus: warning: unrecognized command line option '-Wno-undefined-bool-conversion'
cc1plus: warning: unrecognized command line option '-Wno-unused-local-typedef'
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

FileNotFoundError: [Errno 2] No such file or directory: 'data/wikidata/wikidata_wikititle2wikidata.tsv'

Hi,

I get this issue when running:
python3 extraction/project_graph.py ${DATA_DIR}wikidata/ extraction/classifiers/type_classifier.py

Full error:
Traceback (most recent call last):
File "extraction/project_graph.py", line 183, in
main()
File "extraction/project_graph.py", line 93, in main
cache=args.use_cache
File "/Users/parisakhan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/wikidata_linker_utils/type_collection.py", line 53, in init
prefix=prefix
File "/Users/parisakhan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/wikidata_linker_utils/wikidata_ids.py", line 54, in load_names
with open(path, "rt", encoding="UTF-8") as fin:
FileNotFoundError: [Errno 2] No such file or directory: 'data/wikidata/wikidata_wikititle2wikidata.tsv'

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.