ink-usc / usc-ds-relationextraction Goto Github PK

Distantly Supervised Relation Extraction

License: MIT License

Python 5.07% C++ 68.98% C 1.49% Makefile 0.09% Perl 0.19% Shell 0.24% CMake 2.29% Fortran 17.09% JavaScript 0.09% CSS 0.06% Java 2.68% HTML 0.98% M4 0.74% MATLAB 0.01%

natural-language-processing information-extraction relation-extraction knowledgebase machine-learning

usc-ds-relationextraction's Issues

Missing descriptions for the output files of retype

Hi,

I am very interested in how CoType works. However, it seems I can't find any descriptions for the output files generated from retype. More specifically, when I did test run with KBP, it gives the following outputs:

data/results/KBP/em:
total 309M
-rw-rw-r-- 1 msk msk 240M Nov 30 12:34 emb_retype_feature.txt
-rw-rw-r-- 1 msk msk 70M Nov 30 12:34 emb_retype_mention.txt
-rw-rw-r-- 1 msk msk 62K Nov 30 12:34 emb_retype_type.txt

data/results/KBP/rm:
total 440M
-rw-rw-r-- 1 msk msk 368M Nov 30 12:35 emb_retype_feature.txt
-rw-rw-r-- 1 msk msk 73M Nov 30 12:34 emb_retype_mention.txt
-rw-rw-r-- 1 msk msk 3.4K Nov 30 12:35 emb_retype_type.txt
-rw-rw-r-- 1 msk msk 48K Nov 30 12:35 prediction_emb_retype_cosine.txt
-rw-rw-r-- 1 msk msk 2.8K Nov 30 12:35 tune_thresholds_emb_retype_cosine.txt

However, I don't understand what each file means and what rows/columns of each file represent. Can you please provide this information?

Thank you,
Minseung Kim

Hoffmann test set has more than 1000 sentences

I'm wondering why you wrote that Hoffmann's data has only 395 sentences (both on this github and the paper). Did you preprocess the data?

the output file of the NYT dataset

can you give me the output file of the NYT dataset?i just run failed
thanks
my personal email:[email protected]
thanks a lot

libgsl.so.19: No such file or directory

When I run the ./run.sh

I get this error:

Learn CoType embeddings...
code/Model/retype/retype: error while loading shared libraries: libgsl.so.19: cannot open shared object file: No such file or directory

Although I have installed gsl ,it did not work ...
Are there any suggestions? Thx a lot!

Trouble running stanford coreNLP and run.sh

Stanford coreNLP has requirement python>＝3.6，while the code is based on python2,thus import stanza.nlp cannot be done.(Module import error)Did it have legacy support python2 or the way I run code was wrong？

why PubMed-BioInfer and Wiki-KBP data in google drive is so small?

PubMed-BioInfer: 100k PubMed paper abstracts as training data and 1,530 manually labeled biomedical paper abstracts from BioInfer (Pyysalo et al., 2007) as test data. It consists of 94 relation types (protein-protein interactions) and over 2,000 entity types (from MESH ontology)

But on the google drive link, there is only 1580 sentences in the train.json, while 708 sentence in test.json.

and for Wiki-KBP data sets, the size are 23784 for train, 289 for test.

So I wonder if there is anything wrong?

It was fine untill “learn Cotype embeddings”

Learn CoType embeddings...
./run.sh: line 20: code/Model/retype/retype: Permission denied

CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with

gpuws@gpuws32g:/media/gpuws/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/DS-RelationExtraction$ bash run.sh
NYT
Generate Features...
Start nlp parsing
CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /?properties=%7B%27outputFormat%27%3A+%27serialized%27%2C+%27annotators%27%3A+%27ssplit%2Ctokenize%2Cpos%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fde9a953e10>: Failed to establish a new connection: [Errno 99] Cannot assign requested address',))",),)
CRITICAL:root:It seems like we've temporarily ran out of ports. Taking a 30s break...
CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /?properties=%7B%27outputFormat%27%3A+%27serialized%27%2C+%27annotators%27%3A+%27ssplit%2Ctokenize%2Cpos%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fde9a6e0a50>: Failed to establish a new connection: [Errno 99] Cannot assign requested address',))",),)
CRITICAL:root:It seems like we've temporarily ran out of ports. Taking a 30s break...
CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /?properties=%7B%27outputFormat%27%3A+%27serialized%27%2C+%27annotators%27%3A+%27ssplit%2Ctokenize%2Cpos%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fde9a930190>: Failed to establish a new connection: [Errno 99] Cannot assign requested address',))",),)
CRITICAL:root:It seems like we've temporarily ran out of ports. Taking a 30s break...

Does CoType handle multiple relations between two entities in distant supervision?

I'm currently applying CoType for Freebase relation extraction. Between two entity mentions in a sentence, there might exists multiple relations between both entities. For example, there exists two relations president_of and nationality between Obama and USA.

In distantSupervision.py of StructMineDataPipeline, multiple relations are joined with commas (line 96). In my case, I select 100 relations as my target relations, but after feature generation, intermediate/my_extraction/rm/type.txt contains 1200 types, where multiple relations are seen as a type (president_of,nationality is one of the types) I wonder if CoType handles the case of multiple relations?

Applying CoType to Freebase

I found your work very interesting and I'd like to apply it to extract Freebase relations.
Could you give some hints on the code I should modified so that CoType can extract Freebase relations?

Thanks!

The dataset linked to google drive seems to be wrong

I scanned through the test set of both NYT and KBP as referred to in the link. The 'test.json' in KBP have a field 'articleID' beginning with 'NYT', which suggests it might be the dataset from NYT.

Hope to update the correct dataset links in google drive :)

liblinear.so.3

During Evaluate on Relattion Extraction, "raise Exception('LIBLINEAR library not found.')" and I don't use try
it report "OSError: ~/anaconda2/bin/../lib/libgomp.so.1: version `GOMP_4.0' not found (required by /data/guozhao/sourceCodes/CoType-master/code/Classifier/liblinear.so.3)".

Benchmark result Reproduction

I use the default setting for KBP dataset.

code/Model/retype/retype -data $Data -mode j -size 50 -negative 3 -threads 3 -alpha 0.0001 -samples 1 -iters 400 -lr 0.02 -transWeight 1.0

But my best F1 core only reach 0.31, which is far from benchmark performance(o.369).

My Result:
Best threshold: 0.54 .  Precision: 0.303225806442 .     Recall: 0.33215547702 . F1: 0.317032035472

Benchmark performance:
CoType (Ren et al., 2017) | 0.348 | 0.406 | 0.369

Would anyone help me to reproduce the benchmark result? Thank you very much.

IOError: [Errno 2] No such file or directory: 'data/results/KBP/rm/prediction_emb_retype_cosine.txt'

when run the script run.sh, I have below error:

Learn CoType embeddings...
code/Model/retype/retype: error while loading shared libraries: libgsl.so.19: cannot open shared object file: No such file or directory
...
IOError: [Errno 2] No such file or directory: 'data/results/KBP/rm/prediction_emb_retype_cosine.txt'

please tell me how to build libgsl? I already ran the make command under retype folder.

Output for data/intermediate/KBP/em/type.txt is strange

The content of intermediate/KBP/em/type.txt is strange for the KBP corpus. I haven't tested on other corpora.

type.txt:

/       0       466411
p       1       177541
e       2       270192
r       3       288178
s       4       182380
o       5       480385
n       6       380277
l       7       159181
d       8       28569
i       9       303219
,       10      170403
m       11      42962
a       12      236368
c       13      196298
h       14      38605
t       15      312620
u       16      59252
g       17      59662
z       18      16877
_       19      16360
w       20      5846
k       21      2778
f       22      3852
y       23      52196
b       24      4285
v       25      21442
x       26      1

For relation mentions, type.txt looks fine:

None    0       111343
per:country_of_death    1       10265
per:country_of_birth    2       12040
per:parents     3       6984
per:children    4       6984
per:religion    5       2584
per:countries_of_residence      6       1

I believe the problem is caused by ner_feature.py line 84 to line 91. For relation mention, mention.labels is a list so the code works ok. However, in the case of entity mention, mention.labels is a string (e.g. /person/soldier,/person/actor,/person/politician,/person/author,/person) and the for loop will loop over the characters in the string and thus creating weird type.txt.

Regarding fields of train.json

Are the parameters sentId and start (inside entityMentions) used in the training / testing ?

hi,where is the mention_type_test.txt

stanza问题

git clone [email protected]:stanfordnlp/stanza.git
Cloning into 'stanza'...
Warning: Permanently added the RSA host key for IP address '52.74.223.119' to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
(py27) shuzhilian@gpuserver0301:/home/chency/wll/pythonworkspace/DS-RelationExtraction/code/DataProcessor$

NYT corpus: Issue with number of entity types

Hello,

In your paper you state that you have 47 entity types for the NYT corpus.
However, in the uploaded data as well as in the data generation code I find there are only 3 (person, organisation, location).
Do you consider these types as supertypes and include other types inside? If so, can you provide the correct mapping from Freebase?

Thanks in advance!

How do you create the mention_type_test.txt?

I tried to run ./run.sh but it didn't create mention_type_test.txt I get this:

Traceback (most recent call last):
File "code/Evaluation/convertPredictionToJson.py", line 13, in
with open(typeMapFile) as typeF:
IOError: [Errno 2] No such file or directory: 'data/intermediate/KBP/rm/type.txt'
Traceback (most recent call last):
File "code/Evaluation/tune_threshold.py", line 66, in
ground_truth = load_labels(indir + '/mention_type_test.txt')
File "/workspace/png/CoType-master/code/Evaluation/evaluation.py", line 17, in load_labels
with open(file_name) as f:
IOError: [Errno 2] No such file or directory: 'data/intermediate/KBP/rm/mention_type_test.txt'

ink-usc / usc-ds-relationextraction Goto Github PK

usc-ds-relationextraction's Issues

Recommend Projects

Recommend Topics

Recommend Org