ink-usc / usc-ds-relationextraction Goto Github PK
View Code? Open in Web Editor NEWDistantly Supervised Relation Extraction
License: MIT License
Distantly Supervised Relation Extraction
License: MIT License
Hi,
I am very interested in how CoType works. However, it seems I can't find any descriptions for the output files generated from retype. More specifically, when I did test run with KBP, it gives the following outputs:
data/results/KBP/em:
total 309M
-rw-rw-r-- 1 msk msk 240M Nov 30 12:34 emb_retype_feature.txt
-rw-rw-r-- 1 msk msk 70M Nov 30 12:34 emb_retype_mention.txt
-rw-rw-r-- 1 msk msk 62K Nov 30 12:34 emb_retype_type.txt
data/results/KBP/rm:
total 440M
-rw-rw-r-- 1 msk msk 368M Nov 30 12:35 emb_retype_feature.txt
-rw-rw-r-- 1 msk msk 73M Nov 30 12:34 emb_retype_mention.txt
-rw-rw-r-- 1 msk msk 3.4K Nov 30 12:35 emb_retype_type.txt
-rw-rw-r-- 1 msk msk 48K Nov 30 12:35 prediction_emb_retype_cosine.txt
-rw-rw-r-- 1 msk msk 2.8K Nov 30 12:35 tune_thresholds_emb_retype_cosine.txt
However, I don't understand what each file means and what rows/columns of each file represent. Can you please provide this information?
Thank you,
Minseung Kim
I'm wondering why you wrote that Hoffmann's data has only 395 sentences (both on this github and the paper). Did you preprocess the data?
can you give me the output file of the NYT dataset?i just run failed
thanks
my personal email:[email protected]
thanks a lot
When I run the ./run.sh
I get this error:
Learn CoType embeddings...
code/Model/retype/retype: error while loading shared libraries: libgsl.so.19: cannot open shared object file: No such file or directory
Although I have installed gsl ,it did not work ...
Are there any suggestions? Thx a lot!
Stanford coreNLP has requirement python>=3.6,while the code is based on python2,thus import stanza.nlp cannot be done.(Module import error)Did it have legacy support python2 or the way I run code was wrong?
PubMed-BioInfer: 100k PubMed paper abstracts as training data and 1,530 manually labeled biomedical paper abstracts from BioInfer (Pyysalo et al., 2007) as test data. It consists of 94 relation types (protein-protein interactions) and over 2,000 entity types (from MESH ontology)
But on the google drive link, there is only 1580 sentences in the train.json, while 708 sentence in test.json.
and for Wiki-KBP data sets, the size are 23784 for train, 289 for test.
So I wonder if there is anything wrong?
Learn CoType embeddings...
./run.sh: line 20: code/Model/retype/retype: Permission denied
gpuws@gpuws32g:/media/gpuws/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/DS-RelationExtraction$ bash run.sh
NYT
Generate Features...
Start nlp parsing
CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /?properties=%7B%27outputFormat%27%3A+%27serialized%27%2C+%27annotators%27%3A+%27ssplit%2Ctokenize%2Cpos%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fde9a953e10>: Failed to establish a new connection: [Errno 99] Cannot assign requested address',))",),)
CRITICAL:root:It seems like we've temporarily ran out of ports. Taking a 30s break...
CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /?properties=%7B%27outputFormat%27%3A+%27serialized%27%2C+%27annotators%27%3A+%27ssplit%2Ctokenize%2Cpos%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fde9a6e0a50>: Failed to establish a new connection: [Errno 99] Cannot assign requested address',))",),)
CRITICAL:root:It seems like we've temporarily ran out of ports. Taking a 30s break...
CRITICAL:root:ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=9000): Max retries exceeded with url: /?properties=%7B%27outputFormat%27%3A+%27serialized%27%2C+%27annotators%27%3A+%27ssplit%2Ctokenize%2Cpos%27%2C+%27serializer%27%3A+%27edu.stanford.nlp.pipeline.ProtobufAnnotationSerializer%27%7D (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fde9a930190>: Failed to establish a new connection: [Errno 99] Cannot assign requested address',))",),)
CRITICAL:root:It seems like we've temporarily ran out of ports. Taking a 30s break...
I'm currently applying CoType for Freebase relation extraction. Between two entity mentions in a sentence, there might exists multiple relations between both entities. For example, there exists two relations president_of and nationality between Obama and USA.
In distantSupervision.py of StructMineDataPipeline, multiple relations are joined with commas (line 96). In my case, I select 100 relations as my target relations, but after feature generation, intermediate/my_extraction/rm/type.txt contains 1200 types, where multiple relations are seen as a type (president_of,nationality is one of the types) I wonder if CoType handles the case of multiple relations?
I found your work very interesting and I'd like to apply it to extract Freebase relations.
Could you give some hints on the code I should modified so that CoType can extract Freebase relations?
Thanks!
I scanned through the test set of both NYT and KBP as referred to in the link. The 'test.json' in KBP have a field 'articleID' beginning with 'NYT', which suggests it might be the dataset from NYT.
Hope to update the correct dataset links in google drive :)
During Evaluate on Relattion Extraction, "raise Exception('LIBLINEAR library not found.')" and I don't use try
it report "OSError: ~/anaconda2/bin/../lib/libgomp.so.1: version `GOMP_4.0' not found (required by /data/guozhao/sourceCodes/CoType-master/code/Classifier/liblinear.so.3)".
I use the default setting for KBP dataset.
code/Model/retype/retype -data $Data -mode j -size 50 -negative 3 -threads 3 -alpha 0.0001 -samples 1 -iters 400 -lr 0.02 -transWeight 1.0
But my best F1 core only reach 0.31, which is far from benchmark performance(o.369).
My Result:
Best threshold: 0.54 . Precision: 0.303225806442 . Recall: 0.33215547702 . F1: 0.317032035472
Benchmark performance:
CoType (Ren et al., 2017) | 0.348 | 0.406 | 0.369
Would anyone help me to reproduce the benchmark result? Thank you very much.
when run the script run.sh, I have below error:
Learn CoType embeddings...
code/Model/retype/retype: error while loading shared libraries: libgsl.so.19: cannot open shared object file: No such file or directory
...
IOError: [Errno 2] No such file or directory: 'data/results/KBP/rm/prediction_emb_retype_cosine.txt'
please tell me how to build libgsl? I already ran the make command under retype folder.
The content of intermediate/KBP/em/type.txt is strange for the KBP corpus. I haven't tested on other corpora.
type.txt:
/ 0 466411
p 1 177541
e 2 270192
r 3 288178
s 4 182380
o 5 480385
n 6 380277
l 7 159181
d 8 28569
i 9 303219
, 10 170403
m 11 42962
a 12 236368
c 13 196298
h 14 38605
t 15 312620
u 16 59252
g 17 59662
z 18 16877
_ 19 16360
w 20 5846
k 21 2778
f 22 3852
y 23 52196
b 24 4285
v 25 21442
x 26 1
For relation mentions, type.txt looks fine:
None 0 111343
per:country_of_death 1 10265
per:country_of_birth 2 12040
per:parents 3 6984
per:children 4 6984
per:religion 5 2584
per:countries_of_residence 6 1
I believe the problem is caused by ner_feature.py line 84 to line 91. For relation mention, mention.labels is a list so the code works ok. However, in the case of entity mention, mention.labels is a string (e.g. /person/soldier,/person/actor,/person/politician,/person/author,/person) and the for loop will loop over the characters in the string and thus creating weird type.txt.
Are the parameters sentId and start (inside entityMentions) used in the training / testing ?
git clone [email protected]:stanfordnlp/stanza.git
Cloning into 'stanza'...
Warning: Permanently added the RSA host key for IP address '52.74.223.119' to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
(py27) shuzhilian@gpuserver0301:/home/chency/wll/pythonworkspace/DS-RelationExtraction/code/DataProcessor$
Hello,
In your paper you state that you have 47 entity types for the NYT corpus.
However, in the uploaded data as well as in the data generation code I find there are only 3 (person, organisation, location).
Do you consider these types as supertypes and include other types inside? If so, can you provide the correct mapping from Freebase?
Thanks in advance!
I tried to run ./run.sh but it didn't create mention_type_test.txt I get this:
Traceback (most recent call last):
File "code/Evaluation/convertPredictionToJson.py", line 13, in
with open(typeMapFile) as typeF:
IOError: [Errno 2] No such file or directory: 'data/intermediate/KBP/rm/type.txt'
Traceback (most recent call last):
File "code/Evaluation/tune_threshold.py", line 66, in
ground_truth = load_labels(indir + '/mention_type_test.txt')
File "/workspace/png/CoType-master/code/Evaluation/evaluation.py", line 17, in load_labels
with open(file_name) as f:
IOError: [Errno 2] No such file or directory: 'data/intermediate/KBP/rm/mention_type_test.txt'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.