baidu / information-extraction Goto Github PK

Python 100.00%

information-extraction's Introduction

Information Extraction Baseline System—InfoExtractor

Abstract

InfoExtractor is an information extraction baseline system based on the Schema constrained Knowledge Extraction dataset(SKED). InfoExtractor adopt a pipeline architecture with a p-classification model and a so-labeling model which are both implemented with PaddlePaddle. The p-classification model is a multi-label classification which employs a stacked Bi-LSTM with max-pooling network, to identify the predicate involved in the given sentence. Then a deep Bi-LSTM-CRF network is adopted with BIEO tagging scheme in the so-labeling model to label the element of subject and object mention, given the predicate which is distinguished in the p-classification model. The F1 value of InfoExtractor on the development set is 0.668.

Getting Started

Environment Requirements

Paddlepaddle v1.2.0
Numpy
Memory requirement 10G for training and 6G for infering

Step 1: Install paddlepaddle

For now we’ve only tested on PaddlePaddle Fluid v1.2.0, please install PaddlePaddle firstly and see more details about PaddlePaddle in PaddlePaddle Homepage.

Step 2: Download the training data, dev data and schema files

Please download the training data, development data and schema files from the competition website, then unzip files and put them in ./data/ folder.

cd data
unzip train_data.json.zip 
unzip dev_data.json.zip
cd -

Step 3: Get the vocabulary file

Obtain high frequency words from the field ‘postag’ of training and dev data, then compose these high frequency words into a vocabulary list.

python lib/get_vocab.py ./data/train_data.json ./data/dev_data.json > ./dict/word_idx

Step 4: Train p-classification model

First, the classification model is trained to identify predicates in sentences. Note that if you need to change the default hyper-parameters, e.g. hidden layer size or whether to use GPU for training (By default, CPU training is used), etc. Please modify the specific argument in ./conf/IE_extraction.conf, then run the following command:

python bin/p_classification/p_train.py --conf_path=./conf/IE_extraction.conf

The trained p-classification model will be saved in the folder ./model/p_model.

Step 5: Train so-labeling model

After getting the predicates that exist in the sentence, a sequence labeling model is trained to identify the s-o pairs corresponding to the relation that appear in the sentence.
Before training the so-labeling model, you need to prepare the training data that meets the training model format to train a so-labeling model.

python lib/get_spo_train.py  ./data/train_data.json > ./data/train_data.p
python lib/get_spo_train.py  ./data/dev_data.json > ./data/dev_data.p

To train a so labeling model, you can run:

python bin/so_labeling/spo_train.py --conf_path=./conf/IE_extraction.conf

The trained so-labeling model will be saved in the folder ./model/spo_model.

Step 6: Infer with two trained models

After the training is completed, you can choose a trained model for prediction. The following command is used to predict with the last model. You can also use the development set to select the optimal model for prediction. To do inference by using two trained models with the demo test data (under ./data/test_demo.json), please execute the command in two steps:

python bin/p_classification/p_infer.py --conf_path=./conf/IE_extraction.conf --model_path=./model/p_model/final/ --predict_file=./data/test_demo.json > ./data/test_demo.p
python bin/so_labeling/spo_infer.py --conf_path=./conf/IE_extraction.conf --model_path=./model/spo_model/final/ --predict_file=./data/test_demo.p > ./data/test_demo.res

The predicted SPO triples will be saved in the folder ./data/test_demo.res.

Evaluation

Precision, Recall and F1 score are used as the basic evaluation metrics to measure the performance of participating systems. After obtaining the predicted triples of the model, you can run the following command. Considering data security, we don't provide the alias dictionary.

zip -r ./data/test_demo.res.zip ./data/test_demo.res
python bin/evaluation/calc_pr.py --golden_file=./data/test_demo_spo.json --predict_file=./data/test_demo.res.zip

Discussion

If you have any question, you can submit an issue in github and we will respond periodically.

Copyright and License

Copyright 2019 Baidu.com, Inc. All Rights Reserved
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may otain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

APPENDIX

In the released dataset, the field ‘postag’ of sentences represents the segmentation and part-of-speech tagging information. The abbreviations of part-of-speech tagging (PosTag) and their corresponding part of speech meanings are shown in the following table.
In addition, the given segmentation and part-of-speech tagging of the dataset are only references and can be replaced with other segmentation results.

POS	Meaning
n	common nouns
f	localizer
s	space
t	time
nr	noun of people
ns	noun of space
nt	noun of tuan
nw	noun of work
nz	other proper noun
v	verbs
vd	verb of adverbs
vn	verb of noun
a	adjective
ad	adjective of adverb
an	adnoun
d	adverbs
m	numeral
q	quantity
r	pronoun
p	prepositions
c	conjunction
u	auxiliary
xc	other function word
w	punctuations

information-extraction's People

Contributors

Stargazers

Watchers

Forkers

jiangnanqi zjulins cdjasonj charlottesean iceflameworm fendaq bidoudhd 90217 learnerzhang sdgdsffdsfff yabinggoing javelir yunxinzhang burakakrishna ylfeng250 chiven01 slidersun brisksea fighting41love cbigeyes fishguysword heyitsmine wurentidai lokimask byccln bai-yeqi dai1054067910 lrxzhy tarsbase tangku006 betafringe lagrangesmile janciswang yuanxiaosc xuhaiming1996 phychaos rogerspy wanghao19951021 carylearner ishine allensmile gokunwu king1991wbs zhengyima leekltw hxl123654 xucncn zorrock mqlove jbinkleyj qualing-studio foundersix autwind qchj20131252 erimic huyanggithub drickv5 jiaguoxinzhi tapaswenipathak eminemrain crystal-tensor solemnrole fuxina anddoit mingkin husthuke freemanfxzhong liuzhijie03 ripingit awoziji xzk-seu jlzz knowledgehacker jimmycao basilwang zhang-shui-shui hwaking minn-yxm lianglili zhongyunuestc beira-bf htlbayytq onepunchmanmxt hweidream ajunlonglive amtech

information-extraction's Issues

p_classification运行出错

直接运行命令python bin/p_classification/p_train.py --conf_path=./conf/IE_extraction.conf，
代码报错如下，请问是为什么（对paddlepaddle还不太熟，报错不太会处理，想先把baseline跑起来，谢谢）
W0309 22:57:00.700938 37327 device_context.cc:263] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 9.0, Runtime API Version: 9.0 W0309 22:57:00.701022 37327 device_context.cc:271] device: 0, cuDNN Version: 5.0. W0309 22:57:00.701028 37327 device_context.cc:295] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.3, but CUDNN version in your machine is 5.1, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version. Traceback (most recent call last): File "bin/p_classification/p_train.py", line 149, in <module> main(conf_dict, use_cuda=use_gpu) File "bin/p_classification/p_train.py", line 137, in main train(conf_dict, data_generator, use_cuda=use_cuda) File "bin/p_classification/p_train.py", line 122, in train train_loop(fluid.default_main_program()) File "bin/p_classification/p_train.py", line 79, in train_loop exe.run(fluid.default_startup_program()) File "/home/xfbai/anaconda3/envs/py2/lib/python2.7/site-packages/paddle/fluid/executor.py", line 525, in run use_program_cache=use_program_cache) File "/home/xfbai/anaconda3/envs/py2/lib/python2.7/site-packages/paddle/fluid/executor.py", line 591, in _run exe.run(program.desc, scope, 0, True, True) paddle.fluid.core.EnforceNotMet: Invoke operator fill_constant error. Python Callstacks: File "/home/xfbai/anaconda3/envs/py2/lib/python2.7/site-packages/paddle/fluid/framework.py", line 1382, in _prepend_op attrs=kwargs.get("attrs", None)) File "/home/xfbai/anaconda3/envs/py2/lib/python2.7/site-packages/paddle/fluid/initializer.py", line 167, in __call__ stop_gradient=True) File "/home/xfbai/anaconda3/envs/py2/lib/python2.7/site-packages/paddle/fluid/framework.py", line 1198, in create_var kwargs['initializer'](var, self) File "/home/xfbai/anaconda3/envs/py2/lib/python2.7/site-packages/paddle/fluid/layer_helper.py", line 402, in set_variable_initializer initializer=initializer) File "/home/xfbai/anaconda3/envs/py2/lib/python2.7/site-packages/paddle/fluid/layers/tensor.py", line 137, in create_global_var value=float(value), force_cpu=force_cpu)) File "/home/xfbai/anaconda3/envs/py2/lib/python2.7/site-packages/paddle/fluid/optimizer.py", line 92, in _create_global_learning_rate persistable=True) File "/home/xfbai/anaconda3/envs/py2/lib/python2.7/site-packages/paddle/fluid/optimizer.py", line 224, in _create_optimization_pass self._create_global_learning_rate() File "/home/xfbai/anaconda3/envs/py2/lib/python2.7/site-packages/paddle/fluid/optimizer.py", line 350, in apply_gradients optimize_ops = self._create_optimization_pass(params_grads) File "/home/xfbai/anaconda3/envs/py2/lib/python2.7/site-packages/paddle/fluid/optimizer.py", line 405, in minimize optimize_ops = self.apply_gradients(params_grads) File "bin/p_classification/p_train.py", line 65, in train sgd_optimizer.minimize(avg_cost) File "bin/p_classification/p_train.py", line 137, in main train(conf_dict, data_generator, use_cuda=use_cuda) File "bin/p_classification/p_train.py", line 149, in <module> main(conf_dict, use_cuda=use_gpu)

按照教程走完为什么最后一步生成的test_demo.res大小为0

请问这个系统的参考文献可否方便透漏一下，万分感谢!

get_vocab.py运行出错

您好，在运行get_vocab.py时报出错误
Traceback (most recent call last):
File "lib/get_vocab.py", line 89, in
get_vocab(train_file, dev_file)
File "lib/get_vocab.py", line 61, in get_vocab
raise ValueError('The length of dev word is 0')
ValueError: The length of dev word is 0
想请教一下这个该如何解决呢？

(env) [root@localhost information-extraction]# python bin/p_classification/p_train.py --conf_path=./conf/IE_extraction.conf
Traceback (most recent call last):
  File "bin/p_classification/p_train.py", line 28, in <module>
    import paddle
  File "/mnt/information-extraction/env/lib64/python2.7/site-packages/paddle/__init__.py", line 25, in <module>
    import paddle.dataset
  File "/mnt/information-extraction/env/lib64/python2.7/site-packages/paddle/dataset/__init__.py", line 25, in <module>
    import paddle.dataset.sentiment
  File "/mnt/information-extraction/env/lib64/python2.7/site-packages/paddle/dataset/sentiment.py", line 29, in <module>
    import nltk
  File "/mnt/information-extraction/env/lib/python2.7/site-packages/nltk/__init__.py", line 128, in <module>
    from nltk.collocations import *
  File "/mnt/information-extraction/env/lib/python2.7/site-packages/nltk/collocations.py", line 35, in <module>
    from nltk.probability import FreqDist
  File "/mnt/information-extraction/env/lib/python2.7/site-packages/nltk/probability.py", line 333
    print("%*s" % (width, samples[i]), end=" ")

请问是否还可以共享下数据集？我是学生，做科研使用。

谢谢！

谁能提供一下这个比赛的数据吗

hello，everyone。
请问谁有这个比赛的数据可以分享一下吗，百度到现在还没开源数据，太感谢了。