harvardnlp / data2text Goto Github PK

Lua 87.29% Python 12.71%

data2text's Introduction

data2text

Code for Challenges in Data-to-Document Generation (Wiseman, Shieber, Rush; EMNLP 2017); much of this code is adapted from an earlier fork of OpenNMT.

The boxscore-data associated with the above paper can be downloaded from the boxscore-data repo, and this README will go over running experiments on the RotoWire portion of the data; running on the SBNation data (or other data) is quite similar.

Update 2: For an improved implementation of the extractive evaluation metrics (and improved models), please see the data2text-plan-py repo associated with the Puduppully et al. (AAAI 2019) paper.

Update: models and results reflecting the newly cleaned up data in the boxscore-data repo are now given below.

Preprocessing

Before training models, you must preprocess the data. Assuming the RotoWire json files reside at ~/Documents/code/boxscore-data/rotowire, the following command will preprocess the data

th box_preprocess.lua -json_data_dir ~/Documents/code/boxscore-data/rotowire -save_data roto

and write files called roto-train.t7, roto.src.dict, and roto.tgt.dict to your local directory.

Incorporating Pointer Information

For the "conditional copy" model, it is necessary to know where in the source table each target word may have been copied from.

This pointer information can be incorporated into the preprocessing by running:

th box_preprocess.lua -json_data_dir ~/Documents/code/boxscore-data/rotowire -save_data roto -ptr_fi "roto-ptrs.txt"

The file roto-ptrs.txt has been included in the repo.

Training (and Downloading Trained Models)

The command for training the Joint Copy + Rec + TVD model is as follows:

th box_train.lua -data roto-train.t7 -save_model roto_jc_rec_tvd -rnn_size 600 -word_vec_size 600 -enc_emb_size 600 -max_batch_size 16 -dropout 0.5 -feat_merge concat -pool mean -enc_layers 1 -enc_relu -report_every 50 -gpuid 1 -epochs 50 -learning_rate 1 -enc_dropout 0 -decay_update2 -layers 2 -copy_generate -tanh_query -max_bptt 100 -discrec -rho 1 -partition_feats -recembsize 600 -discdist 1 -seed 0

A model trained in this way can be downloaded from https://drive.google.com/file/d/0B1ytQXPDuw7ONlZOQ2R3UWxmZ2s/view?usp=sharing

An updated model can be downloaded from https://drive.google.com/drive/folders/1QKudbCwFuj1BAhpY58JstyGLZXvZ-2w-?usp=sharing

The command for training the Conditional Copy model is as follows:

th box_train.lua -data roto-train.t7 -save_model roto_cc -rnn_size 600 -word_vec_size 600 -enc_emb_size 600 -max_batch_size 16 -dropout 0.5 -feat_merge concat -pool mean -enc_layers 1 -enc_relu -report_every 50 -gpuid 1 -epochs 100 -learning_rate 1 -enc_dropout 0 -decay_update2 -layers 2 -copy_generate -tanh_query -max_bptt 100 -switch -multilabel -seed 0

A model trained in this way can be downloaded from https://drive.google.com/file/d/0B1ytQXPDuw7OaHZJZjVWd2N6R2M/view?usp=sharing

An updated model can be downloaded from https://drive.google.com/drive/folders/1QKudbCwFuj1BAhpY58JstyGLZXvZ-2w-?usp=sharing

Generation

Use the following commands to generate from the above models:

th box_train.lua -data roto-train.t7 -save_model roto_jc_rec_tvd -rnn_size 600 -word_vec_size 600 -enc_emb_size 600 -max_batch_size 16 -dropout 0.5 -feat_merge concat -pool mean -enc_layers 1 -enc_relu -report_every 50 -gpuid 1 -epochs 50 -learning_rate 1 -enc_dropout 0 -decay_update2 -layers 2 -copy_generate -tanh_query -max_bptt 100 -discrec -rho 1 -partition_feats -recembsize 600 -discdist 1 -train_from roto_jc_rec_tvd_epoch45_7.22.t7 -just_gen -beam_size 5 -gen_file roto_jc_rec_tvd-beam5_gens.txt

th box_train.lua -data roto-train.t7 -save_model roto_cc -rnn_size 600 -word_vec_size 600 -enc_emb_size 600 -max_batch_size 16 -dropout 0.5 -feat_merge concat -pool mean -enc_layers 1 -enc_relu -report_every 50 -gpuid 1 -epochs 100 -learning_rate 1 -enc_dropout 0 -decay_update2 -layers 2 -copy_generate -tanh_query -max_bptt 100 -switch -multilabel -train_from roto_cc_epoch34_7.44.t7 -just_gen -beam_size 5 -gen_file roto_cc-beam5_gens.txt

The beam size used in generation can be adjusted with the -beam_size argument. You can generate on the test data by supplying the -test flag.

Misc/Utils

You can regenerate a pointer file with

python data_utils.py -mode ptrs -input_path ~/Documents/code/boxscore-data/rotowire/train.json -output_fi "my-roto-ptrs.txt"

Information/Relation Extraction

Creating Training/Validation Data

You can create a dataset for training or evaluating the relation extraction system as follows:

python data_utils.py -mode make_ie_data -input_path "../boxscore-data/rotowire" -output_fi "roto-ie.h5"

This will create files roto-ie.h5, roto-ie.dict, and roto-ie.labels.

Evaluating Generated summaries

You can download the extraction models we ensemble to do the evaluation from this link. There are six models in total, with the name pattern *ie-ep*.t7. Put these extraction models in the same directory as extractor.lua. (Note that extractor.lua hard-codes the paths to these saved models, so you'll need to change this if you want to substitute in new models.)

Updated extraction models can be downloaded from https://drive.google.com/drive/folders/1QKudbCwFuj1BAhpY58JstyGLZXvZ-2w-?usp=sharing

Once you've generated summaries, you can put them into a format the extraction system can consume as follows:

python data_utils.py -mode prep_gen_data -gen_fi roto_cc-beam5_gens.txt -dict_pfx "roto-ie" -output_fi roto_cc-beam5_gens.h5 -input_path "../boxscore-data/rotowire"

where the file you've generated is called roto_cc-beam5_gens.txt and the dictionary and labels files are in roto-ie.dict and roto-ie.labels respectively (as above). This will create a file called roto_cc-beam5_gens.h5, which can be consumed by the extraction system.

The extraction system can then be run as follows:

th extractor.lua -gpuid 1 -datafile roto-ie.h5 -preddata roto_cc-beam5_gens.h5 -dict_pfx "roto-ie" -just_eval

This will print out the RG metric numbers. (For the recall number, divide the 'nodup correct' number by the total number of generated summaries, e.g., 727). It will also generate a file called roto_cc-beam5_gens.h5-tuples.txt, which contains the extracted relations, which can be compared to the gold extracted relations.

We now need the tuples from the gold summaries. roto-gold-val.h5-tuples.txt and roto-gold-test.h5-tuples.txt have been included in the repo, but they can be recreated by repeating steps 2 and 3 using the gold summaries (with one gold summary per-line, as usual).
The remaining metrics can now be obtained by running:

python non_rg_metrics.py roto-gold-val.h5-tuples.txt roto_cc-beam5_gens.h5-tuples.txt

Retraining the Extraction Model

I trained the convolutional IE model as follows:

th extractor.lua -gpuid 1 -datafile roto-ie.h5 -lr 0.7 -embed_size 200 -conv_fc_layer_size 500 -dropout 0.5 -savefile roto-convie

I trained the BLSTM IE model as follows:

th extractor.lua -gpuid 1 -datafile roto-ie.h5 -lstm -lr 1 -embed_size 200 -blstm_fc_layer_size 700 -dropout 0.5 -savefile roto-blstmie -seed 1111

The saved models linked to above were obtained by varying the seed or the epoch.

Updated Results

On the development set:

	RG (P% / #)	CS (P% / R%)	CO	PPL	BLEU
Gold	95.98 / 16.93	100 / 100	100	1	100
Template	99.93 / 54.21	23.42 / 72.62	11.30	N/A	8.97
Joint+Rec+TVD (B=1)	61.23 / 15.27	28.79 / 39.80	15.27	7.26	12.69
Conditional (B=1)	76.66 / 12.88	37.98 / 35.46	16.70	7.29	13.60
Joint+Rec+TVD (B=5)	62.84 / 16.77	27.23 / 40.60	14.47	7.26	13.44
Conditional (B=5)	75.74 / 16.93	31.20 / 38.94	14.98	7.29	14.57

On the test set:

	RG (P% / #)	CS (P% / R%)	CO	PPL	BLEU
Gold	96.11 / 17.31	100 / 100	100	1	100
Template	99.95 / 54.15	23.74 / 72.36	11.68	N/A	8.93
Joint+Rec+TVD (B=5)	62.66 / 16.82	27.60 / 40.59	14.57	7.49	13.61
Conditional (B=5)	75.62 / 16.83	32.80 / 39.93	15.62	7.53	14.19

data2text's People

Contributors

Stargazers

Watchers

data2text's Issues

nonedenom is not working as claimed

Here you're getting num_to_keep based on the number of examples not labeled as NONE (len(trlabels)-len(none_idxs)); however, according to the comment above you should use the number of examples labeled as NONE (len(none_idxs)).

Template Generator scripts

Thank you for nice work!

Could you share the scripts related to the template generator mentioned in the Section 5 of paper Challenges in Data-to-Document Generation?

CUDNN_STATUS_BAD_PARAM problem when running extractor.lua for training convolutional extraction model

Hi~Need some help here. When training convolution IE model using extractor.lua, I met Error in CuDNN: CUDNN_STATUS_BAD_PARAM (cudnnSetConvolutionNdDescriptor)

Library not found when running extractor.lua

When running extractor.lua, the require 'MarginalNLLCriterion' failed. It seems like we are missing some packages here?
/share/apps/torch/bin/luajit: /share/apps/torch/share/lua/5.1/trepl/init.lua:389: module 'MarginalNLLCriterion' not found:No LuaRocks module found for MarginalNLLCriterion

Question about evalution with torch

Hi,

When I run the evalution with
CUDA_VISIBLE_DEVICES=6 ~/torch/install/bin/th extractor.lua -gpuid 1 -datafile roto-ie.h5 -preddata drive_data/transform_gen/roto_cc-beam5_gens.h5 -dict_pfx "roto-ie" -just_eval, I'm faced with the following error.
/root/torch/install/bin/luajit: extractor.lua:574: bad argument #1 to 'copy' (sizes do not match at /root/torch/extra/cutorch/lib/THC/THCTensorCopy.cu:31)
I print the size and find that the size of p (parameters of generated model) is 2234733, and the size of saved_p (parameters from conv1ie-ep6-94-74.t7) is 2141733. How can I solve the problem?

My torch is installed with LuaJIT. The hdf5 is installed with 'luarocks install hdf5'.

Thank you

Yixian

Which metric is used in development?

Thanks for your sharing.
In model training, there shows 'ppl' (I guess). What's more , there are other metrics in evaluation, such as RG, CO and CS.
So which one is used to mentor parameters choosing in development.
Thanks so much.

nan perplexity during training process

Following the instructions in README, I started training the model with given command. However, for now, it is producing perplexity with a NaN value. Is it normal?

Epoch 10 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 2135 ; PPL nan ;
Epoch 10 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 2122 ; PPL nan ;
Epoch 10 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 2127 ; PPL nan ;
Epoch 10 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 2130 ; PPL nan ;
Validation perplexity: nan

Epoch 11 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1316 ; PPL nan ;
Epoch 11 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1189 ; PPL nan ;
Epoch 11 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1148 ; PPL nan ;
Epoch 11 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1132 ; PPL nan ;
Validation perplexity: nan

Epoch 12 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1088 ; PPL nan ;
Epoch 12 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1083 ; PPL nan ;
Epoch 12 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1090 ; PPL nan ;
Epoch 12 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1089 ; PPL nan ;
Validation perplexity: nan

Epoch 13 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1089 ; PPL nan ;
Epoch 13 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1086 ; PPL nan ;
Epoch 13 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1086 ; PPL nan ;
Epoch 13 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1086 ; PPL nan ;
Validation perplexity: nan

Epoch 14 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1070 ; PPL nan ;
Epoch 14 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1084 ; PPL nan ;
Epoch 14 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1087 ; PPL nan ;
Epoch 14 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1089 ; PPL nan ;
Validation perplexity: nan

Epoch 15 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1087 ; PPL nan ;
Epoch 15 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1076 ; PPL nan ;
Epoch 15 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1078 ; PPL nan ;
Epoch 15 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1080 ; PPL nan ;
Validation perplexity: nan

Epoch 16 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1082 ; PPL nan ;
Epoch 16 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1083 ; PPL nan ;
Epoch 16 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1077 ; PPL nan ;
Epoch 16 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1078 ; PPL nan ;
Validation perplexity: nan

Epoch 17 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1086 ; PPL nan ;
Epoch 17 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1083 ; PPL nan ;
Epoch 17 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1078 ; PPL nan ;

environment config for running extractor.lua

Hi, thanks for the awesome code and paper and datasets!

I got this problem when I run the extractor.lua: /usr/bin/luajit: /usr/share/lua/5.1/torch/File.lua:351: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-4656/cutorch/lib/THC/generic/THCStorage.cu:66

The issue seems to be related to the versions of torch or lua.

So could you please share the running environment config for extractor.lua?

Thanks a lot !

Help

Help
where can i find your code. I am very interested in the topic....

Several issues I find in the entity/number extraction functions

Here are several issues I find in the rule-based entity/number extraction functions in data_utils.py. I'm still working on the dataset and will probably update more once I observe them. These issues are generally sorted by the order of their impact in my impression.

Implicit number words, such as (a|an|a pair of|a trio of) followed by nouns like (rebound|turnover|assist|block|steal|board|three-pointer|three pointer|free-throw|free throw|dime), are ignored. This probably results in the largest number of omissions.
Besides aliases like {'Los Angeles', 'LA'}, there are other aliases, like {'76ers', 'Sixers'} and {'Mavericks', 'Mavs'}.
The NLTK tokenizer did not separate suffixes containing ’ (looks similar to ', but with Unicode 2019), which makes some entities cannot be identified.
A player name with initials, such as J.J., may be present in another form, like JJ.
A player name ending with Jr., may also end with Jr, , Jr. or , Jr.
A player name with a hyphen, in the form A B-C, may be present in another form A-B C.
Other minor issues, such as Oklahoma city.

You can search for these cases in the dataset.

Hopefully, my findings could help people in the future. Actually, I found these issues when I was working on Lin et al, 2020 some years ago. These issues make the dataset quite noisy.

I do have my own version of data_utils.py which fixes these issues to some extent.

Thanks for your pioneering work!

"list index out of range" error when generating pointer file

Hi, Sam, sorry to bother you again. When I tried to generate the pointer file, I encountered a "list index out of range" error at line 644 in "data_utils.py". Maybe there exists a minor mixup or something? It seems easy to fix.

Inconsistency in data_utils.py

In data_utils.py, according to line 134, the tokes are expected to be "not annoying_number_word(...)", so the annoying_number_word() should return True when the toke should be ignored. Yet at line 116, the annoying_number_word() returns True when the toke should not be ignored. I think this bug may affect the generation model as well as the extraction model.

Possible PyTorch version of the code?

Hi, thanks for the awesome code and paper and datasets!
Since there has been a trend of increasing PyTorch usage in NLP areas, and OpenNMT also has a pytorch version, is there any possibility of migrating the project to PyTorch? I have seen an implementation of the model but it seems that it only has implemented the basic attentive decoder. (https://github.com/gau820827/AI-writer_Data2Doc)

A typo in extractor.lua at line 542

Hi, Sam, Thank you for your update! I believe the model name at line 542 should be "blstm1ie-ep4-93-75.t7" rather than "blst1mie-ep4-93-75.t7". Seems like a simple typo.

Adding instructions on how to train the information extraction model

Currently, it shows users how to download the latest pretrained model and evaluate the generated summaries directly. For completeness, I think it would be great to share the command-line how these gold models are trained. I checked the extractor.lua file and it seems to have the functionality of training models, but could you please share the parameters?

Could you please add more comments in the code?

It is very confusing for users who are not familiar with lua or open-nmt to understand your code. Could you please add more comments to explain what is what in your code? Thanks!

A bug regarding text numbers in data_utils.py

In line 134 and 136, it should be and annoying_number_word(sent, i) rather than and not annoying_number_word(sent, i), since the function returns True if it's not saying three-point related stuff.

Possible PyTorch version with Pytorch version of OpenNMT?

Possible bug in extractor.lua

1.When I ran extractor.lua, it tells me it can't find "MarginalNLLCriterion". I assume that it should be "opnmt.modules.MarginalNLLCriterion"?
2.There is a problem in loading the pre-trained model. It says: " extractor.lua:549: bad argument #1 to 'copy' (sizes do not match". I think it's because the downloaded model's parameter sizes are inconsistent with the parameters specified in the extractor.lua?