xmunlp / xmunmt Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 25.0 218 KB

An implementation of RNNsearch using TensorFlow

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

deep-learning neural-machine-translation nmt seq2seq sequence-to-sequence tensorflow

xmunmt's People

Contributors

Stargazers

Watchers

xmunmt's Issues

which version of tensorflow used in this system

What's the minimum memory requirement for the default configuration?

Hi, I tried to run the code with default settings on GTX 1080 Ti, which has a memory of 11 GB, but still got a ResourceExhaustedError. This is amazing, because RNNSearch is a relatively small model. How could it occupy that much memory?
How much GPU memory (or which GPU) do you use in your experiment?
I'm using tf1.4.0-rc0 with CUDA 8, and the error message looks like:

INFO:tensorflow:rnnsearch/decoder/attention/k_transform/matrix_0                                        shape    (2000, 1000)
INFO:tensorflow:rnnsearch/decoder/attention/logits/matrix_0                                             shape    (1000, 1)
INFO:tensorflow:rnnsearch/decoder/attention/q_transform/matrix_0                                        shape    (1000, 1000)
INFO:tensorflow:rnnsearch/decoder/gru_cell/candidate/bias                                               shape    (1000,)
INFO:tensorflow:rnnsearch/decoder/gru_cell/candidate/matrix_0                                           shape    (620, 1000)

...... NMT parameters info ......

INFO:tensorflow:rnnsearch/softmax/bias                                                                  shape    (36166,)    
INFO:tensorflow:rnnsearch/softmax/matrix_0                                                              shape    (620, 36166)
INFO:tensorflow:rnnsearch/source_embedding/bias                                                         shape    (620,)      
INFO:tensorflow:rnnsearch/source_embedding/embedding                                                    shape    (36166, 620)
INFO:tensorflow:rnnsearch/target_embedding/bias                                                         shape    (620,)      
INFO:tensorflow:rnnsearch/target_embedding/embedding                                                    shape    (36166, 620)

INFO:tensorflow:Total trainable variables size: 95822166
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Create EvaluationHook.
INFO:tensorflow:Making dir: train/eval
2017-11-21 22:40:19.596248: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:02:00.0
totalMemory: 10.91GiB freeMemory: 10.75GiB
2017-11-21 22:40:19.596352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:02:00.0, compute capability: 6.1)
INFO:tensorflow:loss = 9.27618, step = 1, target = [128  48], source = [128  48]
INFO:tensorflow:Saving checkpoints for 1 into train/model.ckpt.
INFO:tensorflow:loss = 8.80296, step = 2, target = [128  24], source = [128  24] (3.997 sec)
INFO:tensorflow:loss = 9.28678, step = 3, target = [128  32], source = [128  32] (0.429 sec)
INFO:tensorflow:loss = 8.62557, step = 4, target = [128  48], source = [128  47] (0.637 sec)
INFO:tensorflow:loss = 8.68332, step = 5, target = [128  24], source = [128  24] (0.306 sec)
INFO:tensorflow:loss = 8.25838, step = 6, target = [128  64], source = [128  64] (0.937 sec)
INFO:tensorflow:loss = 8.36041, step = 7, target = [128  32], source = [128  32] (0.400 sec)
INFO:tensorflow:loss = 7.80727, step = 8, target = [128  16], source = [128  16] (0.218 sec)
INFO:tensorflow:loss = 8.16378, step = 9, target = [128  48], source = [128  48] (0.661 sec)
INFO:tensorflow:loss = 7.60664, step = 10, target = [128  24], source = [128  24] (0.299 sec)
INFO:tensorflow:loss = 7.55223, step = 11, target = [128  32], source = [128  32] (0.417 sec)
INFO:tensorflow:loss = 7.39985, step = 12, target = [128  48], source = [128  48] (0.665 sec)
INFO:tensorflow:loss = 7.14252, step = 13, target = [128  12], source = [128  12] (0.164 sec)
INFO:tensorflow:loss = 7.12088, step = 14, target = [128  24], source = [128  24] (0.306 sec)
INFO:tensorflow:loss = 7.08299, step = 15, target = [128  32], source = [128  32] (0.403 sec)
INFO:tensorflow:loss = 7.25178, step = 16, target = [128  64], source = [128  64] (0.929 sec)
INFO:tensorflow:loss = 6.85371, step = 17, target = [128  16], source = [128  16] (0.200 sec)
INFO:tensorflow:loss = 7.00198, step = 18, target = [128  48], source = [128  48] (0.658 sec)
INFO:tensorflow:loss = 6.67163, step = 19, target = [128   8], source = [128   8] (0.116 sec)
INFO:tensorflow:loss = 6.76071, step = 20, target = [128  24], source = [128  24] (0.323 sec)
INFO:tensorflow:loss = 6.8594, step = 21, target = [128  32], source = [128  32] (0.414 sec)
INFO:tensorflow:loss = 7.01129, step = 22, target = [128  48], source = [128  48] (0.675 sec)
2017-11-21 22:40:51.626475: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.38GiB.  Current allocation summary follows.

..... Many memory footprint info.......

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[10240,36166]
         [[Node: rnnsearch/smoothed_softmax_cross_entropy_with_logits/one_hot = OneHot[T=DT_FLOAT, TI=DT_INT32, axis=-1, _device="/job:localhost/replica:0/task:0/device:GPU:0"](rnnsearch/smoothed_softmax_cross_entropy_with_logits/Reshape/_935, rnnsearch/smoothed_softmax_cross_entropy_with_logits/strided_slice, training/train/beta1, rnnsearch/smoothed_softmax_cross_entropy_with_logits/truediv)]]
         [[Node: truediv/_1015 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4870_truediv", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

What is the belu value of Chinese English translation?

Did you do this experiment?Could you provide some benchmark result ?Thanks very much.

Questions about source code

I've read through a large part of the source code, and find it to be well-organized and easy to read.
But I still have a few questions. Could you please verify them?

1, Is the following comment correct?

XMUNMT/xmunmt/layers/attention.py

Line 76 in ed7ae9a

# Shape: [batch, mem_size, 1]

As far as I understand, the shape of hidden in L74 is [batch_size * mem_size, hidden_size], logits in L77 is of shape [batch_size * mem_size, 1], logits in L78 is of shape [batch_size, mem_size], and no variable has a shape of [batch_size, mem_size, 1]. Or am I getting it wrong?

2, In

XMUNMT/xmunmt/models/rnnsearch.py

Line 280 in ed7ae9a

# Special case for non-incremental decoding

, should it be incremental decoding? In this case, the decoder just run for one step.

Any benchmark result?

Hi,
I'm wondering if this reproduce could achieve similar performance with the original RNNSearch implementation (GroundHog). Is there any benchmark result?

我配置好环境了，

运行时出现各种问题、这是怎么回事啊！根本运行不了。
File "trainer.py", line 405, in
main(parse_args())
File "trainer.py", line 286, in main
collect_params(params, model_cls.get_parameters())
File "trainer.py", line 138, in collect_params
collected.add_hparam(k, getattr(all_params, str(k)))
AttributeError: 'HParams' object has no attribute '('rnn_cell', 'LegacyGRUCell')'

Where can I download the dataset and how to prepare training corpus exactly?

Hi @XMU-NLPLAB

I am trying to run your code but I cannot find the dataset or any download link. I've tried the openmt15 dataset but it seems the official registration and data-download link is not available now.
BTW, in the command
python preprocess.py -d vocab.zh.pkl -v 30000 -b bintext.zh.pkl -p zh.txt
What do vocab.zh.pkl, bintext.zh.pkl and zh.txt represent respectively?

I am a beginner in NMT. Can you offer any information or resources? That would be really help!

What's the exact command/dataset to reproduce a BLEU of 30.42 on test set?

The instructions in readme.md describes how to train an English-to-German translation model and apply it on test data. But how did you evaluate the result?

This is what I did:

src=en
tgt=de

# Merge subwords
sed -r 's/(@@ )|(@@ ?$)//g' $nmt_output_dir/test.txt > $nmt_output_dir/test.merged-subwords.txt

# Detruecase NMT outputs
$moses_scripts/recaser/detruecase.perl < $nmt_output_dir/test.merged-subwords.txt > $nmt_output_dir/test.merged-bpe32k.detc

# Detokenize
$moses_scripts/tokenizer/detokenizer.perl -l $tgt < $nmt_output_dir/test.merged-bpe32k.detc > $nmt_output_dir/test.merged-bpe32k.txt

# Evaluation
# Method II: using mteval
# wrap up outputs with SGML format
$moses_scripts/ems/support/wrap-xml.perl $tgt $test_sgm_dir/newstest2017-$src$tgt-src.$src.sgm < $nmt_output_dir/test.merged-bpe32k.txt > $nmt_output_dir/test.merged-bpe32k.sgm

$moses_scripts/generic/mteval-v14.pl -r $test_sgm_dir/newstest2017-$src$tgt-ref.$tgt.sgm -s $test_sgm_dir/newstest2017-$src$tgt-src.$src.sgm -t $nmt_output_dir/test.merged-bpe32k.sgm > mteval-result.txt

I used the default hyper-parameters to train the model (except for batch_size=80), and got a BLEU of 22.47 only:

 Evaluation of any-to-de translation using:
    src set "newstest2017" (130 docs, 3004 segs)
    ref set "newstest2017" (1 refs)
    tst set "newstest2017" (1 systems)

length ratio: 1.01165010524255 (62001/61287), penalty (log): 0
NIST score = 6.6085  BLEU score = 0.2247 for system "Edinburgh"

# ------------------------------------------------------------------------

Individual N-gram scoring
        1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram   8-gram   9-gram
        ------   ------   ------   ------   ------   ------   ------   ------   ------
 NIST:  5.0732   1.2981   0.2044   0.0291   0.0037   0.0007   0.0001   0.0000   0.0000  "Edinburgh"

 BLEU:  0.5593   0.2821   0.1633   0.0989   0.0616   0.0388   0.0250   0.0164   0.0108  "Edinburgh"

# ------------------------------------------------------------------------
Cumulative N-gram scoring
        1-gram   2-gram   3-gram   4-gram   5-gram   6-gram   7-gram   8-gram   9-gram
        ------   ------   ------   ------   ------   ------   ------   ------   ------
 NIST:  5.0732   6.3713   6.5757   6.6048   6.6085   6.6092   6.6094   6.6094   6.6094  "Edinburgh"

 BLEU:  0.5593   0.3972   0.2954   0.2247   0.1735   0.1352   0.1062   0.0841   0.0669  "Edinburgh"

What could be wrong?
And according to my experience in NMT, a BLEU score of 30 is kind of high for English-to-German translation system on newstest2017 data. For example, in this work, the English -> German system just got a BLEU of <26. And the winner of WMT'17 only get a BLEU of 28.3, see http://matrix.statmt.org/.

feat: return value of parallel_model is unnatural

In utils/parallel.py, the function parallel_model is a wrapper which handles multiple computing devices.
But, if I understand it correctly, it achieves the following effect:
Suppose function model_fn returns k scalars (maybe multiple losses or metrics, e.g.: accuracy), i.e.: a tuple (o1, o2, ..., ok),
And m devices are available: d1, d2, ..., dm.
Then the return value of the function is of shape:

multiple return values + multiple devices: ([o1_d1, o1_d2, ..., o1_dm], [o2_d1, o2_d2, ..., o2_dm], ..., [ok_d1, ok_d2, ..., ok_dm]), i.e.: a tuple of lists;
multiple return values + single device: [(o1_d1, o2_d1, ..., ok_d1)], i.e. a length-1 list of tuple
single return value + multiple devices: [o1_d1, o1_d2, ..., o1_dm]
single return value + single device: [o1_d1]

You see, in the second case, the return value is weird. Say, if my model_fn has 2 return values, in the multiple-device cases, I can use sharded_loss1, sharded_loss2 = parallel.parallel_model(fn, features, device_list) to catch these two losses; but if I only specify a single device from command line, the code breaks.
Certainly, I can judge isinstance(return_value_from_parallel_model, tuple) and decide how to deal with the return value, but this is stupid. It would be better to return ([o1_d1], [o2_d1], ..., [ok_d1]), i.e.: a tuple of lists, in the "multiple return values + single device" case, which leads to a more consistent design.

Hope I've made myself clear.

[Question] Does dropout layer after word embedding improve neural machine translation?

Hi, I saw you add dropout layer after word embedding, which was not mentioned in rnnsearch paper "Neural Machine Translation by Jointly Learning to Align and Translate". Does this trick improve some performance? Is this implemented in vanilla theano version groundhog?
Thanks!

"casual" or "causal"?

XMUNMT/xmunmt/layers/attention.py

Line 22 in 0e1539b

if mode == "casual":

, there is a string literal "casual"（随意的）, do you mean "causal"（因果性的）?
I guess this mode is designed to prevent network from seeing future information, so it might be "causal" instead of "casual."

pre-trained models?

Hi, is it possible to share some pre-trained models (checkpoints)? Thanks.

Loss with multiple GPUs

XMUNMT/xmunmt/bin/trainer.py

Line 309 in 84d2d70

loss = tf.add_n(sharded_losses) / len(sharded_losses)

, it's simply an arithmetic average of losses collected from different devices.
This code is perfect for a single-device setting. But if you have multiple devices and different devices afford different amount of computation in a mini-batch (say, device 0 deals with shorter sentences, while device 1 faces longer ones), the loss calculated in this way will be biased. (Nevertheless, this carelessness should not introduce too much impact, unless the data distribution is extremely unbalanced.)

Ideally, it should be a weighted average of all losses, and the weights are number of valid tokens in each device, shouldn't it?

Why initialize weights from Uniform[-0.08, 0.08]?

This code runs very well.
But when I try to adapt it and implement VNMT, it converges very slow. All I did is to feed the average of encoder hidden states to decoder GRU cells (and several tranforming matrices).
I think I should adjust hyperparameters for my new model, and found in your code the default initializer is random_uniform and has range of [-0.08, 0.08]. At least it's not Xavier initialization. I don't really get it.
Does it have a theoretical basis? Or you just get the hyperparameter by trials and errors?

Does this model support experiments related to GAN neural machine translation?

Hi,@Playinf
I want to do experiments on GAN neural machine translation, but I don't know whether this model can be implemented and whether the code needs to be modified?
I am a beginner in NMT. Hope to give answers or suggestions.That would be really help!

xmunlp / xmunmt Goto Github PK

xmunmt's People

Contributors

Stargazers

Watchers

Forkers

xmunmt's Issues

Recommend Projects

Recommend Topics

Recommend Org