The way the position encoding in s/nmt/transformer.py<

I obtained <div class="snippet-clipboard-content notranslate position-relative ove

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Position encoding bug in transformer? about gluon-nlp HOT 9 OPEN

dmlc commented on September 28, 2024

Position encoding bug in transformer?

from gluon-nlp.

Comments (9)

leezu commented on September 28, 2024 2

Both jobs were started at the same time on separate p3.16xlarge instances. According to above log, using the HybridBlock was about 1.5 hours faster than the the fixed embedding matrix way. Ì'm not sure about why the difference is so large. I'll check the full logs later if this is due to a sustained increased throughput or maybe due to some flakiness of the p3 instance running the unmodified code..

from gluon-nlp.

leezu commented on September 28, 2024 1

I obtained

2019-07-25 18:20:45,923 - root - [Epoch 29] valid Loss=1.5228, valid ppl=4.5849, valid bleu=25.98
2019-07-25 18:26:46,444 - root - [Epoch 29] test Loss=1.3216, test ppl=3.7493, test bleu=26.10
2019-07-25 18:26:46,452 - root - Save best parameters to transformer_en_de_u512/valid_best.params
2019-07-25 18:33:03,171 - root - Best model valid Loss=1.4929, valid ppl=4.4499, valid bleu=26.25
2019-07-25 18:38:53,523 - root - Best model test Loss=1.2879, test ppl=3.6253, test bleu=26.85

when using above Block compared to

2019-07-25 19:43:16,816 - root - [Epoch 29] valid Loss=1.5230, valid ppl=4.5857, valid bleu=25.78
2019-07-25 19:49:17,773 - root - [Epoch 29] test Loss=1.3219, test ppl=3.7506, test bleu=26.03
2019-07-25 19:55:01,214 - root - Best model valid Loss=1.4923, valid ppl=4.4471, valid bleu=26.34
2019-07-25 20:00:55,458 - root - Best model test Loss=1.2867, test ppl=3.6208, test bleu=26.79

with the current script in the master branch.

In both cases running: MXNET_GPU_MEM_POOL_TYPE=Round python train_transformer.py --dataset WMT2014BPE --src_lang en --tgt_lang de --batch_size 2700 --optimizer adam --num_accumulated 16 --lr 2.0 --warmup_steps 4000 --save_dir transformer_en_de_u512 --epochs 30 --gpus 0,1,2,3,4,5,6,7 --scaled --average_start 5 --num_buckets 20 --bucket_scheme exp --bleu 13a --log_interval 10

from gluon-nlp.

szhengac commented on September 28, 2024

Hi @JulianSlzr. Thanks for catching up. I think it would not make much difference. In t2t and sockeye, they also use slightly one from the paper. Do you mind try running the script with this modification to see how the performance changes?

from gluon-nlp.

sxjscience commented on September 28, 2024

I think the relative order should not affect the performance

from gluon-nlp.

szha commented on September 28, 2024

Any update?

from gluon-nlp.

leezu commented on September 28, 2024

As the relative order doesn't seem to matter, how about replacing the current logic generating a fixed length (max_length) embedding matrix with

class PositionalEmbedding(mx.gluon.HybridBlock):
    """Positional embedding.

    Parameters
    ----------
    embed_size : int
        Dimensionality of positional embeddings.
    """

    def __init__(self, embed_size, **kwargs):
        super().__init__(**kwargs)

        inv_freq = 1 / mx.nd.power(10000, mx.nd.arange(0.0, embed_size, 2.0) / embed_size)
        with self.name_scope():
            self.inv_freq = self.params.get_constant('inv_freq', inv_freq.reshape((1, -1)))

    def hybrid_forward(self, F, pos_seq, inv_freq):  # pylint: disable=arguments-differ
        """Compute positional embeddings.

        Parameters
        ----------
        pos_seq : Symbol or NDArray
            Positions to compute embedding for. Shape (length, )

        Returns
        -------
        pos_emb: Symbol or NDArray
            Positional embeddings for positions secified in pos_seq. Shape
            (length, embed_size).
        """
        inp = F.dot(pos_seq.reshape((-1, 1)), inv_freq)
        pos_emb = F.concat(F.sin(inp), F.cos(inp), dim=-1)
        return pos_emb

In that case we don't require users to specify max_length a-priori when using sinusoidal embedding.

Above Block is currently used already as part of #846

from gluon-nlp.

szhengac commented on September 28, 2024

In the first version, I also used real-time computing, but I found it slows down the training. Then, I changed it to a predefined embedding matrix. Have you checked how long it takes to complete the training compared to using a fix embedding matrix?

from gluon-nlp.

leezu commented on September 28, 2024

See the two log files attached. Somehow the unmodified run got delayed by 1 hour during the first evaluation:

2019-07-24 21:38:45,328 - root - [Epoch 0 Batch 7520/7679] loss=7.0374, ppl=1138.4511, throughput=163.62K wps, wc=5783.88K
2019-07-24 22:14:10,205 - root - [Epoch 0] valid Loss=6.2416, valid ppl=513.6734, valid bleu=0.21
2019-07-24 22:51:49,854 - root - [Epoch 0] test Loss=6.4132, test ppl=609.8661, test bleu=0.18
2019-07-24 22:51:49,862 - root - Save best parameters to transformer_en_de_u512/valid_best.params
2019-07-24 22:52:29,000 - root - [Epoch 1 Batch 160/7679] loss=6.9615, ppl=1055.1741, throughput=152.81K wps, wc=5780.33K

Compared to the modified run

2019-07-24 21:39:13,101 - root - [Epoch 0 Batch 7520/7679] loss=7.0567, ppl=1160.6043, throughput=162.76K wps, wc=5783.88K
2019-07-24 21:55:46,334 - root - [Epoch 0] valid Loss=6.2516, valid ppl=518.8240, valid bleu=0.27
2019-07-24 22:09:51,370 - root - [Epoch 0] test Loss=6.4214, test ppl=614.8332, test bleu=0.24
2019-07-24 22:09:51,378 - root - Save best parameters to transformer_en_de_u512/valid_best.params
2019-07-24 22:10:35,695 - root - [Epoch 1 Batch 160/7679] loss=6.9881, ppl=1083.6077, throughput=152.01K wps, wc=5780.33K

train_transformer.log
train_transformer_with_pos_emb_block.log

However, based on the throughput numbers (in attached files) I think we can conclude that replacing the precomputed embedding matrix by above block does not, or at least not significantly impact the throughput.

from gluon-nlp.

szhengac commented on September 28, 2024

It seems that the main difference comes from the validation/testing in the first few epochs, where the modified embedding outputs a shorter sequence so that the beam search ends more quickly. This seems to suggest that the positions of sin and cos have something to do with the generation process.

from gluon-nlp.

Position encoding bug in transformer? about gluon-nlp HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent