Giter Club home page Giter Club logo

Comments (9)

leezu avatar leezu commented on September 28, 2024 2

Both jobs were started at the same time on separate p3.16xlarge instances. According to above log, using the HybridBlock was about 1.5 hours faster than the the fixed embedding matrix way. Ì'm not sure about why the difference is so large. I'll check the full logs later if this is due to a sustained increased throughput or maybe due to some flakiness of the p3 instance running the unmodified code..

from gluon-nlp.

leezu avatar leezu commented on September 28, 2024 1

I obtained

2019-07-25 18:20:45,923 - root - [Epoch 29] valid Loss=1.5228, valid ppl=4.5849, valid bleu=25.98
2019-07-25 18:26:46,444 - root - [Epoch 29] test Loss=1.3216, test ppl=3.7493, test bleu=26.10
2019-07-25 18:26:46,452 - root - Save best parameters to transformer_en_de_u512/valid_best.params
2019-07-25 18:33:03,171 - root - Best model valid Loss=1.4929, valid ppl=4.4499, valid bleu=26.25
2019-07-25 18:38:53,523 - root - Best model test Loss=1.2879, test ppl=3.6253, test bleu=26.85

when using above Block compared to

2019-07-25 19:43:16,816 - root - [Epoch 29] valid Loss=1.5230, valid ppl=4.5857, valid bleu=25.78
2019-07-25 19:49:17,773 - root - [Epoch 29] test Loss=1.3219, test ppl=3.7506, test bleu=26.03
2019-07-25 19:55:01,214 - root - Best model valid Loss=1.4923, valid ppl=4.4471, valid bleu=26.34
2019-07-25 20:00:55,458 - root - Best model test Loss=1.2867, test ppl=3.6208, test bleu=26.79

with the current script in the master branch.

In both cases running: MXNET_GPU_MEM_POOL_TYPE=Round python train_transformer.py --dataset WMT2014BPE --src_lang en --tgt_lang de --batch_size 2700 --optimizer adam --num_accumulated 16 --lr 2.0 --warmup_steps 4000 --save_dir transformer_en_de_u512 --epochs 30 --gpus 0,1,2,3,4,5,6,7 --scaled --average_start 5 --num_buckets 20 --bucket_scheme exp --bleu 13a --log_interval 10

from gluon-nlp.

szhengac avatar szhengac commented on September 28, 2024

Hi @JulianSlzr. Thanks for catching up. I think it would not make much difference. In t2t and sockeye, they also use slightly one from the paper. Do you mind try running the script with this modification to see how the performance changes?

from gluon-nlp.

sxjscience avatar sxjscience commented on September 28, 2024

I think the relative order should not affect the performance

from gluon-nlp.

szha avatar szha commented on September 28, 2024

Any update?

from gluon-nlp.

leezu avatar leezu commented on September 28, 2024

As the relative order doesn't seem to matter, how about replacing the current logic generating a fixed length (max_length) embedding matrix with

class PositionalEmbedding(mx.gluon.HybridBlock):
    """Positional embedding.

    Parameters
    ----------
    embed_size : int
        Dimensionality of positional embeddings.
    """

    def __init__(self, embed_size, **kwargs):
        super().__init__(**kwargs)

        inv_freq = 1 / mx.nd.power(10000, mx.nd.arange(0.0, embed_size, 2.0) / embed_size)
        with self.name_scope():
            self.inv_freq = self.params.get_constant('inv_freq', inv_freq.reshape((1, -1)))

    def hybrid_forward(self, F, pos_seq, inv_freq):  # pylint: disable=arguments-differ
        """Compute positional embeddings.

        Parameters
        ----------
        pos_seq : Symbol or NDArray
            Positions to compute embedding for. Shape (length, )

        Returns
        -------
        pos_emb: Symbol or NDArray
            Positional embeddings for positions secified in pos_seq. Shape
            (length, embed_size).
        """
        inp = F.dot(pos_seq.reshape((-1, 1)), inv_freq)
        pos_emb = F.concat(F.sin(inp), F.cos(inp), dim=-1)
        return pos_emb

In that case we don't require users to specify max_length a-priori when using sinusoidal embedding.

Above Block is currently used already as part of #846

from gluon-nlp.

szhengac avatar szhengac commented on September 28, 2024

In the first version, I also used real-time computing, but I found it slows down the training. Then, I changed it to a predefined embedding matrix. Have you checked how long it takes to complete the training compared to using a fix embedding matrix?

from gluon-nlp.

leezu avatar leezu commented on September 28, 2024

See the two log files attached. Somehow the unmodified run got delayed by 1 hour during the first evaluation:

2019-07-24 21:38:45,328 - root - [Epoch 0 Batch 7520/7679] loss=7.0374, ppl=1138.4511, throughput=163.62K wps, wc=5783.88K
2019-07-24 22:14:10,205 - root - [Epoch 0] valid Loss=6.2416, valid ppl=513.6734, valid bleu=0.21
2019-07-24 22:51:49,854 - root - [Epoch 0] test Loss=6.4132, test ppl=609.8661, test bleu=0.18
2019-07-24 22:51:49,862 - root - Save best parameters to transformer_en_de_u512/valid_best.params
2019-07-24 22:52:29,000 - root - [Epoch 1 Batch 160/7679] loss=6.9615, ppl=1055.1741, throughput=152.81K wps, wc=5780.33K

Compared to the modified run

2019-07-24 21:39:13,101 - root - [Epoch 0 Batch 7520/7679] loss=7.0567, ppl=1160.6043, throughput=162.76K wps, wc=5783.88K
2019-07-24 21:55:46,334 - root - [Epoch 0] valid Loss=6.2516, valid ppl=518.8240, valid bleu=0.27
2019-07-24 22:09:51,370 - root - [Epoch 0] test Loss=6.4214, test ppl=614.8332, test bleu=0.24
2019-07-24 22:09:51,378 - root - Save best parameters to transformer_en_de_u512/valid_best.params
2019-07-24 22:10:35,695 - root - [Epoch 1 Batch 160/7679] loss=6.9881, ppl=1083.6077, throughput=152.01K wps, wc=5780.33K

train_transformer.log
train_transformer_with_pos_emb_block.log

However, based on the throughput numbers (in attached files) I think we can conclude that replacing the precomputed embedding matrix by above block does not, or at least not significantly impact the throughput.

from gluon-nlp.

szhengac avatar szhengac commented on September 28, 2024

It seems that the main difference comes from the validation/testing in the first few epochs, where the modified embedding outputs a shorter sequence so that the beam search ends more quickly. This seems to suggest that the positions of sin and cos have something to do with the generation process.

from gluon-nlp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.