Implement Transformer with pytorch from scratch, using for Eng-Vi translation
Done:
-
Change order of Norm layer On Layer Normalization in the Transformer Architecture
-
Using multi GPU to train
Future work:
-
Try to use Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
-
Train on TPU