Re-implementation of Attention is All You Need
- Data Here
- English to French
- Technique
- Ascending Sort By Sentence Length(Good Performence)
- If using random batch, there are too many padding per batch.
- No random Batch
- Ascending Sort By Sentence Length(Good Performence)
- Transformer
- model.py
- Encoder/Decoder layer
- layer.py
- MultiHeadAttention, FeedFoward etc..
- sublayer.py
- Add&Norm
- just use
python torch.nn.LayerNorm
- just use
- train.py
- Techniques used in this paper
- Label Smoothing
- Learning Rate Scheduling
- etc..
- Make Test Code(Why don't work at Shifted Right?)
- Use More Bigger Data