Describe the regression In the forks of Megatron-LM used by gpt-n

<a target="_blank" rel="noopener noreferrer" href="https://private-user-images.githubusercontent.com

Here is some additional information: <bloc

[REGRESSION] MoEs are obtaining higher loss than they should during training about megatron-lm HOT 5 OPEN

kiddyboots216 commented on August 16, 2024 1

[REGRESSION] MoEs are obtaining higher loss than they should during training

from megatron-lm.

Comments (5)

kiddyboots216 commented on August 16, 2024 1

Here is the validation loss plot for more MoE configs, again with varying LRs that all underperform the dense model.

from megatron-lm.

kiddyboots216 commented on August 16, 2024 1

Here is some additional information:

The figure below shows validation loss curves for 125M MoE and dense models trained with megatron_125M_k1_e16_moe_3e-4_config.txt (maxLR \in {1e-4. 3e-4, 6e-4, 9e-4}) and megatron_dense_125M_config.txt, respectively. We observe that the MoE models underperform the dense models contrary to results form the literature. As mentioned above, this suggests that there is a bug in Megatron-LM

Personally, I think we should change the horizontal axis to FLOPs and then compare the loss.

These MoEs are all K=1, so they are already FLOPS-matched (in other words the plots would be the same if we changed the horizontal axis to FLOPS.)

from megatron-lm.

zainsarwar865 commented on August 16, 2024 1

Running the same config with Megatron-DeepSpeed does result in the MoE outperforming the dense model. This was run with 8 experts, topk=1 and a 125M base model.

from megatron-lm.

bentherien commented on August 16, 2024

Here is some additional information:

The figure below shows validation loss curves for 125M MoE and dense models trained with megatron_125M_k1_e16_moe_3e-4_config.txt (maxLR \in {1e-4. 3e-4, 6e-4, 9e-4}) and megatron_dense_125M_config.txt, respectively. We observe that the MoE models underperform the dense models contrary to results form the literature. As mentioned above, this suggests that there is a bug in Megatron-LM

Moreover, below is a plot directly comparing the training loss of dense and MoE models in Megatron and GPT-NeoX trained using GBS=768, SL=2048, E=16 (total exps), K=1 (active exps). All models are trained using the same dataset and the same linear warmup+consine annealing LRS (maxLR3e-4 to minLR3e-5). We observe that the GPT-NeoX implementation has results in line with the literature (e.g., switch transformer Figure1 right), while the Megatron implementation does not.

This suggests there is a bug in Megatron-LM @jaredcasper @duncanriach @jon-barker

from megatron-lm.

yqli2420 commented on August 16, 2024

Here is some additional information:

The figure below shows validation loss curves for 125M MoE and dense models trained with megatron_125M_k1_e16_moe_3e-4_config.txt (maxLR \in {1e-4. 3e-4, 6e-4, 9e-4}) and megatron_dense_125M_config.txt, respectively. We observe that the MoE models underperform the dense models contrary to results form the literature. As mentioned above, this suggests that there is a bug in Megatron-LM

Personally, I think we should change the horizontal axis to FLOPs and then compare the loss.

from megatron-lm.

Recommend Projects

[REGRESSION] MoEs are obtaining higher loss than they should during training about megatron-lm HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent