Comments (5)
![125M_exps8-val-loss](https://private-user-images.githubusercontent.com/22651617/343895907-c57a4590-3c34-4a20-be6e-a63103b867a1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk4NTE4MzAsIm5iZiI6MTcxOTg1MTUzMCwicGF0aCI6Ii8yMjY1MTYxNy8zNDM4OTU5MDctYzU3YTQ1OTAtM2MzNC00YTIwLWJlNmUtYTYzMTAzYjg2N2ExLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzAxVDE2MzIxMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTdiMTVhYjdmOWNiY2MyM2IwZmFjODA1NTFlYWU4ZDIzMzhhMGNmZmU2MGNlNGFjMTA0YTg4MjlkNzRkZTA2NWUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.DnyckRWApGfWzzIpWwJOFgFcx203bANUN3gDpZsHprQ)
from megatron-lm.
Here is some additional information:
The figure below shows validation loss curves for 125M MoE and dense models trained with megatron_125M_k1_e16_moe_3e-4_config.txt (maxLR \in {1e-4. 3e-4, 6e-4, 9e-4}) and megatron_dense_125M_config.txt, respectively. We observe that the MoE models underperform the dense models contrary to results form the literature. As mentioned above, this suggests that there is a bug in Megatron-LM
Personally, I think we should change the horizontal axis to FLOPs and then compare the loss.
These MoEs are all K=1, so they are already FLOPS-matched (in other words the plots would be the same if we changed the horizontal axis to FLOPS.)
from megatron-lm.
Running the same config with Megatron-DeepSpeed does result in the MoE outperforming the dense model. This was run with 8 experts, topk=1 and a 125M base model.
![Deepspeed_moe_dense_e-8_topk1](https://private-user-images.githubusercontent.com/66789976/344224972-a4ee6c4d-7422-4ff7-bcd8-cd4bc1370567.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk4NTE4MzAsIm5iZiI6MTcxOTg1MTUzMCwicGF0aCI6Ii82Njc4OTk3Ni8zNDQyMjQ5NzItYTRlZTZjNGQtNzQyMi00ZmY3LWJjZDgtY2Q0YmMxMzcwNTY3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MDElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzAxVDE2MzIxMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTZmMTE1ZDBkOGZhY2JhODQ3ZmZhOTdlOTgyMTVlNjkyZTU4YTk4YTI1MjYwNDAyNjY2NjlkNmI0YTg3NGM5ZTcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.DWWujgZAw0iiKXcYZy2AlC5xKQF2izZ40c-j_6DN3so)
from megatron-lm.
Here is some additional information:
The figure below shows validation loss curves for 125M MoE and dense models trained with megatron_125M_k1_e16_moe_3e-4_config.txt (maxLR \in {1e-4. 3e-4, 6e-4, 9e-4}) and megatron_dense_125M_config.txt, respectively. We observe that the MoE models underperform the dense models contrary to results form the literature. As mentioned above, this suggests that there is a bug in Megatron-LM
Moreover, below is a plot directly comparing the training loss of dense and MoE models in Megatron and GPT-NeoX trained using GBS=768, SL=2048, E=16 (total exps), K=1 (active exps). All models are trained using the same dataset and the same linear warmup+consine annealing LRS (maxLR3e-4 to minLR3e-5). We observe that the GPT-NeoX implementation has results in line with the literature (e.g., switch transformer Figure1 right), while the Megatron implementation does not.
This suggests there is a bug in Megatron-LM @jaredcasper @duncanriach @jon-barker
from megatron-lm.
Here is some additional information:
The figure below shows validation loss curves for 125M MoE and dense models trained with megatron_125M_k1_e16_moe_3e-4_config.txt (maxLR \in {1e-4. 3e-4, 6e-4, 9e-4}) and megatron_dense_125M_config.txt, respectively. We observe that the MoE models underperform the dense models contrary to results form the literature. As mentioned above, this suggests that there is a bug in Megatron-LM
Personally, I think we should change the horizontal axis to FLOPs and then compare the loss.
from megatron-lm.
Related Issues (20)
- [QUESTION] Question about Mixtral compatibility with Megatron-LM core0.7.0
- [BUG] megatron.training not found HOT 3
- [QUESTION] How to time the code
- [BUG] pipeline_paralle is not available when pp_size > 2
- [BUG] RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead. HOT 1
- [QUESTION]when pretraining bert,meet bug:cuBLAS Error: the requested functionality is not supported HOT 3
- [QUESTION] Gloo connectFullMesh failed when the number of nodes setting "export GLOO_SOCKET_IFNAME=bond4" exceeds 60
- [QUESTION] OSError: [Errno 28] No space left on device HOT 4
- [QUESTION] --overlap-grad-allreduce failing as gradients coming through as None in param hook HOT 2
- [BUG] @jit_fuser fails with Unknown type constructor Sequence HOT 6
- [BUGS] Pipeline Parallelism fails/hangs with Megatron Core example HOT 1
- [QUESTION] What's the internal difference for training when setting only "fp8-format" or setting "fp8-format"+"bf16" HOT 1
- [QUESTION] Why is TELayerNormColumnParallelLinear used instead of TEColumnParallelLinear in gpt_layer_specs HOT 2
- [QUESTION] Why does the tokenizer of mamba-2-hybrid have two ids for the token 'Yes'? id 24639 and id 7298 HOT 1
- [QUESTION] Has standalone_embedding_stage been supported yet in core? HOT 1
- [QUESTION] Sample idx, bin files in public domain for trying out pretrain_gpt.py?
- [QUESTION] Getting tools/preprocess_data.py to work is painful
- [BUG]Question about helpers.cpp in version core_v0.7.0
- Batch_input and elapsed time per iteration slow down during model training HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from megatron-lm.