[ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal, Shiwei Liu, Zhangyang Wang
Hi, I'm curious about the config for BERT model used in the paper. Out of 12 layers of BERT which layers use MoE FFN layer?
Also are you planning share training script and configs for BERT/RoBERTa?
Hi, @Kyriection Thanks for the exciting work.
I notice that you split the MLP of the model to get some smaller MLPs as MoE. But I didn't find any code about this stage in this repository. Did I miss something? Would you give some details about this stage?
Thanks a lot.
what's the random routing policy of SMoE-Dropout?
I read the paper and can not find any detailed description of it.
Are you using the standard dropout as the routing strategy during training?
Or why is your method called SMoE-Dropout?
Appreciate to get you answer.