pjlab-sys4nlp / llama-moe Goto Github PK
View Code? Open in Web Editor NEW⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
License: Apache License 2.0
⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
License: Apache License 2.0
Here you are converting the existing FFN of LLaMA into multiple small sparse experts. But what if we could keep the original FFN dimension for each expert? That would mean the number of parameters will increase but it should be fine because anyway we are going to continually pretrain the model.
One way to achieve this is by duplicating the FFNs into multiple experts and adding the gate on top. I know it is not a great idea to have the experts with same weights, so maybe we could add a bit of random noise to the weights before pretraining?
Let me know your thoughts!
https://github.com/pjlab-sys4nlp/llama-moe/blob/main/smoe/trainer/llama_lr_scheduling.py#L125
谢谢您分享的repo。关于lr有个问题,这里支持final_lr_portion,好像和megatron的实现不一样。想请教下是否合理。谢谢
lr下降部分也可能和final_lr_portion有关。
transformer version 4.38.0
import torch
from llama_moe.modeling_llama_moe_hf import LlamaMoEForCausalLM
from transformers import AutoTokenizer
model_dir = "Qwen1.5-0.5B-Chat_llamafy-16Select4-gate_proj-Scale1.0"
model = LlamaMoEForCausalLM.from_pretrained(model_dir, trust_remote_code=True, )
Some weights of LlamaMoEForCausalLM were not initialized from the model checkpoint at Qwen1.5-0.5B-Chat_llamafy-16Select4-gate_proj-Scale1.0 and are newly initialized: ['model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq']
I appreciate the authors' contribution to MOE construction.
The authors mentioned in Section 1 that there is a significant performance drop between the LLaMA-MoE-v1 models and the original dense LLaMA models. I am curious about the performance of the original dense LLaMA model and whether these gaps can be bridged by the approach proposed in the paper.
sbatch: command not found
Since I have to launch a new container environment for each experiment and I have never worked with Slurm before, and this GPU cluster doesn't come with Slurm installed, I found that installing and configuring Slurm seems quite troublesome. Is there any other way to implement multi-node, multi-GPU setups?thanks
llama2 does not seem to be supported, please update modeling_llama_moe_hf.py
我之前也有一些类似的调研,发现Swiglu存在50%以上的动态稀疏性,但似乎比较难预测。激活函数在reglu或者relu时,由于非常高的动态稀疏性,Moe化似乎会更容易一些。
请问moe模型的构建是通过多个llama模型还是1个llama模型呢?
请问这个repo的用途是将1个llama模型的FFN层通过不同的切分方法,切分为多个FFN来扮演多个专家。然后将llama模型的其余模型层和权重与切分后的FFN和gate进行拼接变成moe模型嘛?
是否支持将多个llama结构模型的FFN层合并,基于一个base 的llama模型结构构建Moe呢?
run bash error
./scripts/expert_construction/split/run_split_random.sh: 行 18: srun: 未找到命令
Accelerated Deployment.. Regarding the current llama, there are many mature acceleration frameworks, such as LMDeploy, vLLM, and AWQ.
After using the MoE architecture, how to use these acceleration frameworks for deployment? Has adaptation been done?
如题
如题
您好,谢谢您发布如此好的工作,我们近期也在mixtral-8x7b上做了微调(我们的项目:https://github.com/WangRongsheng/Aurora ),想知道llama-moe的中文能力如何呢?
Hi I got a questions regarding of these params found in config.json : score_scale_factor
, capacity_factor
Base on my understanding, the llama-3B-MoE splits the intermediate dimension into 8 parts with each 1376 dimension instead of the original 11008 dimensions. Hence the 1376 in size_experts list. However I am wasn't quite understand the use of capacity_factor and score_scale_factor affects the architecture of MoE. Is this needed during inference or the 1376 numbers are derived based on capacity factor?
I have read expert construction readme but found no connections to setting these 2 values.
Am I missing something here?
Hi, thanks for your interesting work. I also need use SlimPajama to fine-tune my model. Can you teach me how to split the whole SlimPajama into separate folders as you listed?
/path_to_data/en_arxiv
/path_to_data/en_book
/path_to_data/en_c4
/path_to_data/en_cc
/path_to_data/en_stack
/path_to_data/en_wikipedia
/path_to_data/github
Thank you to the authors for providing a method for transforming a dense model into MoEs for more efficient inference!
MoEfication provides acceleration results for the transformed model on CPU and GPU, while the current technical report for Llama MoE does not contain this information. Can the authors provide the relevant reference information?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.