pjlab-sys4nlp / llama-moe Goto Github PK

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

License: Apache License 2.0

Makefile 0.03% Shell 20.21% Python 79.76%

llama llm mixture-of-experts moe continual-pre-training expert-partition

llama-moe's Issues

Partition FFNs without downsizing them?

Here you are converting the existing FFN of LLaMA into multiple small sparse experts. But what if we could keep the original FFN dimension for each expert? That would mean the number of parameters will increase but it should be fine because anyway we are going to continually pretrain the model.

One way to achieve this is by duplicating the FFNs into multiple experts and adding the gate on top. I know it is not a great idea to have the experts with same weights, so maybe we could add a bit of random noise to the weights before pretraining?

Let me know your thoughts!

about cosine lr scheduler

https://github.com/pjlab-sys4nlp/llama-moe/blob/main/smoe/trainer/llama_lr_scheduling.py#L125

谢谢您分享的repo。关于lr有个问题，这里支持final_lr_portion，好像和megatron的实现不一样。想请教下是否合理。谢谢
lr下降部分也可能和final_lr_portion有关。

Some weights of LlamaMoEForCausalLM were not initialized

transformer version 4.38.0

import torch
from llama_moe.modeling_llama_moe_hf import LlamaMoEForCausalLM
from transformers import AutoTokenizer

model_dir = "Qwen1.5-0.5B-Chat_llamafy-16Select4-gate_proj-Scale1.0"

model = LlamaMoEForCausalLM.from_pretrained(model_dir, trust_remote_code=True, )

Some weights of LlamaMoEForCausalLM were not initialized from the model checkpoint at Qwen1.5-0.5B-Chat_llamafy-16Select4-gate_proj-Scale1.0 and are newly initialized: ['model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq']

Performance comparison between LLama-MOE and the original dense model.

I appreciate the authors' contribution to MOE construction.
The authors mentioned in Section 1 that there is a significant performance drop between the LLaMA-MoE-v1 models and the original dense LLaMA models. I am curious about the performance of the original dense LLaMA model and whether these gaps can be bridged by the approach proposed in the paper.

If I can't configure Slurm on a cluster, does that mean I can't use multi-node multi-GPU setups?

sbatch: command not found

Since I have to launch a new container environment for each experiment and I have never worked with Slurm before, and this GPU cluster doesn't come with Slurm installed, I found that installing and configuring Slurm seems quite troublesome. Is there any other way to implement multi-node, multi-GPU setups?thanks

please update modeling_llama_moe_hf.py

llama2 does not seem to be supported, please update modeling_llama_moe_hf.py

How to split "down" by "up" when using clustering to construct experts? 请问使用clustering进行Expert Construction时，down怎么根据up划分？

llama的FFN层包含up,down,gate三个部分，根据技术报告中这段话，使用MoEfication方法对up的权重进行k-means聚类后，down的权重是根据up的聚类结果进行分割吗？而gate的权重是需要单独进行k-means聚类吗？
请问对down的分割操作具体是怎么做的呢？是在哪一部分代码中实现的呢？
以及为什么down的weight要根据up进行划分而不是对down进行k-means聚类呢？
Thanks very much!

LLama直接Moe化后效果怎么样？

我之前也有一些类似的调研，发现Swiglu存在50%以上的动态稀疏性，但似乎比较难预测。激活函数在reglu或者relu时，由于非常高的动态稀疏性，Moe化似乎会更容易一些。

per_device_train_batch_size=1，but almost all of my GPU memory is still being used up?

Can you please explain why, when I run the llama2_7b model pre-training on 4*A800 with only the following settings:

per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=1
almost all of my GPU memory is still being used up? I don't think this is reasonable.

How many llama models are used for constructing llama-moe ? moe的构建是通过多个llama模型还是1个llama模型

请问moe模型的构建是通过多个llama模型还是1个llama模型呢？

请问这个repo的用途是将1个llama模型的FFN层通过不同的切分方法，切分为多个FFN来扮演多个专家。然后将llama模型的其余模型层和权重与切分后的FFN和gate进行拼接变成moe模型嘛？

是否支持将多个llama结构模型的FFN层合并，基于一个base 的llama模型结构构建Moe呢？

How many llama models are used when constructing llama-moe?
Is this repo partition one llama model's FFN into multiple experts and concatenate the rest parameters with gates to construct an MoE model?
Do you support concatenating multiple llama models and construct an MoE model?

./scripts/expert_construction/split/run_split_random.sh: 行 18: srun: 未找到命令

run bash error
./scripts/expert_construction/split/run_split_random.sh: 行 18: srun: 未找到命令

#Feature Request# Accelerated Deployment.

Accelerated Deployment.. Regarding the current llama, there are many mature acceleration frameworks, such as LMDeploy, vLLM, and AWQ.
After using the MoE architecture, how to use these acceleration frameworks for deployment? Has adaptation been done?

我们才能从llama13b开始训练moe呢？

如题

Why a new trainer instead of the original one? 请教一下为什么要新写一个llama_lr_scheduling_trainer，它的作用是什么，为什么不用原始trainer

如题

About Chinese performances. 关于中文能力的询问

您好，谢谢您发布如此好的工作，我们近期也在mixtral-8x7b上做了微调（我们的项目：https://github.com/WangRongsheng/Aurora ），想知道llama-moe的中文能力如何呢？

Questions about capacity_factor, score_scale_factor

Hi I got a questions regarding of these params found in config.json : score_scale_factor, capacity_factor

Base on my understanding, the llama-3B-MoE splits the intermediate dimension into 8 parts with each 1376 dimension instead of the original 11008 dimensions. Hence the 1376 in size_experts list. However I am wasn't quite understand the use of capacity_factor and score_scale_factor affects the architecture of MoE. Is this needed during inference or the 1376 numbers are derived based on capacity factor?

I have read expert construction readme but found no connections to setting these 2 values.

Am I missing something here?

About dataset prepare

Hi, thanks for your interesting work. I also need use SlimPajama to fine-tune my model. Can you teach me how to split the whole SlimPajama into separate folders as you listed?
/path_to_data/en_arxiv
/path_to_data/en_book
/path_to_data/en_c4
/path_to_data/en_cc
/path_to_data/en_stack
/path_to_data/en_wikipedia
/path_to_data/github

Can you report the running time on hardware?

Thank you to the authors for providing a method for transforming a dense model into MoEs for more efficient inference!

MoEfication provides acceleration results for the transformed model on CPU and GPU, while the current technical report for Llama MoE does not contain this information. Can the authors provide the relevant reference information?

pjlab-sys4nlp / llama-moe Goto Github PK

llama-moe's Issues

Recommend Projects

Recommend Topics

Recommend Org