pjlab-sys4nlp / llama-moe Goto Github PK

View Code? Open in Web Editor NEW

736.0 8.0 31.0 1.7 MB

⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

License: Apache License 2.0

Makefile 0.03% Shell 20.21% Python 79.76%

llama llm mixture-of-experts moe continual-pre-training expert-partition

llama-moe's Introduction

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

📢 A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!!

📃 Technical Report

🎉 Introduction

LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:

Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.

🔥 Features

Lightweight Models: The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.
Multiple Expert Construction Methods:
1. Neuron-Independent: Random, Clustering, Co-activation Graph, Gradient (Zhang et al., 2022, Zuo et al., 2022)
2. Neuron-Sharing: Inner, Inter (residual)
Multiple MoE Gating Strategies:
1. TopK Noisy Gate (Shazeer et al., 2017)
2. Switch Gating (Fedus et al., 2022)
Fast Continual Pre-training:
1. FlashAttention-v2 integrated (Dao, 2023)
2. Fast streaming dataset loading
Abundant Monitor Items:
1. Gate load, gate importance
2. Loss on steps, loss on tokens, balance loss
3. TGS (tokens/GPU/second), MFU (model FLOPs utilization)
4. Other visualization utilities
Dynamic Weight Sampling:
1. Self-defined static sampling weights
2. Sheared LLaMA's dynamic batch loading (Xia et al., 2023)

🚀 QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.to("cuda:0")

input_text = "Suzhou is famous of"
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")

pred = model.generate(**inputs, max_length=50, temperature=0.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# Suzhou is famous of its beautiful gardens. The most famous one is the Humble Administrator's Garden. It is a classical Chinese garden with a history of more than 600 years. The garden is divided into three

⚙️ Installation

Prepare conda environment: conda create -n smoe python=3.11 (If your environment name is not smoe, you may need to change environment in launching scripts)

Add correct environment variables in ~/.bashrc (gcc is set to newer version for installing flash-attn). e.g.:

export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH
export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH
export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH
export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH

Take the variables into effect: source ~/.bashrc
Install PyTorch (CUDA-11.8): pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Install dependencies: pip install -r requirements.txt
Install flash-attn: pip install flash-attn==2.0.1 --no-build-isolation. You may need to follow the flash-attn installation instructions to avoid some errors.
Install the latest Git: conda install git
Clone the repo: git clone [email protected]:pjlab-sys4nlp/llama-moe.git (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the docs about it.)
Change current directory: cd llama-moe
Install smoe in editable mode: pip install -e .[dev]
Setup pre-commit hooks: pre-commit install

📊 Model Performance

Model	#Activated Experts	#Experts	#Activated Params	Foundation Model	SFT Model
LLaMA-MoE-3.0B	2	16	3.0B	🤗 base	🤗 SFT
LLaMA-MoE-3.5B (4/16)	4	16	3.5B	🤗 base	🤗 SFT
LLaMA-MoE-3.5B (2/8)	2	8	3.5B	🤗 base	🤗 SFT

Foundation models

Model	Average	SciQ	PIQA	WinoGrande	ARC-e	ARC-c (25)	HellaSwag (10)	LogiQA	BoolQ (32)	LAMBADA	NQ (32)	MMLU (5)
OPT-2.7B	50.3	78.9	74.8	60.8	54.4	34.0	61.4	25.8	63.3	63.6	10.7	25.8
Pythia-2.8B	51.5	83.2	73.6	59.6	58.8	36.7	60.7	28.1	65.9	64.6	8.7	26.8
INCITE-BASE-3B	53.7	85.6	73.9	63.5	61.7	40.3	64.7	27.5	65.8	65.4	15.2	27.2
Open-LLaMA-3B-v2	55.6	88.0	77.9	63.1	63.3	40.1	71.4	28.1	69.2	67.4	16.0	26.8
Sheared-LLaMA-2.7B	56.4	87.5	76.9	65.0	63.3	41.6	71.0	28.3	73.6	68.3	17.6	27.3
LLaMA-MoE-3.0B	55.5	84.2	77.5	63.6	60.2	40.9	70.8	30.6	71.9	66.6	17.0	26.8
LLaMA-MoE-3.5B (4/16)	57.7	87.6	77.9	65.5	65.6	44.2	73.3	29.7	75.0	69.5	20.3	26.8
LLaMA-MoE-3.5B (2/8)	57.6	88.4	77.6	66.7	65.3	43.1	73.3	29.6	73.9	69.4	19.8	27.0

SFT models

Model	MMLU	ARC-c	HellaSeag	TruthfulQA	MT-Bench
Sheared LLaMA-2.7B ShareGPT	28.41	41.04	71.21	47.65	3.79
Sheared LLaMA-2.7B Deita6K (Our Impl.)	25.24	43.69	71.70	49.00	4.06
LLaMA-MoE-v1-3.0B (2/16)	23.61	43.43	72.28	44.24	4.15
LLaMA-MoE-v1-3.5B (4/16)	26.49	48.29	75.10	45.91	4.60
LLaMA-MoE-v1-3.5B (2/8)	25.53	45.99	74.95	44.39	4.72

🚧 Expert Construction

Neuron-Independent
- Independent_Random: bash ./scripts/expert_construction/split/run_split_random.sh
- Independent_Clustering: bash ./scripts/expert_construction/split/run_split_clustering.sh
Neuron-Sharing
- Sharing_Inner: bash ./scripts/expert_construction/split/run_split_gradient.sh
- Sharing_Inter: bash ./scripts/expert_construction/split/run_split_gradient_residual.sh

For more information, please refer to Expert Construction docs.

🚅 Continual Pre-training

Tokenization

Download SlimPajama into /path_to_data and put data from different domains into separate folders:

/path_to_data/en_arxiv
/path_to_data/en_book
/path_to_data/en_c4
/path_to_data/en_cc
/path_to_data/en_stack
/path_to_data/en_wikipedia
/path_to_data/github

Each file should be end with *.jsonl and each line looks like:

{"id": "id-info", "content": "raw text to be tokenized"}

Run the following command to tokenize the data in each folder:

python -m smoe.utils.tokenize \
  -f jsonl \
  -t /path_to_tokenizer \
  -i /path_to_data/en_arxiv \
  -o /path_to_data_tokenized/en_arxiv

Continual Pre-training (CPT)

NOTICE: Please create logs/ folder manually: mkdir -p logs
To run the continual pre-training, please check the CPT docs.

💎 Evaluation

For evalution on Natural Questions (NQ), please refer to opencompass.
For other tasks, please refer to lm-eval-harness.

💬 Supervised Fine-Tuning (SFT)

We provide simple examples of SFT to build chatbots. Please refer to SFT docs and /mnt/petrelfs/zhutong/smoe/scripts/sft for more details.

📑 Citation

@misc{llama-moe-2023,
  title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},
  author={LLaMA-MoE Team},
  year={2023},
  month={Dec},
  url={https://github.com/pjlab-sys4nlp/llama-moe}
}

LLaMA-MoE Team w/ ❤️

llama-moe's People

Contributors

Stargazers

Watchers

llama-moe's Issues

Hi, thanks for your interesting work. I also need use SlimPajama to fine-tune my model. Can you teach me how to split the whole SlimPajama into separate folders as you listed?
/path_to_data/en_arxiv
/path_to_data/en_book
/path_to_data/en_c4
/path_to_data/en_cc
/path_to_data/en_stack
/path_to_data/en_wikipedia
/path_to_data/github

Can you report the running time on hardware?

Thank you to the authors for providing a method for transforming a dense model into MoEs for more efficient inference!

MoEfication provides acceleration results for the transformed model on CPU and GPU, while the current technical report for Llama MoE does not contain this information. Can the authors provide the relevant reference information?

How to split "down" by "up" when using clustering to construct experts? 请问使用clustering进行Expert Construction时，down怎么根据up划分？

llama的FFN层包含up,down,gate三个部分，根据技术报告中这段话，使用MoEfication方法对up的权重进行k-means聚类后，down的权重是根据up的聚类结果进行分割吗？而gate的权重是需要单独进行k-means聚类吗？
请问对down的分割操作具体是怎么做的呢？是在哪一部分代码中实现的呢？
以及为什么down的weight要根据up进行划分而不是对down进行k-means聚类呢？
Thanks very much!

Partition FFNs without downsizing them?

Here you are converting the existing FFN of LLaMA into multiple small sparse experts. But what if we could keep the original FFN dimension for each expert? That would mean the number of parameters will increase but it should be fine because anyway we are going to continually pretrain the model.

One way to achieve this is by duplicating the FFNs into multiple experts and adding the gate on top. I know it is not a great idea to have the experts with same weights, so maybe we could add a bit of random noise to the weights before pretraining?

Let me know your thoughts!

LLama直接Moe化后效果怎么样？

我之前也有一些类似的调研，发现Swiglu存在50%以上的动态稀疏性，但似乎比较难预测。激活函数在reglu或者relu时，由于非常高的动态稀疏性，Moe化似乎会更容易一些。

please update modeling_llama_moe_hf.py

llama2 does not seem to be supported, please update modeling_llama_moe_hf.py

per_device_train_batch_size=1，but almost all of my GPU memory is still being used up?

Can you please explain why, when I run the llama2_7b model pre-training on 4*A800 with only the following settings:

per_device_train_batch_size=1
per_device_eval_batch_size=1
gradient_accumulation_steps=1
almost all of my GPU memory is still being used up? I don't think this is reasonable.

Questions about capacity_factor, score_scale_factor

Hi I got a questions regarding of these params found in config.json : score_scale_factor, capacity_factor

Base on my understanding, the llama-3B-MoE splits the intermediate dimension into 8 parts with each 1376 dimension instead of the original 11008 dimensions. Hence the 1376 in size_experts list. However I am wasn't quite understand the use of capacity_factor and score_scale_factor affects the architecture of MoE. Is this needed during inference or the 1376 numbers are derived based on capacity factor?

I have read expert construction readme but found no connections to setting these 2 values.

Am I missing something here?

#Feature Request# Accelerated Deployment.

Accelerated Deployment.. Regarding the current llama, there are many mature acceleration frameworks, such as LMDeploy, vLLM, and AWQ.
After using the MoE architecture, how to use these acceleration frameworks for deployment? Has adaptation been done?

About Chinese performances. 关于中文能力的询问

您好，谢谢您发布如此好的工作，我们近期也在mixtral-8x7b上做了微调（我们的项目：https://github.com/WangRongsheng/Aurora ），想知道llama-moe的中文能力如何呢？

How many llama models are used for constructing llama-moe ? moe的构建是通过多个llama模型还是1个llama模型

请问moe模型的构建是通过多个llama模型还是1个llama模型呢？

请问这个repo的用途是将1个llama模型的FFN层通过不同的切分方法，切分为多个FFN来扮演多个专家。然后将llama模型的其余模型层和权重与切分后的FFN和gate进行拼接变成moe模型嘛？

是否支持将多个llama结构模型的FFN层合并，基于一个base 的llama模型结构构建Moe呢？

How many llama models are used when constructing llama-moe?
Is this repo partition one llama model's FFN into multiple experts and concatenate the rest parameters with gates to construct an MoE model?
Do you support concatenating multiple llama models and construct an MoE model?

If I can't configure Slurm on a cluster, does that mean I can't use multi-node multi-GPU setups?

sbatch: command not found

Since I have to launch a new container environment for each experiment and I have never worked with Slurm before, and this GPU cluster doesn't come with Slurm installed, I found that installing and configuring Slurm seems quite troublesome. Is there any other way to implement multi-node, multi-GPU setups?thanks

Performance comparison between LLama-MOE and the original dense model.

I appreciate the authors' contribution to MOE construction.
The authors mentioned in Section 1 that there is a significant performance drop between the LLaMA-MoE-v1 models and the original dense LLaMA models. I am curious about the performance of the original dense LLaMA model and whether these gaps can be bridged by the approach proposed in the paper.

about cosine lr scheduler

https://github.com/pjlab-sys4nlp/llama-moe/blob/main/smoe/trainer/llama_lr_scheduling.py#L125

谢谢您分享的repo。关于lr有个问题，这里支持final_lr_portion，好像和megatron的实现不一样。想请教下是否合理。谢谢
lr下降部分也可能和final_lr_portion有关。

./scripts/expert_construction/split/run_split_random.sh: 行 18: srun: 未找到命令

run bash error
./scripts/expert_construction/split/run_split_random.sh: 行 18: srun: 未找到命令

Some weights of LlamaMoEForCausalLM were not initialized

transformer version 4.38.0

import torch
from llama_moe.modeling_llama_moe_hf import LlamaMoEForCausalLM
from transformers import AutoTokenizer

model_dir = "Qwen1.5-0.5B-Chat_llamafy-16Select4-gate_proj-Scale1.0"

model = LlamaMoEForCausalLM.from_pretrained(model_dir, trust_remote_code=True, )

Some weights of LlamaMoEForCausalLM were not initialized from the model checkpoint at Qwen1.5-0.5B-Chat_llamafy-16Select4-gate_proj-Scale1.0 and are newly initialized: ['model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq']

Why a new trainer instead of the original one? 请教一下为什么要新写一个llama_lr_scheduling_trainer，它的作用是什么，为什么不用原始trainer

如题