Giter Club home page Giter Club logo

llama-moe's Introduction

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

LLaMA-MoE favicon
📢 A SMALLER AFFORDABLE MoE MODEL FOR EVERYONE!! 📃 Technical Report

🎉 Introduction

LLaMA-MoE is a series of open-sourced Mixture-of-Expert (MoE) models based on LLaMA and SlimPajama. We build LLaMA-MoE with the following two steps:

  1. Partition LLaMA's FFNs into sparse experts and insert top-K gate for each layer of experts.
  2. Continually pre-train the initialized MoE model with an optimized data sampling weights from Sheared LLaMA and filtered datasets from SlimPajama.

MoE Routing

🔥 Features

  1. Lightweight Models: The number of activated model parameters is only 3.0~3.5B, which is friendly for deployment and research usage.
  2. Multiple Expert Construction Methods:
    1. Neuron-Independent: Random, Clustering, Co-activation Graph, Gradient (Zhang et al., 2022, Zuo et al., 2022)
    2. Neuron-Sharing: Inner, Inter (residual)
  3. Multiple MoE Gating Strategies:
    1. TopK Noisy Gate (Shazeer et al., 2017)
    2. Switch Gating (Fedus et al., 2022)
  4. Fast Continual Pre-training:
    1. FlashAttention-v2 integrated (Dao, 2023)
    2. Fast streaming dataset loading
  5. Abundant Monitor Items:
    1. Gate load, gate importance
    2. Loss on steps, loss on tokens, balance loss
    3. TGS (tokens/GPU/second), MFU (model FLOPs utilization)
    4. Other visualization utilities
  6. Dynamic Weight Sampling:
    1. Self-defined static sampling weights
    2. Sheared LLaMA's dynamic batch loading (Xia et al., 2023)

🚀 QuickStart

# python>=3.10

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_dir = "llama-moe/LLaMA-MoE-v1-3_5B-2_8"
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, trust_remote_code=True)
model.eval()
model.to("cuda:0")

input_text = "Suzhou is famous of"
inputs = tokenizer(input_text, return_tensors="pt")
inputs = inputs.to("cuda:0")

pred = model.generate(**inputs, max_length=50, temperature=0.0)
print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
# Suzhou is famous of its beautiful gardens. The most famous one is the Humble Administrator's Garden. It is a classical Chinese garden with a history of more than 600 years. The garden is divided into three

⚙️ Installation

  1. Prepare conda environment: conda create -n smoe python=3.11 (If your environment name is not smoe, you may need to change environment in launching scripts)
  2. Add correct environment variables in ~/.bashrc (gcc is set to newer version for installing flash-attn). e.g.:
    export PATH=/mnt/petrelfs/share/cuda-11.8/bin:$PATH
    export LD_LIBRARY_PATH=/mnt/petrelfs/share/cuda-11.8/lib64:$LD_LIBRARY_PATH
    export PATH=/mnt/petrelfs/share/gcc-10.1.0/bin:$PATH
    export LD_LIBRARY_PATH=/mnt/petrelfs/share/gcc-10.1.0/lib64:$LD_LIBRARY_PATH
  3. Take the variables into effect: source ~/.bashrc
  4. Install PyTorch (CUDA-11.8): pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
  5. Install dependencies: pip install -r requirements.txt
  6. Install flash-attn: pip install flash-attn==2.0.1 --no-build-isolation. You may need to follow the flash-attn installation instructions to avoid some errors.
  7. Install the latest Git: conda install git
  8. Clone the repo: git clone [email protected]:pjlab-sys4nlp/llama-moe.git (If you don't setup the ssh key to GitHub, you may not able to clone through ssh. Check the docs about it.)
  9. Change current directory: cd llama-moe
  10. Install smoe in editable mode: pip install -e .[dev]
  11. Setup pre-commit hooks: pre-commit install

📊 Model Performance

Model #Activated Experts #Experts #Activated Params Foundation Model SFT Model
LLaMA-MoE-3.0B 2 16 3.0B 🤗 base 🤗 SFT
LLaMA-MoE-3.5B (4/16) 4 16 3.5B 🤗 base 🤗 SFT
LLaMA-MoE-3.5B (2/8) 2 8 3.5B 🤗 base 🤗 SFT
  • Foundation models
Model Average SciQ PIQA WinoGrande ARC-e ARC-c (25) HellaSwag (10) LogiQA BoolQ (32) LAMBADA NQ (32) MMLU (5)
OPT-2.7B 50.3 78.9 74.8 60.8 54.4 34.0 61.4 25.8 63.3 63.6 10.7 25.8
Pythia-2.8B 51.5 83.2 73.6 59.6 58.8 36.7 60.7 28.1 65.9 64.6 8.7 26.8
INCITE-BASE-3B 53.7 85.6 73.9 63.5 61.7 40.3 64.7 27.5 65.8 65.4 15.2 27.2
Open-LLaMA-3B-v2 55.6 88.0 77.9 63.1 63.3 40.1 71.4 28.1 69.2 67.4 16.0 26.8
Sheared-LLaMA-2.7B 56.4 87.5 76.9 65.0 63.3 41.6 71.0 28.3 73.6 68.3 17.6 27.3
LLaMA-MoE-3.0B 55.5 84.2 77.5 63.6 60.2 40.9 70.8 30.6 71.9 66.6 17.0 26.8
LLaMA-MoE-3.5B (4/16) 57.7 87.6 77.9 65.5 65.6 44.2 73.3 29.7 75.0 69.5 20.3 26.8
LLaMA-MoE-3.5B (2/8) 57.6 88.4 77.6 66.7 65.3 43.1 73.3 29.6 73.9 69.4 19.8 27.0
  • SFT models
Model MMLU ARC-c HellaSeag TruthfulQA MT-Bench
Sheared LLaMA-2.7B ShareGPT 28.41 41.04 71.21 47.65 3.79
Sheared LLaMA-2.7B Deita6K (Our Impl.) 25.24 43.69 71.70 49.00 4.06
LLaMA-MoE-v1-3.0B (2/16) 23.61 43.43 72.28 44.24 4.15
LLaMA-MoE-v1-3.5B (4/16) 26.49 48.29 75.10 45.91 4.60
LLaMA-MoE-v1-3.5B (2/8) 25.53 45.99 74.95 44.39 4.72

🚧 Expert Construction

  • Neuron-Independent
    • IndependentRandom: bash ./scripts/expert_construction/split/run_split_random.sh
    • IndependentClustering: bash ./scripts/expert_construction/split/run_split_clustering.sh
  • Neuron-Sharing
    • SharingInner: bash ./scripts/expert_construction/split/run_split_gradient.sh
    • SharingInter: bash ./scripts/expert_construction/split/run_split_gradient_residual.sh

For more information, please refer to Expert Construction docs.

🚅 Continual Pre-training

Tokenization

Download SlimPajama into /path_to_data and put data from different domains into separate folders:

  • /path_to_data/en_arxiv
  • /path_to_data/en_book
  • /path_to_data/en_c4
  • /path_to_data/en_cc
  • /path_to_data/en_stack
  • /path_to_data/en_wikipedia
  • /path_to_data/github

Each file should be end with *.jsonl and each line looks like:

{"id": "id-info", "content": "raw text to be tokenized"}

Run the following command to tokenize the data in each folder:

python -m smoe.utils.tokenize \
  -f jsonl \
  -t /path_to_tokenizer \
  -i /path_to_data/en_arxiv \
  -o /path_to_data_tokenized/en_arxiv

Continual Pre-training (CPT)

  • NOTICE: Please create logs/ folder manually: mkdir -p logs
  • To run the continual pre-training, please check the CPT docs.

💎 Evaluation

💬 Supervised Fine-Tuning (SFT)

We provide simple examples of SFT to build chatbots. Please refer to SFT docs and /mnt/petrelfs/zhutong/smoe/scripts/sft for more details.

📑 Citation

@misc{llama-moe-2023,
  title={LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training},
  author={LLaMA-MoE Team},
  year={2023},
  month={Dec},
  url={https://github.com/pjlab-sys4nlp/llama-moe}
}

LLaMA-MoE Team w/ ❤️

llama-moe's People

Contributors

daizedong avatar jcruan519 avatar spico197 avatar tongjingqi avatar xiaoyee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

llama-moe's Issues

About dataset prepare

Hi, thanks for your interesting work. I also need use SlimPajama to fine-tune my model. Can you teach me how to split the whole SlimPajama into separate folders as you listed?
/path_to_data/en_arxiv
/path_to_data/en_book
/path_to_data/en_c4
/path_to_data/en_cc
/path_to_data/en_stack
/path_to_data/en_wikipedia
/path_to_data/github

Can you report the running time on hardware?

Thank you to the authors for providing a method for transforming a dense model into MoEs for more efficient inference!

MoEfication provides acceleration results for the transformed model on CPU and GPU, while the current technical report for Llama MoE does not contain this information. Can the authors provide the relevant reference information?

How to split "down" by "up" when using clustering to construct experts? 请问使用clustering进行Expert Construction时,down怎么根据up划分?

image
llama的FFN层包含up,down,gate三个部分,根据技术报告中这段话,使用MoEfication方法对up的权重进行k-means聚类后,down的权重是根据up的聚类结果进行分割吗?而gate的权重是需要单独进行k-means聚类吗?
请问对down的分割操作具体是怎么做的呢?是在哪一部分代码中实现的呢?
以及为什么down的weight要根据up进行划分而不是对down进行k-means聚类呢?
Thanks very much!

Partition FFNs without downsizing them?

Here you are converting the existing FFN of LLaMA into multiple small sparse experts. But what if we could keep the original FFN dimension for each expert? That would mean the number of parameters will increase but it should be fine because anyway we are going to continually pretrain the model.

One way to achieve this is by duplicating the FFNs into multiple experts and adding the gate on top. I know it is not a great idea to have the experts with same weights, so maybe we could add a bit of random noise to the weights before pretraining?

Let me know your thoughts!

LLama直接Moe化后效果怎么样?

我之前也有一些类似的调研,发现Swiglu存在50%以上的动态稀疏性,但似乎比较难预测。激活函数在reglu或者relu时,由于非常高的动态稀疏性,Moe化似乎会更容易一些。

Questions about capacity_factor, score_scale_factor

Hi I got a questions regarding of these params found in config.json : score_scale_factor, capacity_factor

Base on my understanding, the llama-3B-MoE splits the intermediate dimension into 8 parts with each 1376 dimension instead of the original 11008 dimensions. Hence the 1376 in size_experts list. However I am wasn't quite understand the use of capacity_factor and score_scale_factor affects the architecture of MoE. Is this needed during inference or the 1376 numbers are derived based on capacity factor?

I have read expert construction readme but found no connections to setting these 2 values.

Am I missing something here?

#Feature Request# Accelerated Deployment.

Accelerated Deployment.. Regarding the current llama, there are many mature acceleration frameworks, such as LMDeploy, vLLM, and AWQ.
After using the MoE architecture, how to use these acceleration frameworks for deployment? Has adaptation been done?

How many llama models are used for constructing llama-moe ? moe的构建是通过多个llama模型还是1个llama模型

请问moe模型的构建是通过多个llama模型还是1个llama模型呢?

请问这个repo的用途是将1个llama模型的FFN层通过不同的切分方法,切分为多个FFN来扮演多个专家。然后将llama模型的其余模型层和权重与切分后的FFN和gate进行拼接变成moe模型嘛?

是否支持将多个llama结构模型的FFN层合并,基于一个base 的llama模型结构构建Moe呢?

  1. How many llama models are used when constructing llama-moe?
  2. Is this repo partition one llama model's FFN into multiple experts and concatenate the rest parameters with gates to construct an MoE model?
  3. Do you support concatenating multiple llama models and construct an MoE model?

Performance comparison between LLama-MOE and the original dense model.

I appreciate the authors' contribution to MOE construction.
The authors mentioned in Section 1 that there is a significant performance drop between the LLaMA-MoE-v1 models and the original dense LLaMA models. I am curious about the performance of the original dense LLaMA model and whether these gaps can be bridged by the approach proposed in the paper.

Some weights of LlamaMoEForCausalLM were not initialized

transformer version 4.38.0

import torch
from llama_moe.modeling_llama_moe_hf import LlamaMoEForCausalLM
from transformers import AutoTokenizer

model_dir = "Qwen1.5-0.5B-Chat_llamafy-16Select4-gate_proj-Scale1.0"

model = LlamaMoEForCausalLM.from_pretrained(model_dir, trust_remote_code=True, )

Some weights of LlamaMoEForCausalLM were not initialized from the model checkpoint at Qwen1.5-0.5B-Chat_llamafy-16Select4-gate_proj-Scale1.0 and are newly initialized: ['model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq']

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.