Giter Club home page Giter Club logo

llama-moe's Issues

Partition FFNs without downsizing them?

Here you are converting the existing FFN of LLaMA into multiple small sparse experts. But what if we could keep the original FFN dimension for each expert? That would mean the number of parameters will increase but it should be fine because anyway we are going to continually pretrain the model.

One way to achieve this is by duplicating the FFNs into multiple experts and adding the gate on top. I know it is not a great idea to have the experts with same weights, so maybe we could add a bit of random noise to the weights before pretraining?

Let me know your thoughts!

Some weights of LlamaMoEForCausalLM were not initialized

transformer version 4.38.0

import torch
from llama_moe.modeling_llama_moe_hf import LlamaMoEForCausalLM
from transformers import AutoTokenizer

model_dir = "Qwen1.5-0.5B-Chat_llamafy-16Select4-gate_proj-Scale1.0"

model = LlamaMoEForCausalLM.from_pretrained(model_dir, trust_remote_code=True, )

Some weights of LlamaMoEForCausalLM were not initialized from the model checkpoint at Qwen1.5-0.5B-Chat_llamafy-16Select4-gate_proj-Scale1.0 and are newly initialized: ['model.layers.0.self_attn.rotary_emb.inv_freq', 'model.layers.1.self_attn.rotary_emb.inv_freq', 'model.layers.10.self_attn.rotary_emb.inv_freq', 'model.layers.11.self_attn.rotary_emb.inv_freq', 'model.layers.12.self_attn.rotary_emb.inv_freq', 'model.layers.13.self_attn.rotary_emb.inv_freq', 'model.layers.14.self_attn.rotary_emb.inv_freq', 'model.layers.15.self_attn.rotary_emb.inv_freq', 'model.layers.16.self_attn.rotary_emb.inv_freq', 'model.layers.17.self_attn.rotary_emb.inv_freq', 'model.layers.18.self_attn.rotary_emb.inv_freq', 'model.layers.19.self_attn.rotary_emb.inv_freq', 'model.layers.2.self_attn.rotary_emb.inv_freq', 'model.layers.20.self_attn.rotary_emb.inv_freq', 'model.layers.21.self_attn.rotary_emb.inv_freq', 'model.layers.22.self_attn.rotary_emb.inv_freq', 'model.layers.23.self_attn.rotary_emb.inv_freq', 'model.layers.3.self_attn.rotary_emb.inv_freq', 'model.layers.4.self_attn.rotary_emb.inv_freq', 'model.layers.5.self_attn.rotary_emb.inv_freq', 'model.layers.6.self_attn.rotary_emb.inv_freq', 'model.layers.7.self_attn.rotary_emb.inv_freq', 'model.layers.8.self_attn.rotary_emb.inv_freq', 'model.layers.9.self_attn.rotary_emb.inv_freq']

Performance comparison between LLama-MOE and the original dense model.

I appreciate the authors' contribution to MOE construction.
The authors mentioned in Section 1 that there is a significant performance drop between the LLaMA-MoE-v1 models and the original dense LLaMA models. I am curious about the performance of the original dense LLaMA model and whether these gaps can be bridged by the approach proposed in the paper.

How to split "down" by "up" when using clustering to construct experts? 请问使用clustering进行Expert Construction时,down怎么根据up划分?

image
llama的FFN层包含up,down,gate三个部分,根据技术报告中这段话,使用MoEfication方法对up的权重进行k-means聚类后,down的权重是根据up的聚类结果进行分割吗?而gate的权重是需要单独进行k-means聚类吗?
请问对down的分割操作具体是怎么做的呢?是在哪一部分代码中实现的呢?
以及为什么down的weight要根据up进行划分而不是对down进行k-means聚类呢?
Thanks very much!

LLama直接Moe化后效果怎么样?

我之前也有一些类似的调研,发现Swiglu存在50%以上的动态稀疏性,但似乎比较难预测。激活函数在reglu或者relu时,由于非常高的动态稀疏性,Moe化似乎会更容易一些。

How many llama models are used for constructing llama-moe ? moe的构建是通过多个llama模型还是1个llama模型

请问moe模型的构建是通过多个llama模型还是1个llama模型呢?

请问这个repo的用途是将1个llama模型的FFN层通过不同的切分方法,切分为多个FFN来扮演多个专家。然后将llama模型的其余模型层和权重与切分后的FFN和gate进行拼接变成moe模型嘛?

是否支持将多个llama结构模型的FFN层合并,基于一个base 的llama模型结构构建Moe呢?

  1. How many llama models are used when constructing llama-moe?
  2. Is this repo partition one llama model's FFN into multiple experts and concatenate the rest parameters with gates to construct an MoE model?
  3. Do you support concatenating multiple llama models and construct an MoE model?

#Feature Request# Accelerated Deployment.

Accelerated Deployment.. Regarding the current llama, there are many mature acceleration frameworks, such as LMDeploy, vLLM, and AWQ.
After using the MoE architecture, how to use these acceleration frameworks for deployment? Has adaptation been done?

Questions about capacity_factor, score_scale_factor

Hi I got a questions regarding of these params found in config.json : score_scale_factor, capacity_factor

Base on my understanding, the llama-3B-MoE splits the intermediate dimension into 8 parts with each 1376 dimension instead of the original 11008 dimensions. Hence the 1376 in size_experts list. However I am wasn't quite understand the use of capacity_factor and score_scale_factor affects the architecture of MoE. Is this needed during inference or the 1376 numbers are derived based on capacity factor?

I have read expert construction readme but found no connections to setting these 2 values.

Am I missing something here?

About dataset prepare

Hi, thanks for your interesting work. I also need use SlimPajama to fine-tune my model. Can you teach me how to split the whole SlimPajama into separate folders as you listed?
/path_to_data/en_arxiv
/path_to_data/en_book
/path_to_data/en_c4
/path_to_data/en_cc
/path_to_data/en_stack
/path_to_data/en_wikipedia
/path_to_data/github

Can you report the running time on hardware?

Thank you to the authors for providing a method for transforming a dense model into MoEs for more efficient inference!

MoEfication provides acceleration results for the transformed model on CPU and GPU, while the current technical report for Llama MoE does not contain this information. Can the authors provide the relevant reference information?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.