Comments (7)
Hi I tried this code, seems to work in terms of model generating correctly. But strangely very slow output when compared to fp16, the model I tried was 1B. Using a 3090, GPU util is very low < 10% when generating:
The process to support models is first to support quantization and then we move on to optimize the inference after by fusing layers. Additionally, we have an upcoming PR that will enable speeding up inference much easier.
from autoawq.
Hi, @casper-hansen Thanks for your great work.
Based on the branch you submitted but not merged in awq, I tried some experiments with the starcoder model, but the accuracy dropped significantly after quantization. The accuracy on he python is about 18%, and the inference speed is 43ms/token. The above two results are averaged on the he python dataset. I’m not sure what went wrong. Could it be that starcoder used multi-query attention? Apart from that, I can’t think of any other reason.
Here is my code:
from .base import BaseAWQForCausalLM
class BigCodeAWQForCausalLM(BaseAWQForCausalLM):
layer_type = "gpt_bigcode"
max_new_tokens_key = "n_positions"
@staticmethod
def get_model_layers(model):
return model.transformer.h
@staticmethod
def get_act_for_scaling(module):
return dict(
is_scalable=True,
scale_name="mlp.act",
scale_layer=module.mlp.act,
scale_shape=module.mlp.c_fc.out_features
)
@staticmethod
def move_embed(model, device):
model.transformer.wte = model.transformer.wte.to(device)
model.transformer.drop = model.transformer.drop.to(device)
@staticmethod
def get_layers_for_scaling(module, input_feat, module_kwargs):
layers = []
# attention input
layers.append(dict(
prev_op=module.ln_1,
layers=[module.attn.c_attn],
inp=input_feat['attn.c_attn'],
module2inspect=module.attn,
kwargs=module_kwargs
))
# attention output
layers.append(dict(
prev_op=module.attn.c_attn,
layers=[module.attn.c_proj],
inp=input_feat['attn.c_proj']
))
# linear 1
layers.append(dict(
prev_op=module.ln_2,
layers=[module.mlp.c_fc],
inp=input_feat['mlp.c_fc'],
module2inspect=module.mlp
))
# linear 2
layers.append(dict(
prev_op=module.mlp.act,
layers=[module.mlp.c_proj],
inp=input_feat['mlp.c_proj']
))
return layers
I am not sure do i need do something special?
Looking forward to your reply!
from autoawq.
accuracy on he python is about 18%
Seems something must have gone wrong here when converting. I will look into the specification of the layers.
@curname Can you paste the code you used to measure the accuracy?
43ms/token
This could be reasonable dependent on hardware/model size, but seems there is room for improvement here.
from autoawq.
Hi, @casper-hansen
accuracy on he python is about 18%
The code for measuring the HE accuracy comes from the OpenAI human-eval project, the address is here https://github.com/openai/human-eval/tree/master/human_eval.
I did some experiments more, like the bloom, I did not scale the attention_output of starcoder, the latest results not only did not decline, but also had a slight improvement, the accuracy of HE python reached 36%, which is really surprising.
And the original model url: https://huggingface.co/bigcode/starcoder/tree/main
43ms/token
I did the above implementation on A100 80G, the speed of awq and gptq is almost same, the experiments in the paper prove that awq is better than gptq, although the experimental model is mainly llama, not starcoder, if I want to further improve the inference speed to 30ms/token, or even 20ms/token, I would appreciate it if you could give me some suggestions.
from autoawq.
like the bloom, I did not scale the attention_output of starcoder
And the code like this:
`from .base import BaseAWQForCausalLM
class BigCodeAWQForCausalLM(BaseAWQForCausalLM):
layer_type = "gpt_bigcode"
max_new_tokens_key = "n_positions"
@staticmethod
def get_model_layers(model):
return model.transformer.h
@staticmethod
def get_act_for_scaling(module):
# return dict(
# is_scalable=False
# )
return dict(
is_scalable=True,
scale_name="mlp.act",
scale_layer=module.mlp.act,
scale_shape=module.mlp.c_fc.out_features
)
@staticmethod
def move_embed(model, device):
model.transformer.wte = model.transformer.wte.to(device)
model.transformer.drop = model.transformer.drop.to(device)
@staticmethod
def get_layers_for_scaling(module, input_feat, module_kwargs):
layers = []
# attention input
layers.append(dict(
prev_op=module.ln_1,
layers=[module.attn.c_attn],
inp=input_feat['attn.c_attn'],
module2inspect=module.attn,
kwargs=module_kwargs
))
# attention output
# layers.append(dict(
# prev_op=module.attn.c_attn,
# layers=[module.attn.c_proj],
# inp=input_feat['attn.c_proj']
# ))
# linear 1
layers.append(dict(
prev_op=module.ln_2,
layers=[module.mlp.c_fc],
inp=input_feat['mlp.c_fc'],
module2inspect=module.mlp
))
# linear 2
layers.append(dict(
prev_op=module.mlp.act,
layers=[module.mlp.c_proj],
inp=input_feat['mlp.c_proj']
))
return layers`
from autoawq.
@curname Did you get it working with better accuracy yet? Also, did you test perplexity before and after on wikitext? A normal in wikitext perplexity is between 2-5% (LLaMa 7B is around 2%). I wish I could test these models for you but unfortunately, I do not have many GPU resources available to me because of the cost associated with them.
The code for testing can be found here:
Lines 134 to 138 in 783afe5
from autoawq.
Hi I tried this code, seems to work in terms of model generating correctly. But strangely very slow output when compared to fp16, the model I tried was 1B. Using a 3090, GPU util is very low < 10% when generating:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'abacaj/starcoderbase-1b-sft'
quant_path = 'starcoder-1b-awq'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4 }
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
from autoawq.
Related Issues (20)
- VLM quantization-aware? HOT 1
- using autoawq to quantize gemma-2b-it model and export to llama.cpp got wrong generated token
- Bug - mixtral qlora(after b&b peft train) quantization broadcast problem HOT 3
- [BUG] Load StarCoder2 AWQ using Transformers HOT 1
- Inference Parallelism Issue HOT 1
- quick question around Token indices sequence length is longer than the specified maximum sequence length for this model (77697 > 32768). Running this sequence through the model will result in indexing errors
- Compare the inference speed of quantized model and unquantized model HOT 4
- Mixtral 8x22b instruct AWQ request HOT 9
- Llama-3 support HOT 4
- LLaMA-3 issues when used with vLLM HOT 5
- CommandR、CommandR Plus AWQ request
- Issues with quantizing Cohere model HOT 6
- Support of llava-v1.5 and llava-v1.6 with transformers==4.40.0 HOT 4
- AWQ quantization of gemma results in "RuntimeError: probability tensor contains either `inf`, `nan` or element < 0" HOT 2
- Phi-3 mini support? HOT 5
- Possibly a bug when checking if awq_ext is installed HOT 2
- python 3.12 support HOT 1
- Can I use peft/lora with fuse_layers = True ? HOT 1
- torch 2.3.x support HOT 1
- Need Assistance with “pip install autoawq” Error HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from autoawq.