If we can speed up the BERT model, we will significantly increase the throughput of ma

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

hej <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Experiment with implementing AWQ for BERT models about autoawq HOT 3 OPEN

casper-hansen commented on August 15, 2024

Experiment with implementing AWQ for BERT models

from autoawq.

Comments (3)

casper-hansen commented on August 15, 2024 1

@michaelfeil AWQ/GEMM kernels can work for any linear layer. However, there is a challenge in applying it to BERT models because it lacks some scaling methods. For example, we would usually scale from a layernorm to a linear layer.

See more about the scaling of layers here:
https://github.com/casper-hansen/AutoAWQ/blob/main/awq/quantize/scale.py

I also created a PR for better scaling for Mixtral, which may be interesting to you:
#301

from autoawq.

z3ugma commented on August 15, 2024

Would love this for image captioning with quantized speedup.
The kosmos-2 model from Microsoft would be another good candidate

from autoawq.

michaelfeil commented on August 15, 2024

hej @casper-hansen I would be curious to implement this for https://github.com/michaelfeil/infinity. Do you see any road-blockers in regards to encoder only architectures? Will the GEMM kernels work for non-causal masked LMs?

from autoawq.

Experiment with implementing AWQ for BERT models about autoawq HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent