Hello, author. I am sincerely that you can answer me when you saw. I urgently want

About multi-head attention in attention is all you need, thanks. about transformer HOT 2 OPEN

hyunwoongko commented on May 22, 2024

About multi-head attention in attention is all you need, thanks.

from transformer.

Comments (2)

adhiraj2001 commented on May 22, 2024

As far as I understand, your doubt is that why Q, K, V is not going through n_head linear transformations to extract Q_i, Q_i and V_i corresponding to each head ?
=>
The answer to that is to avoid significant growth of computational cost and parametrization cost, we set d_q = d_k = d_v = d_model / n_head. [1]
This is what the split() function in MultiHeadAttention does, and then concat() function essentially is a weighted combination of these heads, just like in the paper.

You can see a similar implementation in PyTorch source code as well. [2]

Anyone else reading this, please correct me if I am wrong or if there are some others benefits/reasons of using this implementation.

EDIT:
It's clearly mentioned in the paper as well:

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use d_k = d_v = d_model / h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

For the most faithful implementations of research papers, you should also check out labml.ai annotated pytorch implementations repository. [3]

[1] https://d2l.ai/chapter_attention-mechanisms-and-transformers/multihead-attention.html
[2] https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention
[3] http://nlp.seas.harvard.edu/annotated-transformer/

from transformer.

sonrisa07 commented on May 22, 2024

As far as I understand, your doubt is that why Q, K, V is not going through n_head linear transformations to extract Q_i, Q_i and V_i corresponding to each head ? => The answer to that is to avoid significant growth of computational cost and parametrization cost, we set d_q = d_k = d_v = d_model / n_head. [1] This is what the split() function in MultiHeadAttention does, and then concat() function essentially is a weighted combination of these heads, just like in the paper.

You can see a similar implementation in PyTorch source code as well. [2]

Anyone else reading this, please correct me if I am wrong or if there are some others benefits/reasons of using this implementation.

EDIT: It's clearly mentioned in the paper as well:

In this work we employ h = 8 parallel attention layers, or heads. For each of these we use d_k = d_v = d_model / h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality.

For the most faithful implementations of research papers, you should also check out labml.ai annotated pytorch implementations repository. [3]

[1] https://d2l.ai/chapter_attention-mechanisms-and-transformers/multihead-attention.html [2] https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#MultiheadAttention [3] http://nlp.seas.harvard.edu/annotated-transformer/

Thank you for your comment, but it doesn't address my question. For instance, consider a sequence, and we need to produce its embedding matrix, named X. Then, it is sent to every head and multiplied by W_q, W_k, and W_v, respectively. Now, each head generates its corresponding Q, K, and V.

However, before entering each linear layer in every head, the paper's multi-head attention illustration shows Q, K, and V instead of X, X, and X correspondingly.

from transformer.

About multi-head attention in attention is all you need, thanks. about transformer HOT 2 OPEN

Comments (2)

Related Issues (17)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent