opennlplab / cosformer Goto Github PK

View Code? Open in Web Editor NEW

178.0 178.0 26.0 12 KB

[ICLR 2022] Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

License: Apache License 2.0

Python 100.00%

cosformer's People

Contributors

Stargazers

Watchers

cosformer's Issues

Pre-train model

In the paper，it mentioned that the work of the bidirectional language modeling pre-train has been done. Are you planning on releasing some pre-trained weights for the model?

why input is [s b dim] but not [b s dim]?

this is more general(from my perspective)
q,k,v shape : (b,s,d)

q = q.contiguous()
q = rearrange(q, 'b n (h d) -> (b h) n d', h = self.num_heads)

(N * h, S, d)

k = k.contiguous()
k = rearrange(k, 'b n (h d) -> (b h) n d', h = self.num_heads)

(N * h, S, d)

v = v.contiguous()
v = rearrange(v, 'b n (h d) -> (b h) n d', h = self.num_heads)

When the code will be released?

Thanks for the awesome project.
I wonder when the code will be released. Thank you ;)

Question about space complexity

Thanks very much for your interesting work! I have a question about the O(N) space complexity mentioned in your paper. I am wondering whether you can help me to figure it out.

In Eq. (11) of your paper, you compute QK^T in the denominator, which may lead to O(N^2*d) space complexity?

bests

Why the attn mask is not used in forward function?

Compared with left_product function, attention mask is not used in forward() function.
How to use the attention mask in the forward method?

Why cosformer not work on XL-base transformer architecture?

When implementing cosformer on MultiHeadAttention in Transformer-XL and running without extra long-range memory, the ReLU performance is worse than eLU. I think it is because the Attention and FF Net are different since XL-like transformer has different layer norm and residual connection. Why this ReLU(Q)ReLU(K).T softmax replacement is not robust on different transformer architectures?

Script for computing memory consumption

Hello!

In the Figure 1, how do you get memory consumption of cosFormer on the LRA benchmark? Could you please open source script for computing it?

Hello, when will the full code be published?

diff += torch.norm(left_res - right_res)
What does diff stand for

Attn Mask for Non-causal Models

We are examining non-NLP applications of the cosformer self-attention, and would need to use attention masking for the padded tokens in the batch.
Is there a way to incorporate this ?
Because the code does not explicitly compute the attention weights on which masking is traditionally applied.

causal attention not working when q and kv are not in same length

Thank you for your great work! I am currently working on a seq2seq task and I found the causal attention code only works the src_len and the tgt_len are the same. Also, I suggest that you could adopt EPFL's causal linear attention CUDA code to improve the speed of causal attention.

opennlplab / cosformer Goto Github PK

cosformer's People

Contributors

Stargazers

Watchers

Forkers

cosformer's Issues

(N * h, S, d)

(N * h, S, d)

Recommend Projects

Recommend Topics

Recommend Org