Giter Club home page Giter Club logo

cosformer's People

Contributors

doraemonzzz avatar opennlplab123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

cosformer's Issues

Pre-train model

In the paper,it mentioned that the work of the bidirectional language modeling pre-train has been done. Are you planning on releasing some pre-trained weights for the model?

why input is [s b dim] but not [b s dim]?

this is more general(from my perspective)
q,k,v shape : (b,s,d)

q = q.contiguous()
q = rearrange(q, 'b n (h d) -> (b h) n d', h = self.num_heads)

(N * h, S, d)

k = k.contiguous()
k = rearrange(k, 'b n (h d) -> (b h) n d', h = self.num_heads)

(N * h, S, d)

v = v.contiguous()
v = rearrange(v, 'b n (h d) -> (b h) n d', h = self.num_heads)

Question about space complexity

Thanks very much for your interesting work! I have a question about the O(N) space complexity mentioned in your paper. I am wondering whether you can help me to figure it out.

In Eq. (11) of your paper, you compute QK^T in the denominator, which may lead to O(N^2*d) space complexity?

bests

Why cosformer not work on XL-base transformer architecture?

When implementing cosformer on MultiHeadAttention in Transformer-XL and running without extra long-range memory, the ReLU performance is worse than eLU. I think it is because the Attention and FF Net are different since XL-like transformer has different layer norm and residual connection. Why this ReLU(Q)ReLU(K).T softmax replacement is not robust on different transformer architectures?

Attn Mask for Non-causal Models

We are examining non-NLP applications of the cosformer self-attention, and would need to use attention masking for the padded tokens in the batch.
Is there a way to incorporate this ?
Because the code does not explicitly compute the attention weights on which masking is traditionally applied.

causal attention not working when q and kv are not in same length

Thank you for your great work! I am currently working on a seq2seq task and I found the causal attention code only works the src_len and the tgt_len are the same. Also, I suggest that you could adopt EPFL's causal linear attention CUDA code to improve the speed of causal attention.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.