Giter Club home page Giter Club logo

infini-transformer's People

Contributors

amitportnoy avatar borda avatar dingo-actual avatar muditbhargava66 avatar willtai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

infini-transformer's Issues

Add multiple options for nonlinear activation

The MLPs in both transformer modules currently have ReLU hard-coded as the activation function. It would help to have options for nonlinear activations commonly used in recent LLMs (GeLU, SwiGLU, GeGLU, etc.)

q, k, v projection inside the loop

Isn’t is unnecessary to perform q, k, v projections inside the loop? Due to data copying during the calculation process this makes the whole attention operation slower. Ideally this whole thing would be done with a fused kernel … future personal project of mine.

Memory issues with core implementation

While running some tests, I noticed several memory inefficiencies that nullify the advantages of the Infini-Transformer. In particular:

  • The loop over segments needs to be moved out of the memory module and into the transformer module.
  • The final projection step should be performed on a per-segment basis, rather than for the full sequence.

Add positional embeddings

Currently, the only positional embeddings that can be handled are those that take place before passing embeddings to the transformer blocks. It would be nice to have positional embeddings that can be utilized within the transformer blocks.

Need to look through positional embedding papers to see which ones are compatible with Infini-Transformer.

Handle batch input inference for MoD Infini-Former more gracefully

Currently, the token sampling for MoD Infini-Former at inference time can result in different length sequences for each observation in the batch. The current workaround is to force the batch size to one and loop through the observations in the batch, which is highly inefficient.

There are two main options for handling this efficiently:

  1. Pad the sampled sequences to the longest sequence length in such a way that the additional tokens contribute nothing to downstream calculations.
  2. Wait for PyTorch to implement a ragged tensor type

I'm likely to pursue the first because there's no telling how long it'll be before the PyTorch devs add ragged tensors.

Memory savings?

Great implementation! I have a couple of questions:
Does you implementation have O(N) complexity like the papers? What kind of memory savings should i expect compared classic sdpa attention? It would be great if you can give me some numbers on the VRAM usage during your training test compared to classic SDPA.
Thanks!

Update readme to include MoD Infini-Former instructions

There are a couple of changes that need to be made to the training loop when using MoD Infini-Former, as well as a minor change to the .forward() method of the Module that contains MoD Infini-Former layers.

Need to put example usage in the readme.

inference issue

thoughts about this exception during inference?

  File "/workspace/voltronformers/src/voltronformer/model.py", line 303, in forward                               
    output = residual + self.attn(h, position_ids=position_ids)[0]                                                                                                                                                                  
modules/module.py", line 1520, in _call_impl    tor h/nn/modules/module.py",eline 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)                                                                       
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl             
    return forward_call(*args, **kwargs)                                                                                                                                                                                            
  File "/workspace/voltronformers/src/voltronformer/infini_attention.py", line 109, in forward                                                                                                                                      
    return torch.concat(out, dim=1)                                                                                                                                                                                                 
RuntimeError: torch.cat(): expected a non-empty list of Tensors                                                   

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.