Giter Club home page Giter Club logo

illama's Introduction


Image credit: DALL·E

This is a PyTorch implementation of iLLaMA proposed by our paper "Adapting LLaMA Decoder to Vision Transformer".

iLLaMA first figure Figure 1: Left: iLLaMA architecture. Right: our design roadmap. Colored and gray bars represent the results of the tiny and base regimes, with the red line depicting the training loss of the tiny regime. iLLaMA strives to process visual tokens using standard LLaMa components, e.g., causal self-attention. The proposed PS [cls] and soft mask strategy help overcome training challenges.


iLLaMA second figure Figure 2: (a) mask in causal self-attention. (b) mask in causal self-attention with our post-sequence class token (PS [cls]) method. (c) modified causal mask.


iLLaMA third figure Figure 3: (a) Soft mask gradually transitions from a bi-directional mask into a causal mask during training through a constant or linear schedule. (b) Ablation results of training loss and test accuracy.

Requirements

PyTorch and timm 0.5.4 (pip install timm==0.5.4).

Data preparation: ImageNet with the following folder structure, you can extract ImageNet by this script.

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Models

iLLaMA on ImageNet-1K

Model Pre-trained dataset Resolution Params MACs Top1 Acc
illama_tiny - 224 5.7M 1.3G 75.0
illama_small - 224 21.9M 4.6G 79.9
illama_base - 224 86.3M 17.6G 81.6
illama_base - 384 86.3M 55.5G 83.0
illama_base ImageNet-21K 224 86.3M 17.6G 83.6
illama_base ImageNet-21K 384 86.3M 55.5G 85.0
illama_large ImageNet-21K 224 310.2M 62.8G 84.8
illama_large ImageNet-21K 384 310.2M 194.7G 86.0

Evaluate

To evaluate models on 224 resolution, run:

MODEL=illama_tiny
RESUME='/your/path/to/model.pth'

python -m torch.distributed.launch --nproc_per_node=2 main.py \
    --model $MODEL --eval true \
    --data_path $root_imagenet \
    --resume $RESUME

To evaluate models on 384 resolution, run:

MODEL=illama_base
RESUME='/your/path/to/model.pth'

python -m torch.distributed.launch --nproc_per_node=2 main_soft_fthr.py \
    --model $MODEL --input_size 384 --eval true \
    --data_path $root_imagenet \
    --resume $RESUME

Train

We use batch size of 4096 by default with 8 GPUs.

bash scripts/train_illama_tiny_in1k.sh

Training scripts of other models are shown in scripts.

Initialization Using LLaMA2-7B (Optional)

We use weight selection method to select weights from LLaMA2-7B.

python llama2/weight_selection.py

Then we use the selected weights to initialize our iLLaMA-T/S/B.

bash scripts/train_illama_tiny_from_llama2.sh

Training scripts of other models are shown in scripts.

Bibtex

@article{wang2024adapting,
  title={Adapting LLaMA Decoder to Vision Transformer},
  author={Wang, Jiahao and Shao, Wenqi and Chen, Mengzhao and Wu, Chengyue and Liu, Yong and Zhang, Kaipeng and Zhang, Songyang and Chen, Kai and Luo, Ping},
  journal={arXiv preprint arXiv:2404.06773},
  year={2024}
}

Acknowledgment

Our implementation is based on pytorch-image-models, llama, dropout, ConvNeXt, weight-selection, and MambaOut.

illama's People

Contributors

techmonsterwang avatar

Stargazers

vasgaowei avatar  avatar Serdar Erişen avatar Ray Yang avatar  avatar Muhammad Uzair Khattak avatar  avatar Malik Hashmat avatar 唐国梁Tommy avatar Qizhang Li avatar  avatar 爱可可-爱生活 avatar  avatar  avatar yahooo avatar Kaining Ying avatar Wenqi Shao avatar  avatar  avatar Mengzhao Chen avatar Wu Chengyue avatar Xuchen Li (李旭宸) avatar Weihao Yu avatar

Watchers

Kostas Georgiou avatar Malik Hashmat avatar  avatar

illama's Issues

question

Can I see the corresponding environment of this project? Such as requirements.txt. Thanks !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.