Giter Club home page Giter Club logo

gcvit's Introduction

Global Context Vision Transformer (GC ViT)

This repository presents the official PyTorch implementation of Global Context Vision Transformers (ICML2023)

Global Context Vision Transformers
Ali Hatamizadeh, Hongxu (Danny) Yin, Greg Heinrich, Jan Kautz, and Pavlo Molchanov.

GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, GC ViT variants with 51M, 90M and 201M parameters achieve 84.3, 85.9 and 85.7 Top-1 accuracy, respectively, surpassing comparably-sized prior art such as CNN-based ConvNeXt and ViT-based Swin Transformer.

The architecture of GC ViT is demonstrated in the following:

gc_vit

๐Ÿ’ฅ News ๐Ÿ’ฅ

  • [10.14.2023] ๐Ÿ”ฅ We have released the object detection code !
  • [07.27.2023] We will present GC ViT in the (1:30-3:30 HDT) ICML23 session in exhibit hall#1, poster #516.
  • [07.22.2023] ๐Ÿ”ฅ๐Ÿ”ฅ We have released pretrained 21K GC ViT-L checkpoint for 512 x 512 resolution !
  • [07.22.2023] Pretrained checkpoints are now available in official NVIDIA GCViT HuggingFace page !
  • [07.21.2023] ๐Ÿ”ฅ We have released the object detection/instance segmentation code !
  • [05.21.2023] ๐Ÿ”ฅ We have released ImageNet-21K fine-tuned GC ViT model weights for 224x224 and 384x384.
  • [05.21.2023] ๐Ÿ”ฅ๐Ÿ”ฅ We have released new ImageNet-1K GC ViT model weights with better performance !
  • [04.24.2023] ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ GC ViT has been accepted to ICML 2023 !

Introduction

GC ViT leverages global context self-attention modules, joint with local self-attention, to effectively yet efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows.

ImageNet Benchmarks

ImageNet-1K Pretrained Models

Model Variant Acc@1 #Params(M) FLOPs(G) Download
GC ViT-XXT 79.9 12 2.1 model
GC ViT-XT 82.0 20 2.6 model
GC ViT-T 83.5 28 4.7 model
GC ViT-T2 83.7 34 5.5 model
GC ViT-S 84.3 51 8.5 model
GC ViT-S2 84.8 68 10.7 model
GC ViT-B 85.0 90 14.8 model
GC ViT-L 85.7 201 32.6 model

ImageNet-21K Pretrained Models

Model Variant Resolution Acc@1 #Params(M) FLOPs(G) Download
GC ViT-L 224 x 224 86.6 201 32.6 model
GC ViT-L 384 x 384 87.4 201 120.4 model
GC ViT-L 512 x 512 87.6 201 245.0 model

Installation

The dependencies can be installed by running:

pip install -r requirements.txt

Data Preparation

Please download the ImageNet dataset from its official website. The training and validation images need to have sub-folders for each class with the following structure:

  imagenet
  โ”œโ”€โ”€ train
  โ”‚   โ”œโ”€โ”€ class1
  โ”‚   โ”‚   โ”œโ”€โ”€ img1.jpeg
  โ”‚   โ”‚   โ”œโ”€โ”€ img2.jpeg
  โ”‚   โ”‚   โ””โ”€โ”€ ...
  โ”‚   โ”œโ”€โ”€ class2
  โ”‚   โ”‚   โ”œโ”€โ”€ img3.jpeg
  โ”‚   โ”‚   โ””โ”€โ”€ ...
  โ”‚   โ””โ”€โ”€ ...
  โ””โ”€โ”€ val
      โ”œโ”€โ”€ class1
      โ”‚   โ”œโ”€โ”€ img4.jpeg
      โ”‚   โ”œโ”€โ”€ img5.jpeg
      โ”‚   โ””โ”€โ”€ ...
      โ”œโ”€โ”€ class2
      โ”‚   โ”œโ”€โ”€ img6.jpeg
      โ”‚   โ””โ”€โ”€ ...
      โ””โ”€โ”€ ...
 

Commands

Training on ImageNet-1K From Scratch (Multi-GPU)

The GC ViT model can be trained on ImageNet-1K dataset by running:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus> --master_port 11223  train.py \ 
--config <config-file> --data_dir <imagenet-path> --batch-size --amp <batch-size-per-gpu> --tag <run-tag> --model-ema

To resume training from a pre-trained checkpoint:

python -m torch.distributed.launch --nproc_per_node <num-of-gpus> --master_port 11223  train.py \ 
--resume <checkpoint-path> --config <config-file> --amp --data_dir <imagenet-path> --batch-size <batch-size-per-gpu> --tag <run-tag> --model-ema

Evaluation

To evaluate a pre-trained checkpoint using ImageNet-1K validation set on a single GPU:

python validate.py --model <model-name> --checkpoint <checkpoint-path> --data_dir <imagenet-path> --batch-size <batch-size-per-gpu>

Citation

Please consider citing GC ViT paper if it is useful for your work:

@inproceedings{hatamizadeh2023global,
  title={Global context vision transformers},
  author={Hatamizadeh, Ali and Yin, Hongxu and Heinrich, Greg and Kautz, Jan and Molchanov, Pavlo},
  booktitle={International Conference on Machine Learning},
  pages={12633--12646},
  year={2023},
  organization={PMLR}
}

Third-party Implementations and Resources

In this section, we list third-party contributions by other users. If you would like to have your work included here, please raise an issue in this repository.

Name Link Contributor Framework
timm Link @rwightman PyTorch
tfgcvit Link @shkarupa-alex Tensorflow 2.0 (Keras)
gcvit-tf Link @awsaf49 Tensorflow 2.0 (Keras)
GCViT-TensorFlow Link @EMalagoli92 Tensorflow 2.0 (Keras)
keras_cv_attention_models Link @leondgarse Keras
flaim Link @BobMcDear JAX/Flax

Additional Resources

We list additional GC ViT resources such as notebooks, demos, paper explanations in this section. If you have created similar items and would like to be included, please raise an issue in this repository.

Name Link Contributor Note
Paper Explanation Link @awsaf49 Annotated GC ViT
Colab Notebook Link @awsaf49 Flower classification
Kaggle Notebook Link @awsaf49 Flower classification
Live Demo Link @awsaf49 Hugging Face demo

Licenses

Copyright ยฉ 2023, NVIDIA Corporation. All rights reserved.

This work is made available under the Nvidia Source Code License-NC. Click here to view a copy of this license.

The pre-trained models are shared under CC-BY-NC-SA-4.0. If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

For license information regarding the timm, please refer to its repository.

For license information regarding the ImageNet dataset, please refer to the ImageNet official website.

Acknowledgement

  • This repository is built upon the timm library.

  • We would like to sincerely thank the community especially Github users @rwightman, @shkarupa-alex, @awsaf49, @leondgarse, who have provided insightful feedback, which has helped us to further improve GC ViT and achieve even better benchmarks.

gcvit's People

Contributors

ahatamiz avatar tp-nan avatar shkarupa-alex avatar dependabot[bot] avatar pamolchanov avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.