Giter Club home page Giter Club logo

next-vit's Introduction

Next-ViT

This repo is the official implementation of "Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios". This algorithm is proposed by ByteDance, Intelligent Creation, AutoML Team (字节跳动-智能创作 AutoML团队).

Updates

08/16/2022

  1. Pretrained models on large scale dataset follow [SSLD] are provided.
  2. Segmentation results with large scale dataset pretrained model are also presented.

Overview

Figure 1. The overall hierarchical architecture of Next-ViT.

Introduction

Due to the complex attention mechanisms and model design, most existing vision Transformers (ViTs) can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely Next-ViT, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the Next Convolution Block (NCB) and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, Next Hybrid Strategy (NHS) is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that Next-ViT significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, Next-ViT surpasses ResNet by 5.5 mAP (from 40.4 to 45.9) on COCO detection and 7.7% mIoU (from 38.8% to 46.5%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6×. On CoreML, Next-ViT surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.1% to 48.6%) on ADE20K segmentation under similar latency. Next-ViT-R

Figure 2. Comparison among Next-ViT and efficient Networks, in terms of accuracy-latency trade-off.

Usage

First, clone the repository locally:

git clone https://github.com/bytedance/Next-ViT.git

Then, install torch=1.10.0, mmcv-full==1.5.0, timm==0.4.9 and etc.

pip3 install -r requirements.txt

Data preparation

Download and extract ImageNet train and val images from http://image-net.org/. The directory structure is the standard layout for the torchvision datasets.ImageFolder, and the training and validation data is expected to be in the train/ folder and val/ folder respectively:

/path/to/imagenet/
  train/
    class1/
      img1.jpeg
    class2/
      img2.jpeg
  val/
    class1/
      img3.jpeg
    class/2
      img4.jpeg

Image Classification

We provide a series of Next-ViT models pretrained on ILSVRC2012 ImageNet-1K dataset. More details can be seen in [paper].

Model Dataset Resolution FLOPs(G) Params (M) TensorRT
Latency(ms)
CoreML
Latency(ms)
Acc@1 ckpt log
Next-ViT-S ImageNet-1K 224 5.8 31.7 7.7 3.5 82.5 ckpt log
Next-ViT-B ImageNet-1K 224 8.3 44.8 10.5 4.5 83.2 ckpt log
Next-ViT-L ImageNet-1K 224 10.8 57.8 13.0 5.5 83.6 ckpt log
Next-ViT-S ImageNet-1K 384 17.3 31.7 21.6 8.9 83.6 ckpt log
Next-ViT-B ImageNet-1K 384 24.6 44.8 29.6 12.4 84.3 ckpt log
Next-ViT-L ImageNet-1K 384 32.0 57.8 36.0 15.2 84.7 ckpt log

We also provide a series of Next-ViT models pretrained on large scale dataset follow [SSLD]. More details can be seen in [paper].

Model Dataset Resolution FLOPs(G) Params (M) TensorRT
Latency(ms)
CoreML
Latency(ms)
Acc@1 ckpt
Next-ViT-S ImageNet-1K-6M 224 5.8 31.7 7.7 3.5 84.8 ckpt
Next-ViT-B ImageNet-1K-6M 224 8.3 44.8 10.5 4.5 85.1 ckpt
Next-ViT-L ImageNet-1K-6M 224 10.8 57.8 13.0 5.5 85.4 ckpt
Next-ViT-S ImageNet-1K-6M 384 17.3 31.7 21.6 8.9 85.8 ckpt
Next-ViT-B ImageNet-1K-6M 384 24.6 44.8 29.6 12.4 86.1 ckpt
Next-ViT-L ImageNet-1K-6M 384 32.0 57.8 36.0 15.2 86.4 ckpt

Training

To train Next-ViT-S on ImageNet using 8 gpus for 300 epochs, run:

cd classification/
bash train.sh 8 --model nextvit_small --batch-size 256 --lr 5e-4 --warmup-epochs 20 --weight-decay 0.1 --data-path your_imagenet_path

Finetune Next-ViT-S with 384x384 input size for 30 epochs, run:

cd classification/
bash train.sh 8 --model nextvit_small --batch-size 128 --lr 5e-6 --warmup-epochs 0 --weight-decay 1e-8 --epochs 30 --sched step --decay-epochs 60 --input-size 384 --resume ../checkpoints/nextvit_small_in1k_224.pth --finetune --data-path your_imagenet_path 

Evaluation

To evaluate the performance of Next-ViT-S on ImageNet using 8 gpus, run:

cd classification/
bash train.sh 8 --model nextvit_small --batch-size 256 --lr 5e-4 --warmup-epochs 20 --weight-decay 0.1 --data-path your_imagenet_path --resume ../checkpoints/nextvit_small_in1k_224.pth --eval

Detection

Our code is based on mmdetection, please install mmdetection==2.23.0. Next-ViT serve as the strong backbones for Mask R-CNN. It's easy to apply Next-ViT in other detectors provided by mmdetection based on our examples. More details can be seen in [paper].

Mask R-CNN

Backbone Pretrained Lr Schd Param.(M) FLOPs(G) TensorRT
Latency(ms)
CoreML
Latency(ms)
bbox mAP mask mAP ckpt log
Next-ViT-S ImageNet-1K 1x 51.8 290 38.2 18.1 45.9 41.8 ckpt log
Next-ViT-S ImageNet-1K 3x 51.8 290 38.2 18.1 48.0 43.2 ckpt log
Next-ViT-B ImageNet-1K 1x 64.9 340 51.6 24.4 47.2 42.8 ckpt log
Next-ViT-B ImageNet-1K 3x 64.9 340 51.6 24.4 49.5 44.4 ckpt log
Next-ViT-L ImageNet-1K 1x 77.9 391 65.3 30.1 48.0 43.2 ckpt log
Next-ViT-L ImageNet-1K 3x 77.9 391 65.3 30.1 50.2 44.8 ckpt log

Training

To train Mask R-CNN with Next-ViT-S backbone using 8 gpus, run:

cd detection/
PORT=29501 bash dist_train.sh configs/mask_rcnn_nextvit_small_1x.py 8

Evaluation

To evaluate Mask R-CNN with Next-ViT-S backbone using 8 gpus, run:

cd detection/
PORT=29501 bash dist_test.sh configs/mask_rcnn_nextvit_small_1x.py ../checkpoints/mask_rcnn_1x_nextvit_small.pth 8 --eval bbox

Semantic Segmentation

Our code is based on mmsegmentation, please install mmsegmentation==0.23.0. Next-ViT serve as the strong backbones for segmentation tasks on ADE20K dataset. It's easy to extend it to other datasets and segmentation methods. More details can be seen in [paper].

Semantic FPN 80k

Backbone Pretrained FLOPs(G) Params (M) TensorRT
Latency(ms)
CoreML
Latency(ms)
mIoU ckpt log
Next-ViT-S ImageNet-1K 208 36.3 38.2 18.1 46.5 ckpt log
Next-ViT-B ImageNet-1K 260 49.3 51.6 24.4 48.6 ckpt log
Next-ViT-L ImageNet-1K 331 62.4 65.3 30.1 49.1 ckpt log
Next-ViT-S ImageNet-1K-6M 208 36.3 38.2 18.1 48.8 ckpt log
Next-ViT-B ImageNet-1K-6M 260 49.3 51.6 24.4 50.2 ckpt log
Next-ViT-L ImageNet-1K-6M 331 62.4 65.3 30.1 50.5 ckpt log

UperNet 160k

Backbone Pretrained FLOPs(G) Params (M) TensorRT
Latency(ms)
CoreML
Latency(ms)
mIoU(ss/ms) ckpt log
Next-ViT-S ImageNet-1K 968 66.3 38.2 18.1 48.1/49.0 ckpt log
Next-ViT-B ImageNet-1K 1020 79.3 51.6 24.4 50.4/51.1 ckpt log
Next-ViT-L ImageNet-1K 1072 92.4 65.3 30.1 50.1/50.8 ckpt log
Next-ViT-S ImageNet-1K-6M 968 66.3 38.2 18.1 49.8/50.8 ckpt log
Next-ViT-B ImageNet-1K-6M 1020 79.3 51.6 24.4 51.8/52.8 ckpt log
Next-ViT-L ImageNet-1K-6M 1072 92.4 65.3 30.1 51.5/52.0 ckpt log

Training

To train Semantic FPN 80k with Next-ViT-S backbone using 8 gpus, run:

cd segmentation/
PORT=29501 bash dist_train.sh configs/fpn_512_nextvit_small_80k.py 8

Evaluation

To evaluate Semantic FPN 80k(single scale) with Next-ViT-S backbone using 8 gpus, run:

cd segmentation/
PORT=29501 bash dist_test.sh configs/fpn_512_nextvit_small_80k.py ../checkpoints/fpn_80k_nextvit_small.pth 8 --eval mIoU

Deployment and Latency Measurement

we provide scripts to convert Next-ViT from pytorch model to CoreML model and TensorRT engine.

CoreML

Convert Next-ViT-S to CoreML model with coremltools==5.2.0, run:

cd deployment/
python3 export_coreml_model.py --model nextvit_small --batch-size 1 --image-size 224
Backbone Resolution FLOPs (G) CoreML
Latency(ms)
CoreML Model
Next-ViT-S 224 5.8 3.5 mlmodel
Next-ViT-B 224 8.3 4.5 mlmodel
Next-ViT-L 224 10.8 5.5 mlmodel

We uniformly benchmark CoreML Latency on an iPhone12 Pro Max(iOS 16.0) with Xcode 14.0. The performance report of CoreML model can be generated with Xcode 14.0 directly(new feature of Xcode 14.0).
Next-ViT-R

Figure 3. CoreML latency of Next-ViT-S/B/L.

TensorRT

Convert Next-ViT-S to TensorRT engine with tensorrt==8.0.3.4, run:

cd deployment/
python3 export_tensorrt_engine.py --model nextvit_small --batch-size 8  --image-size 224 --datatype fp16 --profile True --trtexec-path /usr/bin/trtexec

Citation

If you find this project useful in your research, please consider cite:

@article{li2022next,
  title={Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios},
  author={Li, Jiashi and Xia, Xin and Li, Wei and Li, Huixia and Wang, Xing and Xiao, Xuefeng and Wang, Rui and Zheng, Min and Pan, Xin},
  journal={arXiv preprint arXiv:2207.05501},
  year={2022}
}

Acknowledgement

We heavily borrow the code from Twins.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

next-vit's People

Contributors

jiashi-li avatar xiaxin-aloys avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

next-vit's Issues

单GPU训练

当前训练指令:CUDA_VISIBLE_DEVICES=0 bash train.sh 1 --model nextvit_small --batch-size 8 --lr 5e-4 --warmup-epochs 20 --weight-decay 0.1 --data-path ./data
model.to(devices)时报错:RuntimeError: CUDA error: invalid device function
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 556202) of binary: /opt/conda/envs/NextVit/bin/python3
请问是什么原因呢?

out_channels error

When I run the code I get the following error, can you help ?

File "/content/drive/MyDrive/UNINEXT/projects/UNINEXT/uninext/backbone/nextvit.py", line 144, in init
assert out_channels % head_dim == 0
TypeError: unsupported operand type(s) for %: 'list' and 'int'

Error in conversion to ONNX

I used the NextVit as backbone for training a face embedding model and the accuracy was pretty good. But I tried to convert it to onnx and deploy on edge device, however, there was an error like this:

einops.py:314: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  known = {axis for axis in composite_axis if axis_name2known_length[axis] != _unknown_axis_length}

This error prevents me from inference the model and I don't know how to fix this, can anyone help me with this, thanks :D

关于论文对E-MHSA的空间缩减率的描述疑问

作者您好!

最近我正在研读您的论文。您在论文的3.5节中提到E-MHSA在不同stage的空间缩减率为8,4,2,1. 但是按照我对论文的理解,stage1中是没有E-MHSA模块的(仅有NCB没有NTB)。那您所描述的缩减率为8是什么意思呢?还是我对您的论文哪里理解有误呢?还望您能解惑!
万分感谢。

图片

Hosting model ckpts on Hugging Face

Hello 👋 I came across your model in kaggle competitions and was thinking it would be great if it was hosted on Hugging Face Hub model repositories as it addresses many needs around model hosting, and if you feel like it you can also wrap your model with this mixin to make it easily loadable. What do you think?

Patch embedding in each NCB and NTB?

Thank you for sharing a nice architecture.

I'm curious why you adopt "PatchEmbedding function" for every NCB as well as NTB.
(code line:

self.patch_embed = PatchEmbed(in_channels, out_channels, stride)

Because it looks slightly different from the figure of the paper which added patch embedding for each stage (not for each NCB and NTB).

no relu before the global pool?

in the classification model of nextvit.py:

why is there no relu before the global pool?
the batchnorm will produce both +ve and -ve values.



    def forward(self, x):
        x = self.stem(x)
        for idx, layer in enumerate(self.features):
            if self.use_checkpoint:
                x = checkpoint.checkpoint(layer, x)
            else:
                x = layer(x)
        x = self.norm(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.proj_head(x)
        return x


Please provide CoreML Segmentation pretrained models

This is a request to provide CoreML segmentation pretrained models, as it seems you have already converted them and tested/used in order to get your statistics on the page for the latency of the models. If you could post them it would help me and my team a huge amount. Thanks!

博主,fig5的问题

博主,我又来了!就是你那个傅里叶图,是用的np.fft.fft2吗?ff2(图),np.fft.fft2(图),这个图直接是热力图,还是最后一层网络输出的图?谢谢了!

Excellent work!

Hi,
What tool was used to draw Figure 1 in paper(ie. images/result.png ) ?
Was it drawn by Excel ?

分类训练的路径问题

--data/
--train/
--false/
img.jpg
--true/
img.jpg
--val/
--false/
img.jpg
--true/
img.jpg
数据集是这个结构,训练代码:bash train.sh 1 --model nextvit_small --batch-size 256 --lr 5e-4 --warmup-epochs 20 --weight-decay 0.1 --data-path ./data
报错:FileNotFoundError: Found no valid file for the classes .ipynb_checkpoints. Supported extensions are: .jpg, .jpeg, .png, .ppm, .bmp, .pgm, .tif, .tiff, .webp
定位到代码里,
dataset = datasets.ImageFolder(root, transform=transform),root读取不到false或者true的数据
请问怎么修改呢?我理解的是图像的文件夹名称代表的就是类别,训练过程应该是需要路径和类别两个数据才对呀

论文引用

您好,若我想对您的工作进行改进并用于分割任务,那我应该如何与您的工作对比?是利用FPN neck吗或是upernet?

论文fig5的问题

您好,请问你fig5的傅里叶图和热力图怎么生成的?有没有代码或者链接分享一下,谢谢博主了!

Wrong implementation of AvgPool in E_MHSA

Hi!
First of all very interesting paper with good ideas. I've looked into your code and your implementation of AvgPool seems to be wrong. Usually when we speak about AvgPool for images we assume a box filter with kernel_size=(size, size) and stride=size.

In your implementation you

  1. take input of size: (B, C, H, W)
  2. reshape it to (B, H*W, C)
  3. in E_MHSA reshape it to (B, Heads, H * W, C / Heads)
  4. in case of sr_ratio != 1 you reshape to (B, Heads, H * W, C / Heads)
  5. Apply AvgPool 1d to the tensor above with kernel_size=size * size, stride=size * size. This is NOT IDENTICAL to AvgPool2d.

The problem is that in this implementation your performing avg-pooling of rows, and it leads to leakage of information from one border to another and also doesn't use any information in vertical dimension. Below is an example of tensor and it's mean inside your E_MHSA.

BS, DIM, SZ = 1, 1, 4
inp = torch.arange(SZ * SZ).float().view(BS, SZ * SZ, DIM)
E_MHSA(dim=DIM, sr_ratio=2, head_dim=1)(inp)

inp = tensor([
        [ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [12., 13., 14., 15.]])

mean_inp (after self.sr(_x)): tensor([ 1.5000,  5.5000,  9.5000, 13.5000])

It's obvious it calculate average only inside rows. I would assume you used such implementation to avoid reshaping back to (B, C, H, W) for performing AvgPool2d, but this leads to a totally different pooling.

As a side note I would say that all this (B, C, H, W) -> (B, N, C) -> (B, C, H, W) permutations are not really needed and you could boost speed of your network by re-implementing E_MHSA to work on (B, C, H, W) tensors as inputs.

关于log时间的疑问

您好!
我注意到在log中,nextvit-small在imagenet上(224x224)训练时,耗时约为4min/epoch;
我自己在8 * A100上进行训练,bsz=8 * 256,耗时约为10min/epoch;
而注意到paper中提到您们工作的训练策略是8 * V100,bsz=8 * 256。
因此想请问是不是有一些不统一的地方在里面,抑或是我训练得有问题?
谢谢!

部署问题

请问分割模型适合在边缘端部署么?我们的硬件采用nvidia orin,我下载的pretrained model都700多MB了,能部署么?

Welcome update to OpenMMLab 2.0

Welcome update to OpenMMLab 2.0

I am Vansin, the technical operator of OpenMMLab. In September of last year, we announced the release of OpenMMLab 2.0 at the World Artificial Intelligence Conference in Shanghai. We invite you to upgrade your algorithm library to OpenMMLab 2.0 using MMEngine, which can be used for both research and commercial purposes. If you have any questions, please feel free to join us on the OpenMMLab Discord at https://discord.gg/amFNsyUBvm or add me on WeChat (van-sin) and I will invite you to the OpenMMLab WeChat group.

Here are the OpenMMLab 2.0 repos branches:

OpenMMLab 1.0 branch OpenMMLab 2.0 branch
MMEngine 0.x
MMCV 1.x 2.x
MMDetection 0.x 、1.x、2.x 3.x
MMAction2 0.x 1.x
MMClassification 0.x 1.x
MMSegmentation 0.x 1.x
MMDetection3D 0.x 1.x
MMEditing 0.x 1.x
MMPose 0.x 1.x
MMDeploy 0.x 1.x
MMTracking 0.x 1.x
MMOCR 0.x 1.x
MMRazor 0.x 1.x
MMSelfSup 0.x 1.x
MMRotate 1.x 1.x
MMYOLO 0.x

Attention: please create a new virtual environment for OpenMMLab 2.0.

add models to Hugging Face Hub

Hi!

Would you be interested in sharing your models in the Hugging Face Hub? The Hub offers free hosting and it would make your work more accessible and visible to the rest of the ML community. We can help you set up a bytedance organization.

Some of the benefits of sharing your models through the Hub would be:

  • versioning, commit history and diffs
  • repos provide useful metadata about their tasks, languages, metrics, etc that make them discoverable
  • multiple features from TensorBoard visualizations, PapersWithCode integration, and more
  • wider reach of your work to the ecosystem

Creating the repos and adding new models should be a relatively straightforward process if you've used Git before. This is a step-by-step guide explaining the process in case you're interested. Please let us know if you would be interested and if you have any questions.

and you can also setup a gradio demo for your model by following this guide: https://gradio.app/getting_started/

here is a example of a gradio demo: https://huggingface.co/spaces/adirik/OWL-ViT
and the code: https://huggingface.co/spaces/adirik/OWL-ViT/blob/main/app.py

Happy to hear your thoughts,
Ahsen and the Hugging Face team

Dockerfile request

Hi eveyone,

did someone already build a dockerfile to run this code. If you don't mind, please share with us. Thank you for helping us.

Best regards,
Xin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.