This repo contains the official detection and segmentation implementation of paper "DaViT: Dual Attention Vision Transformer", by Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, and Lu Yuan.
The official implementation for image classification will be released in https://github.com/microsoft/DaViT.
Python3, PyTorch>=1.8.0, torchvision>=0.7.0 are required for the current codebase.
# An example on CUDA 10.2
pip install torch===1.9.0+cu102 torchvision===0.10.0+cu102 torchaudio===0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install thop pyyaml fvcore pillow==8.3.2
-
cd mmdet
& install mmcv/mmdet# An example on CUDA 10.2 and pytorch 1.9 pip install mmcv-full==1.3.0 -f https://download.openmmlab.com/mmcv/dist/cu102/torch1.9.0/index.html pip install -r requirements/build.txt pip install -v -e . # or "python setup.py develop"
-
mkdir data
& Prepare the dataset in data/coco/ (Format: ROOT/mmdet/data/coco/annotations, train2017, val2017) -
Finetune on COCO
bash tools/dist_train.sh configs/davit_retinanet_1x_coco.py 8 \ --cfg-options model.pretrained=PRETRAINED_MODEL_PATH
-
cd mmseg
& install mmcv/mmseg# An example on CUDA 10.2 and pytorch 1.9 pip install mmcv-full==1.3.0 -f https://download.openmmlab.com/mmcv/dist/cu102/torch1.9.0/index.html pip install -e .
-
mkdir data
& Prepare the dataset in data/ade/ (Format: ROOT/mmseg/data/ADEChallengeData2016) -
Finetune on ADE
bash tools/dist_train.sh configs/upernet_davit_512x512_160k_ade20k.py 8 \ --options model.pretrained=PRETRAINED_MODEL_PATH
-
Multi-scale Testing
bash tools/dist_test.sh configs/upernet_davit_512x512_160k_ade20k.py \ TRAINED_MODEL_PATH 8 --aug-test --eval mIoU
Image Classification on ImageNet-1K
Model | Pretrain | Resolution | acc@1 | acc@5 | #params | FLOPs | Checkpoint | Log |
---|---|---|---|---|---|---|---|---|
DaViT-T | IN-1K | 224 | 82.8 | 96.2 | 28.3M | 4.5G | download | log |
DaViT-S | IN-1K | 224 | 84.2 | 96.9 | 49.7M | 8.8G | download | log |
DaViT-B | IN-1K | 224 | 84.6 | 96.9 | 87.9M | 15.5G | download | log |
Object Detection and Instance Segmentation on COCO
Backbone | Pretrain | Lr Schd | #params | FLOPs | box mAP | mask mAP | Checkpoint | Log |
---|---|---|---|---|---|---|---|---|
DaViT-T | ImageNet-1K | 1x | 47.8M | 263G | 45.0 | 41.1 | download | log |
DaViT-T | ImageNet-1K | 3x | 47.8M | 263G | 47.4 | 42.9 | download | log |
DaViT-S | ImageNet-1K | 1x | 69.2M | 351G | 47.7 | 42.9 | download | log |
DaViT-S | ImageNet-1K | 3x | 69.2M | 351G | 49.5 | 44.3 | download | log |
DaViT-B | ImageNet-1K | 1x | 107.3M | 491G | 48.2 | 43.3 | download | log |
DaViT-B | ImageNet-1K | 3x | 107.3M | 491G | 49.9 | 44.6 | download | log |
Backbone | Pretrain | Lr Schd | #params | FLOPs | box mAP | Checkpoint | Log |
---|---|---|---|---|---|---|---|
DaViT-T | ImageNet-1K | 1x | 38.5M | 244G | 44.0 | download | log |
DaViT-T | ImageNet-1K | 3x | 38.5M | 244G | 46.5 | download | log |
DaViT-S | ImageNet-1K | 1x | 59.9M | 332G | 46.0 | download | log |
DaViT-S | ImageNet-1K | 3x | 59.9M | 332G | 48.2 | download | log |
DaViT-B | ImageNet-1K | 1x | 98.5M | 471G | 46.7 | download | log |
DaViT-B | ImageNet-1K | 3x | 98.5M | 471G | 48.7 | download | log |
Semantic Segmentation on ADE20K
Backbone | Pretrain | Method | Resolution | Iters | #params | FLOPs | mIoU | Checkpoint | Log |
---|---|---|---|---|---|---|---|---|---|
DaViT-T | ImageNet-1K | UPerNet | 512x512 | 160k | 60M | 940G | 46.3 | download | log |
DaViT-S | ImageNet-1K | UPerNet | 512x512 | 160k | 81M | 1030G | 48.8 | download | log |
DaViT-B | ImageNet-1K | UPerNet | 512x512 | 160k | 121M | 1175G | 49.4 | download | log |
If you find this repo useful to your project, please consider citing it with following bib:
@article{ding2022davit,
title={DaViT: Dual Attention Vision Transformer},
author={Ding, Mingyu and Xiao, Bin and Codella, Noel and Luo, Ping and Wang, Jingdong and Yuan, Lu},
journal={arXiv preprint arXiv:2204.03645},
year={2022},
}
Our codebase is built based on timm, MMDetection, MMSegmentation. We thank the authors for the nicely organized code!