Giter Club home page Giter Club logo

visionlongformerforobjectdetection's Introduction

Vision Longformer for Object Detection

This project provides the source code for the object detection part of vision longformer paper. It is based on detectron2.

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

The classification part of the code and checkpoints can be found here.

Updates

  • 03/29/2021: First version of vision longformer paper posted on Arxiv.
  • 05/17/2021: Performance improved by adding relative positional bias, inspired by Swin Transformer! First version of Object Detection code released.

Usage

Here is an example command for evaluating a pretrained vision-longformer small model on COCO

python -m pip install -e .

ln -s /mnt/data_storage datasets

DETECTRON2_DATASETS=datasets python train_net.py --num-gpus 1 --eval-only --config configs/msvit_maskrcnn_fpn_1x_small_sparse.yaml 
MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0" 
SOLVER.AMP.ENABLED True 
MODEL.WEIGHTS /mnt/model_storage/msvit_det/visionlongformer/vilsmall/maskrcnn1x/model_final.pth

Here is an example training command for training the vision-longformer small model on COCO

python -m pip install -e .

ln -s /mnt/data_storage datasets

# convert the classification checkpoint into a detection checkpoint for initialization
python3 converter.py --source_model "/mnt/model_storage/msvit/visionlongformer/small1281_relative/model_best.pth"
--output_model msvit_pretrain.pth --config configs/msvit_maskrcnn_fpn_3xms_small_sparse.yaml
MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0"

# train with the converted detection checkpoint as initialization
DETECTRON2_DATASETS=datasets python train_net.py --num-gpus 8 --config configs/msvit_maskrcnn_fpn_3xms_small_sparse.yaml
MODEL.WEIGHTS msvit_pretrain.pth MODEL.TRANSFORMER.DROP_PATH 0.2 MODEL.TRANSFORMER.MSVIT.ATTN_TYPE
longformerhand MODEL.TRANSFORMER.MSVIT.ARCH "l1,h3,d96,n1,s1,g1,p4,f7,a0_l2,h3,d192,n2,s1,g1,p2,f7,a0_l3,h6,d384,n8,s1,g1,p2,f7,a0_l4,h12,d768,n1,s1,g0,p2,f7,a0"
SOLVER.AMP.ENABLED True SOLVER.BASE_LR 1e-4 SOLVER.WEIGHT_DECAY 0.1 TEST.EVAL_PERIOD
7330 SOLVER.IMS_PER_BATCH 16

Model Zoo on COCO

Vision Longformer with relative positional bias

Backbone Method pretrain drop_path Lr Schd box mAP mask mAP #params FLOPs checkpoints log
ViL-Tiny Mask R-CNN ImageNet-1K 0.05 1x 41.4 38.1 26.9M 145.6G ckpt config log
ViL-Tiny Mask R-CNN ImageNet-1K 0.1 3x 44.2 40.6 26.9M 145.6G ckpt config log
ViL-Small Mask R-CNN ImageNet-1K 0.2 1x 44.9 41.1 45.0M 218.3G ckpt config log
ViL-Small Mask R-CNN ImageNet-1K 0.2 3x 47.1 42.7 45.0M 218.3G ckpt config log
ViL-Medium (D) Mask R-CNN ImageNet-21K 0.2 1x 47.6 43.0 60.1M 293.8G ckpt config log
ViL-Medium (D) Mask R-CNN ImageNet-21K 0.3 3x 48.9 44.2 60.1M 293.8G ckpt config log
ViL-Base (D) Mask R-CNN ImageNet-21K 0.3 1x 48.6 43.6 76.1M 384.4G ckpt config log
ViL-Base (D) Mask R-CNN ImageNet-21K 0.3 3x 49.6 44.5 76.1M 384.4G ckpt config log
--- --- --- --- --- --- --- --- ---
ViL-Tiny RetinaNet ImageNet-1K 0.05 1x 40.8 -- 16.64M 182.7G ckpt config log
ViL-Tiny RetinaNet ImageNet-1K 0.1 3x 43.6 -- 16.64M 182.7G ckpt config log
ViL-Small RetinaNet ImageNet-1K 0.1 1x 44.2 -- 35.68M 254.8G ckpt config log
ViL-Small RetinaNet ImageNet-1K 0.2 3x 45.9 -- 35.68M 254.8G ckpt config log
ViL-Medium (D) RetinaNet ImageNet-21K 0.2 1x 46.8 -- 50.77M 330.4G ckpt config log
ViL-Medium (D) RetinaNet ImageNet-21K 0.3 3x 47.9 -- 50.77M 330.4G ckpt config log
ViL-Base (D) RetinaNet ImageNet-21K 0.3 1x 47.8 -- 66.74M 420.9G ckpt config log
ViL-Base (D) RetinaNet ImageNet-21K 0.3 3x 48.6 -- 66.74M 420.9G ckpt config log

See more fine-grained results in Table 6 and Table 7 in the Vision Longformer paper. We use weight decay 0.05 for all experiments, but search for best drop path in [0.05, 0.1, 0.2, 0.3].

Comparison of various efficient attention mechanims with absolute positional embedding (Small size)

Backbone Method pretrain drop_path Lr Schd box mAP mask mAP #params FLOPs Memory checkpoints log
srformer/64 Mask R-CNN ImageNet-1K 0.1 1x 36.4 34.6 73.3M 224.1G 7.1G ckpt config log
srformer/32 Mask R-CNN ImageNet-1K 0.1 1x 39.9 37.3 51.5M 268.3G 13.6G ckpt config log
Partial srformer/32 Mask R-CNN ImageNet-1K 0.1 1x 42.4 39.0 46.8M 352.1G 22.6G ckpt config log
global Mask R-CNN ImageNet-1K 0.1 1x 34.8 33.4 45.2M 226.4G 7.6G ckpt config log
Partial global Mask R-CNN ImageNet-1K 0.1 1x 42.5 39.2 45.1M 326.5G 20.1G ckpt config log
performer Mask R-CNN ImageNet-1K 0.1 1x 36.1 34.3 45.0M 251.5G 8.4G ckpt config log
Partial performer Mask R-CNN ImageNet-1K 0.05 1x 42.3 39.1 45.0M 343.7G 20.0G ckpt config log
ViL Mask R-CNN ImageNet-1K 0.1 1x 42.9 39.6 45.0M 218.3G 7.4G ckpt config log
Partial ViL Mask R-CNN ImageNet-1K 0.1 1x 43.3 39.8 45.0M 326.8G 19.5G ckpt config log

We use weight decay 0.05 for all experiments, but search for best drop path in [0.05, 0.1, 0.2].

visionlongformerforobjectdetection's People

Contributors

pzzhang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.