Towards Robust Video Object Segmentation with Adaptive Object Calibration (ACM Multimedia 2022)

Preview version paper of this work is available at Arxiv.

The conference poster is available at this github repo.

Long paper presentation video is available at GoogleDrive and YouTube.

Qualitative results and comparisons with previous SOTAs are available at YouTube.

Welcome to starts ⭐ & comments 💹 & collaboration 😀 !!**

- 2022.11.16: All the codes are cleaned and released ~ 
- 2022.10.21: Add the robustness evaluation dataloader for other models, e.g., AOT~
- 2022.10.1：Add the code of key implementations of this work~
- 2022.9.25：Add the poster of this work~
- 2022.8.27: Add presentation video and PPT for this work~
- 2022.7.10: Add future works towards robust VOS!
- 2022.7.5: Our ArXiv-version paper is available.
- 2022.7.1: Repo init. Please stay tuned~

Motivation for Robust Video Object Segmentation

Pipeline

Adaptive Object Proxy Representation (Component1)

Object Mask Calibration (Component2)

Abstract

In the booming video era, video segmentation attracts increasing research attention in the multimedia community.

Semi-supervised video object segmentation (VOS) aims at segmenting objects in all target frames of a video, given annotated object masks of reference frames. Most existing methods build pixel-wise reference-target correlations and then perform pixel-wise tracking to obtain target masks. Due to neglecting object-level cues, pixel-level approaches make the tracking vulnerable to perturbations, and even indiscriminate among similar objects.

Towards robust VOS, the key insight is to calibrate the representation and mask of each specific object to be expressive and discriminative. Accordingly, we propose a new deep network, which can adaptively construct object representations and calibrate object masks to achieve stronger robustness.

First, we construct the object representations by applying an adaptive object proxy (AOP) aggregation method, where the proxies represent arbitrary-shaped segments via clustering at multi-levels for reference.

Then, prototype masks are initially generated from the reference-target correlations based on AOP. Afterwards, such proto-masks are further calibrated through network modulation, conditioning on the object proxy representations. We consolidate this conditional mask calibration process in a progressive manner, where the object representations and proto-masks evolve to be discriminative iteratively.

Extensive experiments are conducted on the standard VOS benchmarks, YouTube-VOS-18/19 and DAVIS-17. Our model achieves the state-of-the-art performance among existing published works, and also exhibits significantly superior robustness against perturbations.

Requirements

Python3
pytorch >= 1.4.0
torchvision
opencv-python
Pillow

You can also use the docker image below to set up your env directly. However, this docker image may contain some redundent packages.

docker image: xxiaoh/vos:10.1-cudnn7-torch1.4_v3

A more light-weight version can be created by modified the Dockerfile provided.

Preparation

Datasets
- YouTube-VOS
  
  A commonly-used large-scale VOS dataset.
  
  datasets/YTB/2019: version 2019, download link. train is required for training. valid (6fps) and valid_all_frames (30fps, optional) are used for evaluation.
  
  datasets/YTB/2018: version 2018, download link. Only valid (6fps) and valid_all_frames (30fps, optional) are required for this project and used for evaluation.
- DAVIS
  
  A commonly-used small-scale VOS dataset.
  
  datasets/DAVIS: TrainVal (480p) contains both the training and validation split. Test-Dev (480p) contains the Test-dev split. The full-resolution version is also supported for training and evaluation but not required.
pretrained weights for the backbone

resnet101-deeplabv3p

Implementation

The key implementation of matching with adaptive-proxy-based representation is provided in THIS FILE. Other implementation and training/evaluation details can refer to PRCMVOS or CFBI.

The key implementation of the preliminary robust VOS benchmark evaluation is provided in THIS FILE.

The whole project code is provided in THIS FOLDER.

Feel free to contact me if you have any problems with the implementation~

For evaluation, please use official YouTube-VOS servers (2018 server and 2019 server), official DAVIS toolkit (for Val), and official DAVIS server (for Test-dev).

Limitation & Directions for further exploration towards Robust VOS!

Extension of the proposed clustering-based adaptive proxy representation to other dense-tracking tasks in a more efficient and robust way
Leverage the robust layered representation, i.e., intermediate masks, for robust mask calibration in other segmentation tasks
More diverse perturbation/corruption types can be studied for video segmentation tasks like VOS and VIS
Adversial attack and defence for VOS models is still an open question for further exploration
VOS model robustness verification and theoretical analysis
Model enhancement from the perspective of data management

(to be continued...)

Citation

If you find this work is useful for your research, please consider citing:

@inproceedings{xu2022towards,
   title={Towards Robust Video Object Segmentation with Adaptive Object Calibration},
   author={Xu, Xiaohao and Wang, Jinglu and Ming, Xiang and Lu, Yan},
   booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
   pages={2709--2718},
   year={2022}
}