Giter Club home page Giter Club logo

stcn's Introduction

STCN

Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang

NeurIPS 2021

[arXiv] [PDF] [Project Page] [Papers with Code]

Check out our new work Cutie!

bmx pigs

News: In the YouTubeVOS 2021 challenge, STCN achieved 1st place accuracy in novel (unknown) classes and 2nd place in overall accuracy. Our solution is also fast and light.

We present Space-Time Correspondence Networks (STCN) as the new, effective, and efficient framework to model space-time correspondences in the context of video object segmentation. STCN achieves SOTA results on multiple benchmarks while running fast at 20+ FPS without bells and whistles. Its speed is even higher with mixed precision. Despite its effectiveness, the network itself is very simple with lots of room for improvement. See the paper for technical details.

UPDATE (15-July-2021)

  1. CBAM block: We tried without CBAM block and I would say that we don't really need it. For s03 model, we get -1.2 in DAVIS and +0.1 in YouTubeVOS. For s012 model, we get +0.1 in DAVIS and +0.1 in YouTubeVOS. You are welcome to drop this block (see no_cbam branch). Overall, the much larger YouTubeVOS seems to be a better evaluation benchmark for consistency.

UPDATE (22-Aug-2021)

  1. Reproducibility: We have updated the package requirements below. With that environment, we obtained DAVIS J&F in the range of [85.1, 85.5] across multiple runs on two different machines.

UPDATE (27-Apr-2022)

Multi-scale testing code (as in the paper) has been added here.

What do we have here?

  1. A gentle introduction

  2. Quantitative results and precomputed outputs

    1. DAVIS 2016
    2. DAVIS 2017 validation/test-dev
    3. YouTubeVOS 2018/2019
  3. Try our model on your own data (Interactive GUI available)

  4. Steps to reproduce

    1. Pretrained models
    2. Inference
    3. Training
  5. If you want to look closer

  6. Citation

A Gentle Introduction

framework

There are two main contributions: STCN framework (above figure), and L2 similarity. We build affinity between images instead of between (image, mask) pairs -- this leads to a significantly speed up, memory saving (because we compute one, instead of multiple affinity matrices), and robustness. We further use L2 similarity to replace dot product, which improves the memory bank utilization by a great deal.

Perks

  • Simple, runs fast (30+ FPS with mixed precision; 20+ without)
  • High performance
  • Still lots of room to improve upon (e.g. locality, memory space compression)
  • Easy to train: just two 11GB GPUs, no V100s needed

Requirements

We used these packages/versions in the development of this project.

  • PyTorch 1.8.1
  • torchvision 0.9.1
  • OpenCV 4.2.0
  • Pillow-SIMD 7.0.0.post3
  • progressbar2
  • thinspline for training (pip install git+https://github.com/cheind/py-thin-plate-spline)
  • gitpython for training
  • gdown for downloading pretrained models
  • Other packages in my environment, for reference only.

Refer to the official PyTorch guide for installing PyTorch/torchvision, and the pillow-simd guide to install Pillow-SIMD. The rest can be installed by:

pip install progressbar2 opencv-python gitpython gdown git+https://github.com/cheind/py-thin-plate-spline

Results

Notations

  • FPS is amortized, computed as total processing time / total number of frames irrespective of the number of objects, aka multi-object FPS, and measured on an RTX 2080 Ti with IO time excluded.
  • We also provide inference speed when Automatic Mixed Precision (AMP) is used -- the performance is almost identical. Speed in the paper are measured without AMP.
  • All evaluations are done in the 480p resolution. FPS for test-dev is measured on the validation set under the same memory setting (every third frame as memory) for consistency.

[Precomputed outputs - Google Drive]

[Precomputed outputs - OneDrive]

s012 denotes models with BL pretraining while s03 denotes those without (used to be called s02 in MiVOS).

Numbers (s012)

Dataset Split J&F J F FPS FPS (AMP)
DAVIS 2016 validation 91.7 90.4 93.0 26.9 40.8
DAVIS 2017 validation 85.3 82.0 88.6 20.2 34.1
DAVIS 2017 test-dev 79.9 76.3 83.5 14.6 22.7
Dataset Split Overall Score J-Seen F-Seen J-Unseen F-Unseen
YouTubeVOS 18 validation 84.3 83.2 87.9 79.0 87.2
YouTubeVOS 19 validation 84.2 82.6 87.0 79.4 87.7
Dataset AUC-J&F J&F @ 60s
DAVIS Interactive 88.4 88.8

For DAVIS interactive, we changed the propagation module of MiVOS from STM to STCN. See this link for details.

Try on your own data (Interactive GUI available)

If you (somehow) have the first-frame segmentation (or more generally, segmentation of each object when they first appear), you can use eval_generic.py. Check the top of that file for instructions.

If you just want to play with it interactively, I highly recommend our extension to MiVOS ๐Ÿ’› -- it comes with an interactive GUI, and is highly efficient/effective.

Reproducing the results

Pretrained models

We use the same model for YouTubeVOS and DAVIS. You can download them yourself and put them in ./saves/, or use download_model.py.

s012 model (better): [Google Drive] [OneDrive]

s03 model: [Google Drive] [OneDrive]

s0 pretrained model: [GitHub]

s01 pretrained model: [GitHub]

Inference

  • eval_davis_2016.py for DAVIS 2016 validation set
  • eval_davis.py for DAVIS 2017 validation and test-dev set (controlled by --split)
  • eval_youtube.py for YouTubeVOS 2018/19 validation set (controlled by --yv_path)

The arguments tooltip should give you a rough idea of how to use them. For example, if you have downloaded the datasets and pretrained models using our scripts, you only need to specify the output path: python eval_davis.py --output [somewhere] for DAVIS 2017 validation set evaluation. For YouTubeVOS evaluation, point --yv_path to the version of your choosing.

Multi-scale testing code (as in the paper) has been added here.

Training

Data preparation

I recommend either softlinking (ln -s) existing data or use the provided download_datasets.py to structure the datasets as our format. download_datasets.py might download more than what you need -- just comment out things that you don't like. The script does not download BL30K because it is huge (>600GB) and we don't want to crash your harddisks. See below.

โ”œโ”€โ”€ STCN
โ”œโ”€โ”€ BL30K
โ”œโ”€โ”€ DAVIS
โ”‚   โ”œโ”€โ”€ 2016
โ”‚   โ”‚   โ”œโ”€โ”€ Annotations
โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ””โ”€โ”€ 2017
โ”‚       โ”œโ”€โ”€ test-dev
โ”‚       โ”‚   โ”œโ”€โ”€ Annotations
โ”‚       โ”‚   โ””โ”€โ”€ ...
โ”‚       โ””โ”€โ”€ trainval
โ”‚           โ”œโ”€โ”€ Annotations
โ”‚           โ””โ”€โ”€ ...
โ”œโ”€โ”€ static
โ”‚   โ”œโ”€โ”€ BIG_small
โ”‚   โ””โ”€โ”€ ...
โ”œโ”€โ”€ YouTube
โ”‚   โ”œโ”€โ”€ all_frames
โ”‚   โ”‚   โ””โ”€โ”€ valid_all_frames
โ”‚   โ”œโ”€โ”€ train
โ”‚   โ”œโ”€โ”€ train_480p
โ”‚   โ””โ”€โ”€ valid
โ””โ”€โ”€ YouTube2018
    โ”œโ”€โ”€ all_frames
    โ”‚   โ””โ”€โ”€ valid_all_frames
    โ””โ”€โ”€ valid

BL30K

BL30K is a synthetic dataset proposed in MiVOS.

You can either use the automatic script download_bl30k.py or download it manually from MiVOS. Note that each segment is about 115GB in size -- 700GB in total. You are going to need ~1TB of free disk space to run the script (including extraction buffer). Google might block the Google Drive link. You can 1) make a shortcut of the folder to your own Google Drive, and 2) use rclone to copy from your own Google Drive (would not count towards your storage limit).

Training commands

CUDA_VISIBLE_DEVICES=[a,b] OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port [cccc] --nproc_per_node=2 train.py --id [defg] --stage [h]

We implemented training with Distributed Data Parallel (DDP) with two 11GB GPUs. Replace a, b with the GPU ids, cccc with an unused port number, defg with a unique experiment identifier, and h with the training stage (0/1/2/3).

The model is trained progressively with different stages (0: static images; 1: BL30K; 2: 300K main training; 3: 150K main training). After each stage finishes, we start the next stage by loading the latest trained weight.

(Models trained on stage 0 only cannot be used directly. See model/model.py: load_network for the required mapping that we do.)

The .pth with _checkpoint as suffix is used to resume interrupted training (with --load_model) which is usually not needed. Typically you only need --load_network and load the last network weights (without checkpoint in its name).

So, to train a s012 model, we launch three training steps sequentially as follows:

Pre-training on static images: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=2 train.py --id retrain_s0 --stage 0

Pre-training on the BL30K dataset: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=2 train.py --id retrain_s01 --load_network [path_to_trained_s0.pth] --stage 1

Main training: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=2 train.py --id retrain_s012 --load_network [path_to_trained_s01.pth] --stage 2

And to train a s03 model, we launch two training steps sequentially as follows:

Pre-training on static images: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=2 train.py --id retrain_s0 --stage 0

Main training: CUDA_VISIBLE_DEVICES=0,1 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port 9842 --nproc_per_node=2 train.py --id retrain_s03 --load_network [path_to_trained_s0.pth] --stage 3

Looking closer

  • To add your datasets, or do something with data augmentations: dataset/static_dataset.py, dataset/vos_dataset.py
  • To work on the similarity function, or memory readout process: model/network.py: MemoryReader, inference_memory_bank.py
  • To work on the network structure: model/network.py, model/modules.py, model/eval_network.py
  • To work on the propagation process: model/model.py, eval_*.py, inference_*.py

Citation

Please cite our paper (MiVOS if you use top-k) if you find this repo useful!

@inproceedings{cheng2021stcn,
  title={Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation},
  author={Cheng, Ho Kei and Tai, Yu-Wing and Tang, Chi-Keung},
  booktitle={NeurIPS},
  year={2021}
}

@inproceedings{cheng2021mivos,
  title={Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion},
  author={Cheng, Ho Kei and Tai, Yu-Wing and Tang, Chi-Keung},
  booktitle={CVPR},
  year={2021}
}

And if you want to cite the datasets:

bibtex

@inproceedings{shi2015hierarchicalECSSD,
  title={Hierarchical image saliency detection on extended CSSD},
  author={Shi, Jianping and Yan, Qiong and Xu, Li and Jia, Jiaya},
  booktitle={TPAMI},
  year={2015},
}

@inproceedings{wang2017DUTS,
  title={Learning to Detect Salient Objects with Image-level Supervision},
  author={Wang, Lijun and Lu, Huchuan and Wang, Yifan and Feng, Mengyang 
  and Wang, Dong, and Yin, Baocai and Ruan, Xiang}, 
  booktitle={CVPR},
  year={2017}
}

@inproceedings{FSS1000,
  title = {FSS-1000: A 1000-Class Dataset for Few-Shot Segmentation},
  author = {Li, Xiang and Wei, Tianhan and Chen, Yau Pun and Tai, Yu-Wing and Tang, Chi-Keung},
  booktitle={CVPR},
  year={2020}
}

@inproceedings{zeng2019towardsHRSOD,
  title = {Towards High-Resolution Salient Object Detection},
  author = {Zeng, Yi and Zhang, Pingping and Zhang, Jianming and Lin, Zhe and Lu, Huchuan},
  booktitle = {ICCV},
  year = {2019}
}

@inproceedings{cheng2020cascadepsp,
  title={{CascadePSP}: Toward Class-Agnostic and Very High-Resolution Segmentation via Global and Local Refinement},
  author={Cheng, Ho Kei and Chung, Jihoon and Tai, Yu-Wing and Tang, Chi-Keung},
  booktitle={CVPR},
  year={2020}
}

@inproceedings{xu2018youtubeVOS,
  title={Youtube-vos: A large-scale video object segmentation benchmark},
  author={Xu, Ning and Yang, Linjie and Fan, Yuchen and Yue, Dingcheng and Liang, Yuchen and Yang, Jianchao and Huang, Thomas},
  booktitle = {ECCV},
  year={2018}
}

@inproceedings{perazzi2016benchmark,
  title={A benchmark dataset and evaluation methodology for video object segmentation},
  author={Perazzi, Federico and Pont-Tuset, Jordi and McWilliams, Brian and Van Gool, Luc and Gross, Markus and Sorkine-Hornung, Alexander},
  booktitle={CVPR},
  year={2016}
}

@inproceedings{denninger2019blenderproc,
  title={BlenderProc},
  author={Denninger, Maximilian and Sundermeyer, Martin and Winkelbauer, Dominik and Zidan, Youssef and Olefir, Dmitry and Elbadrawy, Mohamad and Lodhi, Ahsan and Katam, Harinandan},
  booktitle={arXiv:1911.01911},
  year={2019}
}

@inproceedings{shapenet2015,
  title       = {{ShapeNet: An Information-Rich 3D Model Repository}},
  author      = {Chang, Angel Xuan and Funkhouser, Thomas and Guibas, Leonidas and Hanrahan, Pat and Huang, Qixing and Li, Zimo and Savarese, Silvio and Savva, Manolis and Song, Shuran and Su, Hao and Xiao, Jianxiong and Yi, Li and Yu, Fisher},
  booktitle   = {arXiv:1512.03012},
  year        = {2015}
}

Contact: [email protected]

stcn's People

Contributors

hkchengrex avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

stcn's Issues

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched

raceback (most recent call last):
File "eval_youtube.py", line 133, in
processor.interact(with_bg_msk, frame_idx, rgb.shape[1], obj_idx)
File "/datasets/MODELS/STCN-ALL/STCN/inference_core_yv.py", line 123, in interact
self.do_pass(key_k, key_v, frame_idx, end_idx)
File "/datasets/MODELS/STCN-ALL/STCN/inference_core_yv.py", line 84, in do_pass
for oi in self.enabled_obj], 0)
File "/datasets/MODELS/STCN-ALL/STCN/inference_core_yv.py", line 84, in
for oi in self.enabled_obj], 0)
File "/datasets/MODELS/STCN-ALL/STCN/model/eval_network.py", line 61, in segment_with_query
readout_mem = mem_bank.match_memory(qk16)
File "/datasets/MODELS/STCN-ALL/STCN/inference_memory_bank.py", line 60, in match_memory
readout_mem = self._readout(affinity.expand(k,-1,-1), mv)
File "/datasets/MODELS/STCN-ALL/STCN/inference_memory_bank.py", line 42, in _readout
return torch.bmm(mv, affinity)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

Where can I find the key and value of the feature code of the labeled first frame in your test code?

I thought you said this was the key and value coding for the first frame with the label, but why do I encode the key for frame_idx and the value for self.images[:,frame_idx]?

image

There is another problem. The dimension of the key here is: torch.Size([1, 512, 1, 30, 54]), and the dimension of the value is torch.Size([3, 512, 1, 30, 54]), The first dimension here is k (num_objects), what does the third dimension represent? Are the key and value here the feature encoding of the first frame with the label? Looking forward to your reply, thank you very much.
image

RuntimeError: CUDA out of memory

Hi,
Thank you for sharing this repository, it has been very helpful.
Unfortunately I keep getting RuntimeErrors when running the eval scripts:
"RuntimeError: CUDA out of memory. Tried to allocate 9.83 GiB (GPU 0; 4.00 GiB total capacity; 416.90 MiB already allocated; 2.34 GiB free; 446.00 MiB reserved in total by PyTorch)"
Traceback shows this is thrown by "rgb = data['rgb'].cuda()".
I suspect the GPU I am using does not have enough memory to store all frames from a video simultaneously. Could this indeed be the problem? And is there an easy way to fix this? Can frames be stored in smaller batches?
Thank you.

The GPU I am using : nvidia geforce gtx 1650.
This specific error was thrown while trying to evaluate my own data. It contains longer videos (ca. 1000 frames). But when running eval_davis.py I get a similar error.

about memory bank.

sry, im not really understand why the memory bank can reduce the running time of Memory/Value Encoder .
i think even though the feature of frame(t-1) has been saved,but you still need to encode the mask(t-1) and frame again,so where is the saved time reflected?
wating for your reply, thank you very much.

The training time is too long

I train stage 0 with 2ร—2080Ti and it takes 4 days (about 96h). My parameters basically maintain your original settings, except that I set the master_port to -29500 imitating other codes. The GPUs are at (almost) full load all the time.
2

I also train to use higher OMP_NUM_THREADS (OMP_NUM_THREADS= 8, num_workers = 8, batch_size=8) or num_works (OMP_NUM_THREADS= 4, num_workers = 32, batch_size=8).
The time in log "retrain_s0 - It ******* [TRAIN] [time ]: ๏ผŸ" is 1.0+ when I use the modified settings above , and when I use your original settings (OMP_NUM_THREADS= 4, num_workers = 8, batch_size=8) the time is 0.8-0.9 around the start of training.
Therefore, even if I didnโ€™t run the entire training process with other parameter settings, I judge that other parameter settings excpet your original settings will reduce the training speed.

The time in log "retrain_s0 - It ******* [TRAIN] [time ]: ๏ผŸ" is 0.8-0.9 around the start of training๏ผŒthen it will drop to 0.6-0.7๏ผŒhowever it rises to 1.5-1.6 after about 70 epoch. This is different from what you described in this answer #5 (comment) . Could this be the reason why my training time is too long? Could you please give me some suggestions to shorten the training time?

Question about training speed.

First of all, thank you for your great work ! ! !
Conduct the training s0 & s2 with 2ร—2080Ti should take 30h as your paper. But in practice, I will take 100h just for s0 with2ร—2080ti (or 1*3090).
So I wanted to confirm the training speed. Or maybe what's wrong with me?

Performance on the DAVIS datasets

Hi, thanks for your great work!!

Could you provide the performance on the DAVIS datasets without using YouTube-VOS for network training?
Specifically, performance if using 1. DAVIS 2. Static+DAVIS 3. Static+BL30K+DAVIS.

Questions about paper and code

Hi, thank you for your great work! I have several questions about the paper and code.

  1. For the L2 distance, you discard one term here, why you do that?
  2. I don't know how you only calculate the affinity matrix once for a frame with multiple instances? I think memory is instance-specific and for each instance, we have one memory, so the affinity should be calculated for each instance. I think it is the same with your test code.
  3. I am new to this area, how can I get the quantity results? I seems that you only save the images.

Look forward to your reply.

About optimizer.step()

The code report warnings๏ผšwarnings.warn("Detected call of lr_scheduler.step() before optimizer.step().
Since I modified part of your network, I want to make sure I don't misunderstand your code. Does self.scaler.step(self.optimizer) (model/model.py line 163) means optimizer.step()in the code?

7f092c665e7e2cf16861f75f3635e01

DefaultCPUAllocator: can't allocate memory: you tried to allocate 195696230400 bytes

Hey there,

first of all, thank you for the wonderful repo, it works great!

However, I've been experimenting for a few hours now and I can't process over seventy frames. I'm using 960x480 as resolution, but reducing the frame size doesn't seem to solve the problem.

Usually the script is interrupted by the usual CUDA OOM errors:

/local/data/venvs/swav/lib/python3.6/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Traceback (most recent call last):
  File "eval_generic.py", line 127, in <module>
    processor.interact(with_bg_msk, frame_idx, rgb.shape[1], obj_idx)
  File "/local/data/repos/STCN/inference_core_yv.py", line 119, in interact
    key_v = self.prop_net.encode_value(self.images[:,frame_idx].cuda(), qf16, self.prob[self.enabled_obj,frame_idx].cuda())
  File "/local/data/repos/STCN/model/eval_network.py", line 47, in encode_value
    f16 = self.value_encoder(frame, kf16.repeat(k,1,1,1), masks, others)
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/local/data/repos/STCN/model/modules.py", line 114, in forward
    x = self.bn1(x)
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 178, in forward
    self.eps,
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/nn/functional.py", line 2282, in batch_norm
    input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 1.04 GiB (GPU 0; 31.75 GiB total capacity; 29.64 GiB already allocated; 667.50 MiB free; 29.76 GiB reserved in total by PyTorch)
Processing video1 ...
N/A% (0 of 1) |                                                                                                                                        | Elapsed Time: 0:00:00 ETA:  --:--:--Traceback (most recent call last):
  File "eval_generic.py", line 107, in <module>
    mem_every=args.mem_every, include_last=args.include_last)
  File "/local/data/repos/STCN/inference_core_yv.py", line 38, in __init__
    self.prob = torch.zeros((self.k+1, t, 1, nh, nw), dtype=torch.float32, device=self.device)
RuntimeError: CUDA out of memory. Tried to allocate 47.21 GiB (GPU 0; 31.75 GiB total capacity; 416.90 MiB already allocated; 29.99 GiB free; 446.00 MiB reserved in total by PyTorch)
Processing video1 ...

But sometimes there are much more disturbing errors like this one:

Traceback (most recent call last):
  File "eval_generic.py", line 80, in <module>
    for data in progressbar(test_loader, max_value=len(test_loader), redirect_stdout=True):
  File "/local/data/venvs/swav/lib/python3.6/site-packages/progressbar/shortcuts.py", line 10, in progressbar
    for result in progressbar(iterator):
  File "/local/data/venvs/swav/lib/python3.6/site-packages/progressbar/bar.py", line 547, in __next__
    value = next(self._iterable)
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/local/data/repos/STCN/dataset/generic_test_dataset.py", line 101, in __getitem__
    masks = torch.from_numpy(all_to_onehot(masks, labels)).float()
RuntimeError: [enforce fail at CPUAllocator.cpp:71] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 195696230400 bytes. Error code 12 (Cannot allocate memory)

100% (1 of 1) |########################################################################################################################################| Elapsed Time: 0:00:00 ETA:  00:00:00

The command line is fairly standard:

python eval_generic.py --data_path /local/data/dataset/dummy-test-set --output /local/data/repos/STCN/output-dummy-test-set

The only thing that changes is the number of images and their resolution (960x480 is the maximum).

Is there a way to do inference one batch at a time, without allocating all the memory at the beginning and thus avoiding all these OOMs?

Thank you!

RuntimeError: CUDA error: the launch timed out and was terminated

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1614378083779/work/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=702 : the launch timed out and was terminated
0%| | 0/258 [06:58<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 193, in
model.do_pass(data, total_iter)
File "/home/cwc/STCN_dd/STCN/model/model.py", line 160, in do_pass
self.scaler.step(self.optimizer)
File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 332, in step
if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
File "/home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 332, in
if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
RuntimeError: CUDA error: the launch timed out and was terminated
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1614378083779/work/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f6e8e0592f2 in /home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f6e8e05667b in /home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f6e8e2b2219 in /home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f6e8e0413a4 in /home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x2f9 (0x7f6ee52d4169 in /home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x7f6ee52c912a in /home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f6ee52f03c2 in /home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f6ee4c2c4a6 in /home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: + 0xa1e6bf (0x7f6ee52f36bf in /home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: + 0x3665b0 (0x7f6ee4c3b5b0 in /home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x36781e (0x7f6ee4c3c81e in /home/cwc/anaconda3/envs/torch18/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x15893b (0x558167db293b in /home/cwc/anaconda3/envs/torch18/bin/python)
frame #12: + 0x193141 (0x558167ded141 in /home/cwc/anaconda3/envs/torch18/bin/python)
frame #13: + 0x15893b (0x558167db293b in /home/cwc/anaconda3/envs/torch18/bin/python)
frame #14: + 0x193141 (0x558167ded141 in /home/cwc/anaconda3/envs/torch18/bin/python)
frame #15: + 0x156d2c (0x558167db0d2c in /home/cwc/anaconda3/envs/torch18/bin/python)
frame #16: + 0x159649 (0x558167db3649 in /home/cwc/anaconda3/envs/torch18/bin/python)
frame #17: + 0x1670f2 (0x558167dc10f2 in /home/cwc/anaconda3/envs/torch18/bin/python)
frame #18: _PyGC_CollectNoFail + 0x2a (0x558167e5464a in /home/cwc/anaconda3/envs/torch18/bin/python)
frame #19: PyImport_Cleanup + 0x295 (0x558167e2fca5 in /home/cwc/anaconda3/envs/torch18/bin/python)
frame #20: Py_FinalizeEx + 0x79 (0x558167e61a49 in /home/cwc/anaconda3/envs/torch18/bin/python)
frame #21: Py_RunMain + 0x183 (0x558167e63893 in /home/cwc/anaconda3/envs/torch18/bin/python)
frame #22: Py_BytesMain + 0x39 (0x558167e63ca9 in /home/cwc/anaconda3/envs/torch18/bin/python)
frame #23: __libc_start_main + 0xf0 (0x7f6f25c29840 in /lib/x86_64-linux-gnu/libc.so.6)
frame #24: + 0x1e21c7 (0x558167e3c1c7 in /home/cwc/anaconda3/envs/torch18/bin/python)

Number of instances

Thanks for your great work and code.

I am new to this area, it seems that the maximum instance number is 2 for your code. How to deal with images with more than 2 instances?

About the results written in the paper

Why the result of CFBI+ do not use the best results in the paper (CFBI+2ร— and CFBI+MS, i.e. the two lines framed by the red frame in the second figure)?
4423092850d2f3d85cbb839181788b4
59fb0b0b78941b8a722433a704f2c90

DAVIS 2016 evaluation error.

Hi,

Thank you for the nice job!

When I load the pre-trained model, I test the performance on the davis-2016-evaluation. It is odd to obtain a bad score on the DAVIS 2016.

I also test the results mapping the output to 255 /1, referring to hkchengrex/Mask-Propagation/issues/33. It is useless.

+-----------+--------+----------+---------+--------+----------+---------+
| Method | J_mean | J_recall | J_decay | F_mean | F_recall | F_decay |
+-----------+--------+----------+---------+--------+----------+---------+
| STCN | 0.856 | 0.931 | 0.040 | 0.880 | 0.921 | 0.043 |
+-----------+--------+----------+---------+--------+----------+---------+

davis-2016-evaluation is poorly maintained, I would like to ask, do you know the cause of the problem?

two question .

thanks for your sharing.nice work.
but could you pls show your ablation experiment result? especially the cbam.
and could u pls tell me why you didn't use " c = qk.pow(2).sum(1).unsqueeze(1) โ€œ in the code.
wait for your reply .

Model selection

Hi,

For each sub-dataset (davis-2017 validation/test, davis-2016/youtube-vos), do you use only one model or several models selected for different sub-datasets? Could you tell me the detail about the model selection, thank you!

How was 1620 obtained?

The mv dimension of the memory bank in the training code is: [4, 512, 2, 24, 24], and the mv dimension in the test code becomes: [1, 512, 1620]๏ผŒ How was 1620 obtained?

image

about the evaluation of davis2016

I want to evaluate davis2016 with this code https://github.com/davisvideochallenge/davis. But when I want to run eval_view.py to see the result in HDF5 file generated by eval.py, the program can't find the file db_benchmark.yml. This file db_benchmark.yml is neither in the davis2016 dataset nor generated from the program eval.py. Do you know where can I find the program db_benchmark.yml?

Or could you kindly tell me how do you evaluate davis2016? Which code or website do you use?

Does the model with nproc_per_node=1 can only run on one GPU?

The model is too big to run on one GPU, so I try to run the model with two GPUs and nproc_per_node=1 (the training command is CUDA_VISIBLE_DEVICES=2,3 OMP_NUM_THREADS=4 python -m torch.distributed.launch --master_port -29500 --nproc_per_node=1 train.py --id retrain_s0 --stage 0) . However, the code only use one GPU and still report cuda out of memory. I do not want to change batch_size and num_works because I'm afraid I'll get lower results.

Does the model with nproc_per_node=1 can only run on one GPU? If so, could you kindly give me some suggestions to let the model run on two or more GPUs?

Does TPS improve the results?

Hi, I feel that the TPS in the static phase(s0) is quite slow. But I wonder that does the TPS improve the results?
Thanks.

Pre-trained model inference

Would be great if you could provide a sample of using the pre-trained network on a new set of images (MP4 video or a folder of images) for people who would like to test it on new data

้ข„่ฎญ็ปƒๅ’Œๆญฃๅผ่ฎญ็ปƒๅฎŒๆˆไน‹ๅŽๆต‹่ฏ•ๅพ—ๅˆฐ็š„Annotations็š„zipๆ–‡ไปถๅคงๅฐๅชๆœ‰14M

ๅฐŠๆ•ฌ็š„STCNไฝœ่€…๏ผš
ๆ‚จๅฅฝ๏ผŒ้ฆ–ๅ…ˆ้žๅธธๆ„Ÿ่ฐขๆ‚จๅพˆๆฃ’็š„ๅทฅไฝœSTCN๏ผŒๆˆ‘ๅœจๆœๅŠกๅ™จๅค็Žฐๆ‚จ็š„STCN็š„ๆ—ถๅ€™๏ผŒๆˆ‘ๅšไบ†ไธ€็‚นๅฐๆ”นๅŠจ๏ผŒ้ข„่ฎญ็ปƒๅ’Œๆญฃๅผ่ฎญ็ปƒๅฎŒๆˆไน‹ๅŽๆต‹่ฏ•ๅพ—ๅˆฐ็š„Annotations็š„zipๆ–‡ไปถๅคงๅฐๅชๆœ‰14M๏ผŒๆไบคๅŽๅฎ˜็ฝ‘ๆต‹่ฏ•็ฒพๅบฆ่ถ‹่ฟ‘ไบŽ0๏ผŒๅฅ‡ๆ€ช็š„ๆ˜ฏๅฆ‚ๆžœไธๅŠ ้ข„่ฎญ็ปƒๆต‹่ฏ•็ป“ๆžœๆ˜ฏๆญฃๅธธ็š„๏ผŒๅŠ ไธŠ้ข„่ฎญ็ปƒๆต‹่ฏ•็ป“ๆžœๅฐฑๆŽฅ่ฟ‘ไบŽ0๏ผŒไธ็Ÿฅ้“ๆ˜ฏๅ“ช้‡Œๅ‡บไบ†้—ฎ้ข˜๏ผŸ่ฟ˜ๆœ‰ไธ€ไธช้—ฎ้ข˜่ฏทๆ•™ไธ€ไธ‹๏ผŒๅœจๅคš็›ฎๆ ‡encode_value็š„ๆ—ถๅ€™ๅฅฝๅƒ็›ฎๆ ‡1ๅ’Œ็›ฎๆ ‡2็š„mask่ฐƒๆขไฝ็ฝฎๅ†ๆ‹ผๆŽฅๅˆฐไธ€่ตท๏ผŒไธบไป€ไนˆ่ฆ่ฟ™ๆ ทๅš๏ผŸ็›ฎๆ ‡1ๅ’Œ็›ฎๆ ‡2็š„mask่ฐƒๆขไฝ็ฝฎๅพ—ๅˆฐref_v1ๅ’Œref_v2๏ผŒ่ฐƒๆขไฝ็ฝฎๆœ‰ไป€ไนˆไฝœ็”จๅ—๏ผŸๆœŸๅพ…ๆ‚จ็š„ๅ›žๅคใ€‚้žๅธธๆ„Ÿ่ฐขใ€‚

็ฅๅฅฝ๏ผŒ
้พ™้ฉฌ
image
image

the loss is strange

The stage 0 total_loss is very strange. The model seems to be overfitting. But the model converges at about 20k step, this is even earlier than when the model was first saved (50k step). Since I didn't change any parameter except master_port and I also trained the model with two 11GB 2080Ti GPUs as you write in the article, the time that the model converges seems to be too early. Is this what the loss supposed to be like? Or I should use the model that just converges (when the loss is lowest) to train the stage 3?
lossstage0

By the way, the total_loss of stage 3 is rising when I use the final model of stage 0 (at 300k step) to train it. Is the loss of stage 3 supposed to rise? Or this is just the wrong result of over fitting stage 0 model?
lossstage3

A quick question about the code and the training details

Thanks for your great work and open-source code!
I have a quick question about the code related to the training details.

At line 161 in train.py, there is a code to determine the maximum epochs to run.

STCN/train.py

Line 161 in 5f11ad4

total_epoch = math.ceil(para['iterations']/len(train_loader))

However, as far as I know, len(train_loader) equals to dataset_size/batch_size.
So the line becomes para['iterations']/len(train_loader) == para['iterations']/(dataset_size/batch_size) == para['iterations']*batch_size/dataset_size.
The problem is, then the number of epochs to train increases as we increase the mini-batch size, which is not desirable behavior.

For instance, when I run the code for stage 0 with 2 gpus with -b 8, the total training epochs is 129.
However, when I run the code for stage 0 with 2 gpus with -b 4, the total training epochs is 65.
Moreover, when I run the code for stage 0 with 4 gpus with -b 8, the total training epochs is 258.

It might not hurt the reproducibility of your amazing work, but I think it should be checked for scalability for the different batch sizes or different numbers of GPUs.

Or If I'm wrong, please let me know.

Thanks!

do I need to make changes to the test code as I change the training code?

Hello, I have made some changes to your training code based on your STCN baseline, but the test code is still the same. If I change the training code but do not change the test code, I can still test normally. Is that normal? Or do I need to make changes to the test code as I change the training code?

Some questions about ablation study

I am curious about the performance under the following setting. Have you tried them before?

  1. STCN w/o V.S. W/ Top-k filtering
  2. STCN V.S. STM with the same training strategy (I find that the pretraining of STCN maybe uses more data and some augmentations.

about the result

Thanks for your great work!

I retrained the code and used s03 model.I get 84.5 J&F on DAVIS17 val and 75.3 J&F on DAVIS17 testdev.
Do you think the result are within a resonable range? I noticed that there are still some gaps compared to the result in your paper.
no11
no22

Great work!!

Thank you for opening the training code. STM-based methods are very difficult to train. Your code will give us some partical help.

About the parameter setting of training command

I use the command CUDA_VISIBLE_DEVICES=0,3 OMP_NUM_THREADS=4 python -m torch.distributed.launch --nproc_per_node=2 train.py --id retrain_s0 --stage 0 to train the model, but an error occurred.
The error message is as follows๏ผš

CUDA Device count: 2
CUDA Device count: 2
I am rank 0 in this world of size 2!
I will take the role of logging!
I am rank 1 in this world of size 2!
Traceback (most recent call last):
File "train.py", line 55, in
logger = TensorboardLogger(para['id'], long_id)
File "/root/projects/code/STCN/util/logger.py", line 46, in init
repo = git.Repo(".")
File "/root/miniconda3/envs/pytracking/lib/python3.8/site-packages/git/repo/base.py", line 220, in init
self.working_dir = self._working_tree_dir or self.common_dir # type: Optional[PathLike]
File "/root/miniconda3/envs/pytracking/lib/python3.8/site-packages/git/repo/base.py", line 303, in common_dir
raise InvalidGitRepositoryError()
git.exc.InvalidGitRepositoryError
Traceback (most recent call last):
File "train.py", line 64, in
model = STCNModel(para, local_rank=local_rank, world_size=world_size).train()
File "/root/projects/code/STCN/model/model.py", line 26, in init
STCN(self.single_object).cuda(),
File "/root/projects/code/STCN/model/network.py", line 80, in init
self.key_encoder = KeyEncoder()
File "/root/projects/code/STCN/model/modules.py", line 129, in init
resnet = models.resnet50(pretrained=True)
File "/root/miniconda3/envs/pytracking/lib/python3.8/site-packages/torchvision/models/resnet.py", line 300, in resnet50
return _resnet('resnet50', Bottleneck, [3, 4, 6, 3], pretrained, progress,
File "/root/miniconda3/envs/pytracking/lib/python3.8/site-packages/torchvision/models/resnet.py", line 262, in _resnet
state_dict = load_state_dict_from_url(model_urls[arch],
File "/root/miniconda3/envs/pytracking/lib/python3.8/site-packages/torch/hub.py", line 528, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location)
File "/root/miniconda3/envs/pytracking/lib/python3.8/site-packages/torch/serialization.py", line 593, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/root/miniconda3/envs/pytracking/lib/python3.8/site-packages/torch/serialization.py", line 762, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: unpickling stack underflow
Killing subprocess 24445
Killing subprocess 24446
Traceback (most recent call last):
File "/root/miniconda3/envs/pytracking/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/pytracking/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/pytracking/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/root/miniconda3/envs/pytracking/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/root/miniconda3/envs/pytracking/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/miniconda3/envs/pytracking/bin/python', '-u', 'train.py', '--local_rank=1', '--id', 'retrain_s0', '--stage', '0']' returned non-zero exit status 1.

I don't set the parameter master_ port and keep the parameters id unchanged because I fail to understand the tutorial.
Do you know the cause of the problem, and could you kindly teach me how to set the parameter master_ port and id ?

Pretraining for the key and value encoder

Your work is excellent and interestind. And thanks for your code.
Now , I'm conducting some experiments on the modification and training on your model, but i met with some problem about pretraining.
I has modified the model structure and retrain the model just follow your setting, and the result is not as expected.How can I ensure that the pretraining process is adequate? Or How can I know whether more pretraining is needed?
And I don't know the difference between static image training and bl30k training except for multi-object. and the influence of the training on bl30k.
Many thanks, if you could notice this issue and answer my question.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.