Giter Club home page Giter Club logo

groupvit's Introduction

GroupViT: Semantic Segmentation Emerges from Text Supervision

GroupViT is a framework for learning semantic segmentation purely from text captions without using any mask supervision. It learns to perform bottom-up heirarchical spatial grouping of semantically-related visual regions. This repository is the official implementation of GroupViT introduced in the paper:

GroupViT: Semantic Segmentation Emerges from Text Supervision, Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang, CVPR 2022.

Visual Results

Links

Citation

If you find our work useful in your research, please cite:

@article{xu2022groupvit,
  author    = {Xu, Jiarui and De Mello, Shalini and Liu, Sifei and Byeon, Wonmin and Breuel, Thomas and Kautz, Jan and Wang, Xiaolong},
  title     = {GroupViT: Semantic Segmentation Emerges from Text Supervision},
  journal   = {arXiv preprint arXiv:2202.11094},
  year      = {2022},
}

Environmental Setup

  • Python 3.7
  • PyTorch 1.8
  • webdataset 0.1.103
  • mmsegmentation 0.18.0
  • timm 0.4.12

Instructions:

conda create -n groupvit python=3.7 -y
conda activate groupvit
conda install pytorch==1.8.0 torchvision==0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge
pip install mmcv-full==1.3.14 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.8.0/index.html
pip install mmsegmentation==0.18.0
pip install webdataset==0.1.103
pip install timm==0.4.12
git clone https://github.com/NVIDIA/apex
cd && apex && pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
pip install opencv-python==4.4.0.46 termcolor==1.1.0 diffdist einops omegaconf
pip install nltk ftfy regex tqdm

Demo

python demo/demo_seg.py --cfg configs/group_vit_gcc_yfcc_30e.yml --resume /path/to/checkpoint --vis input_pred_label final_group --input demo/examples/voc.jpg --output_dir demo/output

The output is saved in demo/output/.

Benchmark Results

Zero-shot Classification Zero-shot Segmentation
config ImageNet Pascal VOC Pascal Context COCO
GCC + YFCC (cfg) 43.7 52.3 22.4 24.3
GCC + RedCaps (cfg) 51.6 50.8 23.7 27.5

Pre-trained weights group_vit_gcc_yfcc_30e-879422e0.pth and group_vit_gcc_redcap_30e-3dd09a76.pth for these models are provided by Jiarui Xu here.

Data Preparation

During training, we use webdataset for scalable data loading. To convert image text pairs into the webdataset format, we use the img2dataset tool to download and preprocess the dataset.

For inference, we use mmsegmentation for semantic segmentation testing, evaluation and visualization on Pascal VOC, Pascal Context and COCO datasets.

The overall file structure is as follows:

GroupViT
β”œβ”€β”€ local_data
β”‚   β”œβ”€β”€ gcc3m_shards
β”‚   β”‚   β”œβ”€β”€ gcc-train-000000.tar
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”‚   β”œβ”€β”€ gcc-train-000436.tar
β”‚   β”œβ”€β”€ gcc12m_shards
β”‚   β”‚   β”œβ”€β”€ gcc-conceptual-12m-000000.tar
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”‚   β”œβ”€β”€ gcc-conceptual-12m-001943.tar
β”‚   β”œβ”€β”€ yfcc14m_shards
β”‚   β”‚   β”œβ”€β”€ yfcc14m-000000.tar
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”‚   β”œβ”€β”€ yfcc14m-001888.tar
β”‚   β”œβ”€β”€ redcap12m_shards
β”‚   β”‚   β”œβ”€β”€ redcap12m-000000.tar
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”‚   β”œβ”€β”€ redcap12m-001211.tar
β”‚   β”œβ”€β”€ imagenet_shards
β”‚   β”‚   β”œβ”€β”€ imagenet-val-000000.tar
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”‚   β”œβ”€β”€ imagenet-val-000049.tar
β”‚   β”œβ”€β”€ VOCdevkit
β”‚   β”‚   β”œβ”€β”€ VOC2012
β”‚   β”‚   β”‚   β”œβ”€β”€ JPEGImages
β”‚   β”‚   β”‚   β”œβ”€β”€ SegmentationClass
β”‚   β”‚   β”‚   β”œβ”€β”€ ImageSets
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ Segmentation
β”‚   β”‚   β”œβ”€β”€ VOC2010
β”‚   β”‚   β”‚   β”œβ”€β”€ JPEGImages
β”‚   β”‚   β”‚   β”œβ”€β”€ SegmentationClassContext
β”‚   β”‚   β”‚   β”œβ”€β”€ ImageSets
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ SegmentationContext
β”‚   β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ train.txt
β”‚   β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ val.txt
β”‚   β”‚   β”‚   β”œβ”€β”€ trainval_merged.json
β”‚   β”‚   β”œβ”€β”€ VOCaug
β”‚   β”‚   β”‚   β”œβ”€β”€ dataset
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ cls
β”‚   β”œβ”€β”€ coco
β”‚   β”‚   β”œβ”€β”€ images
β”‚   β”‚   β”‚   β”œβ”€β”€ train2017
β”‚   β”‚   β”‚   β”œβ”€β”€ val2017
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”‚   β”œβ”€β”€ train2017
β”‚   β”‚   β”‚   β”œβ”€β”€ val2017

The instructions for preparing each dataset are as follows.

GCC3M

Please download the training split annotation file from Conceptual Caption 12M and name it as gcc3m.tsv.

Then run img2dataset to download the image text pairs and save them in the webdataset format.

sed -i '1s/^/caption\turl\n/' gcc3m.tsv
img2dataset --url_list gcc3m.tsv --input_format "tsv" \
            --url_col "url" --caption_col "caption" --output_format webdataset\
            --output_folder local_data/gcc3m_shards
            --processes_count 16 --thread_count 64
            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
            --enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/gcc-train-/' local_data/gcc3m_shards/*

Please refer to img2dataset CC3M tutorial for more details.

GCC12M

Please download the annotation file from Conceptual Caption 12M and name it as gcc12m.tsv.

Then run img2dataset to download the image text pairs and save them in the webdataset format.

sed -i '1s/^/caption\turl\n/' gcc12m.tsv
img2dataset --url_list gcc12m.tsv --input_format "tsv" \
            --url_col "url" --caption_col "caption" --output_format webdataset\
            --output_folder local_data/gcc12m_shards \
            --processes_count 16 --thread_count 64
            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
            --enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/gcc-conceptual-12m-/' local_data/gcc12m_shards/*

Please refer to img2dataset CC12M tutorial for more details.

YFCC14M

Please follow the CLIP Data Preparation instructions to download the YFCC14M subset.

wget https://openaipublic.azureedge.net/clip/data/yfcc100m_subset_data.tsv.bz2
bunzip2 yfcc100m_subset_data.tsv.bz2

Then run the preprocessing script to create the subset sql db and annotation tsv files. This may take a while.

python convert_dataset/create_subset.py --input-dir . --output-dir . --subset yfcc100m_subset_data.tsv

This script will create two files: an SQLite db called yfcc100m_dataset.sql and an annotation tsv file called yfcc14m_dataset.tsv.

Then follow the YFCC100M Download Instruction to download the dataset and its metadata file.

pip install git+https://gitlab.com/jfolz/yfcc100m.git
mkdir -p yfcc100m_meta
python -m yfcc100m.convert_metadata . -o yfcc100m_meta --skip_verification
mkdir -p yfcc100m_zip
python -m yfcc100m.download yfcc100m_meta -o yfcc100m_zip

Finally convert the dataset into the webdataset format.

python convert_dataset/convert_yfcc14m.py --root yfcc100m_zip --info yfcc14m_dataset.tsv --shards yfcc14m_shards

RedCaps12M

Please download the annotation file from RedCaps.

wget https://www.dropbox.com/s/cqtdpsl4hewlli1/redcaps_v1.0_annotations.zip?dl=1
unzip redcaps_v1.0_annotations.zip

Then run the preprocessing script and img2dataset to download the image text pairs and save them in the webdataset format.

python convert_dataset/process_redcaps.py annotations redcaps12m_meta/redcaps12m.parquet --num-split 16
img2dataset --url_list ~/data/redcaps12m/ --input_format "parquet" \
            --url_col "URL" --caption_col "TEXT" --output_format webdataset \
            --output_folder local_data/recaps12m_shards
            --processes_count 16 --thread_count 64
            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
            --enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/redcap12m-/' local_data/recaps12m_shards/*

ImageNet

Please follow the webdataset ImageNet Example to convert ImageNet into the webdataset format.

Pascal VOC

Please follow the MMSegmentation Pascal VOC Preparation instructions to download and setup the Pascal VOC dataset.

Pascal Context

Please refer to the MMSegmentation Pascal Context Preparation instructions to download and setup the Pascal Context dataset.

COCO

COCO dataset is an object detection dataset with instance segmentation annotations. To evaluate GroupViT, we combine all the instance masks of a catergory together and generate semantic segmentation maps. To generate the semantic segmentation maps, please follow MMSegmentation's documentation to download the COCO-Stuff-164k dataset first and then run the following

python convert_dataset/convert_coco.py local_data/data/coco/ -o local_data/data/coco/

Run Experiments

Pre-train

Train on a single node:

(node0)$ ./tools/dist_launch.sh main_group_vit.py /path/to/config $GPUS_PER_NODE

For example, to train on a node with 8 GPUs, run:

(node0)$ ./tools/dist_launch.sh main_group_vit configs/group_vit_gcc_yfcc_30e.yml 8

Train on multiple nodes:

(node0)$ ./tools/dist_mn_launch.sh main_group_vit.py /path/to/config $NODE_RANK $NUM_NODES $GPUS_PER_NODE $MASTER_ADDR
(node1)$ ./tools/dist_mn_launch.sh main_group_vit.py /path/to/config $NODE_RANK $NUM_NODES $GPUS_PER_NODE $MASTER_ADDR

For example, to train on two nodes with 8 GPUs each, run:

(node0)$ ./tools/dist_mn_launch.sh main_group_vit.py configs/group_vit_gcc_yfcc_30e.yml 0 2 8 tcp://node0
(node1)$ ./tools/dist_mn_launch.sh main_group_vit.py configs/group_vit_gcc_yfcc_30e.yml 1 2 8 tcp://node0

We used 16 NVIDIA V100 GPUs for pre-training (in 2 days) in our paper.

Zero-shot Transfer to Image Classification

ImageNet

./tools/dist_launch.sh main_group_vit.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint --eval

Zero-shot Transfer to Semantic Segmentation

Pascal VOC

./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint

Pascal Context

./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint --opts evaluate.seg.cfg segmentation/configs/_base_/datasets/pascal_context.py

COCO

./tools/dist_launch.sh main_seg.py /path/to/config $NUM_GPUS --resume /path/to/checkpoint --opts evaluate.seg.cfg segmentation/configs/_base_/datasets/coco.py

groupvit's People

Contributors

shalinidemello avatar xvjiarui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

groupvit's Issues

Background threshold.

Many thanks for the good work. I am curious about how the background thresholds for PASCAL VOC and PASCAL context were determined?

Non parametric grouping

Hi,

Can you please provide more details as how you do the non-parametric grouping on CLIP's features (obtained from ViT encoder)?

Mistakes in the GCC Dataset download commands

Hi! I realized that the commands for downloading GCC 3M and 12M have a couple of typos. The corrected version for the 12M is below:

sed -i '1s/^/url\tcaption\n/' gcc12m.tsv
img2dataset --url_list gcc12m.tsv --input_format "tsv" \
            --url_col "url" --caption_col "caption" --output_format webdataset\
            --output_folder local_data/gcc12m_shards \
            --processes_count 16 --thread_count 64 \
            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
            --enable_wandb True --save_metadata False --oom_shard_count 6

It would be nice if you can update the README.me with it.
Great work! Thank for sharing it.

Parallel training hangs in reduce_tensor method

Hi,

I was training the network with validation on coco dataset using 8 gpus on a single node. It seems like the network hangs while using reduce_tensor method inside validate_cls(). Is there a solution known for this issue?

Inconsistent Results from the paper

Thanks for your great works!

when I evaluate your pretrained model on VOC and Pascal Context datasets, I could not get consistent results as you provided in the paper. I am not sure what are the issues preventing me to get your reported in the paper, so I hope you can help me out of this. thanks!

The results I got from your pretrained model group_vit_gcc_yfcc_30e-879422e0.pth on VOC 2012 is:
------------+-------+-------+
| Class | IoU | Acc |
+------------+-------+-------+
| background | 80.53 | 95.76 |
| aeroplane | 40.76 | 51.58 |
| bicycle | 36.07 | 70.07 |
| bird | 54.19 | 56.7 |
| boat | 36.01 | 51.18 |
| bottle | 32.97 | 36.98 |
| bus | 31.66 | 33.96 |
| car | 51.58 | 55.66 |
| cat | 76.48 | 84.09 |
| chair | 14.19 | 17.46 |
| cow | 63.01 | 67.99 |
| table | 21.48 | 26.53 |
| dog | 70.72 | 76.04 |
| horse | 70.28 | 75.15 |
| motorbike | 40.7 | 50.26 |
| person | 18.89 | 19.19 |
| plant | 28.56 | 33.04 |
| sheep | 55.39 | 60.27 |
| sofa | 28.03 | 34.41 |
| train | 23.3 | 28.74 |
| monitor | 24.64 | 38.76 |
+------------+-------+-------+
Summary:

+-------+-------+-------+
| aAcc | mIoU | mAcc |
+-------+-------+-------+
| 82.38 | 42.83 | 50.66 |
+-------+-------+-------+

The results for Pascal Context is:

+------------+-------+-------+
| Class | IoU | Acc |
+------------+-------+-------+
| background | 13.48 | 89.03 |
| airplane | 43.22 | 51.06 |
| bag | 5.67 | 7.07 |
| bed | 11.9 | 13.89 |
| bedclothes | 0.0 | 0.0 |
| bench | 8.74 | 17.63 |
| bicycle | 48.39 | 59.64 |
| bird | 52.0 | 57.12 |
| boat | 34.42 | 42.73 |
| book | 1.3 | 1.36 |
| bottle | 37.57 | 42.64 |
| building | 1.0 | 1.0 |
| bus | 34.33 | 37.25 |
| cabinet | 6.01 | 8.21 |
| car | 42.33 | 45.41 |
| cat | 70.61 | 81.02 |
| ceiling | 3.34 | 3.4 |
| chair | 14.0 | 15.47 |
| cloth | 2.8 | 3.2 |
| computer | 5.81 | 18.75 |
| cow | 58.19 | 67.05 |
| cup | 2.35 | 2.87 |
| curtain | 1.63 | 1.7 |
| dog | 63.04 | 69.54 |
| door | 0.0 | 0.0 |
| fence | 1.35 | 1.39 |
| floor | 0.86 | 0.87 |
| flower | 3.96 | 4.92 |
| food | 13.98 | 18.65 |
| grass | 0.34 | 0.34 |
| ground | 0.21 | 0.21 |
| horse | 61.52 | 69.12 |
| keyboard | 4.97 | 5.65 |
| light | 1.24 | 1.5 |
| motorbike | 40.99 | 48.74 |
| mountain | 9.55 | 10.09 |
| mouse | 0.0 | 0.0 |
| person | 16.52 | 16.85 |
| plate | 7.14 | 12.29 |
| platform | 2.16 | 3.59 |
| plant | 16.85 | 17.64 |
| road | 1.09 | 1.12 |
| rock | 6.24 | 6.38 |
| sheep | 48.16 | 55.33 |
| shelves | 2.7 | 2.85 |
| sidewalk | 0.65 | 0.69 |
| sign | 2.84 | 2.97 |
| sky | 1.15 | 1.15 |
| snow | 6.41 | 7.25 |
| sofa | 20.15 | 23.2 |
| table | 13.88 | 18.24 |
| track | 7.75 | 8.52 |
| train | 26.09 | 30.97 |
| tree | 1.36 | 1.36 |
| truck | 1.54 | 2.65 |
| monitor | 14.22 | 15.41 |
| wall | 0.02 | 0.02 |
| water | 3.47 | 3.5 |
| window | 2.31 | 2.4 |
| wood | 0.47 | 0.52 |
+------------+-------+-------+
Summary:

+-------+-------+-------+
| aAcc | mIoU | mAcc |
+-------+-------+-------+
| 24.44 | 15.07 | 18.89 |
+-------+-------+-------+

The meaning of pre_assign_attn

Thank you for sharing such a nice work.
I have some questions.
(1) What is the meaning of self.pre_assign_attn? Has it been described in the paper?

projected_group_tokens = self.pre_assign_attn(projected_group_tokens, x)

(2) Does self.assign represent equation (3), (4), and part of (5) in the paper?

Hard assignment

Dear authors,
Thanks for sharing this nice work.
Are the training and testing in classification task in hard assignment style, while the inference in segmentation in soft assignment?

Question about zero-shot transfer to semantic segmentation

Hi, thank you for your great work.

I noticed that during the generation of segmentation masks, soft assignment matrices are used instead of hard assigment matrices (from segmentation/evaluation/group_vit_seg.py, line 166). Although the product of the soft assignment metrices is converted to one-hot before classifying pixels, it is somewhat different from your paper, which suggests that we should directly multiply hard assigment matrices.

In fact, by changing the code in line 166 from attn_masks = attn_dict['soft'] to attn_masks = attn_dict['hard'], the demo yields worse segmentation result.

Am I misunderstanding the code or missing some implementation details from your paper?

dist_trian.sh not found

Hi,

Thank you for your great work. However, it seems the pre-training script dist_train.sh is missing. Can you plz have a check? TY.

Mistakes in the data preparation command

https://github.com/NVlabs/GroupViT#gcc3m

sed -i '1s/^/caption\turl\n/' gcc3m.tsv
img2dataset --url_list gcc3m.tsv --input_format "tsv" \
            --url_col "url" --caption_col "caption" --output_format webdataset\
            --output_folder local_data/gcc3m_shards
            --processes_count 16 --thread_count 64
            --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
            --enable_wandb True --save_metadata False --oom_shard_count 6
rename -d 's/^/gcc-train-/' local_data/gcc3m_shards/*
  1. After checking the documentation of img2dataset, I find that there is no argument --save_metadata for img2dataset and this may cause an error.

Learnable group tokens fine-tuning on out-of-domain datasets

Hi! Thank you for the great work and neat implementation.

In section C.4 of the paper you reported results from testing GroupViT pretrained model on COCO dataset, which were quit impressive but not as good as the ones on PASCAL VOC. This is probably due to the domain shift between PASCAL VOC and COCO images, classes/text descriptions.

I was wondering if it's possible to fine-tune GroupViT on COCO dataset (and out-of-domain datasets in general) by freezing the model's weights and training only the learnable group tokens on the new datasets in a few-shot manner.

If it's the case, I'm ready to implement this with you're guidance.

Problem solved.

Hi, Thanks for sharing the great work!

In my initial guess, I think stage 2 groups the results from stage 1. And it can form a hierarchical grouping pipeline.
However, when I reading the source code, it seems that stage 2 just groups the image patches.
Is my understanding right? And why not just grouping the results from stage 1?

Thanks for your reading!

Confusion about dataset setting and label's defination

Dear Author,
Thanks for your work! I have some questions about the code:
(1οΌ‰Dataset splitting
Assume we have one custom dataset called Seg_Caption Dataset which has segmentation mask annotation and caption annotation. Can we split the dataset based on the following way:
Specifically, Seg_Caption_train just includes the input image and caption, the Seg_Caption_val includes the input image, caption, and segmentation mask. The segmentation mask is used to compare the model's segmentation prediction and the ground truth segmentation mask. And we just enable the seg task during the evaluation.

train:
- Seg_Caption_train
val:
- Seg_Caption_val
-
evaluate:
task:
- seg

(2) label's definition in the code
labels = torch.arange(batch_size, dtype=torch.long, device=image_x.device) + batch_size * dist.get_rank()
loss_img = self.cross_entropy(logits_per_img * logit_scale, labels)
loss_text = self.cross_entropy(logits_per_text * logit_scale, labels)

Does this label make sense? It's hard to understand what such a loss wants to do.

(3) WebDataset format
Following this video tutorial: https://www.youtube.com/watch?v=v_PacO-3OGQ
I try to use the "tar --sort=name -cf ../dataset.tar ." to convert my dataset into the required format.
but the training will stop after the follwing procedure:
(main_group_vit.py 284): INFO Train: [0/30][0/750] eta 0:15:46 lr 0.000000 time 1.2622 (1.2622) total_loss 2.3884 (2.3884) loss 0.7358 (0.7358) multi_label_loss 1.6526 (1.6526) grad_norm nan (nan) mem 4184MB
Do you know why this will happen?

Thanks for your help!
Mengya Xu

Question about visualization

Hi, thanks for the amazing work! I have two questions about the paper. In my understanding, the resolution of the mask is restricted by the number of patches. In the paper, it says that during testing the shorter side is resized to 448. With a patch size 16, the mask resolution is 28 for the shorter side.

  1. Many of the visualization results in the paper show very smooth masks. Some have very high frequency noise (very small isolated dots in a mask). I do some rough calculation, it does not seem to have a resolution of only 28. Do you use an even larger image size for the visualization results? Or I am missing something?
  2. If I understand correctly, you are using 224x224 images for training, but use 448 directly for testing without any further modification to the model. There should be a data distribution mismatch between training and testing, right? Do you have any sense about how this affects the results? For example, maybe testing on 448x448 imagenet images to see how the accuracy changes?

Thank you!
Jiahao

Confused by the visualization results

Hi, Thank you for your great work.
I am confused by the visualization results. In my opinion, the final assignment is image patch (pixels in same patch have same class/group).
The setting for training is an image resolution of 224x224 and a patch size of 16x16.
I assume there should have a clear grid effect in the final visualization.
But in your paper, visualization results have perfect curve.

Best.

IoU for each category in PASCAL VOC

Thanks a lot for your contribution!

Do you have IoU for each category for PASCAL VOC dataset? This would help understand how your model performs for each category and sometimes helps to interpret some results.
Also, are background class masks obtained with query "background"?

Unable to reproduce the results on PASCAL VOC.

Dear Authors,

I tried training the GroupViT model on GCC + YFCC datasets using the group_vit_gcc_yfcc_30e.yml config file and with a batch size of 2048 (256x8). The results on PASCAL VOC after 30 epochs of training is roughly 5% mIoU lower (absolute) than what is reported in the paper. Additionally, I tried training with gradient accumulation with 2 steps, which did not give any improvement. Do you have any suggestions on what can cause the lower performance?

Thank you.

About initialization and visualization of group tokens

Hi, @xvjiarui @shalinidemello
Thank you for presenting this good work.

I have some questions about the group tokens.
It seems that group_token is set as None for initialization in line 839 ./models/group_vit.py.
So how to make the group_token learnable? Why do not set it as a learnable parameters?

What's more, how do we get the segmentation mask through the model? It seems that the "class GroupViT" of ./models/group_vit.py only return the image features?

These make me very confused.
Could you please help me to solve my confusion?

How to train on custom datasets?

Hello, thank you very much for your work. I found that the data processing in the code is for public data sets, and now I have some image-text pairs of data sets, and I want to make some fine-tuning on the pre-training model you published, so I would like to ask you how to deal with custom data sets. Looking forward to your reply.

Minor error in Apex install

The current command for installing apex
cd && apex && pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Needs to be replaced by
cd && apex && pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

libtorch_cuda_cu.so

Could you help me solve this problem? libtorch_cuda_cu.so has already existed in the directory: "/home/user/anaconda3/envs/groupvit/lib/python3.7/site-packages/torch/lib"

image

Mask Probing

Hi guys,

Do you have any plans on releasing the code for evaluating the mask quality (mask probing) mentioned in appendix C.2. table C.2.??
It would be really helpful. Thank you!.

Regards

As the training progresses, the memory required increases

Hi, thanks for your great work.
In the process of training, I found that the memory usage gradually increased until it was out of memory. There are 252G memory in my server. The training dataset is gcc12m, the val dataset is VOC2012.
The Traceback message:
RuntimeError:DataLoader worker(pid xxxxx) is killed by signal:killed.
Could you please help me solve it?

Unstable results

Hi, thanks for your contribution!

I've re-run the model twice with GCC+RedCaps. For the first time, I got 51.9 on ImageNet and 11.6 on COCO, while for the second time, I got 54.3 on ImageNet and 25.8 on COCO. It seems that the results are not stable.

Is this a normal phenomenon for this model?

Question about the mask generation

Hi @xvjiarui, thanks for your great work! I'm confused about how did you generate the smooth object masks based on the grouping allocation among 16 * 16 patches per image. Why the masks are not aligned with the 16 * 16 grids?

how to Train on multiple nodes

The command is
(node0)$ ./tools/dist_mn_launch.sh main_group_vit.py configs/group_vit_gcc_yfcc_30e.yml 0 2 8 tcp://node0
(node1)$ ./tools/dist_mn_launch.sh main_group_vit.py configs/group_vit_gcc_yfcc_30e.yml 1 2 8 tcp://node0
I want to ask how can I do with the tcp://node0 ??? what is it?
Thank you very much.

Not Comparable Results

Hi, I have trained GroupVIT under your default settings with RedCAP+GCC3M+GCC12M, and I ran the experiments 2 times, only achieving 26.8% mIoU and 27.9% mIoU. According to the previous discussions in the Closed Issue, I found that it may be caused by the training data size. So would you plz tell me the specific data size of RedCap, GCC3M, GCC12M, and YFC100M?

Resource \x1b[93mpunkt\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n

Hello, when I run the code .
When it arrives the line 220 of main_group_vit.py.
for idx, samples in enumerate(data_loader):
it will show the information:

/data/miniconda3/envs/groupvit/lib/python3.7/site-packages/torchvision/transforms/functional.py:365: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
"Argument interpolation should be of type InterpolationMode instead of int. "
/group/20016/zhangyangqi/debug/GroupViT/datasets/builder.py:105: UserWarning: LookupError("\n**********************************************************************\n Resource \x1b[93mpunkt\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('punkt')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mtokenizers/punkt/PY3/english.pickle\x1b[0m\n\n Searched in:\n - '/usr/local/app/nltk_data'\n - '/data/miniconda3/envs/groupvit/nltk_data'\n - '/data/miniconda3/envs/groupvit/share/nltk_data'\n - '/data/miniconda3/envs/groupvit/lib/nltk_data'\n - '/usr/share/nltk_data'\n - '/usr/local/share/nltk_data'\n - '/usr/lib/nltk_data'\n - '/usr/local/lib/nltk_data'\n - ''\n**********************************************************************\n")
warnings.warn(repr(exn))
/data/miniconda3/envs/groupvit/lib/python3.7/site-packages/torchvision/transforms/functional.py:365: UserWarning: Argument interpolation should be of type InterpolationMode instead of int. Please, use InterpolationMode enum.
"Argument interpolation should be of type InterpolationMode instead of int. "
/group/20016/zhangyangqi/debug/GroupViT/datasets/builder.py:105: UserWarning: LookupError("\n**********************************************************************\n Resource \x1b[93mpunkt\x1b[0m not found.\n Please use the NLTK Downloader to obtain the resource:\n\n \x1b[31m>>> import nltk\n >>> nltk.download('punkt')\n \x1b[0m\n For more information see: https://www.nltk.org/data.html\n\n Attempted to load \x1b[93mtokenizers/punkt/PY3/english.pickle\x1b[0m\n\n Searched in:\n - '/usr/local/app/nltk_data'\n - '/data/miniconda3/envs/groupvit/nltk_data'\n - '/data/miniconda3/envs/groupvit/share/nltk_data'\n - '/data/miniconda3/envs/groupvit/lib/nltk_data'\n - '/usr/share/nltk_data'\n - '/usr/local/share/nltk_data'\n - '/usr/lib/nltk_data'\n - '/usr/local/lib/nltk_data'\n - ''\n**********************************************************************\n")
warnings.warn(repr(exn))

I do not know what it is and what should I do next?

command

sed -i '1s/^/caption\turl\n/' gcc3m.tsv
How to write this command under the Windows system

single GPU

Hello, can the model be trained on a single GPU

Inquiry about the Grouping Blocks

Hi @xvjiarui, Thanks for your great work on GroupViT!

I notice that in your implementation of the grouping blocks here, L280-L281 and L285 are not mentioned in the paper, especially the pre-attention part.

May I ask that are these designs crucial to equip the model with grouping abilities? I've failed to reproduce your results using the information in the paper only, i.e. implementing the grouping block as simply as using a cross-attention (q: grouping tokens, kv: visual tokens) block with Gumbel Softmax and hard assignment strategy.

Checkpoint for the model trained just with 1 grouping stage

Hi,

Thank you for your work :)
I was curious about your work and noticed an ablation study on grouping stages(in Table 3). I was hoping if you could provide me with the checkpoints for the model trained with 1 grouping stage only as well. I wanted to finetune the models of a smaller dataset and evaluate the performance :)

Best
Ayushi

why not softmax?

Hi, Thanks for the great work.
Why do you have to use onehot or gumbel softmax. why don't you use softmax?

Clarification on the training datasets used.

Hi, for the model that achieves 52.3% mIoU on PASCAL VOC, the paper says that GCC12M + YFCC14M are used for training. Whereas, in the config files of this repository GCC3M is also used. Which one is correct?

In case if GCC3M is also used, are the pre-trainings for fully supervised transfer in Table 5. done using GCC3M as well?

Thank you in advance.

Testing on PASCAL VOC is very slow

Hi!
Currently, I'm working with your code and have been able to train in multiple GPUs without major issues. However, I had to disable the testing on PASCAL VOC because it takes forever and the code gets stuck. the last prints that I can see are the next ones:

All checkpoints founded in output/group_vit_gcc_3m_bs712x4: []
All checkpoints founded in output/group_vit_gcc_3m_bs712x4: []
All checkpoints founded in output/group_vit_gcc_3m_bs712x4: []
All checkpoints founded in output/group_vit_gcc_3m_bs712x4: []
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0
Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 32768.0


[                           ] 20/1449, 0.0 task/s, elapsed: 9609s, ETA: 686561s(base)

And

[2022-04-10 10:10:18 group_vit_gcc_3m_bs712x4](main_group_vit.py 294): INFO EPOCH 0 training takes 0:27:05
[2022-04-10 10:10:19 group_vit_gcc_3m_bs712x4](checkpoint.py 92): INFO output/group_vit_gcc_3m_bs712x4/ckpt_epoch_0.pth saving......
[2022-04-10 10:10:20 group_vit_gcc_3m_bs712x4](checkpoint.py 95): INFO output/group_vit_gcc_3m_bs712x4/ckpt_epoch_0.pth saved !!!
[2022-04-10 10:10:20 group_vit_gcc_3m_bs712x4](main_group_vit.py 160): INFO Avg loss of the network on the 2876225 train images: 15.16
[2022-04-10 10:10:20 group_vit_gcc_3m_bs712x4](main_group_vit.py 317): INFO Building zero shot classifier
[2022-04-10 10:10:25 group_vit_gcc_3m_bs712x4](main_group_vit.py 321): INFO Zero shot classifier built
[2022-04-10 10:10:39 group_vit_gcc_3m_bs712x4](main_group_vit.py 347): INFO Test: [0/17]        Time 19.063 (19.063)    Loss 6.8310 (6.8310)    Acc@1 1.545 (1.545)     Acc@5 6.180 (6.180)     Mem 71248MB
[2022-04-10 10:12:43 group_vit_gcc_3m_bs712x4](main_group_vit.py 347): INFO Test: [10/17]       Time 13.185 (13.037)    Loss 6.8311 (6.8282)    Acc@1 1.990 (1.904)     Acc@5 6.739 (6.734)     Mem 71248MB
[2022-04-10 10:13:53 group_vit_gcc_3m_bs712x4](main_group_vit.py 353): INFO Clearing zero shot classifier
[2022-04-10 10:13:53 group_vit_gcc_3m_bs712x4](main_group_vit.py 355): INFO  * Acc@1 1.903 Acc@5 6.735
[2022-04-10 10:13:53 group_vit_gcc_3m_bs712x4](main_group_vit.py 166): INFO Accuracy of the network on the 50000 test images: 1.9%
[2022-04-10 10:13:53 group_vit_gcc_3m_bs712x4](checkpoint.py 92): INFO output/group_vit_gcc_3m_bs712x4/ckpt_epoch_0_best_acc1.pth saving......
[2022-04-10 10:13:55 group_vit_gcc_3m_bs712x4](checkpoint.py 95): INFO output/group_vit_gcc_3m_bs712x4/ckpt_epoch_0_best_acc1.pth saved !!!
[2022-04-10 10:13:55 group_vit_gcc_3m_bs712x4](main_group_vit.py 173): INFO Max accuracy: 1.90%
[2022-04-10 10:13:55 group_vit_gcc_3m_bs712x4](group_vit_seg.py 142): INFO Building GroupViTSegInference with 21 classes, test_cfg=Config (path: None): {'bg_thresh': 0.95, 'mode': 'slide', 'stride': (224, 224), 'crop_size': (448, 448)}, with_bg=True

Do you have any clue about what is going on? Thanks for your help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.