rstrudel / segmenter Goto Github PK

[ICCV2021] Official PyTorch implementation of Segmenter: Transformer for Semantic Segmentation

License: MIT License

Python 100.00%

segmenter's Introduction

Segmenter: Transformer for Semantic Segmentation

Segmenter: Transformer for Semantic Segmentation by Robin Strudel*, Ricardo Garcia*, Ivan Laptev and Cordelia Schmid, ICCV 2021.

*Equal Contribution

🔥 Segmenter is now available on MMSegmentation.

Installation

Define os environment variables pointing to your checkpoint and dataset directory, put in your .bashrc:

export DATASET=/path/to/dataset/dir

Install PyTorch 1.9 then pip install . at the root of this repository.

To download ADE20K, use the following command:

python -m segm.scripts.prepare_ade20k $DATASET

Model Zoo

We release models with a Vision Transformer backbone initialized from the improved ViT models.

ADE20K

Segmenter models with ViT backbone:

Name	mIoU (SS/MS)	# params	Resolution	FPS	Download
Seg-T-Mask/16	38.1 / 38.8	7M	512x512	52.4	model	config	log
Seg-S-Mask/16	45.3 / 46.9	27M	512x512	34.8	model	config	log
Seg-B-Mask/16	48.5 / 50.0	106M	512x512	24.1	model	config	log
Seg-B/8	49.5 / 50.5	89M	512x512	4.2	model	config	log
Seg-L-Mask/16	51.8 / 53.6	334M	640x640	-	model	config	log

Segmenter models with DeiT backbone:

Name	mIoU (SS/MS)	# params	Resolution	FPS	Download
Seg-B†/16	47.1 / 48.1	87M	512x512	27.3	model	config	log
Seg-B†-Mask/16	48.7 / 50.1	106M	512x512	24.1	model	config	log

Pascal Context

Name	mIoU (SS/MS)	# params	Resolution	FPS	Download
Seg-L-Mask/16	58.1 / 59.0	334M	480x480	-	model	config	log

Cityscapes

Name	mIoU (SS/MS)	# params	Resolution	FPS	Download
Seg-L-Mask/16	79.1 / 81.3	322M	768x768	-	model	config	log

Inference

Download one checkpoint with its configuration in a common folder, for example seg_tiny_mask.

You can generate segmentation maps from your own data with:

python -m segm.inference --model-path seg_tiny_mask/checkpoint.pth -i images/ -o segmaps/

To evaluate on ADE20K, run the command:

# single-scale evaluation:
python -m segm.eval.miou seg_tiny_mask/checkpoint.pth ade20k --singlescale
# multi-scale evaluation:
python -m segm.eval.miou seg_tiny_mask/checkpoint.pth ade20k --multiscale

Train

Train Seg-T-Mask/16 on ADE20K on a single GPU:

python -m segm.train --log-dir seg_tiny_mask --dataset ade20k \
  --backbone vit_tiny_patch16_384 --decoder mask_transformer

To train Seg-B-Mask/16, simply set vit_base_patch16_384 as backbone and launch the above command using a minimum of 4 V100 GPUs (~12 minutes per epoch) and up to 8 V100 GPUs (~7 minutes per epoch). The code uses SLURM environment variables.

Logs

To plot the logs of your experiments, you can use

python -m segm.utils.logs logs.yml

with logs.yml located in utils/ with the path to your experiments logs:

root: /path/to/checkpoints/
logs:
  seg-t: seg_tiny_mask/log.txt
  seg-b: seg_base_mask/log.txt

Attention Maps

To visualize the attention maps for Seg-T-Mask/16 encoder layer 0 and patch (0, 21), you can use:

python -m segm.scripts.show_attn_map seg_tiny_mask/checkpoint.pth \ 
images/im0.jpg output_dir/ --layer-id 0 --x-patch 0 --y-patch 21 --enc

Different options are provided to select the generated attention maps:

--enc or --dec: Select encoder or decoder attention maps respectively.
--patch or --cls: --patch generates attention maps for the patch with coordinates (x_patch, y_patch). --cls combined with --enc generates attention maps for the CLS token of the encoder. --cls combined with --dec generates maps for each class embedding of the decoder.
--x-patch and --y-patch: Coordinates of the patch to draw attention maps from. This flag is ignored when --cls is used.
--layer-id: Select the layer for which the attention maps are generated.

For example, to generate attention maps for the decoder class embeddings, you can use:

python -m segm.scripts.show_attn_map seg_tiny_mask/checkpoint.pth \
images/im0.jpg output_dir/ --layer-id 0 --dec --cls

Attention maps for patch (0, 21) in Seg-L-Mask/16 encoder layers 1, 4, 8, 12 and 16:

Attention maps for the class embeddings in Seg-L-Mask/16 decoder layer 0:

Video Segmentation

Zero shot video segmentation on DAVIS video dataset with Seg-B-Mask/16 model trained on ADE20K.

BibTex

@article{strudel2021,
  title={Segmenter: Transformer for Semantic Segmentation},
  author={Strudel, Robin and Garcia, Ricardo and Laptev, Ivan and Schmid, Cordelia},
  journal={arXiv preprint arXiv:2105.05633},
  year={2021}
}

Acknowledgements

The Vision Transformer code is based on timm library and the semantic segmentation training and evaluation pipeline is using mmsegmentation.

segmenter's People

Contributors

Stargazers

Watchers

Forkers

liuguoyou s-bei doanvanthong luweishuang yinlb3 snoopybingo mei-727 thithaotran sunstarchan githubltqc ginobilinie yangyin2016 quincy-kh-chen cao-dut johndpope zots0127 hetao255 assansanogo ahatamiz tp030ny pramitd trendingtechnology chlee-hy rjgpinel samsungapple hoang-phuong-nguyen tatsuki-fukushima-avenue qinliuliuqin ltrottier srdg peternara wangyxxjtu jesperkers arasharchor jt-z balvisio mynameiziji vlavado chenhongjin811 macillas bilibulu1 sabernn cv-ip sushmit0109 anhquancao overbestfitting josauder cosinus01 smfirmin zhizhangxian urasakikeisuke avniculae tommylitlle lv-tuan zyy-cn xueyue404 supergirl-os collector-m xczhou520 maitanmassimo qizhangsama dotieuthien myjeffxie husterrc ldg810 fdsig nalsadi xk-wang paperwave samxiaosheng hahahamamama barchid reconstruct xuliangcs shampooma k8xu sharib-vision wzr3016 dctyxx juyebshin niuxiaopaier hy0523 xiaopaier johnnykaz lwj-allen rsdljm shawn207 zijiandu pele324 centrasis htermotto hossein-astaneh adamas-v wzy199520 akashchgupta xavieryu404 gsl0304 mohsen-azimi one-green-bird hannah-chou

segmenter's Issues

Missing mmsegmentation and mmcv in requirements

Just so you know, if you wish to update it :)

where can download the 'vit_base_patch8_384.pth'

thank you for your excellent work, I want to know where can download the 'vit_base_patch8_384.pth', thanks again!

License

What license are the code/pre-trained models in this repo licensed under?

Download speed up!!!!

Use
kaggle datasets download -d shujunge/ade20k-dataset

Visualization map in Figure 7

Hi, can u provide the code to reproduce the visualization attention map in Table 7 of your paper? I hope I can do more visualization analysis based on your framework.

Inference Error

hello, and thanksn for this great work,

I am trying to do inference from the pretrained model, i chose the first model :
Seg-T-Mask/16 | 38.1 / 38.8 | 7M | 512x512

then, i want to test for some images with 512*512 resolution, but i got this error

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Thanks

Pretrained weights for Pascal Context

Hi,
Great work and thanks for providing code online. In the model zoo you have provided weights for various versions trained on ADE20K, however in the paper you have also mentioned about training on Pascal Context. Is it possible to provide weights for the same?

Make a PR to MMSegmentation

Great work!
Could you please make a PR to MMSegmentation?

Performance better than that in the paper.

Hi Robin,

Thanks for releasing the code and model. I find that your model performs better than what is reported in the paper. For example, on ADE20K validation set, Seg-B-Mask/16 has 45.69 mIoU (SS), but according to the information from this repo, it can actually achieve 48.5. Am I missing something?

Performance of Seg-B/16 on CityScapes using AugReg initialization

Hi, thanks for the excellent work! I notice that in your paper, the Seg-B/16 trained on CityScapes is initialized by DeiT pre-trained model (rather than AugReg). And by my own experiments, Seg-B/16 (and my own model based on ViT-Base) with AugReg initialization performs quite bad on CityScapes (73.2 mIoU), while Seg-S/16 performs well (76.2 mIoU). So I wonder if you guys had also got similar results, and if you can share extra information about your choice on initialization of Seg-B/16 model?
Many thanks.

Dataset link is error

Hi author
can you send me the dataset ? the download link now have errors.
Thank you

replicate cityscapes performance

Hi,

Thanks for making the code open source. I was wondering if you could share the config file you used to train a seg-l/16 model with a vit-l/16 backbone to achieve 80.7% for the cityscapes dataset (second to last row in table 8). When I tried using the same configs as https://www.rocq.inria.fr/cluster-willow/rstrudel/segmenter/checkpoints/cityscapes/seg_large_mask/variant.yml but replacing the decoder the performance I got was around 75%. Thanks.

Infer, Colab

Can you please provide single image inference or colab

ValueError: DATASET is not defined in the os variables, it is required for data loading.

hi，Why is this error reported during training? Have DATASET folder.ValueError: DATASET is not defined in the os variables, it is required for data loading.

KeyError: ''

Hello, I run the program in windows. And an error occurred that

D:\Download\anaconda\anaconda\envs\learn\python.exe E:/Learning/Graduate/segmenter/segmenter-master/segm/train.py
Starting process with rank 0...
Process 0 is connected.
All processes are connected.
Traceback (most recent call last):
  File "E:\Learning\Graduate\segmenter\segmenter-master\segm\train.py", line 304, in <module>
    main()
  File "D:\Download\anaconda\anaconda\envs\learn\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "D:\Download\anaconda\anaconda\envs\learn\lib\site-packages\click\core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "D:\Download\anaconda\anaconda\envs\learn\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "D:\Download\anaconda\anaconda\envs\learn\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "E:\Learning\Graduate\segmenter\segmenter-master\segm\train.py", line 76, in main
    model_cfg = cfg["model"][backbone]
KeyError: ''

Do you know how to solve it? Thank you!

Performance on Pascal Context with Seg-L-Mask/16

Hi, thanks for the great works and the code!
I'm trying to reproduce the baseline base on mmsegmentation. While the baseline could be reproduced well on cityscapes and ADE20k, I could only get 56.9 on single scale on Pascal Context(58.1 reported). Anything I've missed?
Below is the config I'm running base on mmsegmentation, anything wrong in the setting?
Great thanks for your help!

_base_ = [
    # "./training_scheme.py",
    "../_base_/models/segmenter_vit-b16.py",
    "../_base_/datasets/pascal_context_meanstd0.5.py",
    "../_base_/default_runtime.py",
    "../_base_/schedules/schedule_80k.py",
]

model = dict(
    pretrained="pretrain/L_16-i21k-300ep-lr_0.001-aug_medium1-wd_0.1-do_0.1-sd_0.1--imagenet2012-steps_20k-lr_0.01-res_384.npz",
    backbone=dict(
        type="VisionTransformer",
        img_size=(480, 480),
        patch_size=16,
        in_channels=3,
        embed_dims=1024,
        num_layers=24,
        num_heads=16,
        mlp_ratio=4,
        out_indices=(5, 11, 17, 23),
        qkv_bias=True,
        drop_rate=0.0,
        attn_drop_rate=0.0,
        drop_path_rate=0.1,
        with_cls_token=True,
        final_norm=True,
        norm_cfg=dict(type="LN", eps=1e-6),
        act_cfg=dict(type="GELU"),
        norm_eval=False,
        interpolate_mode="bicubic",
    ),
    neck=dict(
        type="UseIndexSingleOutNeck",
        index=-1,
    ),
    decode_head=dict(
        n_cls=60,
        n_layers=2,
        d_encoder=1024,
        n_heads=16,
        d_model=1024,
        d_ff=4 * 1024,
    ),
    test_cfg=dict(mode="slide", crop_size=(480, 480), stride=(320, 320)),
)

optimizer = dict(
    _delete_=True,
    type="SGD",
    lr=0.001,
    weight_decay=0.0,
    momentum=0.9,
    paramwise_cfg=dict(
        custom_keys={
            "pos_embed": dict(decay_mult=0.0),
            "cls_token": dict(decay_mult=0.0),
            "norm": dict(decay_mult=0.0),
        }
    ),
)

lr_config = dict(
    _delete_=True,
    policy="poly",
    warmup_iters=0,
    power=0.9,
    min_lr=1e-5,
    by_epoch=False,
)

# By default, models are trained on 8 GPUs with 2 images per GPU
data = dict(samples_per_gpu=2)

README: a wrong link

The link for the config of Seg-B†/16 in https://github.com/rstrudel/segmenter#model-zoo is wrong

It should be:

https://www.rocq.inria.fr/cluster-willow/rstrudel/segmenter/checkpoints/seg_base_deit_linear/variant.yml

how can I finetune segmenter in my own dataset?

Can you use swin transformer as encoder?

As in title, can you replace the standard ViT encoder with a swin transformer + FPN? Would this be a reasonable thing to try out?

Custom datasets

Hello, thank you for your excellent job. I want to know that how to train with custom datasets?

Multi-GPU Training Not On SLURM

Hello, thanks a lot for your contribution of such a excellent work. I noticed that the distributed multi-gpu training is based on the slurm platform, which is not easy to be run on other platforms. Could you or anyone can provide some tips to change the code from the slurm based code to the non-slurm based one, so that the multi-gpu distributed training can also be conducted on other platforms?

Adam Optimizer not working

I have Exerpimented on this code on many things , i have also introduced custom scedulers, but what i am not able to understand is , why SGD is working perfectly fine , while Adam optimizer isnt , i tried changing learning rate to different rate, but none seem to even start decreasing the loss.
I used both SGD and adam from Torch.optim . any suggestions or help would be appreciated

Thanks

train on custom dataset

hello, I would like to ask if I can modify the existing code to train on my dataset because in a previous issue I read that this is not possible yet. If it's possible Any hints about modifications needed ?

Which ViT weights did you originally use in the paper?

Hi, I see some improvements of your released model compared with the results reported in your paper (51 vs 48 for Seg/L-16, etc.), and you mention this performance gain is from the improved ViT weights. I wonder which ViT weights did you use in your paper?

cityscapes model

Dear authors,

can you please share the weights of the model trained on the Cityscapes dataset?

Thank you very much in advance.

Best,
Antonin.

How to train with multi gpu

I wonder how to train with multi-gpu

RuntimeError: Error(s) in loading state_dict for Segmenter:.....

Mutli-GPUs training

This is a good paper and very interested idea! There is a training cmd using a single gpu in readme. For multi-gpus training, could you provide the corresponding cmd ?

Unexpected keyword `mlp_ratio` running `seg_base_deit_mask`

First of all, excellent repo - thanks very much for the awesome contribution to the ml community!

When running running eval on seg_base_deit_mask (via python -m segm.eval.miou checkpoints/seg_base_deit_mask/checkpoint.pth ade20k --multiscale), I am getting an error:

Starting process with rank 0...
Process 0 is connected.
All processes are connected.
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/.../segmenter/segm/eval/miou.py", line 279, in <module>
    main()
  File "/home/.../segmenter/pyenv/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/.../segmenter/pyenv/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/.../segmenter/pyenv/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/.../segmenter/pyenv/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/.../segmenter/segm/eval/miou.py", line 226, in main
    model, variant = load_model(model_path)
  File "/home/.../segmenter/segm/model/factory.py", line 119, in load_model
    model = create_segmenter(net_kwargs)
  File "/home/.../segmenter/segm/model/factory.py", line 106, in create_segmenter
    encoder = create_vit(model_cfg)
  File "/home/.../segmenter/segm/model/factory.py", line 67, in create_vit
    model = VisionTransformer(**model_cfg)
TypeError: __init__() got an unexpected keyword argument 'mlp_ratio'

This is happening with both single and multi scale. This seems to be stemming from the mlp_ratio key in the located in the yml config.

As I keep poking around, if I find a solution I'll submit a PR.

Thanks again for the repo 👍

Instance Segmentation

@rstrudel thanks for opensourcing the code base can we perform instance segmentation also from the segmenter ? if so how to do it ? what all changes have to be made in the code base
THnaks in advance

how can I create the attension maps that provided in table 1?

can you please share the code to provide the attention maps that are provided in the paper?

Linear Decoder performs better than Mask Transformer with ViT Tiny backbone on ADE20K.

Hi, Thanks for the great codebase!

I compared the performance of the two decoders: Linear Decoder and Mask Transformer with ViT Tiny backbone on the ADE20K Dataset, following the original hyperparameter settings.

The results are as follows:

Linear Decoder: mIOU = 39.85
Mask Transformer: mIOU = 38.55

Is it that the Mask Transformer works well with heavier backbones, or am I missing something?

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/$WORK/tempbs_7o9oj'

**I begin to train on my own data, but I get an error when it evals for the first time. The log shows as follow: **

Epoch: [11] [0/8] eta: 0:00:34 loss: 0.0000 (0.0000) learning_rate: 0.0008 (0.0008) time: 4.2506 data: 2.3495 max mem: 9466
Epoch: [11] [7/8] eta: 0:00:00 loss: 0.0000 (0.0000) learning_rate: 0.0008 (0.0008) time: 0.9943 data: 0.2958 max mem: 9491
Epoch: [11] Total time: 0:00:08 (1.0115 s / it)
Epoch: [12] [0/8] eta: 0:00:27 loss: 0.0000 (0.0000) learning_rate: 0.0008 (0.0008) time: 3.4646 data: 2.7603 max mem: 9492
Epoch: [12] [7/8] eta: 0:00:00 loss: 0.0000 (0.0000) learning_rate: 0.0008 (0.0008) time: 0.8330 data: 0.3464 max mem: 9492
Epoch: [12] Total time: 0:00:06 (0.8537 s / it)
Eval: [ 0/58] eta: 0:01:40 time: 1.7340 data: 1.3048 max mem: 10891
Eval: [50/58] eta: 0:00:01 time: 0.1124 data: 0.0121 max mem: 16814
Eval: [57/58] eta: 0:00:00 time: 0.1047 data: 0.0120 max mem: 16814
_Eval: Total time: 0:00:08 (0.1505 s / it)
Traceback (most recent call last):
File "/home/qiuzheng/.conda/envs/Segmenter/lib/python3.8/runpy.py", line 192, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/qiuzheng/.conda/envs/Segmenter/lib/python3.8/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/qiuzheng/segmenter/segm/train.py", line 304, in
main()
File "/home/qiuzheng/.conda/envs/Segmenter/lib/python3.8/site-packages/click/core.py", line 1128, in call
return self.main(*args, **kwargs)
File "/home/qiuzheng/.conda/envs/Segmenter/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/qiuzheng/.conda/envs/Segmenter/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/qiuzheng/.conda/envs/Segmenter/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/qiuzheng/segmenter/segm/train.py", line 266, in main
eval_logger = evaluate(
File "/home/qiuzheng/.conda/envs/Segmenter/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/qiuzheng/segmenter/segm/engine.py", line 104, in evaluate
val_seg_pred = gather_data(val_seg_pred)
File "/home/qiuzheng/segmenter/segm/metrics.py", line 60, in gather_data
tmpdir = tempfile.mkdtemp(prefix=tmpprefix)
File "/home/qiuzheng/.conda/envs/Segmenter/lib/python3.8/tempfile.py", line 359, in mkdtemp
os.mkdir(file, 0o700)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/$WORK/tempbs_7o9oj'

Simpler code

Hi there, just want to point that this code

px_v = einops.rearrange(
    torch.tensor([1, 0, 0]).repeat([patch_size, patch_size, 1]),
    "h w c -> c h w",
).unsqueeze(0)

can be made much more readable

px_v = einops.repeat(
    torch.tensor([1, 0, 0]),
    "c -> 1 c h w", h=patch_size, w=patch_size,
)

AttributeError: module 'yaml' has no attribute 'FullLoader'

On Colab with GPU:

%cd /content
!git clone https://github.com/rstrudel/segmenter

%cd /content/segmenter
%pip install .

%mkdir -p images
%mkdir -p segmaps
%mkdir -p seg_tiny_mask

%cd /content/segmenter/seg_tiny_mask
!wget https://www.rocq.inria.fr/cluster-willow/rstrudel/segmenter/checkpoints/seg_tiny_mask/checkpoint.pth
!wget https://www.rocq.inria.fr/cluster-willow/rstrudel/segmenter/checkpoints/seg_tiny_mask/variant.yml

%cd /content/segmenter/images
!wget https://i.imgur.com/HpJVhfd.png

%cd /content/segmenter
!python -m segm.inference --model-path seg_tiny_mask/checkpoint.pth -i images/ -o segmaps/

returns:

Traceback (most recent call last):
  File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/content/segmenter/segm/inference.py", line 64, in <module>
    main()
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/content/segmenter/segm/inference.py", line 27, in main
    model, variant = load_model(model_path)
  File "/content/segmenter/segm/model/factory.py", line 116, in load_model
    variant = yaml.load(f, Loader=yaml.FullLoader)
AttributeError: module 'yaml' has no attribute 'FullLoader'

Questions about the function "resize_pos_embed" in load model weights when input different resolution image.

Thank you for your interesting work!

I have a question, when the input resolution is different from the pretrained vit model, your solution is "we bilinearly interpolate the pre-trained position embeddings according to their original position in the image to match the fine-tuning sequence length" in your paper.
While I see the "resize_pos_embed" function in timm.vision_transformer._load_weights() is to choose "bicubic" interpolation instead of "bilinear", so have you verified that "bilinear interpolation" is better than "bicubic interpolation"?

Single Scale Evaluation Performanace on Cityscapes with `Seg-T-Mask/16`.

Hey, thanks for the great work and codebase!

I tried reproducing your results using the vit_tiny_patch16_384 variant of ViT as the backbone. That works pretty well when training on the ADE20K dataset, giving an mIOU = 38.55 as expected.

However, when I trained the same model on the Cityscapes dataset, it gave an mIOU = 72.9, which is also amazing. Also, it would be great if you can add performances on the cityscapes dataset in the README.

I used the same hyperparameters as were released with the original code. I made one change to the label loading strategy of Cityscapes to map the GT class IDs to a range of 0-18 as all other works do.

Not using label mapping will give the index mix-match error during training. So putting it here in case anyone faces the same issue in the future.

Code to be added in segm/data/base.py:

ignore_label = 255
id_to_trainid = {-1: -1, 0: ignore_label, 1: ignore_label, 2: ignore_label, 
                3: ignore_label, 4: ignore_label, 5: ignore_label, 6: ignore_label, 
                7: 0, 8: 1, 9: ignore_label, 10: ignore_label, 11: 2, 12: 3, 13: 4, 
                14: ignore_label, 15: ignore_label, 16: ignore_label, 17: 5, 
                18: ignore_label, 19: 6, 20: 7, 21: 8, 22: 9, 23: 10, 24: 11, 25: 12, 26: 13, 27: 14, 
                28: 15, 29: ignore_label, 30: ignore_label, 31: 16, 32: 17, 33: 18}

def convert_labels(self, labels):
      labels_copy = np.copy(labels)
      for k, v in id_to_trainid.items():
          labels_copy[labels == k] = v
      return labels_copy

Calling the convert_labels function before returning the GT Labels for the cityscapes dataset does the trick.

KeyError: 'optimizer'

Thank you for your excellent work, but I have a problem about module checkpoint.pth.When I try to run segm.train module,there is an error "KeyError: 'optimizer'",Hope you to answer me. thanks again!

Ask about the "Seg-B/8"

Great work on semantic segmentation!

I find that the resolution is important for the final performance, e.g., Seg-B/8.

However, I could not find that ImageNet pre-trained checkpoints with patch-size 8 from the lib timm.

It would be great if you could help to address my concern!

mmseg not found !!

hello and thanks for this great work,

i am trynig to test your code using Inference, but y get this error:
ModuleNotFoundError: No module named 'mmseg'

then i tried to install mmseg using pip, but it's not working :
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Best,

attention maps

Hi, I want to generate segmentation maps from my own data , but after download the checkpoint and config file ,
I modified the following code line 413, in vision_transformer.py:

# before
    w = np.load(checkpoint_path)
# after
    w = np.load(checkpoint_path, allow_pickle=True)

I got this error:

OSError: Failed to interpret file '/xxxx/.cache/torch/hub/checkpoints/Ti_16-i21k-300ep-lr_0.001-aug_none-wd_0.03-do_0.0-sd_0.0--imagenet2012-steps_20k-lr_0.03-res_384.npz' as a pickle

when I check this .npz, most of words was garbled.

waiting for the code.

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Code to compute images/sec

Hi,

Thank you for the cool work!

I see that you report images/sec, and mention the following in the paper:

To compute the images per second, we use a V100 GPU, fix the image resolution to 512 and for each model we maximize the batch size allowed by memory for a fair comparison.

I'm trying to do the same, however I'm unable to reproduce the numbers you of images/sec in the paper.

I'm using the code snippet from PyTorch as follows:

    batch = torch.rand(args.batch_size, *input_shape).cuda()
    model(batch)
    n_runs = 10
    from torch.utils.benchmark import Timer

    t = Timer(stmt="model.forward(batch)", globals={"model": model, "batch": batch})
    m = t.timeit(n_runs)

The batch size that fits on V100 for Vit-T backbone is about 140. And the above code shows a timing of 0.62 seconds. So I'm computing the total images/sec = 140/0.62 = 225.8. This is almost half the numbers in Table 3. Can you please help me with what I need to do to get the mentioned result?

Thank you!

How to get the mIoU for small, medium, large objects on ade20k

Why can't Seg-L-Mask/16 be implemented? There will be an error that the weights cannot be loaded

Will the pretrained models work with a different image resolution?

Thanks for this great work.
I tried the training code with my own data (i.e. image with resolution 256 x 320) from scratch and it worked well. However, it crashed if I loaded the pretrained model file originally training with 500x500 images.
Is it normal for such transformer based networks? I ask this because I know a pretrained CNN based segmentation network (without full connected layer) would not be affected by the input resolution, but I am not very familiar with transfemers.

`start_epoch` in Seg-L-Mask/16 config

I've spotted start_epoch = 43 in https://www.rocq.inria.fr/cluster-willow/rstrudel/segmenter/checkpoints/ade20k/seg_large_mask_640/variant.yml . Is it a mistake?

How much does ImageNet pre-training affect model performance?

Hi,

I am trying to use the baseline model (Linear decoder) described in the paper as a baseline for some of my work. However, I do not have access to pre-trained ImageNet weights. My model is not able to learn, converging at around 0.25 mDICE on the training set of Cityscapes. This is after hyper parameter optimisation across SGD, Adam + different learning rate schedulers.

I was wondering if during your experiments, you saw similar levels of performance when you did not initialise your transformer backbones with pre-trained weights? Was this tested for the baseline (ViT + Linear) and your proposed method (ViT + Mask)?

Thank you

how to get the attention maps

first the folder named images don’t have the file named im0.jpg.

they release the message

if i replace the folder images/validation/ADE_val_0000000.jpg
ValueError: Provided image path images/training/ADE_train_00016528 is not a valid image file.

and what is the output_dir