luo3300612 / image-captioning-dlct Goto Github PK

Official pytorch implementation of paper "Dual-Level Collaborative Transformer for Image Captioning" (AAAI 2021).

License: BSD 3-Clause "New" or "Revised" License

Python 20.84% Jupyter Notebook 79.16%

image-captioning-dlct's Introduction

Dual-Level Collaborative Transformer for Image Captioning

This repository contains the reference code for the paper Dual-Level Collaborative Transformer for Image Captioning and Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network.

Experiment setup

please refer to m2 transformer

Data preparation

Annotation. Download the annotation file annotation.zip. Extarct and put it in the project root directory.
Feature. You can download our ResNeXt-101 feature (hdf5 file) here. Acess code: jcj6.
evaluation. Download the evaluation tools here. Acess code: jcj6. Extarct and put it in the project root directory.

There are five kinds of keys in our .hdf5 file. They are

['%d_features' % image_id]: region features (N_regions, feature_dim)
['%d_boxes' % image_id]: bounding box of region features (N_regions, 4)
['%d_size' % image_id]: size of original image (for normalizing bounding box), (2,)
['%d_grids' % image_id]: grid features (N_grids, feature_dim)
['%d_mask' % image_id]: geometric alignment graph, (N_regions, N_grids)

We extract feature with the code in grid-feats-vqa.

The first three keys can be obtained when extracting region features with extract_region_feature.py. The forth key can be obtained when extracting grid features with code in grid-feats-vqa. The last key can be obtained with align.ipynb

Training

python train.py --exp_name dlct --batch_size 50 --head 8 --features_path ./data/coco_all_align.hdf5 --annotation annotation --workers 8 --rl_batch_size 100 --image_field ImageAllFieldWithMask --model DLCT --rl_at 17 --seed 118

Evaluation

python eval.py --annotation annotation --workers 4 --features_path ./data/coco_all_align.hdf5 --model_path path_of_model_to_eval --model DLCT --image_field ImageAllFieldWithMask --grid_embed --box_embed --dump_json gen_res.json --beam_size 5

Important args:

--features_path path to hdf5 file
--model_path
--dump_json dump generated captions to

Pretrained model is available here. Acess code: jcj6. By evaluating the pretrained model, you will get

{'BLEU': [0.8136727001615207, 0.6606095421082421, 0.5167535314080227, 0.39790755018790197], 'METEOR': 0.29522868252436046, 'ROUGE': 0.5914367650104326, 'CIDEr': 1.3382047139781112, 'SPICE': 0.22953477359195887}

References

[1] M2

[2] grid-feats-vqa

[3] butd

Acknowledgements

Thanks the original m2 and amazing work of grid-feats-vqa.

image-captioning-dlct's People

Contributors

Stargazers

Watchers

image-captioning-dlct's Issues

can_be_stateful 参数的含义

非常感谢你能开源论文的代码。在阅读你的代码的时候，我对模型中 can_be_stateful 参数的含义不是很理解，不知道它的取值对模型有什么影响，你解释一下吗？

RuntimeError: stack expects each tensor to be equal size

您好，请问您使用https://github.com/facebookresearch/grid-feats-vqa提取的特征维度是固定的吗？我使用[extract_grid_feature.py]提取得到的特征维度不同，所以在batch size训练时，会出现RuntimeError: stack expects each tensor to be equal size, but got [1, 2048, 18, 32] at entry 0 and [1, 2048, 19, 29] at entry 1

关于GET模型的train.py文件

作者您好，我想请教一下您文件夹下面的GET模型其训练文件和数据集文件与DLCT是相同的吗？

测试过程中遇到的词表长度问题

我使用readme里面的参数进行训练和测试，但是在测试过程加载checkpoints的时候，发现词表长度出现了变化，这是为什么呢。
期待您的帮助，谢谢！

关于evaluation阶段的报错

作者您好，打扰了，我在服务器上跑您的代码在evaluation阶段出现了如下报错，请问您遇到过么？期待您的回复！
Traceback (most recent call last):
File "train.py", line 353, in
scores = evaluate_metrics(model, dict_dataloader_val, text_field)
File "train.py", line 62, in evaluate_metrics
out, _ = model.beam_search(images, 20, text_field.vocab.stoi[''], 5, out_size=1,
File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/captioning_model.py", line 70, in beam_search
return bs.apply(visual, out_size, return_probs, **kwargs)
File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/beam_search/beam_search.py", line 71, in apply
visual, outputs = self.iter(t, visual, outputs, return_probs, **kwargs)
File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/beam_search/beam_search.py", line 121, in iter
self.model.apply_to_states(self._expand_state(selected_beam, cur_beam_size))
File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/containers.py", line 30, in apply_to_states
self._buffers[name] = fn(self._buffers[name])
File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/beam_search/beam_search.py", line 26, in fn
s = torch.gather(s.view(*([self.b_s, cur_beam_size] + shape[1:])), 1,
RuntimeError: gather_out_cuda(): Expected dtype int64 for index

how are alignment graph obtained for new datasets

Hi, in your coding,h5py.File features has keys like ['%d_features' % image_id] , ['%d_grids' % image_id], ['%d_boxes' % image_id], ['%d_size' % image_id], ['%d_mask' % image_id], If I have a new data set, can I just use align.py to get geometric alignment graph after I get grid features and region features using extract_region_feature.py and grid-feats-vqa.

The Code to generate the caption of a figure.

Thank you so much for sharing your brilliant code with us.
But could you share me the code to generate the caption for a image like the Figure 1 in your artical?
I would appreciate it if you could share it with me! ^_^

coco_test数据集特征

你好，可以提供一下你处理好的coco_test特征吗，感谢。

关于训练阶段Shuffle参数设置问题

作者您好，在服务器上复现您代码的时候发现如下报错
Traceback (most recent call last):
File "train.py", line 328, in
dict_dataloader_train = DataLoader(dict_dataset_train, batch_size=args.rl_batch_size // 5, shuffle=True,
File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/data/init.py", line 8, in init
super(DataLoader, self).init(dataset, *args, collate_fn=dataset.collate_fn(), **kwargs)
File "/mnt/ssd1/anaconda3/envs/yjl/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 228, in init
raise ValueError(
ValueError: DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True
我检查了一下源码，发现在train.py的328行
dict_dataloader_train = DataLoader(dict_dataset_train, batch_size=args.rl_batch_size // 5, shuffle=True,
num_workers=args.workers)
若将shuffle置为False,能够运行，请问作者您遇到过上述的问题么？期待您的回复

交叉熵性能

您好，请问您能提供交叉熵的性能吗

GET缺少global_transformer_glu模块

您好，请问下这个模块在哪个文件夹？

Concept Question about MHLCCA

Hi,

great to see a new image captioning model!

I got a question about MHLCCA. In your paper, it suggests setting both the key and value of MHLCCA as MHCRA(H_region, H_region, H_region ... ). However, in your code https://github.com/luo3300612/image-captioning-DLCT/blob/main/models/DLCT/encoders.py line 155, the key and value are formed by torch.cat([out_region, out_grid], dim=1). Seems it is a bit different from the paper. Would you slightly explain the idea of this, or am I miss understanding the concept?

Thank you!

关于pth大小

请问跑完的模型是200M+嘛？我下载了您提供的pretrained_model.pth，240MB，为什么我跑出来有700M+。

TypeError: 'generator' object is not callable

This code will report an error when it is executed here

Reinforcement learning generates an incomplete description？

coco_all_align.zip file error

Decompression of coco_all_align.zip file error:
error: invalid compressed data to inflate

Questions about h5py.File features on customized images!

Hi, in your coding,h5py.File features has keys like ['%d_features' % image_id] , ['%d_grids' % image_id], ['%d_boxes' % image_id], ['%d_size' % image_id], ['%d_mask' % image_id], can you explain the five keys meaning？
I know you extract the grid features on 'https://github.com/facebookresearch/grid-feats-vqa', can you share the other keys(features, boxes,size,mask) extraction method or how to obtain them? Thanks! hope you reply soon！

区域特征提取

我尝试使用您的代码自己常见特征集，但是还有部分疑问。region_after和region_before 分别取自proposal_box_features和proposal_box_features1（均值），这有何区别？？
另外我去https://github.com/facebookresearch/grid-feats-vqa下也并未找到相应区域yaml文件，请问您是否可以分享下您的X-101-region.yaml和X-152-region.yaml文件？？

想获取ResNeXt-152数据集

您好，我用了你的ResNeXt-101数据集，50g左右，相比于grid-feats-vqa里的数据集小了很多，我能拿到您对应的ResNeXt-152数据集么

What is the relation between X-101 and X-152 extractors in the code?

Your script extract_region_feature.py has weights for X-101 hard-coded. But the file with features has name "region_before_X152.hdf5". Also, there is no information from you which checkpoint to use for extracting grid features. In the paper you mention both X-101 and X-152 as extractors.

Which extractor checkpoint should I use for grid-feats-vqa: X-101 or X-152? Can I use X-152 for region features as well?

ImportError: cannot import name 'PTBTokenizer' from 'evaluation'

where is the path_of_model_to_eval.pth ?

Expected all tensors to be on the same device

hello,
I'm working on your code but on training step I got this error, I try to add .cuda() but it didn't work:
Epoch 0 - validation: 100%|##########| 6250/6250 [11:39<00:00, 8.92it/s, loss=3.03]
Epoch 0 - validation: 100%|##########| 6250/6250 [11:39<00:00, 8.93it/s, loss=3.03]

Epoch 0 - evaluation: 0%| | 0/250 [00:00<?, ?it/s]
Epoch 0 - evaluation: 0%| | 0/250 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 354, in
scores = evaluate_metrics(model, dict_dataloader_val, text_field)
File "train.py", line 62, in evaluate_metrics
**{'boxes': boxes, 'grids': grids, 'masks': masks})
File "D:\thez\duallevel\DLCT\models\captioning_model.py", line 70, in beam_search
return bs.apply(visual, out_size, return_probs, **kwargs)
File "D:\thez\duallevel\DLCT\models\beam_search\beam_search.py", line 71, in apply
visual, outputs = self.iter(t, visual, outputs, return_probs, **kwargs)
File "D:\thez\duallevel\DLCT\models\beam_search\beam_search.py", line 104, in iter
word_logprob = self.model.step(t, self.selected_words, visual.type(torch.LongTensor), None, mode='feedback', **kwargs)
File "D:\thez\duallevel\DLCT\models\DLCT\transformer.py", line 73, in step
self.enc_output, self.mask_enc = self.encoder(regions=visual, grids=grids,boxes=boxes,aligns=aligns, region_embed=self.region_embed,grid_embed=self.grid_embed)
File "C:\Users\MehrsysteM\anaconda3\envs\DLCT\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "D:\thez\duallevel\DLCT\models\DLCT\encoders.py", line 189, in forward
out_region = F.relu(self.fc_region(regions))
File "C:\Users\MehrsysteM\anaconda3\envs\DLCT\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\MehrsysteM\anaconda3\envs\DLCT\lib\site-packages\torch\nn\modules\linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "C:\Users\MehrsysteM\anaconda3\envs\DLCT\lib\site-packages\torch\nn\functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_mm)

can you help me how to fix this?

LCCA模块的图结构是用多头注意力实现的？

怎么评估spice分数

您好，请问需要在模型做哪些设置来评估出spice分数呢？

Unable to download the pretrained model

It is not possible to download the trained pth file from pan.baidu.com
It seems we need to install some kind of download software (the Baidu page is in Chinese so it is hard to understand how to proceed).

Could you please host the files in a more user-friendly repository? Like osf.io or Dropbox?

Thx

Can you upload resources to another cloud like Gdrive, Onedrive or Dropbox?

In my country, the download speed from Baidu is very slow and I can't down needed resources. Can you please upload them to GDrive, OneDrive, or Dropbox? Thank you!

提取区域特征？

您好，您可以提供一下区域特征的,yaml文件吗

IndexError: invalid index to scalar variable

....
Epoch 0 - evaluation: 100%|##########| 250/250 [07:44<00:00, 1.86s/it]
Epoch 0 - evaluation: 100%|##########| 250/250 [07:44<00:00, 1.86s/it]
Traceback (most recent call last):
File "train.py", line 356, in
val_cider = scores['CIDEr']
IndexError: invalid index to scalar variable.

Hello, when I run the code , I get this error. Can you help me?

Can you share the pretrained model?

I seems that I have managed to combine the code to generate all the features needed. I want to test it on my custom images without re-training the network. Can you please provide the pretrained checkpoint?

训练好的model

您好，请问能提供一下训练好的model吗？

数据集特征提取

作者，您好，您项目中的方法能否用来提取自有数据集中的特征并用来训练。

error

how to solove?

About test performance

Hello, Thanks for your opening code.
When I train this model from the scratch, training loss is down, but testing performances do not change in training stage

parameter error

Class SelfAtt module includes a MultiHeadAttention module. In its forward function, the self.mhatt receives 6 parameters (q, k, values, relative_geometry_weights, attention_mask, attention_weights). However, there are 5 parameters in MultiHeadAttention's forward function (queries, keys, values,box_relation_embed_matrix, attention_mask, attention_weights). Is this an error? Should it replace MultiHeadAttention with MultiHeadBoxAttention?

Should ['%d_size' % image_id] be equal to the real image size?

For 418x449 image I have [600,644] written to the features, and for 1042x480 I have [1000x461]. Those are not the values for some other images from my dataset, as I don't have images with such dimensions. Is it Ok, are those values somewhat different or they must correspond to the real dimensions?

RL training stage, out of memory

Switching to RL, after every epoch training, the memory usage will increase about 2GB. Finally, the system will be out of memory, and kill process.
Could you please give me some tips to solve it? Thank you very much!

A code bug

Traceback (most recent call last):
File "/mnt/Pycharm_Remote/DLCT_test/train.py", line 335, in
scores = evaluate_metrics(model, dict_dataloader_val, text_field)
File "/mnt/Pycharm_Remote/DLCT_test/train.py", line 61, in evaluate_metrics
**{'boxes': boxes, 'grids': grids, 'masks': masks})
File "/mnt/Pycharm_Remote/DLCT_test/models/captioning_model.py", line 70, in beam_search
return bs.apply(visual, out_size, return_probs, **kwargs)
File "/mnt/Pycharm_Remote/DLCT_test/models/beam_search/beam_search.py", line 71, in apply
visual, outputs = self.iter(t, visual, outputs, return_probs, **kwargs)
File "/mnt/Pycharm_Remote/DLCT_test/models/beam_search/beam_search.py", line 121, in iter
self.model.apply_to_states(self._expand_state(selected_beam, cur_beam_size))
File "/mnt/Pycharm_Remote/DLCT_test/models/containers.py", line 30, in apply_to_states
self._buffers[name] = fn(self._buffers[name])
File "/mnt/Pycharm_Remote/DLCT_test/models/beam_search/beam_search.py", line 27, in fn
beam.expand(*([self.b_s, self.beam_size] + shape[1:])))
RuntimeError: gather_out_cuda(): Expected dtype int64 for index

the beam is float and come from "selected_beam = selected_idx / candidate_logprob.shape[-1]"，so it's float.But index need int.

Feature extraction on customized images

Hi, how to extract the image box features on customized images, can you share the pretrained models of RESNEXT-101 && RESNEXT152 and corresponding files for extra image extraction. Thanks!

TypeError: init() got an unexpected keyword argument 'train_on_pred_boxes'

/usr/local/lib/python3.7/dist-packages/detectron2/config/config.py in wrapped(self, *args, **kwargs)
188 if _called_with_cfg(*args, **kwargs):
189 explicit_args = _get_args_from_config(from_config_func, *args, **kwargs)
--> 190 init_func(self, **explicit_args)
191 else:
192 init_func(self, *args, **kwargs)

TypeError: init() got an unexpected keyword argument 'train_on_pred_boxes'
作者您好，我在复现您论文时，在实现提取特征的代码时出现该问题，我查询了相关的，但是没有解决？

ModuleNotFoundError: No module named 'grid_feats'

How to do it？

TypeError: init() got an unexpected keyword argument 'd_model'

怎么修改这个错误

About features on Baiduyun disk

Hello, could you explain about the features displayed on net disk? is coco_all_align.hdf5 in the zip file? And what are the files end with z01, z02,z03? I have tried to exact the zip file, but fails.

About visualization code

Hello and thank you for this fantastic repo!
I find there is the visualization of attention states for captions (Figure 5 b), but I cannot find the corresponding code in this repo. I tried to reproduce the visualization results according to your paper and failed. Could you please share your visualization code? Thank you!

预训练模型结果较低

你好,用您给的文件不能复现您的结果，而且，我自己跑出来的文件去test也无法得到跑程序时显示的结果，比如跑的时候belu1是81，用保存的最优文件跑eval.py只能得到79.1.请问是什么原因呢？是eval.py的设置有问题吗？

Unable to load pretrained_model.pth

When I try to execute the eval.py file, an error is reported “pretrained_model.pth is a zip archive(did you mean to use torch.jit.load()?)”.

Are you saving the model with torch 1.6 or above?

However, torch 1.1 is installed in the file environment.yml. Which one is right? Should I change to torch 1.6 or above?

I turned to execute the file train.py, and I spent a long time in this step.

image-captioning-DLCT/train.py

Line 248 in 75a823d

 model = Transformer(text_field.vocab.stoi['<bos>'], encoder, decoder, args=args).to(device) 

I don't know if this is correct?

I use one GeForce RTX 3090Ti GPU with 24GB of memory.

I'm looking forward to your reply, thanks.