Giter Club home page Giter Club logo

image-captioning-dlct's Introduction

Dual-Level Collaborative Transformer for Image Captioning

This repository contains the reference code for the paper Dual-Level Collaborative Transformer for Image Captioning and Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network.

Experiment setup

please refer to m2 transformer

Data preparation

  • Annotation. Download the annotation file annotation.zip. Extarct and put it in the project root directory.
  • Feature. You can download our ResNeXt-101 feature (hdf5 file) here. Acess code: jcj6.
  • evaluation. Download the evaluation tools here. Acess code: jcj6. Extarct and put it in the project root directory.

There are five kinds of keys in our .hdf5 file. They are

  • ['%d_features' % image_id]: region features (N_regions, feature_dim)
  • ['%d_boxes' % image_id]: bounding box of region features (N_regions, 4)
  • ['%d_size' % image_id]: size of original image (for normalizing bounding box), (2,)
  • ['%d_grids' % image_id]: grid features (N_grids, feature_dim)
  • ['%d_mask' % image_id]: geometric alignment graph, (N_regions, N_grids)

We extract feature with the code in grid-feats-vqa.

The first three keys can be obtained when extracting region features with extract_region_feature.py. The forth key can be obtained when extracting grid features with code in grid-feats-vqa. The last key can be obtained with align.ipynb

Training

python train.py --exp_name dlct --batch_size 50 --head 8 --features_path ./data/coco_all_align.hdf5 --annotation annotation --workers 8 --rl_batch_size 100 --image_field ImageAllFieldWithMask --model DLCT --rl_at 17 --seed 118

Evaluation

python eval.py --annotation annotation --workers 4 --features_path ./data/coco_all_align.hdf5 --model_path path_of_model_to_eval --model DLCT --image_field ImageAllFieldWithMask --grid_embed --box_embed --dump_json gen_res.json --beam_size 5

Important args:

  • --features_path path to hdf5 file
  • --model_path
  • --dump_json dump generated captions to

Pretrained model is available here. Acess code: jcj6. By evaluating the pretrained model, you will get

{'BLEU': [0.8136727001615207, 0.6606095421082421, 0.5167535314080227, 0.39790755018790197], 'METEOR': 0.29522868252436046, 'ROUGE': 0.5914367650104326, 'CIDEr': 1.3382047139781112, 'SPICE': 0.22953477359195887}

References

[1] M2

[2] grid-feats-vqa

[3] butd

Acknowledgements

Thanks the original m2 and amazing work of grid-feats-vqa.

image-captioning-dlct's People

Contributors

leeyn-43 avatar lifegwt avatar luo3300612 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

image-captioning-dlct's Issues

can_be_stateful 参数的含义

非常感谢你能开源论文的代码。在阅读你的代码的时候,我对模型中 can_be_stateful 参数的含义不是很理解,不知道它的取值对模型有什么影响,你解释一下吗?

测试过程中遇到的词表长度问题

image
我使用readme里面的参数进行训练和测试,但是在测试过程加载checkpoints的时候,发现词表长度出现了变化,这是为什么呢。
期待您的帮助,谢谢!

关于evaluation阶段的报错

作者您好,打扰了,我在服务器上跑您的代码在evaluation阶段出现了如下报错,请问您遇到过么?期待您的回复!
Traceback (most recent call last):
File "train.py", line 353, in
scores = evaluate_metrics(model, dict_dataloader_val, text_field)
File "train.py", line 62, in evaluate_metrics
out, _ = model.beam_search(images, 20, text_field.vocab.stoi[''], 5, out_size=1,
File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/captioning_model.py", line 70, in beam_search
return bs.apply(visual, out_size, return_probs, **kwargs)
File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/beam_search/beam_search.py", line 71, in apply
visual, outputs = self.iter(t, visual, outputs, return_probs, **kwargs)
File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/beam_search/beam_search.py", line 121, in iter
self.model.apply_to_states(self._expand_state(selected_beam, cur_beam_size))
File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/containers.py", line 30, in apply_to_states
self._buffers[name] = fn(self._buffers[name])
File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/beam_search/beam_search.py", line 26, in fn
s = torch.gather(s.view(*([self.b_s, cur_beam_size] + shape[1:])), 1,
RuntimeError: gather_out_cuda(): Expected dtype int64 for index

how are alignment graph obtained for new datasets

Hi, in your coding,h5py.File features has keys like ['%d_features' % image_id] , ['%d_grids' % image_id], ['%d_boxes' % image_id], ['%d_size' % image_id], ['%d_mask' % image_id], If I have a new data set, can I just use align.py to get geometric alignment graph after I get grid features and region features using extract_region_feature.py and grid-feats-vqa.

The Code to generate the caption of a figure.

Thank you so much for sharing your brilliant code with us.
But could you share me the code to generate the caption for a image like the Figure 1 in your artical?
I would appreciate it if you could share it with me! ^_^

关于训练阶段Shuffle参数设置问题

作者您好,在服务器上复现您代码的时候发现如下报错
Traceback (most recent call last):
File "train.py", line 328, in
dict_dataloader_train = DataLoader(dict_dataset_train, batch_size=args.rl_batch_size // 5, shuffle=True,
File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/data/init.py", line 8, in init
super(DataLoader, self).init(dataset, *args, collate_fn=dataset.collate_fn(), **kwargs)
File "/mnt/ssd1/anaconda3/envs/yjl/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 228, in init
raise ValueError(
ValueError: DataLoader with IterableDataset: expected unspecified shuffle option, but got shuffle=True
我检查了一下源码,发现在train.py的328行
dict_dataloader_train = DataLoader(dict_dataset_train, batch_size=args.rl_batch_size // 5, shuffle=True,
num_workers=args.workers)
若将shuffle置为False,能够运行,请问作者您遇到过上述的问题么?期待您的回复

Concept Question about MHLCCA

Hi,

great to see a new image captioning model!

I got a question about MHLCCA. In your paper, it suggests setting both the key and value of MHLCCA as MHCRA(H_region, H_region, H_region ... ). However, in your code https://github.com/luo3300612/image-captioning-DLCT/blob/main/models/DLCT/encoders.py line 155, the key and value are formed by torch.cat([out_region, out_grid], dim=1). Seems it is a bit different from the paper. Would you slightly explain the idea of this, or am I miss understanding the concept?

Thank you!

关于pth大小

请问跑完的模型是200M+嘛?我下载了您提供的pretrained_model.pth,240MB,为什么我跑出来有700M+。
size

想获取ResNeXt-152数据集

您好,我用了你的ResNeXt-101数据集,50g左右,相比于grid-feats-vqa里的数据集小了很多,我能拿到您对应的ResNeXt-152数据集么

What is the relation between X-101 and X-152 extractors in the code?

Your script extract_region_feature.py has weights for X-101 hard-coded. But the file with features has name "region_before_X152.hdf5". Also, there is no information from you which checkpoint to use for extracting grid features. In the paper you mention both X-101 and X-152 as extractors.

Which extractor checkpoint should I use for grid-feats-vqa: X-101 or X-152? Can I use X-152 for region features as well?

Expected all tensors to be on the same device

hello,
I'm working on your code but on training step I got this error, I try to add .cuda() but it didn't work:
Epoch 0 - validation: 100%|##########| 6250/6250 [11:39<00:00, 8.92it/s, loss=3.03]
Epoch 0 - validation: 100%|##########| 6250/6250 [11:39<00:00, 8.93it/s, loss=3.03]

Epoch 0 - evaluation: 0%| | 0/250 [00:00<?, ?it/s]
Epoch 0 - evaluation: 0%| | 0/250 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 354, in
scores = evaluate_metrics(model, dict_dataloader_val, text_field)
File "train.py", line 62, in evaluate_metrics
**{'boxes': boxes, 'grids': grids, 'masks': masks})
File "D:\thez\duallevel\DLCT\models\captioning_model.py", line 70, in beam_search
return bs.apply(visual, out_size, return_probs, **kwargs)
File "D:\thez\duallevel\DLCT\models\beam_search\beam_search.py", line 71, in apply
visual, outputs = self.iter(t, visual, outputs, return_probs, **kwargs)
File "D:\thez\duallevel\DLCT\models\beam_search\beam_search.py", line 104, in iter
word_logprob = self.model.step(t, self.selected_words, visual.type(torch.LongTensor), None, mode='feedback', **kwargs)
File "D:\thez\duallevel\DLCT\models\DLCT\transformer.py", line 73, in step
self.enc_output, self.mask_enc = self.encoder(regions=visual, grids=grids,boxes=boxes,aligns=aligns, region_embed=self.region_embed,grid_embed=self.grid_embed)
File "C:\Users\MehrsysteM\anaconda3\envs\DLCT\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "D:\thez\duallevel\DLCT\models\DLCT\encoders.py", line 189, in forward
out_region = F.relu(self.fc_region(regions))
File "C:\Users\MehrsysteM\anaconda3\envs\DLCT\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\MehrsysteM\anaconda3\envs\DLCT\lib\site-packages\torch\nn\modules\linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
File "C:\Users\MehrsysteM\anaconda3\envs\DLCT\lib\site-packages\torch\nn\functional.py", line 1848, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_mm)

can you help me how to fix this?

Unable to download the pretrained model

It is not possible to download the trained pth file from pan.baidu.com
It seems we need to install some kind of download software (the Baidu page is in Chinese so it is hard to understand how to proceed).

Could you please host the files in a more user-friendly repository? Like osf.io or Dropbox?

Thx

IndexError: invalid index to scalar variable

....
Epoch 0 - evaluation: 100%|##########| 250/250 [07:44<00:00, 1.86s/it]
Epoch 0 - evaluation: 100%|##########| 250/250 [07:44<00:00, 1.86s/it]
Traceback (most recent call last):
File "train.py", line 356, in
val_cider = scores['CIDEr']
IndexError: invalid index to scalar variable.

Hello, when I run the code , I get this error. Can you help me?

Can you share the pretrained model?

I seems that I have managed to combine the code to generate all the features needed. I want to test it on my custom images without re-training the network. Can you please provide the pretrained checkpoint?

数据集特征提取

作者,您好,您项目中的方法能否用来提取自有数据集中的特征并用来训练。

About test performance

Hello, Thanks for your opening code.
When I train this model from the scratch, training loss is down, but testing performances do not change in training stage

parameter error

Class SelfAtt module includes a MultiHeadAttention module. In its forward function, the self.mhatt receives 6 parameters (q, k, values, relative_geometry_weights, attention_mask, attention_weights). However, there are 5 parameters in MultiHeadAttention's forward function (queries, keys, values,box_relation_embed_matrix, attention_mask, attention_weights). Is this an error? Should it replace MultiHeadAttention with MultiHeadBoxAttention?

Should ['%d_size' % image_id] be equal to the real image size?

For 418x449 image I have [600,644] written to the features, and for 1042x480 I have [1000x461]. Those are not the values for some other images from my dataset, as I don't have images with such dimensions. Is it Ok, are those values somewhat different or they must correspond to the real dimensions?

RL training stage, out of memory

Switching to RL, after every epoch training, the memory usage will increase about 2GB. Finally, the system will be out of memory, and kill process.
Could you please give me some tips to solve it? Thank you very much!

A code bug

Traceback (most recent call last):
File "/mnt/Pycharm_Remote/DLCT_test/train.py", line 335, in
scores = evaluate_metrics(model, dict_dataloader_val, text_field)
File "/mnt/Pycharm_Remote/DLCT_test/train.py", line 61, in evaluate_metrics
**{'boxes': boxes, 'grids': grids, 'masks': masks})
File "/mnt/Pycharm_Remote/DLCT_test/models/captioning_model.py", line 70, in beam_search
return bs.apply(visual, out_size, return_probs, **kwargs)
File "/mnt/Pycharm_Remote/DLCT_test/models/beam_search/beam_search.py", line 71, in apply
visual, outputs = self.iter(t, visual, outputs, return_probs, **kwargs)
File "/mnt/Pycharm_Remote/DLCT_test/models/beam_search/beam_search.py", line 121, in iter
self.model.apply_to_states(self._expand_state(selected_beam, cur_beam_size))
File "/mnt/Pycharm_Remote/DLCT_test/models/containers.py", line 30, in apply_to_states
self._buffers[name] = fn(self._buffers[name])
File "/mnt/Pycharm_Remote/DLCT_test/models/beam_search/beam_search.py", line 27, in fn
beam.expand(*([self.b_s, self.beam_size] + shape[1:])))
RuntimeError: gather_out_cuda(): Expected dtype int64 for index

the beam is float and come from "selected_beam = selected_idx / candidate_logprob.shape[-1]",so it's float.But index need int.

Feature extraction on customized images

Hi, how to extract the image box features on customized images, can you share the pretrained models of RESNEXT-101 && RESNEXT152 and corresponding files for extra image extraction. Thanks!

TypeError: __init__() got an unexpected keyword argument 'train_on_pred_boxes'

/usr/local/lib/python3.7/dist-packages/detectron2/config/config.py in wrapped(self, *args, **kwargs)
188 if _called_with_cfg(*args, **kwargs):
189 explicit_args = _get_args_from_config(from_config_func, *args, **kwargs)
--> 190 init_func(self, **explicit_args)
191 else:
192 init_func(self, *args, **kwargs)

TypeError: init() got an unexpected keyword argument 'train_on_pred_boxes'
作者您好,我在复现您论文时,在实现提取特征的代码时出现该问题,我查询了相关的,但是没有解决?

About features on Baiduyun disk

Hello, could you explain about the features displayed on net disk? is coco_all_align.hdf5 in the zip file? And what are the files end with z01, z02,z03? I have tried to exact the zip file, but fails.
feature

About visualization code

Hello and thank you for this fantastic repo!
I find there is the visualization of attention states for captions (Figure 5 b), but I cannot find the corresponding code in this repo. I tried to reproduce the visualization results according to your paper and failed. Could you please share your visualization code? Thank you!

预训练模型结果较低

你好,用您给的文件不能复现您的结果,而且,我自己跑出来的文件去test也无法得到跑程序时显示的结果,比如跑的时候belu1是81,用保存的最优文件跑eval.py只能得到79.1.请问是什么原因呢?是eval.py的设置有问题吗?

Unable to load pretrained_model.pth

When I try to execute the eval.py file, an error is reported “pretrained_model.pth is a zip archive(did you mean to use torch.jit.load()?)”.

Are you saving the model with torch 1.6 or above?

However, torch 1.1 is installed in the file environment.yml. Which one is right? Should I change to torch 1.6 or above?

I turned to execute the file train.py, and I spent a long time in this step.

model = Transformer(text_field.vocab.stoi['<bos>'], encoder, decoder, args=args).to(device)

I don't know if this is correct?

I use one GeForce RTX 3090Ti GPU with 24GB of memory.

I'm looking forward to your reply, thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.