Giter Club home page Giter Club logo

adaptiveattention's Introduction

AdaptiveAttention

Implementation of "Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning"

teaser results

Requirements

To train the model require GPU with 12GB Memory, if you do not have GPU, you can directly use the pretrained model for inference.

This code is written in Lua and requires Torch. The preprocssinng code is in Python, and you need to install NLTK if you want to use NLTK to tokenize the caption.

You also need to install the following package in order to sucessfully run the code.

Pretrained Model

The pre-trained model for COCO can be download here. The pre-trained model for Flickr30K can be download here.

Vocabulary File

Download the corresponding Vocabulary file for COCO and Flickr30k

Download Dataset

The first thing you need to do is to download the data and do some preprocessing. Head over to the data/ folder and run the correspodning ipython script. It will download, preprocess and generate coco_raw.json.

Download COCO and Flickr30k image dataset, extract the image and put under somewhere.

training a new model on MS COCO

First train the Language model without finetune the image.

th train.lua -batch_size 20 

When finetune the CNN, load the saved model and train for another 15~20 epoch.

th train.lua -batch_size 16 -startEpoch 21 -start_from 'model_id1_20.t7'

More Result about spatial attention and visual sentinel

teaser results

teaser results

For more visualization result, you can visit here (it will load more than 1000 image and their result...)

Reference

If you use this code as part of any published research, please acknowledge the following paper

@misc{Lu2017Adaptive,
author = {Lu, Jiasen and Xiong, Caiming and Parikh, Devi and Socher, Richard},
title = {Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning},
journal = {CVPR},
year = {2017}
}

Acknowledgement

This code is developed based on NeuralTalk2.

Thanks Torch team and Facebook ResNet implementation.

License

BSD 3-Clause License

adaptiveattention's People

Contributors

jiasenlu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

adaptiveattention's Issues

The isssue of pre-trained model

Hello, when I loaded the pre-trained model,the following problem occurred:
/opt/zbstudio/bin/linux/x64/lua: ...ngll/PycharmProjects/AdaptiveAttention-master1/train.lua:162: bad argument #1 to 'copy' (sizes do not match at /tmp/luarocks_cutorch-scm-1-9584/cutorch/lib/THC/generic/THCTensorCopy.c:48)
stack traceback:
[C]: in function 'copy'
...ngll/PycharmProjects/AdaptiveAttention-master1/train.lua:162: in main chunk
[C]: at 0x00404f08
Can you help me?

How to eval my own images?

What to do if I want to evaluate my own images like neuraltalk2: " $ th eval.lua -model /path/to/model -image_folder /path/to/image/directory -num_images 10 "

Error in demo.ipynb (Evaluation script)

I am having this error in Demo.ipynb

./misc/utils.lua:54: bad argument #1 to 'pairs' (table expected, got nil)
stack traceback:
[C]: in function 'pairs'
./misc/utils.lua:54: in function 'count_keys'
[string "opt = {}..."]:9: in main chunk
[C]: in function 'xpcall'
/home/imrankhurram/.luarocks/share/lua/5.1/itorch/main.lua:210: in function </home/imrankhurram/.luarocks/share/lua/5.1/itorch/main.lua:174>
/home/imrankhurram/.luarocks/share/lua/5.1/lzmq/poller.lua:80: in function 'poll'
.../imrankhurram/.luarocks/share/lua/5.1/lzmq/impl/loop.lua:307: in function 'poll'
.../imrankhurram/.luarocks/share/lua/5.1/lzmq/impl/loop.lua:325: in function 'sleep_ex'
.../imrankhurram/.luarocks/share/lua/5.1/lzmq/impl/loop.lua:370: in function 'start'
/home/imrankhurram/.luarocks/share/lua/5.1/itorch/main.lua:389: in main chunk
[C]: in function 'require'
(command line):1: in main chunk
[C]: at 0x00406670

What is its solution?

Performance on Flickr30k Dataset

Hi, I used your pretrained model for Flickr30k Dataset. However, the performance is very bad than the performance on paper.
Bleu_1: 0.206
Bleu_2: 0.122
Bleu_3: 0.077
Bleu_4: 0.051
computing METEOR score...
METEOR: 0.108
computing Rouge score...
ROUGE_L: 0.278
computing CIDEr score...
CIDEr: 0.377
computing SPICE score...
Parsing reference captions
Parsing test captions
SPICE evaluation took: 3.287 s
SPICE: 0.161
Could you please check the pretrained model ?
By the way, after creating h5 and json file for flickr30k dataset, parameter number did not match with the pretrained model. Thus, I had to take word_count_threshold 4.

Question about adaptive attention ?

In your paper, section 2.2, you say
image
and
image ,
But in your code, attention.lua,

local probs3dim = nn.View(1,-1):setNumInputDims(1)(PI)

Is it means =>
image ?

Unsupported hdf5 version

Line utils = require 'misc.utils' fails with the following traceback:
/home/****/torch/install/share/lua/5.1/hdf5/ffi.lua:71: Unsupported HDF5 version: 1.10.1

I installed it as was described in the torch-hdf5 repo, and version 1.10 must be newer than 1.8...
I also ran dpkg -s libhdf5-dev which gave me Version: 1.8.16+docs-4ubuntu1 which is again newer than 1.8.14
Could you help?

How to test on test images

Hi Jiasen,

I want to test the model on 2014 Testing images. After using prepro_coco_test.py to process the test images, I find that I am not able to let eval_visulization.lua to test correctly.

"ix_to_word" is lack in the output json made by prepro_coco_test.py
"labels" is lack in the output h5 made by prepro_coco_test.py

Could you tell me how to run the test image without their tagged captions?

Thanks!

The performance of released model is drastically better than performance reported in paper

Hi Jiasen,

The performance of your released model, (coco_train.t7, cocotalk_vocab.json) seems to be much better than the performance reported in the highlighted row from table 1 in your paper (screenshot attached). I feel that I must be understanding something incorrectly about the code/models.

My understanding is that the following models were trained using the standard karpathy splits of the mscoco captions and that model (1) was used to generate the lowermost results in table 1.

URL=https://filebox.ece.vt.edu/~jiasenlu/codeRelease/AdaptiveAttention
wget $URL/model/COCO/coco_train/coco_train.t7 # (1)
wget $URL/data/COCO/cocotalk_vocab.json -O coco_vocab.json 
wget $URL/model/COCO/coco_challenge/model_id1_34.t7 -O coco_challenge_model_id1_34.t7 # (2)
wget $URL/data/COCO/cocotalk_challenge_vocab.json -O coco_challenge_vocab.json

However when I test the predictions of these models on the test portion of karpathy splits then all the metrics are much higher than the ones reported in table 1 of the paper. Do you have any idea why the eval metrics might be so much better than the ones reported in the paper? I have tested my own evaluation code as well and tested it by reproducing the results in the LRCN paper so I am fairly sure that my evaluation code is correct.

Bleu1 B2 B3 B4 Cider Meteor Rouge
InPaper 0.742 0.580 0.439 0.332 1.085 0.266 0.549
MyEval (1) 0.794 0.647 0.513 0.403 1.287 0.293 0.595
MyEval (2) 0.782 0.628 0.485 0.368 1.219 0.285 0.580

image

Question about Eq(8)

My question is, in Eq(8),

v represents the spatial image features, and I think it should be time-invariant, so why it has a subscript "t"?

Thanks.

Demo on CPU issue

Pretrained model is giving CUDA errors in CPU :(URGENT HELP NEEDED )

Loading model from: save/coco_train.t7
/home/sarath/torch/install/share/lua/5.1/torch/File.lua:343: unknown Torch class <cudnn.SpatialConvolution>
stack traceback:
[C]: in function 'error'
/home/sarath/torch/install/share/lua/5.1/torch/File.lua:343: in function 'readObject'
/home/sarath/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
/home/sarath/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
/home/sarath/torch/install/share/lua/5.1/nn/Module.lua:192: in function 'read'
/home/sarath/torch/install/share/lua/5.1/torch/File.lua:351: in function 'readObject'
/home/sarath/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
/home/sarath/torch/install/share/lua/5.1/torch/File.lua:369: in function 'readObject'
/home/sarath/torch/install/share/lua/5.1/torch/File.lua:409: in function 'load'
[string "opt = {}..."]:14: in main chunk
[C]: in function 'xpcall'
/home/sarath/torch/install/share/lua/5.1/itorch/main.lua:210: in function </home/sarath/torch/install/share/lua/5.1/itorch/main.lua:174>
/home/sarath/torch/install/share/lua/5.1/lzmq/poller.lua:75: in function 'poll'
/home/sarath/torch/install/share/lua/5.1/lzmq/impl/loop.lua:307: in function 'poll'
/home/sarath/torch/install/share/lua/5.1/lzmq/impl/loop.lua:325: in function 'sleep_ex'
/home/sarath/torch/install/share/lua/5.1/lzmq/impl/loop.lua:370: in function 'start'
/home/sarath/torch/install/share/lua/5.1/itorch/main.lua:389: in main chunk
[C]: in function 'require'
(command line):1: in main chunk
[C]: at 0x00406670


Will the uploaded checkpoints work for even CPU DEMO ???
If not could you supply the checkpoints for CPU demo???

Some confusion about adaptive attention model

According to the paper "Knowing When to Look", LSTM only receive the word vector Xt and the previous hidden state Ht-1,instead of the image vector,but your code includes the image vector when building the LSTM.
Would you please explain it ?
Thank you very much

Clarification on code + paper

Hi @jiasenlu,

Thanks for uploading the code! I was going through it and comparing it with the paper.

I have a few questions on both the paper and code and was hoping you could clarify :

  1. Is this an updated model, different from the paper? Because, I couldn't find gradient clipping and dropout weren't in the paper.
  2. In Section 3. of the paper, it says

We use a single layer neural network to transform the visual sentinel vector st and LSTM output vector ht into new vectors that have the dimension d

Can you tell whats the single layer network in the paper? 
If it refers to W_s and W_g, aren't they're converted to k dimension vectors?
  1. In the code, at L29-L31, if I'm not mistaken, you're transforming w_t and v^g individually (notation from the paper) and summing them. In the paper however, they're simply concatenated and transformed to i2h.

  2. In the attention.lua module, i'm not sure what the 2 conv inputs are? The paper only indicates h, s and V as inputs.

404 -----NO model ,I want to test

Pretrained Model

For evaluation, you can directly download the pretrained model.

The pre-trained model for COCO can be download here. The pre-trained model for Flickr30K can be download here.

question about "visual sentinel"

Dear Jiasen Lu,
Thank you for your work on "Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning".
I am writing to ask about the "visual sentinel": what is the difference between your "visual sentinel" and the hidden state ht?
I think your visual sentinel "st" and the LSTM's hidden state "ht" are the same except for the different symbols. Am I right ? If I am wrong, could you kindly give me some further explanation?

Many thanks in advance for your answer.
Kind regards
Zhennan Wang

Is there a typo in your paper?

In your paper,I think the formulation of the sentinel gate should be:
,
but in CVPR and arxiv version of your paper, both of them shows:
,

I doubt that is it a typo?
Thanks.

how to get the results closed to that in the paper? about the result of test set.

Thanks for code!
I have run the code and found that the validation results is very good, but during testing ,the result is embarrassing. I fisrt using the default parameters setting, and then the papameter setting is changed according to the content of paper. The situation is just the same. Like the note in the code, I'm dying overhere. Who knows how to obtain better results during testing? please tell me.
Thanks again!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.