Giter Club home page Giter Club logo

ban-vqa's Introduction

Bilinear Attention Networks

โš ๏ธ Regrettably, I cannot perform maintenance due to the loss of the materials. I'm archiving this repository for reference

This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entities tasks.

For the visual question answering task, our single model achieved 70.35 and an ensemble of 15 models achieved 71.84 (Test-standard, VQA 2.0). For the Flickr30k Entities task, our single model achieved 69.88 / 84.39 / 86.40 for Recall@1, 5, and 10, respectively (slightly better than the original paper). For the detail, please refer to our technical report.

This repository is based on and inspired by @hengyuan-hu's work. We sincerely thank for their sharing of the codes.

Overview of bilinear attention networks

Updates

  • Bilinear attention networks using torch.einsum, backward-compatible. (12 Mar 2019)
  • Now compatible with PyTorch v1.0.1. (12 Mar 2019)

Prerequisites

You may need a machine with 4 GPUs, 64GB memory, and PyTorch v1.0.1 for Python 3.

  1. Install PyTorch with CUDA and Python 3.6.
  2. Install h5py.

WARNING: do not use PyTorch v1.0.0 due to a bug which induces underperformance.

VQA

Preprocessing

Our implementation uses the pretrained features from bottom-up-attention, the adaptive 10-100 features per image. In addition to this, the GloVe vectors. For the simplicity, the below script helps you to avoid a hassle.

All data should be downloaded to a data/ directory in the root directory of this repository.

The easiest way to download the data is to run the provided script tools/download.sh from the repository root. If the script does not work, it should be easy to examine the script and modify the steps outlined in it according to your needs. Then run tools/process.sh from the repository root to process the data to the correct format.

For now, you should manually download for the below options (used in our best single model).

We use a part of Visual Genome dataset for data augmentation. The image meta data and the question answers of Version 1.2 are needed to be placed in data/.

We use MS COCO captions to extract semantically connected words for the extended word embeddings along with the questions of VQA 2.0 and Visual Genome. You can download in here. Since the contribution of these captions is minor, you can skip the processing of MS COCO captions by removing cap elements in the target option in this line.

Counting module (Zhang et al., 2018) is integrated in this repository as counting.py for your convenience. The source repository can be found in @Cyanogenoid's vqa-counting.

Training

$ python3 main.py --use_both True --use_vg True

to start training (the options for the train/val splits and Visual Genome to train, respectively). The training and validation scores will be printed every epoch, and the best model will be saved under the directory "saved_models". The default hyperparameters should give you the best result of single model, which is around 70.04 for test-dev split.

Validation

If you trained a model with the training split using

$ python3 main.py

then you can run evaluate.py with appropriate options to evaluate its score for the validation split.

Pretrained model

We provide the pretrained model reported as the best single model in the paper (70.04 for test-dev, 70.35 for test-standard).

Please download the link and move to saved_models/ban/model_epoch12.pth (you may encounter a redirection page to confirm). The training log is found in here.

$ python3 test.py --label mytest

The result json file will be found in the directory results/.

Without Visual Genome augmentation

Without the Visual Genome augmentation, we get 69.50 (average of 8 models with the standard deviation of 0.096) for the test-dev split. We use the 8-glimpse model, the learning rate is starting with 0.001 (please see this change for the better results), 13 epochs, and the batch size of 256.

Flickr30k Entities

Preprocessing

You have to manually download Annotation and Sentence files to data/flickr30k/Flickr30kEntities.tar.gz. Then run the provided script tools/download_flickr.sh and tools/process_flickr.sh from the root of this repository, similarly to the case of VQA. Note that the image features of Flickr30k were generated using bottom-up-attention pretrained model.

Training

$ python3 main.py --task flickr --out saved_models/flickr

to start training. --gamma option does not applied. The default hyperparameters should give you approximately 69.6 for Recall@1 for the test split.

Validation

Please download the link and move to saved_models/flickr/model_epoch5.pth (you may encounter a redirection page to confirm).

$ python3 evaluate.py --task flickr --input saved_models/flickr --epoch 5

to evaluate the scores for the test split.

Troubleshooting

Please check troubleshooting wiki and previous issue history.

Citation

If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:

@inproceedings{Kim2018,
author = {Kim, Jin-Hwa and Jun, Jaehyun and Zhang, Byoung-Tak},
booktitle = {Advances in Neural Information Processing Systems 31},
title = {{Bilinear Attention Networks}},
pages = {1571--1581},
year = {2018}
}

License

MIT License

ban-vqa's People

Contributors

jaesuny avatar jnhwkim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ban-vqa's Issues

How to use the pre

Hello,
I'm the first time try to use a vqa network, and I wonder how can I use the pretrained model to ask a question on a image and get a response? Thank you.

Attention Visualization

Hi,
Love your work and repository

Just want to now how can I get the attention visualization? (like Figures 3,4 in the paper)

bug in bc.py

line 39 in bc.py:
self.h_net = weight_norm(nn.Linear(h_dim, h_out), dim=None)
is this should be
self.h_net = weight_norm(nn.Linear(h_dim*self.k, h_out), dim=None)

cannot reproduce the best result of single model

I followed all the instructions and use the default hyperparameters, which should give me the best results. However, if I set random seed=1204 as default, I can only get 69.84 on test-dev split, which is 0.2 lower than the reported results. And I notice that the standard deviations reported on val split is around 0.11.
Can you give me some advice on how to fix the gap?
Thx!

test.py

Hello,

I'd like to use the model, I expect to enter a question in string, and an image path, but in the test.py, the input is the saved model, I am wondering where to input the question and image to test the model?

how to get the files

I don't have 'data/question_answers.json' and 'image_data/json',how to get it or generate it

Out of memory while executing loss.backward()

Hello, thanks for your great code! I have some trouble while running

python3 main.py --use_both True --use_vg True

I have 4 TITAN Xps, which has 12.2G memory per GPU, and set the batchsize to 256. Then I get the following error:

nParams= 90618566
optim: adamax lr=0.0007, decay_step=2, decay_rate=0.25, grad_clip=0.25
gradual warmup lr: 0.0003
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File "main.py", line 97, in
train(model, train_loader, eval_loader, args.epochs, args.output, optim, epoch)
File "/home/Project/ban-vqa/train.py", line 74, in train
loss.backward()
File "/home/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/home/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/autograd/init.py", line 99, in backward
variables, grad_variables, retain_graph)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

And If I set batchsize to 128, it will occupy ~12G GPU memory during the early stage and then goes down to ~6G per GPU. Is there something wrong with my execution?
Thx!

Run test get KeyError: 1 error

when I run python test.py --label mytest, get this error:

Traceback (most recent call last):
  File "test.py", line 91, in <module>
    eval_dset = VQAFeatureDataset(args.split, dictionary, adaptive=True)
  File "/home/gwh/Downloads/ban-vqa-master/dataset.py", line 244, in __init__
    self.entries = _load_dataset(dataroot, name, self.img_id2idx, self.label2ans)
  File "/home/gwh/Downloads/ban-vqa-master/dataset.py", line 142, in _load_dataset
    entries.append(_create_entry(img_id2val[img_id], question, None))
KeyError: 1

I find data/test2015_imgid2idx.pkl is {}, the file is generated with python3 tools/adaptive_detection_features_converter.py.

Can you help me? @jnhwkim Thanks in advance for any suggestions.

Unaccessible questions and annotations

Hello, thanks for your work! The links of questions and annotations in download.sh are unaccessible to me, so I use questions and annotations from VQA[https://visualqa.org] like this one (https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Train_mscoco.zip). However, I got huge train_loss while running python main.py --use_both True --use_vg True --batch_size 32.
I was wondering if I used the wrong data. If so, could anyone please tell me or provide another valid link?

Question about Visual Genome version

hi, @jnhwkim I have a question about visual genome version.

I find visual genome version is 1.2 in README.md.
image

but in dataset.py, the 1.2 version of image_data.json does not have a key called id, this key in version 1.0.
image

the 1.2 version example:
image

so which version should I use?

thank you first!

How to get labels for objects?

Hi, I am very interested in your BAN model on flickr30k. I am wondering do you provide labels for detected objects together with bounding boxes and features, just like what faster-rcnn or bottom-up attention would do? Since I am not so sure about how you prepared your dataset, I'm afraid if I use pre-trained models to predict labels myself, the dataloader pipeline would have some problem. Thanks!

Reproducing error

Upon reproducing this result, I encounter the following error.
Traceback (most recent call last):
File "main.py", line 96, in
train(model, train_loader, eval_loader, args.epochs, args.output, optim, epoch)
File "/home/tingting/Documents/tingting/ban-vqa/train.py", line 72, in train
pred, att = model(v, b, q, a)
File "/home/tingting/tingting/lib/python3.5/site-packages/torch/nn/modules/module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "/home/tingting/tingting/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 113, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home/tingting/tingting/lib/python3.5/site-packages/torch/nn/parallel/data_parallel.py", line 118, in replicate
return replicate(module, device_ids)
File "/home/tingting/tingting/lib/python3.5/site-packages/torch/nn/parallel/replicate.py", line 12, in replicate
param_copies = Broadcast.apply(devices, *params)
RuntimeError: slice() cannot be applied to a 0-dim tensor

After tracing this code, I found that if I delete "nn.DataParallel(model).cuda()", it worked well.

I use 4 GTX 1080 ti. Have you encountered the same thing before?

flickr30k upperbound

Hello,

I used Bottom-up Attention to get boxes for Flickr30k data. Unfortunately, I could not get the same upperbound you reported in the paper. I get 0.6507 you reported 0.8745. Do you mind providing the details how you used Bottom-up model for inducing boxes. Below I listed mine:

model_name: resnet101_faster_rcnn_final.caffemodel
conf_thresh=0.2
min_boxes=10
max_boxes=100

UPDATE:

When I increase the number of boxes I get better upperbound but still it is not as good as yours, below setup gives me upperbound 0.8530

model_name: resnet101_faster_rcnn_final.caffemodel
conf_thresh=0.01
min_boxes=200
max_boxes=200

Training too slow

My machine has 3 Titan 1080 Ti, 12 Intel i7 CPUs. Its total memory is 65GB. However, the program cost me more than 5800s to run an epoch.
My command is python3 main.py --use_both True --use_vg True --batch_size 128 because batch size 256 will out of memory.

epoch 1, time: 5844.42
        train_loss: 3.32, norm: 4.2468, score: 51.21
gradual warmup lr: 0.0010
epoch 2, time: 5844.72
        train_loss: 3.05, norm: 2.5201, score: 55.44
gradual warmup lr: 0.0014
epoch 3, time: 5839.73
        train_loss: 2.90, norm: 1.7370, score: 58.02
lr: 0.0014
epoch 4, time: 5835.09
        train_loss: 2.75, norm: 1.3749, score: 60.45
lr: 0.0014
epoch 5, time: 5837.11
        train_loss: 2.64, norm: 1.2232, score: 62.33
lr: 0.0014
epoch 6, time: 5829.90
        train_loss: 2.54, norm: 1.1545, score: 63.88
lr: 0.0014
epoch 7, time: 5832.88
        train_loss: 2.46, norm: 1.1238, score: 65.32
lr: 0.0014
epoch 8, time: 5834.77
        train_loss: 2.39, norm: 1.1157, score: 66.59

Trouble creating ID.pkls

Hello :)

first of all thank you for sharing your repo!

i am having trouble creating those files:
indices_file = {
'train': 'data/train_imgid2idx.pkl',
'val': 'data/val_imgid2idx.pkl',
'test': 'data/test2015_imgid2idx.pkl'}
ids_file = {
'train': 'data/train_ids.pkl',
'val': 'data/val_ids.pkl',
'test': 'data/test2015_ids.pkl'}

because the utils.py demands .jpeg images to create the indexes which are not created at this point. Could you be so kind to share the id.pkls?

thank you and best regards
Max

Flickr30K evaluation?

It seems like the Flickr30K grounding task in the report is not included in the repo.
Am I missing something?

Compared models without using Visual Genome

Hi Kim:

Thanks for sharing your great work and elegant codes.

I have questions about your test-dev results. As your READER.MD indicated, the training contains the data-augmentation trick with Visual Genome. However, the compared models (Counter, Bottom-up) in your paper did not use Visual Genome for training. That seems an unfair comparison.

Have you trained BAN model without Visual Genome? I think it could better verify your model high efficiency.

Ensemble details

Hi, thanks for the library.
Is it possible to share details of your ensemble method?

link no longer works

Dear authors:
the link image metadata and question answers of VQA are no longer works.could you support it again?

error when using adaptive_detection_features_converter.py

While running adaptive_detection_features_converter.py for the TSV files, I am getting this error and can't resolve it. Any leads here would be helpful. This error occurs when trying to decode the features/boxes from the tsv file.

File "tools/adaptive_detection_features_converter.py", line 156, in extract
bboxes = np.frombuffer(base64.decodestring(item['boxes']), dtype=np.float32).reshape((item['num_boxes'], -1))
File "/home/reddy/myvenv/lib/python3.6/base64.py", line 554, in decodestring
return decodebytes(s)
File "/home/reddy/myvenv/lib/python3.6/base64.py", line 546, in decodebytes
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding

Memory error

Hi, I am tring to run your repository, but I keep getting the following error:

Namespace(batch_size=128, epochs=13, gamma=8, input=None, model='ban', num_hid=1280, op='c', output='saved_models/ban', seed=1204, tfidf=True, use_both=False, use_vg=False)
loading dictionary from data/dictionary.pkl
loading features from h5 file
Traceback (most recent call last):
  File "main.py", line 50, in <module>
    train_dset = VQAFeatureDataset('train', dictionary, adaptive=True)
  File "/home/michas/Desktop/codes/ban-vqa/dataset.py", line 234, in __init__
    self.features = np.array(hf.get('image_features'))
MemoryError

I suppose it is hapenning because of trying to load the whole dataset as a numpy array into RAM (which I have 32GB). Can you suggest any solution?
Thanks

Evaluating accuracy on test?

When I run python3 test.py --label mytest, i got a warning 'RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greately increasing memory usage. To compact weights again call flatten_parameters().' and code still complete but the result was evaluated on VQA challenge only 1% for overall. I use the your pretrained model and feature.

Do you use the VG dataset to get the results on validation set

Hi Kim:
Thanks for your excellent work and code

Did you use the visual genome dataset for training to get the results on validation set which is listed in table 1 in your paper? As you compared with bottom-up and top-down results which used the VG dataset, so I assume you also used the VG dataset + VQA 2.0 train to get the final results on validation set, am i right ?

Question

Hello guys,

Very nice piece of work.
I was wondering why you didn't use a
einsum implementation of the bilinear attention in order to speed up training.
image
This equation is perfect for it. U should have a significant gain, and it would be nice for once to have highly optimized code available on github.

Best,
T.C

error from tools/process.sh

I have downloaded everything listed in tools/download.sh.
Could you provide the missing data as well?
Thank you.

Traceback (most recent call last):
File "tools/adaptive_detection_features_converter.py", line 199, in
extract('train', infiles, args.task)
File "tools/adaptive_detection_features_converter.py", line 94, in extract
imgids = utils.load_imageid(path_imgs[split])
File "/home/sizhangyu/Documents/pytorch_code/ban-vqa/utils.py", line 47, in load_imageid
images = load_folder(folder, 'jpg')
File "/home/sizhangyu/Documents/pytorch_code/ban-vqa/utils.py", line 40, in load_folder
for f in sorted(os.listdir(folder)):
FileNotFoundError: [Errno 2] No such file or directory: 'data/train2014'

flickr 30k features download

Are the hdf5 files in the downloaded flickr30k_features.zip used to reproduce the results? I don't see tsv files in flickr30k_features.zip but I do need the features and bounding boxes for flickr 30k validation/testing sets. The files in flickr30k_features.zip are confusing, for example, in val.hdf5 file, there are (30722, 2048) features, but in adaptive_detection_features_converter.py, known_num_boxes for validation set is 29906, so what are these 30722 features?

train36_imgid2idx.pkl file

Hi, thank you for sharing your code. I was wondering what does data/train36_imgid2idx.pkl contain exactly ?

Error in Flickr30k features

Dear authors,

I saw your previous answer, but I didn't have time to answer before the issue was closed.
I have tried two different Linux systems and have also tried on Windows. I have tried Chrome and Firefox. I can download the package but not unzip it because it gives me an error with the train.hdf5 file. It says the file is corrupted. I also tried two different internet connections. I can't unzip without errors. I have tried to download the file several times, but the result is always the same.

Could you please check the train.hdf5 file?
Davide

Originally posted by @drigoni in #46 (comment)

Evaluating pretrained model

Hello,

I am trying to evaluate the pretrained model on the VQA dataset. If possible, I would like to ask you the following questions:

  1. I executed the command "python3.6 evaluate.py". However, in that case, the script returns the following error:
Evaluate a given model optimized by training split using validation split.
loading dictionary from data/dictionary.pkl
loading features from h5 file
Traceback (most recent call last):
  File "evaluate.py", line 47, in <module>
    model.load_state_dict(model_data.get('model_state', model_data))
  File "/home/claudio.greco/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 522, in load_state_dict
    .format(name))
KeyError: 'unexpected key "module.w_emb.emb_.weight" in state_dict'

Probably, this happens because the default parameters of the script do not match the ones of the pretrained model. Am I right?

  1. In order to solve problem (1), I executed the command "python3.6 evaluate.py. --num_hid=1280 --op='c' --gamma=8". In this case, it works, but the script returns the result "eval score: 82.23 (92.66)", which seems a bit too high to me. With what row and table in the paper should I compare this result to?

  2. I tried to evaluate the pretrained model on the test split of the VQA dataset by changing "eval_dset = VQAFeatureDataset('dev', dictionary, adaptive=True)" to "eval_dset = VQAFeatureDataset('test2015', dictionary, adaptive=True)" in the evaluate.py script. However, in that case, the script returns the following error:

Evaluate a given model optimized by training split using validation split.
loading dictionary from data/dictionary.pkl
loading features from h5 file
/mnt/8tera/claudio.greco/ban-vqa/language_model.py:95: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
  output, hidden = self.rnn(x, hidden)
Traceback (most recent call last):
  File "evaluate_new.py", line 51, in <module>
    eval_score, bound, entropy = evaluate(model, eval_loader)
  File "/mnt/8tera/claudio.greco/ban-vqa/train.py", line 121, in evaluate
    batch_score = compute_score_with_logits(pred, a.cuda()).sum()
  File "/mnt/8tera/claudio.greco/ban-vqa/train.py", line 26, in compute_score_with_logits
    one_hots.scatter_(1, logits.view(-1, 1), 1)
RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)

Do you know why this is happening?

Thank you very much!

tar cache.pkl.tgz error, when downloading Pickle caches for the pretrained model

Thanks a lot for sharing code!
After downloading cache.pkl.tgz and entering the following command:

tar xvf data/cache/cache.pkl.tgz -C data/cache/

I got:

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now

Is there something wrong with the cache file on google drive?

I got an error with arguments

When i run the main.py
I got un error

main.py: error: unrecognized arguments: True True

Then, I fixed the command

$ python3 main.py --use_both True --use_vg True

into

$ python3 main.py --use_both --use_vg

Which files are needed for inference only?

I want to only inference using this model.

Is it possible to have only the pre-trained model file for inference?
If not, should I run both download.sh and download_data.sh for inference only?

Evaluate.py

When you run evaluate.py for the pretrained model, is there a way to run evaluate without needing a GPU/Cuda?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.