milvlg / mcan-vqa Goto Github PK

Deep Modular Co-Attention Networks for Visual Question Answering

License: Apache License 2.0

Python 98.43% Shell 1.57%

visual-question-answering attention visual-reasoning

mcan-vqa's Introduction

Deep Modular Co-Attention Networks (MCAN)

This repository corresponds to the PyTorch implementation of the MCAN for VQA, which won the champion in VQA Challgen 2019. With an ensemble of 27 models, we achieved an overall accuracy 75.23% and 75.26% on test-std and test-challenge splits, respectively. See our slides for details.

By using the commonly used bottom-up-attention visual features, a single MCAN model delivers 70.70% (small model) and 70.93% (large model) overall accuracy on the test-dev split of VQA-v2 dataset respectively, which significantly outperform existing state-of-the-arts. Please check our paper for details.

Updates

July 10, 2019

Pytorch implementation of MCAN along with several state-of-the-art models on VQA/GQA/CLEVR are maintained in our another OpenVQA project.

June 13, 2019

Pure PyTorch implementation of MCAN model with deep encoder-decoder strategy.
Self-contained documentation from scratch .
Model zoo consists of pre-trained MCAN-small and MCAN-large models on the VQA-v2 dataset.
Multi-GPUs training and gradient accumulation.

Prerequisites
Training
Validation and Testing
Pretrained models
Citation

Prerequisites

Software and Hardware Requirements

You may need a machine with at least 1 GPU (>= 8GB), 20GB memory and 50GB free disk space. We strongly recommend to use a SSD drive to guarantee high-speed I/O.

You should first install some necessary packages.

Install Python >= 3.5
Install Cuda >= 9.0 and cuDNN
Install PyTorch >= 0.4.1 with CUDA (Pytorch 1.x is also supported).

Install SpaCy and initialize the GloVe as follows:

$ pip install -r requirements.txt
$ wget https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz -O en_vectors_web_lg-2.1.0.tar.gz
$ pip install en_vectors_web_lg-2.1.0.tar.gz

Setup

The image features are extracted using the bottom-up-attention strategy, with each image being represented as an dynamic number (from 10 to 100) of 2048-D features. We store the features for each image in a .npz file. You can prepare the visual features by yourself or download the extracted features from OneDrive or BaiduYun. The downloaded files contains three files: train2014.tar.gz, val2014.tar.gz, and test2015.tar.gz, corresponding to the features of the train/val/test images for VQA-v2, respectively. You should place them as follows:

|-- datasets
	|-- coco_extract
	|  |-- train2014.tar.gz
	|  |-- val2014.tar.gz
	|  |-- test2015.tar.gz

Besides, we use the VQA samples from the visual genome dataset to expand the training samples. Similar to existing strategies, we preprocessed the samples by two rules:

Select the QA pairs with the corresponding images appear in the MSCOCO train and val splits.
Select the QA pairs with the answer appear in the processed answer list (occurs more than 8 times in whole VQA-v2 answers).

For convenience, we provide our processed vg questions and annotations files, you can download them from OneDrive or BaiduYun, and place them as follow:

|-- datasets
	|-- vqa
	|  |-- VG_questions.json
	|  |-- VG_annotations.json

After that, you can run the following script to setup all the needed configurations for the experiments

$ sh setup.sh

Running the script will:

Download the QA files for VQA-v2.
Unzip the bottom-up features

Finally, the datasets folders will have the following structure:

|-- datasets
	|-- coco_extract
	|  |-- train2014
	|  |  |-- COCO_train2014_...jpg.npz
	|  |  |-- ...
	|  |-- val2014
	|  |  |-- COCO_val2014_...jpg.npz
	|  |  |-- ...
	|  |-- test2015
	|  |  |-- COCO_test2015_...jpg.npz
	|  |  |-- ...
	|-- vqa
	|  |-- v2_OpenEnded_mscoco_train2014_questions.json
	|  |-- v2_OpenEnded_mscoco_val2014_questions.json
	|  |-- v2_OpenEnded_mscoco_test2015_questions.json
	|  |-- v2_OpenEnded_mscoco_test-dev2015_questions.json
	|  |-- v2_mscoco_train2014_annotations.json
	|  |-- v2_mscoco_val2014_annotations.json
	|  |-- VG_questions.json
	|  |-- VG_annotations.json

Training

The following script will start training with the default hyperparameters:

$ python3 run.py --RUN='train'

All checkpoint files will be saved to:

ckpts/ckpt_<VERSION>/epoch<EPOCH_NUMBER>.pkl

and the training log file will be placed at:

results/log/log_run_<VERSION>.txt

To add：

--VERSION=str, e.g.--VERSION='small_model' to assign a name for your this model.
--GPU=str, e.g.--GPU='2' to train the model on specified GPU device.
--NW=int, e.g.--NW=8 to accelerate I/O speed.
--MODEL={'small', 'large'} ( Warning: The large model will consume more GPU memory, maybe Multi-GPU Training and Gradient Accumulation can help if you want to train the model with limited GPU memory.)
--SPLIT={'train', 'train+val', 'train+val+vg'} can combine the training datasets as you want. The default training split is 'train+val+vg'. Setting --SPLIT='train' will trigger the evaluation script to run the validation score after every epoch automatically.
--RESUME=True to start training with saved checkpoint parameters. In this stage, you should assign the checkpoint version--CKPT_V=str and the resumed epoch number CKPT_E=int.
--MAX_EPOCH=int to stop training at a specified epoch number.
--PRELOAD=True to pre-load all the image features into memory during the initialization stage (Warning: needs extra 25~30GB memory and 30min loading time from an HDD drive).

Multi-GPU Training and Gradient Accumulation

We recommend to use the GPU with at least 8 GB memory, but if you don't have such device, don't worry, we provide two ways to deal with it:

Multi-GPU Training:

If you want to accelerate training or train the model on a device with limited GPU memory, you can use more than one GPUs:

Add --GPU='0, 1, 2, 3...'

The batch size on each GPU will be adjusted to BATCH_SIZE/#GPUs automatically.
Gradient Accumulation:

If you only have one GPU less than 8GB, an alternative strategy is provided to use the gradient accumulation during training:

Add --ACCU=n

This makes the optimizer accumulate gradients forn small batches and update the model weights at once. It is worth noting that BATCH_SIZE must be divided by n to run this mode correctly.

Validation and Testing

Warning: If you train the model use --MODEL args or multi-gpu training, it should be also set in evaluation.

Offline Evaluation

Offline evaluation only support the VQA 2.0 val split. If you want to evaluate on the VQA 2.0 test-dev or test-std split, please see Online Evaluation.

There are two ways to start:

(Recommend)

$ python3 run.py --RUN='val' --CKPT_V=str --CKPT_E=int

or use the absolute path instead:

$ python3 run.py --RUN='val' --CKPT_PATH=str

Online Evaluation

The evaluations of both the VQA 2.0 test-dev and test-std splits are run as follows:

$ python3 run.py --RUN='test' --CKPT_V=str --CKPT_E=int

Result files are stored in results/result_test/result_run_<'PATH+random number' or 'VERSION+EPOCH'>.json

You can upload the obtained result json file to Eval AI to evaluate the scores on test-dev and test-std splits.

Pretrained models

We provide two pretrained models, namely the small model and the large model. The small model corrresponds to the one describe in our paper with slightly higher performance (the overall accuracy on the test-dev split is 70.63% in our paper) due to different pytorch versions. The large model uses a 2x larger HIDDEN_SIZE=1024 compared to the small model with HIDDEN_SIZE=512.

The performance of the two models on test-dev split is reported as follows:

Model	Overall	Yes/No	Number	Other
Small	70.7	86.91	53.42	60.75
Large	70.93	87.39	52.78	60.98

These two models can be downloaded from OneDrive or BaiduYun, and you should unzip and put them to the correct folders as follows:

|-- ckpts
	|-- ckpt_small
	|  |-- epoch13.pkl
	|-- ckpt_large
	|  |-- epoch13.pkl

Set --CKPT={'small', 'large'} --CKPT_E=13 to testing or resume training, details can be found in Training and Validation and Testing.

Citation

If this repository is helpful for your research, we'd really appreciate it if you could cite the following paper:

@inProceedings{yu2019mcan,
  author = {Yu, Zhou and Yu, Jun and Cui, Yuhao and Tao, Dacheng and Tian, Qi},
  title = {Deep Modular Co-Attention Networks for Visual Question Answering},
  booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  pages = {6281--6290},
  year = {2019}
}

mcan-vqa's People

Contributors

Stargazers

Watchers

Forkers

kailiwu paradoxzw hshujuan nbgao aistudentsh cyhbrilliant ammieqi yuzcccc eustcpl jizecao cuiyuhao1996 jerrywisdom zawecha1 chandanmishra-03 shubhamagarwal92 tanwey rajsaurabh1303 saifsayed jialinwu17 uuuque yijunwu chwlsunny kayoyin luogen1996 d86518 dami23 arastogi1997 tgc1997 j-bing cczka gazelxu avpodtikhov michael-wzhu zerojuzi hopstone leedoyup taaccoo-beta swetha2410 cmfiltenborg gzcsudo ayush1801 mishra27 gwy-nk sailfish009 hemanth-s17 v-user1098new jojo23333 aaronhd hopeliu20160622 zhenwang23 harisahmad1881 zhangk1551 kevinqian97 ericdoug-qi chensyeric dorren002 b-matchlsr rentainhe ankitshah009 jeff52415 eminde raikon55 fangzheng354 esradonmez yfcodedream 2212221352 techthiyanes jingjingjing123 doublebc originofamonia rongfei-chen lily11223344 billiecn fineimew cjj2923 wheltz queekye malekijoo syyyyyw farisalasmary qianqian121 rilzob xxayt lizi0408 codenewww seungbhin dunghuynhandy

mcan-vqa's Issues

# of params for the small model

Your small model (6-Layer) has 57,812,491 parameters. However, the paper reports 56M for it (Table 1b). What does bring this dicrepancy?

p=0
for k,v in data['state_dict'].items():
    p+=functools.reduce(operator.mul, v.size(), 1)
>>> p
57812491

Are the provided bottomup feature same as the it in the original repo?

Hi, thank you for sharing your code.
However, I have found that the bottom up feature you provided is different with the default bottom up feature?
Which is different with the up down model and BAN?

Why does the answer dictionary need to be filtered with less than 8 occurrences

Why does the answer dictionary need to be filtered with less than 8 occurrences?

Why our reproduced accuracy is so much higher than in the paper.

Overall Accuracy is: 81.02

Per Answer Type Accuracy is the following:
other : 73.65
yes/no : 95.62
number : 66.90

net.train() not called again after evaluation finishes

Hi!

From your code, I realized that after you run the eval function and start the next training epoch, you do not turn back to training mode by net.train(), as net.train() is called before the epoch loop. Shouldn't that affect since you are using dropout?

about val score

Thank you very much for sharing your work!
I have a question.
I train the model in train_mode='train' and train_split='train', other settings are not change.
but the best score is no more than 65% in the val dataset.
In your work, the score is 66~67% in the val dataset.
What should i do for my confusion?
Thanks again!

single question test

I am trying to do a single question test, so I reduce the test_question.json to one picture and one question, and I encounter this problem:
RuntimeError: Error(s) in loading state_dict for Net: size mismatch for embedding.weight: copying a param with shape torch.Size([20572, 300]) from checkpoint, the shape in current model is torch.Size([18405, 300]).

Do you know how to fix this ? or the test_question.json is just stable which I cannot modify the scale of the pictures and the questions?

Can you give me some suggestions about how to test single question in single picture ?

Thanks a lot.

size mismatch for embedding.weight

Thank you for the open source code!

I am trying to run the validation with the dataset you have provided, as well as the pretrained model "small".

I encounter this problem during evaluation of the validation set:

size mismatch for embedding.weight: copying a param with shape torch.Size([20572, 300]) from checkpoint, the shape in current model is torch.Size([14613, 300]).

Do you know why this happens and how to fix the difference in vocabulary size for the embedding?

Best,

Kayo

mcan_encoder_decoder

question regarding test split

Hello! Thanks for sharing your code and brilliant work.
I'd like to ask about the evaluation on test-std and test-dev. Is there any way to know the number of epochs needed for training, since evaluation is not available. I've seen that in your case, you used the same number of training epochs (13). But I assume that since the training data is largely increased (since for evaluation on test set requires training on 'train+val+vg' sets), the number of epochs required for convergence will also increase. Or do you evaluate a number of epochs on the online server to see which performs better?

Thanks for sharing your suggestion.

Regards

Direct link to the Bottom-up features

Hi,
Can you please provide a direct downloadable link to the bottomup-features, so that I can download it using "wget".
Thanks

How to use pretrained models

How can i use pretrained models for generation of answers to image and questions

Code for learned image and question attentions

Hey,

in your paper you showed some examples of learned image and question attentions (Figure 8).
I want to replicate these examples if possible.
Can you post the code you used to extract and create the learned image and question attentions?

Best,
Karol

help!

what can i do?
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 90789960 (char 90789959)

How much impovement does VG dataset bring?

Co-Attention?

The MCAN paper suggests that SGA (i.e. a guided attention module) is only used for question-guided attention over image content, but not the other way around (image-guided attention over question content). Could the authors please explain why they call this "CO-"attention even though there's no image-guided attention over question content? Or did I misunderstand the paper?

Greatly appreciate a response!

MCAN

Imbalance of GPU comsumption.

Hi! I have ran into a problem that when I running the model, the GPU-0 always takes large memory,while other GPUs take less memory, any suggestion?
Hope for your respond.

improved results in val set but decrease results in online evaluation

Thanks for providing codes of this interesting project. I followed your approach where I trained the network with default hyper-parameters settings (python3 run.py --RUN='train'). During validation the model is performing well where I used

python3 run.py --RUN='val' --CKPT_PATH=str

But when I did online evaluation performance is not improving. I used

python3 run.py --RUN='test' --CKPT_PATH=str

to generate json for online evaluation. Am I missing something? or am I using correct split? Or for online evaluation (i.e. on test-dev and test-std) do we need to train the network with different split?

the problem about result jons file

Hi,thank for your opening code and excellent work.
i have a problem when i upload the json file which is the ans to the test2015 to the EVA AI it gets false and the error info as fellow:

'Results do not correspond to current VQA set. Either the results do have predictions for all question ids in annotation file or there is one/more questions id that does not belong to the question ids in the annotation file.'

Have you met the problem before?

Bert encoding

Hi,

Did you also use bert encodings in your experiments? Do you plan to release the final model config that you used for the challenge?

Can I run code with CPU?

Hello and thanks for your code and share
When I tried to run your work, couldn't download SPACY==2.1.0, so I run it with SPACY==3.2.3 (last version), and thus I have to change the pipeline, from ‘‘en_vectors_web_lg-2.1.0” to “en_core_web_lg-3.2.0”, after that I change ‘‘ yaml_dict = yaml.safe_load(f)’’ that was "yaml_dict = yaml.load(f)" before.
And in the end, I face this error: “AssertionError: Torch not compiled with CUDA enabled”.
I tried with: ...GPU=="CPU” but it doesn't work either.
Please guide me, how to run with CPU (my GPU that doesn't support CUDA requirements), if possible.

VQA CPv2

你好，我想问一下MCAN有在VQA CPv2上测试过性能吗？

log file

Hello and thanks for your code.
May you please share your log file for validation split (results/log/log_run_small.txt)? I am doing some minor changes in your code, and therefore it would be helpful to compare each epoch. The results are strange. Some minor changes I did degraded the accuracy. It started at 48% reached 60% at the 9th epoch. I don't assume it will reach 67% in a few more epochs, as it increases very slowly.

Thanks

How long did you take on training with ssd?

How long did you take on training with ssd?
I spent ~2 hours every epoch with default setting on hhd disk. I have set GPU=2, BS=256, CPU=10, num_worker=15 to accelerate I/O, however I don't think it's fast. Could you provide some suggestions to accelerate I/O?
Do you try to load all *.npz files into memory? I think it may be faster.

Thank you!

pretrained frcnn and network

Hi, Thanks for your project and great work.
I am looking to run it on new images but using other pretrained faster-rcnn features like COCO giving wrong answer. Can you please provide pretrained faster-rcnn model and network to replicate.

Thanks

question about answer processing

Hi. thanks for your code.
May I please ask, in the case where multiple questions for the same image are given, why are you assigning a label <1 for each question, as in:

def get_score(occur):
    if occur == 0:
        return .0
    elif occur == 1:
        return .3
    elif occur == 2:
        return .6
    elif occur == 3:
        return .9
    else:
        return 1.

Why not assign 1 to each and treat it individually? Is it to avoid confusion in predicting the answer at test time?

linear fusion model

Thank you for sharing.I would like to ask if you have tried to change the linear multimodal fusion model, does it affect the accuracy?Looking forward to your reply. Thanks a lot!!

Feature file download failed

During the feature file download process, the speed is normal at the beginning but the connection is disconnected during the process. And it can't be reconnected.Could you please provide a download address suitable for Chinese users? For example, the download address of Baidu Netdisk.In addition, since I have not used it before, installing spacy in anaconda has also caused some problems.
特征文件下载过程中，开始时速度正常但是过程中会断开连接。而且无法重连。能否请您提供一个适合**用户使用的下载地址？比如百度网盘的下载地址。另外，由于我之前没有使用过，在anaconda中安装spacy也出了些问题。

How did you create "npz" files from tsv files

Can you please provide a source code through which you created the npz files for each image from tsv files?

Overfitting on val dataset

Hi, i have one question.
I train the model, details as follow: model='small', train_split='train+val', 13 epoches.
The val acc is overall=84.18, yes/no=97.67, num=71.03, other=77.39.
Is it overfitting on the val dataset?
I tried to add dropout rate to 0.5, it works(val: overall=72.31),
but the test score is overall=66.95 (the same implements).
What should i do for this problem?
Thank you!

Features' file loading in the code

Hi,
I am confused regarding how you tackled the "bbox" information in the code. I can see loading the image features x from ".npz" file only.

Also, it is mentioned that we can work with grid features as well. Grid features' file with ".pth" extension only contains features/weights with tensor size [1, 2048, 19, 29](a sample feature file) and not any bounding box information, object detection etc. Then how can we cater those features without any such information.

test-dev or test-std

May I ask the author, it seems that there is no channel to submit json files online now, how to do test-dev

Visualizations of the learned attention maps

Hi! MCAN team
Thanks for your sharing

I want to replicate the visualization of the learned attention maps in Figure7
Can you post the code you used to extract and create the learned attention maps?
Looking forward to your reply !

Best

Why use only image guided-attention rather than both image and question guided-attention?

Thanks for making your great work open-source!
Curious about why use only image guided-attention rather than both image and question guided-attention.
Frankly, both directions of attention are important.
Have you ever done some experiments about that?
Looking forward to your reply.

About initialize the GloVe

Hello and thank you for your code and share ,
When I execute wget https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz -O en_ vectors_web_lg-2.1.0.tar.gz command, I found that the web page doesn't exist, please guide me what I should do!

box is xyxy or xywh?

Hello,

Thanks for your code. I was trying to plot rois (boxes) on the images. Can I ask what's the box format? e.g.: (x_min, y_min, x_max, y_max) or (x_min, y_min, w, h)?

Regards.