peteanderson80 / bottom-up-attention Goto Github PK

Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

Home Page: http://panderson.me/up-down-attention/

License: MIT License

CMake 0.91% Makefile 0.22% HTML 0.06% CSS 0.08% Jupyter Notebook 63.93% C++ 25.32% Shell 0.47% Python 6.52% Cuda 2.10% MATLAB 0.29% C 0.08% Dockerfile 0.02%

vqa visual-question-answering captioning-images faster-rcnn caffe image-captioning mscoco mscoco-dataset

bottom-up-attention's Introduction

bottom-up-attention

This code implements a bottom-up attention model, based on multi-gpu training of Faster R-CNN with ResNet-101, using object and attribute annotations from Visual Genome.

The pretrained model generates output features corresponding to salient image regions. These bottom-up attention features can typically be used as a drop-in replacement for CNN features in attention-based image captioning and visual question answering (VQA) models. This approach was used to achieve state-of-the-art image captioning performance on MSCOCO (CIDEr 117.9, BLEU_4 36.9) and to win the 2017 VQA Challenge (70.3% overall accuracy), as described in:

Some example object and attribute predictions for salient image regions are illustrated below.

Note: This repo only includes code for training the bottom-up attention / Faster R-CNN model (section 3.1 of the paper). The actual captioning model (section 3.2) is available in a separate repo here.

Reference

If you use our code or features, please cite our paper:

@inproceedings{Anderson2017up-down,
  author = {Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang},
  title = {Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering},
  booktitle={CVPR},
  year = {2018}
}

Disclaimer

This code is modified from py-R-FCN-multiGPU, which is in turn modified from py-faster-rcnn code. Please refer to these links for further README information (for example, relating to other models and datasets included in the repo) and appropriate citations for these works. This README only relates to Faster R-CNN trained on Visual Genome.

License

bottom-up-attention is released under the MIT License (refer to the LICENSE file for details).

Pretrained features

For ease-of-use, we make pretrained features available for the entire MSCOCO dataset. It is not necessary to clone or build this repo to use features downloaded from the links below. Features are stored in tsv (tab-separated-values) format that can be read with tools/read_tsv.py.

LINKS HAVE BEEN UPDATED TO GOOGLE CLOUD STORAGE (14 Feb 2021)

10 to 100 features per image (adaptive):

36 features per image (fixed):

Both sets of features can be recreated by using tools/generate_tsv.py with the appropriate pretrained model and with MIN_BOXES/MAX_BOXES set to either 10/100 or 36/36 respectively - refer Demo.

Requirements: software
Requirements: hardware
Basic installation
Demo
Training
Testing

Requirements: software

Important Please use the version of caffe contained within this repository.
Requirements for Caffe and pycaffe (see: Caffe installation instructions)

Note: Caffe must be built with support for Python layers and NCCL!

# In your Makefile.config, make sure to have these lines uncommented
WITH_PYTHON_LAYER := 1
USE_NCCL := 1
# Unrelatedly, it's also recommended that you use CUDNN
USE_CUDNN := 1

Python packages you might not have: cython, python-opencv, easydict
Nvidia's NCCL library which is used for multi-GPU training https://github.com/NVIDIA/nccl

Requirements: hardware

Any NVIDIA GPU with 12GB or larger memory is OK for training Faster R-CNN ResNet-101.

Installation

Clone the repository

git clone https://github.com/peteanderson80/bottom-up-attention/

Build the Cython modules
```
cd $REPO_ROOT/lib
make
```

Build Caffe and pycaffe

cd $REPO_ROOT/caffe
# Now follow the Caffe installation instructions here:
#   http://caffe.berkeleyvision.org/installation.html

# If you're experienced with Caffe and have all of the requirements installed
# and your Makefile.config in place, then simply do:
make -j8 && make pycaffe

Demo

Download pretrained model, and put it under data\faster_rcnn_models.
Run tools/demo.ipynb to show object and attribute detections on demo images.
Run tools/generate_tsv.py to extract bounding box features to a tab-separated-values (tsv) file. This will require modifying the load_image_ids function to suit your data locations. To recreate the pretrained feature files with 10 to 100 features per image, set MIN_BOXES=10 and MAX_BOXES=100. To recreate the pretrained feature files with 36 features per image, set MIN_BOXES=36 and MAX_BOXES=36 use this alternative pretrained model instead. The alternative pretrained model was trained for fewer iterations but performance is similar.

Training

Download the Visual Genome dataset. Extract all the json files, as well as the image directories VG_100K and VG_100K_2 into one folder $VGdata.
Create symlinks for the Visual Genome dataset
```
cd $REPO_ROOT/data
ln -s $VGdata vg
```
Generate xml files for each image in the pascal voc format (this will take some time). This script will extract the top 2500/1000/500 objects/attributes/relations and also does basic cleanup of the visual genome data. Note however, that our training code actually only uses a subset of the annotations in the xml files, i.e., only 1600 object classes and 400 attribute classes, based on the hand-filtered vocabs found in data/genome/1600-400-20. The relevant part of the codebase is lib/datasets/vg.py. Relation labels can be included in the data layers but are currently not used.
```
cd $REPO_ROOT
./data/genome/setup_vg.py
```
Please download the ImageNet-pre-trained ResNet-100 model manually, and put it into $REPO_ROOT/data/imagenet_models
You can train your own model using ./experiments/scripts/faster_rcnn_end2end_multi_gpu_resnet_final.sh (see instructions in file). The train (95k) / val (5k) / test (5k) splits are in data/genome/{split}.txt and have been determined using data/genome/create_splits.py. To avoid val / test set contamination when pre-training for MSCOCO tasks, for images in both datasets these splits match the 'Karpathy' COCO splits.

Trained Faster-RCNN snapshots are saved under:
```
output/faster_rcnn_resnet/vg/
```
Logging outputs are saved under:
```
experiments/logs/
```
Run tools/review_training.ipynb to visualize the training data and predictions.

Testing

The model will be tested on the validation set at the end of training, or models can be tested directly using tools/test_net.py, e.g.:

./tools/test_net.py --gpu 0 --imdb vg_1600-400-20_val --def models/vg/ResNet-101/faster_rcnn_end2end_final/test.prototxt --cfg experiments/cfgs/faster_rcnn_end2end_resnet.yml --net data/faster_rcnn_models/resnet101_faster_rcnn_final.caffemodel > experiments/logs/eval.log 2<&1

Mean AP is reported separately for object prediction and attibute prediction (given ground-truth object detections). Test outputs are saved under:

output/faster_rcnn_resnet/vg_1600-400-20_val/<network snapshot name>/

Expected detection results for the pretrained model

	objects [email protected]	objects weighted [email protected]	attributes [email protected]	attributes weighted [email protected]
Faster R-CNN, ResNet-101	10.2%	15.1%	7.8%	27.8%

Note that mAP is relatively low because many classes overlap (e.g. person / man / guy), some classes can't be precisely located (e.g. street, field) and separate classes exist for singular and plural objects (e.g. person / people). We focus on performance in downstream tasks (e.g. image captioning, VQA) rather than detection performance.

bottom-up-attention's People

Contributors

Stargazers

Watchers

Forkers

robotzheng ronghanghu bdfyb eustcpl wanesta zgsxwsdxg yufengm lilhope sithlqf shykoe leezqcst janelou wangheda logonod jdc08161063 kekedan liu-hai-yang wenxuanliu zhoulian lemonnight alicepeter xiangliu886 researcher2003pro jwyang zbxzc35 parisots vijayvee little1tow leviswind allenanthony shubhampachori12110095 vanpersie32 kuanih aodiwei raghparihar chanyn gdlsdfz mdamo chenxistephen jryongithub zhangzhizz yyf17 fword team-aupair airyym btjhjeon ardapekis dimplesl hyeonwoonoh okanlv jl0623 kuanghuei yancyycwong yuzcccc sidianshuia wangweilai1 matrixplayer hsuwanting stevenji pzzhang xiangchenchao spartag117 weimianli afcarl happyzhouch ml-lab kimisissi lzd0825 artist100 lewiszhao wywywy01 cyhbrilliant raghavgoyal14 zhiqinzhan daqingliu youngerboy michael-hsu jiangmengqi gechen dreadlord1984 jpchen2012 anirband thecharm ricklentz joon-park92 deftruth maybefeicun yuanezhou sadafgulshad1 ycremar gulugulujiang basselali1 xuliangfrdc song-heng xf15 hyzcn forence lzh990711 l1haodong fwtan

bottom-up-attention's Issues

./lib/nms/gpu_nms.so: undefined symbol: _Py_ZeroStruct

Hi,when i running demo.ipynb, occur ./lib/nms/gpu_nms.so: undefined symbol: _Py_ZeroStruct. I have perform make.

cannot train locally with the error "AttributeError: type object 'NCCL' has no attribute 'new_uid'"

File "./tools/train_net_multi_gpu.py", line 109, in
max_iter=args.max_iters, gpus=gpus)
File "/home/jzheng/PycharmProjects/bottom-up-attention/tools/../lib/fast_rcnn/train_multi_gpu.py", line 233, in train_net_multi_gpu
uid = caffe.NCCL.new_uid()
AttributeError: type object 'NCCL' has no attribute 'new_uid'

What should I do with this error?

The downloads links are too slow?

Am I the only one experiencing slow download of the zipped features? Previous links used to be fast.

What features are used to train a VQA model? DO you use only 2048-dimension features?

In your code, the image_id, image_h, image_w, num_boxes, boxes, features were extracted and saved. But in your paper, it seems that only features are used to present the image. Do you use the embedding of the predicted classes or bbox to train a VQA model?

generate_tsv.py problem ImportError: /usr/lib/libcblas.so.3: undefined symbol: ATL_zger2u

Hi, I am using your code generate_tsv.py to generate features, but met the problem as followed:

./caffe/python/caffe/pycaffe.py", line 13, in
from ._caffe import Net, SGDSolver, NesterovSolver, AdaGradSolver,
ImportError: /usr/lib/libcblas.so.3: undefined symbol: ATL_zger2u

what is the problem?

make -j8 && make pycaffe both successed

About test results

Hi, I just run the test code using your trained resnet101 model on the test set. I got the following numbers on object detection task:

Mean AP = 0.0146
Weighted Mean AP = 0.1799
Mean Detection Threshold = 0.328

The mean AP (1.46%) is far from the number (10.2%) you reported in the table at the bottom of readme. The weighted mean AP is a bit higher than the number you reported. I am wondering whether there is a typo in your table.

thanks!

Image caption

hi,can you release your image caption implement version?

binascii.Error: Incorrect padding

Got binascii.Error: Incorrect padding when reading image 300104 from test2014/test2014_resnet101_faster_rcnn_genome.tsv.1 with tools/read_tsv.py. Anything wrong?

Traceback (most recent call last):
  File "read_tsv.py", line 64, in <module>
    read_and_save(os.path.join(in_dir, in_file), out_dir)
  File "read_tsv.py", line 45, in read_and_save
    item['features'] = np.frombuffer(base64.decodestring(item['features']), dtype=np.float32).reshape((item['num_boxes'], -1))
  File "/usr/lib64/python2.7/base64.py", line 321, in decodestring
    return binascii.a2b_base64(s)
binascii.Error: Incorrect padding

activation of the relation prediction

@peteanderson80
I saw the option "HAS_RELATION" in the cfg file. I turned it on and add a top[6] data for the proposal_target_layer and set the param num_rel_classes to 21(I am not sure if this is correct for the vg_1600-400-20 dataset) and start training, I got the following error:

File "/home/work/bottom-up-attention/tools/../lib/roi_data_layer/minibatch.py", line 55, in get_minibatch
    "Generation of gt_relations doesn't accomodate dropping objects"
AssertionError: Generation of gt_relations doesn't accomodate dropping objects

Is there something wrong with my setting?

Compilation Error

i got the following error:

token ""CUDACC_VER is no longer supported. Use CUDACC_VER_MAJOR, CUDACC_VER_MINOR, and CUDACC_VER_BUILD instead."" is not valid in preprocessor expressions

i've updated eigen already and still get the same error...
what should i do?

How to visualize the attention

the datasets are too big to be downloaded

Have you tried training the object detection and the VQA model in an "end to end" fashion ?

Hi guys 😅 😅 I was just wondering, Have you tried doing that? If not, Do you expect the accuracy to go down if you do that?

example images without bounding boxes

First of all, thanks for making this project public.

In the paper and this repo's README, two example images are used to show your model's prediction qualitatively. (images with bike and oven) Since I can't figure out where those images are in the dataset (VG or COCO), is it possible for you to provide where those images are?

Thanks in advance.

GloVe word embedding for top down attention model

Thank you for the fascinating paper and code!

I notice that for the image captioning model, you decided to use the standard vocabulary and train the word embedding matrix from scratch. So I've been wondering if I apply the approach mentioned in your previous paper about Constrained Beam Search (pretrained GloVe vectors with expanded vocabulary), will it improve the performance of the model?

Class name of detection result of MSCOCO

Would you please provide the class name of detection result of MSCOCO?

It seems I cannot download the trainval features from onedrive?

Can anyone give me another links?

Could you please point out the main code of this paper?

@peteanderson80 Thank you!!
Have trouble to find them

Is this just the bottom-up attention or does it include the full VQA model?

If it's just the attention, do you plan on releasing the VQA model?

Running evaluation script on CPU

Hello,

is it possible to run the evaluation script on CPU? Should I still install Caffe in the way proposed on the readme of this repository?

Thanks,
Claudio

why the feature is the same when I recreate image features?

I want to recreate 36 features one images.
But I find pool5 in generate_tsv.py, all dim is the same!!!
why pool feature?not the region feature.

all dims in pool5 are the same

why using the VG data to train the Faster-RCNN model

Thanks for sharing the models and features. I have tried the feature for VQA with my own model, really surprising results indeed :)
I have two questions as follows:

As the VQA datasets is based on the images of MSCOCO, will it be better to train the faster rcnn model on the COCO dataset directly?
Could a better object detection model, e.g., R-fcn or Deformable R-fcn further improve the VQA performance?

Problem in running demo.ipynb

net = caffe.Net(prototxt, caffe.TEST, weights=weights)
Traceback (most recent call last):
File "", line 1, in
Boost.Python.ArgumentError: Python argument types in
Net.init(Net, str, int)
did not match C++ signature:
init(boost::python::api::object, std::string, std::string, int)
init(boost::python::api::object, std::string, int)

read_tsv file error

when I use default setting, which r+b, to open tsv file, error occurs like
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
in line "for item in reader:"
when I use r+
TypeError: expected bytes-like object, not str. occur at np.frombuffer(base64.decodestring(item[field]),
my environment in windows. should it matter?

I run python demo.py failed!

I was download the model file first. when I run demo.py get message like this:
I1017 11:02:41.250684 40048 net.cpp:131] Top shape: 1 2 126 14 (3528)
I1017 11:02:41.250689 40048 net.cpp:139] Memory required for data: 117482412
I1017 11:02:41.250692 40048 layer_factory.hpp:77] Creating layer rpn_cls_prob_reshape
I1017 11:02:41.250699 40048 net.cpp:86] Creating Layer rpn_cls_prob_reshape
I1017 11:02:41.250704 40048 net.cpp:408] rpn_cls_prob_reshape <- rpn_cls_prob
I1017 11:02:41.250713 40048 net.cpp:382] rpn_cls_prob_reshape -> rpn_cls_prob_reshape
I1017 11:02:41.250741 40048 net.cpp:124] Setting up rpn_cls_prob_reshape
I1017 11:02:41.250747 40048 net.cpp:131] Top shape: 1 18 14 14 (3528)
I1017 11:02:41.250751 40048 net.cpp:139] Memory required for data: 117496524
I1017 11:02:41.250756 40048 layer_factory.hpp:77] Creating layer proposal
F1017 11:02:41.250799 40048 layer_factory.hpp:81] Check failed: registry.count(type) == 1 (0 vs. 1) Unknown layer type: Python (known types: AbsVal, Accuracy, ArgMax, BNLL, BatchNorm, BatchReindex, Bias, BoxAnnotatorOHEM, Concat, ContrastiveLoss, Convolution, Crop, Data, Deconvolution, Dropout, DummyData, ELU, Eltwise, Embed, EuclideanLoss, Exp, Filter, Flatten, HDF5Data, HDF5Output, HingeLoss, Im2col, ImageData, InfogainLoss, InnerProduct, InnerProductBlob, Input, LRN, LSTM, LSTMUnit, Log, MVN, MemoryData, MultinomialLogisticLoss, PReLU, PSROIPooling, Parameter, Pooling, Power, RNN, ROIPooling, ReLU, Reduction, Reshape, SPP, Scale, Sigmoid, SigmoidCrossEntropyLoss, Silence, Slice, SmoothL1Loss, SmoothL1LossOHEM, Softmax, SoftmaxWithLoss, SoftmaxWithLossOHEM, Split, TanH, Threshold, Tile, WindowData)
*** Check failure stack trace: ***
Aborted (core dumped)

Generation of 1600-400-20 vocab files

I run the setup_vg.py script with the max_objects/attributes/relation set to 1600/400/20 respectively. However, the generated vocabulary files are slightly different from the ones provided in the 1600-400-20 folder. Is there any manual post-procedure?

Why choose "SoftmaxWithLoss" for "loss_attr"

Hi @peteanderson80 , Thanks for your sharing.
There are some objects has multiple attributes, why choose "SoftmaxWithLoss" for "loss_attr"?

run tools/demo.ipynb error

[libprotobuf ERROR google/protobuf/text_format.cc:274] Error parsing text-format caffe.NetParameter: 6305:21: Message type "caffe.LayerParameter" has no field named "roi_pooling_param".
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0216 21:21:02.787189 22074 upgrade_proto.cpp:90] Check failed: ReadProtoFromTextFile(param_file, param) Failed to parse NetParameter file: /bottom-up-attention/models/vg/ResNet-101/faster_rcnn_end2end/test.prototxt
*** Check failure stack trace: ***
Aborted

could you please give me a solution to fact this error? Thank you!!

ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/tuple'

Hi , I run generate_tsv.py to generate pretrained features,but i meet a problem.

here is the problem :
**Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, self._kwargs)
File "generate_tsv.py", line 157, in generate_tsv
net = caffe.Net(prototxt, caffe.TRAIN, weights=weights)
File "/home/lijinze/bottom-up-attention-master/tools/../lib/rpn/anchor_target_layer.py", line 27, in setup
layer_params = yaml.load(self.param_str)
File "/usr/local/lib/python2.7/dist-packages/yaml/init.py", line 72, in load
return loader.get_single_data()
File "/usr/local/lib/python2.7/dist-packages/yaml/constructor.py", line 39, in get_single_data
return self.construct_document(node)
File "/usr/local/lib/python2.7/dist-packages/yaml/constructor.py", line 48, in construct_document
for dummy in generator:
File "/usr/local/lib/python2.7/dist-packages/yaml/constructor.py", line 398, in construct_yaml_map
value = self.construct_mapping(node)
File "/usr/local/lib/python2.7/dist-packages/yaml/constructor.py", line 208, in construct_mapping
return BaseConstructor.construct_mapping(self, node, deep=deep)
File "/usr/local/lib/python2.7/dist-packages/yaml/constructor.py", line 133, in construct_mapping
value = self.construct_object(value_node, deep=deep)
File "/usr/local/lib/python2.7/dist-packages/yaml/constructor.py", line 88, in construct_object
data = constructor(self, node)
File "/usr/local/lib/python2.7/dist-packages/yaml/constructor.py", line 414, in construct_undefined
node.start_mark)
ConstructorError: could not determine a constructor for the tag 'tag:yaml.org,2002:python/tuple'
in "", line 2, column 11:
'scales': !!python/tuple [4, 8, 16, 32]

Could you do me a favor?Thank you very much!

How l2-normalization over feature is implemented ?

this paper states that L2 normalization of the image features is crucial for good performance. However, you just use pool5 data, which is average pooled to become a 2048 vector in generate_tsv.py

I'm wondering if you have implemented L2-normalization over feature or not. If you did, please inform me how you do it. Thanks a lot~

Illustrations for object and attribute predictions

Hello @peteanderson80

I was wondering how to generate the figure showing the object and attribute predictions for salient image regions with bounding boxes and labels (like the figure in this repo).

ImportError: No module named gpu_nms

I can not find gpu_nms.py in nms. Need help!

Could you please list the platform version?

When I make pycaffe follow your configuration, it always arise some annoying conflicts. Could you please list the platform used in your code?

Besides, my environment is :

Ubuntu16.04
CUDA 8
CUDNN 5.0
Opencv2.4.13
mkl2016
NCCL

Is that worked?
Thanks~

How to load .tsv datas in Tensorflow?

I am trying to use your .tsv datas as image features for image captioning. However, I have no idea how to load the .tsv data so as to random batch sample the features items and match correlated captions. I find a way to solve it that is to turn .tsv into .json format, and make the .json file to be a dict with "image_ids" as the keys. However, the .json file to too large to load. In fact, I am even failed to generate the json file since the lack of Memory. I am also failed to use tensorflow's textlinereader.

So,
How to load .tsv datas rightly?
Looking forward to your help!
Thank you!

When I run make -j8 && make pycaffe error.

CXX src/caffe/internal_thread.cpp
CXX src/caffe/layer.cpp
CXX src/caffe/blob.cpp
CXX src/caffe/syncedmem.cpp
CXX src/caffe/solver.cpp
CXX src/caffe/layer_factory.cpp
CXX src/caffe/data_transformer.cpp
CXX src/caffe/layers/hdf5_data_layer.cpp
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from src/caffe/syncedmem.cpp:3:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/syncedmem.o] Error 1
make: *** Waiting for unfinished jobs....
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from src/caffe/blob.cpp:7:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/blob.o] Error 1
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from ./include/caffe/layer.hpp:12,
from src/caffe/layer_factory.cpp:8:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/layer_factory.o] Error 1
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from ./include/caffe/layer.hpp:12,
from src/caffe/layer.cpp:1:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/layer.o] Error 1
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from ./include/caffe/layer.hpp:12,
from ./include/caffe/layers/hdf5_data_layer.hpp:10,
from src/caffe/layers/hdf5_data_layer.cpp:17:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/layers/hdf5_data_layer.o] Error 1
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from src/caffe/data_transformer.cpp:10:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/data_transformer.o] Error 1
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from ./include/caffe/layer.hpp:12,
from ./include/caffe/net.hpp:12,
from ./include/caffe/solver.hpp:7,
from src/caffe/solver.cpp:6:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/solver.o] Error 1
In file included from ./include/caffe/util/math_functions.hpp:11:0,
from src/caffe/internal_thread.cpp:5:
./include/caffe/util/mkl_alternate.hpp:14:19: fatal error: cblas.h: No such file or directory
#include <cblas.h>
^
compilation terminated.
make: *** [.build_release/src/caffe/internal_thread.o] Error 1

when I training the model , I find the loss_bbox is nan?

I use the vg dataset to train the model.But after 0 iter,the loss_bbox is nan.
I do not modify your code.
this is my training log:

It is yours trianing log:

Test 2014 adaptive features are not getting prepared

I tried creating numpy files from test2014 variable box features using the following read_csv file in this git. but it says there's an error in padding while decoding from string to numpy array.

item[field] = np.frombuffer(base64.decodestring(item[field]),
dtype=np.float32).reshape((item['num_boxes'],-1))

it says, incorrect padding within decodestring function. I tried adding ''=" at the end but then the dimensions of resultant numpy array mismatches with num_boxes. This error occurs for every tsv file in test2014 adaptive feature set.

Tried debugging the code:
hRPQAAAAAAAAAAAAAAALSF8D0AAAAAAAAAAEK5hkBUkI4/ubE7PwAAAABMs648eTKRO2Xq5z1mSOE+aKcrPwAAAAAAAAAAAAAAABsb3T7nhK49jVvEPirGgjqzJrM+AAAAAJLICj8B0G4+HmhvPvccLz4AAAAAq37BOoivl0AtCwg8AAAAAAAAAAAAAAAAtsQoPgAAAADbHCo7AAAAAAAAAACSgow/AAAAAFYsqT+fwoM9AAAAAIEkFkHFf6U8AAAAAAAAAACspfQ+AAAAAAAAAACbh5E9AAAAAF/CRz8AAAAAAAAAAGXGa0BfWrs/FetIPKe0RD8AAAAAzPLROwAAAAAAAAAAAAAAAHy/Rz7JO49A/5cPP8bSlz4AAAAANOKEQAAAAAAAAAAAAAAAAAAAA437192 length of string=237423

/home/juan_fernandez/scripts/read_tsv.py(74)()
-> pdb.set_trace()
(Pdb) c
Traceback (most recent call last):
File "scripts/read_tsv.py", line 74, in
pdb.set_trace()
File "/home/juan_fernandez/anaconda2/envs/py27/lib/python2.7/base64.py", line 328, in decodestring
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding

Any help would be appreciated.

Struggling for installation

Hello,

I am heavily failing to build the shipped Caffe with anaconda, mainly linking errors w.r.t google protobuf. Been struggling for like 7-8 hours and I am pretty close to give up.

So the question is what are the modifications shipped in the caffe/ folder? Can't we really use upstream Caffe?

Hardware Requirement, does eval need that 12G GPU memory?

F0906 11:11:48.665238   945 syncedmem.cpp:71] Check failed: error == cudaSuccess (2 vs. 0)  out of memory

i got this error when running demo.ipynb with a new picture size 416x449, but example pictures running success.

Training Scripts

It is a nice work and the feature improves a lot ! Could you please share the detailed training process for the final model?

how to set larger batch size in training????

I modify the BATCH_SIZE from 64 to 192 in faster_rcnn_end2end_resnet.yml, but I get error :

F1206 14:38:02.929175 195836 loss_layer.cpp:19] Check failed: bottom[0]->num() == bottom[1]->num() (32 vs. 96) The data and label should have the same number.

I think I miss some thing in configure setting. May be another parameter should also be changed corresponding to the BATCH_SIZE.

How to extract specfied region's feature?

Hi, current code can extract detected region's features, can i specify a region and extract its feature, for example, the union region of two detected regions?

Tensorflow version ?

Hi,

I have a question. I would like to know whether you will be releasing a tensorflow version of your code ?

Can sombody share the Pretrained features?

I used Chrome to download the features, but the speed is lower than 50kb/s and when downloading 50~100MB, it would interupt and when I redownload it will start from the begining.....
I feel helpless...

which proto is for generate_tsv.py？

I want to use your model to get features, but I don't know what proto do I need for generate_tsv.py.
Following the "test part", I choose "models/vg/ResNet-101/faster_rcnn_end2end_final/test.prototxt", but it report error :
cudnn.hpp:122] Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM
I will run this model in XMedia Wikipedia and Pascal. I want more introduction about this part please.
btw, do I need to resize the img shape to 224,224,3 ?
Thank you

Required gpu memory

Hello
I am trying to use the pretrained model to extract image features, I have GTX 1070(8 gb) and I get out of memory error when I use the network on one image, I suspect this is a caffe issue regarding memory mangament, what do you suggest to solve this issue without decreasing performance?

I recreate the pretrained feature files with 36 features per image, but the rois num of some images does not have 36?

I want recreate visual genome feature with 36 features per image.
but I find some images roi nums <36；
how can I make rois nums >=36;

where the five nums means:
rois.shape, max_conf.shape, np.argsort(max_conf)[::-1].shape,img.h,img.w

roisnum <36,it cannot get 36 features per image.

not able to define function in lib/fast_rcnn/test.py

I want to define a function in lib/fast_rcnn/test.py . I implemented new function in test.py and imported test in demo.ipnb. When I access new function as test.new_function() it throw error module object has not attribute as new_function. How can we define new function test.py ?

image captioning task

hi, peteanderson:
I have paid close attention to you for a long time in the Cross-modal field. And, your bottom-up& top-down work really made greatly improved than other works. According to your paper, i accomplished the top-down algorithm for image captioning. But i could not reproduce your 'CIDER loss'. So i just use the 'cross entropy loss' . If possible, i hope you will put your image captioning code on github.
Wish you will rollout another wonderful article in CVPR 2018.

How about the image caption model?

Hello, the attribute extraction net using bottom-up attention you proposed is impressing! It indeed boosts the image caption performance in the paper. Besides the attention model, I am also interested in the caption model designed in the paper. In the paper, you mentioned that your caption model achieves performance comparable to start-of-art on most evaluation metrics. So to compare with my own model, can you provide your captioning model implementation code? :)