kdexd / virtex Goto Github PK
View Code? Open in Web Editor NEW[CVPR 2021] VirTex: Learning Visual Representations from Textual Annotations
Home Page: http://kdexd.xyz/virtex
License: MIT License
[CVPR 2021] VirTex: Learning Visual Representations from Textual Annotations
Home Page: http://kdexd.xyz/virtex
License: MIT License
Hi, I'v got following error
File "scripts/eval_captioning.py", line 114, in
main(_A)
File "scripts/eval_captioning.py", line 76, in main
val_batch[key] = val_batch[key].to(device)
AttributeError: 'list' object has no attribute 'to'
i'v made following changes in code to run script without error:
for val_iteration, val_batch in enumerate(val_dataloader, start=1):
#for key in val_batch:
#print(val_batch[key])
val_batch["image"] = val_batch["image"].to(device)
Hi, thank you so much for sharing this code. It is very helpful.
However, I am confused about the data preprocessing configuration. In the config files, Caffe-style image mean and std is used, but it seams they are not used in the code. Instead, the code seems to hard-code torchvision-style mean and std (here). Can you confirm that both pretraining and fine-tuning use the latter?
Furthermore, I am not sure whether the images are in 0-255 range or 0-1. For Caffe-style mean and std, it should be 0-255, but it seems with your hard-coded mean and std, it should be 0-1. However, I noticed you are using opencv to load images, which loads in 0-255, and I did not find anywhere in the code that they are transformed into 0-1, except in supervised pretraining (here).
Could you please comment on the aforementioned issues? Especially it is important to make sure the config is identical for all pretraining and downstream settings. Since you fine-tune all layers and don't freeze the stem, it is hard to notice if such inconsistencies exist, because the fine-tuning process would fix them to some extent.
Thank you so much.
Great project in the right direction, i.e. getting about the same results with less compute.
In your paper, you mention that you are discarding the text-head and only using the visual backbone and future research could leverage the text-head – do you think it could develop comparable performance to BERT-like models trained on text corpuses?
Also, I’m very interested in using VirTex for classification problems such as Conceptual Captions – Did you try / Would you estimate performance improvements using VirTex’s Visual Backbone + BERT or VirTex Visual & Textual (not sure if the latter would work) over VilBERT or VisualBERT?
Hi,
I would like to evaluate your work on a single image for image captioning. Can you tell me the steps I should follow for a single input? For instance, given a folder of images, how would I use your model for inference only on the folder of images?
Looking at captioning-task from your description, I am not sure how to go about using my own dataset for evaluation of the model.
Thanks
"😵 Uh oh! This model can't be run on Replicate because it was built with a version of Cog that is no longer supported."
https://replicate.com/kdexd/virtex-image-captioning
When I was doing the pretraining, this errors showed up:
lmdb.Error: datasets/coco/serialized_train.lmdb: No such file or directory
I couldn't any instruction that generates this file. Anyone has this problem?
import virtex.model_zoo as mz model = mz.get("width_ablations/bicaptioning_R_50_L1_H2048.yaml", pretrained=True)
When I run this script I get the following HTTP Error:
`HTTPError Traceback (most recent call last)
in ()
1 import virtex
2 import virtex.model_zoo as mz
----> 3 model = mz.get("width_ablations/bicaptioning_R_50_L1_H2048.yaml", pretrained=True)18 frames
/content/virtex/virtex/model_zoo/model_zoo.py in get(config_path, pretrained)
98 checkpoint_url,
99 dir=os.path.expanduser("~/.torch/virtex_cache"),
--> 100 filename=os.path.basename(config_path).replace(".yaml", ".pth")
101 )
102 CheckpointManager(model=model).load(checkpoint_path)/usr/local/lib/python3.6/dist-packages/fvcore/common/download.py in download(url, dir, filename, progress)
56 unit="B", unit_scale=True, miniters=1, desc=filename, leave=True
57 ) as t:
---> 58 tmp, _ = request.urlretrieve(url, filename=tmp, reporthook=hook(t))
59
60 else:/usr/lib/python3.6/urllib/request.py in urlretrieve(url, filename, reporthook, data)
246 url_type, path = splittype(url)
247
--> 248 with contextlib.closing(urlopen(url, data)) as fp:
249 headers = fp.info()
250/usr/lib/python3.6/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
221 else:
222 opener = _opener
--> 223 return opener.open(url, data, timeout)
224
225 def install_opener(opener):/usr/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
530 for processor in self.process_response.get(protocol, []):
531 meth = getattr(processor, meth_name)
--> 532 response = meth(req, response)
533
534 return response/usr/lib/python3.6/urllib/request.py in http_response(self, request, response)
640 if not (200 <= code < 300):
641 response = self.parent.error(
--> 642 'http', request, response, code, msg, hdrs)
643
644 return response/usr/lib/python3.6/urllib/request.py in error(self, proto, *args)
562 http_err = 0
563 args = (dict, proto, meth_name) + args
--> 564 result = self._call_chain(*args)
565 if result:
566 return result/usr/lib/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
502 for handler in handlers:
503 func = getattr(handler, meth_name)
--> 504 result = func(*args)
505 if result is not None:
506 return result/usr/lib/python3.6/urllib/request.py in http_error_302(self, req, fp, code, msg, headers)
754 fp.close()
755
--> 756 return self.parent.open(new, timeout=req.timeout)
757
758 http_error_301 = http_error_303 = http_error_307 = http_error_302/usr/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
530 for processor in self.process_response.get(protocol, []):
531 meth = getattr(processor, meth_name)
--> 532 response = meth(req, response)
533
534 return response/usr/lib/python3.6/urllib/request.py in http_response(self, request, response)
640 if not (200 <= code < 300):
641 response = self.parent.error(
--> 642 'http', request, response, code, msg, hdrs)
643
644 return response/usr/lib/python3.6/urllib/request.py in error(self, proto, *args)
562 http_err = 0
563 args = (dict, proto, meth_name) + args
--> 564 result = self._call_chain(*args)
565 if result:
566 return result/usr/lib/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
502 for handler in handlers:
503 func = getattr(handler, meth_name)
--> 504 result = func(*args)
505 if result is not None:
506 return result/usr/lib/python3.6/urllib/request.py in http_error_302(self, req, fp, code, msg, headers)
754 fp.close()
755
--> 756 return self.parent.open(new, timeout=req.timeout)
757
758 http_error_301 = http_error_303 = http_error_307 = http_error_302/usr/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
530 for processor in self.process_response.get(protocol, []):
531 meth = getattr(processor, meth_name)
--> 532 response = meth(req, response)
533
534 return response/usr/lib/python3.6/urllib/request.py in http_response(self, request, response)
640 if not (200 <= code < 300):
641 response = self.parent.error(
--> 642 'http', request, response, code, msg, hdrs)
643
644 return response/usr/lib/python3.6/urllib/request.py in error(self, proto, *args)
568 if http_err:
569 args = (dict, 'default', 'http_error_default') + orig_args
--> 570 return self._call_chain(*args)
571
572 # XXX probably also want an abstract factory that knows when it makes/usr/lib/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
502 for handler in handlers:
503 func = getattr(handler, meth_name)
--> 504 result = func(*args)
505 if result is not None:
506 return result/usr/lib/python3.6/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
648 class HTTPDefaultErrorHandler(BaseHandler):
649 def http_error_default(self, req, fp, code, msg, hdrs):
--> 650 raise HTTPError(req.full_url, code, msg, hdrs, fp)
651
652 class HTTPRedirectHandler(BaseHandler):HTTPError: HTTP Error 404: Not Found`
I think its because the URL is not being generated correctly there is an extra '/'
Failed to download https://umich.box.com/shared/static//fm1nq819q74vr0kqcd3gkivlzf06xvko.pth
Hi,
I saw that in SentencePieceTrainer, as below you made EOS and BOS and MASK and PAS tokens equal to Zero
" --bos_id=-1 --eos_id=-1" " --control_symbols=[SOS],[EOS],[MASK]"
However, during the captioning, you define
sos_index: int = 1, eos_index: int = 2,
I am wondering if these setups , have any effects?
sorry to bother you, but I run into this problem and can not to find a way to fix it.
it happens when I train the base virtex model.
I have update the cuDNN version into 8.0.3, the former version is 7.6.5. both version have this error.
Dear authors,
Thank you very much for providing the code and instructions for using this library! I tried to use the instructions from http://kdexd.xyz/virtex/virtex/usage/setup_dependencies.html:
cd virtex
conda create -n virtex python=3.6
conda activate virtex
pip install -r requirements.txt
I run into a problem that numpy was not found when I ran pip install -r requirements.txt
.
I fixed it by running conda install numpy
before that.
Maybe you could add it to the requirements.txt
so others can use the same version you used.
Many thanks,
George Batchkala
Is it possible to use this model to caption one image in code, for example, in google colab, without running script?
Hi,
I want to reproduce your pre-training result. There was a accident that caused the interruption of my training. I restored it by the flag "--resume-from" and it acts weirdly. The training and validation loss jumped dramatically at the beginning and then decreased, which seems there is a problem about the restoring. Could you help me about this?
Hi there,
I am aware that Virtex used image captioning as a pretraining task and not as the "final goal", but I was wondering whether one could go on fine-tuning the pretrained model (e.g. bicaptioning_R_50_L1_H2048) with additional COCOcaptions-like data in order to get an improved captioning model.
Has anyone tried that or does anyone have any suggestion how to do it? Can any of the scripts in the repository be used/adapted for fine-tuning existing models?
Thanks a lot! :)
Hi, this is really a great job and clear codebase.
However, I face a problem when I want to evaluate the Image Captioning on COCO Captions val2017. It said "Error: Could not find or load main class edu.stanford.nlp.process.PTBTokenizer".
I have downloaded CoreNLP and Spice with your config 'download_spice.sh'. Besides, I also set the path in my bashrc file that is:
export CLASSPATH="$CLASSPATH:/path/to/virtex/virtex/utils/assets/SPICE-1.0/lib/stanford-corenlp-3.6.0.jar:/path/to/virtex/virtex/utils/assets/SPICE-1.0/lib/stanford-corenlp-3.6.0-models.jar for file in 'find /path/to/virtex/virtex/utils/assets/SPICE-1.0/lib -name "*.jar"'; do export CLASSPATH="$CLASSPATH:'realpath $file'"; done
I could run
java -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -outputFormat json -file input.txt
successfully with sentences in input.txt.
In addition, If I don't calculate metrics, i.e. without --calc-metrics, it works fine.
Do you have any idea about this problem?
Hi, thanks for the awesome code base!
I'm looking to produce visualizations of decoder attention weights similar to those shown in the paper, but I don't think that you have implemented this feature in the published code (although I may have overlooked it!)
As best I can tell, the way this would be done is by using a new TransformerDecoderLayer which returns the multihead attention's attn_output_weights in its forward method. The visualized attention weights when predicting a single token would then be the average of these weights across all heads. The problem that I am finding is that the visualized weights seem to mostly appear in the center of the image during captioning on the coco dataset, but the results in the paper show reasonable variation in these weights as tokens are predicted.
Is this the method that you used to create the visualization? Any insight into how this was previously done would be appreciated!
I tried running this in Colab environment.
Got the below error:
KeyError Traceback (most recent call last)
<ipython-input-5-e8ec27705300> in <module>()
1 import torch
2 # model = torch.hub.load('pytorch/vision:v0.9.0', 'alexnet', pretrained=True)
----> 3 model = torch.hub.load("kdexd/virtex", "resnet50", pretrained=True)
4 model.eval()
2 frames
/root/.cache/torch/hub/kdexd_virtex_master/hubconf.py in resnet50(pretrained, **kwargs)
31 "https://umich.box.com/shared/static/gsjqm4i4fm1wpzi947h27wweljd8gcpy.pth",
32 progress=False,
---> 33 )["model"]
34 )
35 return model
KeyError: 'model'
Can you let me know the fix ?
Hi,
Thanks for the nice paper and clear code!
I found that the models are set with .train()
in clf_linear.py. Thus the running averages (i.e., the states) of BatchNormalization layers will be accumulated when training the ImageNet datasets (via calling the forward
function), and the backbone model seems not to be fully frozen. Is it a special design for this fine-tuning task?
Best,
Hao
Hello
I'm trying to perform inference on new images
I've followed the setup instructions here
This is the CMD i'm running:
python scripts/eval_captioning.py \
--config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \
--checkpoint-path /tmp/.torch/virtex_cache/bicaptioning_R_50_L1_H2048.pth \
--data-root /path/to/images_dir \
--output /path/to/save/predictions.json \
--num-gpus-per-machine 1 \
--cpu-workers 4
This is my stacktrace:
** fvcore version of PathManager will be deprecated soon. **
** Please migrate to the version in iopath repo. **
https://github.com/facebookresearch/iopath
2020-12-16 16:34:21.685 | INFO | virtex.utils.checkpointing:load:156 - Rank 0: Loading checkpoint from /tmp/.torch/virtex_cache/bicaptioning_R_50_L1_H2048.pth
2020-12-16 16:34:22.134 | INFO | virtex.utils.checkpointing:load:166 - Rank 0: Loading model from /tmp/.torch/virtex_cache/bicaptioning_R_50_L1_H2048.pth
Traceback (most recent call last):
File "scripts/eval_captioning.py", line 114, in <module>
main(_A)
File "scripts/eval_captioning.py", line 78, in main
output_dict = model(val_batch)
File "/yonatan/virtex/virtex_env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/yonatan/virtex/virtex/models/captioning.py", line 176, in forward
start_predictions, beam_search_step
File "/yonatan/virtex/virtex/utils/beam_search.py", line 250, in search
predictions[timestep].gather(1, cur_backpointers).unsqueeze(2)
RuntimeError: gather_out_cuda(): Expected dtype int64 for index
This is my system (Linux):
(virtex_env) (base) [p virtex]$ python
Python 3.6.10 |Anaconda, Inc.| (default, Jan 7 2020, 21:14:29)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.7.0'
This is my datasets structure:
(virtex_env) (base) [p virtex]$ ls datasets/
coco vocab
(virtex_env) (base) [p virtex]$ ls datasets/vocab/
coco_10k.model coco_10k.vocab
(virtex_env) (base) [p virtex]$ ls datasets/coco/
annotations serialized_train.lmdb serialized_train.lmdb-lock serialized_val.lmdb serialized_val.lmdb-lock train2017 val2017
What am I missing?
Thanks
Hi,
I am trying to download the pertained model for image captioning. But the download link has been removed. Could you please update the download link?
python scripts/eval_captioning.py \ --config /tmp/bicaptioning_R_50_L1_H2048/pretrain_config.yaml \ --checkpoint-path /tmp/bicaptioning_R_50_L1_H2048/checkpoint_500000.pth \ --data-root /path/to/images_dir \ --output /path/to/save/predictions.json \ --num-gpus-per-machine 1 \ --cpu-workers 4
Hi, I am wondering where can I download the pretrain_config.yaml and the checkpoint file. I cannot find the link on Github README. Is the only way to get the pre-trained model retraining the whole model?
Cheers,
Hi, nice work - thanks for sharing the code.
I'm trying to reproduce CIDEr & SPICE results, as appear in figure 4 in your paper.
I simply load the pre-trained models (specifically, those that correspond to H=1024 & H=2048 in width-ablation) and run eval_captioning.py, after building the vocabulary.
The values I get are much lower than those in fig. 4, which seems like some inconsistency in the pre/post-processing.
Should I expect the same values in this experiment? If so, is there any change I should perform?
Hello, your work is very attractive to me, but when I reproduced your excellent research results, I found that the weight file on http://kdexd.xyz/virtex/virtex/usage/model_zoo.html was canceled. I hope you can provide effective links to the weighted documents and reproduce your excellent work.
Hi, I'v got following error
File "scripts/eval_captioning.py", line 114, in
main(_A)
File "scripts/eval_captioning.py", line 76, in main
val_batch[key] = val_batch[key].to(device)
AttributeError: 'list' object has no attribute 'to'
i'v made following changes in code to run script without error:
for val_iteration, val_batch in enumerate(val_dataloader, start=1):
#for key in val_batch:
#print(val_batch[key])
val_batch["image"] = val_batch["image"].to(device)
Hi,
Thank you for making this code public!
I want to pre-train a captioning model on another dataset (ARCH dataset). I went through your codebase and realized that first I need to create a Dataset class for my dataset similar to your Dataset class in virtex/data/datasets/coco_captions.py
. Next, I will need to make a modified version of virtex/data/datasets/captioning.py
.
Somehow the files in virtex/data/datasets/
are all ignored by git and I can't make any of them become visible. Can you please help me with it? I would also appreciate any suggestions on how to modify the code at this stage in order to cause the least amount of disruption to the functions and classes which rely on the Dataset classes.
Many thanks,
George Batchkala
I am trying to pretrain using the token classification method. I copied this repo and was just trying to reproduce the results from the study. I am experiencing problems when pretraining using token classification. It seems as though the loss values are not in the output_dict variable.
When I use pretrain_virtex.py and log every 20 iterations, I get the following output.
2021-11-16T12:20:04.960052+0000: Iter 20 | Time: 0.764 sec | ETA: 54h 39m [Loss nan] [GPU 8774 MB]
Do you have any idea what could be wrong in the code?
I've been adapting the example scripts to my own training task, and I've noticed that the scripts do not handle different random seeds as expected. I've found this problem in two places, but there might be more:
Lines 104 to 109 in 2baba8a
virtex/scripts/pretrain_virtex.py
Line 68 in 2baba8a
The problem is that the DistributedSampler (from PyTorch 1.9.0) requires kwarg "seed" to shuffle differently, when shuffle=True. I believe that the correct use of DistributedSampler for training with different random seeds would be to add the kwarg seed=_DOWNC.RANDOM_SEED
when DistributedSampler is initialized in these two places. As for reshuffling on additional epochs, DistributedSampler will add the seed to the epoch number, so nothing needs to be changed during epoch-setting for the sampler.
Please let me know your thoughts, or if I may have missed something.
Can you share the approximate time for pretraining?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.