v-iashin / bmt Goto Github PK

Source code for "Bi-modal Transformer for Dense Video Captioning" (BMVC 2020)

Home Page: https://v-iashin.github.io/bmt

License: MIT License

Python 43.11% Shell 0.60% Jupyter Notebook 56.28%

pytorch bi-modal-transformer dense-video-captioning audio video i3d activitynet-captions multimodal-fusion temporal-action-proposals bmvc

bmt's Introduction

Dense Video Captioning with Bi-modal Transformer

Project Page • ArXiv • BMVC Page • Presentation (Can't watch YouTube? I gotchu! 🤗) •

This is a PyTorch implementation for our paper: A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer (BMVC 2020).

Dense Video Captioning with Bi-modal Transformer

Summary

Dense video captioning aims to localize and describe important events in untrimmed videos. Existing methods mainly tackle this task by exploiting the visual information alone, while completely neglecting the audio track.

To this end, we present Bi-modal Transformer with Proposal Generator (BMT), which efficiently utilizes audio and visual input sequences to select events in a video and, then, use these clips to generate a textual description.

Audio and visual features are encoded with VGGish and I3D while caption tokens with GloVe. First, VGGish and I3D features are passed through the stack of N bi-modal encoder layers where audio and visual sequences are encoded to, what we call, audio-attended visual and video-attended audio features. These features are passed to the bi-modal multi-headed proposal generator, which generates a set of proposals using information from both modalities.

Then, the input features are trimmed according to the proposed segments and encoded in the bi-modal encoder again. The stack of N bi-modal decoder layers inputs both: a) GloVe embeddings of the previously generated caption sequence, b) the internal representation from the last layer of the encoder for both modalities. The decoder produces its internal representation which is, then, used in the generator model the distribution over the vocabulary for the caption next word.

Getting Started

The code is tested on Ubuntu 16.04/18.04 with one NVIDIA GPU 1080Ti/2080Ti. If you are planning to use it with other software/hardware, you might need to adapt conda environment files or even the code.

Clone the repository. Mind the --recursive flag to make sure submodules are also cloned (evaluation scripts for Python 3 and scripts for feature extraction).

git clone --recursive https://github.com/v-iashin/BMT.git

Download features (I3D and VGGish) and word embeddings (GloVe). The script will download them (~10 GB) and unpack into ./data and ./.vector_cache folders. Make sure to run it while being in BMT folder

bash ./download_data.sh

Set up a conda environment

conda env create -f ./conda_env.yml
conda activate bmt
# install spacy language model. Make sure you activated the conda environment
python -m spacy download en

Train

We train our model in two staged: training of the captioning module on ground truth proposals and training of the proposal generator using the pre-trained encoder from the captioning module.

Train the captioning module. You may also download the pre-trained model best_cap_model.pt (md5 hash 7b4d48cd77ec49a027a4a1abc6867ee7).

python main.py \
    --procedure train_cap \
    --B 32

Train proposal generation module. You may also download the pre-trained model best_prop_model.pt (md5 hash 5f8b20826b09eadd41b7a5be662c198b)

python main.py \
    --procedure train_prop \
    --pretrained_cap_model_path /your_exp_path/best_cap_model.pt \
    --B 16

Evaluate

Since a part of videos in ActivityNet Captions became unavailable over the time, we could only obtain ~91 % of videos in the dataset (see ./data/available_mp4.txt for ids). To this end, we evaluate the performance of our model against ~91 % of the validation videos. We provide the validation sets without such videos in ./data/val_*_no_missings.json. Please see Experiments and Supplementary Material sections for details and performance of other models on the same validation sets.

Ground truth proposals. The performance of the captioning module on ground truth segments might be obtained from the file with pre-trained captioning module. You may also want to use the official evaluation script with ./data/val_*_no_missings.json as references (-r argument).

import torch
cap_model_cpt = torch.load('./path_to_pre_trained_model/best_cap_model.pt', map_location='cpu')
print(cap_model_cpt['val_1_metrics'])
print(cap_model_cpt['val_2_metrics'])
# To obtain the final results, average values in both dicts

Learned proposals. Create a file with captions for every proposal provided in --prop_pred_path using the captioning model specified in --pretrained_cap_model_path. The script will automatically evaluate it againts both ground truth validation sets. Alternatively, use the predictions prop_results_val_1_e17_maxprop100.json in ./results and official evaluation script with ./data/val_*_no_missings.json as references (-r argument).

python main.py \
    --procedure evaluate \
    --pretrained_cap_model_path /path_to_best_cap_model.pt \
    --prop_pred_path /path_to_generated_json_file \
    --device_ids 0

Details on Feature Extraction

Check out our script for extraction of I3D and VGGish features from a set of videos: video_features on GitHub (make sure to checkout to 662ec51caf591e76724237f0454bdf7735a8dcb1 commit). Also see #7 for more details on configuration.

Reproducibility Note

We would like to note that, despite a fixed random seed, some randomness occurs in our experimentation. Therefore, during the training of the captioning module, one might achieve slightly different results. Specifically, the numbers in your case might differ (higher or lower) from ours or the model will saturate in a different number of epochs. At the same time, we observed quite consistent results when training the proposal generation module with the pre-trained captioning module.

We relate this problem to padding and how it is implemented in PyTorch. (see PyTorch Reproducibility for details). Also, any suggestions on how to address this issue are greatly appreciated.

Comparison with MDVC

Comparison between MDVC and Bi-modal Transformer (BMT) on ActivityNet Captions validation set captioning ground truth proposals. BMT performs on par while having three times fewer parameters and using only two modalities.

Model	Params (Mill)	BLEU@3	BLEU@4	METEOR
MDVC	149	4.52	1.98	11.07
BMT	51	4.63	1.99	10.90

Single Video Prediction

The experience with Google Colab is pretty poor. To ensure a more optimal experience, we recommend following the installation guide and setting up the software locally as described below.

Start by extracting audio and visual features from your video using video_features repository. This repo is also included in ./submodules/video_features (commit 662ec51caf591e76724237f0454bdf7735a8dcb1).

Extract I3D features

# run this from the video_features folder:
cd ./submodules/video_features
conda deactivate
conda activate i3d
python main.py \
    --feature_type i3d \
    --on_extraction save_numpy \
    --device_ids 0 \
    --extraction_fps 25 \
    --video_paths ../../sample/women_long_jump.mp4 \
    --output_path ../../sample/

Extract VGGish features (if ValueError, download the vggish model first--see README.md in ./submodules/video_features)

conda deactivate
conda activate vggish
python main.py \
    --feature_type vggish \
    --on_extraction save_numpy \
    --device_ids 0 \
    --video_paths ../../sample/women_long_jump.mp4 \
    --output_path ../../sample/

Run the inference

# run this from the BMT main folder:
cd ../../
conda deactivate
conda activate bmt
python ./sample/single_video_prediction.py \
    --prop_generator_model_path ./sample/best_prop_model.pt \
    --pretrained_cap_model_path ./sample/best_cap_model.pt \
    --vggish_features_path ./sample/women_long_jump_vggish.npy \
    --rgb_features_path ./sample/women_long_jump_rgb.npy \
    --flow_features_path ./sample/women_long_jump_flow.npy \
    --duration_in_secs 35.155 \
    --device_id 0 \
    --max_prop_per_vid 100 \
    --nms_tiou_thresh 0.4

Expected output

[
  {'start': 0.1, 'end': 4.9, 'sentence': 'We see a title screen'},
  {'start': 5.0, 'end': 7.9, 'sentence': 'A large group of people are seen standing around a building'},
  {'start': 0.7, 'end': 11.9, 'sentence': 'A man is seen standing in front of a large crowd'},
  {'start': 19.6, 'end': 33.3, 'sentence': 'The woman runs down a track and jumps into a sand pit'},
  {'start': 7.5, 'end': 10.0, 'sentence': 'A large group of people are seen standing around a building'},
  {'start': 0.6, 'end': 35.1, 'sentence': 'A large group of people are seen running down a track while others watch on the sides'},
  {'start': 8.2, 'end': 13.7, 'sentence': 'A man runs down a track'},
  {'start': 0.1, 'end': 2.0, 'sentence': 'We see a title screen'}
]

Note that in our research we avoided non-maximum suppression for computational efficiency and to allow the event prediction to be dense. Feel free to play with --nms_tiou_thresh parameter: for example, try to make it 0.4 as in the provided example.

The sample video credits: Women's long jump historical World record in 1978

If you are having an error

RuntimeError: Vector for token b'<something>' has <some-number> dimensions, but previously read vectors
have 300 dimensions.

try to remove *.txt and *.txt.pt from the hidden folder ./.vector_cache/ and check if you are not running out of disk space (unpacking of glove.840B.300d.zip requires extra ~8.5G). Then run single_video_prediction.py again.

Citation

Our paper was accepted at BMVC 2020. Please, use this bibtex if you would like to cite our work

@InProceedings{BMT_Iashin_2020,
  title={A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer},
  author={Iashin, Vladimir and Rahtu, Esa},
  booktitle={British Machine Vision Conference (BMVC)},
  year={2020}
}

@InProceedings{MDVC_Iashin_2020,
  title = {Multi-Modal Dense Video Captioning},
  author = {Iashin, Vladimir and Rahtu, Esa},
  booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
  pages={958--959},
  year = {2020}
}

Acknowledgments

Funding for this research was provided by the Academy of Finland projects 327910 & 324346. The authors acknowledge CSC — IT Center for Science, Finland, for computational resources for our experimentation.

Prithviraj contributed to the Google Colab demo

Media Coverage

bmt's People

Contributors

Stargazers

Watchers

bmt's Issues

OSError: libtorch_cpu.so: cannot open shared object file: No such file or directory

Hi
when i tried to that, i got this error

conda activate i3d
python main.py
--feature_type i3d
--on_extraction save_numpy
--device_ids 0
--extraction_fps 25
--video_paths ../../sample/women_long_jump.mp4
--output_path ../../sample/

OSError: libtorch_cpu.so: cannot open shared object file: No such file or directory

how do i fix it?

Final performance

Hi, v-iashin!

Whether the final results(e.g. Blue, Meteor) should be divided by 0.91.

Hope your reply.

meteor error

During the caption evaluation, the following error occurred, please tell me how to solve it, thank you！

Traceback (most recent call last):
File "/home/njj/niu/video-caption-project/BMT/TMT-project/evaluation/meteor_test.py", line 36, in
val_metrics = calculate_metrics(reference_paths, submission_path, tIoUs, 100)
File "/home/njj/niu/video-caption-project/BMT/TMT-project/evaluation/meteor_test.py", line 12, in calculate_metrics
max_prop_per_vid, PREDICTION_FIELDS, verbose, only_proposals)
File "/home/njj/niu/video-caption-project/BMT/TMT-project/evaluation/evaluate.py", line 64, in init
(Meteor(), "METEOR"),
File "/home/njj/niu/video-caption-project/BMT/TMT-project/submodules/pycocoevalcap/meteor/meteor.py", line 25, in init
stderr=subprocess.PIPE)
File "/home/njj/anaconda3/envs/mdvc/lib/python3.7/subprocess.py", line 800, in init
restore_signals, start_new_session)
File "/home/njj/anaconda3/envs/mdvc/lib/python3.7/subprocess.py", line 1551, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'java': 'java'
Exception ignored in: <function Meteor.del at 0x7f8d33e3e950>
Traceback (most recent call last):
File "/home/njj/niu/video-caption-project/BMT/TMT-project/submodules/pycocoevalcap/meteor/meteor.py", line 79, in del
self.lock.acquire()
AttributeError: 'Meteor' object has no attribute 'lock'

Process finished with exit code 1

i3d features are corrupted while extracting via download.sh

Hi,
I am trying to run the single_prediction.py for a different video with pretrained BMT model. It requires i3d features as well. The features that are uploaded are giving the corrupted error. Could you provide an alternate link for downloading the features?

Thanks for uploading the model and congrats on the publication.

ERROR: .zip file with i3d features is corrupted

Thank you for your excellent work
Describe the bug
when I bash ./download_data.sh

Checking for correctness of the downloaded files
md5sum: ./data/i3d_25fps_stack64step64_2stream_npy.zip: No such file or directory
ERROR: .zip file with i3d features is corrupted

Looking forward to your reply

BERT word embeddings

Hello,
Thank you for this great project!
Have you tried any word embedding model other than GloVe pre-trained embeddings, for example BERT pre-trained model?
Did you encouter any issues when trying such a model, or is there a reason you didn't use one?

Hey Vladimir, Can I get the pretrained proposal model ?

Question: Why is duration_in_secs 35.155 ?

I would like to know about duration_in_secs.
This is why 35.155 ?
And, duration_in_secs == frame per second ?
Please tell me about duration_in_secs.

Why does the time of generating multiple captionings overlap?

Here is my result to use single_video_prediction.py to generate multiple captionings from raw video. First one sets the max_prop_per_vid equals to 20, the second result sets the max_prop_per_vid to 100.
result 1:
[{'start': 2.7, 'end': 37.9, 'sentence': 'A man is sitting on a chair talking to the camera'}, {'start': 1.6, 'end': 15.1, 'sentence': 'A man is sitting on a chair talking to the camera'}, {'start': 0.0, 'end': 127.7, 'sentence': 'A man is seen speaking to the camera while holding a microphone and leads into him speaking'}]
result 2:
[{'start': 2.7, 'end': 37.9, 'sentence': 'A man is sitting on a chair talking to the camera'}, {'start': 1.6, 'end': 15.1, 'sentence': 'A man is sitting on a chair talking to the camera'}, {'start': 0.0, 'end': 127.7, 'sentence': 'A man is seen speaking to the camera while holding a microphone and leads into him speaking'}, {'start': 0.3, 'end': 5.2, 'sentence': 'A man is seen sitting on a chair with a woman in a microphone'}, {'start': 109.0, 'end': 200.9, 'sentence': 'The man is talking to the camera'}, {'start': 283.4, 'end': 300.7, 'sentence': 'A man is seen speaking to the camera while holding a microphone'}]

Why does the time of multiple proposals overlap? How can I get a non-overlapping event description? Thanks!

Uni-modal/Bi-modal Training

Hello,

Thank you for sharing the code and responding to all issues. I have two questions:

For the uni-modal training, did you use the "linear embedder"?
How did you decide the dimension of internal space (d_model) for bi-modal training? Should it be the same as the dimension of the video feature?

Can BMT generate the caption for the customized video?

Hello, I want to use your fancy model to generate the caption for the customized video. But I am new for this task, is there a solution to generate the caption easily.

Thank you for your work!

Fine tuning and end-to-end inference on a video.

I want to fine tune this pre-trained model for learning purpose on a small set of videos, but not able to proceed. If anyone can help me then it would be a great help. And also if possible a master virtual environment which can run entire caption generation in a single shot without changing environment. Thanks in advance.

video dataset

Hello!I would like to ask if you could share the video data set of this project with me. I would be very grateful if you could.My email address is [email protected] you very much!

Caption my custom video with no audio information

Hi ~ @v-iashin ,
Thanks for sharing your wonderful work and the detailed instructions on the usage!! I want to caption my own video with the provided pre-trained model. But the video doesn't has an audio. So I wander if I can directly follow the instructions in "Single Video Prediction" of the README? My concerns mainly lie in (1) the feature extraction module (VGGish) ,should I skip the VGGish feature or just do it even (error may occur?) through the video has no an audio. (2) Can I use the pre-trained model directly, or I have to re-training the model without audio information (I just want to finish a small application, re-training is time-consuming)?

Thanks and best reagrds!

503 service unavailable: cannot open a3s.fi links

Hi,

Since this Monday the bucket storage that we use for storing features and models (a3s) is down. Usually, it takes several hours to recover but this time it is more serious, it seems. The problem is claimed to be 'temporally' but we are not sure when it will recover so for now please use this issue to let us know if you need any of the files ASAP – we will share them via other means.

Sorry for the trouble and good luck with your research!

ResolvePackageNotFound issue

I have recently started to get this issue while extracting the i3d and Vggish packages in the provided Colab. Earlier I was not encountering this issue but now I am not able to resolve it. Please help to solve this issue as soon as possible.

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

I am facing the above error while running the training script. Can someone let me know how to solve it?

My GPU specifications are-

And the packages in conda are-
# Name Version Build Channel
_libgcc_mutex 0.1 main conda-forge
_pytorch_select 0.2 gpu_0 anaconda
absl-py 0.8.1 py37_0 conda-forge
asn1crypto 1.3.0 py37_0 conda-forge
blas 1.0 mkl conda-forge
ca-certificates 2020.1.1 0 anaconda
certifi 2020.4.5.1 py37_0 anaconda
cffi 1.14.0 py37h2e261b9_0 anaconda
chardet 3.0.4 py37_1003 conda-forge
cryptography 2.8 py37h1ba5d50_0 anaconda
cuda-command-line-tools 11.3.0 h3b286be_0 nvidia/label/cuda-11.3.0
cuda-compiler 11.3.0 h3b286be_0 nvidia/label/cuda-11.3.0
cuda-cudart 11.3.58 hc1aae59_0 nvidia/label/cuda-11.3.0
cuda-cuobjdump 11.3.58 hc78e225_0 nvidia/label/cuda-11.3.0
cuda-cupti 11.3.58 h9a3dd33_0 nvidia/label/cuda-11.3.0
cuda-cuxxfilt 11.3.58 he670d9e_0 nvidia/label/cuda-11.3.0
cuda-gdb 11.3.58 h531059a_0 nvidia/label/cuda-11.3.0
cuda-libraries 11.3.0 h3b286be_0 nvidia/label/cuda-11.3.0
cuda-libraries-dev 11.3.0 h3b286be_0 nvidia/label/cuda-11.3.0
cuda-memcheck 11.3.58 h8711ecb_0 nvidia/label/cuda-11.3.0
cuda-nvcc 11.3.58 h2467b9f_0 nvidia/label/cuda-11.3.0
cuda-nvdisasm 11.3.58 hd2ea46e_0 nvidia/label/cuda-11.3.0
cuda-nvml-dev 12.2.128 0 nvidia
cuda-nvprof 11.3.58 h860cd9e_0 nvidia/label/cuda-11.3.0
cuda-nvprune 11.3.58 hb917323_0 nvidia/label/cuda-11.3.0
cuda-nvrtc 11.3.58 he300756_0 nvidia/label/cuda-11.3.0
cuda-nvtx 11.3.58 h3fa534a_0 nvidia/label/cuda-11.3.0
cuda-nvvp 11.3.58 hd16380c_0 nvidia/label/cuda-11.3.0
cuda-samples 11.6.101 h8efea70_0 nvidia
cuda-sanitizer-api 11.3.58 h58da6c8_0 nvidia/label/cuda-11.3.0
cuda-thrust 11.3.58 h7b74f08_0 nvidia/label/cuda-11.3.0
cuda-toolkit 11.3.0 h3b286be_0 nvidia/label/cuda-11.3.0
cuda-tools 11.3.0 h3b286be_0 nvidia/label/cuda-11.3.0
cuda-visual-tools 11.3.0 h3b286be_0 nvidia/label/cuda-11.3.0
cudatoolkit 10.0.130 0 anaconda
cudnn 7.6.5.32 ha8d7eb6_1 conda-forge
cymem 1.31.2 py37h6bb024c_0 anaconda
cytoolz 0.9.0.1 py37h14c3975_1 anaconda
dill 0.2.9 py37_0 conda-forge
en-core-web-sm 2.0.0 pypi_0 pypi
future 0.17.1 py37_0 anaconda
idna 2.9 py_1 conda-forge
intel-openmp 2020.0 166 anaconda
joblib 0.14.1 py_0 conda-forge
ld_impl_linux-64 2.33.1 h53a641e_7 conda-forge
libcublas 11.4.2.10064 h8a72295_0 nvidia/label/cuda-11.3.0
libcufft 10.4.2.58 h58ccd86_0 nvidia/label/cuda-11.3.0
libcurand 10.2.4.58 h99380db_0 nvidia/label/cuda-11.3.0
libcusolver 11.1.1.58 hec68242_0 nvidia/label/cuda-11.3.0
libcusparse 11.5.0.58 hf5aa513_0 nvidia/label/cuda-11.3.0
libedit 3.1.20181209 hc058e9b_0 anaconda
libffi 3.2.1 hd88cf55_4
libgcc-ng 9.1.0 hdf63c60_0 anaconda
libgfortran-ng 7.3.0 hdf63c60_0 anaconda
libnpp 11.3.3.44 h8df316f_0 nvidia/label/cuda-11.3.0
libnvjpeg 11.4.1.58 h3d06750_0 nvidia/label/cuda-11.3.0
libprotobuf 3.11.4 h8b12597_0 conda-forge
libstdcxx-ng 9.1.0 hdf63c60_0 anaconda
markdown 3.2.1 py_0 conda-forge
mkl 2020.0 166 anaconda
mkl-service 2.3.0 py37he904b0f_0
mkl_fft 1.0.15 py37ha843d7b_0
mkl_random 1.1.0 py37hd6b4f25_0
msgpack-numpy 0.4.4.3 py_0 conda-forge
msgpack-python 0.5.6 py37h6bb024c_1 anaconda
murmurhash 0.28.0 py37hf484d3e_0 anaconda
ncurses 6.2 he6710b0_1 anaconda
ninja 1.9.0 py37hfd86e86_0 anaconda
numpy 1.15.4 py37h7e9f1db_0
numpy-base 1.15.4 py37hde5b4d6_0
openjdk 8.0.152 h7b6447c_3 anaconda
openssl 1.1.1g h7b6447c_0 anaconda
pandas 0.24.2 py37he6710b0_0 anaconda
pip 20.0.2 py37_1 conda-forge
plac 0.9.6 py37_0 anaconda
preshed 1.0.1 py37he6710b0_0 anaconda
protobuf 3.11.4 py37h3340039_1 conda-forge
pycparser 2.20 py_0 conda-forge
pyopenssl 19.1.0 py37_0 conda-forge
pysocks 1.7.1 py37_0 conda-forge
python 3.7.7 hcf32534_0_cpython anaconda
python-dateutil 2.8.1 py_0 conda-forge
python_abi 3.7 1_cp37m conda-forge
pytorch 1.2.0 cuda100py37h938c94c_0
pytz 2020.1 py_0 anaconda
readline 8.0 h7b6447c_0 anaconda
regex 2018.07.11 py37h14c3975_0 anaconda
requests 2.23.0 py37_0 conda-forge
scikit-learn 0.22.1 py37hd81dba3_0
scipy 1.3.1 py37h7c811a0_0
setuptools 46.1.3 py37_0 anaconda
six 1.14.0 py37_0 conda-forge
spacy 2.0.12 py37h962f231_0 anaconda
sqlite 3.31.1 h62c20be_1 anaconda
tensorboard 1.14.0 py37_0 conda-forge
termcolor 1.1.0 py37_1 anaconda
thinc 6.10.3 py37h962f231_0 anaconda
tk 8.6.8 hbc83047_0 anaconda
toolz 0.10.0 py_0 conda-forge
torchtext 0.3.1 pypi_0 pypi
tqdm 4.46.0 py_0 anaconda
ujson 2.0.3 py37he6710b0_0 anaconda
urllib3 1.25.8 py37_0 anaconda
werkzeug 1.0.1 pyh9f0ad1d_0 conda-forge
wheel 0.34.2 py37_0 conda-forge
wrapt 1.10.11 py37h14c3975_2 anaconda
xz 5.2.5 h7b6447c_0 anaconda
zlib 1.2.11 h7b6447c_3 anaconda

Unexpected "UNK" captions with single video prediction

Hello, Vladimir.

First of all congratulations for such a fantastic project. I was introduced to this work from many other papers who cited it and used it as a base to grow upon. I enjoyed your video presentation, and I think you are doing a very good job at keeping up with all the repo issues.

I ran the sample code single_video_prediction.py on the given example (women_long_jump.mp4) without major issues (had to change CUDA and PyTorch versions from the conda environment as reported in #45).

However, when I tried the code on a custom video, let's call it my_video.mp4, I got some errors.

VGGish was unable to extract a .wav file from the audio because it had no aac codec (I checked with ffprobe my_video.mp4 and the audio used opus codec instead of aac). So, I changed these 2 lines in BMT/submodules/video_features/models/vggish/utils/utils.py for the following, which resolved the issue:

mp4_to_acc = f'{which_ffmpeg()} -hide_banner -loglevel panic -y -i {video_path} {audio_aac_path}'
aac_to_wav = f'{which_ffmpeg()} -hide_banner -loglevel panic -y -i {video_path} {audio_wav_path}'

After obtaining the i3d and vggish features I tried running BMT on the video using the following command:

python ./sample/single_video_prediction.py \
--prop_generator_model_path ./sample/best_prop_model.pt \
--pretrained_cap_model_path ./sample/best_cap_model.pt \
--vggish_features_path ./sample/my_video_vggish.npy \
--rgb_features_path ./sample/my_video_rgb.npy \
--flow_features_path ./sample/my_video_flow.npy \
--duration_in_secs 148.121 \
--device_id 0 \
--max_prop_per_vid 100 \
--nms_tiou_thresh 0.4

Obtaining:

Contructing caption_iterator for "train" phase
Using vanilla Generator
initialization: xavier
Glove emb of the same size as d_model_caps
Pretrained caption path:
 ./sample/best_cap_model.pt
Traceback (most recent call last):
  File "./sample/single_video_prediction.py", line 313, in <module>
    cap_model, feature_paths, train_dataset, cap_cfg, args.device_id, proposals, args.duration_in_secs
  File "./sample/single_video_prediction.py", line 219, in caption_proposals
    for start, end, conf in proposals.squeeze():
  File "/home/mrt/miniconda3/envs/bmt/lib/python3.7/site-packages/torch/tensor.py", line 456, in __iter__
    raise TypeError('iteration over a 0-d tensor')
TypeError: iteration over a 0-d tensor

Checking it was iterating over a 0-d tensor, I tried removing the NMS and ran it again with:

python ./sample/single_video_prediction.py \
--prop_generator_model_path ./sample/best_prop_model.pt \
--pretrained_cap_model_path ./sample/best_cap_model.pt \
--vggish_features_path ./sample/my_video_vggish.npy \
--rgb_features_path ./sample/my_video_rgb.npy \
--flow_features_path ./sample/my_video_flow.npy \
--duration_in_secs 148.121 \
--device_id 0 \
--max_prop_per_vid 100 \

Obtaining a list of sentences with the token "UNK":

Contructing caption_iterator for "train" phase
Using vanilla Generator
initialization: xavier
Glove emb of the same size as d_model_caps
Pretrained caption path:
 ./sample/best_cap_model.pt
[{'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}, {'start': 0.0, 'end': 148.1, 'sentence': ' unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk   unk '}]

I am a bit at a loss here, as I have not much experience working with text and audio (only with image and video). Could you point me in the right direction? I am unsure of what might be the root cause. I suspect it could be one of the following:

PyTorch version. I installed torch 1.4.0 instead of 1.2.0, as if was the closest version that could work with my GPU. I kept torchtext at version 0.3.1 (same as in yours). However, the code works for the example video you provide, so it seems unlikely that this is the root cause.
VGGish features. As I described above, I changed one script to be able to extract a .wav file directly from the .mp4, skipping the intermediate step of obtaining an .aac file. I do not see any inconvenient in doing so, in fact, it seems like a more portable option. However, I remain unsure whether you did this for a specific reason I am unaware of.

Desktop (please complete the following information):

OS: Ubuntu 22.04
GPU: NVidia RTX 4090 24GB

You conda environment

# packages in environment at /home/mrt/miniconda3/envs/bmt:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main    conda-forge
_pytorch_select           0.2                       gpu_0    anaconda
absl-py                   0.8.1                    py37_0    conda-forge
asn1crypto                1.3.0                    py37_0    conda-forge
blas                      1.0                         mkl    conda-forge
ca-certificates           2020.1.1                      0    anaconda
certifi                   2020.4.5.1               py37_0    anaconda
cffi                      1.14.0           py37h2e261b9_0    anaconda
chardet                   3.0.4                 py37_1003    conda-forge
cryptography              2.8              py37h1ba5d50_0    anaconda
cudatoolkit               10.1.243             h6bb024c_0    anaconda
cudnn                     7.6.5.32             hc0a50b0_1    conda-forge
cymem                     1.31.2           py37h6bb024c_0    anaconda
cytoolz                   0.9.0.1          py37h14c3975_1    anaconda
dill                      0.2.9                    py37_0    conda-forge
en-core-web-sm            2.0.0                    pypi_0    pypi
future                    0.17.1                   py37_0    anaconda
idna                      2.9                        py_1    conda-forge
intel-openmp              2020.0                      166    anaconda
joblib                    0.14.1                     py_0    conda-forge
ld_impl_linux-64          2.33.1               h53a641e_7    conda-forge
libedit                   3.1.20181209         hc058e9b_0    anaconda
libffi                    3.2.1                hd88cf55_4
libgcc-ng                 9.1.0                hdf63c60_0    anaconda
libgfortran-ng            7.3.0                hdf63c60_0    anaconda
libprotobuf               3.11.4               h8b12597_0    conda-forge
libstdcxx-ng              9.1.0                hdf63c60_0    anaconda
markdown                  3.2.1                      py_0    conda-forge
mkl                       2020.0                      166    anaconda
mkl-service               2.3.0            py37he904b0f_0
mkl_fft                   1.0.15           py37ha843d7b_0
mkl_random                1.1.0            py37hd6b4f25_0
msgpack-numpy             0.4.4.3                    py_0    conda-forge
msgpack-python            0.5.6            py37h6bb024c_1    anaconda
murmurhash                0.28.0           py37hf484d3e_0    anaconda
ncurses                   6.2                  he6710b0_1    anaconda
ninja                     1.9.0            py37hfd86e86_0    anaconda
numpy                     1.15.4           py37h7e9f1db_0
numpy-base                1.15.4           py37hde5b4d6_0
openjdk                   8.0.152              h7b6447c_3    anaconda
openssl                   1.1.1g               h7b6447c_0    anaconda
pandas                    0.24.2           py37he6710b0_0    anaconda
pip                       20.0.2                   py37_1    conda-forge
plac                      0.9.6                    py37_0    anaconda
preshed                   1.0.1            py37he6710b0_0    anaconda
protobuf                  3.11.4           py37h3340039_1    conda-forge
pycparser                 2.20                       py_0    conda-forge
pyopenssl                 19.1.0                   py37_0    conda-forge
pysocks                   1.7.1                    py37_0    conda-forge
python                    3.7.7           hcf32534_0_cpython    anaconda
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.7                     1_cp37m    conda-forge
pytorch                   1.4.0           cuda101py37h02f0884_0
pytz                      2020.1                     py_0    anaconda
readline                  8.0                  h7b6447c_0    anaconda
regex                     2018.07.11       py37h14c3975_0    anaconda
requests                  2.23.0                   py37_0    conda-forge
scikit-learn              0.22.1           py37hd81dba3_0
scipy                     1.3.1            py37h7c811a0_0
setuptools                46.1.3                   py37_0    anaconda
six                       1.14.0                   py37_0    conda-forge
spacy                     2.0.12           py37h962f231_0    anaconda
sqlite                    3.31.1               h62c20be_1    anaconda
tensorboard               1.14.0                   py37_0    conda-forge
termcolor                 1.1.0                    py37_1    anaconda
thinc                     6.10.3           py37h962f231_0    anaconda
tk                        8.6.8                hbc83047_0    anaconda
toolz                     0.10.0                     py_0    conda-forge
torchtext                 0.3.1                    pypi_0    pypi
tqdm                      4.46.0                     py_0    anaconda
ujson                     2.0.3            py37he6710b0_0    anaconda
urllib3                   1.25.8                   py37_0    anaconda
werkzeug                  1.0.1              pyh9f0ad1d_0    conda-forge
wheel                     0.34.2                   py37_0    conda-forge
wrapt                     1.10.11          py37h14c3975_2    anaconda
xz                        5.2.5                h7b6447c_0    anaconda
zlib                      1.2.11               h7b6447c_3    anaconda

How to visualize the attention?

Do you have a way to visualize the attention mechanism?I want to know what exactly does the attention mechanism focus on?

RuntimeError: Vector for token b'27ll' has 38 dimensions, but previously read vectors have 300 dimensions

Hello Vladimir,

I'm trying to run your repository on Google Colab and I'm facing this error which I hope you can address it and give me some insight on the same.

I even have read your readme file but I'm trying to replicate the instructions mentioned in this article

So when I'm trying to run this command after activating the bmt environment:

python ./sample/single_video_prediction.py \
    --prop_generator_model_path /content/BMT/sample/best_prop_model.pt \
    --pretrained_cap_model_path /content/BMT/best_cap_model.pt.1  \
    --vggish_features_path /content/BMT/test/y2mate_vggish.npy \
    --rgb_features_path /content/BMT/test/y2mate_rgb.npy \
    --flow_features_path /content/BMT/test/y2mate_flow.npy \
    --duration_in_secs 99 \
    --device_id 0 \
    --max_prop_per_vid 100 \
    --nms_tiou_thresh 0.4

This error shows up:

Contructing caption_iterator for "train" phase
100%|█████████▉| 753786/753787 [01:21<00:00, 9194.76it/s]
Traceback (most recent call last):
  File "./sample/single_video_prediction.py", line 279, in <module>
    cap_cfg, cap_model, train_dataset = load_cap_model(args.pretrained_cap_model_path, args.device_id)
  File "./sample/single_video_prediction.py", line 136, in load_cap_model
    train_dataset = ActivityNetCaptionsDataset(cfg, 'train', get_full_feat=False)
  File "/content/BMT/sample/../datasets/captioning_dataset.py", line 310, in __init__
    self.train_vocab, self.caption_loader = caption_iterator(cfg, self.batch_size, self.phase)
  File "/content/BMT/sample/../datasets/captioning_dataset.py", line 40, in caption_iterator
    CAPTION.build_vocab(dataset.caption, min_freq=cfg.min_freq_caps, vectors=cfg.word_emb_caps)
  File "/usr/local/envs/bmt/lib/python3.7/site-packages/torchtext/data/field.py", line 273, in build_vocab
    self.vocab = self.vocab_cls(counter, specials=specials, **kwargs)
  File "/usr/local/envs/bmt/lib/python3.7/site-packages/torchtext/vocab.py", line 88, in __init__
    self.load_vectors(vectors, unk_init=unk_init, cache=vectors_cache)
  File "/usr/local/envs/bmt/lib/python3.7/site-packages/torchtext/vocab.py", line 147, in load_vectors
    vectors[idx] = pretrained_aliases[vector](**kwargs)
  File "/usr/local/envs/bmt/lib/python3.7/site-packages/torchtext/vocab.py", line 401, in __init__
    super(GloVe, self).__init__(name, url=url, **kwargs)
  File "/usr/local/envs/bmt/lib/python3.7/site-packages/torchtext/vocab.py", line 280, in __init__
    self.cache(name, cache, url=url, max_vectors=max_vectors)
  File "/usr/local/envs/bmt/lib/python3.7/site-packages/torchtext/vocab.py", line 361, in cache
    dim))
RuntimeError: Vector for token b'27ll' has 38 dimensions, but previously read vectors have 300 dimensions. All vectors must have the same number of dimensions.

I'm not sure why this issue is coming and I have tried everything in my power to solve it.

For reproducing this issue, here is the colab file. You can run it and reproduce the issue . Just to be clear, many improvisations had to be done after following the article to run this repository, but I'm stuck at the very last step and I hope you can help me.

I have also attached a video while on which we are running, you can directly upload it in test folder and you wouldn't have to change any paths in any cells then (if you wish not to).

y2mate.mp4

create environment

when i am creating environment i am facing this issue. please help me.
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:

libgfortran-ng==7.3.0=hdf63c60_0
zlib==1.2.11=h7b6447c_3
cymem==1.31.2=py37h6bb024c_0
mkl_fft==1.0.15=py37ha843d7b_0
cryptography==2.8=py37h1ba5d50_0
libstdcxx-ng==9.1.0=hdf63c60_0
pytorch==1.2.0=py3.7_cuda10.0.130_cudnn7.6.2_0
tk==8.6.8=hbc83047_0
libgcc-ng==9.1.0=hdf63c60_0
ncurses==6.2=he6710b0_1
cytoolz==0.9.0.1=py37h14c3975_1
ninja==1.9.0=py37hfd86e86_0
regex==2018.07.11=py37h14c3975_0
thinc==6.10.3=py37h962f231_0
scikit-learn==0.22.1=py37hd81dba3_0
numpy-base==1.15.4=py37hde5b4d6_0
readline==8.0=h7b6447c_0
preshed==1.0.1=py37he6710b0_0
xz==5.2.5=h7b6447c_0
libffi==3.2.1=hd88cf55_4
mkl_random==1.1.0=py37hd6b4f25_0
numpy==1.15.4=py37h7e9f1db_0
openssl==1.1.1g=h7b6447c_0
murmurhash==0.28.0=py37hf484d3e_0
mkl-service==2.3.0=py37he904b0f_0
libedit==3.1.20181209=hc058e9b_0
python==3.7.7=hcf32534_0_cpython
sqlite==3.31.1=h62c20be_1
msgpack-python==0.5.6=py37h6bb024c_1
ld_impl_linux-64==2.33.1=h53a641e_7
protobuf==3.11.4=py37h3340039_1
pandas==0.24.2=py37he6710b0_0
scipy==1.3.1=py37h7c811a0_0
spacy==2.0.12=py37h962f231_0
ujson==2.0.3=py37he6710b0_0
cffi==1.14.0=py37h2e261b9_0
libprotobuf==3.11.4=h8b12597_0
openjdk==8.0.152=h7b6447c_3
wrapt==1.10.11=py37h14c3975_2

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

The training and evaluation worked without any problems, but I get the following error when run Single video prediction as:

(bmt) adel@adel-Pulse-GL66-11UEK:/media/adel/Data3/BMT$ python ./sample/single_video_prediction.py \
>     --prop_generator_model_path ./sample/best_prop_model.pt \
>     --pretrained_cap_model_path ./sample/best_cap_model.pt \
>     --vggish_features_path ./sample/women_long_jump_vggish.npy \
>     --rgb_features_path ./sample/women_long_jump_rgb.npy \
>     --flow_features_path ./sample/women_long_jump_flow.npy \
>     --duration_in_secs 35.155 \
>     --device_id 0 \
>     --max_prop_per_vid 100 \
>     --nms_tiou_thresh 0.4

Contructing caption_iterator for "train" phase
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 2196016/2196017 [01:54<00:00, 19151.82it/s]
Using vanilla Generator
initialization: xavier
Glove emb of the same size as d_model_caps
Pretrained prop path:
./best_prop_model.pt
Pretrained caption path:
./sample/best_cap_model.pt
Traceback (most recent call last):
File "./sample/single_video_prediction.py", line 295, in
prop_model, feature_paths, train_dataset.pad_idx, prop_cfg, args.device_id, args.duration_in_secs
File "./sample/single_video_prediction.py", line 174, in generate_proposals
predictions, _, _, _ = prop_model(batch['feature_stacks'], None, masks)
File "/home/adel/miniconda3/envs/bmt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/media/adel/Data3/BMT/sample/../model/proposal_generator.py", line 348, in forward
Av, Va = self.encoder((A, V), masks)
File "/home/adel/miniconda3/envs/bmt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/media/adel/Data3/BMT/sample/../model/encoders.py", line 126, in forward
Av, Va = self.encoder_AV((A, V), (masks['A_mask'], masks['V_mask']))
File "/home/adel/miniconda3/envs/bmt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/media/adel/Data3/BMT/sample/../model/blocks.py", line 18, in forward
x = layer(x, masks)
File "/home/adel/miniconda3/envs/bmt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/media/adel/Data3/BMT/sample/../model/encoders.py", line 72, in forward
M1 = self.res_layers_M1[0](M1, sublayer_self_att_M1)
File "/home/adel/miniconda3/envs/bmt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/media/adel/Data3/BMT/sample/../model/blocks.py", line 133, in forward
res = sublayer(res)
File "/media/adel/Data3/BMT/sample/../model/encoders.py", line 63, in sublayer_self_att_M1
def sublayer_self_att_M1(M1): return self.self_att_M1(M1, M1, M1, M1_mask)
File "/home/adel/miniconda3/envs/bmt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/media/adel/Data3/BMT/sample/../model/multihead_attention.py", line 66, in forward
Q = self.linear_Q2d(Q)
File "/home/adel/miniconda3/envs/bmt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/adel/miniconda3/envs/bmt/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
return F.linear(input, self.weight, self.bias)
File "/home/adel/miniconda3/envs/bmt/lib/python3.7/site-packages/torch/nn/functional.py", line 1371, in linear
output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

Final results

Hi, v-iashin!
After finishing the evaluation process, I got the following results using your pre-trained models (best_cap_model.pt & best_prop_model.pt) in this GitHub:

0411021941: learned_props 1by1 26 @ 0: 100%|██████████| 6998/6998 [6:58:57<00:00, 3.59s/it]
PTBTokenizer tokenized 12044821 tokens at 3094732.62 tokens per second.
PTBTokenizer tokenized 11638061 tokens at 3382104.47 tokens per second.
PTBTokenizer tokenized 12044821 tokens at 3013699.36 tokens per second.
...
PTBTokenizer tokenized 1499705 tokens at 1944330.17 tokens per second.
./captioning_results_learned_props_e26.json
{0.3: {'Bleu_1': 0.21855237245090708, 'Bleu_2': 0.11829910025778011, 'Bleu_3': 0.06527576698633347, 'Bleu_4': 0.03286126222797009, 'METEOR': 0.11318173026044272, 'ROUGE_L': 0.2214381639467888, 'CIDEr': 0.12540909027725913, 'Recall': 0.7653550052445841, 'Precision': 0.8432352890940963}, 0.5: {'Bleu_1': 0.1797576627363622, 'Bleu_2': 0.09703088725902112, 'Bleu_3': 0.05261611349091552, 'Bleu_4': 0.025653078928295117, 'METEOR': 0.10449090865385391, 'ROUGE_L': 0.17540336633318065, 'CIDEr': 0.12255632406512765, 'Recall': 0.6267787136401878, 'Precision': 0.574749139042341}, 0.7: {'Bleu_1': 0.10613642314342578, 'Bleu_2': 0.057339390086598954, 'Bleu_3': 0.030490115612125883, 'Bleu_4': 0.014518221261694548, 'METEOR': 0.07997454689235316, 'ROUGE_L': 0.09942682704857322, 'CIDEr': 0.10579470701931824, 'Recall': 0.49080783704361436, 'Precision': 0.2893747835081578}, 0.9: {'Bleu_1': 0.028548101221267407, 'Bleu_2': 0.01514513993899966, 'Bleu_3': 0.007702772567403976, 'Bleu_4': 0.003440348885064907, 'METEOR': 0.03290084691301671, 'ROUGE_L': 0.026592904083137726, 'CIDEr': 0.048605469972506386, 'Recall': 0.3467758615092933, 'Precision': 0.07477945163071767}, 'Average across tIoUs': {'Bleu_1': 0.1332486398879906, 'Bleu_2': 0.07195362938559996, 'Bleu_3': 0.03902119216419472, 'Bleu_4': 0.019118227825756166, 'METEOR': 0.08263700817991662, 'ROUGE_L': 0.1307153153529201, 'CIDEr': 0.10059139783355285, 'Recall': 0.55742935435942, 'Precision': 0.44553466581882817}}.

Whether these scores (e.g. Blue, Meteor,...) should be multiplied by 100 as re-scaling, to be roughly the same as the scores in your paper?
Generally, is results in dense video captioning papers multiplied by 100 as re-scaling?
Hope your reply!!

dataloader

About evaluation

Hi Vladimir Iashin
Thanks for sharing such an awesome work!
Thanks for your quick reply!
I have some questions about evaluation:
1- How to increase the speed of evaluation process ?, I have RTX gpu and I got like 0411021941: learned_props 1by1 26 @ 0:
0%| | 9/6998 [02:47<36:11:36, 18.64s/it].
2- Based on the paper "dataset is split into 50/25/25 % parts for training,validation, and testing" , but in the code there are only
train, val_1, and val_2, can explain this point please and what is the difference between validation & testing in your work?

Error while extracting vggish features

Hello,
Thanks for this amazing work.

Error opening './tmp/test.wav': System error.

Above error on executing python main.py --feature_type vggish --on_extraction save_numpy --device_ids 0 --video_paths ../../sample/v__N1MWv9bW6Q.mp4 --output_path ../../sample/ comand

Please help me for this!

Runtimeerror

When i am running this command -

python ./sample/single_video_prediction.py \

--prop_generator_model_path ./sample/best_prop_model.pt \
--pretrained_cap_model_path ./sample/best_cap_model.pt \
--vggish_features_path ./sample/women_long_jump_vggish.npy \
--rgb_features_path ./sample/women_long_jump_rgb.npy \
--flow_features_path ./sample/women_long_jump_flow.npy \
--duration_in_secs 35.155 \
--device_id 0 \
--max_prop_per_vid 100 \
--nms_tiou_thresh 0

I am facing this issue - File "./sample/single_video_prediction.py", line 279, in
cap_cfg, cap_model, train_dataset = load_cap_model(args.pretrained_cap_model_path, args.device_id)
File "./sample/single_video_prediction.py", line 136, in load_cap_model
train_dataset = ActivityNetCaptionsDataset(cfg, 'train', get_full_feat=False)
File "/home/shivani/BMT/sample/../datasets/captioning_dataset.py", line 310, in init
self.train_vocab, self.caption_loader = caption_iterator(cfg, self.batch_size, self.phase)
File "/home/shivani/BMT/sample/../datasets/captioning_dataset.py", line 40, in caption_iterator
CAPTION.build_vocab(dataset.caption, min_freq=cfg.min_freq_caps, vectors=cfg.word_emb_caps)
File "/home/shivani/anaconda3/envs/bmt/lib/python3.7/site-packages/torchtext/data/field.py", line 273, in build_vocab
self.vocab = self.vocab_cls(counter, specials=specials, **kwargs)
File "/home/shivani/anaconda3/envs/bmt/lib/python3.7/site-packages/torchtext/vocab.py", line 88, in init
self.load_vectors(vectors, unk_init=unk_init, cache=vectors_cache)
File "/home/shivani/anaconda3/envs/bmt/lib/python3.7/site-packages/torchtext/vocab.py", line 147, in load_vectors
vectors[idx] = pretrained_aliasesvector
File "/home/shivani/anaconda3/envs/bmt/lib/python3.7/site-packages/torchtext/vocab.py", line 401, in init
super(GloVe, self).init(name, url=url, **kwargs)
File "/home/shivani/anaconda3/envs/bmt/lib/python3.7/site-packages/torchtext/vocab.py", line 280, in init
self.cache(name, cache, url=url, max_vectors=max_vectors)
File "/home/shivani/anaconda3/envs/bmt/lib/python3.7/site-packages/torchtext/vocab.py", line 361, in cache
dim))
RuntimeError: Vector for token b'Getatchew' has 10 dimensions, but previously read vectors have 300 dimensions. All vectors must have the same number of dimensions.

And in i3d and vggish environment

python main.py \

--feature_type vggish \
--on_extraction save_numpy \
--device_ids 0 \
--video_paths ../../sample/women_long_jump.mp4 \
--output_path ../../sample/

this error is coming

home/shivani/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/shivani/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/shivani/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/shivani/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/shivani/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/shivani/anaconda3/envs/vggish/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
usage: main.py [-h] [--train_meta_path TRAIN_META_PATH]
[--val_1_meta_path VAL_1_META_PATH]
[--val_2_meta_path VAL_2_META_PATH]
[--modality {audio,video,audio_video}]
[--video_feature_name VIDEO_FEATURE_NAME]
[--audio_feature_name AUDIO_FEATURE_NAME]
[--video_features_path VIDEO_FEATURES_PATH]
[--audio_features_path AUDIO_FEATURES_PATH] [--d_vid D_VID]
[--d_aud D_AUD] [--word_emb_caps WORD_EMB_CAPS]
[--unfreeze_word_emb]
[--feature_timespan_in_fps FEATURE_TIMESPAN_IN_FPS]
[--fps_at_extraction FPS_AT_EXTRACTION]
[--audio_feature_timespan AUDIO_FEATURE_TIMESPAN]
[--train_json_path TRAIN_JSON_PATH] --procedure
{train_cap,train_prop,evaluate}
[--device_ids DEVICE_IDS [DEVICE_IDS ...]]
[--start_token START_TOKEN] [--end_token END_TOKEN]
[--pad_token PAD_TOKEN] [--max_len MAX_LEN]
[--min_freq_caps MIN_FREQ_CAPS] [--optimizer {adam,sgd}]
[--betas BETAS BETAS] [--eps EPS] [--momentum MOMENTUM]
[--scheduler {constant,reduce_on_plateau}] [--lr LR]
[--weight_decay WEIGHT_DECAY] [--lr_patience LR_PATIENCE]
[--lr_reduce_factor LR_REDUCE_FACTOR] [--B B]
[--inf_B_coeff INF_B_COEFF] [--epoch_num EPOCH_NUM]
[--one_by_one_starts_at ONE_BY_ONE_STARTS_AT]
[--early_stop_after EARLY_STOP_AFTER] [--smoothing SMOOTHING]
[--grad_clip GRAD_CLIP]
[--pretrained_prop_model_path PRETRAINED_PROP_MODEL_PATH]
[--finetune_prop_encoder]
[--pretrained_cap_model_path PRETRAINED_CAP_MODEL_PATH]
[--finetune_cap_encoder] [--obj_coeff OBJ_COEFF]
[--noobj_coeff NOOBJ_COEFF]
[--pad_audio_feats_up_to PAD_AUDIO_FEATS_UP_TO]
[--pad_video_feats_up_to PAD_VIDEO_FEATS_UP_TO]
[--nms_tiou_thresh NMS_TIOU_THRESH] [--log_dir LOG_DIR]
[--prop_pred_path PROP_PRED_PATH]
[--avail_mp4_path AVAIL_MP4_PATH]
[--reference_paths REFERENCE_PATHS [REFERENCE_PATHS ...]]
[--tIoUs TIOUS [TIOUS ...]]
[--max_prop_per_vid MAX_PROP_PER_VID]
[--val_prop_meta_path VAL_PROP_META_PATH]
[--model {transformer,av_transformer}] [--dout_p DOUT_P]
[--N N] [--d_model D_MODEL] [--d_model_video D_MODEL_VIDEO]
[--d_model_audio D_MODEL_AUDIO] [--d_model_caps D_MODEL_CAPS]
[--use_linear_embedder] [--H H] [--d_ff_video D_FF_VIDEO]
[--d_ff_audio D_FF_AUDIO] [--d_ff_caps D_FF_CAPS]
[--anchors_num_video ANCHORS_NUM_VIDEO]
[--anchors_num_audio ANCHORS_NUM_AUDIO]
[--kernel_sizes_audio KERNEL_SIZES_AUDIO [KERNEL_SIZES_AUDIO ...]]
[--kernel_sizes_video KERNEL_SIZES_VIDEO [KERNEL_SIZES_VIDEO ...]]
[--conv_layers_audio [CONV_LAYERS_AUDIO [CONV_LAYERS_AUDIO ...]]]
[--conv_layers_video [CONV_LAYERS_VIDEO [CONV_LAYERS_VIDEO ...]]]
[--layer_norm] [--debug] [--dont_log]
main.py: error: the following arguments are required: --procedure

Please help me. I have downloaded all the packages. i am not getting how to resolve this.

video files

Hi v-iashin, I would like to ask if it's possible for you to share the videos dataset of this project with me. My email address is ([email protected]). Thank you so much.

1 virtualenv instead of 3 conda evns?

Hi,
Thanks for this amazing work. I'm wondering if it's better to create one general virtualenv (and one requirements.txt file) instead of 3 conda environment and their yml files ?
BTW I'm facing a pb when installing packages with your method on a jupyterlab interface:

conda env create -f ./conda_env.yml
Collecting package metadata: done
Solving environment: \

The installation progress bar stuck at the very first step (and is taking a lot of time)..

RuntimeError: storage has wrong size

I got this in the prediction section
(bmt) root@ABC001:/data/zuyuehan/project/BMT# python ./sample/single_video_prediction.py --prop_generator_model_path ./sample/best_prop_model.pt --pretrained_cap_model_path ./sample/best_cap_model.pt --vggish_features_path ./test/pandemic_vggish.npy --rgb_features_path ./test/pandemic_rgb.npy --flow_features_path ./test/pandemic_flow.npy --duration_in_secs 99 --device_id 0 --max_prop_per_vid 100 --nms_tiou_thresh 0.4 Contructing caption_iterator for "train" phase Using vanilla Generator initialization: xavier Glove emb of the same size as d_model_caps Traceback (most recent call last): File "./sample/single_video_prediction.py", line 281, in <module> args.device_id, args.prop_generator_model_path, args.pretrained_cap_model_path, args.max_prop_per_vid File "./sample/single_video_prediction.py", line 94, in load_prop_model checkpoint = torch.load(prop_generator_model_path, map_location='cpu') File "/root/anaconda3/envs/bmt/lib/python3.7/site-packages/torch/serialization.py", line 386, in load return _load(f, map_location, pickle_module, **pickle_load_args) File "/root/anaconda3/envs/bmt/lib/python3.7/site-packages/torch/serialization.py", line 580, in _load deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly) RuntimeError: storage has wrong size: expected 0 got 512
I need your help, and I'd appreciate it if you could help me.

Testing with i3d features of a sample video

Hi Vladimir,

I've extracted i3d features for my videos set using your i3d_features project code. Now I want to generate the caption for them using this BMT, how can we do that here? Please guide.

Your projects are my inspiration, thanks a lot for being my guide.

Error while extracting i3d features

I wrote the following code in my terminal
conda env create -f conda_env_i3d.yml conda activate i3d python main.py \ --feature_type i3d \ --on_extraction save_numpy \ --device_ids 0 \ --extraction_fps 25 \ --video_paths ../../test/pandemic.mp4 \ --output_path ../../test/

I believe this error is caused since I don't have Nvidia GPU . If my assumption for error is right , is there a way to use CPU for feature extraction please help me out

Multilingual Audio

Hey @v-iashin
Thanks for open sourcing such an awesome work!!!
Kudos to you on this and MDVC.
I was wondering since my videos are not of English language but I do require captions in the English language, how do I go about utilizing your work here? Is there a possibility to totally ignore audio features and just image features?

How to train the captioning module on ground truth proposals

Hi Iashin, I need to train the captioning module on ground truth proposals. What should I do?

missing ground truth files (./tmp/extracted_targets_for_{phase}.pkl ).

Hi v-iashin,

Nice work!

I got error message when tried to train proposal model, and the error seems come from missing ground truth file

FileNotFoundError: [Errno 2] No such file or directory: './tmp/extracted_targets_for_val_1.pkl'

Could I know where I can download the related files? (e.g., ./tmp/extracted_targets_for_{phase}.pkl ).

Thanks!

License

Hi, May I know about the license?
We intend to use your code for one of our projects. Kindly mention the license so that we can understand if its allowed the usage.

Where is Audio dataset in this repo?

Hi, thanks to share your great codes!

i inference my own videos, and success (use pre-trained models)
but i didn't know what a dataset are..

1. I found dataset, can you check it is correct?
video : kinetics (deepmind)
audio : audioset (google research)

2. In this repo/data, I think there is no audio dataset.
train,test csv,json file is for video dataset (youtube video crawling)
and they have duration, timestemps, sentences
but i can't find audio dataset..
what are they?

thank you

the score of result

When I use "best_cap_model.pt" and "prop_results_val_1_e17_maxprop100.json" which are provide in your code to generate captions, I find that the meteor score which is measured by the offical script is 7.6. And I want to know how to get the result reported in the paper which can be more than 8.

Append some captions on Dataset and training

Hello,

Thank you for responding to all issues. I have one question.

I inferenced my own video, but sentences accuracy is not so good.

So, I want to append more captions in dataset -> and training

You told me in my last question issue, BMT's dataset is Activitynet captions.

BMT/data/train.json is that.

and there also exist train.csv and etc..

So I try to append captions in train.json and train.csv

like this

but it doesn't work
"""
Contructing caption_iterator for "train" phase
Traceback (most recent call last):
File "main.py", line 186, in
main(cfg)
File "main.py", line 11, in main
train_cap(cfg)
File "/home/piai/share/BMT/scripts/train_captioning_module.py", line 30, in train_cap
train_dataset = ActivityNetCaptionsDataset(cfg, 'train', get_full_feat=False)
File "/home/piai/share/BMT/datasets/captioning_dataset.py", line 310, in init
self.train_vocab, self.caption_loader = caption_iterator(cfg, self.batch_size, self.phase)
File "/home/piai/share/BMT/datasets/captioning_dataset.py", line 40, in caption_iterator
CAPTION.build_vocab(dataset.caption, min_freq=cfg.min_freq_caps, vectors=cfg.word_emb_caps)
File "/home/piai/anaconda3/envs/bmt/lib/python3.7/site-packages/torchtext/data/field.py", line 262, in build_vocab
for x in data:
File "/home/piai/anaconda3/envs/bmt/lib/python3.7/site-packages/torchtext/data/dataset.py", line 154, in getattr
yield getattr(x, attr)
AttributeError: 'Example' object has no attribute 'caption'
"""

And I found something wrong in train.csv and available_mp4.txt

in this picture, there exist captions in start time column

in available_mp4, mp4's address is v_dZ~~ but in correct address i found is v=dZ~~

So, Could you tell me how to append my own data?

Thankyou

speech features

Hello, v-iashin. Thanks for your code! Can the speech features be added to this BMT project?

How to evaluate proposal generation module for specific epoch

Hello Iashin,
I have trained the proposal generation module for 5 epochs and I have got a new 'best_prop_model.pt' and 5 files 'prop_results_val_1_e*_maxprop100.json' for epoch 0 to 4 , but after replacing the 'best_prop_model.pt' and 'prop_results_val_1_e0_maxprop100.json' with the new files and then running the evaluation, I have got the same results in the paper.
what should I do to get the results for the module that I have trained ?

video dataset

Hello!I would like to ask if you could share the video data set of this project with me. I would be very grateful if you could.My email address is [email protected] you very much!

RuntimeError: CUDA error: unspecified launch failure

When I run video captioning from a raw video, I encounter this error. I use flask to put the feature extraction process and the caption generation process together to make a service. In order to release the memory in time, I use a multiprocessing method when extracting the features using vggish, but after extracting the features, I encounter the following error before generating the caption.
The error is here.

Using vanilla Generator
initialization: xavier
Glove emb of the same size as d_model_caps
Pretrained caption path:
pretrained_models/best_cap_model.pt

Serving Flask app "demo" (lazy loading)
Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
Debug mode: off
i3d extraction time : 122.41406
i3d feature extraction is complete!
2021-01-16 13:44:56.115960: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-01-16 13:44:56.466762: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error
2021-01-16 13:44:56.466895: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: localhost.localdomain
2021-01-16 13:44:56.466918: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: localhost.localdomain
2021-01-16 13:44:56.467077: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 390.25.0
2021-01-16 13:44:56.467163: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 390.25.0
2021-01-16 13:44:56.467185: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 390.25.0
Vggish feature extraction completed!
lens of the train_dataset: 1068
train_dataset.pad_idx 1
[2021-01-16 13:45:15,938] ERROR in app: Exception on /dense_video_caption [POST]
Traceback (most recent call last):
File "/u01/isi/wangjunyan/envs/py36/lib/python3.6/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/u01/isi/wangjunyan/envs/py36/lib/python3.6/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/u01/isi/wangjunyan/envs/py36/lib/python3.6/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/u01/isi/wangjunyan/envs/py36/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/u01/isi/wangjunyan/envs/py36/lib/python3.6/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/u01/isi/wangjunyan/envs/py36/lib/python3.6/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functionsrule.endpoint
File "demo.py", line 418, in dense_video_caption
prop_model, feature_paths, train_dataset.pad_idx, prop_cfg, args.device_id, args.duration_in_secs
File "demo.py", line 230, in generate_proposals
pad_feats_up_to=cfg.pad_feats_up_to
File "demo.py", line 203, in load_features_from_npy
stack_vggish = stack_vggish.to(torch.device(device)).unsqueeze(0)
RuntimeError: CUDA error: unspecified launch failure

the code to generate captions is as follows:

@app.route('/dense_video_caption', methods=['POST'])
def dense_video_caption():
starttime = time.time()
if request.method != "POST":
return error("Only POST requests are expected!")
if "file" not in request.files:
return error("No filess found!")
file = request.files['file']
if not file:
return error("No file found!")
if file.filename == '':
return error("No filename found!")
if os.path.exists('upload'):
subprocess.call('rm -rf upload', shell=True)
subprocess.call('mkdir upload', shell=True)
if os.path.exists(args.tmp_path):
shutil.rmtree(args.tmp_path)
filename = str(random.randint(0, 1000000)) + '.' + file.filename.rsplit('.', 1)[1]
filename = os.path.join('upload', filename)
file.save(filename)
if not allowed_file(file.filename): # 不是mp4文件，用ffmpeg转化成mp4文件
basepath = 'upload'
try:
filename = fileProcessing(filename, basepath)
except Exception as e:
return error("Cannot read video file!")

args.video_path = filename
extract_feat(extractor_i3d, filename)
time1 = time.time()
print("i3d extraction time :", time1-starttime)
print("i3d feature extraction is complete!")
torch.cuda.empty_cache() 

p = Process(target=extract_feat, args=(extractor_vgg, filename))
p.start()
if p.is_alive():
    p.join() 
print("Vggish feature extraction completed!")

filename_rgb = os.path.split(args.video_path)[-1].replace('.mp4', '_rgb.npy')
filename_flow = os.path.split(args.video_path)[-1].replace('.mp4', '_flow.npy')
filename_vggish = os.path.split(args.video_path)[-1].replace('.mp4', '_vggish.npy')
# construct the paths to save the features
args.rgb_features_path = os.path.join(args.output_path, filename_rgb)
args.flow_features_path = os.path.join(args.output_path, filename_flow)
args.vggish_features_path = os.path.join(args.output_path, filename_vggish)
feature_paths = {
    'audio': args.vggish_features_path,
    'rgb': args.rgb_features_path,
    'flow': args.flow_features_path,
}
cap = cv2.VideoCapture(args.video_path)
if cap.isOpened():  
    rate = cap.get(5)  
    FrameNumber = cap.get(7)  
    args.duration_in_secs = FrameNumber / rate   
# Proposal
print("lens of train_dataset: ", len(train_dataset))
print("train_dataset.pad_idx", train_dataset.pad_idx)
proposals = generate_proposals(
    prop_model, feature_paths, train_dataset.pad_idx, prop_cfg, args.device_id, args.duration_in_secs
)
# NMS if specified
if args.nms_tiou_thresh is not None:
    proposals = non_max_suppresion(proposals.squeeze(), args.nms_tiou_thresh)
    proposals = proposals.unsqueeze(0)
# Captions for each proposal
captions = caption_proposals(
    cap_model, feature_paths, train_dataset, cap_cfg, args.device_id, proposals, args.duration_in_secs
)
result = {"result": captions, "time": time.time() - starttime}
torch.cuda.empty_cache()  
return result

OSError: [Errno 30] Read-only file system: './captioning_results_learned_props_e26.json'

I got the following error after finishing the evaluation process:

0411021941: learned_props 1by1 26 @ 0: 100%|██████████| 6998/6998 [7:15:30<00:00, 3.73s/it]

Traceback (most recent call last):

File "/home/env/miniconda3/envs/tr_17/lib/python3.8/site-packages/spyder_kernels/py3compat.py", line 356, in compat_exec
exec(code, globals, locals)

File "/media/env/Data3/BMT_ok/main.py", line 201, in
main(cfg)

File "/media/env/Data3/BMT_ok/main.py", line 15, in main
eval_on_learned_props(cfg)

File "/media/env/Data3/BMT_ok/scripts/eval_on_learned_props.py", line 128, in eval_on_learned_props
val_metrics_pred_prop = validation_1by1_loop(

File "/media/env/Data3/BMT_ok/epoch_loops/captioning_epoch_loops.py", line 271, in validation_1by1_loop
with open(submission_path, 'w') as outf:

OSError: [Errno 30] Read-only file system: './captioning_results_learned_props_e26.json'

Failed to reproduce similar model metrics

I tried to reproduce similar model by training the model on the same dataset with the same parameters you provided in the repository, but whatever I do or how long I train the model, I keep not reaching similar results or metrics... can you please tell me why do you think this happens? and how many epochs you trained your model?

Thanks

The ratio of objectness loss coefficients to the noobjectness coefficient

In the experiment, I find there are more negative samples. But the noobjectness coefficient is 100，it is more than objectness loss coefficients. So it's not clear why do you choose giving more weight to negative samples?

Error(s) in loading state_dict for DataParallel

Describe the bug
I get this error when run Single video prediction

Screenshots

How should I fix it ? Thanks