v-iashin / mdvc Goto Github PK

View Code? Open in Web Editor NEW

142.0 6.0 19.0 31.59 MB

PyTorch implementation of Multi-modal Dense Video Captioning (CVPR 2020 Workshops)

Home Page: https://v-iashin.github.io/mdvc

Python 100.00%

pytorch transformer multi-modal dense-video-captioning cvpr-workshop audio visual speech i3d activitynet-captions

mdvc's Introduction

Multi-modal Dense Video Captioning

Project Page | Proceedings | ArXiv | Presentation (Mirror)

This is a PyTorch implementation of our paper Multi-modal Dense Video Captioning (CVPR Workshops 2020).

The publication will appear in the conference proceedings of CVPR Workshops. Please, use this bibtex citation

@InProceedings{MDVC_Iashin_2020,
  author = {Iashin, Vladimir and Rahtu, Esa},
  title = {Multi-Modal Dense Video Captioning},
  booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
  pages={958--959},
  year = {2020}
}

If you found this work interesting, check out our latest paper, where we propose a novel architecture for the dense video captioning task called Bi-modal Transformer with Proposal Generator.

Usage

The code is tested on Ubuntu 16.04/18.04 with one NVIDIA GPU 1080Ti/2080Ti. If you are planning to use it with other software/hardware, you might need to adapt conda environment files or even the code.

Clone this repository. Mind the --recursive flag to make sure submodules are also cloned (evaluation scripts for Python 3).

git clone --recursive https://github.com/v-iashin/MDVC.git

Download features I3D (17GB), VGGish (1GB) and put in ./data/ folder (speech segments are already there). You may use curl -O <link> to download the features.

# MD5 Hash
a661cfe3535c0d832ec35dd35a4fdc42  sub_activitynet_v1-3.i3d_25fps_stack24step24_2stream.hdf5
54398be59d45b27397a60f186ec25624  sub_activitynet_v1-3.vggish.hdf5

Setup conda environment. Requirements are in file conda_env.yml

# it will create new conda environment called 'mdvc' on your machine
conda env create -f conda_env.yml
conda activate mdvc
# install spacy language model. Make sure you activated the conda environment
python -m spacy download en

Train and Predict

Run the training and prediction script. It will, first, train the captioning model and, then, evaluate the predictions of the best model in the learned proposal setting. It will take ~24 hours (50 epochs) to run on a 2080Ti GPU. Please, note that the performance is expected to reach its peak after ~30 epochs.

# make sure to activate environment: conda activate mdvc
# the cuda:1 device will be used for the run
python main.py --device_ids 1

The script keeps the log files, including tensorboard log, under ./log directory by default. You may specify other path using --log_dir argument. Also, if you stored the downloaded data (.hdf5) files in another directory other than ./data, make sure to specify it using –-video_features_path and --audio_features_path arguments.

You may also download the pre-trained model here (~2 GB).

# MD5 Hash
55cda5bac1cf2b7a803da24fca60898b  best_model.pt

Evaluation Scrips and Results

If you want to skip the training procedure, you may replicate the main results of the paper using the prediction files in ./results and the official evaluation script.

To evaluate the performance in the learned proposal set up, run the official evaluation script on ./results/results_val_learned_proposals_e30.json. Our final result is 6.8009
To evaluate the performance on ground truth segments, run the script on each validation part (./results/results_val_*_e30.json) against the corresponding ground truth files (use -r argument in the script to specify each of them). When both values are obtained, average them to verify the final result. We got 9.9407 and 10.2478 on val_1 and val_2 parts, respectively, thus, the average is 10.094.

As we mentioned in the paper, we didn't have access to the full dataset as ActivityNet Captions is distributed as the list of links to YouTube video. Consequently, many videos (~8.8 %) were no longer available at the time when we were downloading the dataset. In addition, some videos didn't have any speech. We filtered out such videos from the validation files and reported the results as no missings in the paper. We provide these filtered ground truth files in ./data.

Raw Data & Details on Feature Extraction

If you are feeling brave, you may want extract features on your own. Check out our script for extraction of the I3D and VGGish features from a set of videos: video_features on GitHub (make sure to check out to 6190f3d7db6612771b910cf64e274aedba8f1e1b commit). Also see #7 for more details on configuration. We also provide the script used to process the timestamps ./utils/parse_subs.py.

Misc.

We additionally provide

the file with subtitles with original timestamps in ./data/asr_en.csv
the file with video categories in ./data/vid2cat.json

Acknowledgments

Funding for this research was provided by the Academy of Finland projects 327910 & 324346. The authors acknowledge CSC — IT Center for Science, Finland, for computational resources for our experimentation.

Media Coverage

(in Russian) Рубрика «Читаем статьи за вас». Июнь 2020 года (habr.com)

mdvc's People

Contributors

Stargazers

Watchers

Forkers

mdvc's Issues

The utilization rate of GPU is low

I found that your dataloader is kind of strange. Its batch_size must be 1 and then according to the idx(a batch) of the caption_loader_iter to get video&audio feature vectors. So I cannot set num_workers>0 and I guess it is the reason of my problem. How do you solve this problem when you are training?

Error video_stack_rgb and video_stack_flow all is None

video_stack_rgb = feat_h5_video.get(f'{video_id}/i3d_features/rgb')
video_stack_flow = feat_h5_video.get(f'{video_id}/i3d_features/flow')
video_stack_rgb and video_stack_flow all is None
video_id = 'v_QOlSCBRmfWY'
h5='sub_activitynet_v1-3.i3d_25fps_stack24step24_2stream.hdf5'

Bi-SST implementation

Hello again!

From the paper:

Firstly, we obtain the temporal event locations. For this task, we employ the Bidirectional Single-Stream Temporal action proposals network (Bi-SST) proposed in [48]

Per [48], a Bi-SST goes through a forward and backward pass and then fuses the results with a fusion operation (typically multiplication). In "models/transformer.py", can you please point me to where this is implemented? Thanks!

ResolvePackageNotFound Issue

When I try to use your yml file to create the virtual environment i have following error message.

ResolvePackageNotFound:

cffi==1.14.0=py37h2e261b9_0
msgpack-python==0.5.6=py37h6bb024c_1
numpy-base==1.15.4=py37hde5b4d6_0
libgfortran-ng==7.3.0=hdf63c60_0
cymem==1.31.2=py37h6bb024c_0
tk==8.6.8=hbc83047_0
libedit==3.1.20181209=hc058e9b_0
mkl_random==1.1.0=py37hd6b4f25_0
ninja==1.9.0=py37hfd86e86_0
python==3.7.7=hcf32534_0_cpython
openssl==1.1.1g=h516909a_0
zlib==1.2.11=h7b6447c_3
libtiff==4.1.0=h2733197_0
mkl-service==2.3.0=py37he904b0f_0
regex==2018.07.11=py37h14c3975_0
readline==8.0=h7b6447c_0
libpng==1.6.37=hbc83047_0
cytoolz==0.9.0.1=py37h14c3975_1
hdf5==1.10.2=hba1933b_1
preshed==1.0.1=py37he6710b0_0
sqlite==3.31.1=h62c20be_1
libgcc-ng==9.1.0=hdf63c60_0
ujson==2.0.3=py37he6710b0_0
xz==5.2.5=h7b6447c_0
openjdk==11.0.1=h516909a_1016
freetype==2.9.1=h8a8886c_1
ld_impl_linux-64==2.33.1=h53a641e_7
h5py==2.8.0=py37h989c5e5_3
mkl_fft==1.0.15=py37ha843d7b_0
pytorch==1.2.0=py3.7_cuda10.0.130_cudnn7.6.2_0
numpy==1.15.4=py37h7e9f1db_0
libstdcxx-ng==9.1.0=hdf63c60_0
protobuf==3.11.4=py37h3340039_1
spacy==2.0.12=py37h962f231_0
libffi==3.2.1=hd88cf55_4
libprotobuf==3.11.4=h8b12597_0
thinc==6.10.3=py37h962f231_0
ncurses==6.2=he6710b0_1
cryptography==2.8=py37h1ba5d50_0
murmurhash==0.28.0=py37hf484d3e_0
pillow==7.1.2=py37hb39fc2d_0
pandas==0.24.2=py37he6710b0_0
jpeg==9b=h024ee3a_2
wrapt==1.10.11=py37h14c3975_2
zstd==1.3.7=h0b5b093_0
torchvision==0.3.0=py37_cu10.0.130_1
Can u please tell me how to handle this issue? Thank you!

Order of training the captioning module v/s the proposal module (and whether training is E2E?)

Hi Vladimir,

First, thanks for the great codebase - everything is neatly organized in the source files - a nice deviation compared to what AI codebases from papers usually look like :)

Some questions:

Train and Predict
Run the training and prediction script. It will, first, train the captioning model and, then, evaluate the predictions of the best model in the learned proposal setting.

From your comments on training, it is clear that the captioning module is trained first (on GT proposals?). However, it is not very clear when the proposal module is trained. Is the training end-to-end as in Zhuo et al. [59] where both modules are trained in unison (where the captioning module is able to influence the event proposal mechanism)? Can you explain this sequence clearly (maybe for the sake for everyone by updating the readme?). Thanks!

Code to run model on own videos

Is there code that we can use to run this on our own videos that we want to caption?

RuntimeError: CUDA out of memory.

(mdvc) root@ever:~/MDVC# python main.py --device_ids 0
log_path: ./log/0606173845
model_checkpoint_path: ./log/0606173845
Preparing dataset for train
Preparing dataset for val_1
Preparing dataset for val_2
Preparing dataset for val_1
using SubsAudioVideoGeneratorConcatLinearDoutLinear
initialization: xavier
Param Num: 178749320
17:42:32 train (0): 0%| | 1/1221 [00:01<39:18, 1.93s/it]
Traceback (most recent call last):
File "main.py", line 573, in
main(cfg)
File "main.py", line 276, in main
cfg.modality, cfg.use_categories
File "/root/MDVC/epoch_loop/run_epoch.py", line 308, in training_loop
pred = model(feature_stacks, caption_idx, masks)
File "/root/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/root/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/root/MDVC/model/transformer.py", line 328, in forward
memory_video = self.encoder_video(src_video, src_mask)
File "/root/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/root/MDVC/model/transformer.py", line 202, in forward
x = layer(x, src_mask)
File "/root/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/root/MDVC/model/transformer.py", line 189, in forward
x = self.res_layers[0](x, sublayer0)
File "/root/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/root/MDVC/model/transformer.py", line 152, in forward
res = sublayer(res)
File "/root/MDVC/model/transformer.py", line 186, in
sublayer0 = lambda x: self.self_att(x, x, x, src_mask)
File "/root/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/root/MDVC/model/transformer.py", line 137, in forward
att = attention(Q, K, V, mask) # (B, H, seq_len, d_k)
File "/root/MDVC/model/transformer.py", line 101, in attention
sm_input = sm_input.masked_fill(mask == 0, -float('inf'))
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 3.94 GiB total capacity; 3.43 GiB already allocated; 12.94 MiB free; 71.07 MiB cached)

I3D Convolutions Script + Input Data

Hi Vladimir,

Noticed in the MDVC codebase that you load the I3D CONV features from "./data/sub_activitynet_v1-3.i3d_25fps_stack24step24_2stream.hdf5"

Some questions:
(i) Do you have a script that generates these features from raw data?
(ii) What input data did you run the I3D model over? I ask because it appears from your I3D features filename that your features are for 25 FPS, which implies that you manually sampled the videos in the ActivityNet Captions dataset at 25 FPS since unfortunately, the official ActivityNet website only offers sampled frames at 5 FPS (http://activity-net.org/challenges/2020/tasks/anet_captioning.html).
(iii) Do you have a link for the sampled frames?

Thanks!
Aman

Tool to draw block diagrams

Hi Vladimir,

Just curious about what tool was used to draw the following diagrams:

From the MDVC paper:

===

All the block-diagrams here: https://v-iashin.github.io/bmt.html

Thanks for the support!

About text

caption_idx = caption_data.caption caption_idx, caption_idx_y = caption_idx[:, :-1], caption_idx[:, 1:]
Excuse me, why do you want to remove the first token and the last token in the second line?

difference between val_1 and val_2

Thank you for your research. Could you explain the difference between data val_1 and val_2?

作者您好，请问提取的多模态特征的时候需不需要区分是训练集的特征还是测试集的特征。

因为我看到您代码中给的提取好的特征都是一个文件，是直接把所有视频的特征提取到一个文件不区分训练集和测试集嘛。希望您能抽出宝贵的事件解答疑惑。

Requesting tensorboard log file for best model

Hi,

First of all great work with codebase and paper. Model architecture is explained very well in the paper.
I am working on the improvements by keeping your work as a baseline. I am trying to train the dataset on my own, with the little refactoring of code as per my need. I am requesting you to please provide tensorboard log file for best_model.pt you have already shared in the repo. I am particularly interested in looking for the results of validation sets epoch by epoch. As currently after certain epochs, I see the prediction for 'videos_to_monitor' is coming '', which is interesting. And I would like to see how training progressed in your best model.

ASR

Hello! Then I would like to ask, how did you get your .srt subtitle file?Is the .srt subtitle file downloaded directly on YouTube, or is it obtained by calling Google’s API?

videoCategoriesMetaUS.json

parser.add_argument(
    '--video_categories_meta_path', type=str, default='./data/videoCategoriesMetaUS.json',
    help='Path to the categories meta from Youtube API: \
    https://developers.google.com/youtube/v3/docs/videoCategories/list'
)

Hello!May I ask where I can find the above documents?

Masked transformer technique in Zhuo et. al

Does MDVC make use of the masked transformer technique in Zhuo et. al [59]?

I did a quick scan of model/transformer.py but couldn't find any direct references.

All the best for your CVPR 2020 workshop, Vladimir! :)

^Title

Hello, author. May I ask whether it is necessary to distinguish the features of training set or test set when extracting multi-modal features?

Because I saw that all the features you have extracted in your code are a file, which is to directly extract all the features of the video into an HDF5 file, without distinguishing between the training set and the test set. Hope you can extract valuable events to answer the question.

How to extract vggish features having overlapping?

Hello, if I want to extract audio features that per feature is 0.96s long with 0.32s overlapping, can I just set the parameter "EXAMPLE_HOP_SECONDS=0.32" without any other changes in ./models/vggish/vggish_src/vggish_params.py?
Are there any other changes required?

503 service unavailable: cannot open a3s.fi links

Hi,

Since this Monday the bucket storage that we use for storing features and models (a3s) is down. Usually, it takes several hours to recover but this time it is more serious, it seems. The problem is claimed to be 'temporally' but we are not sure when it will recover so for now please use this issue to let us know if you need any of the files ASAP – we will share them via other means.

Sorry for the trouble and good luck with your research!

Filtered validation files

"As we mentioned in the paper, we didn't have access to the full dataset as ActivityNet Captions is distributed as the list of links to YouTube video. Consequently, many videos (~8.8 %) were no longer available at the time when we were downloading the dataset. In addition, some videos didn't have any speech. We filtered out such videos from the validation files and reported the results as no missings in the paper. Please create an issue if you would like us to share these filtered validation files."

Can you please share the filtered validation files? Thanks!

ASR

Hello! I want to ask what API is used to obtain .srt subtitles?

Alignment key for the A/V features in the .npy/.hdf5 files

Hi Vladimir,

Long time no talk :) I was wondering if you can share the code that converted the .npy features (from your VGGish and I3D feature extractor) that you made available to me mid last year, to .hdf5 in MDVC Readme: Usage. In particular, I am interested in understanding how you "align" the audio and video features (based on the code below).

Questions:

Are the audio and video features aligned by time in the hdf5 file? Is that what T_audio/T_video stands for?
Is the D_audio/D_video simply the feature dimension?

def load_multimodal_features_from_h5(feat_h5_video, feat_h5_audio, feature_names_list, 
                                     video_id, start, end, duration, get_full_feat=False, cs=True):
    supported_feature_names = {'i3d_features', 'c3d_features', 'vggish_features'}
    assert isinstance(feature_names_list, list)
    assert len(feature_names_list) > 0
    assert set(feature_names_list).issubset(supported_feature_names)

    if 'vggish_features' in feature_names_list:
        audio_stack = feat_h5_audio.get(f'{video_id}/vggish_features')

        # some videos doesn't have audio
        if audio_stack is None:
            print(f'audio_stack is None @ {video_id}')
            audio_stack = torch.empty((0, 128)).float()

        T_audio, D_audio = audio_stack.shape

    if 'i3d_features' in feature_names_list:
        video_stack_rgb = feat_h5_video.get(f'{video_id}/i3d_features/rgb')
        video_stack_flow = feat_h5_video.get(f'{video_id}/i3d_features/flow')
        
        assert video_stack_rgb.shape == video_stack_flow.shape
        T_video, D_video = video_stack_rgb.shape

        if T_video > T_audio:
            video_stack_rgb = video_stack_rgb[:T_audio, :]
            video_stack_flow = video_stack_flow[:T_audio, :]
            T = T_audio
        elif T_video < T_audio:
            audio_stack = audio_stack[:T_video, :]
            T = T_video
        else:
            # or T = T_audio
            T = T_video
        
        # at this point they should be the same
        assert audio_stack.shape[0] == video_stack_rgb.shape[0]

Thanks again for your help!

KeyError in validation_next_word_loop when running main.py

Hi Vladimir! Hope you are doing well.

I was running your main.py script. There is the following error saying KeyError. Am I missing something? Thanks a lot!

Traceback (most recent call last):
File "main.py", line 572, in
main(cfg)
File "main.py", line 281, in main
cfg.use_categories
File "/home/tuf72841/MDVC/epoch_loop/run_epoch.py", line 336, in validation_next_word_loop
for i, batch in enumerate(tqdm(loader, desc=f'{time} {phase} ({epoch})')):
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/tqdm/std.py", line 1127, in iter
for obj in iterable:
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 346, in next
data = self.dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/tuf72841/MDVC/dataset/dataset.py", line 443, in getitem
caption_data = next(self.caption_loader_iter)
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torchtext/data/iterator.py", line 156, in iter
yield Batch(minibatch, self.dataset, self.device)
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torchtext/data/batch.py", line 34, in init
setattr(self, name, field.process(batch, device=device))
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torchtext/data/field.py", line 237, in process
tensor = self.numericalize(padded, device=device)
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torchtext/data/field.py", line 336, in numericalize
arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torchtext/data/field.py", line 336, in
arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torchtext/data/field.py", line 336, in
arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
KeyError: 'stairclimber'

ResolvePackageNotFound:

PS D:\Python Pro\Other Project\MDVC> conda env create -f conda_env.yml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:

h5py==2.8.0=py37h989c5e5_3
openjdk==11.0.1=h516909a_1016
protobuf==3.11.4=py37h3340039_1
spacy==2.0.12=py37h962f231_0
cffi==1.14.0=py37h2e261b9_0
thinc==6.10.3=py37h962f231_0
libedit==3.1.20181209=hc058e9b_0
readline==8.0=h7b6447c_0
regex==2018.07.11=py37h14c3975_0
pytorch==1.2.0=py3.7_cuda10.0.130_cudnn7.6.2_0
libprotobuf==3.11.4=h8b12597_0
mkl_fft==1.0.15=py37ha843d7b_0
libstdcxx-ng==9.1.0=hdf63c60_0
pandas==0.24.2=py37he6710b0_0
torchvision==0.3.0=py37_cu10.0.130_1
pillow==7.1.2=py37hb39fc2d_0
ujson==2.0.3=py37he6710b0_0
zstd==1.3.7=h0b5b093_0
libpng==1.6.37=hbc83047_0
cryptography==2.8=py37h1ba5d50_0
libgfortran-ng==7.3.0=hdf63c60_0
cymem==1.31.2=py37h6bb024c_0
freetype==2.9.1=h8a8886c_1
zlib==1.2.11=h7b6447c_3
ninja==1.9.0=py37hfd86e86_0
cytoolz==0.9.0.1=py37h14c3975_1
sqlite==3.31.1=h62c20be_1
ld_impl_linux-64==2.33.1=h53a641e_7
hdf5==1.10.2=hba1933b_1
libtiff==4.1.0=h2733197_0
numpy==1.15.4=py37h7e9f1db_0
mkl-service==2.3.0=py37he904b0f_0
msgpack-python==0.5.6=py37h6bb024c_1
ncurses==6.2=he6710b0_1
tk==8.6.8=hbc83047_0
wrapt==1.10.11=py37h14c3975_2
libffi==3.2.1=hd88cf55_4
xz==5.2.5=h7b6447c_0
jpeg==9b=h024ee3a_2
openssl==1.1.1g=h516909a_0
mkl_random==1.1.0=py37hd6b4f25_0
murmurhash==0.28.0=py37hf484d3e_0
python==3.7.7=hcf32534_0_cpython
libgcc-ng==9.1.0=hdf63c60_0
numpy-base==1.15.4=py37hde5b4d6_0
preshed==1.0.1=py37he6710b0_0

I have two Issues - 1. I want to run this without GPU (i don't have GPU in Ubuntu)

(mdvc) tarun@tarun-VirtualBox:~/Downloads/MDVC$ python main.py --device_ids 1
log_path: ./log/0601100925
model_checkpoint_path: ./log/0601100925
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp line=50 error=38 : no CUDA-capable device is detected
Traceback (most recent call last):
File "main.py", line 572, in
main(cfg)
File "main.py", line 177, in main
torch.cuda.set_device(cfg.device_ids[0])
File "/home/tarun/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/cuda/init.py", line 281, in set_device
torch._C._cuda_setDevice(device)
File "/home/tarun/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/cuda/init.py", line 179, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp:50

Sharing I3D and VGGish features

Hi Vladimir,

I am working on a project citing MDVC as baseline, and adding some elements to it.

Would you mind if I use your download link for the I3D and VGGish features in my project's readme (which would be on Github)?

Let me know!

Thanks,
Aman

Dense Video Captioning on raw input videos

It seems a nice work. I wanted to test it on custom input videos. It would be very helpful if you can provide a script for generating video captions for a raw input video.

v-iashin / mdvc Goto Github PK

mdvc's Introduction

Multi-modal Dense Video Captioning

Usage

Train and Predict

Evaluation Scrips and Results

Raw Data & Details on Feature Extraction

Misc.

Acknowledgments

Media Coverage

mdvc's People

Contributors

Stargazers

Watchers

Forkers

mdvc's Issues

Recommend Projects

Recommend Topics

Recommend Org