Giter Club home page Giter Club logo

mdvc's Introduction

MDVC

Multi-modal Dense Video Captioning

Project Page | Proceedings | ArXiv | Presentation (Mirror)

This is a PyTorch implementation of our paper Multi-modal Dense Video Captioning (CVPR Workshops 2020).

The publication will appear in the conference proceedings of CVPR Workshops. Please, use this bibtex citation

@InProceedings{MDVC_Iashin_2020,
  author = {Iashin, Vladimir and Rahtu, Esa},
  title = {Multi-Modal Dense Video Captioning},
  booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
  pages={958--959},
  year = {2020}
}

If you found this work interesting, check out our latest paper, where we propose a novel architecture for the dense video captioning task called Bi-modal Transformer with Proposal Generator.

Usage

The code is tested on Ubuntu 16.04/18.04 with one NVIDIA GPU 1080Ti/2080Ti. If you are planning to use it with other software/hardware, you might need to adapt conda environment files or even the code.

Clone this repository. Mind the --recursive flag to make sure submodules are also cloned (evaluation scripts for Python 3).

git clone --recursive https://github.com/v-iashin/MDVC.git

Download features I3D (17GB), VGGish (1GB) and put in ./data/ folder (speech segments are already there). You may use curl -O <link> to download the features.

# MD5 Hash
a661cfe3535c0d832ec35dd35a4fdc42  sub_activitynet_v1-3.i3d_25fps_stack24step24_2stream.hdf5
54398be59d45b27397a60f186ec25624  sub_activitynet_v1-3.vggish.hdf5

Setup conda environment. Requirements are in file conda_env.yml

# it will create new conda environment called 'mdvc' on your machine
conda env create -f conda_env.yml
conda activate mdvc
# install spacy language model. Make sure you activated the conda environment
python -m spacy download en

Train and Predict

Run the training and prediction script. It will, first, train the captioning model and, then, evaluate the predictions of the best model in the learned proposal setting. It will take ~24 hours (50 epochs) to run on a 2080Ti GPU. Please, note that the performance is expected to reach its peak after ~30 epochs.

# make sure to activate environment: conda activate mdvc
# the cuda:1 device will be used for the run
python main.py --device_ids 1

The script keeps the log files, including tensorboard log, under ./log directory by default. You may specify other path using --log_dir argument. Also, if you stored the downloaded data (.hdf5) files in another directory other than ./data, make sure to specify it using –-video_features_path and --audio_features_path arguments.

You may also download the pre-trained model here (~2 GB).

# MD5 Hash
55cda5bac1cf2b7a803da24fca60898b  best_model.pt

Evaluation Scrips and Results

If you want to skip the training procedure, you may replicate the main results of the paper using the prediction files in ./results and the official evaluation script.

  1. To evaluate the performance in the learned proposal set up, run the official evaluation script on ./results/results_val_learned_proposals_e30.json. Our final result is 6.8009
  2. To evaluate the performance on ground truth segments, run the script on each validation part (./results/results_val_*_e30.json) against the corresponding ground truth files (use -r argument in the script to specify each of them). When both values are obtained, average them to verify the final result. We got 9.9407 and 10.2478 on val_1 and val_2 parts, respectively, thus, the average is 10.094.

As we mentioned in the paper, we didn't have access to the full dataset as ActivityNet Captions is distributed as the list of links to YouTube video. Consequently, many videos (~8.8 %) were no longer available at the time when we were downloading the dataset. In addition, some videos didn't have any speech. We filtered out such videos from the validation files and reported the results as no missings in the paper. We provide these filtered ground truth files in ./data.

Raw Data & Details on Feature Extraction

If you are feeling brave, you may want extract features on your own. Check out our script for extraction of the I3D and VGGish features from a set of videos: video_features on GitHub (make sure to check out to 6190f3d7db6612771b910cf64e274aedba8f1e1b commit). Also see #7 for more details on configuration. We also provide the script used to process the timestamps ./utils/parse_subs.py.

Misc.

We additionally provide

  • the file with subtitles with original timestamps in ./data/asr_en.csv
  • the file with video categories in ./data/vid2cat.json

Acknowledgments

Funding for this research was provided by the Academy of Finland projects 327910 & 324346. The authors acknowledge CSC — IT Center for Science, Finland, for computational resources for our experimentation.

Media Coverage

mdvc's People

Contributors

v-iashin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

mdvc's Issues

The utilization rate of GPU is low

I found that your dataloader is kind of strange. Its batch_size must be 1 and then according to the idx(a batch) of the caption_loader_iter to get video&audio feature vectors. So I cannot set num_workers>0 and I guess it is the reason of my problem. How do you solve this problem when you are training?
gpu_ulti

Error video_stack_rgb and video_stack_flow all is None

video_stack_rgb = feat_h5_video.get(f'{video_id}/i3d_features/rgb')
video_stack_flow = feat_h5_video.get(f'{video_id}/i3d_features/flow')
video_stack_rgb and video_stack_flow all is None
video_id = 'v_QOlSCBRmfWY'
h5='sub_activitynet_v1-3.i3d_25fps_stack24step24_2stream.hdf5'

Bi-SST implementation

Hello again!

From the paper:

Firstly, we obtain the temporal event locations. For this task, we employ the Bidirectional Single-Stream Temporal action proposals network (Bi-SST) proposed in [48]

Per [48], a Bi-SST goes through a forward and backward pass and then fuses the results with a fusion operation (typically multiplication). In "models/transformer.py", can you please point me to where this is implemented? Thanks!

ResolvePackageNotFound Issue

When I try to use your yml file to create the virtual environment i have following error message.

ResolvePackageNotFound:

  • cffi==1.14.0=py37h2e261b9_0
  • msgpack-python==0.5.6=py37h6bb024c_1
  • numpy-base==1.15.4=py37hde5b4d6_0
  • libgfortran-ng==7.3.0=hdf63c60_0
  • cymem==1.31.2=py37h6bb024c_0
  • tk==8.6.8=hbc83047_0
  • libedit==3.1.20181209=hc058e9b_0
  • mkl_random==1.1.0=py37hd6b4f25_0
  • ninja==1.9.0=py37hfd86e86_0
  • python==3.7.7=hcf32534_0_cpython
  • openssl==1.1.1g=h516909a_0
  • zlib==1.2.11=h7b6447c_3
  • libtiff==4.1.0=h2733197_0
  • mkl-service==2.3.0=py37he904b0f_0
  • regex==2018.07.11=py37h14c3975_0
  • readline==8.0=h7b6447c_0
  • libpng==1.6.37=hbc83047_0
  • cytoolz==0.9.0.1=py37h14c3975_1
  • hdf5==1.10.2=hba1933b_1
  • preshed==1.0.1=py37he6710b0_0
  • sqlite==3.31.1=h62c20be_1
  • libgcc-ng==9.1.0=hdf63c60_0
  • ujson==2.0.3=py37he6710b0_0
  • xz==5.2.5=h7b6447c_0
  • openjdk==11.0.1=h516909a_1016
  • freetype==2.9.1=h8a8886c_1
  • ld_impl_linux-64==2.33.1=h53a641e_7
  • h5py==2.8.0=py37h989c5e5_3
  • mkl_fft==1.0.15=py37ha843d7b_0
  • pytorch==1.2.0=py3.7_cuda10.0.130_cudnn7.6.2_0
  • numpy==1.15.4=py37h7e9f1db_0
  • libstdcxx-ng==9.1.0=hdf63c60_0
  • protobuf==3.11.4=py37h3340039_1
  • spacy==2.0.12=py37h962f231_0
  • libffi==3.2.1=hd88cf55_4
  • libprotobuf==3.11.4=h8b12597_0
  • thinc==6.10.3=py37h962f231_0
  • ncurses==6.2=he6710b0_1
  • cryptography==2.8=py37h1ba5d50_0
  • murmurhash==0.28.0=py37hf484d3e_0
  • pillow==7.1.2=py37hb39fc2d_0
  • pandas==0.24.2=py37he6710b0_0
  • jpeg==9b=h024ee3a_2
  • wrapt==1.10.11=py37h14c3975_2
  • zstd==1.3.7=h0b5b093_0
  • torchvision==0.3.0=py37_cu10.0.130_1
    Can u please tell me how to handle this issue? Thank you!

Order of training the captioning module v/s the proposal module (and whether training is E2E?)

Hi Vladimir,

First, thanks for the great codebase - everything is neatly organized in the source files - a nice deviation compared to what AI codebases from papers usually look like :)

Some questions:

Train and Predict
Run the training and prediction script. It will, first, train the captioning model and, then, evaluate the predictions of the best model in the learned proposal setting.

From your comments on training, it is clear that the captioning module is trained first (on GT proposals?). However, it is not very clear when the proposal module is trained. Is the training end-to-end as in Zhuo et al. [59] where both modules are trained in unison (where the captioning module is able to influence the event proposal mechanism)? Can you explain this sequence clearly (maybe for the sake for everyone by updating the readme?). Thanks!

RuntimeError: CUDA out of memory.

(mdvc) root@ever:~/MDVC# python main.py --device_ids 0
log_path: ./log/0606173845
model_checkpoint_path: ./log/0606173845
Preparing dataset for train
Preparing dataset for val_1
Preparing dataset for val_2
Preparing dataset for val_1
using SubsAudioVideoGeneratorConcatLinearDoutLinear
initialization: xavier
Param Num: 178749320
17:42:32 train (0): 0%| | 1/1221 [00:01<39:18, 1.93s/it]
Traceback (most recent call last):
File "main.py", line 573, in
main(cfg)
File "main.py", line 276, in main
cfg.modality, cfg.use_categories
File "/root/MDVC/epoch_loop/run_epoch.py", line 308, in training_loop
pred = model(feature_stacks, caption_idx, masks)
File "/root/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/root/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/root/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/root/MDVC/model/transformer.py", line 328, in forward
memory_video = self.encoder_video(src_video, src_mask)
File "/root/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/root/MDVC/model/transformer.py", line 202, in forward
x = layer(x, src_mask)
File "/root/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/root/MDVC/model/transformer.py", line 189, in forward
x = self.res_layers[0](x, sublayer0)
File "/root/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/root/MDVC/model/transformer.py", line 152, in forward
res = sublayer(res)
File "/root/MDVC/model/transformer.py", line 186, in
sublayer0 = lambda x: self.self_att(x, x, x, src_mask)
File "/root/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/root/MDVC/model/transformer.py", line 137, in forward
att = attention(Q, K, V, mask) # (B, H, seq_len, d_k)
File "/root/MDVC/model/transformer.py", line 101, in attention
sm_input = sm_input.masked_fill(mask == 0, -float('inf'))
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 3.94 GiB total capacity; 3.43 GiB already allocated; 12.94 MiB free; 71.07 MiB cached)

I3D Convolutions Script + Input Data

Hi Vladimir,

Noticed in the MDVC codebase that you load the I3D CONV features from "./data/sub_activitynet_v1-3.i3d_25fps_stack24step24_2stream.hdf5"

Some questions:
(i) Do you have a script that generates these features from raw data?
(ii) What input data did you run the I3D model over? I ask because it appears from your I3D features filename that your features are for 25 FPS, which implies that you manually sampled the videos in the ActivityNet Captions dataset at 25 FPS since unfortunately, the official ActivityNet website only offers sampled frames at 5 FPS (http://activity-net.org/challenges/2020/tasks/anet_captioning.html).
(iii) Do you have a link for the sampled frames?

Thanks!
Aman

About text

caption_idx = caption_data.caption caption_idx, caption_idx_y = caption_idx[:, :-1], caption_idx[:, 1:]
Excuse me, why do you want to remove the first token and the last token in the second line?

Requesting tensorboard log file for best model

Hi,

First of all great work with codebase and paper. Model architecture is explained very well in the paper.
I am working on the improvements by keeping your work as a baseline. I am trying to train the dataset on my own, with the little refactoring of code as per my need. I am requesting you to please provide tensorboard log file for best_model.pt you have already shared in the repo. I am particularly interested in looking for the results of validation sets epoch by epoch. As currently after certain epochs, I see the prediction for 'videos_to_monitor' is coming '', which is interesting. And I would like to see how training progressed in your best model.

ASR

Hello! Then I would like to ask, how did you get your .srt subtitle file?Is the .srt subtitle file downloaded directly on YouTube, or is it obtained by calling Google’s API?

videoCategoriesMetaUS.json

parser.add_argument(
    '--video_categories_meta_path', type=str, default='./data/videoCategoriesMetaUS.json',
    help='Path to the categories meta from Youtube API: \
    https://developers.google.com/youtube/v3/docs/videoCategories/list'
)

Hello!May I ask where I can find the above documents?

How to extract vggish features having overlapping?

Hello, if I want to extract audio features that per feature is 0.96s long with 0.32s overlapping, can I just set the parameter "EXAMPLE_HOP_SECONDS=0.32" without any other changes in ./models/vggish/vggish_src/vggish_params.py?
Are there any other changes required?

503 service unavailable: cannot open a3s.fi links

Hi,

Since this Monday the bucket storage that we use for storing features and models (a3s) is down. Usually, it takes several hours to recover but this time it is more serious, it seems. The problem is claimed to be 'temporally' but we are not sure when it will recover so for now please use this issue to let us know if you need any of the files ASAP – we will share them via other means.

Sorry for the trouble and good luck with your research!

Filtered validation files

"As we mentioned in the paper, we didn't have access to the full dataset as ActivityNet Captions is distributed as the list of links to YouTube video. Consequently, many videos (~8.8 %) were no longer available at the time when we were downloading the dataset. In addition, some videos didn't have any speech. We filtered out such videos from the validation files and reported the results as no missings in the paper. Please create an issue if you would like us to share these filtered validation files."

Can you please share the filtered validation files? Thanks!

ASR

Hello! I want to ask what API is used to obtain .srt subtitles?

Alignment key for the A/V features in the .npy/.hdf5 files

Hi Vladimir,

Long time no talk :) I was wondering if you can share the code that converted the .npy features (from your VGGish and I3D feature extractor) that you made available to me mid last year, to .hdf5 in MDVC Readme: Usage. In particular, I am interested in understanding how you "align" the audio and video features (based on the code below).

Questions:

  1. Are the audio and video features aligned by time in the hdf5 file? Is that what T_audio/T_video stands for?
  2. Is the D_audio/D_video simply the feature dimension?
def load_multimodal_features_from_h5(feat_h5_video, feat_h5_audio, feature_names_list, 
                                     video_id, start, end, duration, get_full_feat=False, cs=True):
    supported_feature_names = {'i3d_features', 'c3d_features', 'vggish_features'}
    assert isinstance(feature_names_list, list)
    assert len(feature_names_list) > 0
    assert set(feature_names_list).issubset(supported_feature_names)

    if 'vggish_features' in feature_names_list:
        audio_stack = feat_h5_audio.get(f'{video_id}/vggish_features')

        # some videos doesn't have audio
        if audio_stack is None:
            print(f'audio_stack is None @ {video_id}')
            audio_stack = torch.empty((0, 128)).float()

        T_audio, D_audio = audio_stack.shape

    if 'i3d_features' in feature_names_list:
        video_stack_rgb = feat_h5_video.get(f'{video_id}/i3d_features/rgb')
        video_stack_flow = feat_h5_video.get(f'{video_id}/i3d_features/flow')
        
        assert video_stack_rgb.shape == video_stack_flow.shape
        T_video, D_video = video_stack_rgb.shape

        if T_video > T_audio:
            video_stack_rgb = video_stack_rgb[:T_audio, :]
            video_stack_flow = video_stack_flow[:T_audio, :]
            T = T_audio
        elif T_video < T_audio:
            audio_stack = audio_stack[:T_video, :]
            T = T_video
        else:
            # or T = T_audio
            T = T_video
        
        # at this point they should be the same
        assert audio_stack.shape[0] == video_stack_rgb.shape[0]

Thanks again for your help!

KeyError in validation_next_word_loop when running main.py

Hi Vladimir! Hope you are doing well.

I was running your main.py script. There is the following error saying KeyError. Am I missing something? Thanks a lot!

Traceback (most recent call last):
File "main.py", line 572, in
main(cfg)
File "main.py", line 281, in main
cfg.use_categories
File "/home/tuf72841/MDVC/epoch_loop/run_epoch.py", line 336, in validation_next_word_loop
for i, batch in enumerate(tqdm(loader, desc=f'{time} {phase} ({epoch})')):
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/tqdm/std.py", line 1127, in iter
for obj in iterable:
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 346, in next
data = self.dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/tuf72841/MDVC/dataset/dataset.py", line 443, in getitem
caption_data = next(self.caption_loader_iter)
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torchtext/data/iterator.py", line 156, in iter
yield Batch(minibatch, self.dataset, self.device)
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torchtext/data/batch.py", line 34, in init
setattr(self, name, field.process(batch, device=device))
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torchtext/data/field.py", line 237, in process
tensor = self.numericalize(padded, device=device)
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torchtext/data/field.py", line 336, in numericalize
arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torchtext/data/field.py", line 336, in
arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
File "/home/tuf72841/.conda/envs/mdvc/lib/python3.7/site-packages/torchtext/data/field.py", line 336, in
arr = [[self.vocab.stoi[x] for x in ex] for ex in arr]
KeyError: 'stairclimber'

ResolvePackageNotFound:

PS D:\Python Pro\Other Project\MDVC> conda env create -f conda_env.yml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:

  • h5py==2.8.0=py37h989c5e5_3
  • openjdk==11.0.1=h516909a_1016
  • protobuf==3.11.4=py37h3340039_1
  • spacy==2.0.12=py37h962f231_0
  • cffi==1.14.0=py37h2e261b9_0
  • thinc==6.10.3=py37h962f231_0
  • libedit==3.1.20181209=hc058e9b_0
  • readline==8.0=h7b6447c_0
  • regex==2018.07.11=py37h14c3975_0
  • pytorch==1.2.0=py3.7_cuda10.0.130_cudnn7.6.2_0
  • libprotobuf==3.11.4=h8b12597_0
  • mkl_fft==1.0.15=py37ha843d7b_0
  • libstdcxx-ng==9.1.0=hdf63c60_0
  • pandas==0.24.2=py37he6710b0_0
  • torchvision==0.3.0=py37_cu10.0.130_1
  • pillow==7.1.2=py37hb39fc2d_0
  • ujson==2.0.3=py37he6710b0_0
  • zstd==1.3.7=h0b5b093_0
  • libpng==1.6.37=hbc83047_0
  • cryptography==2.8=py37h1ba5d50_0
  • libgfortran-ng==7.3.0=hdf63c60_0
  • cymem==1.31.2=py37h6bb024c_0
  • freetype==2.9.1=h8a8886c_1
  • zlib==1.2.11=h7b6447c_3
  • ninja==1.9.0=py37hfd86e86_0
  • cytoolz==0.9.0.1=py37h14c3975_1
  • sqlite==3.31.1=h62c20be_1
  • ld_impl_linux-64==2.33.1=h53a641e_7
  • hdf5==1.10.2=hba1933b_1
  • libtiff==4.1.0=h2733197_0
  • numpy==1.15.4=py37h7e9f1db_0
  • mkl-service==2.3.0=py37he904b0f_0
  • msgpack-python==0.5.6=py37h6bb024c_1
  • ncurses==6.2=he6710b0_1
  • tk==8.6.8=hbc83047_0
  • wrapt==1.10.11=py37h14c3975_2
  • libffi==3.2.1=hd88cf55_4
  • xz==5.2.5=h7b6447c_0
  • jpeg==9b=h024ee3a_2
  • openssl==1.1.1g=h516909a_0
  • mkl_random==1.1.0=py37hd6b4f25_0
  • murmurhash==0.28.0=py37hf484d3e_0
  • python==3.7.7=hcf32534_0_cpython
  • libgcc-ng==9.1.0=hdf63c60_0
  • numpy-base==1.15.4=py37hde5b4d6_0
  • preshed==1.0.1=py37he6710b0_0

I have two Issues - 1. I want to run this without GPU (i don't have GPU in Ubuntu)

(mdvc) tarun@tarun-VirtualBox:~/Downloads/MDVC$ python main.py --device_ids 1
log_path: ./log/0601100925
model_checkpoint_path: ./log/0601100925
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp line=50 error=38 : no CUDA-capable device is detected
Traceback (most recent call last):
File "main.py", line 572, in
main(cfg)
File "main.py", line 177, in main
torch.cuda.set_device(cfg.device_ids[0])
File "/home/tarun/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/cuda/init.py", line 281, in set_device
torch._C._cuda_setDevice(device)
File "/home/tarun/miniconda3/envs/mdvc/lib/python3.7/site-packages/torch/cuda/init.py", line 179, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (38) : no CUDA-capable device is detected at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp:50

Sharing I3D and VGGish features

Hi Vladimir,

I am working on a project citing MDVC as baseline, and adding some elements to it.

Would you mind if I use your download link for the I3D and VGGish features in my project's readme (which would be on Github)?

Let me know!

Thanks,
Aman

Dense Video Captioning on raw input videos

It seems a nice work. I wanted to test it on custom input videos. It would be very helpful if you can provide a script for generating video captions for a raw input video.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.