Giter Club home page Giter Club logo

everything_at_once's Introduction

Everything at Once – Multi-modal Fusion Transformer for Video Retrieval

Shvetsova, N., Chen, B., Rouditchenko, A., Thomas, S., Kingsbury, B., Feris, R., Harwath, D., Glass, J. and Kuehne, H. Everything at Once – Multi-modal Fusion Transformer for Video Retrieval. In CVPR, 2022.

arXiv preprint arXiv:2112.04446

alt text

Accepted at CVPR 2022!

Repository contains:

  • the code to conduct all experiments reported in the paper
  • model weights to obtain main results
  • data for fine-tuning and evaluation on the YouCook2 and MSR-VTT datasets

Get started

  1. Create an environment:

    conda create python=3.6 -y -n everything_at_once
    conda activate everything_at_once 
    conda install -y pytorch==1.7.0 torchvision==0.8.0 torchaudio==0.7.0 cudatoolkit=10.2 -c pytorch
    pip install gensim==3.8.0 sacred==0.8.2 humanize==3.14.0 transformers==4.10.2 librosa==0.8.1 timm==0.4.12
    pip install neptune-contrib==0.28.1 --ignore-installed certifi
    
  2. If needed, download data.tar with features and spectrograms to fine-tune and evaluate on YouCook2 and MSR-VTT here. Extract a tar: tar -xvf data.tar

  3. If needed, create pretrained_models folder and download model weights here:

    Extract a tar:

    cd pretrained_models
    tar -xvf everything_at_once_tva.tar
    

Evaluation

To evaluate a pretrained everything-at-once model on the MSR-VTT dataset, run:

python test.py --n_gpu 1  \
  --config configs/evaluation/msrvtt_at_once.yaml \
  --resume pretrained_models/everything_at_once_tva/latest_model.pth

On the YouCook2 dataset:

python test.py --n_gpu 1  \
  --config configs/evaluation/youcook_at_once.yaml \
  --resume pretrained_models/everything_at_once_tva/latest_model.pth

Check out configs/evaluation folder to find more configs for evaluating models trained with S3D or CLIP features, or using other strategies to process long videos.

Fine-tuning

To fine-tune the HowTo100M-pretrained model on the MSR-VTT dataset, run:

python train.py \
  --config configs/finetuning/finetune_msrvtt.yaml \
  --resume pretrained_models/everything_at_once_tva/latest_model.pth

Add --neptune key if you want to log experiments using neptune.ai (See Experiment Logging)

On the YouCook2 dataset:

python train.py \
  --config configs/finetuning/finetune_youcook.yaml \
  --resume pretrained_models/everything_at_once_tva/latest_model.pth

Add --neptune key if you want to log experiments using neptune.ai (See Experiment Logging)

Check out configs/finetunning/clip folder to find configs for fine-tuning with CLIP features.

Pretraining

  1. Downloading HowTo100M and feature extraction. Please note that HowTo100M videos require a huge storage, and features alone take up terabytes of space. Features extraction (ResNet-152,ResNeXt-101) and audio spectrogram extraction were carefully described in https://github.com/roudimit/AVLnet/blob/main/training.md. We will release the code for S3D and CLIP feature extraction.

  2. Review configs/pretraining/everything_at_once_tva.yaml and make sure csv, features_path, features_path_audio, and caption_path point on the correct paths. CSV file should contain one column named 'path' with a list of videos. An example of the CSV file that we used in the training can be found here (HowTo100M_1166_videopaths.txt).

  3. Train python train.py --config configs/pretraining/everything_at_once_tva.yaml

Add --neptune key if you want to log experiments using neptune.ai (See Experiment Logging)

Check out configs/pretraining folder to find more configs for different ablation experiments.

Experiment Logging

This repository uses Sacred with a neptune.ai for logging and tracking experiments. If you want to activate this:

  1. Create a neptune.ai account.
  2. Create a project, copy in your credentials (api_token, project_name) in train.py
  3. Add --neptune key to the training (e.g. python train.py --neptune ..)

Using the model on your own data

If you want to use the model on your own data, please follow steps described in https://github.com/roudimit/AVLnet for features extraction and audio spectrogram extraction.

You may also take a look at everything_at_once_tva.yaml, where some comments about how to define n_video_tokens and num_audio_STFT_frames are provided.

Cite

If you use this code in your research, please cite:

@inproceedings{shvetsova2022everything,
  title={Everything at Once-Multi-Modal Fusion Transformer for Video Retrieval},
  author={Shvetsova, Nina and Chen, Brian and Rouditchenko, Andrew and Thomas, Samuel and Kingsbury, Brian and Feris, Rogerio S and Harwath, David and Glass, James and Kuehne, Hilde},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={20020--20029},
  year={2022}
}

Contact

If you have any problems with the code or have a question, please open an issue or send an email to shvetsova at em.uni-frankfurt.de. I'll try to answer as soon as possible.

Acknowledgments and Licenses

The main structure of the code is based on the frozen-in-time code: https://github.com/m-bain/frozen-in-time, which itself is based on the pytorch-template https://github.com/victoresque/pytorch-template. Thanks for sharing good practices!

The code in davenet.py, layers.py, avlnet.py is partly derived from https://github.com/dharwath/DAVEnet-pytorch/, https://github.com/wnhsu/ResDAVEnet-VQ, https://github.com/antoine77340/howto100m, and https://github.com/roudimit/AVLnet, and is licensed under BSD-3 (David Harwath, Wei-Ning Hsu, Andrew Rouditchenko) and Apache License 2.0 (Antoine Miech).

everything_at_once's People

Contributors

ninatu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

everything_at_once's Issues

Problem

It appears bug when I run test.py.
usage: test.py [-h] -r RESUME [-c CONFIG] [-d DEVICE] [--n_gpu N_GPU]
test.py: error: the following arguments are required: -r/--resume
can u give me solutions,please?

Code problem

Hello,bro.I saw in the readme that We will release the code for S3D and CLIP feature extraction, can I know the link of the S3D code? Thanks.

Loss problem

Hello! May I ask that what exactly the six positive samples and negative samples in your Combinatorial Loss ?

code problem

hello bro ,when i run the test.py I get this error:
ValueError: Cannot import git (pip install GitPython).
Either GitPython or the git executable is missing.
You can disable git with:
sacred.Experiment(..., save_git_info=False)

How are the inputs look like

The HowTo100M require a huge storage that I can't make a training. May I ask that how the videos and the texts look like before put into transformer module(like[31373, 11, 1312, 716, 36945...]or[0.0524,-0.0960,-0.1728...])? Do you directly contenate the tensor after feature extraction and put into transformer module without any adjustment?

Questions about the transformer

Hello, thanks for your work! I have a question about the transformer, i see that all the modalities forward by one transformer structure. Have you tested using different transformer for different modality? Put another way, Is there any benefit to using only one transformer?

thx~

data size

请问您做这项工作有准备多大数据内存了

Some difference between the code and paper

1、I found that there is a "Transformer Blocks" for all the embeddings(from all modalities) in the paper Figure 2. But in your code, one "Transformer Blocks" for each modal embeddings, one "Transformer Blocks" for cross modal embeddings later.
2、Question: If I have 'n' modals, the number of cross modal embedding will be 2 ** n - n - 2? For example, if I have A/B/C/D modals, then the cross-modal is 'AB', 'AC', 'AD', 'BC', 'BD', 'CD', 'ABC', 'ABD', 'ACD', 'BCD' ?

What hardware resources do I need to run your model and data?

Hi Shvedova,
Congratulations on your team's paper being accepted by CVPR. I have a slight question. What hardware resources do you need to run your model and data? Because I have three 2080Ti and it still doesn't work. The message is below:

RuntimeError: CUDA out of memory. Tried to allocate 96.00 MiB (GPU 0; 10.76 GiB total capacity; 9.81 GiB already allocated; >>88.31 MiB free; 9.82 GiB reserved in total by PyTorch)
The command that I used is : 'python test.py --n_gpu 3 --config configs/evaluation/msrvtt_at_once.yaml --resume pretrained_models/everything_at_once_tva/latest_model.pth'
Looking forward to your reply.
Best wishes,
Feng.

Regarding feature computation for S3D and CLIP

Hi Nina! Great work!

Thanks for the code release. I was wondering if you have scripts for S3D and CLIP feature computations handy with you? If so, could you please share them? Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.