Giter Club home page Giter Club logo

clip4clip's Introduction

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

(July 28, 2021) Add ViT-B/16 with an extra --pretrained_clip_name

(Apr. 22, 2021) First version

The implementation of paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval.

CLIP4Clip is a video-text retrieval model based on CLIP (ViT-B). We investigate three similarity calculation approaches: parameter-free type, sequential type, and tight type, in this work. The model achieve SOTA results on MSR-VTT, MSVD, LSMDC, ActivityNet, and DiDeMo.

CLIP4Clip

Requirement

# From CLIP
conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
pip install ftfy regex tqdm
pip install opencv-python boto3 requests pandas

Data Preparing

For MSRVTT

The official data and video links can be found in link.

For the convenience, you can also download the splits and captions by,

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip

Besides, the raw videos can be found in sharing from Frozen️ in Time, i.e.,

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip

For MSVD

Raw videos can be download from link.

The splits and raw_captions can be found in the wonderful job collaborative-experts. For the convenience, you can also download them by,

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msvd_data.zip

For LSMDC

You must obtain permission from MPII to download and use the data. The download link is here. The 1000 test clips data is link. Read our paper and the dataloader for more information.

For ActivityNet

The official websit has made the full dataset available on Google and Baidu drives, see more information at here . The splits can be found in the job collaborative-experts.

For DiDeMo

Raw videos can be download from LisaAnne/LocalizingMoments. The splits can be found in the job collaborative-experts.

Compress Video for Speed-up (optional)

python preprocess/compress_video.py --input_root [raw_video_path] --output_root [compressed_video_path]

This script will compress the video to 3fps with width 224 (or height 224). Modify the variables for your customization.

How to Run

--features_path is the video root path

--linear_patch can be set with 2d or 3d

--sim_header can be set with meanP, seqLSTM, seqTransf, or tightTransf

--pretrained_clip_name can be set with ViT-B/32 or ViT-B/16

--resume_model can be used to reload the saved optimizer state to continuely train the model, Note: need to set the corresponding chechpoint via --init_model simultaneously.

read our paper for more details on --linear_patch and --sim_header. Test more hyperparameters for better performance.

Download CLIP (ViT-B/32) weight,

wget -P ./modules https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt

or, download CLIP (ViT-B/16) weight,

wget -P ./modules https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt

Then, run

The CLIP (ViT-B/32) is the default setting in the paper, replacing with the ViT-B/16 for better performance.

MSRVTT

DATA_PATH=[Your MSRVTT data and videos path]
python -m torch.distributed.launch --nproc_per_node=4 \
main_task_retrieval.py --do_train --num_thread_reader=0 \
--epochs=5 --batch_size=128 --n_display=50 \
--train_csv ${DATA_PATH}/MSRVTT_train.9k.csv \
--val_csv ${DATA_PATH}/MSRVTT_JSFUSION_test.csv \
--data_path ${DATA_PATH}/MSRVTT_data.json \
--features_path ${DATA_PATH}/MSRVTT_Videos \
--output_dir ckpts/ckpt_msrvtt_retrieval_looseType \
--lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \
--datatype msrvtt --expand_msrvtt_sentences  \
--feature_framerate 1 --coef_lr 1e-3 \
--freeze_layer_num 0  --slice_framepos 2 \
--loose_type --linear_patch 2d --sim_header meanP \
--pretrained_clip_name ViT-B/32

MSVD

DATA_PATH=[Your MSVD data and videos path]
python -m torch.distributed.launch --nproc_per_node=4 \
main_task_retrieval.py --do_train --num_thread_reader=2 \
--epochs=5 --batch_size=128 --n_display=50 \
--data_path ${DATA_PATH} \
--features_path ${DATA_PATH}/MSVD_Videos \
--output_dir ckpts/ckpt_msvd_retrieval_looseType \
--lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \
--datatype msvd \
--feature_framerate 1 --coef_lr 1e-3 \
--freeze_layer_num 0 --slice_framepos 2 \
--loose_type --linear_patch 2d --sim_header meanP \
--pretrained_clip_name ViT-B/32

LSMDC

DATA_PATH=[Your LSMDC data and videos path]
python -m torch.distributed.launch --nproc_per_node=4 \
main_task_retrieval.py --do_train --num_thread_reader=2 \
--epochs=5 --batch_size=128 --n_display=50 \
--data_path ${DATA_PATH} \
--features_path ${DATA_PATH}/LSMDC_Videos \
--output_dir ckpts/ckpt_lsmdc_retrieval_looseType \
--lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 16 \
--datatype lsmdc --feature_framerate 1 --coef_lr 1e-3 \
--freeze_layer_num 0  --slice_framepos 2 \
--loose_type --linear_patch 2d --sim_header meanP \
--pretrained_clip_name ViT-B/32

ActivityNet

ActivityNet is regarded as video-paragraph retrieval in our setting, thus, need more GPUs (or run with multi-node).

DATA_PATH=[Your ActivityNet data and videos path]
python -m torch.distributed.launch --nproc_per_node=8 \
main_task_retrieval.py --do_train --num_thread_reader=2 \
--epochs=5 --batch_size=128 --n_display=50 \
--data_path ${DATA_PATH} \
--features_path ${DATA_PATH}/Activity_Videos \
--output_dir ckpts/ckpt_activity_retrieval_looseType \
--lr 1e-4 --max_words 64 --max_frames 64 --batch_size_val 16 \
--datatype activity --feature_framerate 1 --coef_lr 1e-3 \
--freeze_layer_num 0  --slice_framepos 2 \
--loose_type --linear_patch 2d --sim_header meanP \
--pretrained_clip_name ViT-B/32

DiDeMo

DiDeMo is regarded as video-paragraph retrieval in our setting, thus, need more GPUs (or run with multi-node).

DATA_PATH=[Your DiDeMo data and videos path]
python -m torch.distributed.launch --nproc_per_node=8 \
main_task_retrieval.py --do_train --num_thread_reader=2 \
--epochs=5 --batch_size=128 --n_display=50 \
--data_path ${DATA_PATH} \
--features_path ${DATA_PATH}/DiDeMo_Videos \
--output_dir ckpts/ckpt_didemo_retrieval_looseType \
--lr 1e-4 --max_words 64 --max_frames 64 --batch_size_val 16 \
--datatype didemo --feature_framerate 1 --coef_lr 1e-3 \
--freeze_layer_num 0  --slice_framepos 2 \
--loose_type --linear_patch 2d --sim_header meanP \
--pretrained_clip_name ViT-B/32

Citation

If you find CLIP4Clip useful in your work, you can cite the following paper:

@Article{Luo2021CLIP4Clip,
  author  = {Huaishao Luo and Lei Ji and Ming Zhong and Yang Chen and Wen Lei and Nan Duan and Tianrui Li},
  title   = {{CLIP4Clip}: An Empirical Study of CLIP for End to End Video Clip Retrieval},
  journal = {arXiv preprint arXiv:2104.08860},
  year    = {2021},
}

Acknowledgments

Our code is based on CLIP and UniVL.

clip4clip's People

Contributors

arrowluo avatar bryant1410 avatar fangmingzhou avatar junchen14 avatar zara0m avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clip4clip's Issues

Pre-trained models

Hey,

Hope you are all well and thank you for open-sourcing the code! 🤗

Was wondering if you are also planning to release any of the pre-trained models?

Thanks.

the output of visual hidden between ViT-B/32 and RN50 doesn't match

Hi ArrowLuo,
When I change the pretrained ViT-B/32 to RN50,, I got an error in function encode_image(module_clip.py).
The visual hidden output (x) dim of ViT-B/32 is [768, 512],
while the visual hidden output (x) dim of RN50 is [768, 1024].
Do you know how to solve this problem? Thanks.

comparison with COOT model

hi,
I am wondering why you don't compare with the COOT model for the activity-net benchmark.
COOT has reported much higher performance with R1 up to 60.8.

My main concern is that if there is any difference regarding the performance calculation?

Thanks for your time

pretraining hyperparamters

when you pretrain on howto100m dataset, do you use the same pretrain hyperparamters as you used in training on msvrtt and msvd small datasets (especially the learning rate).
did you also test other hyperparamters? if so, it would be good to share some experiences.
thanks

How do you sample frames from howto100m videos?

hi,

could you please provide more details regarding how to pretrain on howto100m videos? how you extract frames features? how many frames you extract for each video? how you organize multiple captions of each video (simple concatenation?)

I appreciate it very much

MSRVTT downsampling

Hi, great work and thanks for sharing the code.

I'm trying to reproduce the results on MSRVTT for comparison but the training is taking longer than expected (~6 hours/epoch)
The bottleneck is presumably in the data loading. In #8 I read that you downsampled the videos in advance. Can you explain how you downsampled the videos and share the script if possible?

evalution performance is influenced with the batch size

hi, in your evaluation code, before you computes the logits, you also compute the batch-level normalization. the normalization score is influenced by the batch number in fact, which indicates that if give different evaluation batch size, and the performance will be also different.

so in this case, did you do any ablation study on evaluation batch size?

Open source license

Thanks for your excellent work and sharing the code of CLIP4Clip. Do you plan to release the source code under a specific open source license (e.g. MIT, ...)?

About batch_size in training configuration?

Hi Luo, thanks for your remarkable work!
I am here wondering why use batch 128 in 4gpu DDP training? That is to say 32 per GPU which only accounts for less than 1/2 GPU memory. Is that for some special purpose?

The dataset of MSR VTT

Hi,
This project is excellent in the field of video retrieval,but I can't get the original video of MSRVTT through your link,Can you tell me how to get it?

cap.set(cv2.CAP_PROP_POS_FRAMES, sec_base + ind) gets some h264 warning message

Hi,
When I run Clip4Clip code to exact video frames, this code in rawvideo_utils.py:cap.set(cv2.CAP_PROP_POS_FRAMES, sec_base + ind),print some warning message like:
[h264 @ 0x56543b236ac0] illegal short term buffer state detected
[h264 @ 0x5654397996c0] Missing reference picture, default is 0
[h264 @ 0x5654397996c0] decode_slice_header error

[h264 @ 0x56543aefc800] Invalid NAL unit size (30588 > 15033).
[h264 @ 0x56543aefc800] Error splitting the input into NAL units.

Did you met this before? Not sure whether this have effect on frames that we extracted...

MSRVTT dataset

Hi, thanks your work.I want to download MSRVTT dataset, but the link can't open,it display this page can't be used. Could you share some links to download them,thanks!
image

The output of `set(optimizer.get_lr())` is confusing

  1. The code of logging at each log_step in latest commit is as follows.
                logger.info("Epoch: %d/%s, Step: %d/%d, Lr: %s, Loss: %f, Time/step: %f", epoch + 1,
                            args.epochs, step + 1,
                            len(train_dataloader), "-".join([str('%.9f'%itm) for itm in sorted(list(set(optimizer.get_lr())))]),
                            float(loss),
                            (time.time() - start_time) / (log_step * args.gradient_accumulation_steps))
  1. The output of "-".join([str('%.9f'%itm) for itm in sorted(list(set(optimizer.get_lr())))]) in my log is formatted as:
2021-12-20 15:12:50,703:INFO: Epoch: 1/5, Step: 10/1406, Lr: 0.000000001, Loss: 1.7373971939086914, Time/step: 15.502562475204467
  1. While output in others log is formatted as:
2021-12-16 13:41:08,839:INFO: Epoch: 1/5, Step: 50/703, Lr: 0.000000014-0.000014225, Loss: 2.009616, Time/step: 15.425273
  1. Why the column of Lr is different and even though i test in latest codebase, it is still only one floating point number.

longer video evaluation

when I test your model on youcook2 data, the performance is not very good.
I am not sure what is the main reason behind it. have you conducted similar experiments before?

youcook2 is longer than msrvtt dataset, so maybe I should sample much more frames for youcook2 video?

in this case, howto100m is also temporally long compared to msrvtt, when you pretrain on howto100m, did you increase the sampling frame numbers?

Annotation of ActivityNet

Hello, I can not find "train.json","val_1.json" of activatitynet dataset. Can you share these files?

Time of training in MSR-VTT dataset.

when I training in MSR-VTT dataset, it is very slow. My videos are 720P, maybe it's too large. However when I resize all videos to short size is 256, it's also not very fast.
So, what's resolution of your MSR-VTT videos. How long does it take to train in MSR-VTT. Thanks!

Support RN50?

Hi,

I see there are several links for different models in module_clip.py.
Do you support for Resnet backbone?

Thank you

Some questions about the results of the MARVTT with `sim_header seqTransf`.

When I use the following configuration to train the model on MSRVTT Training-9K, the best result I got is
07/27/2021 13:11:01 - INFO - sim matrix size: 1000, 1000 07/27/2021 13:11:01 - INFO - Length-T: 1000, Length-V:1000 07/27/2021 13:11:01 - INFO - Text-to-Video: 07/27/2021 13:11:01 - INFO - >>> R@1: 43.2 - R@5: 71.0 - R@10: 79.4 - Median R: 2.0 - Mean R: 15.4 07/27/2021 13:11:01 - INFO - Video-to-Text: 07/27/2021 13:11:01 - INFO - >>> V2T$R@1: 43.1 - V2T$R@5: 71.2 - V2T$R@10: 80.7 - V2T$Median R: 2.0 - V2T$Mean R: 11.9.
It's worse than the results R@1: 44.5 listed in the paper. Did i miss some details?
Here is the configuration.
CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 --master_addr=127.0.0.2 --master_port 29552 main_ta sk_retrieval.py --num_thread_reader=4 --epochs=5 --batch_size=128 --n_display=20 --train_csv /home/hadoop-vacv/cephfs/data/caoshuqia ng/data/jobs/MSRVTT/csv/msrvtt_data/MSRVTT_train.9k.csv --val_csv /home/hadoop-vacv/cephfs/data/caoshuqiang/data/jobs/MSRVTT/csv/msr vtt_data/MSRVTT_JSFUSION_test.csv --data_path /home/hadoop-vacv/cephfs/data/caoshuqiang/data/jobs/MSRVTT/csv/msrvtt_data/MSRVTT_data .json --features_path /home/hadoop-vacv/cephfs/data/caoshuqiang/data/jobs/MSRVTT/MSRVTT_Videos --output_dir /home/hadoop-vacv/cephfs /data/caoshuqiang/code/vicab/newexp/hope/clip_raw --lr 1e-4 --max_words 32 --max_frames 12 --batch_size_val 12 --datatype msrvtt -- expand_msrvtt_sentences --feature_framerate 1 --coef_lr 1e-3 --freeze_layer_num 0 --slice_framepos 2 --loose_type --linear_patch 2d --sim_header seqTransf --do_train.

Query search

As I understood correctly, after training and evaluation, videos are set on the basis of ranks and similarity scores. Is there any script available to make a search query as text and get the video with some freedom of search based on confidence or similarity index with some closest ranked video ?

Why meanpooling is better?

Thanks to open source for such a good job. why mean pooling is better than other methods? In theory, LSTM and Transformer use the time dimension relationship, and the result may be better. Could you give some theoretical analysis or tips? Thanks a lot. By the way, Can the code provide how to save the weights after each epoch?

dataset splits and training code

Hi,

First, I would like to say thank for the nice work and the code. I have a few questions on the dataset splits. For example, in MSVD and DiDeMo, they have both validation set and test set. But as in the code for the dataloader, it seems that you are using the test set also for validation.

Moreover, in the training code, Line 548, you used test set to select the best checkpoint, which may not be the best practice especially when there is a validation set available for the dataset. What do you think? Thanks!

DATALOADER_DICT["msvd"] = {"train":dataloader_msvd_train, "val":dataloader_msvd_test, "test":dataloader_msvd_test}
DATALOADER_DICT["lsmdc"] = {"train":dataloader_lsmdc_train, "val":dataloader_lsmdc_test, "test":dataloader_lsmdc_test}
DATALOADER_DICT["didemo"] = {"train":dataloader_didemo_train, "val":dataloader_didemo_test, "test":dataloader_didemo_test}

MSR-VTT dataset error

when I using MSRVTT dataset, some errors happened, log is "video path: /home/admin/workspace/workgroup/mayiwei.myw/data/MSRVTT/data/MSRVTT/videos/all/video5397.mp4 error. video id:video5397"
image
do you kown what is wrong

Unsupported video formats after compressing

Hi,
Great work. Thank you very much for sharing the code.
A small issue I encountered when trying to train C4C on DiDeMo dataset.
As recommended, I ran your ffmpeg script and tried to train the model on it. However, the new generated compressed videos of formats 3gp, 3g2 and mpg are missing the moov atom and therefore their frames cannot be extracted (and the video cannot be played). As a result, the fps is 0 the we get division by zero exception at dataloader_activitynet_retrieval.py line 207.

Does the ffmpeg command needs some modification?

MSRVTT dataset

Dear author, thanks your work. When I download MSRVTT dataset, there are some urls broken. How do you solve it, thanks.

evaluation

Hi, dear Arrow, I have some questions as follow:

  1. How to only do evaluation?
  2. How to select a model when training on MSRVTT9K or do you just report the results of the final epoch(5th epoch)?

max_position_embeddings is wrong for ActivityNet

max_position_embeddings is set for 77, which is smaller than the summation of max_words and max_frames.

I just change the value of max_position_embeddings to 128. It worked. But, the results of v2t is slightly lower.

Not sure if it is the reason.

ActivityNet and DiDeMo dataset!

Could you please provide me with the download links of the raw videos of datasets ActivityNet and DiDeMo? (Google cloud link cannot be downloaded in China. Can you open your dataset to baidu Cloud link?)

Error AttributeError: 'list' object has no attribute 'shape' when training on one gpu

Hello,

I am getting this error when trying to train the model on google Colab with one gpu. This occurs when the first epoch finished.
Please let me know how to get around this error.

Here is the stack trace:
11/05/2021 10:51:52 - INFO - Epoch: 1/5, Step: 5600/5625, Lr: 0.000000091, Loss: 0.097879, Time/step: 13.179280
11/05/2021 10:57:31 - INFO - Epoch 1/5 Finished, Train Loss: 0.497687
Traceback (most recent call last):
File "main_task_retrieval.py", line 564, in
main()
File "main_task_retrieval.py", line 548, in main
R1 = eval_epoch(args, model, test_dataloader, device, n_gpu)
File "main_task_retrieval.py", line 437, in eval_epoch
logger.info("sim matrix size: {}, {}".format(sim_matrix.shape[0], sim_matrix.shape[1]))
AttributeError: 'list' object has no attribute 'shape'
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/usr/local/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/local/bin/python', '-u', 'main_task_retrieval.py', '--local_rank=0', '--do_train', '--num_thread_reader=0', '--epochs=5', '--batch_size=32', '--n_display=50', '--train_csv', 'msrvtt_datasplit/MSRVTT_train.9k.csv', '--val_csv', 'msrvtt_datasplit/MSRVTT_JSFUSION_test.csv', '--data_path', 'msrvtt_datasplit/MSRVTT_data.json', '--features_path', '/content/drive/MyDrive/Compressed', '--output_dir', 'ckpts/ckpt_msrvtt_retrieval_looseType', '--lr', '1e-4', '--max_words', '32', '--max_frames', '12', '--batch_size_val', '4', '--datatype', 'msrvtt', '--expand_msrvtt_sentences', '--feature_framerate', '1', '--coef_lr', '1e-3', '--freeze_layer_num', '0', '--slice_framepos', '2', '--loose_type', '--linear_patch', '2d', '--sim_header', 'meanP', '--pretrained_clip_name', 'ViT-B/32']' returned non-zero exit status 1.

About splits of LSMDC

I found the statistics in LSMDC's readme.txt is following:
= Statistics

  • Training: 101,079
  • Validation: 7,408
  • Public Test: 10,053
  • Blind Test: 9,578

But the paper shows 118081 videos and 7408 for validation, why is it different from the above numbers?

Dataloader for ActivityNet

Hello,

First of all, thanks for the great work and sharing the code.
Would it possible to share the dataloader code for ActivityNet dataset as well?

Some question about using WebVid dataset.

Thanks for sharing your code. I wonder if you can share the WIT dataset. When you pre-training on WIT dataset, the setting of hyper-parameters are similar to training datasets?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.