Hi, thanks for the wonderful work. I want to caption my own videos giving the vide

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

More information about the feature extractor can be found at <a href="https://github.c

caption my own video with provided pretrained model about univl HOT 8 CLOSED

microsoft commented on August 25, 2024

caption my own video with provided pretrained model

from univl.

Comments (8)

ArrowLuo commented on August 25, 2024

Hi @dawnlh, would you provide your log.txt here? I can not locate the problem through the command.

from univl.

dawnlh commented on August 25, 2024

Hi @dawnlh, would you provide your log.txt here? I can not locate the problem through the command.

Thanks a lot! Here is the log file:

2021-05-25 11:15:57,643:INFO: Effective parameters:
2021-05-25 11:15:57,644:INFO:   <<< batch_size: 256
2021-05-25 11:15:57,644:INFO:   <<< batch_size_val: 32
2021-05-25 11:15:57,644:INFO:   <<< bert_model: bert-base-uncased
2021-05-25 11:15:57,644:INFO:   <<< cache_dir: 
2021-05-25 11:15:57,644:INFO:   <<< coef_lr: 0.1
2021-05-25 11:15:57,644:INFO:   <<< cross_model: cross-base
2021-05-25 11:15:57,644:INFO:   <<< cross_num_hidden_layers: 2
2021-05-25 11:15:57,644:INFO:   <<< data_path: data/msrvtt/MSRVTT_data.json
2021-05-25 11:15:57,644:INFO:   <<< datatype: msrvtt
2021-05-25 11:15:57,644:INFO:   <<< decoder_model: decoder-base
2021-05-25 11:15:57,644:INFO:   <<< decoder_num_hidden_layers: 3
2021-05-25 11:15:57,644:INFO:   <<< do_eval: True
2021-05-25 11:15:57,644:INFO:   <<< do_lower_case: True
2021-05-25 11:15:57,644:INFO:   <<< do_pretrain: False
2021-05-25 11:15:57,644:INFO:   <<< do_train: False
2021-05-25 11:15:57,644:INFO:   <<< epochs: 20
2021-05-25 11:15:57,644:INFO:   <<< feature_framerate: 1
2021-05-25 11:15:57,644:INFO:   <<< features_path: data/msrvtt/msrvtt_videos_features.pickle
2021-05-25 11:15:57,644:INFO:   <<< fp16: False
2021-05-25 11:15:57,644:INFO:   <<< fp16_opt_level: O1
2021-05-25 11:15:57,644:INFO:   <<< gradient_accumulation_steps: 1
2021-05-25 11:15:57,644:INFO:   <<< hard_negative_rate: 0.5
2021-05-25 11:15:57,644:INFO:   <<< init_model: weight/univl.pretrained.bin
2021-05-25 11:15:57,644:INFO:   <<< local_rank: 0
2021-05-25 11:15:57,644:INFO:   <<< lr: 0.0001
2021-05-25 11:15:57,644:INFO:   <<< lr_decay: 0.9
2021-05-25 11:15:57,644:INFO:   <<< margin: 0.1
2021-05-25 11:15:57,644:INFO:   <<< max_frames: 100
2021-05-25 11:15:57,644:INFO:   <<< max_words: 20
2021-05-25 11:15:57,644:INFO:   <<< min_time: 5.0
2021-05-25 11:15:57,645:INFO:   <<< n_display: 100
2021-05-25 11:15:57,645:INFO:   <<< n_gpu: 1
2021-05-25 11:15:57,645:INFO:   <<< n_pair: 1
2021-05-25 11:15:57,645:INFO:   <<< negative_weighting: 1
2021-05-25 11:15:57,645:INFO:   <<< num_thread_reader: 4
2021-05-25 11:15:57,645:INFO:   <<< output_dir: ckpts/ckpt_msrvtt_caption
2021-05-25 11:15:57,645:INFO:   <<< sampled_use_mil: False
2021-05-25 11:15:57,645:INFO:   <<< seed: 42
2021-05-25 11:15:57,645:INFO:   <<< stage_two: True
2021-05-25 11:15:57,645:INFO:   <<< task_type: caption
2021-05-25 11:15:57,645:INFO:   <<< text_num_hidden_layers: 12
2021-05-25 11:15:57,645:INFO:   <<< train_csv: data/youcookii_singlef_train.csv
2021-05-25 11:15:57,645:INFO:   <<< use_mil: False
2021-05-25 11:15:57,645:INFO:   <<< val_csv: data/msrvtt/MSRVTT_JSFUSION_test.csv
2021-05-25 11:15:57,645:INFO:   <<< video_dim: 1024
2021-05-25 11:15:57,645:INFO:   <<< visual_model: visual-base
2021-05-25 11:15:57,645:INFO:   <<< visual_num_hidden_layers: 6
2021-05-25 11:15:57,645:INFO:   <<< warmup_proportion: 0.1
2021-05-25 11:15:57,645:INFO:   <<< world_size: 1
2021-05-25 11:15:57,646:INFO: device: cuda:0 n_gpu: 1
2021-05-25 11:15:57,646:INFO: loading vocabulary file /data2/zzh/project/SCI_caption/UniVL/modules/bert-base-uncased/vocab.txt
2021-05-25 11:15:58,017:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/bert-base-uncased
2021-05-25 11:15:58,018:INFO: Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

2021-05-25 11:15:58,018:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/visual-base
2021-05-25 11:15:58,018:INFO: Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 1,
  "type_vocab_size": 2,
  "vocab_size": 1024
}

2021-05-25 11:15:58,018:INFO: Weight doesn't exsits. /data2/zzh/project/SCI_caption/UniVL/modules/visual-base/visual_pytorch_model.bin
2021-05-25 11:15:58,018:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/cross-base
2021-05-25 11:15:58,018:INFO: Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 1024,
  "num_attention_heads": 12,
  "num_hidden_layers": 2,
  "type_vocab_size": 2,
  "vocab_size": 768
}

2021-05-25 11:15:58,018:INFO: Weight doesn't exsits. /data2/zzh/project/SCI_caption/UniVL/modules/cross-base/cross_pytorch_model.bin
2021-05-25 11:15:58,018:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/decoder-base
2021-05-25 11:15:58,019:INFO: Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_target_embeddings": 512,
  "num_attention_heads": 12,
  "num_decoder_layers": 1,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

2021-05-25 11:15:58,019:INFO: Weight doesn't exsits. /data2/zzh/project/SCI_caption/UniVL/modules/decoder-base/decoder_pytorch_model.bin
2021-05-25 11:15:58,019:WARNING: Stage-One:False, Stage-Two:True
2021-05-25 11:15:58,019:WARNING: Set bert_config.num_hidden_layers: 12.
2021-05-25 11:15:59,122:WARNING: Set visual_config.num_hidden_layers: 6.
2021-05-25 11:15:59,591:WARNING: Set cross_config.num_hidden_layers: 2.
2021-05-25 11:15:59,763:WARNING: Set decoder_config.num_decoder_layers: 3.
2021-05-25 11:16:02,843:INFO: --------------------
2021-05-25 11:16:02,843:INFO: Weights from pretrained model not used in UniVL: 
   cls.predictions.bias
   cls.predictions.transform.dense.weight
   cls.predictions.transform.dense.bias
   cls.predictions.transform.LayerNorm.weight
   cls.predictions.transform.LayerNorm.bias
   cls.predictions.decoder.weight
   cls_visual.predictions.weight
   cls_visual.predictions.bias
   cls_visual.predictions.transform.dense.weight
   cls_visual.predictions.transform.dense.bias
   cls_visual.predictions.transform.LayerNorm.weight
   cls_visual.predictions.transform.LayerNorm.bias
   similarity_pooler.dense.weight
   similarity_pooler.dense.bias
2021-05-25 11:16:10,136:INFO: ***** Running test *****
2021-05-25 11:16:10,136:INFO:   Num examples = 2990
2021-05-25 11:16:10,136:INFO:   Batch size = 32
2021-05-25 11:16:10,136:INFO:   Num steps = 94
2021-05-25 11:23:31,867:INFO: >>>  BLEU_1: 0.1410, BLEU_2: 0.0450, BLEU_3: 0.0142, BLEU_4: 0.0052
2021-05-25 11:23:31,877:INFO: >>>  METEOR: 0.0684, ROUGE_L: 0.1229, CIDEr: 0.0045

from univl.

ArrowLuo commented on August 25, 2024

Hi @dawnlh, I suppose that you evaluate the pretrained weight (zero-shot) directly instead of finetuning. You should finetune with --do_train at first.

from univl.

dawnlh commented on August 25, 2024

Hi @dawnlh, I suppose that you evaluate the pretrained weight (zero-shot) directly instead of finetuning. You should finetune with --do_train at first.

Yes, I evaluated the pretrained weight (zero-shot) directly. I tried to finetune the model, but failed due to limited GPU memory （even setting batch_size to 1) . Can you give an estimation about how much GPU memory is needed to finetune the model? Or is it convenient for you to share the weights for captioning task (no transcript) ?

from univl.

ArrowLuo commented on August 25, 2024

Hi @dawnlh. We finetuned the model with 4 Tesla V100 GPUs. I am so sorry that we can not provide the finetuned weights.

from univl.

dawnlh commented on August 25, 2024

Okay, thanks anyway~ I'll try to figure out the GPU limitation problem. Another question is that if you can provide some instructions or codes on making use of finetuned model to deal with video captioning tasks for self-captured videos? I mean the input video processing (how to extract the same feature as the training set to serve as the model input) and output visualization.

from univl.

ArrowLuo commented on August 25, 2024

More information about the feature extractor can be found at README. The caption results are saved in --output_dir.

from univl.

dawnlh commented on August 25, 2024

More information about the feature extractor can be found at README. The caption results are saved in --output_dir.

Got it! Thank you a lot for your patient replying.

from univl.

caption my own video with provided pretrained model about univl HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent