Giter Club home page Giter Club logo

vidsitu-ec's Introduction

Video Event Extraction via Tracking Visual States of Arguments

LICENSE Python PyTorch

Video Evnet Extraction via Tracking Visual States of Arguments
Guang Yang, Manling Li, Jiajie Zhang, Xudong Lin, Shih-Fu Chang, Heng Ji
AAAI 2023

Video event extraction aims to detect salient events from a video and identify the arguments for each event as well as their semantic roles. Inspired by the definition of events as changes of states, we propose a novel framework to detect video events by tracking the changes in the visual states of all involved arguments, which are expected to provide the most informative evidence for the extraction of video events. In order to capture the visual state changes of arguments, we decompose them into changes in pixels within objects, displacements of objects, and interactions among multiple arguments. We further propose Object State Embedding, Object Motion-aware Embedding and Argument Interaction Embedding to encode and track these changes respectively.

This repository includes:

  1. Instructions to install, download and process VidSitu Dataset.
  2. Code to run all experiments provided in the paper along with log files.

Download

Please see DATA_PREP.md for detailed instructions on downloading and setting up the VidSitu dataset.

Installation

Please see INSTALL.md for detailed instructions

Training

  • Basic usage is CUDA_VISIBLE_DEVICES=$GPUS python main_dist.py "experiment_name" --arg1=val1 --arg2=val2 and the arg1, arg2 can be found in configs/vsitu_cfg.yml.

  • Set $GPUS=0 for single gpu training. For multi-gpu training via Pytorch Distributed Data Parallel use $GPUS=0,1,2,3

  • YML has a hierarchical structure which is supported using . For instance, if you want to change the beam_size under gen which in the YML file looks like

    gen:
        beam_size: 1
    

    you can pass --gen.beam_size=5

  • Sometimes it might be easier to directly change the default setting in configs/vsitu_cfg.yml itself.

  • To keep the code modular, some configurations are set in code/extended_config.py as well.

  • All model choices are available under code/mdl_selector.py

Verb Classification

Here are the bash commands to train our three models:

  • OSE-pixel + OME:

    CUDA_VISIBLE_DEVICES=0,1,2,3 python main_dist.py "OSE-pixel_OME" --mdl.mdl_name="sf_ec_cat" \
        --train.bs=8 --train.gradient_accumulation=1 --train.nw=8 --train.bsv=16 --train.lr=3e-5 --mdl.C=0\
        --train.resume=False --mdl.load_sf_pretrained=True  \
        --do_dist=True
    
  • OSE-pixel/disp + OME:

    CUDA_VISIBLE_DEVICES=0,1,2,3 python main_dist.py "OSE-pixel-disp_OME" --mdl.mdl_name="sf_ec_cat" \
        --train.bs=8 --train.gradient_accumulation=1 --train.nw=8 --train.bsv=16 --train.lr=3e-5 --mdl.C=128\
        --train.resume=False --mdl.load_sf_pretrained=True  \
        --do_dist=True
    
  • OSE-pixel/disp + OME + OIE:

    CUDA_VISIBLE_DEVICES=0,1,2,3 python main_dist.py "OSE-pixel-disp_OME_OIE" --mdl.mdl_name="sf_ec_rel" \
        --train.bs=4 --train.gradient_accumulation=1 --train.nw=8 --train.bsv=16 --train.lr=3e-5 --mdl.C=128\
        --train.resume=False --mdl.load_sf_pretrained=True  \
        --do_dist=True
    

After training verb classification model, run the following commands to extract features for all videos:

CUDA_VISIBLE_DEVICES=0,1,2,3 python vidsitu_code/feat_extractor.py --mdl_resume_path='tmp/models/OSE-pixel_OME/best_mdl.pth' \
	--mdl_name_used='OSE-pixel_OME' --mdl.mdl_name='sf_ec_cat' --is_cu=False --mdl.C=0 \
	--train.bsv=16 --train.nwv=16

Semantic Role Prediction

For semantic role prediction, we can set the random seeds. To reproduce our results on semantic role prediction, set the random seeds to ${17, 33, 66, 74, 98, 137, 265, 314, 590, 788}$.

For example, to reproduce the experiments with "OSE-pixel + OME" features, the commands are:

for seed in 17 33 66 74 98 137 265 314 590 788
do
    CUDA_VISIBLE_DEVICES=1 python main_dist.py "OSE-pixel_OME_arg_${seed}" --task_type="vb_arg" \
        --train.bs=16 --train.bsv=16 --mdl.mdl_name="sfpret_txe_txd_vbarg" \
        --tx_dec.decoder_layers=3 --tx_dec.encoder_layers=3 --mdl.C=0 \
        --ds.vsitu.vsit_frm_feats_dir="./data/vsitu_vid_feats/OSE-pixel_OME" \
        --debug_mode=False --seed=$seed
done

Logging

Logs are stored inside tmp/ directory. When you run the code with $exp_name the following are stored:

  • txt_logs/$exp_name.txt: the config used and the training, validation losses after ever epoch.
  • models/$exp_name.pth: the model, optimizer, scheduler, accuracy, number of epochs and iterations completed are stored. Only the best model upto the current epoch is stored.
  • ext_logs/$exp_name.txt: this uses the logging module of python to store the logger.debug outputs printed. Mainly used for debugging.
  • predictions: the validation outputs of current best model.

Logs are also stored using MLFlow. These can be uploaded to other experiment trackers such as neptune.ai, wandb for better visualization of results.

vidsitu-ec's People

Contributors

limanling avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

xintt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.