Giter Club home page Giter Club logo

fast-vid2vid's Introduction

Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis

Abstract: Video-to-Video synthesis (Vid2Vid) has achieved remarkable results in generating a photo-realistic video from a sequence of semantic maps. However, this pipeline suffers from high computational cost and long inference latency, which largely depends on two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly compressed via more efficient network architectures. Nevertheless, existing methods mainly focus on slimming network architectures and ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal compression framework, Fast-Vid2Vid, which focuses on data aspects of generative models. It makes the first attempt at time dimension to reduce computational resources and accelerate inference. Specifically, we compress the input data stream spatially and reduce the temporal redundancy. After the proposed spatial-temporal knowledge distillation, our model can synthesize key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid interpolates intermediate frames by motion compensation with slight latency. On standard benchmarks, Fast-Vid2Vid achieves around real-time performance as 20 FPS and saves around 8× computational cost on a single V100 GPU.

Long Zhuo1, Guangcong Wang2, Shikai Li3, Wayne Wu1,3, Ziwei Liu2
1Shanghai AI Laboratory, 2S-Lab, Nanyang Technological University, 3SenseTime Research

In European Conference on Computer Vision (ECCV), 2022

Fast Video-to-Video Translation

Peformance

Prerequisites

  • Linux or macOS
  • Python 3
  • NVIDIA GPU + CUDA cuDNN
  • PyTorch >= 1.0
  • ffmpeg toolbox >= 4.0
  • OpenCV

Getting Started

Inference Enviroment

We recommend using the virtual environment (conda) to run the code easily.

conda create -n fvid python=3.7;
conda activate fvid;
conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=10.1;
conda install ffmpeg==4.0.2;
pip install opencv-python dominate scipy tqdm matplotlib scikit-image;

Download examples

  • Please first download the example datasets from this link.

Download pretrained models

  • Face

    • Download the pretrained model from this link and unzip it in checkpoints folder.
    • To test the model.
      bash scripts/face/test.sh
      
  • Cityscapes

    • Download the pretrained model from this link and unzip it in checkpoints folder.
    • To test the model.
      bash scripts/street/test.sh
      

Training

Installation

Please first install FlowNet2 into models/flownet2_pytorch/.

Dataset

  • Cityscapes
    • Please download the full-size dataset (~300 GB) from the official website.
    • We pre-pocess the corresponding semantic maps (A) and instance maps (Inst) by well-trained models.
  • Face
    • We adopt the FaceForensics dataset. The keypoints of each frame are generated by landmark detection. We then interpolate the keypoints to get face edges.
  • Pose
    • We use dancing videos on YouTube following vid2vid. We then apply DensePose and OpenPose to estimate the poses for each frame. Due to the pravicy issues, we do not release the pretrained model for Pose2Body, which is kept the same as vid2vid.

Training with Face Dataset

  • Pre-train the teacher model.

    bash scripts/face/train_teacher.sh
    
  • Train the spatially low-demand generator with spatial knowledge distillation.

    bash scripts/face/train_skd.sh
    
  • Train the part-time student generator with temporal knowledge distillation.

    bash scripts/face/train_tkd.sh
    

Training with Cityscapes Dataset

  • Pre-train the teacher model.

    bash scripts/street/train_teacher.sh
    
  • Train the spatially low-demand generator with spatial knowledge distillation.

    bash scripts/street/train_skd.sh
    
  • Train the part-time student generator with temporal knowledge distillation.

    bash scripts/street/train_tkd.sh
    
  • Note that the resolution of our training data is 256 × 512, as we only use the first-scale generator. If needed, one can use the original vid2vid's coarse-to-fine manner for higher resolution. For example, firstly, one needs to train a 1024× (or higher resolution) teacher model. Then, the network structures of the refined network are needed to be converted to a spatially low-demand network (refer to netorks.py). Next, train the network with spatial-temporal knowledge distillation.

    • (Optional) The knowledge distillation may be used from the first-scale network.

Citation

If you find this useful for your research, please cite the our paper.

@inproceedings{zhuo2022fast,
   author    = {Zhuo, Long and Wang, Guangcong and Li, Shikai and Wu, Wanye and Liu, Ziwei},
   title     = {Fast-Vid2Vid: Spatial-Temporal Compression for Video-to-Video Synthesis},
   booktitle = {European Conference on Computer Vision (ECCV)},   
   year      = {2022},
}

Acknowledgments

This code is based on the Vid2Vid codebase.

fast-vid2vid's People

Contributors

fast-vid2vid avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.