Giter Club home page Giter Club logo

vidxtend's Introduction

StreamingT2V

This repository is the official implementation of StreamingT2V.

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, Humphrey Shi

arXiv preprint | Video | Project page



StreamingT2V is an advanced autoregressive technique that enables the creation of long videos featuring rich motion dynamics without any stagnation. It ensures temporal consistency throughout the video, aligns closely with the descriptive text, and maintains high frame-level image quality. Our demonstrations include successful examples of videos up to 1200 frames, spanning 2 minutes, and can be extended for even longer durations. Importantly, the effectiveness of StreamingT2V is not limited by the specific Text2Video model used, indicating that improvements in base models could yield even higher-quality videos.

News

Setup

  1. Clone this repository and enter:
git clone https://github.com/Picsart-AI-Research/StreamingT2V.git
cd StreamingT2V/
  1. Install requirements using Python 3.10 and CUDA >= 11.6
conda create -n st2v python=3.10
conda activate st2v
pip install -r requirements.txt
  1. (Optional) Install FFmpeg if it's missing on your system
conda install conda-forge::ffmpeg
  1. Download the weights from HF and put them into the t2v_enhanced/checkpoints directory.

Inference

For Text-to-Video

cd t2v_enhanced
python inference.py --prompt="A cat running on the street"

To use other base models add the --base_model=AnimateDiff argument. Use python inference.py --help for more options.

For Image-to-Video

cd t2v_enhanced
python inference.py --image=../__assets__/demo/fish.jpg --base_model=SVD

Inference Time

ModelscopeT2V as a Base Model
Number of Frames Inference Time for Faster Preview (256x256) Inference Time for Final Result (720x720)
24 frames 40 seconds 165 seconds
56 frames 75 seconds 360 seconds
80 frames 110 seconds 525 seconds
240 frames 340 seconds 1610 seconds (~27 min)
600 frames 860 seconds 5128 seconds (~85 min)
1200 frames 1710 seconds (~28 min) 10225 seconds (~170 min)
AnimateDiff as a Base Model
Number of Frames Inference Time for Faster Preview (256x256) Inference Time for Final Result (720x720)
24 frames 50 seconds 180 seconds
56 frames 85 seconds 370 seconds
80 frames 120 seconds 535 seconds
240 frames 350 seconds 1620 seconds (~27 min)
600 frames 870 seconds 5138 seconds (~85 min)
1200 frames 1720 seconds (~28 min) 10235 seconds (~170 min)
SVD as a Base Model
Number of Frames Inference Time for Faster Preview (256x256) Inference Time for Final Result (720x720)
24 frames 80 seconds 210 seconds
56 frames 115 seconds 400 seconds
80 frames 150 seconds 565 seconds
240 frames 380 seconds 1650 seconds (~27 min)
600 frames 900 seconds 5168 seconds (~86 min)
1200 frames 1750 seconds (~29 min) 10265 seconds (~171 min)

All measurements were conducted using the NVIDIA A100 (80 GB) GPU. Randomized blending is employed when the frame count surpasses 80. For Randomized blending, the values for chunk_size and overlap_size are set to 112 and 32, respectively.

Gradio

The same functionality is also available as a gradio demo

cd t2v_enhanced
python gradio_demo.py

Results

Detailed results can be found in the Project page.

License

Our code is published under the CreativeML Open RAIL-M license.

We include ModelscopeT2V, AnimateDiff, SVD in the demo for research purposes and to demonstrate the flexibility of the StreamingT2V framework to include different T2V/I2V models. For commercial usage of such components, please refer to their original license.

BibTeX

If you use our work in your research, please cite our publication:

@article{henschel2024streamingt2v,
  title={StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text},
  author={Henschel, Roberto and Khachatryan, Levon and Hayrapetyan, Daniil and Poghosyan, Hayk and Tadevosyan, Vahram and Wang, Zhangyang and Navasardyan, Shant and Shi, Humphrey},
  journal={arXiv preprint arXiv:2403.14773},
  year={2024}
}

vidxtend's People

Contributors

painebenjamin avatar levon-khachatryan avatar honghuis avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.