pku-yuangroup / open-sora-plan Goto Github PK

This project aim to reproduce Sora (Open AI T2V model), we wish the open source community contribute to this project.

License: MIT License

Python 98.09% Shell 1.54% C++ 0.14% Cuda 0.23% CSS 0.01%

open-sora-plan's Introduction

Open-Sora Plan

This project aims to create a simple and scalable repo, to reproduce Sora (OpenAI, but we prefer to call it "ClosedAI" ). We wish the open-source community can contribute to this project. Pull requests are welcome! The current code supports complete training and inference using the Huawei Ascend AI computing system. Models trained on Huawei Ascend can also output video quality comparable to industry standards.

本项目希望通过开源社区的力量复现Sora，由北大-兔展AIGC联合实验室共同发起，当前版本离目标差距仍然较大，仍需持续完善和快速迭代，欢迎Pull request！目前代码同时支持使用国产AI计算系统（华为昇腾）进行完整的训练和推理。基于昇腾训练出的模型，也可输出持平业界的视频质量。

If you like our project, please give us a star ⭐ on GitHub for latest update.

📣 News

[2024.07.24] 🔥🔥🔥 v1.2.0 is here! Utilizing a 3D full attention architecture instead of 2+1D. We released a true 3D video diffusion model trained on 4s 720p. Checking out our latest report.
[2024.05.27] 🎉 We are launching Open-Sora Plan v1.1.0, which significantly improves video quality and length, and is fully open source! Please check out our latest report. Thanks to ShareGPT4Video's capability to annotate long videos.
[2024.04.09] 🤝 Excited to share our latest exploration on metamorphic time-lapse video generation: MagicTime, which learns real-world physics knowledge from time-lapse videos.
[2024.04.07] 🎉🎉🎉 Today, we are thrilled to present Open-Sora-Plan v1.0.0, which significantly enhances video generation quality and text control capabilities. See our report. Thanks to HUAWEI NPU for supporting us.
[2024.03.27] 🚀🚀🚀 We release the report of VideoCausalVAE, which supports both images and videos. We present our reconstructed video in this demonstration as follows. The text-to-video model is on the way.
[2024.03.01] 🤗 We launched a plan to reproduce Sora, called Open-Sora Plan! Welcome to watch 👀 this repository for the latest updates.

😍 Gallery

93×1280×720 Text-to-Video Generation. The video quality has been compressed for playback on GitHub.

video_24fps_compress.mp4

😮 Highlights

Open-Sora Plan shows excellent performance in video generation.

🔥 High performance CausalVideoVAE, but with fewer training cost

High compression ratio with excellent performance, capable of compressing videos by 256 times (4×8×8). Causal convolution supports simultaneous inference of images and videos but only need 1 node to train.

🚀 Video Diffusion Model based on 3D attention, joint learning of spatiotemporal features.

With a 3D full attention architecture instead of a 2+1D model, 3D attention can better capture joint spatial and temporal features.

🤗 Demo

Gradio Web UI

Highly recommend trying out our web demo by the following command.

python -m opensora.serve.gradio_web_server --model_path "path/to/model" --ae_path "path/to/causalvideovae"

ComfyUI

Coming soon...

🐳 Resource

Version	Architecture	Diffusion Model	CausalVideoVAE	Data
v1.2.0	3D	93x720p, 29x720p[1], 93x480p[1,2]	Anysize	Annotations
v1.1.0	2+1D	221x512x512, 65x512x512	Anysize	Data and Annotations
v1.0.0	2+1D	65x512x512, 65x256x256, 17x256x256	Anysize	Data and Annotations

[1] Please note that the weights for v1.2.0 29×720p and 93×480p were trained on Panda70M and have not undergone final high-quality data fine-tuning, so they may produce watermarks.

[2] We fine-tuned 3.5k steps from 93×720p to get 93×480p for community research use.

Warning

🚨 For version 1.2.0, we no longer support 2+1D models.

⚙️ Requirements and Installation

Clone this repository and navigate to Open-Sora-Plan folder

git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan

Install required packages We recommend the requirements as follows.

Python >= 3.8
Pytorch >= 2.1.0
CUDA Version >= 11.7

conda create -n opensora python=3.8 -y
conda activate opensora
pip install -e .

Install additional packages for training cases

pip install -e ".[train]"

Install optional requirements such as static type checking:

pip install -e '.[dev]'

🗝️ Training & Validating

🗜️ CausalVideoVAE

Data prepare

The organization of the training data is easy. We only need to put all the videos recursively in a directory. This makes the training more convenient when using multiple datasets.

Training Dataset
|——sub_dataset1
    |——sub_sub_dataset1
        |——video1.mp4
        |——video2.mp4
        ......
    |——sub_sub_dataset2
        |——video3.mp4
        |——video4.mp4
        ......
|——sub_dataset2
    |——video5.mp4
    |——video6.mp4
    ......
|——video7.mp4
|——video8.mp4

Training

bash scripts/causalvae/train.sh

We introduce the important args for training.

Argparse	Usage
Training size
`--num_frames`	The number of using frames for training videos
`--resolution`	The resolution of the input to the VAE
`--batch_size`	The local batch size in each GPU
`--sample_rate`	The frame interval of when loading training videos
Data processing
`--video_path`	/path/to/dataset
Load weights
`--model_config`	/path/to/config.json The model config of VAE. If you want to train from scratch use this parameter.
`--pretrained_model_name_or_path`	A directory containing a model checkpoint and its config. Using this parameter will only load its weight but not load the state of the optimizer
`--resume_from_checkpoint`	/path/to/checkpoint It will resume the training process from the checkpoint including the weight and the optimizer.

Inference

bash scripts/causalvae/rec_video.sh

We introduce the important args for inference.

Argparse	Usage
Ouoput video size
`--num_frames`	The number of frames of generated videos
`--height`	The resolution of generated videos
`--width`	The resolution of generated videos
Data processing
`--video_path`	The path to the original video
`--rec_path`	The path to the generated video
Load weights
`--ae_path`	/path/to/model_dir. A directory containing the checkpoint of VAE is used for inference and its model config.json
Other
`--enable_tilintg`	Use tiling to deal with videos of high resolution and long duration
`--save_memory`	Save memory to inference but lightly influence quality

Evaluation

For evaluation, you should save the original video clips by using --output_origin.

bash scripts/causalvae/prepare_eval.sh

We introduce the important args for inference.

Argparse	Usage
Ouoput video size
`--num_frames`	The number of frames of generated videos
`--resolution`	The resolution of generated videos
Data processing
`--real_video_dir`	The directory of the original videos.
`--generated_video_dir`	The directory of the generated videos.
Load weights
`--ckpt`	/path/to/model_dir. A directory containing the checkpoint of VAE is used for inference and its model config.
Other
`--enable_tilintg`	Use tiling to deal with videos of high resolution and long time.
`--output_origin`	Output the original video clips, fed into the VAE.

Then, we begin to eval. We introduce the important args in the script for evaluation.

bash scripts/causalvae/eval.sh

Argparse	Usage
`--metric`	The metric, such as psnr, ssim, lpips
`--real_video_dir`	The directory of the original videos.
`--generated_video_dir`	The directory of the generated videos.

📜 Text-to-Video

Data prepare

We use a data.txt file to specify all the training data. Each line in the file consists of DATA_ROOT and DATA_JSON. The example of data.txt is as follows.

/path/to/data_root_1,/path/to/data_json_1.json
/path/to/data_root_2,/path/to/data_json_2.json
...

Then, we introduce the format of the annotation json file. The absolute data path is the concatenation of DATA_ROOT and the "path" field in the annotation json file.

For image

The format of image annotation file is as follows.

[
  {
    "path": "00168/001680102.jpg",
    "cap": [
      "xxxxx."
    ],
    "resolution": {
      "height": 512,
      "width": 683
    }
  },
  ...
]

For video

The format of video annotation file is as follows. More details refer to HF dataset.

[
  {
    "path": "panda70m_part_5565/qLqjjDhhD5Q/qLqjjDhhD5Q_segment_0.mp4",
    "cap": [
      "A man and a woman are sitting down on a news anchor talking to each other."
    ],
    "resolution": {
      "height": 720,
      "width": 1280
    },
    "fps": 29.97002997002997,
    "duration": 11.444767
  },
  ...
]

Training

bash scripts/text_condition/gpu/train_t2v.sh

We introduce some key parameters in order to customize your training process.

Argparse	Usage
Training size
`--num_frames 61`	To train videos of different durations, e.g, 29, 61, 93, 125...
`--max_height 640`	To train videos of different resolutions
`--max_width 480`	To train videos of different resolutions
Data processing
`--data /path/to/data.txt`	Specify your training data.
`--speed_factor 1.25`	To accelerate 1.25x videos.
`--drop_short_ratio 1.0`	Do not want to train on videos of dynamic durations, discard all video data with frame counts not equal to `--num_frames`
`--group_frame`	If you want to train with videos of dynamic durations, we highly recommend specifying `--group_frame` as well. It improves computational efficiency during training.
Multi-stage transfer learning
`--interpolation_scale_h 1.0`	When training a base model, such as 240p (`--max_height 240`, `--interpolation_scale_h 1.0`) , and you want to initialize higher resolution models like 480p (height 480) from 240p's weights, you need to adjust `--max_height 480`, `--interpolation_scale_h 2.0`, and set `--pretrained` to your 240p weights path (path/to/240p/xxx.safetensors).
`--interpolation_scale_w 1.0`	Same as `--interpolation_scale_h 1.0`
Load weights
`--pretrained`	This is typically used for loading pretrained weights across stages, such as using 240p weights to initialize 480p training. Or when switching datasets and you do not want the previous optimizer state.
`--resume_from_checkpoint`	It will resume the training process from the latest checkpoint in `--output_dir`. Typically, we set `--resume_from_checkpoint="latest"`, which is useful in cases of unexpected interruptions during training.
Sequence Parallelism
`--sp_size 8 --train_sp_batch_size 2`	It means running a batch size of 2 across 8 GPUs (8 GPUs on the same node).

Warning

🚨 We have two ways to load weights: `--pretrained` and `--resume_from_checkpoint`. The latter will override the former.

Inference

We provide multiple inference scripts to support various requirements. We recommend configuration --guidance_scale 7.5 --num_sampling_steps 100 --sample_method EulerAncestralDiscrete for sampling.

Inference on 93×720p, we report speed on H100.

Size	1 GPU	8 GPUs (sp)
29×720p	420s/100step	80s/100step
93×720p	3400s/100step	450s/100step

🖥️ 1 GPU

If you only have one GPU, it will perform inference on each sample sequentially, one at a time.

bash scripts/text_condition/gpu/sample_t2v.sh

🖥️🖥️ Multi-GPUs

If you want to batch infer a large number of samples, each GPU will infer one sample.

bash scripts/text_condition/gpu/sample_t2v_ddp.sh

🖥️🖥️ Multi-GPUs & Sequence Parallelism

If you want to quickly infer one sample, it will utilize all GPUs simultaneously to infer that sample.

bash scripts/text_condition/gpu/sample_t2v_sp.sh

🖼️ Image-to-Video

Data prepare

Coming soon...

Training

Coming soon...

Inference

Coming soon...

💡 How to Contribute

We greatly appreciate your contributions to the Open-Sora Plan open-source community and helping us make it even better than it is now!

For more details, please refer to the Contribution Guidelines

👍 Acknowledgement

Latte: It is an wonderful 2+1D video generated model.
PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis.
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions.
VideoGPT: Video Generation using VQ-VAE and Transformers.
DiT: Scalable Diffusion Models with Transformers.
FiT: Flexible Vision Transformer for Diffusion Model.
Positional Interpolation: Extending Context Window of Large Language Models via Positional Interpolation.

🔒 License

See LICENSE for details.

✏️ Citing

BibTeX

@software{pku_yuan_lab_and_tuzhan_ai_etc_2024_10948109,
  author       = {PKU-Yuan Lab and Tuzhan AI etc.},
  title        = {Open-Sora-Plan},
  month        = apr,
  year         = 2024,
  publisher    = {GitHub},
  doi          = {10.5281/zenodo.10948109},
  url          = {https://doi.org/10.5281/zenodo.10948109}
}

Latest DOI

🤝 Community contributors

open-sora-plan's People

Contributors

Stargazers

Watchers

Forkers

yuanli2333 chg0901 cversace dongsky tonywang-sh mr-harry gyfastas ftgreat gaovicki lixiang007666 paperwave strategist922 chunyu226 allenxuejian nzb15555196162 wodole wangguan1995 qiao0313 wxfai ray-ruisun jxzhangjhu syunar againstentropy xiaosheng-zhao gptv cfandy hmssg 021gink yang0110 2456868764 hanker-zhu cylonspace edsun3941 zhuxiongwei24 dandingbudanding assassindesign lida0408 ryanhuangnlp qyxqyx alexandor91 zhongjy1998 moreno7798 hku kjdnl jerryyin777 david20080125 jasonz1360 atfortes ma-dan xnliang98 gaohuan2015 black-archivers feixue94 polarisyxh zoonono maolala233 eltociear gmh5225 qxzsilver1 00light00 xmyx guanghuisong dotieuthien eric-doug yiming992 aoyuqc craii puccho729 techthiyanes hhhhwb xiaolong-rrl mingkin hwade zuonet1988 fjfd leixy76 stjordanis defend1234 zhenyangiacas yuan-manx gisealh hauzaakming fangwudi hellonicoo cinesynth jimyzzp rainingyu maopy pieuni xiaobaibai1963 luomingshuang islinxu xuyutom macroustc cookerjin keyman9848 opendidi carringbrinks joontju lplzyp

open-sora-plan's Issues

Panda-70M as the dataset

Panda-70M is available right now.

https://snap-research.github.io/Panda-70M/

This contains 70M video-text pairs.
It seems useful for this project.
What do you think?

训练Vit模型最小的配置需要多少呢？

Videogpt 1.0 requires torch~=1.7.1

Hi, great works!

I have the version conflict with videogpt and torch.
Downloaded the code from VideoGPT, and running pip install -e . But videogpt 1.0 requires torch version ~=1.7, which is different from the previously installed torch==1.13.1+cu117.
Looking forward to your reply! Thx

请停止这种对浪费社会资源的行为

通读了今天的各种 PR 宣传稿，我不得不来说一句——请停止这种浪费资源的行为！

你们仅仅是为了自身的学术名声，而没有在意这项工作是不是有社会和商业价值，这个行为和目的和 EMO 的空项目开源基本一致了。

请收手吧！

谢谢！

Integrating with nodeJS

Please I would like to know if it's possible to integrate the model with nodeJS at all. Forgive my ignorance if this doesn't sound relevant.

I'm trying to see if I can build an npm package for it so that it can be installed in applications.

是否用claude3加速一下~

Open AI 说sora是基于gpt与dalle 3进行开发的，理论上要完全复刻sora,也需要攻克gpt与dalle 3才能去摸索sora的技术路线。现在claude 3如果在模仿人的思维模式上取得突破，也许能给出复现sora的正确技术方案。

发错了，抱歉

          > 有个叫ganchengguang的，也不知道是何居心，你为啥给每个支持这个开源项目的人都弄个大拇指朝下的表情，也不知道你是什么成分，去了趟日本就把自己当日本人了是吧，还是你本来就是日本间谍？

对不起，手滑点错了。已改成赞

好的，没事，希望你是真的手滑

加油加油，希望早日胜利

希望这条issue可以见证

Community Integration: Making Sora cheaper, faster, and more efficient

Thank you for your outstanding contribution to Open-Sora-Plan!

AIGC, e.g., Sora, has recently risen to be one of the hottest topics in AI. We are happy to share a fantastic solution where the costs of Sora can be much cheaper!

Colossal-AI team provides an optimized open source Sora replication solution with 46% cost reduction, sequence expansion to nearly a million. More details can be found on the blog 中文博客.

Open-source code：https://github.com/hpcaitech/Open-Sora

We would appreciate it if we could build the integration with you and other users to benefit the community!

Thank you very much.

Failure in 'pip install -r requirements.txt'

INFO: pip is looking at multiple versions of transformers to determine which version is compatible with other requirements. This could take a while.
ERROR: Cannot install -r requirements.txt (line 36) and tokenizers==0.10.3 because these package versions have conflicting dependencies.

The conflict is caused by:
The user requested tokenizers==0.10.3
transformers 4.32.0 depends on tokenizers!=0.11.3, <0.14 and >=0.11.1

To fix this you could try to:

loosen the range of package versions you've specified
remove package versions to allow pip attempt to solve the dependency conflict

It seems running install command fails due to transformers and tokenizers version conflict

Join You!

I wanna to know how to join the program. Looking forward to your reply.

[support] fine-tune Video-VQVAE on higher resolution

I can support the work of fine-tuning Video-VQVAE on higher resolution.
So can you give me the training settings?
I can train multiple versions of the model with these settings.
I will cooperate by being open about training progress and status in this issue, while submitting the necessary code and models through PR.
@LinB203 @SHYuanBest @Tzy010822 @LiuhanChen-github @yuanli2333 @chg0901 @cxh0519 @BinZhu-ece @junwuzhang19

Is it really feasible to train a video dit without inserting temporal transformers or attention modules?

no function named is_image_file

from opensora.utils.dataset_utils import is_image_file

I can't find a function named is_image_file.

What's HW requirement to run this model?

I tried A100 (40GB SXM4) with 30 vCPUs, 200 GiB RAM, 512 GiB SSD but immediately CUDA out of memory.

which card / config shall i use? 8x A100 80GB? 1x H100 80GB? 8x H100 80GB?

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 538.00 MiB (GPU 0; 39.39 GiB total capacity; 37.39 GiB already allocated; 233.94 MiB free; 38.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation

(opensora) ubuntu@129-146-126-183:~/opensora-arizona/Open-Sora-Plan$ python ./src/sora/modules/ae/vqvae/videogpt/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1
/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead.
warnings.warn(
/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
warnings.warn(
Downloading...
From (original): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5
From (redirected): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5&confirm=t&uuid=edea95d1-1e18-41c1-8b57-966749fb41ad
To: /home/ubuntu/opensora-arizona/Open-Sora-Plan/ucf101_stride4x4x4
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 258M/258M [00:05<00:00, 45.4MB/s]
sample_frames_len 500, only can sample 300 assets/origin_video_0.mp4 300
Traceback (most recent call last):
File "./src/sora/modules/ae/vqvae/videogpt/rec_video.py", line 110, in
main(args)
File "./src/sora/modules/ae/vqvae/videogpt/rec_video.py", line 92, in main
encodings, embeddings = vqvae.encode(x_vae, include_embeddings=True)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 38, in encode
h = self.pre_vq_conv(self.encoder(x))
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 241, in forward
h = self.res_stack(h)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 125, in forward
return x + self.block(x)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/vqvae.py", line 104, in forward
x = self.attn_w(x, x, x) + self.attn_h(x, x, x) + self.attn_t(x, x, x)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 193, in forward
a = self.attn(q, k, v, decode_step, decode_idx)
File "/home/ubuntu/opensora-arizona/miniconda3/envs/opensora/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 244, in forward
out = scaled_dot_product_attention(q, k, v, training=self.training)
File "/home/ubuntu/opensora-arizona/Open-Sora-Plan/src/sora/modules/ae/vqvae/videogpt/videogpt/attention.py", line 500, in scaled_dot_product_attention
attn = torch.matmul(q, k.transpose(-1, -2))
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 538.00 MiB (GPU 0; 39.39 GiB total capacity; 37.39 GiB already allocated; 233.94 MiB free; 38.27 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Sample Error

Two days ago, I train a Dit-XL with the following command:

torchrun --nproc_per_node=8 src/train.py \
  --model DiT-XL/122 \
  --vae ucf101_stride4x4x4 \
  --data-path ./UCF-101 --num-classes 101 \
  --sample-rate 2 --num-frames 8 --max-image-size 128 --clip-grad-norm 1 \
  --epochs 14000 --global-batch-size 64 --lr 1e-4 \
  --ckpt-every 4000 --log-every 1000 \
  --results-dir ./exp1

Today, I try to sample a video through:

python opensora/sample/sample.py \
  --model DiT-XL/122 --ae ucf101_stride4x4x4 \
  --ckpt ./exp1/000-DiT-XL-122/checkpoints/0012000.pt --extras 1 \
  --fps 10 --num-frames 16 --image-size 256

However, I met

    model.load_state_dict(state_dict)
  File "/root/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DiT:
        Unexpected key(s) in state_dict: "y_embedder.embedding_table.weight".

Thank you for taking the time to look into this issue. I look forward to your response.

[Question] Open Source Sora Model: Clarification and Contribution

Hi there,

I came across the Sora model replication project and am interested in contributing ideas for improvement. As I lack access to powerful hardware for testing, my focus would be on suggesting enhancements to address any existing shortcomings in the Sora model.

Could you clarify if the project aims to replicate Sora exactly or if there's a focus on improving its current performance?

为什么是videogpt，不应该是dit这种扩散模型吗，videogpt这种是自回归模型吧。

Ask for training resource requirement

I want to know how many GPUs and GPU RAMs are needed to run the demo training as well as the training times for various training configurations.

Is there any WeChat group to join for fast iteration and optimization of devepment of this project?

One more question, have you test your codes with outputs including moving humans or animals? I understand that the performance maybe not very good since the limitation of computing resources. But I'd love to find more various output examples.

How to incorporate implicit physics ?

Hi what is the plan for the replacement of game engine rendering they used for the physics incorporated in the diffusion model ? Any plans ?

Question about latent size

hi, this project use VQVAE to compress video into small latent space, and latent embedding dim is 512 or 256. But in LDM, they usually use very small embedding dim 4 or 3, SD use 4. Will this large latent dim make the diffusion training process too hard to learn, since it predict a high dim noise?

open source

RuntimeError in DiT `Attention` class `forward` function due to dimension mismatch

Description:

While attempting to run the code from the repository, I encountered a runtime error in the forward function of the Attention class located in src/sora/modules/diffusion/dit/models.py. I suspect that the issue might be caused by a mismatch in the dimensions of the attention_mask.

Steps to Reproduce:

Clone the repository and pull the latest code.
Download dependency model ucf101_stride4x4x4 and dataset UCF-101 from https://www.crcv.ucf.edu/datasets/human-actions/ucf101/UCF101.rar.
Run the model using the provided train.sh script.
Observe the runtime error occurring in the forward function of the Attention class.

torchrun  --nproc_per_node=8 src/train.py \
  --model DiT-XL/122 \
  --vae ucf101_stride4x4x4 \
  --data-path UCF-101 --num-classes 101 \
  --sample-rate 2 --num-frames 8 --max-image-size 128 --clip-grad-norm 1 \
  --epochs 14000 --global-batch-size 256 --lr 1e-4 \
  --ckpt-every 1000 --log-every 1000

Error Message:

RuntimeError: The expanded size of the tensor (384) must match the existing size (16) at non-singleton dimension 3. Target sizes: [32, 16, 384, 384]. Tensor sizes: [32, 2, 12, 16]

Expected Behavior:

The forward function should execute normally without any dimension mismatch errors.

Actual Behavior:

The execution of scaled_dot_product_attention results in a runtime error due to a dimension mismatch with the attention_mask.

Additional Information:

I am certain that I have not modified any other code or training scripts.
I attempted to print the dimensions of q, k, v, and attention_mask, which are as follows:

if self.fused_attn:
            print(q.shape, k.shape, v.shape, attention_mask.shape)
            x = F.scaled_dot_product_attention(
                q, k, v,
                attn_mask=attention_mask,
                dropout_p=self.attn_drop.p if self.training else 0.,
            )
# Output: torch.Size([32, 16, 384, 72]) torch.Size([32, 16, 384, 72]) torch.Size([32, 16, 384, 72]) torch.Size([32, 2, 12, 16])

This indicates that the dimension of attention_mask does not match.

Environment Information:

Operating System: Linux
Python Version: 3.10
PyTorch Version: 2.1.1
CUDA Version: 12.3

Tokenizers library version issue

The conflict is caused by:
The user requested tokenizers==0.10.3
transformers 4.32.0 depends on tokenizers!=0.11.3, <0.14 and >=0.11.1

To fix this you could try to:

loosen the range of package versions you've specified
remove package versions to allow pip attempt to solve the dependency conflict

在xformers中使用默认的op加上attn_mask会出现nan

等了两天终于看到了xformers相关的内容。我自己测试也发现了xformers中使用构造的attn mask会出现nan的情况。我换了其他attention operator进行了尝试，出现了以下情况：

“[email protected]” is not supported because: attn_bias type is <class 'torch.Tensor'>

"tritonflashattF" is not supported because: attn_bias type is <class 'torch.Tensor'> operator wasn't built - see `python -m xformers.info` for more info triton is not available

"smallkF" is not supported because: max(query.shape[-1] != value.shape[-1]) > 32 dtype=torch.float16 (supported: {torch.float32}) bias with non-zero stride not supported unsupported embed per head: 72

所以针对目前的这种attn mask的构造，是没有办法使用xformers+attn_mask进行显存优化吗

基于cnn的vae在工程实现上能支持超长视频吗？

目前transformer能够支持序列并行，所以对超长视频比较友好。不知道cnn在处理超长视频上会不会有什么难点？

curating high-quality video data

Hi team members,
I would attribute the success of SORA to the training data like how OpenAI has done for GPT. Any ideas on curating high-quality video data?

Taking inspiration from Stable Diffusion 3

As you probably know, StabilityAI today published their architecture details of SD3.

https://stability.ai/news/stable-diffusion-3-research-paper / https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf

The key takeaways are:

Rectified Flow (much faster than diffusion)
Joint Transformer for both Text and Image embedding processing
Improved text encoding/prompt-alignment by using mixture of CLIPs and T5
Deduplication efforts
Outperforms SOTA
Scales to Text2Video too

I think these ideas can be of much help to OpenSora project

inference results

Thanks for your work. After training the model, can it infer from normal videos? Could you provide some video samples?

My Architecture Overhaul Practical Roadmap for Faster and Less Resources T2V Generation

Hi there!

I have been watching and contributing to the text2video ecosystem for a long time now. Now that Sora is out, there's more attention to the subject and I have been concerned with the topic of multimodal models too. However, while I have ideas in mind, some of them are too fundamental to train the model from scratch.

And here is that needs to be done, in my mind.

Switch the base to Latte/PixArt-a. It has good and fast architecture and supports ControlNet-Transformer out of the box.
Important! As you probably know, there's a push away from the vanilla Transformer architecture in the NLP community due to its quadratic costs. While the transition to Mamba will be too complicated, I propose to switch to the newer compromise of "Linear Transformers with Learnable Kernel Functions are Better In-Context Models" https://github.com/corl-team/rebased. (released this February)
Training A. Diffusion from noise is becoming outdated technology. Instead we can use flow matching where the process is accelerated due to approximating the direction of where the denoising process should follow, and not simulating the whole diffusion process.
Training B. Learning the real world representation is a daunting task for AI. We can greatly by again not generating videos from total scratch, but instead making the model fill-in (inpaint) the 3D-masked parts of existing videos. See Meta's Voicebox and V-JEPA models (present on github) for more details.
Use Temporal VAE from StableVideoDiffusion instead of VideoGPT. (simply better quality, and Latte uses it too)

VQVAE or VAE?

Dear authors, thanks for your interesting work and plans. However, there is one question in my mind: why you choose to use VQVAE instead of VAE?
As stated both in DiT and SoRA's official website, both of them use VAE without quantization. So what drive you to choosing VQVAE as your tokenizer?
Looking forward to your reply and hoping to contribute to your project.

License

Hi,
Thank you for releasing this, I noticed you mentioned that this is an “open source” project but the license is NC which doesn’t classify as an open source license. Is there any chance of could be changed or does the technology this repo depends on also have that license?
Thanks!

Videogpt 1.0 requires torch~=1.7.1

Hi, great works!

Integration with `huggingface_hub`

Hi there 👋

My name is Sayak, one of the maintainers of the diffusers library at Hugging Face. Thanks for kicking this off!

I was wondering if you'd be interested to integrate with the huggingface_hub library to make model loading and saving easier with the Hugging Face Hub platform. I am happy to draft a PR to showcase the possibility.

NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation

Hi,

I'm using H100 (80GB) , but the specified pytorch version (torch==1.13.1+cu117) does not support H100 CUDA sm_90.

Has anyone met h100 issue? how to fix it? Much thanks!!

NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

(opensora) ubuntu@209-20-158-49:~/opensora-utah/Open-Sora-Plan$ python ./src/sora/modules/ae/vqvae/videogpt/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1
/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead.
warnings.warn(
/home/ubuntu/.local/lib/python3.10/site-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
warnings.warn(
Downloading...
From (original): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5
From (redirected): https://drive.google.com/uc?id=1uuB_8WzHP_bbBmfuaIV7PK_Itl3DyHY5&confirm=t&uuid=9a37ecfb-0c55-4e77-a418-9129ea8e4ba4
To: /home/ubuntu/opensora-utah/Open-Sora-Plan/ucf101_stride4x4x4
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 258M/258M [00:03<00:00, 83.7MB/s]
/home/ubuntu/.local/lib/python3.10/site-packages/torch/cuda/init.py:155: UserWarning:
NVIDIA H100 PCIe with CUDA capability sm_90 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75 sm_80 sm_86.
If you want to use the NVIDIA H100 PCIe GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))

好样的，前进！

逐梦之人，联合起来！

Todo list or discussion channel

Do we have an todo list on public for promising next steps and also a place for the open group members to join for brainstorming idea?

spatial temporal embedding 的定义方式

review源码的时候有个地方没想明白，想请教一下：
源码中 spatial embedding 相关代码是：

pos_embed = get_2d_sincos_pos_embed(self.hidden_size, [num_patches_height, num_patches_width])
...
emb = np.concatenate([emb_h, emb_w], axis=1)

也就说这里把 pos_embed 向量分成了两半，前一半标记y方向的pos，后一半标记x方向的pos。

然后引入了 temporal embeding, 源码中的变量名是 pos_embed_1d：

pos_embed_1d = get_2d_sincos_pos_embed(self.hidden_size, [num_tubes_length, 1])

然后最终总的embedding 大致相当于三部分相加：

emb = token_embed + pos_embed + temporal_embed

而在用patches表征视频时，空间维度和时间维度似乎是等价的，所以我觉得，空间x方向，空间y方向，时间t方向应该按相同方式处理，简单来说，我觉得应该这样定义时空相关的embedding：

spatial_temporal_embed = np.concatenate([emb_h, emb_w, emb_t], axis=1)

emb_total = token_embed + spatial_temporal_embed

这里的 emb_t 表示标记视频前后时间帧相关的embedding分量。这种定义中 x、y、t三个方向是同等方式处理的。

显然这种处理方式和源码中的embedding处理方式略有不同，不知道为什么采用了源码中的方式，或者我这里的想法有没有问题，烦请指教，谢谢！

open alibaba EMO Plz

dataset format and size

Hi and thanks for the project. You said you needed a small sample dataset to validate the model. Do you have any specific format in mind, or just video description - video link? Also, how many rows? You've probably seen this one already, but just in case: https://huggingface.co/datasets/jovianzm/Pexels-400k

Are you evaluating to experiment with Vim?

https://github.com/hustvl/Vim

Hi

Tokenizers issue

ERROR: Cannot install -r requirements.txt (line 36)

variable aspect ratios, resolutions, durations

Is the implementation of variable aspect ratios, resolutions, durations different from that of NaViT? Are there any plans to implement the NaViT?

[feat] Add FreeNoise support for training-free longer video generation

FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling

The repo https://github.com/arthur-qiu/FreeNoise-LaVie

cd VideoGPT: No such file or directory

Thanks for the great work!!!

I just found out the repo directory has changed and VideoGPT/ is moved to src/sora/modules/ae/vqvae/videogpt/. But the README is still same, cd VideoGPT.

could u please update that as well? Much thanks!!!

bash: cd: VideoGPT: No such file or directory

git clone https://github.com/PKU-YuanGroup/Open-Sora-Plan
cd Open-Sora-Plan
conda create -n opensora python=3.8 -y
conda activate opensora
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -r requirements.txt
cd VideoGPT --> cd src/sora/modules/ae/vqvae/videogpt/
pip install -e .
cd ..

python ./src/sora/modules/ae/vqvae/videogpt/rec_video.py --video-path "assets/origin_video_0.mp4" --rec-path "rec_video_0.mp4" --num-frames 500 --sample-rate 1

[support] training with more data and more GPU

I can support the work of training with more data and more GPU.
So can you give me the training settings?
I can train multiple versions of the model with these settings.
I will cooperate by being open about training progress and status in this issue, while submitting the necessary code and models through PR.
@LinB203 @SHYuanBest @Tzy010822 @LiuhanChen-github @yuanli2333 @chg0901 @cxh0519 @BinZhu-ece @junwuzhang19

Try using linear passthrough to train a model in dit?

One of the key ideas is that it works as if it was like "an online passthrough", by applying a loop on a module SuperClass, that groups layers, in a such way they get their forward method repeated in a loop. So, in theory, you can observe more intelligence in the same way MegaDolphin 120b, Professor 155b, Venus120b and other huge models, but use way less vRAM, because instead of cloning the weights, we share them in the vRAM.

https://huggingface.co/cognitivecomputations/dolphin-phi-2-kensho

点赞，加油。一点正面建议。

给Open-Sora-Plan兄弟们的一点建议：

1）与其仅仅跟随，不如顺带超越。技术性超越点：比如在基础版本上，增加大对象的检测和约束（正则），避免无中生有；建立一个关键词和属性表，增加对象的类型标定（刚体，准刚体，流体），增加刚体、准刚体之类的运动关键点检测，防止手穿越身体，用正则，用RLHF之类来尽可能增加物理合理性。（所有生成式模型，最后都拼的是对客观世界一致性的约束强弱），...

2）条件部分，从你们示意图可见，你们侧重图片的几个属性，个人建议：文字（caption越丰富越好，包括物体/场景对象组，包括运动，包括形容词） > 图片（原始照片 > 处理后图片）> UI操作数据

3）数据和算力很大，可以搞个赞助页面。赞助钱或者GPU云资源都行啊。
数据可以同步准备：特别是海量视频的caption。这里面应该很多人肉工作。在算法caption的基础上，还是需要人工去检查。需要提前规划，那些在视频生成的时候，用户可能会很care的那些关键字：相机位置（非常必要：比如导演心目中的镜头，机位移动轨迹，比如从人物背后跟随到绕道前方近距离人脸特写等，这个在一般视频caption中没这些；我做的立体中，这个是建模事后在线去确定），画面风格，影视术语，灯光材质，动作交互，...

4）Sora, Genie等出来后，我也当天就做了学习分析。比如：
https://github.com/yuedajiong/super-ai/blob/main/superai-20240216-sora.png
https://github.com/yuedajiong/super-ai
我主要精力在： ”立体，动态/交互，逼真，影级，复杂世界“ 的生成上。
本质上，我更在乎：利用显式的5D(dynamic，interactive)表示，强约束，来做vision generation。
如果技术上有用得到的，乐意加入一起写代码。
比如：逼真，在我的理解中，包含类别的逼真（人像人），更包含个体的逼真（刘亦菲像刘亦菲）。当我们文字描述：刘亦菲在跳舞，并且用了刘亦菲的照片的时候，产生的5D世界，或者sora的2D世界，确实一直就是刘亦菲的脸。对于影视创作（明星的IP），甚至video-fake等场景来说(这个就不用举例了，名人)，这个一致性约束，非常重要。

5）一个超越Sora技术思考点分享：
用户输入文字的时候，是one-shot一把输入的；但生成的是视频。当涉及到关于“运动”的描述时，如何将“运动”文本条件分解。大家都不知道sora的做法；我只是按照我的经验和理解来，在文本空间做一定处理，然后输入条件。我可能这么做：
比如：
input：一轮红日正在冉冉升起，中途一架飞机正好凌日飞过。
condition-processor-network: ....
output: {前：文本处理：一轮红日正在冉冉升起；图片补充：sunraise.jpg
中：文本处理：一轮红日正在冉冉升起，中途一架飞机正好凌日飞过; 图片补充：plane.jpg
后：文本处理：一轮红日正在冉冉升起；}
一句话：人类智能的很多来源就是在于对物理世界抽象后，在符号空间上，基于世界模拟，做大量的加工。
不管Sora有没有，我们构造一个好的：文本90% + 图片10%的，场景符号化细化的子网络，一定是超越sora的点。用户输入100个字，我们的场景细化模块产生10000个字，这些字的产生极快时间可以忽略不计，然后这些“字9图1”代表的约束，可以非常高的程度来强化一致性，甚至在2小时长的视频中起作用。（想象为导演的那些手稿）（在我的立体世界构造中，我的文字会变成：远景：2D图片；中景：高斯溅射；近景：立体可交互模型；灯光；等等展开的细节。完全不会无中生有的出来一个东西。 sora这种弱化约束的2d生成，也可以有类似的在符号空间的详细约束。）

6）再来一个超越Sora的技术点的思考分享：
不知道sora有没有哈，我抛砖我的想法。
假设要构造：“刘亦菲（含图片）在跳科目三”的视频。由于只有刘亦菲的的一张含脸照片输入，如何保证在所有帧里面的刘亦菲都是刘亦菲？
没有权限访问sora做试验，而且要用图片做条件的，所以不清楚sora支持的如何；但我估计，sora可能目前支持不好。要在“个体逼真，立体一致，动态一致，光照变换下一致，多表情一致”，有立体模型还相对好控制，但都不容易。
如何搞呢？
在sora-like的算法中，在condition输入部分，它有args的输入除了分辨率之类的，可以增加一个：face-high-fidelity。这个对特定人IP影视制作，极其有用。对输入的image做face detection/segmention，然后对人脸的特征，作为diffusion部分的条件，在多个steps都用原始特征cat上。有了“face-high-fidelity”指令，diffusion的condition构造策略可能和normal的构造策略都不一样。
如果sora做不好甚至做不到，这种face-high-fidelity的feature，对于直播，deep-face，影视制作，是最重要的。

7）增加controllable的设计
我不认为所有的Gen-AI based的符号和视觉类系统，可以做的很严谨的物理一致。因为不比显式表示模型的算法，确实可以逐点精准控制。数据和算法优化，会越来越好，但不能最终可控。
当商业需求确实需要很强的约束的时候，怎么办呢？我觉得就是可以在生成的过程可控。
一是通过工程上的可重复：虽然各种随机数是随机生成的，但每次随机数都记录下来，可以重放。（这种50行代码就可以搞一个通用的非侵入主算法的：various-random-make&save [or load] -> random-set&use
二是基于可重复，增加diffusion环节的可控性，比如有能力把无中生有的对象，不合理的东西，通过文本条件或者controllnet之类的技术给“再处理”“后处理”掉。

0）其他一些资源：
技术分析 https://arxiv.org/abs/2402.17177

Could we enable type hints for the project

Given that this is designed as an open source project that supposes to have lots of contribution from different teams, maybe it's a good idea to enable type hints for functions so it's more readable.

pku-yuangroup / open-sora-plan Goto Github PK

open-sora-plan's Introduction

If you like our project, please give us a star ⭐ on GitHub for latest update.

📣 News

😍 Gallery

😮 Highlights

🔥 High performance CausalVideoVAE, but with fewer training cost

🚀 Video Diffusion Model based on 3D attention, joint learning of spatiotemporal features.

🤗 Demo

Gradio Web UI

ComfyUI

🐳 Resource

⚙️ Requirements and Installation

🗝️ Training & Validating

🗜️ CausalVideoVAE

Data prepare

Training

Inference

Evaluation

📜 Text-to-Video

Data prepare

For image

For video

Training

Inference

🖥️ 1 GPU

🖥️🖥️ Multi-GPUs

🖥️🖥️ Multi-GPUs & Sequence Parallelism

🖼️ Image-to-Video

Data prepare

Training

Inference

💡 How to Contribute

👍 Acknowledgement

🔒 License

✏️ Citing

BibTeX

Latest DOI

🤝 Community contributors

open-sora-plan's People

Contributors

Stargazers

Watchers

Forkers

open-sora-plan's Issues

Description:

Steps to Reproduce:

Error Message:

Expected Behavior:

Actual Behavior:

Additional Information:

Environment Information:

And here is that needs to be done, in my mind.

Recommend Projects

Recommend Topics

Recommend Org