Comments (2)
Hello! Thank you for your great paper and for publishing the code and checkpoints for the t2v models. While reading about how it all works, I had a number of questions. I hope you'll find some time to answer at least some of them. Feel free to direct me to your paper if it is already explained there :)
- What is the reasoning behind using such a small patch size = 2? Usually, I see patch sizes of 16 or 8 used, especially when generating 512x512 images.
- I see that you used LoRACompatible modules for linear projection. Have you thought about how this architecture could be expanded with LoRAs?
- Have you thought about adding some image-specific positional encoding to appended images?
- What is the purpose of args.fixed_spatial? In what cases would one want to train only spatial layers?
- In the provided training script, the decay for EMA is set to 0. Does that mean that the provided checkpoint was trained without EMA? Link: here
- Given that you are already passing "scaling_factor": 0.18215 to the VAE model, why do you scale it again in the training loop? Link: here
- Given that you are already doing attention masking in the encode_prompt function, why are you passing attention_mask and encoder_attention_mask arguments to the model's forward method? I'm not sure if I'm right, but it seems that both of these arguments are never used.
- How do you switch between using fp16 and fp32 in the training script?
- Training the model for more than 16 frames often results in checkerboard artifacts and significantly reduced quality. Do you think this is a limitation of the Latte model's architecture? I've seen that you recommend looking into autoregressive video modeling, but still, how can we effectively scale the number of frames generated from 16 to 32 without changing the architecture or sampling method?
- In the implementation of the BasicTransformerBlock, there is a lot of commented-out code with the cross-attention implementation. Does this mean that the pretrained checkpoint was trained without it?
Thank you again for your work, and I look forward to your answers!
Hi, thanks for your interest.
- Directly inherited from DiT and Pixart-alpha.
- Not yet, but it should be easy, depending on what you want to do with this.
- What's the positional embedding, for example? The spatial part is encoded with the absolute positional embedding.
- Someone only wants to train the temporal module.
- No, the setting of 0 here is just to synchronize the parameter values with the model at the beginning of training.
- VAE itself does not multiply by this scaling factor.
- attention_mask is used for training; encoder_attention_mask is used fro both training and testing.
- The corresponding parameters are controlled in config.
- Training on longer frames, such as 32, I did not experience a serious drop in quality.
- The autoregressive method is just a training-free method, and there are some training-free methods that can generate longer videos than the base model.
- Yes, it is from diffusers and not used in Latte.
from latte.
Hi There! ๐
This issue has been marked as stale due to inactivity for 60 days.
We would like to inquire if you still have the same problem or if it has been resolved.
If you need further assistance, please feel free to respond to this comment within the next 7 days. Otherwise, the issue will be automatically closed.
We appreciate your understanding and would like to express our gratitude for your contribution to Latte. Thank you for your support. ๐
from latte.
Related Issues (20)
- Questions about the *0.18215 and /0.18215 operation HOT 2
- About video VAE HOT 2
- ๆๅ ณๆถ้ฟ็้ฎ้ขใ HOT 4
- question on t2v model training HOT 2
- ่ง้ขๅธง็ HOT 2
- how to place and preprocess these datasets HOT 7
- the code of variant 4 HOT 1
- Question: evaluate the FVD HOT 6
- Error once speed up training HOT 2
- How to get preprocessed_ffs HOT 4
- Any plan to implement Latte in HuggingFace's diffusers library? HOT 3
- ๆจกๅๅจucf101ไธๆ ๆณๆถๆ HOT 5
- Can Latte train for I2V tasks? HOT 2
- Batch Size Ablations HOT 1
- what the param <input_sq_size> stands for? HOT 2
- Can we use batch_size>1 in sample_t2x.py HOT 7
- Evaluate the `FVD` on FFS HOT 4
- How can I utilize the weights of pre-trained PixArt-ฮฑ to initialize the parameters of the spatial Transformer block in the Latte T2V model? HOT 2
- FVD on UCF-101 HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from latte.