This is a wonderful project! If I have a zero _ shot data set, it contains a one-d

How to just train condition audio-diffusion without text-condition? about audio-diffusion-pytorch HOT 3 CLOSED

archinetai commented on August 25, 2024

How to just train condition audio-diffusion without text-condition?

from audio-diffusion-pytorch.

Comments (3)

flavioschneider commented on August 25, 2024 1

Yes this is possible, you can provide a custom embedding by simply setting the embedding_features dimension in the constructor and choose where you want to have cross attention blocks (like with text-cond):

from audio_diffusion_pytorch import DiffusionModel, UNetV0, VDiffusion, VSampler

model = DiffusionModel(
    net_t=UNetV0, # The model type used for diffusion 
    in_channels=2, # U-Net: number of input/output (audio) channels 
    channels=[8, 32, 64, 128, 256, 512, 512, 1024, 1024], # U-Net: channels at each layer
    factors=[1, 4, 4, 4, 2, 2, 2, 2, 2], # U-Net: downsampling and upsampling factors at each layer 
    items=[1, 2, 2, 2, 2, 2, 2, 4, 4], # U-Net: number of repeating items at each layer
    attentions=[0, 0, 0, 0, 0, 1, 1, 1, 1], # U-Net: attention enabled/disabled at each layer 
    attention_heads=8, # U-Net: number of attention heads per attention block
    attention_features=64, # U-Net: number of attention features per attention block,
    diffusion_t=VDiffusion, # The diffusion method used 
    sampler_t=VSampler, # The diffusion sampler used 
    embedding_features=768, # U-Net: embedding features
    cross_attentions=[0, 0, 0, 1, 1, 1, 1, 1, 1], # U-Net: cross-attention enabled/disabled at each layer 
)

# Train model with audio
audio_wave = torch.randn(1, 2, 2**18) # [batch, in_channels, length]
embedding = torch.randn(1, 1, 768) # [batch, num_embeddings, embedding_features] 
loss = model(audio_wave, embedding=embedding) 
loss.backward()

noise = torch.randn(1, 2, 2**18)
sample = model.sample(noise, embedding=embedding, num_steps=2)

This example uses one embedding vector per batch, but you can use multiple

from audio-diffusion-pytorch.

LeonJoe13 commented on August 25, 2024

Thank you very much for your answer !
I also want to ask how to realize the introduction of conditional information within the network ? Is it similar to latent diffusion ?
The second question is how to enhance the network 's ability to learn conditional information or generalization ability ? By increasing the number of cross-attention modules ?
Thank you very much!

from audio-diffusion-pytorch.

flavioschneider commented on August 25, 2024

Yes, increasing the number of cross attention layers might help to to enhance the conditional information (note that I'd do this by increasing the number of items per layer, not the number of cross attentions per item). Also during inference you can increase the classifier free guidance scale.

from audio-diffusion-pytorch.

How to just train condition audio-diffusion without text-condition? about audio-diffusion-pytorch HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent