lucidrains / bs-roformer Goto Github PK

View Code? Open in Web Editor NEW

402.0 402.0 13.0 230 KB

Implementation of Band Split Roformer, SOTA Attention network for music source separation out of ByteDance AI Labs

License: MIT License

Python 100.00%

artificial-intelligence attention-mechanisms deep-learning music-source-separation transformers

bs-roformer's People

Contributors

Stargazers

Watchers

Forkers

newoneincntk yoshikimas adrianwangzhao shenberg gedebabin iver56 dmcneil crlandsc lucataco tps-f funky-synatra yolkispalkis ealpha

bs-roformer's Issues

Removed

Feature request: decouple the loss function of the forward function

In the current implementation, the forward() method is generic for train or eval mode. In some case, we need to have not only the loss but the prediction on output that allow to compute extra features like the SDR metric during the validation step.

Because the loss function code is common for BSRoformer and MelBandRoformer classes, maybe that can be better create a new class like MultiResLoss for a maximum of flexibility:

import torch
import torch.nn.functional as F
from einops import rearrange
from beartype import beartype
from beartype.typing import Tuple

class MultiResLoss():
    @beartype
    def __init__(
        self,
        num_stems,
        stft_n_fft,
        multi_stft_resolution_loss_weight = 1.,
        multi_stft_resolutions_window_sizes: Tuple[int, ...] = (4096, 2048, 1024, 512, 256),
        multi_stft_hop_size = 147,
        multi_stft_normalized = False
    ):
        self.num_stems = num_stems

        self.multi_stft_resolution_loss_weight = multi_stft_resolution_loss_weight
        self.multi_stft_resolutions_window_sizes = multi_stft_resolutions_window_sizes
        self.multi_stft_n_fft = stft_n_fft

        self.multi_stft_kwargs = dict(
            hop_length = multi_stft_hop_size,
            normalized = multi_stft_normalized
        )
    
    def __call__(
        self, 
        predict, 
        targets, 
        return_loss_breakdown = False
    ):
        if self.num_stems > 1:
            assert targets.ndim == 4 and targets.shape[1] == self.num_stems
        
        if targets.ndim == 2:
            targets = rearrange(targets, '... t -> ... 1 t')

        targets = targets[..., :predict.shape[-1]] # protect against lost length on istft

        loss = F.l1_loss(predict, targets)

        multi_stft_resolution_loss = 0.

        for window_size in self.multi_stft_resolutions_window_sizes:

            res_stft_kwargs = dict(
                n_fft = max(window_size, self.multi_stft_n_fft),  # not sure what n_fft is across multi resolution stft
                win_length = window_size,
                return_complex = True,
                **self.multi_stft_kwargs,
            )

            predict_Y = torch.stft(rearrange(predict, '... s t -> (... s) t'), **res_stft_kwargs)
            targets_Y = torch.stft(rearrange(targets, '... s t -> (... s) t'), **res_stft_kwargs)

            multi_stft_resolution_loss = multi_stft_resolution_loss + F.l1_loss(predict_Y, targets_Y)

        weighted_multi_resolution_loss = multi_stft_resolution_loss * self.multi_stft_resolution_loss_weight

        total_loss =  loss + weighted_multi_resolution_loss

        if not return_loss_breakdown:
            return total_loss

        return total_loss, (loss, multi_stft_resolution_loss)

In the same spirit, a little refactoring could be to create a new file for the common classes :

- RMSNorm
- FeedForward
- Attention
- Transformer
- BandSplit
- MLP
- MaskEstimator

That can be easier for future change in the code?

Reproduce the results from the paper

Hi,

Has anyone successfully reproduced the results from the paper? I am interested in doing more experiments with this architecture.

Thanks!

Open sourcing my Mel-Band Roformer vocal model

Hello @lucidrains

I have just open sourced my Mel-band Roformer vocal model, it performs slightly better than the authors model because I trained it with more data. I also included a google colab for people if they want to use it.

https://github.com/KimberleyJensen/Mel-Band-Roformer-Vocal-Model

Thank you again for implementing this paper !!

Gates in Attention module of bs_roformer.py

I am a bit confused of the gates in the Attention module of the bs_roformer.py. The code in lines 103-105 is

out = self.attend(q, k, v)
gates = self.to_gates(x)
out = out * rearrange(gates, 'b n h -> b h n 1').sigmoid()

out = rearrange(out, 'b h n d -> b n (h d)')
return self.to_out(out)

From my understanding this is not the standard multi head attention approach and the paper has not mentioned anything about using something else. Therefore, I would remove the parts using gates resulting in the following code:

out = self.attend(q, k, v)

out = rearrange(out, 'b h n d -> b n (h d)')
out = self.to_out(out)

What got I wrong? What is the gates for and why is it used can you clear this up?

Clarifications Needed for Reproducing BS-RoFormer SDR Performance

Hello,

Firstly, thank you for making the code publicly available. I am attempting to reproduce the results of the BS-RoFormer paper to achieve similar SDR performance and am referencing your code to do so. However, I am encountering significantly lower performance during reproduction and have a few questions. I would appreciate it if you could answer the following queries:

The hyperparameters I used are as follows:

dim: 384
depth: 6
stereo: True
num_stems: 1
time_transformer_depth: 2
freq_transformer_depth: 2
dim_head: 64
heads: 8
ff_dropout: 0.1
attn_dropout: 0.1
flash_attn: True
mask_estimator_depth: 2
Here are my questions:

Could you please share the model and training hyperparameters you used during training?
The paper mentions using a complex spectrogram as input, but I noticed the code uses torch.view_as_real to handle the input in a CaC manner. I believe this is different from the paper. Could you explain the reason for this difference?
I am running the training on an H100 80GB GPU with the above hyperparameters. Despite slight differences from the paper's hyperparameters, the batch_size of 4 fills up the 80GB memory. Could you let me know what batch_size you used and if there were any additional steps taken to manage memory efficiency? For reference, I used 44.1kHz 8-second audio for both target and mixture inputs, as per the paper's setup.
Your answers would be greatly helpful. Thank you very much!

MelBand doesn't work with stereo

Hello. I tried to run a training.

model = MelBandRoformer(
      stereo=True,
      dim=32,
      depth=1,
      attn_dropout=0.1,
      time_transformer_depth=1,
      freq_transformer_depth=1,
  )

But I got an error:

RuntimeError: split_with_sizes expects split_sizes to sum exactly to 3964 (input tensor's size at dimension -1), but got split_sizes=[24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 28, 32, 32, 32, 36, 40, 44, 44, 44, 52, 56, 60, 64, 64, 68, 76, 80, 84, 92, 100, 104, 112, 120, 124, 136, 148, 156, 164, 176, 188, 200, 216, 228, 244, 264, 280, 296, 316, 340, 364, 388, 412, 440, 472, 504]

Without stereo=True it works normally.

Hardly to train with 8s length audio for batch size of 2

Hi, according to the paper,
We do not use the In-House dataset for ablation study. The effective batch size is 64 (i.e., 4 for each GPU) using accumulate grad batches=2.
and
Our model takes a segment of 8-seconds waveform for input and output.
for the L=6 model.
Based on my understanding, they train every 2 audio segments with 8s on each GPU without accumulate batch.
However, I hardly to fit in while I am only training with one V100.

Another thing is the size of the model
In the paper, they mentioned
The numbers of parameters for BS-RoFormer and BS-Transformer with L=6 are 72.2M and 72.5M.
While I following the default setting from the paper, my model is 108M

model = Model(
dim=384,
depth=6,
time_transformer_depth=1,
freq_transformer_depth=1,
heads=8,
attn_dropout=0.1,
ff_dropout=0.1,
dim_head=48,
stereo=True
)

Anything wrong?

Flash attention error in Linear Attention layer

I noticed error in Linear Attention layer when flash_attn is set in True:

File "attend.py", line 84, in flash_attn
    out = F.scaled_dot_product_attention(
RuntimeError: No available kernel. Aborting execution.

In standard Attention (time_transformer and freq_transformer) it works ok.

So currently as workaround I did followig change in LinearAttention (set False to flash):

self.attend = Attend(
    scale=scale,
    dropout=dropout,
    flash=False
)

Default time_transformer_depth

In readme you use the following parameters as example:

time_transformer_depth = 1
freq_transformer_depth = 1

However, these are the defaults in the BSRoformer class:

time_transformer_depth = 2
freq_transformer_depth = 2

What would you recommend?

Linear Attention

hello,
I have two question about the Linear Attention that was added later. Can you clear me up why it is called Linear Attention when the referenced paper introduces Cross-Covariance Attention and why exactly is it better than Self-Attention?

Input data shape

Hello, there is a problem with input data. You write

x = torch.randn(2, 131680)
target = torch.randn(2, 131680)

loss = model(x, target = target)
loss.backward()

# after much training

out = model(x)

But in reality there must be also some batch size:

x = torch.randn(10, 2, 131680)
target = torch.randn(10, 2, 131680)

loss = model(x, target = target)
loss.backward()

# after much training

out = model(x)

If I use like in bottom example I have a problem with stft.

RuntimeError: stft(torch.cuda.FloatTensor[6, 2, 133728], n_fft=2048, hop_length=512, win_length=2048, window=None, normalized=0, onesided=None, return_complex=1) : expected a 1D or 2D tensor

Problem with DataParallel

There is a problem with model if you try to use multiGPU:

Traceback (most recent call last):
  File "train_local_run.py", line 44, in <module>
    train_model(args)
  File "train.py", line 239, in train_model
    loss = model(x, y)
  File "\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "\site-packages\torch\nn\parallel\data_parallel.py", line 171, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "\site-packages\torch\nn\parallel\data_parallel.py", line 181, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "\site-packages\torch\nn\parallel\parallel_apply.py", line 89, in parallel_apply
    output.reraise()
  File "\site-packages\torch\_utils.py", line 644, in reraise
    raise exception
StopIteration: Caught StopIteration in replica 0 on device 0.
Original Traceback (most recent call last):
  File "\site-packages\torch\nn\parallel\parallel_apply.py", line 64, in _worker
    output = module(*input, **kwargs)
  File "\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "\models\bs_roformer\mel_band_roformer.py", line 437, in forward
    stft_window = self.stft_window_fn(device=self.device)
  File "\models\bs_roformer\mel_band_roformer.py", line 403, in device
    return next(self.parameters()).device
StopIteration

What is the correct way to fix it?

Training of BS-RoFormer

I tried to train this neural net without any success. SDR stuck in around 2.1 for vocals and never grows more. If somebody have better results please let me know.

Cut frequencies at half of the Nyquist in the MelBand model

Hi, before starting training on a large corpus, I've just test the latest release 3.0 and the new MelBand model. Both models are initialized with the same parameters, no change on others parameters.

model = MelBandRoformer(dim = 384,
                        depth = 9,
                        time_transformer_depth = 1,
                        freq_transformer_depth = 1, 
                        stereo=stereo, 
                        sample_rate=sr).to(device)

For in-situation condition, I load a audio stereo mixture of 8 seconds @ 44100 Hz with a simil target (a drums stem). The batch size is 1, so the tensors are of size : [1, 2, 352800]

After calling a single training forward step, the backward loss seem coherent:

BandSplit : tensor(2.3477, device='cuda:0', grad_fn=<AddBackward0>)
  MelBand : tensor(2.2762, device='cuda:0', grad_fn=<AddBackward0>)

Unfortunatly, when I save back the audio of both models outputs, I got a strange behavior in the MelBand model. Better than words, the spectrograms:

Mixture [mel scale view]

Target [mel scale view]

BandSplit output [mel scale view]

MelBand output [mel scale view]

MelBand output [linear scale view]

The spectrogram show that the MelBand model output cut the frequencies above 11025 Hz, so half of the Nyquist frequency of 22050 Hz for a 44100 Hz audio.

I don't know if it's normal or a bug, but I prefer to share the information here.

Thank's so much for BS-RoFormer !!!

MLP design in MaskEstimator

Thanks for your great work.
In the BS-RoFormer paper, the authors mention:

Each MLP layer consists of a RMS Norm layer, a fully connected layer followed by a Tanh activation, and a fully connected layer followed by a gated linear unit (GLU) layer[29].

However, your MaskEstimator implementation is somewhat different from the paper description.

Here's my implementation (I also checked the number of parameters):

class TanH(Module):
    def __init__(self, dim_in, dim_out):
        super().__init__()
        self.proj = nn.Linear(dim_in, dim_out)

    def forward(self, x):
        x = self.proj(x)
        return x.tanh()

class GLU(Module):
    def __init__(self, dim_in, dim_out):
        super().__init__()
        self.proj = nn.Linear(dim_in, dim_out * 2)

    def forward(self, x):
        x, gate = self.proj(x).chunk(2, dim=-1)
        return x * gate.sigmoid()

class MaskEstimator(Module):
    @beartype
    def __init__(
            self,
            dim,
            dim_inputs: Tuple[int, ...],
            mlp_expansion_factor = 4,
    ):
        super().__init__()
        self.dim_inputs = dim_inputs
        self.to_freqs = ModuleList([])

        for dim_in in dim_inputs:
            net = []
            net.append(TanH(dim, dim * ff_mult))
            net.append(GLU(dim * ff_mult, dim_in))
            self.to_freqs.append(nn.Sequential(*net))

    def forward(self, x):
        x = x.unbind(dim=-2)

        outs = []

        for band_features, to_freq in zip(x, self.to_freqs):
            freq_out = to_freq(band_features)
            outs.append(freq_out)

        return torch.cat(outs, dim=-1)

Apple Silicon / MPS support?

Hi! Amazing work!
Any chance it can run on MPS (Metal Performance Shaders) - Apple GPU?

Hidden dim in mask estimation module

In the paper second sentence in 4.4 configuration they mention:

The multi-band mask estimation module utilizes MLPs with a hidden layer dimension of 4D.

But in the actual code from my understanding there is no hidden layer only 2 linear layers smart combined to a single one.

Hop length

I am still a bit confused because in the code is the following comment
in line 241 stft_hop_length = 512, # 10ms at 44100Hz, from sections 4.1, 4.4 in the paper - @faroit recommends // 2 or // 4 for better

It's true that they are mentioned to use a hop_size of 10ms for a 44.1kHz sample rate, but in my calculation 512 is not 10ms but ~11.6ms

Did I missunderstood something or what is the intention here ?

Potential bug with `num_stems > 1`

Hi @lucidrains

I'm working on a training code to share here and attempt to reproduce the same workflow than the SAMI-ByteDance paper. I will share the code here. In order to make the training code generic, and because I'm pretty newbie in this kind of task, I want to be sure to clearly understand your code.

Num_Stems parameter

EDIT: I remove this very stupid assumption.

Divergence of default parameters

This is a less important, but for better following of the code:

In the "Music Source Separation with Band-Split Rope Transformer" paper:

We use D = 384 for feature dimension, L = 12 for the number of Transformer blocks, 8 heads for each Transformer, and a dropout rate of 0.1.

I don't know if the choice of attn_dropout = 0. and ff_dropout = 0. as default parameters for the Transformers is motivated by some technical considerations. If not, maybe it's better to change the values to follow the original paper?

In the "Mel-Band Roformer for Music Source Separation" paper:

We use 60 Mel-bands, as it is similar to the number of subbands, i.e., 62, adopted by BS-RoFormer.

Same here with the default parameter num_bands = 62 of the MelBandRoformer?

Thank you very much to the teacher for providing such good sound source separation.

Thank you very much to the teacher for providing such good sound source separation. For a junior student like me, can you provide a specific tutorial on how to separate the sound source? I am using win10 system 64 bit

How to use it for audio source separation?

I am a novice in this field and would like to complete the task of audio source separation. This project seems to be for that purpose, but I couldn't understand how to use it for audio source separation based on the examples in the documentation.

Input tensor size

I am confused about why the tensor length in the readme is 131680, from the paper I got that they use 8s of audio with 44.1kHz which makes from my understand 352800 😅

Flash Attention support

Thank you very much for your code. You rock!

Is Flash Attention only supported by A100 GPU ?

Linear Attention temperature initialization

First of all, thanks for all the good work (including in this repo)!

There's a potential bug in LinearAttention that I thought I should bring to your attention: In contrast to the original paper, you learn the logarithmic temperature, but still initialize the parameter with ones instead of zeros. This means the temperature will initially be Euler's number. Maybe it doesn't make much of a difference in practice or works even better, but it looks like it may have been unintentional.

[bad assertion] strange bottleneck performance

Hi, let me introduce this little benchmark of the model running on GPU (device='cuda'):

---------- Training ----------
Time: 0.019369 seconds in "STFT" stage.
Time: 0.704290 seconds in "Band Split" stage.
Time: 0.124404 seconds in "Transformer" stage.
Time: 0.017830 seconds in "Mask Estimator" stage.
Time: 6.671448 seconds in "ISTFT" stage.
--------------------------------
Time: 0.147087 seconds in "Multiresolution Loss" stage.
Time: 0.493661 seconds in "Loss Backward" stage.
--------------------------------
Time: 9.102034 seconds for the whole process.
 
---------- Evaluation ----------
Time: 0.000107 seconds in "STFT" stage.
Time: 0.003609 seconds in "Band Split" stage.
Time: 0.015426 seconds in "Transformer" stage.
Time: 0.009862 seconds in "Mask Estimator" stage.
Time: 3.528062 seconds in "ISTFT" stage.
--------------------------------
Time: 3.565335 seconds for the whole process.

The model specifications follow the original paper:

    model = BSRoformer(dim = 384,
                       depth = 12,
                       time_transformer_depth = 1,
                       freq_transformer_depth = 1,
                       mask_estimator_depth = 4).to(device)

The tensors for testing are initialized using:

    audio_size = 4 * 44100 # 4 seconds @ 44.1 KHz

    sample = torch.randn(2, audio_size, dtype=torch.float32).to(device)
    target = torch.randn(2, audio_size, dtype=torch.float32).to(device)

The benchmark do not include the einops operations, nor the other tensor manipulation but bounds the stages of the model like that:

        self.bench.start('Band Split')
        x = self.band_split(x)
        self.bench.stop()
        
        # axial / hierarchical attention

        self.bench.start('Transformer')
        for time_transformer, freq_transformer in self.layers:

            x = rearrange(x, 'b t f d -> b f t d')
            x, ps = pack([x], '* t d')

            x = time_transformer(x)

            x, = unpack(x, ps, '* t d')
            x = rearrange(x, 'b f t d -> b t f d')
            x, ps = pack([x], '* f d')

            x = freq_transformer(x)

            x, = unpack(x, ps, '* f d')

        x = self.final_norm(x)
        self.bench.stop()

        num_stems = len(self.mask_estimators)

        self.bench.start('Mask Estimator')
        mask = torch.stack([fn(x) for fn in self.mask_estimators], dim = 1)
        self.bench.stop()

The Benchmark class is a pretty trivial one:

from time import perf_counter

class Benchmark():
    def __init__(self):
        pass

    def start(self, stage):
        self.stage = stage
        self.time_start = perf_counter()

    def stop(self):
        self.time_duration = perf_counter() - self.time_start
        print(f'Time: {self.time_duration:.6f} seconds in "{self.stage}" stage.')

Conclusion:

66% of the time in the model is lost in the torch.istft process while the torch.stft is not slow at all.

Am I the only one to notice this?

Edit:

Wrong conclusion, see next message.

Train Melband on 3090

Is it possible to reduce some aspects of Melband to give good results still but be able to train on smaller GPUs?

I have tried finetuning Kimberley's model with my 3090 but with each epoch the results get slightly worse. This is with only audio chunk size reduced to about half and batch size 1-2 (regardless of accum grad)

Issues Related to TensorRT Accelerated Inference

After separating STFT and ISTFT from the BSRoformer class, I was able to successfully export the model to ONNX, and trtexec could convert the ONNX model to a TensorRT engine. However, TensorRT did not accelerate the inference; instead, it was twice as slow compared to the Torch implementation.

Torch takes approximately 0.13 seconds to infer a slice, while TensorRT takes 0.27 seconds to infer the same slice (tested on an RTX 4090). Using NVIDIA Nsight for monitoring, the preliminary analysis suggests that the slowdown is caused by the Tile operation. Is there any way to alleviate this issue in TensorRT without retraining the model?

Nsight Result:

Modified source code:
https://github.com/bfloat16/Music-Source-Separation-Training

MelRoformer parameters from paper

I'm trying to reproduce the paper model. But I have no luck.

My current settings which gave me batch size only 2 for 48 GB memory:

  dim: 192
  depth: 8
  stereo: true
  num_stems: 1
  time_transformer_depth: 1
  freq_transformer_depth: 1
  num_bands: 60
  dim_head: 64
  heads: 8
  attn_dropout: 0.1
  ff_dropout: 0.1
  flash_attn: True
  dim_freqs_in: 1025
  sample_rate: 44100  # needed for mel filter bank from librosa
  stft_n_fft: 2048
  stft_hop_length: 512
  stft_win_length: 2048
  stft_normalized: False
  mask_estimator_depth: 2
  multi_stft_resolution_loss_weight: 1.0
  multi_stft_resolutions_window_sizes: !!python/tuple
  - 4096
  - 2048
  - 1024
  - 512
  - 256
  multi_stft_hop_size: 147
  multi_stft_normalized: False

On input I give 8 seconds of 44100Hz so length is 352800.

I run my code model through torchinfo:

from torchinfo import summary
summary(model, input_size=(1, 2, 352768))

Report is:

==============================================================================================================
Layer (type:depth-idx)                                       Output Shape              Param #
==============================================================================================================
MelBandRoformer                                              [1, 2, 352768]            56,503,768
├─ModuleList: 1-1                                            --                        --
│    └─ModuleList: 2-1                                       --                        384
│    │    └─Transformer: 3-77                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-78                                [690, 60, 192]            (recursive)
│    └─ModuleList: 2-2                                       --                        384
│    │    └─Transformer: 3-79                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-80                                [690, 60, 192]            (recursive)
│    └─ModuleList: 2-3                                       --                        384
│    │    └─Transformer: 3-81                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-82                                [690, 60, 192]            (recursive)
│    └─ModuleList: 2-4                                       --                        384
│    │    └─Transformer: 3-83                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-84                                [690, 60, 192]            (recursive)
│    └─ModuleList: 2-5                                       --                        384
│    │    └─Transformer: 3-85                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-86                                [690, 60, 192]            (recursive)
│    └─ModuleList: 2-6                                       --                        384
│    │    └─Transformer: 3-87                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-88                                [690, 60, 192]            (recursive)
│    └─ModuleList: 2-7                                       --                        384
│    │    └─Transformer: 3-89                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-90                                [690, 60, 192]            (recursive)
│    └─ModuleList: 2-8                                       --                        384
│    │    └─Transformer: 3-91                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-92                                [690, 60, 192]            (recursive)
├─BandSplit: 1-2                                             [1, 690, 60, 192]         --
│    └─ModuleList: 2                                         --                        --
....
├─ModuleList: 1-1                                            --                        --
│    └─ModuleList: 2-1                                       --                        384
│    │    └─Transformer: 3-77                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-78                                [690, 60, 192]            (recursive)
│    └─ModuleList: 2-2                                       --                        384
│    │    └─Transformer: 3-79                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-80                                [690, 60, 192]            (recursive)
│    └─ModuleList: 2-3                                       --                        384
│    │    └─Transformer: 3-81                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-82                                [690, 60, 192]            (recursive)
│    └─ModuleList: 2-4                                       --                        384
│    │    └─Transformer: 3-83                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-84                                [690, 60, 192]            (recursive)
│    └─ModuleList: 2-5                                       --                        384
│    │    └─Transformer: 3-85                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-86                                [690, 60, 192]            (recursive)
│    └─ModuleList: 2-6                                       --                        384
│    │    └─Transformer: 3-87                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-88                                [690, 60, 192]            (recursive)
│    └─ModuleList: 2-7                                       --                        384
│    │    └─Transformer: 3-89                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-90                                [690, 60, 192]            (recursive)
│    └─ModuleList: 2-8                                       --                        384
│    │    └─Transformer: 3-91                                [60, 690, 192]            (recursive)
│    │    └─Transformer: 3-92                                [690, 60, 192]            (recursive)
├─ModuleList: 1                                              --                        --
│    └─MaskEstimator: 2-9                                    [1, 690, 7916]            --
==============================================================================================================
Total params: 69,102,468
Trainable params: 69,102,404
Non-trainable params: 64
Total mult-adds (G): 8.35
==============================================================================================================
Input size (MB): 2.82
Forward/backward pass size (MB): 703.40
Params size (MB): 232.17
Estimated Total Size (MB): 938.40
==============================================================================================================

From report I expect to have batch more than 48. But in the end I can use batch only 2.

GPU memory usage for batch = 2:

To follow the paper I must increase dim to 384, depth to 12 and decrease stft_hop_length to 441 - to be 10 ms. In this case batch size will be only 1 or not fit in memory )

Any ideas how to deal with such big memory usage?

Hi @lucidrains + @ZFTurbo

          Hi @lucidrains + @ZFTurbo

I work with birds and have a audio dataset the sounds they make that I would like to train with this code, I have had success with other models that were authored for music. Please can I request for a way to train this model to be added to the code? Sorry i can not accomplish this myself.

Originally posted by @lyndonlauder in #4 (comment)