descriptinc / descript-audio-codec Goto Github PK

State-of-the-art audio codec with 90x compression factor. Supports 44.1kHz, 24kHz, and 16kHz mono/stereo audio.

Home Page: https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a68f814d5

License: MIT License

Dockerfile 0.18% Python 99.82%

audio audio-compression codec compression-algorithm deep-learning gans generative-adversarial-network pytorch residual-vector-quantization

descript-audio-codec's People

Contributors

Stargazers

Watchers

descript-audio-codec's Issues

Very low bitrate models

Hello,

first of all thank you for great work! The results look really awesome!

In the article you had reported also results for very small bitrates ~1.6Kbps and 22050 Hz. Do you have any plans on releasing these models?

Encoding new file - use of `zero_pad`

Hi, I am trying to understand the encoding procedure and is confused about this line in chunked inference:

descript-audio-codec/dac/model/base.py

Line 197 in c7cfc5d

audio_signal.zero_pad(self.delay, self.delay)

It is supposed to pad the signal with some 0s both at the start and the end. But, since the results is not stored anywhere, I wonder what is the use of this code?

Which win_duration to use?

The default in utils/encode.py is 5.0

descript-audio-codec/dac/utils/encode.py

Line 29 in c7cfc5d

win_duration: float = 5.0,

while the default in model.base.py is 1.0.

descript-audio-codec/dac/model/base.py

Line 129 in c7cfc5d

win_duration: float = 1.0,

What do you recommend actually using in inference?

Padding Mismatches Output Dimension in Conv1d

Thank you for sharing your nice neural feature extractor of arbitrary audios!

In the source code (the main branch), there are several parts that mentionpadding=1, or etc. However, when I print out the dimensions of the outputs and the model itself, every Conv1d layer's padding parameter is not specified in the logs.

Residual unit: 44532 -> 44514 by self.block=Sequential(
  (0): Snake1d()
  (1): Conv1d(64, 64, kernel_size=(7,), stride=(1,), dilation=(3,))
  (2): Snake1d()
  (3): Conv1d(64, 64, kernel_size=(1,), stride=(1,))
)

It seems to me that the padding parameters are being ignored?

I found out this when I tried to make an efficient function to map the original audio offset and the encoded code index.

import math

from audiotools import AudioSignal

import dac


def get_conv1d_output_length(
    kernel_size,
    input_length,
    dilation=1,
    stride=1,
    padding=0,
):
    L = input_length
    d = dilation
    k = kernel_size
    s = stride
    p = padding
    L = math.floor(((L + 2 * p - d * (k - 1) - 1) / s) + 1)
    return L


def get_conv1d_transpose_output_length(dilation, kernel_size, stride, input_length):
    L = input_length
    d = dilation
    k = kernel_size
    s = stride
    L = (L - 1) * s + d * (k - 1) + 1
    return L


def get_residual_output_length(dilation, input_length):
    pad = 0  # ((7 - 1) * dilation) // 2
    # print("Main pad = ", pad)
    l_y = get_conv1d_output_length(
        kernel_size=7,
        dilation=dilation,
        padding=pad,
        input_length=input_length,
    )
    l_y = get_conv1d_output_length(
        kernel_size=1,
        input_length=l_y,
    )
    pad = (input_length - l_y) // 2
    if pad > 0:
        print(f"pad pad {pad=}")
        return input_length - 2 * pad
    print(f"==Residual: {input_length} -> {l_y}")
    return l_y


def get_encoder_block_output_length(stride, input_length):
    l = get_residual_output_length(dilation=1, input_length=input_length)
    l = get_residual_output_length(dilation=3, input_length=l)
    l = get_residual_output_length(dilation=9, input_length=l)
    l = get_conv1d_output_length(
        kernel_size=2 * stride,
        stride=stride,
        padding=0,  # math.ceil(stride / 2),
        input_length=l,
    )
    print(f"=Encoder block:  {input_length} -> {l}")
    return l


def get_encoded_dim(input_length):
    # wnconv1d k=7 p=3 s=1 d=1
    l = get_conv1d_output_length(
        kernel_size=7, padding=0, input_length=input_length
    )  # padding=3
    print(f"After encoder first conf1d: {input_length} -> {l=}")
    # EncoderBlock(stride=2, 4, 8, 8)
    # EncoderBlock = Residual(dilation=1, 3, 9), WNConv1d(k=2*stride, stride=stride, padding=ceil(stride/2))
    l = get_encoder_block_output_length(2, l)
    l = get_encoder_block_output_length(4, l)
    l = get_encoder_block_output_length(8, l)
    l = get_encoder_block_output_length(8, l)
    l = get_conv1d_output_length(
        kernel_size=3, padding=0, input_length=l
    )  # padding = 1
    return l


def main():
    model_path = dac.utils.download(model_type="44khz")
    model = dac.DAC.load(model_path)
    model.to("mps:0")
    signal = AudioSignal("kyunggi.wav")
    signal.to_mono()
    x = model.compress(signal)  # <class 'dac.model.base.DACFile'>
    # print(f"{type(x)=}")
    print(f"{x.codes.shape=}")
    print(f"{x.codes=}")
    print(f"{x.chunk_length=}")
    print(f"{x.original_length=}")
    print(f"{x.sample_rate=}")
    print(f"{signal.signal_length=} {signal.signal_duration=}")
    print(f"{model.get_output_length(403956)=}")
    # x.codes.shape[2] = 1080
    # x.chunk_length = 72
    # x.original_length = 403956
    # x.sample_rate = 44100
    # signal.signal_length = 403956
    # signal.signal_duration = 9.15
    # 403956/44100 = 9.15s
    # signal_duration = signal_length / sample_rate
    # Save and load to and from disk
    # x.save("compressed.dac")
    # x.original_length = 623616 # 668,160 403956
    # y = model.decompress(x)


if __name__ == "__main__":
    print(f"{get_encoded_dim(44544)=}")  # 44538 44544
    main()

The code above calculates 72 codes, given 44544 samples, which is the same as the output of the encoder. However, I had to set all the paddings to 0 when calculating, to match the result. I mean, I could not find any way to match the output dimension using the padding values specified in the source code.

audiotools error

OSError: libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory

about test sample demo

Hi, thanks for your work.

but when i click to play samples on page : https://descript.notion.site/Descript-Audio-Codec-11389fce0ce2419891d6591a68f814d5

it shows : this audio format wav can't 'be played' and it also can't be download to local drive.

this existed for me on both edge and chrome browser.

is it a problem from my PC or is there something need to be fix?

thank you.

does commit_loss and codebook_loss always be equal?

when i try to retrain DAC model, i found that commit_loss and codebook_loss always be equal for each iteration.

is it correct?
code location:
quantize.py, class VectorQuantize
commitment_loss = F.mse_loss(z_e, z_q.detach(), reduction="none").mean([1, 2])
codebook_loss = F.mse_loss(z_q, z_e.detach(), reduction="none").mean([1, 2])
print('commitment_loss/codebook_loss:', commitment_loss, codebook_loss)
e.g.:
commitment_loss/codebook_loss: tensor([1.6723, 2.0803, 1.9611, 1.8907], device='cuda:0',
grad_fn=) tensor([1.6723, 2.0803, 1.9611, 1.8907], device='cuda:0',
grad_fn=)

Does DAC support batch processing

Hi DAC team,

I wonder whether it is possible to include an example in this repo that uses DAC to process audio in batches?

Thanks

entropy computation script not working (it used to work before)

I downloaded the model weights from here:
https://github.com/descriptinc/descript-audio-codec/releases/download/0.0.1/weights.pth

But the script:
https://github.com/descriptinc/descript-audio-codec/blob/main/scripts/compute_entropy.py
does not work.

It used to work before the commit #22 last week.

Here is the error:

descript-audio-codec$ CUDA_VISIBLE_DEVICES=3 python3 scripts/compute_entropy.py input/ weights.pth
  0%|                                                                                      | 0/3 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/project/user/descript-audio-codec/scripts/compute_entropy.py", line 50, in <module>
    main()
  File "/home/user/miniconda3/envs/e2e/lib/python3.9/site-packages/argbind/argbind.py", line 159, in cmd_func
    return func(*cmd_args, **kwargs)
  File "/mnt/project/user/descript-audio-codec/scripts/compute_entropy.py", line 31, in main
    codes.append(o["codes"].cpu())
TypeError: tuple indices must be integers or slices, not str

The folder "input" has three 44.1 kHz mono audio files.

Evaluation on 16KHz and 24KHz model?

there's thorough evaluation of 44.1KHz model in paper, 16K, 24K are not.

could you provide objective evaluation of each model?

Different code sizes when encoding versus when compressing

Hello,

Thanks again for the great work.

I am raising this issue because I noticed that I get a different dimensionality for codes when using model.encode compared to when using model.compress. As an example, I used the script you provide under 'Programmatic Usage' in the README file.

For 10 seconds of audio @ 44100 Hz, z has a dimensionality of [1, 1024, 862] and codes has a dimensionality of [1, 9, 862]. Those are the quantised continuous representation and the codebook indices respectively returned by the quantizer (ResidualVectorQuantize) when call model.encode.

For the same 10 seconds of audio, z has a dimensionality of [1, 1024, 1152] and codes has a dimensionality of [1, 9, 1152]. Those are the quantised continuous representation and the codebook indices respectively returned when calling model.compress, before creating the DAC file. It seems like in those two cases the number of 72-sized chunks differ? Am I misunderstanding something here?

Thank you!

Mary

Were stft loss and waveform loss used in your training?

Hi,

Thank you for sharing your contribution.

I read the code and I found that you used stft loss and waveform loss (L1 recon loss), but based on the config params in conf/base.yml, it seems that these two loss have no weight, so I think they will not be used in the back-propagation.

I wonder if you used these two loss in your released models?
Do you think these loss are not useful for the training?

Thank you.

Decoder using less VRAM

I notice the decoder uses a ton of VRAM (and in some cases causing OOM error)

Here's me decoding a ~5 minute audio file

One solution is reduce VRAM by splitting the codes into chunks, decoding each chunk, then concatenating them

# split into 100 chunks
chunks = np.array_split(data["codes"], 100, axis=-1)
final = []
for chunk in chunks:
    data["codes"] = chunk
    # decode each chunk
    recon = decode(data, 'cuda', model, preserve_sample_rate=True)
    final.append(recon.audio_data)
# concatenate audio data
concatenated = torch.cat(final, dim=-1)
# shorten the end to the original length
concatenated = concatenated[:,:,:data["metadata"]["original_length"]]
# save
torchaudio.save('recon.wav', concatenated[0], data["metadata"]["original_sr"])

However this causes audio normalization weirdness. You can see here, bottom to top (the original wav, default decoding (no chunking), 10 chunks, 160 chunks):

and clipping

How do I solve that

Stereo parameter?

Hi, great project, i encoded a 2 CH input audio file with the 44khz model, all works fine, but it seems only to use 1 channel.

Decoded wave shows:
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, 1 channels, s16, 705 kb/s

How can i set that encoding and encoding uses 2 channels for stereo encoding and decoding?

thx

Some question about official log

Thanks for your amazing work. I'm training dac based your code and I want to know could you please share your officail log file? I want to know the final loss to judge if my training has converged

Details on training and inferencing the 16kHz variant

Hi, there! Thanks for the great work. I noticed in your paper that all audio samples are resampled to 44.1kHz, which I understand is the setting for the 44kHz variant.

I was wondering if you could kindly provide some clarification regarding the 16kHz version. Are the audio samples resampled to 16kHz for training purposes? Or it's just input 16kHz and then resampled to 44kHz in the inference stage?

Some examples on how to train and use these variants could be a great help.

16kHz model is needed!

As most speech dataset are 16 kHz, can you provide well-trained 16kHz(160 stride) model?

Why does the 24kHz model use 32 codebooks?

The 9 codebooks of the 44kHz model seemed to work really well. Even when I used only the first 4 codebooks, I found the quality totally sufficient for speech signals. That's why I was surprised to see that the 24kHz model uses 32 codebooks. What is the reason for that? I thought that a signal with a lower samplingrate could be compressed even better, i.e. fewer codebooks would be sufficient.

model = DAC()
model = load_model(model_type="24khz", tag="0.0.4")
print(model.n_codebooks)
>> 32

Chunked inference result depends on chunk length

First of all, thanks for the great work and clean code!

For the purpose of training a model on the discrete codes (as opposed to just encoding and decoding a signal), the current chunked inference is not ideal. As nicely summarized by @cinjon in #35, the current implementation slices up the input into chunks of about the requested chunk length, encodes them separately, and saves the blocks of latent codes along with the chunk length. However, concatenating the separately encoded chunks gives a different sequence of discrete codes than encoding the whole sequence at once (or, more generally, using a different chunk size). Specifically, decoding with a larger chunk size will lead to repeated audio segments at the original chunk boundaries (about 5ms per boundary in the default settings). This means a model cannot be fed with arbitrary excerpts from the discrete code sequence; the excerpts have to be aligned on chunk boundaries to be meaningful, and the model will have to learn to model the boundaries at the expected positions. It also means I cannot jump to a specific position in the audio by just multiplying the timestamp by 86Hz.

Since the model is convolutional, it is possible to implement a chunked inference that gives the same result as passing the full signal (except at the beginning and end, since we cannot simulate the zero-padding of hidden layers by padding the input). This entails setting padding to 0 in all Conv1d layers, zero-padding the input signal / code sequence (before chunking), and overlapping the chunks by the amount of padding. The current implementation already sets padding to 0, and pads the input, but chose a different strategy: to obtain the same codes, the input signal chunks would overlap by the amount they're padded with, and the code chunks would be padded and overlapped as well, but the decompression routine neither pads nor overlaps the codes. Instead, it relies on the input signal being padded and overlapped to cater for both the encoder and the decoder (i.e., it produces overlapped and padded code chunks for the chunk length that was used for encoding).

byte count on 16kHz decoding

Hi,
I am getting an error on decoding when I use "16khz". For my two second files, the original length is 3200 byres, but the reconstruction comes up 8 bytes short:

File "/home/lonce/working/descript-audio-codec/dac/model/base.py", line 289, in decompress
recons.audio_data = recons.audio_data.reshape(
RuntimeError: shape '[-1, 1, 32000]' is invalid for input of size 31992

I can "fix" the error by just hard-coding the the length argument to the reshape operation (on line 289 in body.py) to 3192. For the general fix, I suppose the reshape should be given the length of the recon signal, not the original signal. Or else the reconstruction should produce the exact same number of bytes as the original files.

This is using code pulled from github on 2023.08.17.

p.s. Nice work on the codec - it sounds great, the compression is amazing, and the git docs easy to understand!

can descript-audio-codec works below 32kbps outperforms opus at 32kbps?

the results in papers showed descript-audio-codec@8kbps has similar performance for opus@24kbps.

Release 24kbps model

According to your ablation study, the 24kbps model scored better. Can you release it?

Error with 16khz

Decoding files:   0%|                                                                            | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\cross\miniconda3\envs\audio\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\cross\miniconda3\envs\audio\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\cross\miniconda3\envs\audio\lib\site-packages\dac\__main__.py", line 36, in <module>
    run(group)
  File "C:\Users\cross\miniconda3\envs\audio\lib\site-packages\dac\__main__.py", line 28, in run
    stage_fn()
  File "C:\Users\cross\miniconda3\envs\audio\lib\site-packages\argbind\argbind.py", line 159, in cmd_func
    return func(*cmd_args, **kwargs)
  File "C:\Users\cross\miniconda3\envs\audio\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\cross\miniconda3\envs\audio\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\cross\miniconda3\envs\audio\lib\site-packages\dac\utils\decode.py", line 76, in decode
    recons = generator.decompress(artifact, verbose=verbose)
  File "C:\Users\cross\miniconda3\envs\audio\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\cross\miniconda3\envs\audio\lib\site-packages\dac\model\base.py", line 270, in decompress
    z = self.quantizer.from_codes(c)[0]
  File "C:\Users\cross\miniconda3\envs\audio\lib\site-packages\dac\nn\quantize.py", line 215, in from_codes
    z_p_i = self.quantizers[i].decode_code(codes[:, i, :])
  File "C:\Users\cross\miniconda3\envs\audio\lib\site-packages\torch\nn\modules\container.py", line 295, in __getitem__
    return self._modules[self._get_abs_string_index(idx)]
  File "C:\Users\cross\miniconda3\envs\audio\lib\site-packages\torch\nn\modules\container.py", line 285, in _get_abs_string_index
    raise IndexError(f'index {idx} is out of range')
IndexError: index 9 is out of range

How is stereo compression done?

Hello!

This is connected to the question I asked after the issue (#25) was closed.

The "About" section of this repo says "Supports 44.1kHz, 24kHz, and 16kHz mono/stereo audio".
However, the paper doesn't say anything special about stereo compression and/or stereo data.

So, how is the stereo compression done currently?
I understood that all the shared models (24 kHz and 44.1 kHz) are designed for mono compression only.
If my understanding is correct, do you downmix the two channel and feed it to the encoder?
If yes, how do you recreate back the two channels in the decoder?

Is there any documentation about your stereo compression scheme?

hours of audio content for the training?

You've nicely described the datasets in the paper!
But how many hours of audio content was used for the training?

Memory leak?

Hi, I am trying to process my audio clip of each 1s segment, given the limitation of my GPU memory. However, the allocated memory seems to steadily increase throughout my loop of computing each segment:

  signal.audio_data = signal.audio_data
  for i in range(num_segments):
      start = i * samples_per_segment
      end = min(start + samples_per_segment, total_samples)
      segment = signal.audio_data[:, :, start:end].to(dac_model.device)

      # Process each segment
      x = dac_model.preprocess(segment, signal.sample_rate)

      print("to begin", torch.cuda.memory_allocated(5))
      z, codes, latents, commitment_loss, codebook_loss = dac_model.encode(x)
      embeddings.append(rearrange(z, '1 e t -> t e').cpu()) 

      del x, z, codes, latents, commitment_loss, codebook_loss
      torch.cuda.empty_cache()
      print("to end", torch.cuda.memory_allocated(5))

Each time the z variable is computed, I move it to cpu and manually deletes all variables that's on the GPU, and cleaned torch cache. The only thing that's being kept on GPU is dac_model which I will use for the next iteration. Is the model itself aggregating memories? The memory print output show a steady increase like this until it reaches OOM.

to begin 312585728
to end 10856556032
to begin 10858396160
to end 21401096704

I wonder is this the expected behavior or there is a memory leak somewhere? If it's expected, what is the correct practice of using DAC to compute encodings for segments of a long audio? I am on CentOS Linux 7 (Core), Python 3.7.5 and torch 1.13.1. This model is the 44khz one.

Can't set n_quantizers for encode()

It appears that DAC.encode() always uses 9 codebooks, ignoring the n_quantizer argument. The z and code index vectors returned are always the same (see screen capture below).

Thank you,

lonce

(Paper Error?) MSD Not Used?

Your paper says "Like prior work, we use multi-scale (MSD) and multi-period waveform discriminators (MPD) which lead to improved audio fidelity."

However, by default it appears the trainer has an empty rates array which means no MSDs are initialized. Was this intentional?

tensor shape mismatch when training on 24khz LibriTTS dataset

Traceback (most recent call last):
File "/home/v-zhikangniu/descript-audio-codec/scripts/train.py", line 441, in
train(args, accel)
File "/home/v-zhikangniu/miniconda3/envs/dac/lib/python3.9/site-packages/argbind/argbind.py", line 159, in cmd_func
return func(*cmd_args, **kwargs)
File "/home/v-zhikangniu/descript-audio-codec/scripts/train.py", line 425, in train
validate(state, val_dataloader, accel)
File "/home/v-zhikangniu/miniconda3/envs/dac/lib/python3.9/site-packages/audiotools/ml/decorators.py", line 375, in decorated
output = fn(*args, **kwargs)
File "/home/v-zhikangniu/descript-audio-codec/scripts/train.py", line 344, in validate
output = val_loop(batch, state, accel)
File "/home/v-zhikangniu/miniconda3/envs/dac/lib/python3.9/site-packages/audiotools/ml/decorators.py", line 321, in decorated
output = fn(*args, **kwargs)
File "/home/v-zhikangniu/miniconda3/envs/dac/lib/python3.9/site-packages/audiotools/ml/decorators.py", line 107, in decorated
output = fn(*args, **kwargs)
File "/home/v-zhikangniu/miniconda3/envs/dac/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/v-zhikangniu/descript-audio-codec/scripts/train.py", line 220, in val_loop
"loss": state.mel_loss(recons, signal),
File "/home/v-zhikangniu/miniconda3/envs/dac/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/v-zhikangniu/miniconda3/envs/dac/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/v-zhikangniu/descript-audio-codec/dac/nn/loss.py", line 322, in forward
loss += self.log_weight * self.loss_fn(
File "/home/v-zhikangniu/miniconda3/envs/dac/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/v-zhikangniu/miniconda3/envs/dac/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/v-zhikangniu/miniconda3/envs/dac/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 101, in forward
return F.l1_loss(input, target, reduction=self.reduction)
File "/home/v-zhikangniu/miniconda3/envs/dac/lib/python3.9/site-packages/torch/nn/functional.py", line 3308, in l1_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "/home/v-zhikangniu/miniconda3/envs/dac/lib/python3.9/site-packages/torch/functional.py", line 76, in broadcast_tensors
return _VF.broadcast_tensors(tensors) # type: ignore[attr-defined]
RuntimeError: The size of tensor a (15000) must match the size of tensor b (15001) at non-singleton dimension 3

How to encode audio with different settings?

Thank you for an amazing paper and OS release. I'm wondering how we can try different settings of the encoder decoder to test the quality of different number of code books for example. Do we need separately trained models for the different compression ratios or could you just provide some examples on how to do it?

Error when set win_duration small

Error when set win_duration small. For example, when win_duration =100ms in the inference stage, function get_output_length will return negative value.
Is small value of win_duration not support?

broken training: please specify versions of libraries used

Hi, I am trying to run training, but there are multiple problems arising from what seems to be library version mismatch.

For example, one I ran into this morning was solved by downgrading soundfile from 0.12 to 0.10.3.post1, but there seems to be other problems which I apparently was not the only one to run into. #18 mentioned that the problem solved itself after "getting access to a colleague's venv. After reading the contents there, I have started to suspect that this is actually due to library version mismatch. (My current problem is identical in nature to the first post in #18 , with a Tcl_AsyncDelete: async handler deleted by the wrong thread.)

If anyone has a working setting for training, would you be willing to put your pip freeze and/or conda list --export output below, depending on which kind of package manager you used? It'd also be immensefully helpful if you could specify your Python version as well.

Thanks in advance!

How to compress stereo sound by model.encode

i found that i can not encode strereo sound by moddel.encode

Inference speed

Hi, thanks for the excellent work!
I'm trying to use DAC (24kHz) as a replacement of EnCodec for some audio tasks, however I'm noticing a significant slowdown of the training process.
I did some quick measures of the inference speed and it's around 4x slower than EnCodec. Is this an expected behaviour?

config file for 24kHz version

Hi, thanks for sharing 24kHz pre-trained model.

I could find 'metadata' from the .pth file, but it only contains hyper parameters for the generator.
I wonder discriminator / loss configurations for 24kHz.
(I guess STFT params can be different between 24kHz setup and 44.1kHz setup.)
Can you share .yaml config file for 24kHz models?

Hi, whether you have compared with other open source codec model, HiFi-Codec?

https://github.com/yangdongchao/AcademiCodec

training does not work

Hi all:

Did anyone manage to start the training?
If yes, could you please share your environment?

I created a separate virtual environment (Python 3.10.11). I'm using CUDA Version: 11.4; Ubuntu 20.04.2 LTS.
I followed all the instruction.
pip install git+https://github.com/descriptinc/descript-audio-codec

Encoding + decoding works!

Then did the training pre-requisites step:
pip install -e ".[dev]"

When I start training:

export CUDA_VISIBLE_DEVICES=0
python scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/

It's stuck for a long time displaying (below) and then exits!

[11:50:04] Saving audio samples to TensorBoard                                     decorators.py:220

─────────────────────────────────────────── train_loop() ───────────────────────────────────────────

╭─────────────────────────────────────────── Progress ────────────────────────────────────────────╮
│                                              train                                              │
│                                             ╷                      ╷                            │
│       key                                   │ value                │ mean                       │
│     ╶───────────────────────────────────────┼──────────────────────┼──────────────────────╴     │
│       adv/disc_loss                         │   7.859243           │   7.859243                 │
│       adv/feat_loss                         │  22.607071           │  22.607071                 │
│       adv/gen_loss                          │   7.853064           │   7.853064                 │
│       loss                                  │ 157.455719           │ 157.455719                 │
│       mel/loss                              │   6.766288           │   6.766288                 │
│       other/batch_size                      │   12.000000           │   12.000000                 │
│       other/grad_norm                       │        nan           │   0.000000                 │
│       other/grad_norm_d                     │        inf           │   0.000000                 │
│       other/learning_rate                   │   0.000100           │   0.000100                 │
│       stft/loss                             │   6.870012           │   6.870012                 │
│       vq/codebook_loss                      │   2.315346           │   2.315346                 │
│       vq/commitment_loss                    │   2.315346           │   2.315346                 │
│       waveform/loss                         │   0.116164           │   0.116164                 │
│       time/train_loop                       │  19.523806           │  19.523806                 │
│                                             ╵                      ╵                            │
│                                                                                                 │
│     ⠏ Iteration (train) 1/250000 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:03:17 / -:--:--     │
│     ⠏ Iteration (val)   0/63     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:03:17 / -:--:--     │
╰─────────────────────────────────────────────────────────────────────────────────────────────────╯Tcl_AsyncDelete: async handler deleted by the wrong thread
Aborted (core dumped)

Any idea why training is not working?

Benchmarks / Metrics?

Hello,

Will there be included benchmarks / metrics for this audio codec vs other popular competitors such as Encodec/Soundstream?

Thanks for open-sourcing!

different volume norm configuration from paper

may be it's not that important, volume norm seems different from paper

below is base config in repo

descript-audio-codec/conf/base.yml

Line 48 in 98e762e

VolumeNorm.db: [const, -16]

finding out the stride of the model

Hello, I would like to use a model's codebook to label a dataset of audio files by chunks. The first problem I am encountering is to find out the actual stride of the model, that is how much of the audio it can label for one entry. SO I tryed the following: first I tryed, on the 24khz model, to divide model.hop_length/24000, and the result is 0.013333333333333334(which should be a value in seconds), but it doesn't make that much sense as it's not that close to an int. So then I tryed labeling an audio 79.55990929705216 seconds long. In the resultin class, chunk_length is 72, so I tryed dividing the number of resulting representations (result.codes.shape[-1]) which is 8928 by 72, and the result is 124.0. Then, trying to divide the total length in seconds by this value, I obtain 0.6416121717504206(seconds) which is not even close to my first value, even trying to adjust the numbers to account for some sort of padding. This might be a stupid question but I think it's worth asking to save me and everyone else some time as I'm quite lost, I also tryed taking a look through the very clear source code and paper but I think I'm missing something very obvious. Thanks! P.S, if it's necessary please try to include verbal explanations of the images you might want to include in your comments as I am totally blind.

error while training

I did the Installation like you described here: https://github.com/descriptinc/descript-audio-codec

Encoding and decoding works!
For training I did the pre-requisite step.
Created “data” folder under “descript-audio-codec” folder and placed there a folder containing a few 44.1 kHz audio (.wav) for training, validation and testing and updated the base.yml file accordingly.

But, when I do:

export CUDA_VISIBLE_DEVICES=0
python scripts/train.py --args.load conf/ablations/baseline.yml --save_path runs/baseline/

I get the error below. Could please let me know what am I doing wrong?

Accelerator(
  amp : bool = False
)
train(
  seed : int = 0
  save_path : str = runs/baseline/
  num_iters : int = 250000
  save_iters : list = [10000, 50000, 100000, 200000]
  sample_freq : int = 10000
  valid_freq : int = 1000
  batch_size : int = 12
  val_batch_size : int = 12
  num_workers : int = 4
  val_idx : list = [0, 1, 2, 3, 4, 5, 6, 7]
  lambdas : dict = {'mel/loss': 15.0, 'adv/feat_loss': 2.0, 'adv/gen_loss': 1.0, 'vq/commitment_loss': 0.25, 'vq/codebook_loss': 1.0}
)
load(
  resume : bool = False
  tag : str = latest
  load_weights : bool = False
)
DAC(
  encoder_dim : int = 64
  encoder_rates : list = [2, 4, 8, 8]
  decoder_dim : int = 1536
  decoder_rates : list = [8, 8, 4, 2]
  n_codebooks : int = 9
  codebook_size : int = 1024
  codebook_dim : int = 8
  quantizer_dropout : float = 1.0
  sample_rate : int = 44100
)
Discriminator(
  rates : list = []
  periods : list = [2, 3, 5, 7, 11]
  fft_sizes : list = [2048, 1024, 512]
  sample_rate : int = 44100
  bands : list = [[0.0, 0.1], [0.1, 0.25], [0.25, 0.5], [0.5, 0.75], [0.75, 1.0]]
)
AdamW(
  # scope = generator
  lr : float = 0.0001
  betas : list = [0.8, 0.99]
  eps : float = 1e-08
  weight_decay : float = 0.01
  amsgrad : bool = False
  maximize : bool = False
  capturable : bool = False
  differentiable : bool = False
)
ExponentialLR(
  # scope = generator
  gamma : float = 0.999996
)
AdamW(
  # scope = discriminator
  lr : float = 0.0001
  betas : list = [0.8, 0.99]
  eps : float = 1e-08
  weight_decay : float = 0.01
  amsgrad : bool = False
  maximize : bool = False
  capturable : bool = False
  differentiable : bool = False
)
ExponentialLR(
  # scope = discriminator
  gamma : float = 0.999996
)
build_dataset(
  # scope = train
  folders : dict = {'speech_fb': ['/data/testset44p1/'], 'speech_hq': ['/data/testset44p1/'], 'speech_uq': ['/data/testset44p1/'], 'music_hq': ['/data/testset44p1/'], 'music_uq': ['/data/testset44p1/'], 'general': ['/data/testset44p1/']}
)
AudioLoader(
  # scope = train
  sources : list = ['/data/testset44p1/']
  weights : NoneType = None
  relative_path : str =
  ext : list = ['.wav', '.flac', '.mp3', '.mp4']
  shuffle : bool = True
  shuffle_state : int = 0
)
build_transform(
  # scope = train
  augment_prob : float = 0.0
  preprocess : list = ['Identity']
  augment : list = ['Identity']
  postprocess : list = ['VolumeNorm', 'RescaleAudio', 'ShiftPhase']
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
VolumeNorm(
  # scope = train
  db : list = ['const', -16]
  name : NoneType = None
  prob : float = 1.0
)
RescaleAudio(
  # scope = train
  val : float = 1.0
  name : NoneType = None
  prob : int = 1
)
ShiftPhase(
  # scope = train
  shift : tuple = ('uniform', -3.141592653589793, 3.141592653589793)
  name : NoneType = None
  prob : int = 1
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : int = 1
)
AudioDataset(
  # scope = train
  n_examples : int = 10000000
  duration : float = 0.38
  offset : NoneType = None
  loudness_cutoff : int = -40
  num_channels : int = 1
  transform : Compose = <audiotools.data.transforms.Compose object at 0x7f20bc0cc520>
  aligned : bool = False
  shuffle_loaders : bool = False
  without_replacement : bool = True
)
AudioLoader(
  # scope = train
  sources : list = ['/data/testset44p1/']
  weights : NoneType = None
  relative_path : str =
  ext : list = ['.wav', '.flac', '.mp3', '.mp4']
  shuffle : bool = True
  shuffle_state : int = 0
)
build_transform(
  # scope = train
  augment_prob : float = 0.0
  preprocess : list = ['Identity']
  augment : list = ['Identity']
  postprocess : list = ['VolumeNorm', 'RescaleAudio', 'ShiftPhase']
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
VolumeNorm(
  # scope = train
  db : list = ['const', -16]
  name : NoneType = None
  prob : float = 1.0
)
RescaleAudio(
  # scope = train
  val : float = 1.0
  name : NoneType = None
  prob : int = 1
)
ShiftPhase(
  # scope = train
  shift : tuple = ('uniform', -3.141592653589793, 3.141592653589793)
  name : NoneType = None
  prob : int = 1
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : int = 1
)
AudioDataset(
  # scope = train
  n_examples : int = 10000000
  duration : float = 0.38
  offset : NoneType = None
  loudness_cutoff : int = -40
  num_channels : int = 1
  transform : Compose = <audiotools.data.transforms.Compose object at 0x7f20bc0cc970>
  aligned : bool = False
  shuffle_loaders : bool = False
  without_replacement : bool = True
)
AudioLoader(
  # scope = train
  sources : list = ['/data/testset44p1/']
  weights : NoneType = None
  relative_path : str =
  ext : list = ['.wav', '.flac', '.mp3', '.mp4']
  shuffle : bool = True
  shuffle_state : int = 0
)
build_transform(
  # scope = train
  augment_prob : float = 0.0
  preprocess : list = ['Identity']
  augment : list = ['Identity']
  postprocess : list = ['VolumeNorm', 'RescaleAudio', 'ShiftPhase']
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
VolumeNorm(
  # scope = train
  db : list = ['const', -16]
  name : NoneType = None
  prob : float = 1.0
)
RescaleAudio(
  # scope = train
  val : float = 1.0
  name : NoneType = None
  prob : int = 1
)
ShiftPhase(
  # scope = train
  shift : tuple = ('uniform', -3.141592653589793, 3.141592653589793)
  name : NoneType = None
  prob : int = 1
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : int = 1
)
AudioDataset(
  # scope = train
  n_examples : int = 10000000
  duration : float = 0.38
  offset : NoneType = None
  loudness_cutoff : int = -40
  num_channels : int = 1
  transform : Compose = <audiotools.data.transforms.Compose object at 0x7f20bc0ccdc0>
  aligned : bool = False
  shuffle_loaders : bool = False
  without_replacement : bool = True
)
AudioLoader(
  # scope = train
  sources : list = ['/data/testset44p1/']
  weights : NoneType = None
  relative_path : str =
  ext : list = ['.wav', '.flac', '.mp3', '.mp4']
  shuffle : bool = True
  shuffle_state : int = 0
)
build_transform(
  # scope = train
  augment_prob : float = 0.0
  preprocess : list = ['Identity']
  augment : list = ['Identity']
  postprocess : list = ['VolumeNorm', 'RescaleAudio', 'ShiftPhase']
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
VolumeNorm(
  # scope = train
  db : list = ['const', -16]
  name : NoneType = None
  prob : float = 1.0
)
RescaleAudio(
  # scope = train
  val : float = 1.0
  name : NoneType = None
  prob : int = 1
)
ShiftPhase(
  # scope = train
  shift : tuple = ('uniform', -3.141592653589793, 3.141592653589793)
  name : NoneType = None
  prob : int = 1
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : int = 1
)
AudioDataset(
  # scope = train
  n_examples : int = 10000000
  duration : float = 0.38
  offset : NoneType = None
  loudness_cutoff : int = -40
  num_channels : int = 1
  transform : Compose = <audiotools.data.transforms.Compose object at 0x7f20bc0cd210>
  aligned : bool = False
  shuffle_loaders : bool = False
  without_replacement : bool = True
)
AudioLoader(
  # scope = train
  sources : list = ['/data/testset44p1/']
  weights : NoneType = None
  relative_path : str =
  ext : list = ['.wav', '.flac', '.mp3', '.mp4']
  shuffle : bool = True
  shuffle_state : int = 0
)
build_transform(
  # scope = train
  augment_prob : float = 0.0
  preprocess : list = ['Identity']
  augment : list = ['Identity']
  postprocess : list = ['VolumeNorm', 'RescaleAudio', 'ShiftPhase']
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
VolumeNorm(
  # scope = train
  db : list = ['const', -16]
  name : NoneType = None
  prob : float = 1.0
)
RescaleAudio(
  # scope = train
  val : float = 1.0
  name : NoneType = None
  prob : int = 1
)
ShiftPhase(
  # scope = train
  shift : tuple = ('uniform', -3.141592653589793, 3.141592653589793)
  name : NoneType = None
  prob : int = 1
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : int = 1
)
AudioDataset(
  # scope = train
  n_examples : int = 10000000
  duration : float = 0.38
  offset : NoneType = None
  loudness_cutoff : int = -40
  num_channels : int = 1
  transform : Compose = <audiotools.data.transforms.Compose object at 0x7f20bc0cd660>
  aligned : bool = False
  shuffle_loaders : bool = False
  without_replacement : bool = True
)
AudioLoader(
  # scope = train
  sources : list = ['/data/testset44p1/']
  weights : NoneType = None
  relative_path : str =
  ext : list = ['.wav', '.flac', '.mp3', '.mp4']
  shuffle : bool = True
  shuffle_state : int = 0
)
build_transform(
  # scope = train
  augment_prob : float = 0.0
  preprocess : list = ['Identity']
  augment : list = ['Identity']
  postprocess : list = ['VolumeNorm', 'RescaleAudio', 'ShiftPhase']
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
VolumeNorm(
  # scope = train
  db : list = ['const', -16]
  name : NoneType = None
  prob : float = 1.0
)
RescaleAudio(
  # scope = train
  val : float = 1.0
  name : NoneType = None
  prob : int = 1
)
ShiftPhase(
  # scope = train
  shift : tuple = ('uniform', -3.141592653589793, 3.141592653589793)
  name : NoneType = None
  prob : int = 1
)
BaseTransform(
  # scope = train
  keys : list = []
  name : NoneType = None
  prob : int = 1
)
AudioDataset(
  # scope = train
  n_examples : int = 10000000
  duration : float = 0.38
  offset : NoneType = None
  loudness_cutoff : int = -40
  num_channels : int = 1
  transform : Compose = <audiotools.data.transforms.Compose object at 0x7f20bc0cdab0>
  aligned : bool = False
  shuffle_loaders : bool = False
  without_replacement : bool = True
)
build_dataset(
  # scope = val
  folders : dict = {'speech_hq': ['/data/testset44p1/'], 'music_hq': ['/data/testset44p1/'], 'general': ['/data/testset44p1/']}
)
AudioLoader(
  # scope = val
  sources : list = ['/data/testset44p1/']
  weights : NoneType = None
  relative_path : str =
  ext : list = ['.wav', '.flac', '.mp3', '.mp4']
  shuffle : bool = True
  shuffle_state : int = 0
)
build_transform(
  # scope = val
  augment_prob : float = 1.0
  preprocess : list = ['Identity']
  augment : list = ['Identity']
  postprocess : list = ['VolumeNorm', 'RescaleAudio', 'ShiftPhase']
)
BaseTransform(
  # scope = val
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
BaseTransform(
  # scope = val
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
VolumeNorm(
  # scope = val
  db : list = ['const', -16]
  name : NoneType = None
  prob : float = 1.0
)
RescaleAudio(
  # scope = val
  val : float = 1.0
  name : NoneType = None
  prob : int = 1
)
ShiftPhase(
  # scope = val
  shift : tuple = ('uniform', -3.141592653589793, 3.141592653589793)
  name : NoneType = None
  prob : int = 1
)
BaseTransform(
  # scope = val
  keys : list = []
  name : NoneType = None
  prob : int = 1
)
AudioDataset(
  # scope = val
  n_examples : int = 250
  duration : float = 5.0
  offset : NoneType = None
  loudness_cutoff : int = -40
  num_channels : int = 1
  transform : Compose = <audiotools.data.transforms.Compose object at 0x7f20bc0cdf60>
  aligned : bool = False
  shuffle_loaders : bool = False
  without_replacement : bool = True
)
AudioLoader(
  # scope = val
  sources : list = ['/data/testset44p1/']
  weights : NoneType = None
  relative_path : str =
  ext : list = ['.wav', '.flac', '.mp3', '.mp4']
  shuffle : bool = True
  shuffle_state : int = 0
)
build_transform(
  # scope = val
  augment_prob : float = 1.0
  preprocess : list = ['Identity']
  augment : list = ['Identity']
  postprocess : list = ['VolumeNorm', 'RescaleAudio', 'ShiftPhase']
)
BaseTransform(
  # scope = val
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
BaseTransform(
  # scope = val
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
VolumeNorm(
  # scope = val
  db : list = ['const', -16]
  name : NoneType = None
  prob : float = 1.0
)
RescaleAudio(
  # scope = val
  val : float = 1.0
  name : NoneType = None
  prob : int = 1
)
ShiftPhase(
  # scope = val
  shift : tuple = ('uniform', -3.141592653589793, 3.141592653589793)
  name : NoneType = None
  prob : int = 1
)
BaseTransform(
  # scope = val
  keys : list = []
  name : NoneType = None
  prob : int = 1
)
AudioDataset(
  # scope = val
  n_examples : int = 250
  duration : float = 5.0
  offset : NoneType = None
  loudness_cutoff : int = -40
  num_channels : int = 1
  transform : Compose = <audiotools.data.transforms.Compose object at 0x7f20bc0ce3b0>
  aligned : bool = False
  shuffle_loaders : bool = False
  without_replacement : bool = True
)
AudioLoader(
  # scope = val
  sources : list = ['/data/testset44p1/']
  weights : NoneType = None
  relative_path : str =
  ext : list = ['.wav', '.flac', '.mp3', '.mp4']
  shuffle : bool = True
  shuffle_state : int = 0
)
build_transform(
  # scope = val
  augment_prob : float = 1.0
  preprocess : list = ['Identity']
  augment : list = ['Identity']
  postprocess : list = ['VolumeNorm', 'RescaleAudio', 'ShiftPhase']
)
BaseTransform(
  # scope = val
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
BaseTransform(
  # scope = val
  keys : list = []
  name : NoneType = None
  prob : float = 1.0
)
VolumeNorm(
  # scope = val
  db : list = ['const', -16]
  name : NoneType = None
  prob : float = 1.0
)
RescaleAudio(
  # scope = val
  val : float = 1.0
  name : NoneType = None
  prob : int = 1
)
ShiftPhase(
  # scope = val
  shift : tuple = ('uniform', -3.141592653589793, 3.141592653589793)
  name : NoneType = None
  prob : int = 1
)
BaseTransform(
  # scope = val
  keys : list = []
  name : NoneType = None
  prob : int = 1
)
AudioDataset(
  # scope = val
  n_examples : int = 250
  duration : float = 5.0
  offset : NoneType = None
  loudness_cutoff : int = -40
  num_channels : int = 1
  transform : Compose = <audiotools.data.transforms.Compose object at 0x7f20bc0ce800>
  aligned : bool = False
  shuffle_loaders : bool = False
  without_replacement : bool = True
)
L1Loss(
  attribute : str = audio_data
  weight : float = 1.0
)
MultiScaleSTFTLoss(
  window_lengths : list = [2048, 512]
  clamp_eps : float = 1e-05
  mag_weight : float = 1.0
  log_weight : float = 1.0
  pow : float = 2.0
  weight : float = 1.0
  match_stride : bool = False
  window_type : NoneType = None
)
MelSpectrogramLoss(
  n_mels : list = [5, 10, 20, 40, 80, 160, 320]
  window_lengths : list = [32, 64, 128, 256, 512, 1024, 2048]
  clamp_eps : float = 1e-05
  mag_weight : float = 0.0
  log_weight : float = 1.0
  pow : float = 1.0
  weight : float = 1.0
  match_stride : bool = False
  mel_fmin : list = [0, 0, 0, 0, 0, 0, 0]
  mel_fmax : list = [None, None, None, None, None, None, None]
  window_type : NoneType = None
)
GANLoss(
)

Traceback (most recent call last):
  File "/home/user/descript-audio-codec/scripts/train.py", line 436, in <module>
    train(args, accel)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/argbind/argbind.py", line 159, in cmd_func
    return func(*cmd_args, **kwargs)
  File "/home/user/descript-audio-codec/scripts/train.py", line 410, in train
    for tracker.step, batch in enumerate(train_dataloader, start=tracker.step):
  File "/home/user/descript-audio-codec/scripts/train.py", line 63, in get_infinite_loader
    for batch in dataloader:
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
ZeroDivisionError: Caught ZeroDivisionError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/audiotools/data/datasets.py", line 487, in __getitem__
    return dataset[idx // len(self.datasets)]
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/audiotools/data/datasets.py", line 419, in __getitem__
    item[keys[0]] = loader(**loader_kwargs)
  File "/home/user/anaconda3/envs/dac/lib/python3.10/site-packages/audiotools/data/datasets.py", line 90, in __call__
    global_idx % len(self.audio_indices)
ZeroDivisionError: integer division or modulo by zero

Question regarding kernel size and stride direction in the MRD module

First and foremost, I'd like to thank for your great work. It's truly impressive!

By the way, I have a question about the MRD module. When I looked into the implementation, it appears that the larger kernel size and stride are applied along the frequency axis of the spectrogram. However, I believe it might be more appropriate to apply them along the time axis. If it's the case, was this done intentionally?

# https://github.com/descriptinc/descript-audio-codec/blob/c7cfc5d2647e26471dc394f95846a0830e7bec34/dac/model/discriminator.py#L101
class MRD(nn.Module):
    def __init__(
        ...
        convs = lambda: nn.ModuleList(
            [
                ...
                WNConv2d(ch, ch, (3, 9), (1, 2), padding=(1, 4)),
                ...
            ]
        )
        self.band_convs = nn.ModuleList([convs() for _ in range(len(self.bands))])
        ...

    def spectrogram(self, x):
        ...
        x = rearrange(x, "b 1 f t c -> (b 1) c t f")
        # Split into bands
        x_bands = [x[..., b[0] : b[1]] for b in self.bands]
        return x_bands

    def forward(self, x):
        x_bands = self.spectrogram(x)
        ...
        for band, stack in zip(x_bands, self.band_convs):
            for layer in stack:
                band = layer(band)
                ...
        ...

Same error in #18

When I try to re-train the 24khz model,It's stuck for a long time displaying (below) and then exits!

Same error in #18

16kHz configs result in shape mismatch

Using the 16khz.yml configs on a different dataset, I get:

Traceback (most recent call last):
File "/mnt/archive2/dac/descript-audio-codec/scripts/train.py", line 441, in
train(args, accel)
File "/home/g/miniconda3/envs/dac/lib/python3.9/site-packages/argbind/argbind.py", line 159, in cmd_func
return func(*cmd_args, **kwargs)
File "/mnt/archive2/dac/descript-audio-codec/scripts/train.py", line 416, in train
train_loop(state, batch, accel, lambdas)
File "/home/g/miniconda3/envs/dac/lib/python3.9/site-packages/audiotools/ml/decorators.py", line 375, in decorated
output = fn(*args, **kwargs)
File "/home/g/miniconda3/envs/dac/lib/python3.9/site-packages/audiotools/ml/decorators.py", line 321, in decorated
output = fn(*args, **kwargs)
File "/home/g/miniconda3/envs/dac/lib/python3.9/site-packages/audiotools/ml/decorators.py", line 107, in decorated
output = fn(*args, **kwargs)
File "/mnt/archive2/dac/descript-audio-codec/scripts/train.py", line 259, in train_loop
output["mel/loss"] = state.mel_loss(recons, signal)
File "/home/g/miniconda3/envs/dac/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/archive2/dac/descript-audio-codec/dac/nn/loss.py", line 322, in forward
loss += self.log_weight * self.loss_fn(
File "/home/g/miniconda3/envs/dac/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs) File "/home/g/miniconda3/envs/dac/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 101, in forward
return F.l1_loss(input, target, reduction=self.reduction)
File "/home/g/miniconda3/envs/dac/lib/python3.9/site-packages/torch/nn/functional.py", line 3263, in l1_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "/home/g/miniconda3/envs/dac/lib/python3.9/site-packages/torch/functional.py", line 74, in broadcast_tensors
return _VF.broadcast_tensors(tensors) # type: ignore[attr-defined]

RuntimeError: The size of tensor a (760) must match the size of tensor b (761) at non-singleton dimension 3

The only changes to the configs that I've made is the num_workers, seed, iters, and valid_freq. batch["signal"] shows an signal of shape [24,1,6080] before transforms. I didn't have any issues with the baseline 44kHz model. The only changes in configs between 16kHz and base are DAC.sample_rate, DAC.encoder_rates, DAC.decoder_rates, n_codebooks, DAC.quantizer_dropout, Discriminator_sample_rate, and num_iters.

Public Release of the Discriminators Checkpoints

Thanks so much for sharing this great piece of work!

Are you planning on releasing the discrinimator checkpoints alongside the generator ones?

How to directly download the trianing data of baseline?

Hi, is there any way to directly download all the training data of the baseline without using docker?

Loading DAC files is insecure due to pickle

Current code is saving DAC files in numpy format containing pickled python dictionaries. Then loading them involves allowing the use of pickle to recover them.

descript-audio-codec/dac/model/base.py

Line 48 in c7cfc5d

artifacts = np.load(path, allow_pickle=True)[()]

This can be an important security issue because an attacker can create malformed dac files that run arbitrary code on load taking advantage of pickle. It prevents DAC to be used as an audio encoding format that can be shared or used in public datasets. Please consider releasing an updated file format version that avoids using pickle.

Training error: "RuntimeError: grad can be implicitly created only for scalar outputs"

Hey, Great work!

Trying to train yields a runtime error:

File "########################/dac/scripts/train.py", line 439, in
train(args, accel)
File "/opt/conda/lib/python3.8/site-packages/argbind/argbind.py", line 159, in cmd_func
return func(*cmd_args, **kwargs)
File "########################/dac/scripts/train.py", line 414, in train
train_loop(state, batch, accel, lambdas)
File "/opt/conda/lib/python3.8/site-packages/audiotools/ml/decorators.py", line 375, in decorated
output = fn(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/audiotools/ml/decorators.py", line 321, in decorated
output = fn(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/audiotools/ml/decorators.py", line 107, in decorated
output = fn(*args, **kwargs)
File "/########################/dac/scripts/train.py", line 268, in train_loop
accel.backward(output["loss"])
File "/opt/conda/lib/python3.8/site-packages/audiotools/ml/accelerator.py", line 123, in backward
self.scaler.scale(loss).backward()
File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 487, in backward
torch.autograd.backward(
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 193, in backward
grad_tensors = make_grads(tensors, grad_tensors, is_grads_batched=False)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py", line 88, in _make_grads
raise RuntimeError("grad can be implicitly created only for scalar outputs")

After analysis, I found this is due to the commitment and codebook losses (here) being a tensor of 2 elements instead of a scalar.

Is this a bug or meant to be?
Should these losses be scalars? why are they set to be tensors with multiple elements?

Thanks!

Fine-tuning from 44.1Khz

Hello,

Thanks for this work! I noticed that the pre-trained 44.1Khz weights isn't doing the best job at reconstruction some music outside of the dataset scope.

I'm wondering if you're planning to release a fine-tuning script, so as to resume from the pre-trained weights that you provided, also reducing compute time.

Thanks and looking forward!

Best,
M

Decode using codes instead of encoder output?

x = model.preprocess(signal.audio_data, signal.sample_rate)
z, codes, latents, _, _ = model.encode(x)

# Decode audio signal
y = model.decode(z)

here z is the direct output of the encoder, without quantizing. this is incorrect as it does not "quantize" the input and the . However, I was unable to find how to decode the "codes" in your script. I checked with the 16khz model, and the decode function/model.decoder only accept unquantized direct output of the encoder which is ~100 times larger than the quantized codes.

I hope I missed something because if this is true, this means the model is an autoencoder instead of an RVQ-VAE.

descriptinc / descript-audio-codec Goto Github PK

descript-audio-codec's People

Contributors

Stargazers

Watchers

Forkers

descript-audio-codec's Issues

Recommend Projects

Recommend Topics

Recommend Org