First of all, thanks for the great work and clean code! For the purp

A proof-of-concept implementation of the encoding part is here: <a href="https://gist.

I've been on parental leave the last few months <p dir=

Glad you found it. I've just updated <a href="https://gist.github.com/f0k/266dd89e5241

I met the same issue. But now I have solved the problem. pleas

Chunked inference result depends on chunk length about descript-audio-codec HOT 9 OPEN

descriptinc commented on May 29, 2024 5

Chunked inference result depends on chunk length

from descript-audio-codec.

Comments (9)

f0k commented on May 29, 2024

A proof-of-concept implementation of the encoding part is here: https://gist.github.com/f0k/266dd89e52417ba6138d33afa9ff8e80. The main algorithm is this part: https://gist.github.com/f0k/266dd89e52417ba6138d33afa9ff8e80#file-chunked_dac-py-L140-L181.
It produces the same codes as python3 -m dac encode --win_duration=10000, except for the first and last 6 frames.
I did not attempt to implement it as a pull request in your code base because it would need some decisions on how and whether to handle backwards compatibility, and also for my purposes I need the output to be stored in a more efficiently readable format. If interested, I'm happy to help integrating it, though.
Decoding is not implemented yet. Due to the stacked strided transposed convolutions, it needs an overlap-add algorithm to be useful (otherwise it would be limited to work with impractically large window sizes). I don't urgently need it, let's see.

from descript-audio-codec.

pseeth commented on May 29, 2024

Hey thanks for the implementation! I've been on parental leave the last few months, so I haven't been plugged in since writing the existing chunking code. Happy to take a look though, as I noticed the same issue with needing to use the same chunk size at encoding and decoding time. My fix then was just to save the chunk length in the metadata so things can be decoded properly, but this comes with some downsides as you mentioned. I'll take a look at your code and see what I can do!

Thanks!

from descript-audio-codec.

f0k commented on May 29, 2024

I've been on parental leave the last few months

Nice, congratulations!

I'll take a look at your code and see what I can do!

Take your time! From what I see, the decoding algorithm will need more attention than what you have during a parental leave, so don't bother for now. Integrating the encoding algorithm alone will not be of much use.

from descript-audio-codec.

pseeth commented on May 29, 2024

Thank you for understanding! And thanks for the congrats!

If you have time, a slightly more detailed sketch of the decoding algorithm would be super helpful. The encoding algorithm looks quite nice, and it has nice properties with re: invariance to chunk size. I also feel that these sorts of tricks with convolutional nets for audio are not widely available so getting it right in this repo would be a nice contribution to open source!

from descript-audio-codec.

f0k commented on May 29, 2024

If you have time, a slightly more detailed sketch of the decoding algorithm would be super helpful.

Well, we'd start by figuring out how many code frames we can decode at once to stay within the limits of the window size given by the user. Then we'd decode those, getting, say, 5 seconds of audio. We take the next chunk of code frames (that will have to overlap with the previous one at least due to the initial size-7 convolution) and again get 5 seconds of audio. Now we don't concatenate the two 5-second chunks, but we overlap the second chunk a bit with the first chunk and add up the samples in the overlapped part. The tough part is figuring out from the network architecture by how much to overlap the code chunks and by how much to overlap the outputs. If the architecture was perfectly symmetric, the output would need to overlap exactly as much as we overlapped the input during encoding, but the architecture is not symmetric (the decoder has additional convolutions interspersed with the transposed convolutions).

My implementation includes some receptive field computation for the decoder, but maybe it is more helpful to compute the receptive field of the decoder inversed, or separate the effects of forward and transposed convolutions.

/edit: The overlap-add idea does not apply due to the nonlinearities. Instead, we will need to overlap the code chunks and crop the decoder output to remove the wrongly computed borders (that should have taken the neighboring codes into account, but could not). Also the code for disabling padding needs to be fixed: To disable zero-padding in a transposed convolution, its padding ought to be set to its kernel_size, not to zero. Leaving it at zero will just increase the size of the wrongly computed borders that we need to discard, if I see correctly.

from descript-audio-codec.

jbmaxwell commented on May 29, 2024

I was having problems getting the example python code running and wound up here. This is working, but I've noticed that, using your script, the decoded file is half the size of the original input file (7.4 mb vs 15 mb). Is there a setting I'm missing somewhere? I've tried setting at 8kbps and 16kbps.

UPDATE - derp... umm... just noticed my input file is 32bit and the output is 16. 🙈

from descript-audio-codec.

f0k commented on May 29, 2024

Glad you found it. I've just updated the gist to the version I ended up with for my use case; it adds support for input and output directories and can be launched multiple times with the same input and output directories but different CUDA devices, taking care not to process the same files. Chunked decoding is still left as an exercise for the reader ;)

from descript-audio-codec.

pseeth commented on May 29, 2024

Decoding in chunks that are overlapped and then chopping off the overlapped samples sounds very plausible as a good method. I'll give it a go! Thanks for the additional detail!

from descript-audio-codec.

BridgetteSong commented on May 29, 2024

I met the same issue. But now I have solved the problem.

please first confirm you are in the inferencing mode, which means you have turned off the dropout layer and so on
please confirm when your input is always same, and you can get the same outputs from the encoder and decoder. For this you can input a same audio twice and check its outputs.
I found when you turn on "@torch.jit.script" in the SnakeModule, the outputs will be a little different although your input is same
and in the CPU or GPU, the outputs from a same input will also be different

from descript-audio-codec.

Chunked inference result depends on chunk length about descript-audio-codec HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent