Giter Club home page Giter Club logo

Comments (9)

f0k avatar f0k commented on May 29, 2024

A proof-of-concept implementation of the encoding part is here: https://gist.github.com/f0k/266dd89e52417ba6138d33afa9ff8e80. The main algorithm is this part: https://gist.github.com/f0k/266dd89e52417ba6138d33afa9ff8e80#file-chunked_dac-py-L140-L181.
It produces the same codes as python3 -m dac encode --win_duration=10000, except for the first and last 6 frames.
I did not attempt to implement it as a pull request in your code base because it would need some decisions on how and whether to handle backwards compatibility, and also for my purposes I need the output to be stored in a more efficiently readable format. If interested, I'm happy to help integrating it, though.
Decoding is not implemented yet. Due to the stacked strided transposed convolutions, it needs an overlap-add algorithm to be useful (otherwise it would be limited to work with impractically large window sizes). I don't urgently need it, let's see.

from descript-audio-codec.

pseeth avatar pseeth commented on May 29, 2024

Hey thanks for the implementation! I've been on parental leave the last few months, so I haven't been plugged in since writing the existing chunking code. Happy to take a look though, as I noticed the same issue with needing to use the same chunk size at encoding and decoding time. My fix then was just to save the chunk length in the metadata so things can be decoded properly, but this comes with some downsides as you mentioned. I'll take a look at your code and see what I can do!

Thanks!

from descript-audio-codec.

f0k avatar f0k commented on May 29, 2024

I've been on parental leave the last few months

Nice, congratulations!

I'll take a look at your code and see what I can do!

Take your time! From what I see, the decoding algorithm will need more attention than what you have during a parental leave, so don't bother for now. Integrating the encoding algorithm alone will not be of much use.

from descript-audio-codec.

pseeth avatar pseeth commented on May 29, 2024

Thank you for understanding! And thanks for the congrats!

If you have time, a slightly more detailed sketch of the decoding algorithm would be super helpful. The encoding algorithm looks quite nice, and it has nice properties with re: invariance to chunk size. I also feel that these sorts of tricks with convolutional nets for audio are not widely available so getting it right in this repo would be a nice contribution to open source!

from descript-audio-codec.

f0k avatar f0k commented on May 29, 2024

If you have time, a slightly more detailed sketch of the decoding algorithm would be super helpful.

Well, we'd start by figuring out how many code frames we can decode at once to stay within the limits of the window size given by the user. Then we'd decode those, getting, say, 5 seconds of audio. We take the next chunk of code frames (that will have to overlap with the previous one at least due to the initial size-7 convolution) and again get 5 seconds of audio. Now we don't concatenate the two 5-second chunks, but we overlap the second chunk a bit with the first chunk and add up the samples in the overlapped part. The tough part is figuring out from the network architecture by how much to overlap the code chunks and by how much to overlap the outputs. If the architecture was perfectly symmetric, the output would need to overlap exactly as much as we overlapped the input during encoding, but the architecture is not symmetric (the decoder has additional convolutions interspersed with the transposed convolutions).

My implementation includes some receptive field computation for the decoder, but maybe it is more helpful to compute the receptive field of the decoder inversed, or separate the effects of forward and transposed convolutions.

/edit: The overlap-add idea does not apply due to the nonlinearities. Instead, we will need to overlap the code chunks and crop the decoder output to remove the wrongly computed borders (that should have taken the neighboring codes into account, but could not). Also the code for disabling padding needs to be fixed: To disable zero-padding in a transposed convolution, its padding ought to be set to its kernel_size, not to zero. Leaving it at zero will just increase the size of the wrongly computed borders that we need to discard, if I see correctly.

from descript-audio-codec.

jbmaxwell avatar jbmaxwell commented on May 29, 2024

I was having problems getting the example python code running and wound up here. This is working, but I've noticed that, using your script, the decoded file is half the size of the original input file (7.4 mb vs 15 mb). Is there a setting I'm missing somewhere? I've tried setting at 8kbps and 16kbps.

UPDATE - derp... umm... just noticed my input file is 32bit and the output is 16. 🙈

from descript-audio-codec.

f0k avatar f0k commented on May 29, 2024

Glad you found it. I've just updated the gist to the version I ended up with for my use case; it adds support for input and output directories and can be launched multiple times with the same input and output directories but different CUDA devices, taking care not to process the same files. Chunked decoding is still left as an exercise for the reader ;)

from descript-audio-codec.

pseeth avatar pseeth commented on May 29, 2024

Decoding in chunks that are overlapped and then chopping off the overlapped samples sounds very plausible as a good method. I'll give it a go! Thanks for the additional detail!

from descript-audio-codec.

BridgetteSong avatar BridgetteSong commented on May 29, 2024

I met the same issue. But now I have solved the problem.

  • please first confirm you are in the inferencing mode, which means you have turned off the dropout layer and so on
  • please confirm when your input is always same, and you can get the same outputs from the encoder and decoder. For this you can input a same audio twice and check its outputs.
  • I found when you turn on "@torch.jit.script" in the SnakeModule, the outputs will be a little different although your input is same
  • and in the CPU or GPU, the outputs from a same input will also be different

from descript-audio-codec.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.