Comments (9)
A proof-of-concept implementation of the encoding part is here: https://gist.github.com/f0k/266dd89e52417ba6138d33afa9ff8e80. The main algorithm is this part: https://gist.github.com/f0k/266dd89e52417ba6138d33afa9ff8e80#file-chunked_dac-py-L140-L181.
It produces the same codes as python3 -m dac encode --win_duration=10000
, except for the first and last 6 frames.
I did not attempt to implement it as a pull request in your code base because it would need some decisions on how and whether to handle backwards compatibility, and also for my purposes I need the output to be stored in a more efficiently readable format. If interested, I'm happy to help integrating it, though.
Decoding is not implemented yet. Due to the stacked strided transposed convolutions, it needs an overlap-add algorithm to be useful (otherwise it would be limited to work with impractically large window sizes). I don't urgently need it, let's see.
from descript-audio-codec.
Hey thanks for the implementation! I've been on parental leave the last few months, so I haven't been plugged in since writing the existing chunking code. Happy to take a look though, as I noticed the same issue with needing to use the same chunk size at encoding and decoding time. My fix then was just to save the chunk length in the metadata so things can be decoded properly, but this comes with some downsides as you mentioned. I'll take a look at your code and see what I can do!
Thanks!
from descript-audio-codec.
I've been on parental leave the last few months
Nice, congratulations!
I'll take a look at your code and see what I can do!
Take your time! From what I see, the decoding algorithm will need more attention than what you have during a parental leave, so don't bother for now. Integrating the encoding algorithm alone will not be of much use.
from descript-audio-codec.
Thank you for understanding! And thanks for the congrats!
If you have time, a slightly more detailed sketch of the decoding algorithm would be super helpful. The encoding algorithm looks quite nice, and it has nice properties with re: invariance to chunk size. I also feel that these sorts of tricks with convolutional nets for audio are not widely available so getting it right in this repo would be a nice contribution to open source!
from descript-audio-codec.
If you have time, a slightly more detailed sketch of the decoding algorithm would be super helpful.
Well, we'd start by figuring out how many code frames we can decode at once to stay within the limits of the window size given by the user. Then we'd decode those, getting, say, 5 seconds of audio. We take the next chunk of code frames (that will have to overlap with the previous one at least due to the initial size-7 convolution) and again get 5 seconds of audio. Now we don't concatenate the two 5-second chunks, but we overlap the second chunk a bit with the first chunk and add up the samples in the overlapped part. The tough part is figuring out from the network architecture by how much to overlap the code chunks and by how much to overlap the outputs. If the architecture was perfectly symmetric, the output would need to overlap exactly as much as we overlapped the input during encoding, but the architecture is not symmetric (the decoder has additional convolutions interspersed with the transposed convolutions).
My implementation includes some receptive field computation for the decoder, but maybe it is more helpful to compute the receptive field of the decoder inversed, or separate the effects of forward and transposed convolutions.
/edit: The overlap-add idea does not apply due to the nonlinearities. Instead, we will need to overlap the code chunks and crop the decoder output to remove the wrongly computed borders (that should have taken the neighboring codes into account, but could not). Also the code for disabling padding needs to be fixed: To disable zero-padding in a transposed convolution, its padding
ought to be set to its kernel_size
, not to zero. Leaving it at zero will just increase the size of the wrongly computed borders that we need to discard, if I see correctly.
from descript-audio-codec.
I was having problems getting the example python code running and wound up here. This is working, but I've noticed that, using your script, the decoded file is half the size of the original input file (7.4 mb vs 15 mb). Is there a setting I'm missing somewhere? I've tried setting at 8kbps
and 16kbps
.
UPDATE - derp... umm... just noticed my input file is 32bit and the output is 16. 🙈
from descript-audio-codec.
Glad you found it. I've just updated the gist to the version I ended up with for my use case; it adds support for input and output directories and can be launched multiple times with the same input and output directories but different CUDA devices, taking care not to process the same files. Chunked decoding is still left as an exercise for the reader ;)
from descript-audio-codec.
Decoding in chunks that are overlapped and then chopping off the overlapped samples sounds very plausible as a good method. I'll give it a go! Thanks for the additional detail!
from descript-audio-codec.
I met the same issue. But now I have solved the problem.
- please first confirm you are in the inferencing mode, which means you have turned off the dropout layer and so on
- please confirm when your input is always same, and you can get the same outputs from the encoder and decoder. For this you can input a same audio twice and check its outputs.
- I found when you turn on "@torch.jit.script" in the SnakeModule, the outputs will be a little different although your input is same
- and in the CPU or GPU, the outputs from a same input will also be different
from descript-audio-codec.
Related Issues (20)
- How to compress stereo sound by model.encode HOT 2
- (Paper Error?) MSD Not Used? HOT 3
- Error when set win_duration small
- Encoding new file - use of `zero_pad` HOT 2
- Inference speed
- Loading DAC files is insecure due to pickle
- Error with 16khz
- Memory leak? HOT 2
- Padding Mismatches Output Dimension in Conv1d HOT 1
- broken training: please specify versions of libraries used
- tensor shape mismatch when training on 24khz LibriTTS dataset HOT 2
- Same error in #18
- Very low bitrate models
- Training error: "RuntimeError: grad can be implicitly created only for scalar outputs"
- How to directly download the trianing data of baseline?
- Decode using codes instead of encoder output? HOT 2
- Duration not preserved HOT 1
- Duration not preserved?
- Strange at the end of the recons audio
- The size of tensor a (5) must match the size of tensor b (6) at non-singleton dimension 1 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from descript-audio-codec.