Giter Club home page Giter Club logo

Comments (3)

yxlu-0102 avatar yxlu-0102 commented on August 26, 2024 1

In our other work on speech bandwidth extension, we used narrowband log-magnitude spectra as input, predicted the high-frequency log-magnitude spectra, and added them together to obtain the wideband log-magnitude spectra.
Since adding log-magnitude spectra is equivalent to multiplying magnitude spectra, we found that bandwidth extension can be achieved by applying an unbounded mask to the magnitude spectra.

Regarding your question, the high-frequency part of the magnitude spectrum of a speech waveform is a very small decimal close to zero after upsampling.
Therefore, a large-value mask can be used to predict the high-frequency magnitude spectrum.
Here, we also applied power-law compression to narrow the range of this mask, making it easier to predict.

Additionally, in the paper, the models for these three tasks were trained separately.
We also tried training a general model using all the data to handle these three tasks simultaneously.
We found that the performance of this model slightly decreased in the tasks of speech denoising and bandwidth extension, but it improved in the dereverberation task.
This improvement might be due to the inclusion of noisy data, which acts as data augmentation.

from mp-senet.

JangyeonKim avatar JangyeonKim commented on August 26, 2024

I have some questions about BWE task.

Currently, I am trying to apply the MP-SENet model to the BWE task. As written in the long version of the paper, I am conducting experiments with the VCTK dataset.

  1. I changed the lsigmoid() of the mask decoder to prelu(), but the loss becomes nan as soon as the training starts. Leakyrelu() also showed the same phenomenon. So, I am currently using relu for training. Can you provide any advice regarding this issue?

`class MaskDecoder(nn.Module):
def init(self, h, out_channel=1):
super(MaskDecoder, self).init()
self.dense_block = DenseBlock(h, depth=4)
self.mask_conv = nn.Sequential(
nn.ConvTranspose2d(h.dense_channel, h.dense_channel, (1, 3), (1, 2)),
nn.Conv2d(h.dense_channel, out_channel, (1, 1)),
nn.InstanceNorm2d(out_channel, affine=True),
nn.PReLU(out_channel),
nn.Conv2d(out_channel, out_channel, (1, 1))
)
self.lsigmoid = LearnableSigmoid_2d(h.n_fft//2+1, beta=h.beta)
self.prelu = nn.PReLU()

def forward(self, x):
    x = self.dense_block(x)
    x = self.mask_conv(x)
    x = x.permute(0, 3, 2, 1).squeeze(-1)
    
    # # lsigmoid for denoisig, dereverberation
    # x = self.lsigmoid(x).permute(0, 2, 1).unsqueeze(1)
    
    # PReLU for Bandwidth Extension
    x = self.prelu(x).permute(0, 2, 1).unsqueeze(1)
    
    return x
    `
  1. When conducting experiments, aside from the metric scores, I found that the output samples contain audible artifacts (buzzing-like sound). I am curious if you have encountered the same issue.

from mp-senet.

jeffery-work avatar jeffery-work commented on August 26, 2024

In our other work on speech bandwidth extension, we used narrowband log-magnitude spectra as input, predicted the high-frequency log-magnitude spectra, and added them together to obtain the wideband log-magnitude spectra. Since adding log-magnitude spectra is equivalent to multiplying magnitude spectra, we found that bandwidth extension can be achieved by applying an unbounded mask to the magnitude spectra.

Regarding your question, the high-frequency part of the magnitude spectrum of a speech waveform is a very small decimal close to zero after upsampling. Therefore, a large-value mask can be used to predict the high-frequency magnitude spectrum. Here, we also applied power-law compression to narrow the range of this mask, making it easier to predict.

Additionally, in the paper, the models for these three tasks were trained separately. We also tried training a general model using all the data to handle these three tasks simultaneously. We found that the performance of this model slightly decreased in the tasks of speech denoising and bandwidth extension, but it improved in the dereverberation task. This improvement might be due to the inclusion of noisy data, which acts as data augmentation.

Great job!
As mentioned above, the "g_best" file in the "/best_ckpt" was trained for denoise? I found that it has no ability to do bandwidth extension.
Will you show your general model for three tasks? I am interested in its PESQ improvement.

from mp-senet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.