Giter Club home page Giter Club logo

Comments (15)

Rikorose avatar Rikorose commented on August 29, 2024

Hi there, could you send me the training log as well as the config file used for training. Are you sure the model fully converged? Did you make sure to choose the model based on the best validation loss? What is the batch size? Were there any NaNs in the training run? Which exact commit was used for training?

AdamW and Adam should not make a huge difference. I will double check if which one was used.

3 seconds should be fine. In each epoch, a different 3 seconds window will be used. I may have made some changes since then, e.g. handling shorter (than 3s) training samples.
100 RIRs should be fine, they are getting randomly sampled.

p_reverb is the reverb probability during training. Since VCTK is non-reverb, this should not have a large impact. The conv_lookahead is a different lookahead (i.e. in the convolutions) than the df lookahead. The overall latency should be no more than 30 ms which is less compared to PercepNet or DCRNN.

from deepfilternet.

rohithmars avatar rohithmars commented on August 29, 2024

Hi,

Thank you for the reply.

Please find attached the log and config.ini
train.log
config.zip

So you mean to say even if the number of speech samples and number of noise samples, differ, it is okay? or should i oversample to match them?

As we had discussed before, nan occurs. But then i restart the training from previous epoch.

from deepfilternet.

Rikorose avatar Rikorose commented on August 29, 2024

The numbers of clean and noise samples may differ since they are randomly mixed at training time. One thing that I changed for better robustness in multi-speaker scenarios is sampling a different speech sample if the first one is shorter than 3 s. This may cause an performance drop.
It can be disabled by setting the probability to zero:
https://github.com/Rikorose/DeepFilterNet/blob/main/pyDF-data/src/lib.rs#L232

Also, we did not include the singing voice dataset, since it is correlated with the music that is also included in the noise sets, as well as the emotional speech due to worse recording quality.
How did you do your train/validation/test split?

The original model was trained in a different repository which I cleaned up for publishing. Maybe I cleaned up too much or introduced a bug somewhere during the refactoring process which causes a performance drop. I will need to double-check.

from deepfilternet.

rohithmars avatar rohithmars commented on August 29, 2024

Hi,

I had included the singing and emotional speech in my target. But the contribution from these data is small compared to read speech from DNS. Split is similar to 75%/15%/15%.

Also, did you include VCTK trainset as part of your training data? or is it only DNS dataset?

By the way, I also notice that when I use the pre-trained model to evaluate the validation set, it gives SDR/STOI scores lower than when I train the first epoch and validate on it. It is very strange and I am not sure where the bug is. Definitely, the pre-trained model should give much higher score than first epoch model.

from deepfilternet.

Rikorose avatar Rikorose commented on August 29, 2024

I did not include the VCTK as an extra training set. What SDR/STOI scores do you observe? On the validation set or the VCTK test set?

from deepfilternet.

rohithmars avatar rohithmars commented on August 29, 2024

Hi,

Sorry for long post and questions. The validation set below is from my own split. [29] is the pre-trained model provided and its scores on this validation set. [0] is the validation scores on the first epoch model. I think there is something wrong?

Pre-trained model score on validation set

[29] [valid] | DfAlphaLoss: 0.016212 | SpectralLoss: 0.30626 | loss: 0.32247 | sdr_snr_-5: 3.6033 | sdr_snr_0: 7.6844 | sdr_snr_10: 13.983 | sdr_snr_20: 19.054 | sdr_snr_40: 24.983 | sdr_snr_5: 10.971 | stoi_snr_-5: 0.68123 | stoi_snr_0: 0.77549 | stoi_snr_10: 0.89468 | stoi_snr_20: 0.94909 | stoi_snr_40: 0.983 | stoi_snr_5: 0.844

Epoch-1 model score on validation set

| DF | [0] [valid] | DfAlphaLoss: 0.011831 | SpectralLoss: 0.2401 | loss: 0.25193 | sdr_snr_-5: 4.9098 | sdr_snr_0: 8.912 | sdr_snr_10: 15.457 | sdr_snr_20: 21.364 | sdr_snr_40: 28.92 | sdr_snr_5: 12.27 | stoi_snr_-5: 0.7153 | stoi_snr_0: 0.80117 | stoi_snr_10: 0.90952 | stoi_snr_20: 0.95896 | stoi_snr_40: 0.99052 | stoi_snr_5: 0.86384

from deepfilternet.

Rikorose avatar Rikorose commented on August 29, 2024

Hm, there is certainly the possibility to improve the DeepFilterNet model. Then it is strange that you are observing worse results on the VCTK set. I am currently working on a different topic but will try to come back to this to provide better reproducibility.

I also noticed a bug in the dataset functionality resulting in different order of samples during training which also limits reproducibility.

from deepfilternet.

rohithmars avatar rohithmars commented on August 29, 2024

Hi,

Can you tell me how to use the pre-trained model as the starting model when training?

Do I need to do any other configuration?

from deepfilternet.

Rikorose avatar Rikorose commented on August 29, 2024

Hi,

You should be able to load the model via the train script and just continue training. You will need to increase max_epochs in the config file.
What you could also try, is to not load the last layer(s) of the pretrained model. This is a common approach when using a pretrained model and fine-tuning on a dataset/task etc. You can use the cp_blacklist config parameter for this. Just add this option as a comma separated list to the train section. E.g.:

[train]
cp_blacklist = df_fc_out,df_fc_a,conv0_out

FYI I am currently trying to reproduce your findings using the current repo state (since I trained the published model on a different repo). I got some issues with our cluster though, so no eta.

from deepfilternet.

rohithmars avatar rohithmars commented on August 29, 2024

Hi,
I did try the first method you suggested. i.e., load the pre-trained model via the train script and just continue training. However, as I mentioned before, the scores on validation set is lower than when I validate on first epoch of a model trained from scratch. It is very strange. The config.ini is same as provided in the pre-trained model. Just that I have used cuda as the device.

from deepfilternet.

rohithmars avatar rohithmars commented on August 29, 2024

Hi @Rikorose

Did you get a chance to look into the issue of reproducing the results? The PESQ score I obtain is 2.60, which is quite less compared to 2.80 from the pre-trained model.

The loading from pre-trained model also does not help as I have explained above.

from deepfilternet.

Rikorose avatar Rikorose commented on August 29, 2024

Hi @rohithmars

Sorry for the late reply. I finally was able to look into it. I found 2 things:

  1. The loss value needs to be higher by a factor of 5. Originally we had some hard-coded factor in the spectral which I now cleaned up.
[spectralloss]
factor_magnitude = 100000
factor_complex = 100000
gamma = 0.6

Another notice on this: You could also try decreasing gamma to 0.3. We found that having a stronger compression (i.e. via a lower gamma) might benefit pesq. I think this is also reported by others in DNS3 challenge. However, other metrics might get slightly worse. If you try this, you would possibly also need to adjust the loss factors.

  1. I looked into our HDF5 files and found that we originally over-sampled high-quality datasets within the DNS corpus by a factor of 10. That is the PTDB and VCTK dataset. These datasets are dry close-talk recordings while librivox contains all sorts of microphones and recording conditions. I am not quite sure if these two datasets are actually included the DNS training set since the english read speech folder only contains common voice samples, even though the readme states it. So I think we just included these datasets manually. I am really sorry that I forgot about this and did not specify this earlier.
    We had split the VCTK and PTDB sets on speaker level to ensure no speaker overlap with the VCTK test set, the rest of the DNS set is split on signal level. Here are our train/val/tests splits for VCTK and PTDB: traintestsplit.zip
    I will provide an updated paper which includes this notice. Thank you for investigating this issue. I retrained our model and just got slightly (but probably not significantly) higher scores (pesq 2.84). So I guess there is still some variance left (e.g. due to #44, #42)

Wrt. the pretraining: Have you tried reinitializing the last layers as I suggested earlier? This usually works pretty well and is the usual approach when fine-tuning on a new/different dataset.

from deepfilternet.

rohithmars avatar rohithmars commented on August 29, 2024

Hi @Rikorose

First of all, thanks for getting back to me and your effort to address the issue.

  1. If this factor [spectralloss] is 20000 * 5 = 100000 instead of 20000, then does it mean the factor for [dfalphaloss] should also be scaled to 1000 * 5 = 5000? This is because, in the paper it mentions lambda_spec = 1 and lambda_alpha = 0.05.

  2. Okay, so it means my results may be affected by the lack of VCTK train data. I did not use VCTK train. My train data was only from DNS-3 dataset and test on VCTK test.

Could you please show me a sample dataset.cfg which lists multiple .hdf5 for speech? It is because, I don't want to recreate the hdf5. I want to know if it is possible to add VCTK_speech.hdf5 to my current speech_train.hdf5. I hope you understand this question.

Thanks a lot again.

from deepfilternet.

Rikorose avatar Rikorose commented on August 29, 2024

Hi here is an example dataset config (without noise datasets):

{
    "test": [
        [
            "DNS_TEST.hdf5",
            1
        ],
        [
            "VCTK_TEST.hdf5",
            10
        ],
        [
            "SLR26_SIM_DNS_TEST.hdf5",
            1
        ]
    ],
    "train": [
        [
            "DNS_TRAIN.hdf5",
            1
        ],
        [
            "VCTK_TRAIN.hdf5",
            10
        ],
        [
            "SLR26_SIM_DNS_TRAIN.hdf5",
            1
        ]
    ],
    "valid": [
        [
            "DNS_VALID.hdf5",
            1
        ],
        [
            "VCTK_VALID.hdf5",
            10
        ],
        [
            "SLR26_SIM_DNS_VALID.hdf5",
            1
        ]
    ]
}

from deepfilternet.

github-actions avatar github-actions commented on August 29, 2024

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.

from deepfilternet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.