Hi, Thank you for the help in maintaining the DiffSinger repo. <

I got another error: <div class="snippet-clipboard-content notranslate position-re

I managed to make inference after doing below changes: Changin

ONNX inference 'depth' parameter about diffsinger HOT 6 CLOSED

loct824 commented on September 6, 2024

ONNX inference 'depth' parameter

from diffsinger.

Comments (6)

yqzhishen commented on September 6, 2024

The depth input is introduced by shallow diffusion mechanism, and you can read the documentation for this. Briefly speaking, it equals to K_step in the configuration file for training.

from diffsinger.

loct824 commented on September 6, 2024

I got another error:

[ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Non-zero status code returned while running Gather node. Name:'/fs2/txt_embed/Gather' Status Message: indices element out of data bounds, idx=50 must be within the inclusive range [-50,49]

Should I use the ONNX model for code deployement (e.g. building an API)? Does it require significant effort to refactor the code given the latest changes?

from diffsinger.

yqzhishen commented on September 6, 2024

Seems like your model have a different phoneme set comparing to the default one in MiniEngine. You should use the right dictionary to infer the model.

However, MiniEngine is no longer maintained. If you do not have strong demand on running models in CLI or on remote host, please consider using OpenUTAU for modern user experience. Also I recommend referring the whole inference procedure from it.

from diffsinger.

loct824 commented on September 6, 2024

I managed to make inference after doing below changes:

Changing the reserved tokens to 2 in the config file:

  filename: assets/dictionaries/dictionary.txt
  reserved_tokens: 2

I understand the reserved tokens mean tokens like AP and SP that are not in the phoneme dictionary? Is that correct?

Adding the depth parameter in the acoustic_infer method:

def acoustic_infer(model: str, providers: list, tokens, durations, f0, speedup):
    session = utils.create_session(model, providers)
    print(type(tokens))
    print(type(durations))
    print(type(f0))
    print(type(speedup))
    mel = session.run(['mel'], {'tokens': tokens, 'durations': durations, 'f0': f0, 'speedup': speedup, 'depth': np.array(1000)})[0]
    return mel

However, I noticed that there is a significant difference in outputted waveform quality compared to the results I obtained using infer.py in the DiffSinger repo. Would you give some advice on how we might refactor the code in DiffSingerMiniEngine to obtain similar performance in DiffSinger repo?

from diffsinger.

yqzhishen commented on September 6, 2024

No, reserved tokens were padding tokens for some historical reasons, and most models nowadays have only 1 reserved token. AP and SP are real tokens. You should make sure the phoneme IDs are correct to get reasonable results.

Are you sure you are using the correct dictionary of the model?

from diffsinger.

loct824 commented on September 6, 2024

Yes I am using the exact same dictionary as the one used for training.

I just changed the reserved_token to 1, and it can now give the same results and quality as in infer.py. I guess it is the reserved_token that affected the indices used for phonemes.

Thank you so much for your help!

from diffsinger.

ONNX inference 'depth' parameter about diffsinger HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent