Thanks for the excellent work, <a class="user-mention notranslate" data-hovercard-type

please check <a href="https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/exampl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Include g2pW into PaddleSpeech TTS about g2pw HOT 17 CLOSED

gitycc commented on July 28, 2024 1

Include g2pW into PaddleSpeech TTS

from g2pw.

Comments (17)

BarryKCL commented on July 28, 2024 1

Of course, I will submit the onnxruntime code as soon as possible.

from g2pw.

yt605155624 commented on July 28, 2024 1

good idea

from g2pw.

yt605155624 commented on July 28, 2024 1

please check https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/g2p @beyondguo

from g2pw.

GitYCC commented on July 28, 2024

@yt605155624 @BarryKCL
Thanks for your effort. Great job!
Could you need some help?

@BarryKCL
Great work of converting torch model to onnx.
Could I invite you to give a PR which replaces models with onnx to speed up?

from g2pw.

yt605155624 commented on July 28, 2024

@GitYCC not yet, but after merging and using by community users of PaddleSpeech, maybe we will find some problems which need your help, there is a little bug in g2pw, there are some words didn't in your 简体->繁体 dict (sorry I don't really understand taiwan or minnan lang, maybe using 简体 dict will be more convenient for our mainland users), you can try to input this word "概念"，in @BarryKCL 's script, he added a "try catch" to avoid this bug, and use g2pM as backup, please check https://github.com/PaddlePaddle/PaddleSpeech/blob/aecf8fd3844371abcce5d337fab83aae6807285b/paddlespeech/t2s/frontend/zh_frontend.py#L186

from g2pw.

GitYCC commented on July 28, 2024

@yt605155624
The root cause of this problem is because our model is trained on the Traditional Chinese (繁体) dataset. So, if we want to apply on the cases of 简体, I need to use package OpenCC to convert them. But OpenCC still has some cases that can not be converted very well. Maybe we could seek a better method to do this conversion.

from g2pw.

yt605155624 commented on July 28, 2024

We have also tried opencc for 繁体 -> 简体 -> 繁体 in PaddleSpeech TTS, but cause opencc has some bug when install in windows (not sure if this is still a bug now), we remove opencc and look up table (maybe this table was copy from somewhere in github I don't remember) now, you can check https://github.com/PaddlePaddle/PaddleSpeech/blob/0eb598b876f99bd26fa735577da92d46c45dc3fd/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py#L81 , but I'm not sure if it works well for taiwan/minnan 繁体

But let me put it another way, if it's possible you :

covert your Traditional Chinese (繁体) dataset to 简体 dataset
train a 简体 G2PW model
😍

I don't know the complexity of this task, because I don't understand the Minnan language at all.. 🥺

from g2pw.

GitYCC commented on July 28, 2024

The problem still exists. In order to convert the dataset into 简体, we need a Good 繁体 -> 简体 converter. If we have the 繁体 -> 简体 converter, our first problem has been solved. XD

Maybe just use the look-up table (https://github.com/PaddlePaddle/PaddleSpeech/blob/0eb598b876f99bd26fa735577da92d46c45dc3fd/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py#L81 ) to solve this problem temporarily.

from g2pw.

GitYCC commented on July 28, 2024

Of course, I will submit the onnxruntime code as soon as possible.

Thank you!

from g2pw.

yt605155624 commented on July 28, 2024

@GitYCC but even use a not too good 繁体 -> 简体 converter, you can also get a dataset (but maybe the number of available data will be reduced), I don't know whether the reduction of dataset will reduce the effect of model for 简体 g2pw

from g2pw.

GitYCC commented on July 28, 2024

Actually, in my opinion,
no matter whether the 繁体 -> 简体 converter is good or not, if we just use the such converter, the effect of "pre-use on dataset" is same as the one of "post-use on changing input",
because filtered error cases of "pre-use on dataset" would not be trained on models and such models still can not deal with missing char. from conversion.

from g2pw.

yt605155624 commented on July 28, 2024

I'm not very familiar with NLP, I naively thought:

The G2P BERT will only have to be trained for polyphonic ones, not word such "概念", even "概念" will not in your datasets after 繁体 -> 简体 converter, but pretrained BERT must has seen "概念" before.
Because people in mainland China use simplified Chinese, I naively thought that there might not be as many "missing chars" in simplified Chinese as traditional Chinese for the pre-trained Bert vocab, for example, simplified "概念" maybe in BERT's vocab, but opencc cannot convert simplified "概念" to traditional "概念", and even traditional "概念" in BERT's vocab, traditional g2pw still cannot deal will simplified "概念" input

g2pw is an excellent job, I think it will have a great influence in the Chinese community (more of them use simplified Chinese). If it'is blocked by a bad converter, I will be very sad

from g2pw.

GitYCC commented on July 28, 2024

I have an idea to get a good look-up table. We can use the google translation to help us.

Like this way,

I will change the converter by this way in the future.

from g2pw.

yt605155624 commented on July 28, 2024

oh, I just found that, when input "概念", the error not because of 简体 -> 繁体 converter, but because there are not polyphone in "概念", so the texts output of prepare_data is [], so the input of bert is [].. maybe an empty judgment will fix this

sent before convert: 概念，
sent after convert: 概念，
sentences: ['概念，']
[] [] [] [['gai4', 'nian4', None]]
texts: []
onnx_input: {'input_ids': array([], dtype=float64), 'token_type_ids': array([], dtype=float64), 'attention_masks': array([], dtype=float64), 'phoneme_masks': array([], dtype=float32), 'char_ids': array([], dtype=float64), 'position_ids': array([], dtype=float64)}
[概念，] not in g2pW dict,use g2pM
sent before convert: 你我，
sent after convert: 你我，
sentences: ['你我，']
[] [] [] [['ni3', 'wo3', None]]
texts: []
onnx_input: {'input_ids': array([], dtype=float64), 'token_type_ids': array([], dtype=float64), 'attention_masks': array([], dtype=float64), 'phoneme_masks': array([], dtype=float32), 'char_ids': array([], dtype=float64), 'position_ids': array([], dtype=float64)}
[你我，] not in g2pW dict,use g2pM
sent before convert: 你好
sent after convert: 你好
sentences: ['你好']
char in polyphonic_chars: 好
['你好'] [1] [0] [['ni3', None]]
texts: ['你好']
onnx_input: {'input_ids': array([[ 101,  872, 1962,  102]]), 'token_type_ids': array([[0, 0, 0, 0]]), 'attention_masks': array([[1, 1, 1, 1]]), 'phoneme_masks': array([[0., 0., 0., ..., 0., 0., 0.]], dtype=float32), 'char_ids': array([580]), 'position_ids': array([2])}

maybe you should check this also

g2pW/scripts/predict_g2p_bert.py

Line 30 in ece11b8

texts, query_ids = prepare_data(sent_path)

from g2pw.

GitYCC commented on July 28, 2024

Thanks for catching bugs. #10

from g2pw.

GitYCC commented on July 28, 2024

#11

from g2pw.

beyondguo commented on July 28, 2024

Add g2pW to Chinese frontend PaddlePaddle/PaddleSpeech#2230

Hi, I'm not familiar with PaddleSpeech, now I only want to use g2p in PaddleSpeech to get the pinyin of sentences (in order to speed up the pinyin generation), could you give a tiny code example? Thanks a lot!

from g2pw.

Include g2pW into PaddleSpeech TTS about g2pw HOT 17 CLOSED

Comments (17)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent