Giter Club home page Giter Club logo

Comments (17)

BarryKCL avatar BarryKCL commented on July 28, 2024 1

Of course, I will submit the onnxruntime code as soon as possible.

from g2pw.

yt605155624 avatar yt605155624 commented on July 28, 2024 1

good idea

from g2pw.

yt605155624 avatar yt605155624 commented on July 28, 2024 1

please check https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/g2p @beyondguo

from g2pw.

GitYCC avatar GitYCC commented on July 28, 2024

@yt605155624 @BarryKCL
Thanks for your effort. Great job!
Could you need some help?

@BarryKCL
Great work of converting torch model to onnx.
Could I invite you to give a PR which replaces models with onnx to speed up?

from g2pw.

yt605155624 avatar yt605155624 commented on July 28, 2024

@GitYCC not yet, but after merging and using by community users of PaddleSpeech, maybe we will find some problems which need your help, there is a little bug in g2pw, there are some words didn't in your 简体->繁体 dict (sorry I don't really understand taiwan or minnan lang, maybe using 简体 dict will be more convenient for our mainland users), you can try to input this word "概念",in @BarryKCL 's script, he added a "try catch" to avoid this bug, and use g2pM as backup, please check https://github.com/PaddlePaddle/PaddleSpeech/blob/aecf8fd3844371abcce5d337fab83aae6807285b/paddlespeech/t2s/frontend/zh_frontend.py#L186

from g2pw.

GitYCC avatar GitYCC commented on July 28, 2024

@yt605155624
The root cause of this problem is because our model is trained on the Traditional Chinese (繁体) dataset. So, if we want to apply on the cases of 简体, I need to use package OpenCC to convert them. But OpenCC still has some cases that can not be converted very well. Maybe we could seek a better method to do this conversion.

from g2pw.

yt605155624 avatar yt605155624 commented on July 28, 2024

We have also tried opencc for 繁体 -> 简体 -> 繁体 in PaddleSpeech TTS, but cause opencc has some bug when install in windows (not sure if this is still a bug now), we remove opencc and look up table (maybe this table was copy from somewhere in github I don't remember) now, you can check https://github.com/PaddlePaddle/PaddleSpeech/blob/0eb598b876f99bd26fa735577da92d46c45dc3fd/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py#L81 , but I'm not sure if it works well for taiwan/minnan 繁体

But let me put it another way, if it's possible you :

  1. covert your Traditional Chinese (繁体) dataset to 简体 dataset
  2. train a 简体 G2PW model
    😍

I don't know the complexity of this task, because I don't understand the Minnan language at all.. 🥺

from g2pw.

GitYCC avatar GitYCC commented on July 28, 2024

The problem still exists. In order to convert the dataset into 简体, we need a Good 繁体 -> 简体 converter. If we have the 繁体 -> 简体 converter, our first problem has been solved. XD

Maybe just use the look-up table (https://github.com/PaddlePaddle/PaddleSpeech/blob/0eb598b876f99bd26fa735577da92d46c45dc3fd/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py#L81 ) to solve this problem temporarily.

from g2pw.

GitYCC avatar GitYCC commented on July 28, 2024

Of course, I will submit the onnxruntime code as soon as possible.

Thank you!

from g2pw.

yt605155624 avatar yt605155624 commented on July 28, 2024

@GitYCC but even use a not too good 繁体 -> 简体 converter, you can also get a dataset (but maybe the number of available data will be reduced), I don't know whether the reduction of dataset will reduce the effect of model for 简体 g2pw

from g2pw.

GitYCC avatar GitYCC commented on July 28, 2024

Actually, in my opinion,
no matter whether the 繁体 -> 简体 converter is good or not, if we just use the such converter, the effect of "pre-use on dataset" is same as the one of "post-use on changing input",
because filtered error cases of "pre-use on dataset" would not be trained on models and such models still can not deal with missing char. from conversion.

from g2pw.

yt605155624 avatar yt605155624 commented on July 28, 2024

I'm not very familiar with NLP, I naively thought:

  1. The G2P BERT will only have to be trained for polyphonic ones, not word such "概念", even "概念" will not in your datasets after 繁体 -> 简体 converter, but pretrained BERT must has seen "概念" before.

  2. Because people in mainland China use simplified Chinese, I naively thought that there might not be as many "missing chars" in simplified Chinese as traditional Chinese for the pre-trained Bert vocab, for example, simplified "概念" maybe in BERT's vocab, but opencc cannot convert simplified "概念" to traditional "概念", and even traditional "概念" in BERT's vocab, traditional g2pw still cannot deal will simplified "概念" input

g2pw is an excellent job, I think it will have a great influence in the Chinese community (more of them use simplified Chinese). If it'is blocked by a bad converter, I will be very sad

from g2pw.

GitYCC avatar GitYCC commented on July 28, 2024

I have an idea to get a good look-up table. We can use the google translation to help us.

Like this way,
image

I will change the converter by this way in the future.

from g2pw.

yt605155624 avatar yt605155624 commented on July 28, 2024

oh, I just found that, when input "概念", the error not because of 简体 -> 繁体 converter, but because there are not polyphone in "概念", so the texts output of prepare_data is [], so the input of bert is [].. maybe an empty judgment will fix this

sent before convert: 概念,
sent after convert: 概念,
sentences: ['概念,']
[] [] [] [['gai4', 'nian4', None]]
texts: []
onnx_input: {'input_ids': array([], dtype=float64), 'token_type_ids': array([], dtype=float64), 'attention_masks': array([], dtype=float64), 'phoneme_masks': array([], dtype=float32), 'char_ids': array([], dtype=float64), 'position_ids': array([], dtype=float64)}
[概念,] not in g2pW dict,use g2pM
sent before convert: 你我,
sent after convert: 你我,
sentences: ['你我,']
[] [] [] [['ni3', 'wo3', None]]
texts: []
onnx_input: {'input_ids': array([], dtype=float64), 'token_type_ids': array([], dtype=float64), 'attention_masks': array([], dtype=float64), 'phoneme_masks': array([], dtype=float32), 'char_ids': array([], dtype=float64), 'position_ids': array([], dtype=float64)}
[你我,] not in g2pW dict,use g2pM
sent before convert: 你好
sent after convert: 你好
sentences: ['你好']
char in polyphonic_chars: 好
['你好'] [1] [0] [['ni3', None]]
texts: ['你好']
onnx_input: {'input_ids': array([[ 101,  872, 1962,  102]]), 'token_type_ids': array([[0, 0, 0, 0]]), 'attention_masks': array([[1, 1, 1, 1]]), 'phoneme_masks': array([[0., 0., 0., ..., 0., 0., 0.]], dtype=float32), 'char_ids': array([580]), 'position_ids': array([2])}

maybe you should check this also

texts, query_ids = prepare_data(sent_path)

from g2pw.

GitYCC avatar GitYCC commented on July 28, 2024

Thanks for catching bugs. #10

from g2pw.

GitYCC avatar GitYCC commented on July 28, 2024

#11

from g2pw.

beyondguo avatar beyondguo commented on July 28, 2024

Hi, I'm not familiar with PaddleSpeech, now I only want to use g2p in PaddleSpeech to get the pinyin of sentences (in order to speed up the pinyin generation), could you give a tiny code example? Thanks a lot!

from g2pw.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.