Comments (17)
Of course, I will submit the onnxruntime code as soon as possible.
from g2pw.
good idea
from g2pw.
please check https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/g2p @beyondguo
from g2pw.
@yt605155624 @BarryKCL
Thanks for your effort. Great job!
Could you need some help?
@BarryKCL
Great work of converting torch model to onnx.
Could I invite you to give a PR which replaces models with onnx to speed up?
from g2pw.
@GitYCC not yet, but after merging and using by community users of PaddleSpeech, maybe we will find some problems which need your help, there is a little bug in g2pw, there are some words didn't in your 简体->繁体 dict (sorry I don't really understand taiwan or minnan lang, maybe using 简体 dict will be more convenient for our mainland users), you can try to input this word "概念",in @BarryKCL 's script, he added a "try catch" to avoid this bug, and use g2pM
as backup, please check https://github.com/PaddlePaddle/PaddleSpeech/blob/aecf8fd3844371abcce5d337fab83aae6807285b/paddlespeech/t2s/frontend/zh_frontend.py#L186
from g2pw.
@yt605155624
The root cause of this problem is because our model is trained on the Traditional Chinese (繁体) dataset. So, if we want to apply on the cases of 简体, I need to use package OpenCC
to convert them. But OpenCC
still has some cases that can not be converted very well. Maybe we could seek a better method to do this conversion.
from g2pw.
We have also tried opencc for 繁体 -> 简体 -> 繁体 in PaddleSpeech TTS, but cause opencc has some bug when install in windows (not sure if this is still a bug now), we remove opencc and look up table
(maybe this table was copy from somewhere in github I don't remember) now, you can check https://github.com/PaddlePaddle/PaddleSpeech/blob/0eb598b876f99bd26fa735577da92d46c45dc3fd/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py#L81 , but I'm not sure if it works well for taiwan/minnan 繁体
But let me put it another way, if it's possible you :
- covert your Traditional Chinese (繁体) dataset to 简体 dataset
- train a 简体 G2PW model
😍
I don't know the complexity of this task, because I don't understand the Minnan language at all.. 🥺
from g2pw.
The problem still exists. In order to convert the dataset into 简体, we need a Good 繁体 -> 简体 converter. If we have the 繁体 -> 简体 converter, our first problem has been solved. XD
Maybe just use the look-up table (https://github.com/PaddlePaddle/PaddleSpeech/blob/0eb598b876f99bd26fa735577da92d46c45dc3fd/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py#L81 ) to solve this problem temporarily.
from g2pw.
Of course, I will submit the onnxruntime code as soon as possible.
Thank you!
from g2pw.
@GitYCC but even use a not too good 繁体 -> 简体 converter, you can also get a dataset (but maybe the number of available data will be reduced), I don't know whether the reduction of dataset will reduce the effect of model for 简体 g2pw
from g2pw.
Actually, in my opinion,
no matter whether the 繁体 -> 简体 converter is good or not, if we just use the such converter, the effect of "pre-use on dataset" is same as the one of "post-use on changing input",
because filtered error cases of "pre-use on dataset" would not be trained on models and such models still can not deal with missing char. from conversion.
from g2pw.
I'm not very familiar with NLP, I naively thought:
-
The G2P BERT will only have to be trained for polyphonic ones, not word such "概念", even "概念" will not in your datasets after 繁体 -> 简体 converter, but pretrained BERT must has seen "概念" before.
-
Because people in mainland China use simplified Chinese, I naively thought that there might not be as many "missing chars" in simplified Chinese as traditional Chinese for the pre-trained Bert vocab, for example, simplified "概念" maybe in BERT's vocab, but opencc cannot convert simplified "概念" to traditional "概念", and even traditional "概念" in BERT's vocab, traditional g2pw still cannot deal will simplified "概念" input
g2pw is an excellent job, I think it will have a great influence in the Chinese community (more of them use simplified Chinese). If it'is blocked by a bad converter, I will be very sad
from g2pw.
I have an idea to get a good look-up table. We can use the google translation to help us.
I will change the converter by this way in the future.
from g2pw.
oh, I just found that, when input "概念", the error not because of 简体 -> 繁体 converter, but because there are not polyphone in "概念", so the texts
output of prepare_data
is []
, so the input of bert is []
.. maybe an empty judgment will fix this
sent before convert: 概念,
sent after convert: 概念,
sentences: ['概念,']
[] [] [] [['gai4', 'nian4', None]]
texts: []
onnx_input: {'input_ids': array([], dtype=float64), 'token_type_ids': array([], dtype=float64), 'attention_masks': array([], dtype=float64), 'phoneme_masks': array([], dtype=float32), 'char_ids': array([], dtype=float64), 'position_ids': array([], dtype=float64)}
[概念,] not in g2pW dict,use g2pM
sent before convert: 你我,
sent after convert: 你我,
sentences: ['你我,']
[] [] [] [['ni3', 'wo3', None]]
texts: []
onnx_input: {'input_ids': array([], dtype=float64), 'token_type_ids': array([], dtype=float64), 'attention_masks': array([], dtype=float64), 'phoneme_masks': array([], dtype=float32), 'char_ids': array([], dtype=float64), 'position_ids': array([], dtype=float64)}
[你我,] not in g2pW dict,use g2pM
sent before convert: 你好
sent after convert: 你好
sentences: ['你好']
char in polyphonic_chars: 好
['你好'] [1] [0] [['ni3', None]]
texts: ['你好']
onnx_input: {'input_ids': array([[ 101, 872, 1962, 102]]), 'token_type_ids': array([[0, 0, 0, 0]]), 'attention_masks': array([[1, 1, 1, 1]]), 'phoneme_masks': array([[0., 0., 0., ..., 0., 0., 0.]], dtype=float32), 'char_ids': array([580]), 'position_ids': array([2])}
maybe you should check this also
g2pW/scripts/predict_g2p_bert.py
Line 30 in ece11b8
from g2pw.
Thanks for catching bugs. #10
from g2pw.
from g2pw.
- Add g2pW to Chinese frontend PaddlePaddle/PaddleSpeech#2230
Hi, I'm not familiar with PaddleSpeech, now I only want to use g2p in PaddleSpeech to get the pinyin of sentences (in order to speed up the pinyin generation), could you give a tiny code example? Thanks a lot!
from g2pw.
Related Issues (13)
- bugs when there are not polyphone in the sentence HOT 1
- “和”字拼音转成了han4 HOT 5
- multiprocessing error HOT 1
- 模型大小不一样 HOT 5
- 錫 ㄒㄧ2? HOT 3
- 模型训练问题——简体/繁体 HOT 9
- 新增調用 GPU 推理功能
- predict_g2p_bert.py报错 HOT 1
- does it support coverting pinyin or 汉字 to phoneme? HOT 3
- inference 60 characters, cost 0.86 sencond on cpu, how to acceleration time HOT 15
- The length of input could not more than 16? HOT 3
- diff between minnan and mandarin HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from g2pw.