Giter Club home page Giter Club logo

g2pw's Introduction

g2pW: Mandarin Grapheme-to-Phoneme Converter

Downloads license

Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Yeh

This is the official repository of our paper g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin (INTERSPEECH 2022).

News

Getting Started

Dependency / Install

(This work was tested with PyTorch 1.7.0, CUDA 10.1, python 3.6 and Ubuntu 16.04.)

  • Install PyTorch

  • $ pip install g2pw

Quick Demo

Open In Colab

>>> from g2pw import G2PWConverter
>>> conv = G2PWConverter()
>>> sentence = '上校請技術人員校正FN儀器'
>>> conv(sentence)
[['ㄕㄤ4', 'ㄒㄧㄠ4', 'ㄑㄧㄥ3', 'ㄐㄧ4', 'ㄕㄨ4', 'ㄖㄣ2', 'ㄩㄢ2', 'ㄐㄧㄠ4', 'ㄓㄥ4', None, None, 'ㄧ2', 'ㄑㄧ4']]
>>> sentences = ['銀行', '行動']
>>> conv(sentences)
[['ㄧㄣ2', 'ㄏㄤ2'], ['ㄒㄧㄥ2', 'ㄉㄨㄥ4']]

Load Offline Model

conv = G2PWConverter(model_dir='./G2PWModel-v2-onnx/', model_source='./path-to/bert-base-chinese/')

Support Simplified Chinese and Pinyin

>>> from g2pw import G2PWConverter
>>> conv = G2PWConverter(style='pinyin', enable_non_tradional_chinese=True)
>>> conv('然而,他红了20年以后,他竟退出了大家的视线。')
[['ran2', 'er2', None, 'ta1', 'hong2', 'le5', None, None, 'nian2', 'yi3', 'hou4', None, 'ta1', 'jing4', 'tui4', 'chu1', 'le5', 'da4', 'jia1', 'de5', 'shi4', 'xian4', None]]

Scripts

$ git clone https://github.com/GitYCC/g2pW.git

Train Model

For example, we train models on CPP dataset as follows:

$ bash cpp_dataset/download.sh
$ python scripts/train_g2p_bert.py --config configs/config_cpp.py

Testing

$ python scripts/test_g2p_bert.py \
    --config saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/config.py \
    --checkpoint saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/best_accuracy.pth \
    --sent_path cpp_dataset/test.sent \
    --output_path output_pred.txt

Prediction

$ python scripts/predict_g2p_bert.py \
    --config saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/config.py \
    --checkpoint saved_models/CPP_BERT_M_DescWS-Sec-cLin-B_POSw01/best_accuracy.pth \
    --sent_path cpp_dataset/test.sent \
    --lb_path cpp_dataset/test.lb

Checkpoints

Citation

To cite the code/data/paper, please use this BibTex

@article{chen2022g2pw,
    author={Yi-Chang Chen and Yu-Chuan Chang and Yen-Cheng Chang and Yi-Ren Yeh},
    title = {g2pW: A Conditional Weighted Softmax BERT for Polyphone Disambiguation in Mandarin},
    journal={Proc. Interspeech 2022},
    url = {https://arxiv.org/abs/2203.10430},
    year = {2022}
}

Star History

Star History Chart

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.