Giter Club home page Giter Club logo

phonecodes's Introduction

phonecodes

This library provides tools for converting between the International Phonetic Alphabet (IPA) and other phonetic alphabets used to transcribe speech, including Callhome, X-SAMPA, ARPABET, DISC/CELEX and Buckeye Corpus Phonetic Alphabet. Additionally, tools for searching mappings between phonetic symbols and reading/writing pronounciation lexicon files in several standard formats are also provided.

These functionalities are useful for processing data for automatic speech recognition, text to speech, and linguistic analyses of speech.

Setup and Installation

Install the library by running pip install phonecodes with python 3.10 or greater. It probably works with earlier versions of python, but this was not tested.

Developers may refer to the CONTRIBUTIONS.md for information on the development environment for testing, linting and contributing to the code.

Basic Usage

Converting between Phonetic Alphabets

If you want to convert to or from IPA to some other phonetic code, use phonecodes.phonecodes as follows:

>>> from phonecodes import phonecodes
>>> print(phonecodes.CODES) # available phonetic alphabets
{'buckeye', 'disc', 'callhome', 'xsampa', 'arpabet', 'ipa'}
>>> phonecodes.convert("DH IH S IH Z AH0 T EH1 S T", "arpabet", "ipa", "eng") # convert from IPA to ARPABET with language explicitly specified
'ð ɪ s ɪ z ə t ˈɛ s t'
>>> phonecodes.convert("ð ɪ s ɪ z ə t ˈɛ s t", "ipa", "arpabet") # convert from IPA to ARPABET with optional language left out
'DH IH S IH Z AH0 T EH1 S T'
>>> phonecodes.ipa2arpabet("ð ɪ s ɪ z ə t ˈɛ s t", "eng") # equivalent to previous with explicit language
'DH IH S IH Z AH0 T EH1 S T'
>>> phonecodes.ipa2arpabet("ð ɪ s ɪ z ə t ˈɛ s t") # equivalent to previous with optional language left out
'DH IH S IH Z AH0 T EH1 S T'
>>> phonecodes.convert("DH IH S IH Z AH0 T EH1 S T", "arpabet", "ipa") # convert from ARPABET to IPA, optional language left out
'ð ɪ s ɪ z ə t ˈɛ s t'
>>> phonecodes.arpabet2ipa("DH IH S IH Z AH0 T EH1 S T", "eng") # equivalent to previous with optional language explicit
'ð ɪ s ɪ z ə t ˈɛ s t'

For 'arpabet', 'buckeye' and 'xsampa', specifying a language is optional and ignored by the code, since X-SAMPA is language agnostic and ARAPABET and Buckeye were designed to work only for English. Note that for 'callhome' and 'disc' you should also specify a language code from the following lists:

  • DISC/CELEX: Dutch 'nld', English 'eng', German 'deu'. Uses German if unspecified.
  • Callhome: Spanish 'spa', Egyptian Arabic 'arz', Mandarin Chinese 'cmn'. You MUST specify an appropriate language code or you'll get a KeyError.

Reading Corpus Files

If you are working with specific corpora, you can also convert between certain corpus formats as follows:

>>> from phonecodes import pronlex
>>> my_lex = pronlex.read("test/fixtures/isle_eng_sample.txt", "isle", "eng") # Read in an English ISLE corpus file
>>> my_lex.w2p # see orthographic to phonetic word mapping
{'a': ['#', 'ə', '#'], 'is': ['#', 'ɪ', 'z', '#'], 'test': ['#', 't', 'ˈɛ', 's', 't', '#'], 'this': ['#', 'ð', 'ɪ', 's', '#']}
new_lex = my_lex.recode('arpabet') # Convert mapping to ARPABET
>>> new_lex.w2p
{'a': ['#', 'AH0', '#'], 'is': ['#', 'IH', 'Z', '#'], 'test': ['#', 'T', 'EH1', 'S', 'T', '#'], 'this': ['#', 'DH', 'IH', 'S', '#']}

The supported corpus formats and their corresponding phonetic alphabets are as follows:

Corpus Format Phonetic Alphabet Language Options
'babel' 'xsampa' 'amh', 'asm', 'ben', 'yue', 'ceb', 'luo', 'kat', 'gug', 'hat', 'ibo', 'jav', 'kur', 'lao', 'lit', 'mon', 'pus', 'swa', 'tgl', 'tam', 'tpi', 'tur', 'vie', 'zul'
'callhome' 'callhome' 'arz', 'cmn', 'spa'
'celex' 'disc' 'eng', 'ndl', 'deu'
'isle' 'ipa' Not required

phonecodes's People

Contributors

jhasegaw avatar ginic avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.