paddlespeech.t2s.frontend.generate_lexicon module

Generate lexicon and symbols for Mandarin Chinese phonology. The lexicon is used for Montreal Force Aligner. Note that syllables are used as word in this lexicon. Since syllables rather than words are used in transcriptions produced by reorganize_baker.py. We make this choice to better leverage other software for chinese text to pinyin tools like pypinyin. This is the convention for G2P in Chinese.

paddlespeech.t2s.frontend.generate_lexicon.generate_lexicon(with_tone=False, with_erhua=False)[source]

Generate lexicon for Mandarin Chinese.

paddlespeech.t2s.frontend.generate_lexicon.rule(C, V, R, T)[source]

Generate a syllable given the initial, the final, erhua indicator, and tone. Orthographical rules for pinyin are applied. (special case for y, w, ui, un, iu)

Note that in this system, 'ΓΌ' is alway written as 'v' when appeared in phoneme, but converted to 'u' in syllables when certain conditions are satisfied.

'i' is distinguished when appeared in phonemes, and separated into 3 categories, 'i', 'ii' and 'iii'. Erhua is possibly applied to every finals, except for finals that already ends with 'r'. When a syllable is impossible or does not have any characters with this pronunciation, return None to filter it out.