Convert the original text into characters/phonemes, through textfrontend module.
Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through Acousticmodels.
Convert acoustic features into waveforms through Vocoders.
When training Tacotron2、TransformerTTS and WaveFlow, we use English single speaker TTS dataset LJSpeech by default. However, when training SpeedySpeech, FastSpeech2 and ParallelWaveGAN, we use Chinese single speaker dataset CSMSC by default.
In the future, PaddleSpeechTTS will mainly use Chinese TTS datasets for default examples.
Here, we will display three types of audio samples:
In our FastSpeech2, we can control duration, pitch and energy.
We provide the audio demos of duration control here. duration means the duration of phonemes, when we reduce duration, the speed of audios will increase, and when we incerase duration, the speed of audios will reduce.
The duration of different phonemes in a sentence can have different scale ratios (when you want to slow down one word and keep the other words' speed in a sentence). Here we use a fixed scale ratio for different phonemes to control the speed of audios.
The duration control in FastSpeech2 can control the speed of audios will keep the pitch. (in some speech tool, increase the speed will increase the pitch, and vice versa.)
We provide the audio demos of pitch control here.
When we set pitch of one sentence to a mean value and set tones of phones to 1, we will get a robot-style timbre.
When we raise the pitch of an adult female (with a fixed scale ratio), we will get a child-style timbre.
The pitch of different phonemes in a sentence can also have different scale ratios.
The nomal audios are in the second column of the previous table.
We provide a complete Chinese text frontend module in PaddleSpeechTTS. TextNormalization and G2P are the most important modules in text frontend, We assume that the texts are normalized already, and mainly compare G2P module here.