TTS Datasets

Mandarin

  • CSMSC: Chinese Standard Mandarin Speech Copus

    • Duration/h: 12

    • Number of Sentences: 10,000

    • Size: 2.14GB

    • Speaker: 1 female, ages 20 ~30

    • Sample Rate: 48 kHz、16bit

    • Mean Words per Clip: 16

  • AISHELL-3

    • Duration/h: 85

    • Number of Sentences: 88,035

    • Size: 17.75GB

    • Speaker: 218

    • Sample Rate: 44.1 kHz、16bit

English

  • LJSpeech

    • Duration/h: 24

    • Number of Sentences: 13,100

    • Size: 2.56GB

    • Speaker: 1, age 20 ~30

    • Sample Rate: 22050 Hz、16bit

    • Mean Words per Clip: 17.23

  • VCTK

    • Number of Sentences: 44,583

    • Size: 10.94GB

    • Speaker: 110

    • Sample Rate: 48 kHz、16bit

    • Mean Words per Clip: 17.23

Japanese

  • tri-jek: Japanese-English-Korean tri-lingual corpus

  • JSSS-misc: misc tasks of JSSS corpus

  • JTubeSpeech: Corpus of Japanese speech collected from YouTube

  • J-MAC: Japanese multi-speaker audiobook corpus

  • J-KAC: Japanese Kamishibai and audiobook corpus

  • JMD: Japanese multi-dialect corpus

  • JSSS: Japanese multi-style (summarization and simplification) corpus

  • RWCP-SSD-Onomatopoeia: onomatopoeic word dataset for environmental sounds

  • Life-m: landmark image-themed music corpus

  • PJS: Phoneme-balanced Japanese singing voice corpus

  • JVS-MuSiC: Japanese multi-speaker singing-voice corpus

  • JVS: Japanese multi-speaker voice corpus

  • JSUT-book: audiobook corpus by a single Japanese speaker

  • JSUT-vi: vocal imitation corpus by a single Japanese speaker

  • JSUT-song: singing voice corpus by a single Japanese singer

  • JSUT: a large-scaled corpus of reading-style Japanese speech by a single speaker

Emotions

English

Mandarin

English && Mandarin

Music