Released Models
!!! Since PaddlePaddle support 0-D tensor from 2.5.0, PaddleSpeech Static model will not work for it, please re-export static model.
Speech-to-Text Models
Speech Recognition Model
Acoustic Model |
Training Data |
Token-based |
Size |
Descriptions |
CER |
WER |
Hours of speech |
Example Link |
Inference Type |
static_model |
---|---|---|---|---|---|---|---|---|---|---|
Wenetspeech Dataset |
Char-based |
1.2 GB |
2 Conv + 5 LSTM layers |
0.152 (test_net, w/o LM) |
- |
10000 h |
- |
onnx/inference/python |
- |
|
Aishell Dataset |
Char-based |
491 MB |
2 Conv + 5 LSTM layers |
0.0666 |
- |
151 h |
onnx/inference/python |
- |
||
Aishell Dataset |
Char-based |
1.4 GB |
2 Conv + 5 bidirectional LSTM layers |
0.0554 |
- |
151 h |
inference/python |
- |
||
WenetSpeech Dataset |
Char-based |
457 MB |
Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |
0.11 (test_net) 0.1879 (test_meeting) |
- |
10000 h |
- |
python |
- |
|
WenetSpeech Dataset |
Char-based |
540 MB |
Encoder:Conformer, Decoder:BiTransformer, Decoding method: Attention rescoring |
0.047198 (aishell test_-1) 0.059212 (aishell test_16) |
- |
10000 h |
- |
python |
||
Aishell Dataset |
Char-based |
189 MB |
Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |
0.051968 |
- |
151 h |
python |
- |
||
Aishell Dataset |
Char-based |
189 MB |
Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |
0.0460 |
- |
151 h |
python |
- |
||
Aishell Dataset |
Char-based |
128 MB |
Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring |
0.0523 |
151 h |
python |
- |
|||
Librispeech Dataset |
Char-based |
1.3 GB |
2 Conv + 5 bidirectional LSTM layers |
- |
0.0467 |
960 h |
inference/python |
- |
||
Librispeech Dataset |
subword-based |
191 MB |
Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |
- |
0.0338 |
960 h |
python |
- |
||
Librispeech Dataset |
subword-based |
131 MB |
Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring |
- |
0.0381 |
960 h |
python |
- |
||
Librispeech Dataset |
subword-based |
131 MB |
Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM |
- |
0.0240 |
960 h |
python |
- |
||
TALCS Dataset |
subword-based |
470 MB |
Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |
- |
0.0844 |
587 h |
python |
- |
Self-Supervised Pre-trained Model
Model |
Pre-Train Method |
Pre-Train Data |
Finetune Data |
Size |
Descriptions |
CER |
WER |
Example Link |
---|---|---|---|---|---|---|---|---|
wav2vec2 |
Librispeech and LV-60k Dataset (5.3w h) |
- |
1.18 GB |
Pre-trained Wav2vec2.0 Model |
- |
- |
- |
|
wav2vec2 |
Librispeech and LV-60k Dataset (5.3w h) |
Librispeech (960 h) |
718 MB |
Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search |
- |
0.0189 |
||
wav2vec2 |
Wenetspeech Dataset (1w h) |
- |
714 MB |
Pre-trained Wav2vec2.0 Model |
- |
- |
- |
|
wav2vec2 |
Wenetspeech Dataset (1w h) |
aishell1 (train set) |
1.18 GB |
Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search |
0.0510 |
- |
- |
|
hubert |
LV-60k Dataset |
- |
1.18 GB |
Pre-trained hubert Model |
- |
- |
- |
|
hubert |
LV-60k Dataset |
librispeech train-clean-100 |
1.27 GB |
Encoder: Hubert, Decoder: Linear + CTC, Decoding method: Greedy search |
- |
0.0587 |
Whisper Model
Demo Link |
Training Data |
Size |
Descriptions |
CER |
Model |
---|---|---|---|---|---|
680kh from internet |
large: 5.8G,medium: 2.9G,small: 923M,base: 277M,tiny: 145M |
Encoder:Transformer, Decoder:Transformer, Decoding method: Greedy search |
0.027 (large, Librispeech) |
whisper-large whisper-medium whisper-medium-English-only whisper-small whisper-small-English-only whisper-base whisper-base-English-only whisper-tiny whisper-tiny-English-only |
Language Model based on NGram
Language Model |
Training Data |
Token-based |
Size |
Descriptions |
---|---|---|---|---|
Word-based |
8.3 GB |
Pruned with 0 1 1 1 1; |
||
Baidu Internal Corpus |
Char-based |
2.8 GB |
Pruned with 0 1 2 4 4; |
|
Baidu Internal Corpus |
Char-based |
70.4 GB |
No Pruning; |
Speech Translation Models
Model |
Training Data |
Token-based |
Size |
Descriptions |
BLEU |
Example Link |
---|---|---|---|---|---|---|
(only for CLI)Transformer FAT-ST MTL En-Zh |
Ted-En-Zh |
Spm |
Encoder:Transformer, Decoder:Transformer, |
20.80 |
Text-to-Speech Models
Acoustic Models
Vocoders
Voice Cloning
Model Type |
Dataset |
Example Link |
Pretrained Models |
---|---|---|---|
GE2E |
AISHELL-3, etc. |
||
GE2E + Tacotron2 |
AISHELL-3 |
||
GE2E + FastSpeech2 |
AISHELL-3 |
Audio Classification Models
Model Type |
Dataset |
Example Link |
Pretrained Models |
Static Models |
---|---|---|---|---|
PANN |
Audioset |
panns_cnn6.pdparams, panns_cnn10.pdparams, panns_cnn14.pdparams |
panns_cnn6_static.tar.gz(18M), panns_cnn10_static.tar.gz(19M), panns_cnn14_static.tar.gz(289M) |
|
PANN |
ESC-50 |
Speaker Verification Models
Model Type |
Dataset |
Example Link |
Pretrained Models |
Static Models |
---|---|---|---|---|
ECAPA-TDNN |
VoxCeleb |
- |
Punctuation Restoration Models
Model Type |
Dataset |
Example Link |
Pretrained Models |
---|---|---|---|
Ernie Linear |
IWLST2012_zh |