paddlespeech.s2t.utils.error_rate module

This module provides functions to calculate error rate in different level. e.g. wer for word-level, cer for char-level.

class paddlespeech.s2t.utils.error_rate.ErrorCalculator(char_list, sym_space, sym_blank, report_cer=False, report_wer=False)[source]

Bases: object

Calculate CER and WER for E2E_ASR and CTC models during training.

Parameters:

y_hats -- numpy array with predicted text
y_pads -- numpy array with true (target) text
char_list -- List[str]
sym_space -- <space>
sym_blank -- <blank>

Returns:

Methods

`__call__`(ys_hat, ys_pad[, is_ctc])	Calculate sentence-level WER/CER score.
`calculate_cer`(seqs_hat, seqs_true)	Calculate sentence-level CER score.
`calculate_cer_ctc`(ys_hat, ys_pad)	Calculate sentence-level CER score for CTC.
`calculate_wer`(seqs_hat, seqs_true)	Calculate sentence-level WER score.
`convert_to_char`(ys_hat, ys_pad)	Convert index to character.

calculate_cer(seqs_hat, seqs_true)[source]

Calculate sentence-level CER score.

Parameters:

seqs_hat (list) -- prediction
seqs_true (list) -- reference

Returns:

average sentence-level CER score

:rtype float

calculate_cer_ctc(ys_hat, ys_pad)[source]

Calculate sentence-level CER score for CTC.

Parameters:

ys_hat (paddle.Tensor) -- prediction (batch, seqlen)
ys_pad (paddle.Tensor) -- reference (batch, seqlen)

Returns:

average sentence-level CER score

:rtype float

calculate_wer(seqs_hat, seqs_true)[source]

Calculate sentence-level WER score.

Parameters:

seqs_hat (list) -- prediction
seqs_true (list) -- reference

Returns:

average sentence-level WER score

:rtype float

convert_to_char(ys_hat, ys_pad)[source]

Convert index to character.

Parameters:

seqs_hat (paddle.Tensor) -- prediction (batch, seqlen)
seqs_true (paddle.Tensor) -- reference (batch, seqlen)

Returns:

token list of prediction

:rtype list :return: token list of reference :rtype list

paddlespeech.s2t.utils.error_rate.cer(reference, hypothesis, ignore_case=False, remove_space=False)[source]

Calculate charactor error rate (CER). CER compares reference text and hypothesis text in char-level. CER is defined as:

\[CER = (Sc + Dc + Ic) / Nc\]

where

Sc is the number of characters substituted,
Dc is the number of characters deleted,
Ic is the number of characters inserted
Nc is the number of characters in the reference

We can use levenshtein distance to calculate CER. Chinese input should be encoded to unicode. Please draw an attention that the leading and tailing space characters will be truncated and multiple consecutive space characters in a sentence will be replaced by one space character.

Parameters:

reference (str) -- The reference sentence.
hypothesis (str) -- The hypothesis sentence.
ignore_case (bool) -- Whether case-sensitive or not.
remove_space (bool) -- Whether remove internal space characters

Returns:

Character error rate.

Return type:

float

Raises:

ValueError -- If the reference length is zero.

paddlespeech.s2t.utils.error_rate.char_errors(reference, hypothesis, ignore_case=False, remove_space=False)[source]

Compute the levenshtein distance between reference sequence and hypothesis sequence in char-level.

Parameters:

reference (str) -- The reference sentence.
hypothesis (str) -- The hypothesis sentence.
ignore_case (bool) -- Whether case-sensitive or not.
remove_space (bool) -- Whether remove internal space characters

Returns:

Levenshtein distance and length of reference sentence.

Return type:

list

paddlespeech.s2t.utils.error_rate.wer(reference, hypothesis, ignore_case=False, delimiter=' ')[source]

Calculate word error rate (WER). WER compares reference text and hypothesis text in word-level. WER is defined as:

\[WER = (Sw + Dw + Iw) / Nw\]

where

Sw is the number of words subsituted,
Dw is the number of words deleted,
Iw is the number of words inserted,
Nw is the number of words in the reference

We can use levenshtein distance to calculate WER. Please draw an attention that empty items will be removed when splitting sentences by delimiter.

Parameters:

reference (str) -- The reference sentence.
hypothesis (str) -- The hypothesis sentence.
ignore_case (bool) -- Whether case-sensitive or not.
delimiter (char) -- Delimiter of input sentences.

Returns:

Word error rate.

Return type:

float

Raises:

ValueError -- If word number of reference is zero.

paddlespeech.s2t.utils.error_rate.word_errors(reference, hypothesis, ignore_case=False, delimiter=' ')[source]

Compute the levenshtein distance between reference sequence and hypothesis sequence in word-level.

Parameters:

reference (str) -- The reference sentence.
hypothesis (str) -- The hypothesis sentence.
ignore_case (bool) -- Whether case-sensitive or not.
delimiter (char) -- Delimiter of input sentences.

Returns:

Levenshtein distance and word number of reference sentence.

Return type:

list