paddlespeech.s2t.frontend.speech module

Contains the speech segment class.

class paddlespeech.s2t.frontend.speech.SpeechSegment(samples, sample_rate, transcript, tokens=None, token_ids=None)[source]

Bases: AudioSegment

Speech Segment with Text

Args:: AudioSegment (AudioSegment): Audio Segment

Attributes:

duration: Return audio duration.
has_token
num_samples: Return number of samples.
rms_db: Return root mean square energy of the audio in decibels.
sample_rate: Return audio sample rate.
samples: Return audio samples.
token_ids: Return the transcript text token ids.
tokens: Return the transcript text tokens.
transcript: Return the transcript text.

Methods

`add_noise`(noise, snr_dB[, ...])	Add the given noise segment at a specific signal-to-noise ratio.
`change_speed`(speed_rate)	Change the audio speed by linear interpolation.
`concatenate`(*segments)	Concatenate an arbitrary number of speech segments together, both audio and transcript will be concatenated.
`convolve`(impulse_segment[, allow_resample])	Convolve this audio segment with the given impulse segment.
`convolve_and_normalize`(impulse_segment[, ...])	Convolve and normalize the resulting audio segment so that it has the same average power as the input signal.
`from_bytes`(bytes, transcript[, tokens, ...])	Create speech segment from a byte string and corresponding
`from_file`(filepath, transcript[, tokens, ...])	Create speech segment from audio file and corresponding transcript.
`from_pcm`(samples, sample_rate, transcript[, ...])	Create speech segment from pcm on online mode Args: samples (numpy.ndarray): Audio samples [num_samples x num_channels]. sample_rate (int): Audio sample rate. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None. Returns: SpeechSegment: Speech segment instance.
`from_sequence_file`(filepath)	Create audio segment from sequence file.
`gain_db`(gain)	Apply gain in decibels to samples.
`make_silence`(duration, sample_rate)	Creates a silent speech segment of the given duration and sample rate, transcript will be an empty string.
`normalize`([target_db, max_gain_db])	Normalize audio to be of the desired RMS value in decibels.
`normalize_online_bayesian`(target_db, ...[, ...])	Normalize audio using a production-compatible online/causal algorithm.
`pad_silence`(duration[, sides])	Pad this audio sample with a period of silence.
`random_subsegment`(subsegment_length[, rng])	Cut the specified length of the audiosegment randomly.
`resample`(target_sample_rate[, filter])	Resample the audio to a target sample rate.
`shift`(shift_ms)	Shift the audio in time.
`slice_from_file`(filepath, transcript[, ...])	Loads a small section of an speech without having to load the entire file into the memory which can be incredibly wasteful.
`subsegment`([start_sec, end_sec])	Cut the AudioSegment between given boundaries.
`superimpose`(other)	Add samples from another segment to those of this segment (sample-wise addition, not segment concatenation).
`to`([dtype])	Create a dtype audio content.
`to_bytes`([dtype])	Create a byte string containing the audio content.
`to_wav_file`(filepath[, dtype])	Save audio segment to disk as wav file.

classmethod concatenate(*segments)[source]

Concatenate an arbitrary number of speech segments together, both audio and transcript will be concatenated.

Parameters:

*segments --

Input speech segments to be concatenated.

Returns:

Speech segment instance.

Return type:

SpeechSegment

Raises:

ValueError -- If the number of segments is zero, or if the sample_rate of any two segments does not match.
TypeError -- If any segment is not SpeechSegment instance.

classmethod from_bytes(bytes, transcript, tokens=None, token_ids=None)[source]

Create speech segment from a byte string and corresponding

Args:: filepath (str|file): Filepath or file object to audio file. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None.
Returns:: SpeechSegment: Speech segment instance.

classmethod from_file(filepath, transcript, tokens=None, token_ids=None, infos=None)[source]

Create speech segment from audio file and corresponding transcript.

Args:: filepath (str|file): Filepath or file object to audio file. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None. infos (TarLocalData, optional): tar2obj and tar2infos. Defaults to None.
Returns:: SpeechSegment: Speech segment instance.

classmethod from_pcm(samples, sample_rate, transcript, tokens=None, token_ids=None)[source]

Create speech segment from pcm on online mode Args:

samples (numpy.ndarray): Audio samples [num_samples x num_channels]. sample_rate (int): Audio sample rate. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None.

Returns:: SpeechSegment: Speech segment instance.

property has_token

classmethod make_silence(duration, sample_rate)[source]

Creates a silent speech segment of the given duration and sample rate, transcript will be an empty string.

Args:: duration (float): Length of silence in seconds. sample_rate (float): Sample rate.
Returns:: SpeechSegment: Silence of the given duration.

classmethod slice_from_file(filepath, transcript, tokens=None, token_ids=None, start=None, end=None)[source]

Loads a small section of an speech without having to load the entire file into the memory which can be incredibly wasteful.

Parameters:

filepath (str|file) -- Filepath or file object to audio file.
start (float) -- Start time in seconds. If start is negative, it wraps around from the end. If not provided, this function reads from the very beginning.
end (float) -- End time in seconds. If end is negative, it wraps around from the end. If not provided, the default behvaior is to read to the end of the file.
transcript -- Transcript text for the speech. if not provided, the defaults is an empty string.

Returns:

SpeechSegment instance of the specified slice of the input speech file.

Return type:

SpeechSegment

property token_ids

Return the transcript text token ids.

Returns:: List[int]: text token ids.

property tokens

Return the transcript text tokens.

Returns:: List[str]: text tokens.

property transcript

Return the transcript text.

Returns:: str: Transcript text for the speech.