paddlespeech.s2t.frontend.speech module

Contains the speech segment class.

class paddlespeech.s2t.frontend.speech.SpeechSegment(samples, sample_rate, transcript, tokens=None, token_ids=None)[source]

Bases: AudioSegment

Speech Segment with Text

Args:

AudioSegment (AudioSegment): Audio Segment

Attributes:
duration

Return audio duration.

has_token
num_samples

Return number of samples.

rms_db

Return root mean square energy of the audio in decibels.

sample_rate

Return audio sample rate.

samples

Return audio samples.

token_ids

Return the transcript text token ids.

tokens

Return the transcript text tokens.

transcript

Return the transcript text.

Methods

add_noise(noise, snr_dB[, ...])

Add the given noise segment at a specific signal-to-noise ratio.

change_speed(speed_rate)

Change the audio speed by linear interpolation.

concatenate(*segments)

Concatenate an arbitrary number of speech segments together, both audio and transcript will be concatenated.

convolve(impulse_segment[, allow_resample])

Convolve this audio segment with the given impulse segment.

convolve_and_normalize(impulse_segment[, ...])

Convolve and normalize the resulting audio segment so that it has the same average power as the input signal.

from_bytes(bytes, transcript[, tokens, ...])

Create speech segment from a byte string and corresponding

from_file(filepath, transcript[, tokens, ...])

Create speech segment from audio file and corresponding transcript.

from_pcm(samples, sample_rate, transcript[, ...])

Create speech segment from pcm on online mode Args: samples (numpy.ndarray): Audio samples [num_samples x num_channels]. sample_rate (int): Audio sample rate. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None. Returns: SpeechSegment: Speech segment instance.

from_sequence_file(filepath)

Create audio segment from sequence file.

gain_db(gain)

Apply gain in decibels to samples.

make_silence(duration, sample_rate)

Creates a silent speech segment of the given duration and sample rate, transcript will be an empty string.

normalize([target_db, max_gain_db])

Normalize audio to be of the desired RMS value in decibels.

normalize_online_bayesian(target_db, ...[, ...])

Normalize audio using a production-compatible online/causal algorithm.

pad_silence(duration[, sides])

Pad this audio sample with a period of silence.

random_subsegment(subsegment_length[, rng])

Cut the specified length of the audiosegment randomly.

resample(target_sample_rate[, filter])

Resample the audio to a target sample rate.

shift(shift_ms)

Shift the audio in time.

slice_from_file(filepath, transcript[, ...])

Loads a small section of an speech without having to load the entire file into the memory which can be incredibly wasteful.

subsegment([start_sec, end_sec])

Cut the AudioSegment between given boundaries.

superimpose(other)

Add samples from another segment to those of this segment (sample-wise addition, not segment concatenation).

to([dtype])

Create a dtype audio content.

to_bytes([dtype])

Create a byte string containing the audio content.

to_wav_file(filepath[, dtype])

Save audio segment to disk as wav file.

classmethod concatenate(*segments)[source]

Concatenate an arbitrary number of speech segments together, both audio and transcript will be concatenated.

Parameters:

*segments --

Input speech segments to be concatenated.

Returns:

Speech segment instance.

Return type:

SpeechSegment

Raises:
  • ValueError -- If the number of segments is zero, or if the sample_rate of any two segments does not match.

  • TypeError -- If any segment is not SpeechSegment instance.

classmethod from_bytes(bytes, transcript, tokens=None, token_ids=None)[source]

Create speech segment from a byte string and corresponding

Args:

filepath (str|file): Filepath or file object to audio file. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None.

Returns:

SpeechSegment: Speech segment instance.

classmethod from_file(filepath, transcript, tokens=None, token_ids=None, infos=None)[source]

Create speech segment from audio file and corresponding transcript.

Args:

filepath (str|file): Filepath or file object to audio file. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None. infos (TarLocalData, optional): tar2obj and tar2infos. Defaults to None.

Returns:

SpeechSegment: Speech segment instance.

classmethod from_pcm(samples, sample_rate, transcript, tokens=None, token_ids=None)[source]

Create speech segment from pcm on online mode Args:

samples (numpy.ndarray): Audio samples [num_samples x num_channels]. sample_rate (int): Audio sample rate. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None.

Returns:

SpeechSegment: Speech segment instance.

property has_token
classmethod make_silence(duration, sample_rate)[source]

Creates a silent speech segment of the given duration and sample rate, transcript will be an empty string.

Args:

duration (float): Length of silence in seconds. sample_rate (float): Sample rate.

Returns:

SpeechSegment: Silence of the given duration.

classmethod slice_from_file(filepath, transcript, tokens=None, token_ids=None, start=None, end=None)[source]

Loads a small section of an speech without having to load the entire file into the memory which can be incredibly wasteful.

Parameters:
  • filepath (str|file) -- Filepath or file object to audio file.

  • start (float) -- Start time in seconds. If start is negative, it wraps around from the end. If not provided, this function reads from the very beginning.

  • end (float) -- End time in seconds. If end is negative, it wraps around from the end. If not provided, the default behvaior is to read to the end of the file.

  • transcript -- Transcript text for the speech. if not provided, the defaults is an empty string.

Returns:

SpeechSegment instance of the specified slice of the input speech file.

Return type:

SpeechSegment

property token_ids

Return the transcript text token ids.

Returns:

List[int]: text token ids.

property tokens

Return the transcript text tokens.

Returns:

List[str]: text tokens.

property transcript

Return the transcript text.

Returns:

str: Transcript text for the speech.