paddlespeech.s2t.frontend.speech module
Contains the speech segment class.
- class paddlespeech.s2t.frontend.speech.SpeechSegment(samples, sample_rate, transcript, tokens=None, token_ids=None)[source]
Bases:
AudioSegment
Speech Segment with Text
- Args:
AudioSegment (AudioSegment): Audio Segment
- Attributes:
duration
Return audio duration.
- has_token
num_samples
Return number of samples.
rms_db
Return root mean square energy of the audio in decibels.
sample_rate
Return audio sample rate.
samples
Return audio samples.
token_ids
Return the transcript text token ids.
tokens
Return the transcript text tokens.
transcript
Return the transcript text.
Methods
add_noise
(noise, snr_dB[, ...])Add the given noise segment at a specific signal-to-noise ratio.
change_speed
(speed_rate)Change the audio speed by linear interpolation.
concatenate
(*segments)Concatenate an arbitrary number of speech segments together, both audio and transcript will be concatenated.
convolve
(impulse_segment[, allow_resample])Convolve this audio segment with the given impulse segment.
convolve_and_normalize
(impulse_segment[, ...])Convolve and normalize the resulting audio segment so that it has the same average power as the input signal.
from_bytes
(bytes, transcript[, tokens, ...])Create speech segment from a byte string and corresponding
from_file
(filepath, transcript[, tokens, ...])Create speech segment from audio file and corresponding transcript.
from_pcm
(samples, sample_rate, transcript[, ...])Create speech segment from pcm on online mode Args: samples (numpy.ndarray): Audio samples [num_samples x num_channels]. sample_rate (int): Audio sample rate. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None. Returns: SpeechSegment: Speech segment instance.
from_sequence_file
(filepath)Create audio segment from sequence file.
gain_db
(gain)Apply gain in decibels to samples.
make_silence
(duration, sample_rate)Creates a silent speech segment of the given duration and sample rate, transcript will be an empty string.
normalize
([target_db, max_gain_db])Normalize audio to be of the desired RMS value in decibels.
normalize_online_bayesian
(target_db, ...[, ...])Normalize audio using a production-compatible online/causal algorithm.
pad_silence
(duration[, sides])Pad this audio sample with a period of silence.
random_subsegment
(subsegment_length[, rng])Cut the specified length of the audiosegment randomly.
resample
(target_sample_rate[, filter])Resample the audio to a target sample rate.
shift
(shift_ms)Shift the audio in time.
slice_from_file
(filepath, transcript[, ...])Loads a small section of an speech without having to load the entire file into the memory which can be incredibly wasteful.
subsegment
([start_sec, end_sec])Cut the AudioSegment between given boundaries.
superimpose
(other)Add samples from another segment to those of this segment (sample-wise addition, not segment concatenation).
to
([dtype])Create a dtype audio content.
to_bytes
([dtype])Create a byte string containing the audio content.
to_wav_file
(filepath[, dtype])Save audio segment to disk as wav file.
- classmethod concatenate(*segments)[source]
Concatenate an arbitrary number of speech segments together, both audio and transcript will be concatenated.
- Parameters:
*segments --
Input speech segments to be concatenated.
- Returns:
Speech segment instance.
- Return type:
- Raises:
ValueError -- If the number of segments is zero, or if the sample_rate of any two segments does not match.
TypeError -- If any segment is not SpeechSegment instance.
- classmethod from_bytes(bytes, transcript, tokens=None, token_ids=None)[source]
Create speech segment from a byte string and corresponding
- Args:
filepath (str|file): Filepath or file object to audio file. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None.
- Returns:
SpeechSegment: Speech segment instance.
- classmethod from_file(filepath, transcript, tokens=None, token_ids=None, infos=None)[source]
Create speech segment from audio file and corresponding transcript.
- Args:
filepath (str|file): Filepath or file object to audio file. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None. infos (TarLocalData, optional): tar2obj and tar2infos. Defaults to None.
- Returns:
SpeechSegment: Speech segment instance.
- classmethod from_pcm(samples, sample_rate, transcript, tokens=None, token_ids=None)[source]
Create speech segment from pcm on online mode Args:
samples (numpy.ndarray): Audio samples [num_samples x num_channels]. sample_rate (int): Audio sample rate. transcript (str): Transcript text for the speech. tokens (List[str], optional): text tokens. Defaults to None. token_ids (List[int], optional): text token ids. Defaults to None.
- Returns:
SpeechSegment: Speech segment instance.
- property has_token
- classmethod make_silence(duration, sample_rate)[source]
Creates a silent speech segment of the given duration and sample rate, transcript will be an empty string.
- Args:
duration (float): Length of silence in seconds. sample_rate (float): Sample rate.
- Returns:
SpeechSegment: Silence of the given duration.
- classmethod slice_from_file(filepath, transcript, tokens=None, token_ids=None, start=None, end=None)[source]
Loads a small section of an speech without having to load the entire file into the memory which can be incredibly wasteful.
- Parameters:
filepath (str|file) -- Filepath or file object to audio file.
start (float) -- Start time in seconds. If start is negative, it wraps around from the end. If not provided, this function reads from the very beginning.
end (float) -- End time in seconds. If end is negative, it wraps around from the end. If not provided, the default behvaior is to read to the end of the file.
transcript -- Transcript text for the speech. if not provided, the defaults is an empty string.
- Returns:
SpeechSegment instance of the specified slice of the input speech file.
- Return type:
- property token_ids
Return the transcript text token ids.
- Returns:
List[int]: text token ids.
- property tokens
Return the transcript text tokens.
- Returns:
List[str]: text tokens.
- property transcript
Return the transcript text.
- Returns:
str: Transcript text for the speech.