paddlespeech.s2t.frontend.audio module

Contains the audio segment class.

class paddlespeech.s2t.frontend.audio.AudioSegment(samples, sample_rate)[source]

Bases: object

Monaural audio segment abstraction.

Parameters:

samples (ndarray.float32) -- Audio samples [num_samples x num_channels].
sample_rate (int) -- Audio sample rate.

Raises:

TypeError -- If the sample data type is not float or int.

Attributes:

duration: Return audio duration.
num_samples: Return number of samples.
rms_db: Return root mean square energy of the audio in decibels.
sample_rate: Return audio sample rate.
samples: Return audio samples.

Methods

`add_noise`(noise, snr_dB[, ...])	Add the given noise segment at a specific signal-to-noise ratio.
`change_speed`(speed_rate)	Change the audio speed by linear interpolation.
`concatenate`(*segments)	Concatenate an arbitrary number of audio segments together.
`convolve`(impulse_segment[, allow_resample])	Convolve this audio segment with the given impulse segment.
`convolve_and_normalize`(impulse_segment[, ...])	Convolve and normalize the resulting audio segment so that it has the same average power as the input signal.
`from_bytes`(bytes)	Create audio segment from a byte string containing audio samples.
`from_file`(file[, infos])	Create audio segment from audio file.
`from_pcm`(samples, sample_rate)	Create audio segment from a byte string containing audio samples.
`from_sequence_file`(filepath)	Create audio segment from sequence file.
`gain_db`(gain)	Apply gain in decibels to samples.
`make_silence`(duration, sample_rate)	Creates a silent audio segment of the given duration and sample rate.
`normalize`([target_db, max_gain_db])	Normalize audio to be of the desired RMS value in decibels.
`normalize_online_bayesian`(target_db, ...[, ...])	Normalize audio using a production-compatible online/causal algorithm.
`pad_silence`(duration[, sides])	Pad this audio sample with a period of silence.
`random_subsegment`(subsegment_length[, rng])	Cut the specified length of the audiosegment randomly.
`resample`(target_sample_rate[, filter])	Resample the audio to a target sample rate.
`shift`(shift_ms)	Shift the audio in time.
`slice_from_file`(file[, start, end])	Loads a small section of an audio without having to load the entire file into the memory which can be incredibly wasteful.
`subsegment`([start_sec, end_sec])	Cut the AudioSegment between given boundaries.
`superimpose`(other)	Add samples from another segment to those of this segment (sample-wise addition, not segment concatenation).
`to`([dtype])	Create a dtype audio content.
`to_bytes`([dtype])	Create a byte string containing the audio content.
`to_wav_file`(filepath[, dtype])	Save audio segment to disk as wav file.

add_noise(noise, snr_dB, allow_downsampling=False, max_gain_db=300.0, rng=None)[source]

Add the given noise segment at a specific signal-to-noise ratio. If the noise segment is longer than this segment, a random subsegment of matching length is sampled from it and used instead.

Note that this is an in-place transformation.

Parameters:

noise (AudioSegment) -- Noise signal to add.
snr_dB (float) -- Signal-to-Noise Ratio, in decibels.
allow_downsampling (bool) -- Whether to allow the noise signal to be downsampled to match the base signal sample rate.
max_gain_db (float) -- Maximum amount of gain to apply to noise signal before adding it in. This is to prevent attempting to apply infinite gain to a zero signal.
rng (None|random.Random) -- Random number generator state.

Raises:

ValueError -- If the sample rate does not match between the two audio segments when downsampling is not allowed, or if the duration of noise segments is shorter than original audio segments.

change_speed(speed_rate)[source]

Change the audio speed by linear interpolation.

Note that this is an in-place transformation.

Parameters:: speed_rate (float) -- Rate of speed change: speed_rate > 1.0, speed up the audio; speed_rate = 1.0, unchanged; speed_rate < 1.0, slow down the audio; speed_rate <= 0.0, not allowed, raise ValueError.
Raises:: ValueError -- If speed_rate <= 0.0.

classmethod concatenate(*segments)[source]

Concatenate an arbitrary number of audio segments together.

Parameters:

*segments --

Input audio segments to be concatenated.

Returns:

Audio segment instance as concatenating results.

Return type:

AudioSegment

Raises:

ValueError -- If the number of segments is zero, or if the sample_rate of any segments does not match.
TypeError -- If any segment is not AudioSegment instance.

convolve(impulse_segment, allow_resample=False)[source]

Convolve this audio segment with the given impulse segment.

Note that this is an in-place transformation.

Parameters:

impulse_segment (AudioSegment) -- Impulse response segments.
allow_resample (bool) -- Indicates whether resampling is allowed when the impulse_segment has a different sample rate from this signal.

Raises:

ValueError -- If the sample rate is not match between two audio segments when resample is not allowed.

convolve_and_normalize(impulse_segment, allow_resample=False)[source]

Convolve and normalize the resulting audio segment so that it has the same average power as the input signal.

Note that this is an in-place transformation.

Parameters:

impulse_segment (AudioSegment) -- Impulse response segments.
allow_resample (bool) -- Indicates whether resampling is allowed when the impulse_segment has a different sample rate from this signal.

property duration

Return audio duration.

Returns:: Audio duration in seconds.
Return type:: float

classmethod from_bytes(bytes)[source]

Create audio segment from a byte string containing audio samples.

Parameters:: bytes (str) -- Byte string containing audio samples.
Returns:: Audio segment instance.
Return type:: AudioSegment

classmethod from_file(file, infos=None)[source]

Create audio segment from audio file.

Args:: filepath (str|file): Filepath or file object to audio file. infos (TarLocalData, optional): tar2obj and tar2infos. Defaults to None.
Returns:: AudioSegment: Audio segment instance.

classmethod from_pcm(samples, sample_rate)[source]: Create audio segment from a byte string containing audio samples. :param samples: Audio samples [num_samples x num_channels]. :type samples: numpy.ndarray :param sample_rate: Audio sample rate. :type sample_rate: int :return: Audio segment instance. :rtype: AudioSegment

classmethod from_sequence_file(filepath)[source]

Create audio segment from sequence file. Sequence file is a binary file containing a collection of multiple audio files, with several header bytes in the head indicating the offsets of each audio byte data chunk.

The format is:

4 bytes (int, version), 4 bytes (int, num of utterance), 4 bytes (int, bytes per header), [bytes_per_header*(num_utterance+1)] bytes (offsets for each audio), audio_bytes_data_of_1st_utterance, audio_bytes_data_of_2nd_utterance, ......

Sequence file name must end with ".seqbin". And the filename of the 5th utterance's audio file in sequence file "xxx.seqbin" must be "xxx.seqbin_5", with "5" indicating the utterance index within this sequence file (starting from 1).

Parameters:: filepath (str) -- Filepath of sequence file.
Returns:: Audio segment instance.
Return type:: AudioSegment

gain_db(gain)[source]

Apply gain in decibels to samples.

Note that this is an in-place transformation.

Parameters:: gain (float|1darray) -- Gain in decibels to apply to samples.

classmethod make_silence(duration, sample_rate)[source]

Creates a silent audio segment of the given duration and sample rate.

Parameters:

duration (float) -- Length of silence in seconds.
sample_rate (float) -- Sample rate.

Returns:

Silent AudioSegment instance of the given duration.

Return type:

AudioSegment

normalize(target_db=-20, max_gain_db=300.0)[source]

Normalize audio to be of the desired RMS value in decibels.

Note that this is an in-place transformation.

Parameters:

target_db (float) -- Target RMS value in decibels. This value should be less than 0.0 as 0.0 is full-scale audio.
max_gain_db (float) -- Max amount of gain in dB that can be applied for normalization. This is to prevent nans when attempting to normalize a signal consisting of all zeros.

Raises:

ValueError -- If the required gain to normalize the segment to the target_db value exceeds max_gain_db.

normalize_online_bayesian(target_db, prior_db, prior_samples, startup_delay=0.0)[source]

Normalize audio using a production-compatible online/causal algorithm. This uses an exponential likelihood and gamma prior to make online estimates of the RMS even when there are very few samples.

Note that this is an in-place transformation.

Parameters:

target_db -- Target RMS value in decibels.
prior_db (float) -- Prior RMS estimate in decibels.
prior_samples (float) -- Prior strength in number of samples.
startup_delay (float) -- Default 0.0s. If provided, this function will accrue statistics for the first startup_delay seconds before applying online normalization.

property num_samples

Return number of samples.

Returns:: Number of samples.
Return type:: int

pad_silence(duration, sides='both')[source]

Pad this audio sample with a period of silence.

Note that this is an in-place transformation.

Parameters:

duration (float) -- Length of silence in seconds to pad.
sides (str) -- Position for padding: 'beginning' - adds silence in the beginning; 'end' - adds silence in the end; 'both' - adds silence in both the beginning and the end.

Raises:

ValueError -- If sides is not supported.

random_subsegment(subsegment_length, rng=None)[source]

Cut the specified length of the audiosegment randomly.

Note that this is an in-place transformation.

Parameters:

subsegment_length (float) -- Subsegment length in seconds.
rng (random.Random) -- Random number generator state.

Raises:

ValueError -- If the length of subsegment is greater than the origineal segemnt.

resample(target_sample_rate, filter='kaiser_best')[source]

Resample the audio to a target sample rate.

Note that this is an in-place transformation.

Parameters:

target_sample_rate (int) -- Target sample rate.
filter (str) -- The resampling filter to use one of {'kaiser_best', 'kaiser_fast'}.

property rms_db

Return root mean square energy of the audio in decibels.

Returns:: Root mean square energy in decibels.
Return type:: float

property sample_rate

Return audio sample rate.

Returns:: Audio sample rate.
Return type:: int

property samples

Return audio samples.

Returns:: Audio samples.
Return type:: ndarray

shift(shift_ms)[source]

Shift the audio in time. If shift_ms is positive, shift with time advance; if negative, shift with time delay. Silence are padded to keep the duration unchanged.

Note that this is an in-place transformation.

Parameters:: shift_ms (float) -- Shift time in millseconds. If positive, shift with time advance; if negative; shift with time delay.
Raises:: ValueError -- If shift_ms is longer than audio duration.

classmethod slice_from_file(file, start=None, end=None)[source]

Loads a small section of an audio without having to load the entire file into the memory which can be incredibly wasteful.

Parameters:

file (str|file) -- Input audio filepath or file object.
start (float) -- Start time in seconds. If start is negative, it wraps around from the end. If not provided, this function reads from the very beginning.
end (float) -- End time in seconds. If end is negative, it wraps around from the end. If not provided, the default behvaior is to read to the end of the file.

Returns:

AudioSegment instance of the specified slice of the input audio file.

Return type:

AudioSegment

Raises:

ValueError -- If start or end is incorrectly set, e.g. out of bounds in time.

subsegment(start_sec=None, end_sec=None)[source]

Cut the AudioSegment between given boundaries.

Note that this is an in-place transformation.

Parameters:

start_sec (float) -- Beginning of subsegment in seconds.
end_sec (float) -- End of subsegment in seconds.

Raises:

ValueError -- If start_sec or end_sec is incorrectly set, e.g. out of bounds in time.

superimpose(other)[source]

Add samples from another segment to those of this segment (sample-wise addition, not segment concatenation).

Note that this is an in-place transformation.

Parameters:

other (AudioSegments) -- Segment containing samples to be added in.

Raises:

TypeError -- If type of two segments don't match.
ValueError -- If the sample rates of the two segments are not equal, or if the lengths of segments don't match.

to(dtype='int16')[source]

Create a dtype audio content.

Parameters:: dtype (str) -- Data type for export samples. Options: 'int16', 'int32', 'float32', 'float64'. Default is 'float32'.
Returns:: np.ndarray containing dtype audio content.
Return type:: str

to_bytes(dtype='float32')[source]

Create a byte string containing the audio content.

Parameters:: dtype (str) -- Data type for export samples. Options: 'int16', 'int32', 'float32', 'float64'. Default is 'float32'.
Returns:: Byte string containing audio content.
Return type:: str

to_wav_file(filepath, dtype='float32')[source]

Save audio segment to disk as wav file.

Parameters:

filepath (str|file) -- WAV filepath or file object to save the audio segment.
dtype (str) -- Subtype for audio file. Options: 'int16', 'int32', 'float32', 'float64'. Default is 'float32'.

Raises:

TypeError -- If dtype is not supported.