paddlespeech.s2t.frontend.audio module

Contains the audio segment class.

class paddlespeech.s2t.frontend.audio.AudioSegment(samples, sample_rate)[source]

Bases: object

Monaural audio segment abstraction.

Parameters:
  • samples (ndarray.float32) -- Audio samples [num_samples x num_channels].

  • sample_rate (int) -- Audio sample rate.

Raises:

TypeError -- If the sample data type is not float or int.

Attributes:
duration

Return audio duration.

num_samples

Return number of samples.

rms_db

Return root mean square energy of the audio in decibels.

sample_rate

Return audio sample rate.

samples

Return audio samples.

Methods

add_noise(noise, snr_dB[, ...])

Add the given noise segment at a specific signal-to-noise ratio.

change_speed(speed_rate)

Change the audio speed by linear interpolation.

concatenate(*segments)

Concatenate an arbitrary number of audio segments together.

convolve(impulse_segment[, allow_resample])

Convolve this audio segment with the given impulse segment.

convolve_and_normalize(impulse_segment[, ...])

Convolve and normalize the resulting audio segment so that it has the same average power as the input signal.

from_bytes(bytes)

Create audio segment from a byte string containing audio samples.

from_file(file[, infos])

Create audio segment from audio file.

from_pcm(samples, sample_rate)

Create audio segment from a byte string containing audio samples.

from_sequence_file(filepath)

Create audio segment from sequence file.

gain_db(gain)

Apply gain in decibels to samples.

make_silence(duration, sample_rate)

Creates a silent audio segment of the given duration and sample rate.

normalize([target_db, max_gain_db])

Normalize audio to be of the desired RMS value in decibels.

normalize_online_bayesian(target_db, ...[, ...])

Normalize audio using a production-compatible online/causal algorithm.

pad_silence(duration[, sides])

Pad this audio sample with a period of silence.

random_subsegment(subsegment_length[, rng])

Cut the specified length of the audiosegment randomly.

resample(target_sample_rate[, filter])

Resample the audio to a target sample rate.

shift(shift_ms)

Shift the audio in time.

slice_from_file(file[, start, end])

Loads a small section of an audio without having to load the entire file into the memory which can be incredibly wasteful.

subsegment([start_sec, end_sec])

Cut the AudioSegment between given boundaries.

superimpose(other)

Add samples from another segment to those of this segment (sample-wise addition, not segment concatenation).

to([dtype])

Create a dtype audio content.

to_bytes([dtype])

Create a byte string containing the audio content.

to_wav_file(filepath[, dtype])

Save audio segment to disk as wav file.

add_noise(noise, snr_dB, allow_downsampling=False, max_gain_db=300.0, rng=None)[source]

Add the given noise segment at a specific signal-to-noise ratio. If the noise segment is longer than this segment, a random subsegment of matching length is sampled from it and used instead.

Note that this is an in-place transformation.

Parameters:
  • noise (AudioSegment) -- Noise signal to add.

  • snr_dB (float) -- Signal-to-Noise Ratio, in decibels.

  • allow_downsampling (bool) -- Whether to allow the noise signal to be downsampled to match the base signal sample rate.

  • max_gain_db (float) -- Maximum amount of gain to apply to noise signal before adding it in. This is to prevent attempting to apply infinite gain to a zero signal.

  • rng (None|random.Random) -- Random number generator state.

Raises:

ValueError -- If the sample rate does not match between the two audio segments when downsampling is not allowed, or if the duration of noise segments is shorter than original audio segments.

change_speed(speed_rate)[source]

Change the audio speed by linear interpolation.

Note that this is an in-place transformation.

Parameters:

speed_rate (float) -- Rate of speed change: speed_rate > 1.0, speed up the audio; speed_rate = 1.0, unchanged; speed_rate < 1.0, slow down the audio; speed_rate <= 0.0, not allowed, raise ValueError.

Raises:

ValueError -- If speed_rate <= 0.0.

classmethod concatenate(*segments)[source]

Concatenate an arbitrary number of audio segments together.

Parameters:

*segments --

Input audio segments to be concatenated.

Returns:

Audio segment instance as concatenating results.

Return type:

AudioSegment

Raises:
  • ValueError -- If the number of segments is zero, or if the sample_rate of any segments does not match.

  • TypeError -- If any segment is not AudioSegment instance.

convolve(impulse_segment, allow_resample=False)[source]

Convolve this audio segment with the given impulse segment.

Note that this is an in-place transformation.

Parameters:
  • impulse_segment (AudioSegment) -- Impulse response segments.

  • allow_resample (bool) -- Indicates whether resampling is allowed when the impulse_segment has a different sample rate from this signal.

Raises:

ValueError -- If the sample rate is not match between two audio segments when resample is not allowed.

convolve_and_normalize(impulse_segment, allow_resample=False)[source]

Convolve and normalize the resulting audio segment so that it has the same average power as the input signal.

Note that this is an in-place transformation.

Parameters:
  • impulse_segment (AudioSegment) -- Impulse response segments.

  • allow_resample (bool) -- Indicates whether resampling is allowed when the impulse_segment has a different sample rate from this signal.

property duration

Return audio duration.

Returns:

Audio duration in seconds.

Return type:

float

classmethod from_bytes(bytes)[source]

Create audio segment from a byte string containing audio samples.

Parameters:

bytes (str) -- Byte string containing audio samples.

Returns:

Audio segment instance.

Return type:

AudioSegment

classmethod from_file(file, infos=None)[source]

Create audio segment from audio file.

Args:

filepath (str|file): Filepath or file object to audio file. infos (TarLocalData, optional): tar2obj and tar2infos. Defaults to None.

Returns:

AudioSegment: Audio segment instance.

classmethod from_pcm(samples, sample_rate)[source]

Create audio segment from a byte string containing audio samples. :param samples: Audio samples [num_samples x num_channels]. :type samples: numpy.ndarray :param sample_rate: Audio sample rate. :type sample_rate: int :return: Audio segment instance. :rtype: AudioSegment

classmethod from_sequence_file(filepath)[source]

Create audio segment from sequence file. Sequence file is a binary file containing a collection of multiple audio files, with several header bytes in the head indicating the offsets of each audio byte data chunk.

The format is:

4 bytes (int, version), 4 bytes (int, num of utterance), 4 bytes (int, bytes per header), [bytes_per_header*(num_utterance+1)] bytes (offsets for each audio), audio_bytes_data_of_1st_utterance, audio_bytes_data_of_2nd_utterance, ......

Sequence file name must end with ".seqbin". And the filename of the 5th utterance's audio file in sequence file "xxx.seqbin" must be "xxx.seqbin_5", with "5" indicating the utterance index within this sequence file (starting from 1).

Parameters:

filepath (str) -- Filepath of sequence file.

Returns:

Audio segment instance.

Return type:

AudioSegment

gain_db(gain)[source]

Apply gain in decibels to samples.

Note that this is an in-place transformation.

Parameters:

gain (float|1darray) -- Gain in decibels to apply to samples.

classmethod make_silence(duration, sample_rate)[source]

Creates a silent audio segment of the given duration and sample rate.

Parameters:
  • duration (float) -- Length of silence in seconds.

  • sample_rate (float) -- Sample rate.

Returns:

Silent AudioSegment instance of the given duration.

Return type:

AudioSegment

normalize(target_db=-20, max_gain_db=300.0)[source]

Normalize audio to be of the desired RMS value in decibels.

Note that this is an in-place transformation.

Parameters:
  • target_db (float) -- Target RMS value in decibels. This value should be less than 0.0 as 0.0 is full-scale audio.

  • max_gain_db (float) -- Max amount of gain in dB that can be applied for normalization. This is to prevent nans when attempting to normalize a signal consisting of all zeros.

Raises:

ValueError -- If the required gain to normalize the segment to the target_db value exceeds max_gain_db.

normalize_online_bayesian(target_db, prior_db, prior_samples, startup_delay=0.0)[source]

Normalize audio using a production-compatible online/causal algorithm. This uses an exponential likelihood and gamma prior to make online estimates of the RMS even when there are very few samples.

Note that this is an in-place transformation.

Parameters:
  • target_db -- Target RMS value in decibels.

  • prior_db (float) -- Prior RMS estimate in decibels.

  • prior_samples (float) -- Prior strength in number of samples.

  • startup_delay (float) -- Default 0.0s. If provided, this function will accrue statistics for the first startup_delay seconds before applying online normalization.

property num_samples

Return number of samples.

Returns:

Number of samples.

Return type:

int

pad_silence(duration, sides='both')[source]

Pad this audio sample with a period of silence.

Note that this is an in-place transformation.

Parameters:
  • duration (float) -- Length of silence in seconds to pad.

  • sides (str) -- Position for padding: 'beginning' - adds silence in the beginning; 'end' - adds silence in the end; 'both' - adds silence in both the beginning and the end.

Raises:

ValueError -- If sides is not supported.

random_subsegment(subsegment_length, rng=None)[source]

Cut the specified length of the audiosegment randomly.

Note that this is an in-place transformation.

Parameters:
  • subsegment_length (float) -- Subsegment length in seconds.

  • rng (random.Random) -- Random number generator state.

Raises:

ValueError -- If the length of subsegment is greater than the origineal segemnt.

resample(target_sample_rate, filter='kaiser_best')[source]

Resample the audio to a target sample rate.

Note that this is an in-place transformation.

Parameters:
  • target_sample_rate (int) -- Target sample rate.

  • filter (str) -- The resampling filter to use one of {'kaiser_best', 'kaiser_fast'}.

property rms_db

Return root mean square energy of the audio in decibels.

Returns:

Root mean square energy in decibels.

Return type:

float

property sample_rate

Return audio sample rate.

Returns:

Audio sample rate.

Return type:

int

property samples

Return audio samples.

Returns:

Audio samples.

Return type:

ndarray

shift(shift_ms)[source]

Shift the audio in time. If shift_ms is positive, shift with time advance; if negative, shift with time delay. Silence are padded to keep the duration unchanged.

Note that this is an in-place transformation.

Parameters:

shift_ms (float) -- Shift time in millseconds. If positive, shift with time advance; if negative; shift with time delay.

Raises:

ValueError -- If shift_ms is longer than audio duration.

classmethod slice_from_file(file, start=None, end=None)[source]

Loads a small section of an audio without having to load the entire file into the memory which can be incredibly wasteful.

Parameters:
  • file (str|file) -- Input audio filepath or file object.

  • start (float) -- Start time in seconds. If start is negative, it wraps around from the end. If not provided, this function reads from the very beginning.

  • end (float) -- End time in seconds. If end is negative, it wraps around from the end. If not provided, the default behvaior is to read to the end of the file.

Returns:

AudioSegment instance of the specified slice of the input audio file.

Return type:

AudioSegment

Raises:

ValueError -- If start or end is incorrectly set, e.g. out of bounds in time.

subsegment(start_sec=None, end_sec=None)[source]

Cut the AudioSegment between given boundaries.

Note that this is an in-place transformation.

Parameters:
  • start_sec (float) -- Beginning of subsegment in seconds.

  • end_sec (float) -- End of subsegment in seconds.

Raises:

ValueError -- If start_sec or end_sec is incorrectly set, e.g. out of bounds in time.

superimpose(other)[source]

Add samples from another segment to those of this segment (sample-wise addition, not segment concatenation).

Note that this is an in-place transformation.

Parameters:

other (AudioSegments) -- Segment containing samples to be added in.

Raises:
  • TypeError -- If type of two segments don't match.

  • ValueError -- If the sample rates of the two segments are not equal, or if the lengths of segments don't match.

to(dtype='int16')[source]

Create a dtype audio content.

Parameters:

dtype (str) -- Data type for export samples. Options: 'int16', 'int32', 'float32', 'float64'. Default is 'float32'.

Returns:

np.ndarray containing dtype audio content.

Return type:

str

to_bytes(dtype='float32')[source]

Create a byte string containing the audio content.

Parameters:

dtype (str) -- Data type for export samples. Options: 'int16', 'int32', 'float32', 'float64'. Default is 'float32'.

Returns:

Byte string containing audio content.

Return type:

str

to_wav_file(filepath, dtype='float32')[source]

Save audio segment to disk as wav file.

Parameters:
  • filepath (str|file) -- WAV filepath or file object to save the audio segment.

  • dtype (str) -- Subtype for audio file. Options: 'int16', 'int32', 'float32', 'float64'. Default is 'float32'.

Raises:

TypeError -- If dtype is not supported.