paddleaudio.compliance.librosa module

paddleaudio.compliance.librosa.adaptive_spect_augment(spect: ndarray, tempo_axis: int = 0, level: float = 0.1) ndarray[source]

Do adpative spectrogram augmentation. The level of the augmentation is gowern by the paramter level, ranging from 0 to 1, with 0 represents no augmentation.

Args:

spect (np.ndarray): Input spectrogram. tempo_axis (int, optional): Indicate the tempo axis. Defaults to 0. level (float, optional): The level factor of masking. Defaults to 0.1.

Returns:

np.ndarray: The augmented spectrogram.

paddleaudio.compliance.librosa.compute_fbank_matrix(sr: int, n_fft: int, n_mels: int = 128, fmin: float = 0.0, fmax: ~typing.Optional[float] = None, htk: bool = False, norm: str = 'slaney', dtype: type = <class 'numpy.float32'>) ndarray[source]

Compute fbank matrix.

Args:

sr (int): Sample rate. n_fft (int): FFT size. n_mels (int, optional): Number of mel bins. Defaults to 128. fmin (float, optional): Minimum frequency in Hz. Defaults to 0.0. fmax (Optional[float], optional): Maximum frequency in Hz. Defaults to None. htk (bool, optional): Use htk scaling. Defaults to False. norm (str, optional): Type of normalization. Defaults to "slaney". dtype (type, optional): Data type. Defaults to np.float32.

Returns:

np.ndarray: Mel transform matrix with shape (n_mels, n_fft//2 + 1).

paddleaudio.compliance.librosa.depth_augment(y: ndarray, choices: List = ['int8', 'int16'], probs: List[float] = [0.5, 0.5]) ndarray[source]

Audio depth augmentation. Do audio depth augmentation to simulate the distortion brought by quantization.

Args:

y (np.ndarray): Input waveform array in 1D or 2D. choices (List, optional): A list of data type to depth conversion. Defaults to ['int8', 'int16']. probs (List[float], optional): Probabilities to depth conversion. Defaults to [0.5, 0.5].

Returns:

np.ndarray: The augmented waveform.

paddleaudio.compliance.librosa.hz_to_mel(frequencies: Union[float, List[float], ndarray], htk: bool = False) ndarray[source]

Convert Hz to Mels.

Args:

frequencies (Union[float, List[float], np.ndarray]): Frequencies in Hz. htk (bool, optional): Use htk scaling. Defaults to False.

Returns:

np.ndarray: Frequency in mels.

paddleaudio.compliance.librosa.mel_frequencies(n_mels: int = 128, fmin: float = 0.0, fmax: float = 11025.0, htk: bool = False) ndarray[source]

Compute mel frequencies.

Args:

n_mels (int, optional): Number of mel bins. Defaults to 128. fmin (float, optional): Minimum frequency in Hz. Defaults to 0.0. fmax (float, optional): Maximum frequency in Hz. Defaults to 11025.0. htk (bool, optional): Use htk scaling. Defaults to False.

Returns:

np.ndarray: Vector of n_mels frequencies in Hz with shape (n_mels,).

paddleaudio.compliance.librosa.mel_to_hz(mels: Union[float, List[float], ndarray], htk: int = False) ndarray[source]

Convert mel bin numbers to frequencies.

Args:

mels (Union[float, List[float], np.ndarray]): Frequency in mels. htk (bool, optional): Use htk scaling. Defaults to False.

Returns:

np.ndarray: Frequencies in Hz.

paddleaudio.compliance.librosa.melspectrogram(x: ndarray, sr: int = 16000, window_size: int = 512, hop_length: int = 320, n_mels: int = 64, fmin: float = 50.0, fmax: Optional[float] = None, window: str = 'hann', center: bool = True, pad_mode: str = 'reflect', power: float = 2.0, to_db: bool = True, ref: float = 1.0, amin: float = 1e-10, top_db: Optional[float] = None) ndarray[source]

Compute mel-spectrogram.

Args:

x (np.ndarray): Input waveform in one dimension. sr (int, optional): Sample rate. Defaults to 16000. window_size (int, optional): Size of FFT and window length. Defaults to 512. hop_length (int, optional): Number of steps to advance between adjacent windows. Defaults to 320. n_mels (int, optional): Number of mel bins. Defaults to 64. fmin (float, optional): Minimum frequency in Hz. Defaults to 50.0. fmax (Optional[float], optional): Maximum frequency in Hz. Defaults to None. window (str, optional): A string of window specification. Defaults to "hann". center (bool, optional): Whether to pad x to make that the \(t imes hop\_length\) at the center of t-th frame. Defaults to True. pad_mode (str, optional): Choose padding pattern when center is True. Defaults to "reflect". power (float, optional): Exponent for the magnitude melspectrogram. Defaults to 2.0. to_db (bool, optional): Enable db scale. Defaults to True. ref (float, optional): The reference value. If smaller than 1.0, the db level of the signal will be pulled up accordingly. Otherwise, the db level is pushed down. Defaults to 1.0. amin (float, optional): Minimum threshold. Defaults to 1e-10. top_db (Optional[float], optional): Threshold the output at top_db below the peak. Defaults to None.

Returns:

np.ndarray: The mel-spectrogram in power scale or db scale with shape (n_mels, num_frames).

paddleaudio.compliance.librosa.mfcc(x: ndarray, sr: int = 16000, spect: Optional[ndarray] = None, n_mfcc: int = 20, dct_type: int = 2, norm: str = 'ortho', lifter: int = 0, **kwargs) ndarray[source]

Mel-frequency cepstral coefficients (MFCCs)

Args:

x (np.ndarray): Input waveform in one dimension. sr (int, optional): Sample rate. Defaults to 16000. spect (Optional[np.ndarray], optional): Input log-power Mel spectrogram. Defaults to None. n_mfcc (int, optional): Number of cepstra in MFCC. Defaults to 20. dct_type (int, optional): Discrete cosine transform (DCT) type. Defaults to 2. norm (str, optional): Type of normalization. Defaults to "ortho". lifter (int, optional): Cepstral filtering. Defaults to 0.

Returns:

np.ndarray: Mel frequency cepstral coefficients array with shape (n_mfcc, num_frames).

paddleaudio.compliance.librosa.mu_decode(y: ndarray, mu: int = 255, quantized: bool = True) ndarray[source]

Mu-law decoding. Compute the mu-law decoding given an input code. It assumes that the input y is in range [0,mu-1] when quantize is True and [-1,1] otherwise.

Args:

y (np.ndarray): The encoded waveform. mu (int, optional): The endoceding parameter. Defaults to 255. quantized (bool, optional): If True, the input is assumed to be quantized to 1 + mu distinct integer values. Defaults to True.

Returns:

np.ndarray: The mu-law decoded waveform.

paddleaudio.compliance.librosa.mu_encode(x: ndarray, mu: int = 255, quantized: bool = True) ndarray[source]

Mu-law encoding. Encode waveform based on mu-law companding. When quantized is True, the result will be converted to integer in range [0,mu-1]. Otherwise, the resulting waveform is in range [-1,1].

Args:

x (np.ndarray): The input waveform to encode. mu (int, optional): The endoceding parameter. Defaults to 255. quantized (bool, optional): If True, quantize the encoded values into 1 + mu distinct integer values. Defaults to True.

Returns:

np.ndarray: The mu-law encoded waveform.

paddleaudio.compliance.librosa.power_to_db(spect: ndarray, ref: float = 1.0, amin: float = 1e-10, top_db: Optional[float] = 80.0) ndarray[source]

Convert a power spectrogram (amplitude squared) to decibel (dB) units. The function computes the scaling 10 * log10(x / ref) in a numerically stable way.

Args:

spect (np.ndarray): STFT power spectrogram of an input waveform. ref (float, optional): The reference value. If smaller than 1.0, the db level of the signal will be pulled up accordingly. Otherwise, the db level is pushed down. Defaults to 1.0. amin (float, optional): Minimum threshold. Defaults to 1e-10. top_db (Optional[float], optional): Threshold the output at top_db below the peak. Defaults to 80.0.

Returns:

np.ndarray: Power spectrogram in db scale.

paddleaudio.compliance.librosa.random_crop1d(y: ndarray, crop_len: int) ndarray[source]

Random cropping on a input waveform.

Args:

y (np.ndarray): Input waveform array in 1D. crop_len (int): Length of waveform to crop.

Returns:

np.ndarray: The cropped waveform.

paddleaudio.compliance.librosa.random_crop2d(s: ndarray, crop_len: int, tempo_axis: int = 0) ndarray[source]

Random cropping on a spectrogram.

Args:

s (np.ndarray): Input spectrogram in 2D. crop_len (int): Length of spectrogram to crop. tempo_axis (int, optional): Indicate the tempo axis. Defaults to 0.

Returns:

np.ndarray: The cropped spectrogram.

paddleaudio.compliance.librosa.spect_augment(spect: ndarray, tempo_axis: int = 0, max_time_mask: int = 3, max_freq_mask: int = 3, max_time_mask_width: int = 30, max_freq_mask_width: int = 20) ndarray[source]

Do spectrogram augmentation in both time and freq axis.

Args:

spect (np.ndarray): Input spectrogram. tempo_axis (int, optional): Indicate the tempo axis. Defaults to 0. max_time_mask (int, optional): Maximum number of time masking. Defaults to 3. max_freq_mask (int, optional): Maximum number of frenquence masking. Defaults to 3. max_time_mask_width (int, optional): Maximum width of time masking. Defaults to 30. max_freq_mask_width (int, optional): Maximum width of frenquence masking. Defaults to 20.

Returns:

np.ndarray: The augmented spectrogram.

paddleaudio.compliance.librosa.spectrogram(x: ndarray, sr: int = 16000, window_size: int = 512, hop_length: int = 320, window: str = 'hann', center: bool = True, pad_mode: str = 'reflect', power: float = 2.0) ndarray[source]

Compute spectrogram.

Args:

x (np.ndarray): Input waveform in one dimension. sr (int, optional): Sample rate. Defaults to 16000. window_size (int, optional): Size of FFT and window length. Defaults to 512. hop_length (int, optional): Number of steps to advance between adjacent windows. Defaults to 320. window (str, optional): A string of window specification. Defaults to "hann". center (bool, optional): Whether to pad x to make that the \(t imes hop\_length\) at the center of t-th frame. Defaults to True. pad_mode (str, optional): Choose padding pattern when center is True. Defaults to "reflect". power (float, optional): Exponent for the magnitude melspectrogram. Defaults to 2.0.

Returns:

np.ndarray: The STFT spectrogram in power scale (n_fft//2 + 1, num_frames).

paddleaudio.compliance.librosa.stft(x: ~numpy.ndarray, n_fft: int = 2048, hop_length: ~typing.Optional[int] = None, win_length: ~typing.Optional[int] = None, window: str = 'hann', center: bool = True, dtype: type = <class 'numpy.complex64'>, pad_mode: str = 'reflect') ndarray[source]

Short-time Fourier transform (STFT).

Args:

x (np.ndarray): Input waveform in one dimension. n_fft (int, optional): FFT size. Defaults to 2048. hop_length (Optional[int], optional): Number of steps to advance between adjacent windows. Defaults to None. win_length (Optional[int], optional): The size of window. Defaults to None. window (str, optional): A string of window specification. Defaults to "hann". center (bool, optional): Whether to pad x to make that the \(t imes hop\_length\) at the center of t-th frame. Defaults to True. dtype (type, optional): Data type of STFT results. Defaults to np.complex64. pad_mode (str, optional): Choose padding pattern when center is True. Defaults to "reflect".

Returns:

np.ndarray: The complex STFT output with shape (n_fft//2 + 1, num_frames).