paddlespeech.t2s.models.transformer_tts.transformer_tts module

Fastspeech2 related modules for paddle

class paddlespeech.t2s.models.transformer_tts.transformer_tts.TransformerTTS(idim: int, odim: int, embed_dim: int = 512, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, dprenet_layers: int = 2, dprenet_units: int = 256, elayers: int = 6, eunits: int = 1024, adim: int = 512, aheads: int = 4, dlayers: int = 6, dunits: int = 1024, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, reduction_factor: int = 1, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', use_gst: bool = False, gst_tokens: int = 10, gst_heads: int = 4, gst_conv_layers: int = 6, gst_conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), gst_conv_kernel_size: int = 3, gst_conv_stride: int = 2, gst_gru_layers: int = 1, gst_gru_units: int = 128, transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, transformer_enc_dec_attn_dropout_rate: float = 0.1, eprenet_dropout_rate: float = 0.5, dprenet_dropout_rate: float = 0.5, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_guided_attn_loss: bool = True, num_heads_applied_guided_attn: int = 2, num_layers_applied_guided_attn: int = 2)[source]

Bases: Layer

TTS-Transformer module.

This is a module of text-to-speech Transformer described in Neural Speech Synthesis with Transformer Network, which convert the sequence of tokens into the sequence of Mel-filterbanks.

Args:
idim (int):

Dimension of the inputs.

odim (int):

Dimension of the outputs.

embed_dim (int, optional):

Dimension of character embedding.

eprenet_conv_layers (int, optional):

Number of encoder prenet convolution layers.

eprenet_conv_chans (int, optional):

Number of encoder prenet convolution channels.

eprenet_conv_filts (int, optional):

Filter size of encoder prenet convolution.

dprenet_layers (int, optional):

Number of decoder prenet layers.

dprenet_units (int, optional):

Number of decoder prenet hidden units.

elayers (int, optional):

Number of encoder layers.

eunits (int, optional):

Number of encoder hidden units.

adim (int, optional):

Number of attention transformation dimensions.

aheads (int, optional):

Number of heads for multi head attention.

dlayers (int, optional):

Number of decoder layers.

dunits (int, optional):

Number of decoder hidden units.

postnet_layers (int, optional):

Number of postnet layers.

postnet_chans (int, optional):

Number of postnet channels.

postnet_filts (int, optional):

Filter size of postnet.

use_scaled_pos_enc (pool, optional):

Whether to use trainable scaled positional encoding.

use_batch_norm (bool, optional):

Whether to use batch normalization in encoder prenet.

encoder_normalize_before (bool, optional):

Whether to perform layer normalization before encoder block.

decoder_normalize_before (bool, optional):

Whether to perform layer normalization before decoder block.

encoder_concat_after (bool, optional):

Whether to concatenate attention layer's input and output in encoder.

decoder_concat_after (bool, optional):

Whether to concatenate attention layer's input and output in decoder.

positionwise_layer_type (str, optional):

Position-wise operation type.

positionwise_conv_kernel_size (int, optional):

Kernel size in position wise conv 1d.

reduction_factor (int, optional):

Reduction factor.

spk_embed_dim (int, optional):

Number of speaker embedding dimenstions.

spk_embed_integration_type (str, optional):

How to integrate speaker embedding.

use_gst (str, optional):

Whether to use global style token.

gst_tokens (int, optional):

The number of GST embeddings.

gst_heads (int, optional):

The number of heads in GST multihead attention.

gst_conv_layers (int, optional):

The number of conv layers in GST.

gst_conv_chans_list (Sequence[int], optional):

List of the number of channels of conv layers in GST.

gst_conv_kernel_size (int, optional):

Kernal size of conv layers in GST.

gst_conv_stride (int, optional):

Stride size of conv layers in GST.

gst_gru_layers (int, optional):

The number of GRU layers in GST.

gst_gru_units (int, optional):

The number of GRU units in GST.

transformer_lr (float, optional):

Initial value of learning rate.

transformer_warmup_steps (int, optional):

Optimizer warmup steps.

transformer_enc_dropout_rate (float, optional):

Dropout rate in encoder except attention and positional encoding.

transformer_enc_positional_dropout_rate (float, optional):

Dropout rate after encoder positional encoding.

transformer_enc_attn_dropout_rate (float, optional):

Dropout rate in encoder self-attention module.

transformer_dec_dropout_rate (float, optional):

Dropout rate in decoder except attention & positional encoding.

transformer_dec_positional_dropout_rate (float, optional):

Dropout rate after decoder positional encoding.

transformer_dec_attn_dropout_rate (float, optional):

Dropout rate in deocoder self-attention module.

transformer_enc_dec_attn_dropout_rate (float, optional):

Dropout rate in encoder-deocoder attention module.

init_type (str, optional):

How to initialize transformer parameters.

init_enc_alpha (float, optional):

Initial value of alpha in scaled pos encoding of the encoder.

init_dec_alpha (float, optional):

Initial value of alpha in scaled pos encoding of the decoder.

eprenet_dropout_rate (float, optional):

Dropout rate in encoder prenet.

dprenet_dropout_rate (float, optional):

Dropout rate in decoder prenet.

postnet_dropout_rate (float, optional):

Dropout rate in postnet.

use_masking (bool, optional):

Whether to apply masking for padded part in loss calculation.

use_weighted_masking (bool, optional):

Whether to apply weighted masking in loss calculation.

bce_pos_weight (float, optional):

Positive sample weight in bce calculation (only for use_masking=true).

loss_type (str, optional):

How to calculate loss.

use_guided_attn_loss (bool, optional):

Whether to use guided attention loss.

num_heads_applied_guided_attn (int, optional):

Number of heads in each layer to apply guided attention loss.

num_layers_applied_guided_attn (int, optional):

Number of layers to apply guided attention loss.

Methods

__call__(*inputs, **kwargs)

Call self as a function.

add_parameter(name, parameter)

Adds a Parameter instance.

add_sublayer(name, sublayer)

Adds a sub Layer instance.

apply(fn)

Applies fn recursively to every sublayer (as returned by .sublayers()) as well as self.

buffers([include_sublayers])

Returns a list of all buffers from current layer and its sub-layers.

children()

Returns an iterator over immediate children layers.

clear_gradients()

Clear the gradients of all parameters for this layer.

create_parameter(shape[, attr, dtype, ...])

Create parameters for this layer.

create_tensor([name, persistable, dtype])

Create Tensor for this layer.

create_variable([name, persistable, dtype])

Create Tensor for this layer.

eval()

Sets this Layer and all its sublayers to evaluation mode.

extra_repr()

Extra representation of this layer, you can have custom implementation of your own layer.

forward(text, text_lengths, speech, ...[, ...])

Calculate forward propagation.

full_name()

Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__

inference(text[, speech, spk_emb, ...])

Generate the sequence of features given the sequences of characters.

load_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

named_buffers([prefix, include_sublayers])

Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.

named_children()

Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.

named_parameters([prefix, include_sublayers])

Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.

named_sublayers([prefix, include_self, ...])

Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.

parameters([include_sublayers])

Returns a list of all Parameters from current layer and its sub-layers.

register_buffer(name, tensor[, persistable])

Registers a tensor as buffer into the layer.

register_forward_post_hook(hook)

Register a forward post-hook for Layer.

register_forward_pre_hook(hook)

Register a forward pre-hook for Layer.

set_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

set_state_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

state_dict([destination, include_sublayers, ...])

Get all parameters and persistable buffers of current layer and its sub-layers.

sublayers([include_self])

Returns a list of sub layers.

to([device, dtype, blocking])

Cast the parameters and buffers of Layer by the give device, dtype and blocking.

to_static_state_dict([destination, ...])

Get all parameters and buffers of current layer and its sub-layers.

train()

Sets this Layer and all its sublayers to training mode.

backward

register_state_dict_hook

forward(text: Tensor, text_lengths: Tensor, speech: Tensor, speech_lengths: Tensor, spk_emb: Optional[Tensor] = None) Tuple[Tensor, Dict[str, Tensor], Tensor][source]

Calculate forward propagation.

Args:

text(Tensor(int64)): Batch of padded character ids (B, Tmax). text_lengths(Tensor(int64)): Batch of lengths of each input batch (B,). speech(Tensor): Batch of padded target features (B, Lmax, odim). speech_lengths(Tensor(int64)): Batch of the lengths of each target (B,). spk_emb(Tensor, optional): Batch of speaker embeddings (B, spk_embed_dim).

Returns:

Tensor: Loss scalar value. Dict: Statistics to be monitored.

inference(text: Tensor, speech: Optional[Tensor] = None, spk_emb: Optional[Tensor] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_teacher_forcing: bool = False) Tuple[Tensor, Tensor, Tensor][source]

Generate the sequence of features given the sequences of characters.

Args:

text(Tensor(int64)): Input sequence of characters (T,). speech(Tensor, optional): Feature sequence to extract style (N, idim). spk_emb(Tensor, optional): Speaker embedding vector (spk_embed_dim,). threshold(float, optional): Threshold in inference. minlenratio(float, optional): Minimum length ratio in inference. maxlenratio(float, optional): Maximum length ratio in inference. use_teacher_forcing(bool, optional): Whether to use teacher forcing.

Returns:

Tensor: Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Encoder-decoder (source) attention weights (#layers, #heads, L, T).

class paddlespeech.t2s.models.transformer_tts.transformer_tts.TransformerTTSInference(normalizer, model)[source]

Bases: Layer

Methods

__call__(*inputs, **kwargs)

Call self as a function.

add_parameter(name, parameter)

Adds a Parameter instance.

add_sublayer(name, sublayer)

Adds a sub Layer instance.

apply(fn)

Applies fn recursively to every sublayer (as returned by .sublayers()) as well as self.

buffers([include_sublayers])

Returns a list of all buffers from current layer and its sub-layers.

children()

Returns an iterator over immediate children layers.

clear_gradients()

Clear the gradients of all parameters for this layer.

create_parameter(shape[, attr, dtype, ...])

Create parameters for this layer.

create_tensor([name, persistable, dtype])

Create Tensor for this layer.

create_variable([name, persistable, dtype])

Create Tensor for this layer.

eval()

Sets this Layer and all its sublayers to evaluation mode.

extra_repr()

Extra representation of this layer, you can have custom implementation of your own layer.

forward(text[, spk_id])

Defines the computation performed at every call.

full_name()

Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__

load_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

named_buffers([prefix, include_sublayers])

Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.

named_children()

Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.

named_parameters([prefix, include_sublayers])

Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.

named_sublayers([prefix, include_self, ...])

Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.

parameters([include_sublayers])

Returns a list of all Parameters from current layer and its sub-layers.

register_buffer(name, tensor[, persistable])

Registers a tensor as buffer into the layer.

register_forward_post_hook(hook)

Register a forward post-hook for Layer.

register_forward_pre_hook(hook)

Register a forward pre-hook for Layer.

set_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

set_state_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

state_dict([destination, include_sublayers, ...])

Get all parameters and persistable buffers of current layer and its sub-layers.

sublayers([include_self])

Returns a list of sub layers.

to([device, dtype, blocking])

Cast the parameters and buffers of Layer by the give device, dtype and blocking.

to_static_state_dict([destination, ...])

Get all parameters and buffers of current layer and its sub-layers.

train()

Sets this Layer and all its sublayers to training mode.

backward

register_state_dict_hook

forward(text, spk_id=None)[source]

Defines the computation performed at every call. Should be overridden by all subclasses.

Parameters:

*inputs(tuple): unpacked tuple arguments **kwargs(dict): unpacked dict arguments