paddlespeech.t2s.models.transformer_tts.transformer_tts module
Fastspeech2 related modules for paddle
- class paddlespeech.t2s.models.transformer_tts.transformer_tts.TransformerTTS(idim: int, odim: int, embed_dim: int = 512, eprenet_conv_layers: int = 3, eprenet_conv_chans: int = 256, eprenet_conv_filts: int = 5, dprenet_layers: int = 2, dprenet_units: int = 256, elayers: int = 6, eunits: int = 1024, adim: int = 512, aheads: int = 4, dlayers: int = 6, dunits: int = 1024, postnet_layers: int = 5, postnet_chans: int = 256, postnet_filts: int = 5, positionwise_layer_type: str = 'conv1d', positionwise_conv_kernel_size: int = 1, use_scaled_pos_enc: bool = True, use_batch_norm: bool = True, encoder_normalize_before: bool = True, decoder_normalize_before: bool = True, encoder_concat_after: bool = False, decoder_concat_after: bool = False, reduction_factor: int = 1, spk_embed_dim: Optional[int] = None, spk_embed_integration_type: str = 'add', use_gst: bool = False, gst_tokens: int = 10, gst_heads: int = 4, gst_conv_layers: int = 6, gst_conv_chans_list: Sequence[int] = (32, 32, 64, 64, 128, 128), gst_conv_kernel_size: int = 3, gst_conv_stride: int = 2, gst_gru_layers: int = 1, gst_gru_units: int = 128, transformer_enc_dropout_rate: float = 0.1, transformer_enc_positional_dropout_rate: float = 0.1, transformer_enc_attn_dropout_rate: float = 0.1, transformer_dec_dropout_rate: float = 0.1, transformer_dec_positional_dropout_rate: float = 0.1, transformer_dec_attn_dropout_rate: float = 0.1, transformer_enc_dec_attn_dropout_rate: float = 0.1, eprenet_dropout_rate: float = 0.5, dprenet_dropout_rate: float = 0.5, postnet_dropout_rate: float = 0.5, init_type: str = 'xavier_uniform', init_enc_alpha: float = 1.0, init_dec_alpha: float = 1.0, use_guided_attn_loss: bool = True, num_heads_applied_guided_attn: int = 2, num_layers_applied_guided_attn: int = 2)[source]
Bases:
Layer
TTS-Transformer module.
This is a module of text-to-speech Transformer described in Neural Speech Synthesis with Transformer Network, which convert the sequence of tokens into the sequence of Mel-filterbanks.
- Args:
- idim (int):
Dimension of the inputs.
- odim (int):
Dimension of the outputs.
- embed_dim (int, optional):
Dimension of character embedding.
- eprenet_conv_layers (int, optional):
Number of encoder prenet convolution layers.
- eprenet_conv_chans (int, optional):
Number of encoder prenet convolution channels.
- eprenet_conv_filts (int, optional):
Filter size of encoder prenet convolution.
- dprenet_layers (int, optional):
Number of decoder prenet layers.
- dprenet_units (int, optional):
Number of decoder prenet hidden units.
- elayers (int, optional):
Number of encoder layers.
- eunits (int, optional):
Number of encoder hidden units.
- adim (int, optional):
Number of attention transformation dimensions.
- aheads (int, optional):
Number of heads for multi head attention.
- dlayers (int, optional):
Number of decoder layers.
- dunits (int, optional):
Number of decoder hidden units.
- postnet_layers (int, optional):
Number of postnet layers.
- postnet_chans (int, optional):
Number of postnet channels.
- postnet_filts (int, optional):
Filter size of postnet.
- use_scaled_pos_enc (pool, optional):
Whether to use trainable scaled positional encoding.
- use_batch_norm (bool, optional):
Whether to use batch normalization in encoder prenet.
- encoder_normalize_before (bool, optional):
Whether to perform layer normalization before encoder block.
- decoder_normalize_before (bool, optional):
Whether to perform layer normalization before decoder block.
- encoder_concat_after (bool, optional):
Whether to concatenate attention layer's input and output in encoder.
- decoder_concat_after (bool, optional):
Whether to concatenate attention layer's input and output in decoder.
- positionwise_layer_type (str, optional):
Position-wise operation type.
- positionwise_conv_kernel_size (int, optional):
Kernel size in position wise conv 1d.
- reduction_factor (int, optional):
Reduction factor.
- spk_embed_dim (int, optional):
Number of speaker embedding dimenstions.
- spk_embed_integration_type (str, optional):
How to integrate speaker embedding.
- use_gst (str, optional):
Whether to use global style token.
- gst_tokens (int, optional):
The number of GST embeddings.
- gst_heads (int, optional):
The number of heads in GST multihead attention.
- gst_conv_layers (int, optional):
The number of conv layers in GST.
- gst_conv_chans_list (Sequence[int], optional):
List of the number of channels of conv layers in GST.
- gst_conv_kernel_size (int, optional):
Kernal size of conv layers in GST.
- gst_conv_stride (int, optional):
Stride size of conv layers in GST.
- gst_gru_layers (int, optional):
The number of GRU layers in GST.
- gst_gru_units (int, optional):
The number of GRU units in GST.
- transformer_lr (float, optional):
Initial value of learning rate.
- transformer_warmup_steps (int, optional):
Optimizer warmup steps.
- transformer_enc_dropout_rate (float, optional):
Dropout rate in encoder except attention and positional encoding.
- transformer_enc_positional_dropout_rate (float, optional):
Dropout rate after encoder positional encoding.
- transformer_enc_attn_dropout_rate (float, optional):
Dropout rate in encoder self-attention module.
- transformer_dec_dropout_rate (float, optional):
Dropout rate in decoder except attention & positional encoding.
- transformer_dec_positional_dropout_rate (float, optional):
Dropout rate after decoder positional encoding.
- transformer_dec_attn_dropout_rate (float, optional):
Dropout rate in deocoder self-attention module.
- transformer_enc_dec_attn_dropout_rate (float, optional):
Dropout rate in encoder-deocoder attention module.
- init_type (str, optional):
How to initialize transformer parameters.
- init_enc_alpha (float, optional):
Initial value of alpha in scaled pos encoding of the encoder.
- init_dec_alpha (float, optional):
Initial value of alpha in scaled pos encoding of the decoder.
- eprenet_dropout_rate (float, optional):
Dropout rate in encoder prenet.
- dprenet_dropout_rate (float, optional):
Dropout rate in decoder prenet.
- postnet_dropout_rate (float, optional):
Dropout rate in postnet.
- use_masking (bool, optional):
Whether to apply masking for padded part in loss calculation.
- use_weighted_masking (bool, optional):
Whether to apply weighted masking in loss calculation.
- bce_pos_weight (float, optional):
Positive sample weight in bce calculation (only for use_masking=true).
- loss_type (str, optional):
How to calculate loss.
- use_guided_attn_loss (bool, optional):
Whether to use guided attention loss.
- num_heads_applied_guided_attn (int, optional):
Number of heads in each layer to apply guided attention loss.
- num_layers_applied_guided_attn (int, optional):
Number of layers to apply guided attention loss.
Methods
__call__
(*inputs, **kwargs)Call self as a function.
add_parameter
(name, parameter)Adds a Parameter instance.
add_sublayer
(name, sublayer)Adds a sub Layer instance.
apply
(fn)Applies
fn
recursively to every sublayer (as returned by.sublayers()
) as well as self.buffers
([include_sublayers])Returns a list of all buffers from current layer and its sub-layers.
children
()Returns an iterator over immediate children layers.
clear_gradients
()Clear the gradients of all parameters for this layer.
create_parameter
(shape[, attr, dtype, ...])Create parameters for this layer.
create_tensor
([name, persistable, dtype])Create Tensor for this layer.
create_variable
([name, persistable, dtype])Create Tensor for this layer.
eval
()Sets this Layer and all its sublayers to evaluation mode.
extra_repr
()Extra representation of this layer, you can have custom implementation of your own layer.
forward
(text, text_lengths, speech, ...[, ...])Calculate forward propagation.
full_name
()Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__
inference
(text[, speech, spk_emb, ...])Generate the sequence of features given the sequences of characters.
load_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
named_buffers
([prefix, include_sublayers])Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.
named_children
()Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.
named_parameters
([prefix, include_sublayers])Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.
named_sublayers
([prefix, include_self, ...])Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.
parameters
([include_sublayers])Returns a list of all Parameters from current layer and its sub-layers.
register_buffer
(name, tensor[, persistable])Registers a tensor as buffer into the layer.
register_forward_post_hook
(hook)Register a forward post-hook for Layer.
register_forward_pre_hook
(hook)Register a forward pre-hook for Layer.
set_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
set_state_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
state_dict
([destination, include_sublayers, ...])Get all parameters and persistable buffers of current layer and its sub-layers.
sublayers
([include_self])Returns a list of sub layers.
to
([device, dtype, blocking])Cast the parameters and buffers of Layer by the give device, dtype and blocking.
to_static_state_dict
([destination, ...])Get all parameters and buffers of current layer and its sub-layers.
train
()Sets this Layer and all its sublayers to training mode.
backward
register_state_dict_hook
- forward(text: Tensor, text_lengths: Tensor, speech: Tensor, speech_lengths: Tensor, spk_emb: Optional[Tensor] = None) Tuple[Tensor, Dict[str, Tensor], Tensor] [source]
Calculate forward propagation.
- Args:
text(Tensor(int64)): Batch of padded character ids (B, Tmax). text_lengths(Tensor(int64)): Batch of lengths of each input batch (B,). speech(Tensor): Batch of padded target features (B, Lmax, odim). speech_lengths(Tensor(int64)): Batch of the lengths of each target (B,). spk_emb(Tensor, optional): Batch of speaker embeddings (B, spk_embed_dim).
- Returns:
Tensor: Loss scalar value. Dict: Statistics to be monitored.
- inference(text: Tensor, speech: Optional[Tensor] = None, spk_emb: Optional[Tensor] = None, threshold: float = 0.5, minlenratio: float = 0.0, maxlenratio: float = 10.0, use_teacher_forcing: bool = False) Tuple[Tensor, Tensor, Tensor] [source]
Generate the sequence of features given the sequences of characters.
- Args:
text(Tensor(int64)): Input sequence of characters (T,). speech(Tensor, optional): Feature sequence to extract style (N, idim). spk_emb(Tensor, optional): Speaker embedding vector (spk_embed_dim,). threshold(float, optional): Threshold in inference. minlenratio(float, optional): Minimum length ratio in inference. maxlenratio(float, optional): Maximum length ratio in inference. use_teacher_forcing(bool, optional): Whether to use teacher forcing.
- Returns:
Tensor: Output sequence of features (L, odim). Tensor: Output sequence of stop probabilities (L,). Tensor: Encoder-decoder (source) attention weights (#layers, #heads, L, T).
- class paddlespeech.t2s.models.transformer_tts.transformer_tts.TransformerTTSInference(normalizer, model)[source]
Bases:
Layer
Methods
__call__
(*inputs, **kwargs)Call self as a function.
add_parameter
(name, parameter)Adds a Parameter instance.
add_sublayer
(name, sublayer)Adds a sub Layer instance.
apply
(fn)Applies
fn
recursively to every sublayer (as returned by.sublayers()
) as well as self.buffers
([include_sublayers])Returns a list of all buffers from current layer and its sub-layers.
children
()Returns an iterator over immediate children layers.
clear_gradients
()Clear the gradients of all parameters for this layer.
create_parameter
(shape[, attr, dtype, ...])Create parameters for this layer.
create_tensor
([name, persistable, dtype])Create Tensor for this layer.
create_variable
([name, persistable, dtype])Create Tensor for this layer.
eval
()Sets this Layer and all its sublayers to evaluation mode.
extra_repr
()Extra representation of this layer, you can have custom implementation of your own layer.
forward
(text[, spk_id])Defines the computation performed at every call.
full_name
()Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__
load_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
named_buffers
([prefix, include_sublayers])Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.
named_children
()Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.
named_parameters
([prefix, include_sublayers])Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.
named_sublayers
([prefix, include_self, ...])Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.
parameters
([include_sublayers])Returns a list of all Parameters from current layer and its sub-layers.
register_buffer
(name, tensor[, persistable])Registers a tensor as buffer into the layer.
register_forward_post_hook
(hook)Register a forward post-hook for Layer.
register_forward_pre_hook
(hook)Register a forward pre-hook for Layer.
set_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
set_state_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
state_dict
([destination, include_sublayers, ...])Get all parameters and persistable buffers of current layer and its sub-layers.
sublayers
([include_self])Returns a list of sub layers.
to
([device, dtype, blocking])Cast the parameters and buffers of Layer by the give device, dtype and blocking.
to_static_state_dict
([destination, ...])Get all parameters and buffers of current layer and its sub-layers.
train
()Sets this Layer and all its sublayers to training mode.
backward
register_state_dict_hook