paddlespeech.s2t.modules.attention module
Multi-Head Attention layer definition.
- class paddlespeech.s2t.modules.attention.MultiHeadedAttention(n_head: int, n_feat: int, dropout_rate: float)[source]
Bases:
Layer
Multi-Head Attention layer.
Methods
__call__
(*inputs, **kwargs)Call self as a function.
add_parameter
(name, parameter)Adds a Parameter instance.
add_sublayer
(name, sublayer)Adds a sub Layer instance.
apply
(fn)Applies
fn
recursively to every sublayer (as returned by.sublayers()
) as well as self.buffers
([include_sublayers])Returns a list of all buffers from current layer and its sub-layers.
children
()Returns an iterator over immediate children layers.
clear_gradients
()Clear the gradients of all parameters for this layer.
create_parameter
(shape[, attr, dtype, ...])Create parameters for this layer.
create_tensor
([name, persistable, dtype])Create Tensor for this layer.
create_variable
([name, persistable, dtype])Create Tensor for this layer.
eval
()Sets this Layer and all its sublayers to evaluation mode.
extra_repr
()Extra representation of this layer, you can have custom implementation of your own layer.
forward
(query, key, value[, mask, ...])Compute scaled dot product attention. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2). 1.When applying cross attention between decoder and encoder, the batch padding mask for input is in (#batch, 1, T) shape. 2.When applying self attention of encoder, the mask is in (#batch, T, T) shape. 3.When applying self attention of decoder, the mask is in (#batch, L, L) shape. 4.If the different position in decoder see different block of the encoder, such as Mocha, the passed in mask could be in (#batch, L, T) shape. But there is no such case in current Wenet. cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size Returns: paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size.
forward_attention
(value, scores[, mask, ...])Compute attention context vector. Args: value (paddle.Tensor): Transformed value, size (#batch, n_head, time2, d_k). scores (paddle.Tensor): Attention score, size (#batch, n_head, time1, time2). mask (paddle.Tensor): Mask, size (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. Returns: paddle.Tensor: Transformed value (#batch, time1, d_model) weighted by the attention score (#batch, time1, time2).
forward_qkv
(query, key, value)Transform query, key and value. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). Returns: paddle.Tensor: Transformed query tensor, size (#batch, n_head, time1, d_k). paddle.Tensor: Transformed key tensor, size (#batch, n_head, time2, d_k). paddle.Tensor: Transformed value tensor, size (#batch, n_head, time2, d_k).
full_name
()Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__
load_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
named_buffers
([prefix, include_sublayers])Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.
named_children
()Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.
named_parameters
([prefix, include_sublayers])Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.
named_sublayers
([prefix, include_self, ...])Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.
parameters
([include_sublayers])Returns a list of all Parameters from current layer and its sub-layers.
register_buffer
(name, tensor[, persistable])Registers a tensor as buffer into the layer.
register_forward_post_hook
(hook)Register a forward post-hook for Layer.
register_forward_pre_hook
(hook)Register a forward pre-hook for Layer.
set_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
set_state_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
state_dict
([destination, include_sublayers, ...])Get all parameters and persistable buffers of current layer and its sub-layers.
sublayers
([include_self])Returns a list of sub layers.
to
([device, dtype, blocking])Cast the parameters and buffers of Layer by the give device, dtype and blocking.
to_static_state_dict
([destination, ...])Get all parameters and buffers of current layer and its sub-layers.
train
()Sets this Layer and all its sublayers to training mode.
backward
register_state_dict_hook
- forward(query: ~paddle.Tensor, key: ~paddle.Tensor, value: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True, []), pos_emb: ~paddle.Tensor = Tensor(shape=[0], dtype=float32, place=Place(cpu), stop_gradient=True, []), cache: ~paddle.Tensor = Tensor(shape=[0, 0, 0, 0], dtype=float32, place=Place(cpu), stop_gradient=True, [])) Tuple[Tensor, Tensor] [source]
Compute scaled dot product attention. Args:
query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or
(#batch, time1, time2). 1.When applying cross attention between decoder and encoder, the batch padding mask for input is in (#batch, 1, T) shape. 2.When applying self attention of encoder, the mask is in (#batch, T, T) shape. 3.When applying self attention of decoder, the mask is in (#batch, L, L) shape. 4.If the different position in decoder see different block of the encoder, such as Mocha, the passed in mask could be in (#batch, L, T) shape. But there is no such case in current Wenet.
- cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),
where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size
- Returns:
paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)
where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size
- forward_attention(value: ~paddle.Tensor, scores: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True, [])) Tensor [source]
Compute attention context vector. Args:
- value (paddle.Tensor): Transformed value, size
(#batch, n_head, time2, d_k).
- scores (paddle.Tensor): Attention score, size
(#batch, n_head, time1, time2).
- mask (paddle.Tensor): Mask, size (#batch, 1, time2) or
(#batch, time1, time2), (0, 0, 0) means fake mask.
- Returns:
- paddle.Tensor: Transformed value (#batch, time1, d_model)
weighted by the attention score (#batch, time1, time2).
- forward_qkv(query: Tensor, key: Tensor, value: Tensor) Tuple[Tensor, Tensor, Tensor] [source]
Transform query, key and value. Args:
query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size).
- Returns:
- paddle.Tensor: Transformed query tensor, size
(#batch, n_head, time1, d_k).
- paddle.Tensor: Transformed key tensor, size
(#batch, n_head, time2, d_k).
- paddle.Tensor: Transformed value tensor, size
(#batch, n_head, time2, d_k).
- class paddlespeech.s2t.modules.attention.RelPositionMultiHeadedAttention(n_head, n_feat, dropout_rate, adaptive_scale=False, init_weights=False)[source]
Bases:
MultiHeadedAttention
Multi-Head Attention layer with relative position encoding.
Methods
__call__
(*inputs, **kwargs)Call self as a function.
add_parameter
(name, parameter)Adds a Parameter instance.
add_sublayer
(name, sublayer)Adds a sub Layer instance.
apply
(fn)Applies
fn
recursively to every sublayer (as returned by.sublayers()
) as well as self.buffers
([include_sublayers])Returns a list of all buffers from current layer and its sub-layers.
children
()Returns an iterator over immediate children layers.
clear_gradients
()Clear the gradients of all parameters for this layer.
create_parameter
(shape[, attr, dtype, ...])Create parameters for this layer.
create_tensor
([name, persistable, dtype])Create Tensor for this layer.
create_variable
([name, persistable, dtype])Create Tensor for this layer.
eval
()Sets this Layer and all its sublayers to evaluation mode.
extra_repr
()Extra representation of this layer, you can have custom implementation of your own layer.
forward
(query, key, value[, mask, ...])Compute 'Scaled Dot Product Attention' with rel. positional encoding. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. pos_emb (paddle.Tensor): Positional embedding tensor (#batch, time2, size). cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size Returns: paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size.
forward_attention
(value, scores[, mask, ...])Compute attention context vector. Args: value (paddle.Tensor): Transformed value, size (#batch, n_head, time2, d_k). scores (paddle.Tensor): Attention score, size (#batch, n_head, time1, time2). mask (paddle.Tensor): Mask, size (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. Returns: paddle.Tensor: Transformed value (#batch, time1, d_model) weighted by the attention score (#batch, time1, time2).
forward_qkv
(query, key, value)Transform query, key and value. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). Returns: paddle.Tensor: Transformed query tensor, size (#batch, n_head, time1, d_k). paddle.Tensor: Transformed key tensor, size (#batch, n_head, time2, d_k). paddle.Tensor: Transformed value tensor, size (#batch, n_head, time2, d_k).
full_name
()Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__
load_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
named_buffers
([prefix, include_sublayers])Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.
named_children
()Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.
named_parameters
([prefix, include_sublayers])Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.
named_sublayers
([prefix, include_self, ...])Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.
parameters
([include_sublayers])Returns a list of all Parameters from current layer and its sub-layers.
register_buffer
(name, tensor[, persistable])Registers a tensor as buffer into the layer.
register_forward_post_hook
(hook)Register a forward post-hook for Layer.
register_forward_pre_hook
(hook)Register a forward pre-hook for Layer.
rel_shift
(x[, zero_triu])Compute relative positinal encoding. Args: x (paddle.Tensor): Input tensor (batch, head, time1, time1). zero_triu (bool): If true, return the lower triangular part of the matrix. Returns: paddle.Tensor: Output tensor. (batch, head, time1, time1).
set_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
set_state_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
state_dict
([destination, include_sublayers, ...])Get all parameters and persistable buffers of current layer and its sub-layers.
sublayers
([include_self])Returns a list of sub layers.
to
([device, dtype, blocking])Cast the parameters and buffers of Layer by the give device, dtype and blocking.
to_static_state_dict
([destination, ...])Get all parameters and buffers of current layer and its sub-layers.
train
()Sets this Layer and all its sublayers to training mode.
backward
init_weights
register_state_dict_hook
- forward(query: ~paddle.Tensor, key: ~paddle.Tensor, value: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True, []), pos_emb: ~paddle.Tensor = Tensor(shape=[0], dtype=float32, place=Place(cpu), stop_gradient=True, []), cache: ~paddle.Tensor = Tensor(shape=[0, 0, 0, 0], dtype=float32, place=Place(cpu), stop_gradient=True, [])) Tuple[Tensor, Tensor] [source]
Compute 'Scaled Dot Product Attention' with rel. positional encoding. Args:
query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or
(#batch, time1, time2), (0, 0, 0) means fake mask.
- pos_emb (paddle.Tensor): Positional embedding tensor
(#batch, time2, size).
- cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),
where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size
- Returns:
paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)
where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size
- class paddlespeech.s2t.modules.attention.RoPERelPositionMultiHeadedAttention(n_head, n_feat, dropout_rate, adaptive_scale=False, init_weights=False)[source]
Bases:
MultiHeadedAttention
Multi-Head Attention layer with RoPE relative position encoding.
Methods
__call__
(*inputs, **kwargs)Call self as a function.
add_parameter
(name, parameter)Adds a Parameter instance.
add_sublayer
(name, sublayer)Adds a sub Layer instance.
align
(tensor, axes[, ndim])重新对齐tensor(批量版expand_dims) axes:原来的第i维对齐新tensor的第axes[i]维; ndim:新tensor的维度。
apply
(fn)Applies
fn
recursively to every sublayer (as returned by.sublayers()
) as well as self.apply_rotary_position_embeddings
(sinusoidal, ...)应用RoPE到tensors中 其中,sinusoidal.shape=[B, T, D],tensors为tensor的列表,而 tensor.shape=[B, T, ..., D], or (B,H,T,D/H)
buffers
([include_sublayers])Returns a list of all buffers from current layer and its sub-layers.
children
()Returns an iterator over immediate children layers.
clear_gradients
()Clear the gradients of all parameters for this layer.
create_parameter
(shape[, attr, dtype, ...])Create parameters for this layer.
create_tensor
([name, persistable, dtype])Create Tensor for this layer.
create_variable
([name, persistable, dtype])Create Tensor for this layer.
eval
()Sets this Layer and all its sublayers to evaluation mode.
extra_repr
()Extra representation of this layer, you can have custom implementation of your own layer.
forward
(query, key, value[, mask, ...])Compute 'Scaled Dot Product Attention' with rel. positional encoding. Ref: https://github.com/facebookresearch/llama/blob/main/llama/model.py Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. pos_emb (paddle.Tensor): Positional embedding tensor (#batch, time2, size). cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size Returns: paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size.
forward_attention
(value, scores[, mask, ...])Compute attention context vector. Args: value (paddle.Tensor): Transformed value, size (#batch, n_head, time2, d_k). scores (paddle.Tensor): Attention score, size (#batch, n_head, time1, time2). mask (paddle.Tensor): Mask, size (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. Returns: paddle.Tensor: Transformed value (#batch, time1, d_model) weighted by the attention score (#batch, time1, time2).
forward_qkv
(query, key, value)Transform query, key and value. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). Returns: paddle.Tensor: Transformed query tensor, size (#batch, n_head, time1, d_k). paddle.Tensor: Transformed key tensor, size (#batch, n_head, time2, d_k). paddle.Tensor: Transformed value tensor, size (#batch, n_head, time2, d_k).
full_name
()Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__
load_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
named_buffers
([prefix, include_sublayers])Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.
named_children
()Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.
named_parameters
([prefix, include_sublayers])Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.
named_sublayers
([prefix, include_self, ...])Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.
parameters
([include_sublayers])Returns a list of all Parameters from current layer and its sub-layers.
register_buffer
(name, tensor[, persistable])Registers a tensor as buffer into the layer.
register_forward_post_hook
(hook)Register a forward post-hook for Layer.
register_forward_pre_hook
(hook)Register a forward pre-hook for Layer.
set_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
set_state_dict
(state_dict[, use_structured_name])Set parameters and persistable buffers from state_dict.
state_dict
([destination, include_sublayers, ...])Get all parameters and persistable buffers of current layer and its sub-layers.
sublayers
([include_self])Returns a list of sub layers.
to
([device, dtype, blocking])Cast the parameters and buffers of Layer by the give device, dtype and blocking.
to_static_state_dict
([destination, ...])Get all parameters and buffers of current layer and its sub-layers.
train
()Sets this Layer and all its sublayers to training mode.
backward
register_state_dict_hook
- align(tensor: Tensor, axes: List[int], ndim=None)[source]
重新对齐tensor(批量版expand_dims) axes:原来的第i维对齐新tensor的第axes[i]维; ndim:新tensor的维度。
- apply_rotary_position_embeddings(sinusoidal, *tensors)[source]
应用RoPE到tensors中 其中,sinusoidal.shape=[B, T, D],tensors为tensor的列表,而 tensor.shape=[B, T, ..., D], or (B,H,T,D/H)
- forward(query: ~paddle.Tensor, key: ~paddle.Tensor, value: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True, []), pos_emb: ~paddle.Tensor = Tensor(shape=[0], dtype=float32, place=Place(cpu), stop_gradient=True, []), cache: ~paddle.Tensor = Tensor(shape=[0, 0, 0, 0], dtype=float32, place=Place(cpu), stop_gradient=True, [])) Tuple[Tensor, Tensor] [source]
Compute 'Scaled Dot Product Attention' with rel. positional encoding. Ref: https://github.com/facebookresearch/llama/blob/main/llama/model.py Args:
query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or
(#batch, time1, time2), (0, 0, 0) means fake mask.
- pos_emb (paddle.Tensor): Positional embedding tensor
(#batch, time2, size).
- cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),
where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size
- Returns:
paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)
where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size