paddlespeech.s2t.modules.attention module

Multi-Head Attention layer definition.

class paddlespeech.s2t.modules.attention.MultiHeadedAttention(n_head: int, n_feat: int, dropout_rate: float)[source]

Bases: Layer

Multi-Head Attention layer.

Methods

`__call__`(inputs, *kwargs)	Call self as a function.
`add_parameter`(name, parameter)	Adds a Parameter instance.
`add_sublayer`(name, sublayer)	Adds a sub Layer instance.
`apply`(fn)	Applies `fn` recursively to every sublayer (as returned by `.sublayers()`) as well as self.
`buffers`([include_sublayers])	Returns a list of all buffers from current layer and its sub-layers.
`children`()	Returns an iterator over immediate children layers.
`clear_gradients`()	Clear the gradients of all parameters for this layer.
`create_parameter`(shape[, attr, dtype, ...])	Create parameters for this layer.
`create_tensor`([name, persistable, dtype])	Create Tensor for this layer.
`create_variable`([name, persistable, dtype])	Create Tensor for this layer.
`eval`()	Sets this Layer and all its sublayers to evaluation mode.
`extra_repr`()	Extra representation of this layer, you can have custom implementation of your own layer.
`forward`(query, key, value[, mask, ...])	Compute scaled dot product attention. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2). 1.When applying cross attention between decoder and encoder, the batch padding mask for input is in (#batch, 1, T) shape. 2.When applying self attention of encoder, the mask is in (#batch, T, T) shape. 3.When applying self attention of decoder, the mask is in (#batch, L, L) shape. 4.If the different position in decoder see different block of the encoder, such as Mocha, the passed in mask could be in (#batch, L, T) shape. But there is no such case in current Wenet. cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size Returns: paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size.
`forward_attention`(value, scores[, mask, ...])	Compute attention context vector. Args: value (paddle.Tensor): Transformed value, size (#batch, n_head, time2, d_k). scores (paddle.Tensor): Attention score, size (#batch, n_head, time1, time2). mask (paddle.Tensor): Mask, size (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. Returns: paddle.Tensor: Transformed value (#batch, time1, d_model) weighted by the attention score (#batch, time1, time2).
`forward_qkv`(query, key, value)	Transform query, key and value. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). Returns: paddle.Tensor: Transformed query tensor, size (#batch, n_head, time1, d_k). paddle.Tensor: Transformed key tensor, size (#batch, n_head, time2, d_k). paddle.Tensor: Transformed value tensor, size (#batch, n_head, time2, d_k).
`full_name`()	Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__
`load_dict`(state_dict[, use_structured_name])	Set parameters and persistable buffers from state_dict.
`named_buffers`([prefix, include_sublayers])	Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.
`named_children`()	Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.
`named_parameters`([prefix, include_sublayers])	Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.
`named_sublayers`([prefix, include_self, ...])	Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.
`parameters`([include_sublayers])	Returns a list of all Parameters from current layer and its sub-layers.
`register_buffer`(name, tensor[, persistable])	Registers a tensor as buffer into the layer.
`register_forward_post_hook`(hook)	Register a forward post-hook for Layer.
`register_forward_pre_hook`(hook)	Register a forward pre-hook for Layer.
`set_dict`(state_dict[, use_structured_name])	Set parameters and persistable buffers from state_dict.
`set_state_dict`(state_dict[, use_structured_name])	Set parameters and persistable buffers from state_dict.
`state_dict`([destination, include_sublayers, ...])	Get all parameters and persistable buffers of current layer and its sub-layers.
`sublayers`([include_self])	Returns a list of sub layers.
`to`([device, dtype, blocking])	Cast the parameters and buffers of Layer by the give device, dtype and blocking.
`to_static_state_dict`([destination, ...])	Get all parameters and buffers of current layer and its sub-layers.
`train`()	Sets this Layer and all its sublayers to training mode.

backward
register_state_dict_hook

forward(query: ~paddle.Tensor, key: ~paddle.Tensor, value: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True, []), pos_emb: ~paddle.Tensor = Tensor(shape=[0], dtype=float32, place=Place(cpu), stop_gradient=True, []), cache: ~paddle.Tensor = Tensor(shape=[0, 0, 0, 0], dtype=float32, place=Place(cpu), stop_gradient=True, [])) → Tuple[Tensor, Tensor][source]

Compute scaled dot product attention. Args:

query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or

(#batch, time1, time2). 1.When applying cross attention between decoder and encoder, the batch padding mask for input is in (#batch, 1, T) shape. 2.When applying self attention of encoder, the mask is in (#batch, T, T) shape. 3.When applying self attention of decoder, the mask is in (#batch, L, L) shape. 4.If the different position in decoder see different block of the encoder, such as Mocha, the passed in mask could be in (#batch, L, T) shape. But there is no such case in current Wenet.

cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),
where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size

Returns:
paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size

forward_attention(value: ~paddle.Tensor, scores: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True, [])) → Tensor[source]

Compute attention context vector. Args:

value (paddle.Tensor): Transformed value, size
(#batch, n_head, time2, d_k).

scores (paddle.Tensor): Attention score, size
(#batch, n_head, time1, time2).

mask (paddle.Tensor): Mask, size (#batch, 1, time2) or
(#batch, time1, time2), (0, 0, 0) means fake mask.

Returns:

paddle.Tensor: Transformed value (#batch, time1, d_model): weighted by the attention score (#batch, time1, time2).

forward_qkv(query: Tensor, key: Tensor, value: Tensor) → Tuple[Tensor, Tensor, Tensor][source]

Transform query, key and value. Args:

query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size).

Returns:

paddle.Tensor: Transformed query tensor, size: (#batch, n_head, time1, d_k).
paddle.Tensor: Transformed key tensor, size: (#batch, n_head, time2, d_k).
paddle.Tensor: Transformed value tensor, size: (#batch, n_head, time2, d_k).

class paddlespeech.s2t.modules.attention.RelPositionMultiHeadedAttention(n_head, n_feat, dropout_rate, adaptive_scale=False, init_weights=False)[source]

Bases: MultiHeadedAttention

Multi-Head Attention layer with relative position encoding.

Methods

`__call__`(inputs, *kwargs)	Call self as a function.
`add_parameter`(name, parameter)	Adds a Parameter instance.
`add_sublayer`(name, sublayer)	Adds a sub Layer instance.
`apply`(fn)	Applies `fn` recursively to every sublayer (as returned by `.sublayers()`) as well as self.
`buffers`([include_sublayers])	Returns a list of all buffers from current layer and its sub-layers.
`children`()	Returns an iterator over immediate children layers.
`clear_gradients`()	Clear the gradients of all parameters for this layer.
`create_parameter`(shape[, attr, dtype, ...])	Create parameters for this layer.
`create_tensor`([name, persistable, dtype])	Create Tensor for this layer.
`create_variable`([name, persistable, dtype])	Create Tensor for this layer.
`eval`()	Sets this Layer and all its sublayers to evaluation mode.
`extra_repr`()	Extra representation of this layer, you can have custom implementation of your own layer.
`forward`(query, key, value[, mask, ...])	Compute 'Scaled Dot Product Attention' with rel. positional encoding. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. pos_emb (paddle.Tensor): Positional embedding tensor (#batch, time2, size). cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size Returns: paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size.
`forward_attention`(value, scores[, mask, ...])	Compute attention context vector. Args: value (paddle.Tensor): Transformed value, size (#batch, n_head, time2, d_k). scores (paddle.Tensor): Attention score, size (#batch, n_head, time1, time2). mask (paddle.Tensor): Mask, size (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. Returns: paddle.Tensor: Transformed value (#batch, time1, d_model) weighted by the attention score (#batch, time1, time2).
`forward_qkv`(query, key, value)	Transform query, key and value. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). Returns: paddle.Tensor: Transformed query tensor, size (#batch, n_head, time1, d_k). paddle.Tensor: Transformed key tensor, size (#batch, n_head, time2, d_k). paddle.Tensor: Transformed value tensor, size (#batch, n_head, time2, d_k).
`full_name`()	Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__
`load_dict`(state_dict[, use_structured_name])	Set parameters and persistable buffers from state_dict.
`named_buffers`([prefix, include_sublayers])	Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.
`named_children`()	Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.
`named_parameters`([prefix, include_sublayers])	Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.
`named_sublayers`([prefix, include_self, ...])	Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.
`parameters`([include_sublayers])	Returns a list of all Parameters from current layer and its sub-layers.
`register_buffer`(name, tensor[, persistable])	Registers a tensor as buffer into the layer.
`register_forward_post_hook`(hook)	Register a forward post-hook for Layer.
`register_forward_pre_hook`(hook)	Register a forward pre-hook for Layer.
`rel_shift`(x[, zero_triu])	Compute relative positinal encoding. Args: x (paddle.Tensor): Input tensor (batch, head, time1, time1). zero_triu (bool): If true, return the lower triangular part of the matrix. Returns: paddle.Tensor: Output tensor. (batch, head, time1, time1).
`set_dict`(state_dict[, use_structured_name])	Set parameters and persistable buffers from state_dict.
`set_state_dict`(state_dict[, use_structured_name])	Set parameters and persistable buffers from state_dict.
`state_dict`([destination, include_sublayers, ...])	Get all parameters and persistable buffers of current layer and its sub-layers.
`sublayers`([include_self])	Returns a list of sub layers.
`to`([device, dtype, blocking])	Cast the parameters and buffers of Layer by the give device, dtype and blocking.
`to_static_state_dict`([destination, ...])	Get all parameters and buffers of current layer and its sub-layers.
`train`()	Sets this Layer and all its sublayers to training mode.

backward
init_weights
register_state_dict_hook

forward(query: ~paddle.Tensor, key: ~paddle.Tensor, value: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True, []), pos_emb: ~paddle.Tensor = Tensor(shape=[0], dtype=float32, place=Place(cpu), stop_gradient=True, []), cache: ~paddle.Tensor = Tensor(shape=[0, 0, 0, 0], dtype=float32, place=Place(cpu), stop_gradient=True, [])) → Tuple[Tensor, Tensor][source]

Compute 'Scaled Dot Product Attention' with rel. positional encoding. Args:

query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or

(#batch, time1, time2), (0, 0, 0) means fake mask.

pos_emb (paddle.Tensor): Positional embedding tensor
(#batch, time2, size).

cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),
where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size

Returns:: paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size

init_weights()[source]

rel_shift(x, zero_triu: bool = False)[source]

Compute relative positinal encoding. Args:

x (paddle.Tensor): Input tensor (batch, head, time1, time1). zero_triu (bool): If true, return the lower triangular part of

the matrix.

Returns:: paddle.Tensor: Output tensor. (batch, head, time1, time1)

class paddlespeech.s2t.modules.attention.RoPERelPositionMultiHeadedAttention(n_head, n_feat, dropout_rate, adaptive_scale=False, init_weights=False)[source]

Bases: MultiHeadedAttention

Multi-Head Attention layer with RoPE relative position encoding.

Methods

`__call__`(inputs, *kwargs)	Call self as a function.
`add_parameter`(name, parameter)	Adds a Parameter instance.
`add_sublayer`(name, sublayer)	Adds a sub Layer instance.
`align`(tensor, axes[, ndim])	重新对齐tensor（批量版expand_dims） axes：原来的第i维对齐新tensor的第axes[i]维； ndim：新tensor的维度。
`apply`(fn)	Applies `fn` recursively to every sublayer (as returned by `.sublayers()`) as well as self.
`apply_rotary_position_embeddings`(sinusoidal, ...)	应用RoPE到tensors中其中，sinusoidal.shape=[B, T, D]，tensors为tensor的列表，而 tensor.shape=[B, T, ..., D], or (B,H,T,D/H)
`buffers`([include_sublayers])	Returns a list of all buffers from current layer and its sub-layers.
`children`()	Returns an iterator over immediate children layers.
`clear_gradients`()	Clear the gradients of all parameters for this layer.
`create_parameter`(shape[, attr, dtype, ...])	Create parameters for this layer.
`create_tensor`([name, persistable, dtype])	Create Tensor for this layer.
`create_variable`([name, persistable, dtype])	Create Tensor for this layer.
`eval`()	Sets this Layer and all its sublayers to evaluation mode.
`extra_repr`()	Extra representation of this layer, you can have custom implementation of your own layer.
`forward`(query, key, value[, mask, ...])	Compute 'Scaled Dot Product Attention' with rel. positional encoding. Ref: https://github.com/facebookresearch/llama/blob/main/llama/model.py Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. pos_emb (paddle.Tensor): Positional embedding tensor (#batch, time2, size). cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size Returns: paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size.
`forward_attention`(value, scores[, mask, ...])	Compute attention context vector. Args: value (paddle.Tensor): Transformed value, size (#batch, n_head, time2, d_k). scores (paddle.Tensor): Attention score, size (#batch, n_head, time1, time2). mask (paddle.Tensor): Mask, size (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. Returns: paddle.Tensor: Transformed value (#batch, time1, d_model) weighted by the attention score (#batch, time1, time2).
`forward_qkv`(query, key, value)	Transform query, key and value. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). Returns: paddle.Tensor: Transformed query tensor, size (#batch, n_head, time1, d_k). paddle.Tensor: Transformed key tensor, size (#batch, n_head, time2, d_k). paddle.Tensor: Transformed value tensor, size (#batch, n_head, time2, d_k).
`full_name`()	Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__
`load_dict`(state_dict[, use_structured_name])	Set parameters and persistable buffers from state_dict.
`named_buffers`([prefix, include_sublayers])	Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.
`named_children`()	Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.
`named_parameters`([prefix, include_sublayers])	Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.
`named_sublayers`([prefix, include_self, ...])	Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.
`parameters`([include_sublayers])	Returns a list of all Parameters from current layer and its sub-layers.
`register_buffer`(name, tensor[, persistable])	Registers a tensor as buffer into the layer.
`register_forward_post_hook`(hook)	Register a forward post-hook for Layer.
`register_forward_pre_hook`(hook)	Register a forward pre-hook for Layer.
`set_dict`(state_dict[, use_structured_name])	Set parameters and persistable buffers from state_dict.
`set_state_dict`(state_dict[, use_structured_name])	Set parameters and persistable buffers from state_dict.
`state_dict`([destination, include_sublayers, ...])	Get all parameters and persistable buffers of current layer and its sub-layers.
`sublayers`([include_self])	Returns a list of sub layers.
`to`([device, dtype, blocking])	Cast the parameters and buffers of Layer by the give device, dtype and blocking.
`to_static_state_dict`([destination, ...])	Get all parameters and buffers of current layer and its sub-layers.
`train`()	Sets this Layer and all its sublayers to training mode.

backward
register_state_dict_hook

align(tensor: Tensor, axes: List[int], ndim=None)[source]: 重新对齐tensor（批量版expand_dims） axes：原来的第i维对齐新tensor的第axes[i]维； ndim：新tensor的维度。

apply_rotary_position_embeddings(sinusoidal, *tensors)[source]: 应用RoPE到tensors中其中，sinusoidal.shape=[B, T, D]，tensors为tensor的列表，而 tensor.shape=[B, T, ..., D], or (B,H,T,D/H)

forward(query: ~paddle.Tensor, key: ~paddle.Tensor, value: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True, []), pos_emb: ~paddle.Tensor = Tensor(shape=[0], dtype=float32, place=Place(cpu), stop_gradient=True, []), cache: ~paddle.Tensor = Tensor(shape=[0, 0, 0, 0], dtype=float32, place=Place(cpu), stop_gradient=True, [])) → Tuple[Tensor, Tensor][source]

Compute 'Scaled Dot Product Attention' with rel. positional encoding. Ref: https://github.com/facebookresearch/llama/blob/main/llama/model.py Args:

query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or

(#batch, time1, time2), (0, 0, 0) means fake mask.

pos_emb (paddle.Tensor): Positional embedding tensor
(#batch, time2, size).

cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),
where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size

Returns:: paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size