paddlespeech.s2t.modules.attention module

Multi-Head Attention layer definition.

class paddlespeech.s2t.modules.attention.MultiHeadedAttention(n_head: int, n_feat: int, dropout_rate: float)[source]

Bases: Layer

Multi-Head Attention layer.


__call__(*inputs, **kwargs)

Call self as a function.

add_parameter(name, parameter)

Adds a Parameter instance.

add_sublayer(name, sublayer)

Adds a sub Layer instance.


Applies fn recursively to every sublayer (as returned by .sublayers()) as well as self.


Returns a list of all buffers from current layer and its sub-layers.


Returns an iterator over immediate children layers.


Clear the gradients of all parameters for this layer.

create_parameter(shape[, attr, dtype, ...])

Create parameters for this layer.

create_tensor([name, persistable, dtype])

Create Tensor for this layer.

create_variable([name, persistable, dtype])

Create Tensor for this layer.


Sets this Layer and all its sublayers to evaluation mode.


Extra representation of this layer, you can have custom implementation of your own layer.

forward(query, key, value[, mask, ...])

Compute scaled dot product attention. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2). 1.When applying cross attention between decoder and encoder, the batch padding mask for input is in (#batch, 1, T) shape. 2.When applying self attention of encoder, the mask is in (#batch, T, T) shape. 3.When applying self attention of decoder, the mask is in (#batch, L, L) shape. 4.If the different position in decoder see different block of the encoder, such as Mocha, the passed in mask could be in (#batch, L, T) shape. But there is no such case in current Wenet. cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size Returns: paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size.

forward_attention(value, scores[, mask, ...])

Compute attention context vector. Args: value (paddle.Tensor): Transformed value, size (#batch, n_head, time2, d_k). scores (paddle.Tensor): Attention score, size (#batch, n_head, time1, time2). mask (paddle.Tensor): Mask, size (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. Returns: paddle.Tensor: Transformed value (#batch, time1, d_model) weighted by the attention score (#batch, time1, time2).

forward_qkv(query, key, value)

Transform query, key and value. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). Returns: paddle.Tensor: Transformed query tensor, size (#batch, n_head, time1, d_k). paddle.Tensor: Transformed key tensor, size (#batch, n_head, time2, d_k). paddle.Tensor: Transformed value tensor, size (#batch, n_head, time2, d_k).


Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__

load_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

named_buffers([prefix, include_sublayers])

Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.


Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.

named_parameters([prefix, include_sublayers])

Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.

named_sublayers([prefix, include_self, ...])

Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.


Returns a list of all Parameters from current layer and its sub-layers.

register_buffer(name, tensor[, persistable])

Registers a tensor as buffer into the layer.


Register a forward post-hook for Layer.


Register a forward pre-hook for Layer.

set_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

set_state_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

state_dict([destination, include_sublayers, ...])

Get all parameters and persistable buffers of current layer and its sub-layers.


Returns a list of sub layers.

to([device, dtype, blocking])

Cast the parameters and buffers of Layer by the give device, dtype and blocking.

to_static_state_dict([destination, ...])

Get all parameters and buffers of current layer and its sub-layers.


Sets this Layer and all its sublayers to training mode.



forward(query: ~paddle.Tensor, key: ~paddle.Tensor, value: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True,        []), pos_emb: ~paddle.Tensor = Tensor(shape=[0], dtype=float32, place=Place(cpu), stop_gradient=True,        []), cache: ~paddle.Tensor = Tensor(shape=[0, 0, 0, 0], dtype=float32, place=Place(cpu), stop_gradient=True,        [])) Tuple[Tensor, Tensor][source]

Compute scaled dot product attention. Args:

query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or

(#batch, time1, time2). 1.When applying cross attention between decoder and encoder, the batch padding mask for input is in (#batch, 1, T) shape. 2.When applying self attention of encoder, the mask is in (#batch, T, T) shape. 3.When applying self attention of decoder, the mask is in (#batch, L, L) shape. 4.If the different position in decoder see different block of the encoder, such as Mocha, the passed in mask could be in (#batch, L, T) shape. But there is no such case in current Wenet.

cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size


paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size

forward_attention(value: ~paddle.Tensor, scores: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True,        [])) Tensor[source]

Compute attention context vector. Args:

value (paddle.Tensor): Transformed value, size

(#batch, n_head, time2, d_k).

scores (paddle.Tensor): Attention score, size

(#batch, n_head, time1, time2).

mask (paddle.Tensor): Mask, size (#batch, 1, time2) or

(#batch, time1, time2), (0, 0, 0) means fake mask.

paddle.Tensor: Transformed value (#batch, time1, d_model)

weighted by the attention score (#batch, time1, time2).

forward_qkv(query: Tensor, key: Tensor, value: Tensor) Tuple[Tensor, Tensor, Tensor][source]

Transform query, key and value. Args:

query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size).

paddle.Tensor: Transformed query tensor, size

(#batch, n_head, time1, d_k).

paddle.Tensor: Transformed key tensor, size

(#batch, n_head, time2, d_k).

paddle.Tensor: Transformed value tensor, size

(#batch, n_head, time2, d_k).

class paddlespeech.s2t.modules.attention.RelPositionMultiHeadedAttention(n_head, n_feat, dropout_rate, adaptive_scale=False, init_weights=False)[source]

Bases: MultiHeadedAttention

Multi-Head Attention layer with relative position encoding.


__call__(*inputs, **kwargs)

Call self as a function.

add_parameter(name, parameter)

Adds a Parameter instance.

add_sublayer(name, sublayer)

Adds a sub Layer instance.


Applies fn recursively to every sublayer (as returned by .sublayers()) as well as self.


Returns a list of all buffers from current layer and its sub-layers.


Returns an iterator over immediate children layers.


Clear the gradients of all parameters for this layer.

create_parameter(shape[, attr, dtype, ...])

Create parameters for this layer.

create_tensor([name, persistable, dtype])

Create Tensor for this layer.

create_variable([name, persistable, dtype])

Create Tensor for this layer.


Sets this Layer and all its sublayers to evaluation mode.


Extra representation of this layer, you can have custom implementation of your own layer.

forward(query, key, value[, mask, ...])

Compute 'Scaled Dot Product Attention' with rel. positional encoding. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. pos_emb (paddle.Tensor): Positional embedding tensor (#batch, time2, size). cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size Returns: paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size.

forward_attention(value, scores[, mask, ...])

Compute attention context vector. Args: value (paddle.Tensor): Transformed value, size (#batch, n_head, time2, d_k). scores (paddle.Tensor): Attention score, size (#batch, n_head, time1, time2). mask (paddle.Tensor): Mask, size (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. Returns: paddle.Tensor: Transformed value (#batch, time1, d_model) weighted by the attention score (#batch, time1, time2).

forward_qkv(query, key, value)

Transform query, key and value. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). Returns: paddle.Tensor: Transformed query tensor, size (#batch, n_head, time1, d_k). paddle.Tensor: Transformed key tensor, size (#batch, n_head, time2, d_k). paddle.Tensor: Transformed value tensor, size (#batch, n_head, time2, d_k).


Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__

load_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

named_buffers([prefix, include_sublayers])

Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.


Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.

named_parameters([prefix, include_sublayers])

Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.

named_sublayers([prefix, include_self, ...])

Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.


Returns a list of all Parameters from current layer and its sub-layers.

register_buffer(name, tensor[, persistable])

Registers a tensor as buffer into the layer.


Register a forward post-hook for Layer.


Register a forward pre-hook for Layer.

rel_shift(x[, zero_triu])

Compute relative positinal encoding. Args: x (paddle.Tensor): Input tensor (batch, head, time1, time1). zero_triu (bool): If true, return the lower triangular part of the matrix. Returns: paddle.Tensor: Output tensor. (batch, head, time1, time1).

set_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

set_state_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

state_dict([destination, include_sublayers, ...])

Get all parameters and persistable buffers of current layer and its sub-layers.


Returns a list of sub layers.

to([device, dtype, blocking])

Cast the parameters and buffers of Layer by the give device, dtype and blocking.

to_static_state_dict([destination, ...])

Get all parameters and buffers of current layer and its sub-layers.


Sets this Layer and all its sublayers to training mode.




forward(query: ~paddle.Tensor, key: ~paddle.Tensor, value: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True,        []), pos_emb: ~paddle.Tensor = Tensor(shape=[0], dtype=float32, place=Place(cpu), stop_gradient=True,        []), cache: ~paddle.Tensor = Tensor(shape=[0, 0, 0, 0], dtype=float32, place=Place(cpu), stop_gradient=True,        [])) Tuple[Tensor, Tensor][source]

Compute 'Scaled Dot Product Attention' with rel. positional encoding. Args:

query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or

(#batch, time1, time2), (0, 0, 0) means fake mask.

pos_emb (paddle.Tensor): Positional embedding tensor

(#batch, time2, size).

cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size


paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size

rel_shift(x, zero_triu: bool = False)[source]

Compute relative positinal encoding. Args:

x (paddle.Tensor): Input tensor (batch, head, time1, time1). zero_triu (bool): If true, return the lower triangular part of

the matrix.


paddle.Tensor: Output tensor. (batch, head, time1, time1)

class paddlespeech.s2t.modules.attention.RoPERelPositionMultiHeadedAttention(n_head, n_feat, dropout_rate, adaptive_scale=False, init_weights=False)[source]

Bases: MultiHeadedAttention

Multi-Head Attention layer with RoPE relative position encoding.


__call__(*inputs, **kwargs)

Call self as a function.

add_parameter(name, parameter)

Adds a Parameter instance.

add_sublayer(name, sublayer)

Adds a sub Layer instance.

align(tensor, axes[, ndim])

重新对齐tensor(批量版expand_dims) axes:原来的第i维对齐新tensor的第axes[i]维; ndim:新tensor的维度。


Applies fn recursively to every sublayer (as returned by .sublayers()) as well as self.

apply_rotary_position_embeddings(sinusoidal, ...)

应用RoPE到tensors中 其中,sinusoidal.shape=[B, T, D],tensors为tensor的列表,而 tensor.shape=[B, T, ..., D], or (B,H,T,D/H)


Returns a list of all buffers from current layer and its sub-layers.


Returns an iterator over immediate children layers.


Clear the gradients of all parameters for this layer.

create_parameter(shape[, attr, dtype, ...])

Create parameters for this layer.

create_tensor([name, persistable, dtype])

Create Tensor for this layer.

create_variable([name, persistable, dtype])

Create Tensor for this layer.


Sets this Layer and all its sublayers to evaluation mode.


Extra representation of this layer, you can have custom implementation of your own layer.

forward(query, key, value[, mask, ...])

Compute 'Scaled Dot Product Attention' with rel. positional encoding. Ref: Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. pos_emb (paddle.Tensor): Positional embedding tensor (#batch, time2, size). cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size Returns: paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size.

forward_attention(value, scores[, mask, ...])

Compute attention context vector. Args: value (paddle.Tensor): Transformed value, size (#batch, n_head, time2, d_k). scores (paddle.Tensor): Attention score, size (#batch, n_head, time1, time2). mask (paddle.Tensor): Mask, size (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. Returns: paddle.Tensor: Transformed value (#batch, time1, d_model) weighted by the attention score (#batch, time1, time2).

forward_qkv(query, key, value)

Transform query, key and value. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). Returns: paddle.Tensor: Transformed query tensor, size (#batch, n_head, time1, d_k). paddle.Tensor: Transformed key tensor, size (#batch, n_head, time2, d_k). paddle.Tensor: Transformed value tensor, size (#batch, n_head, time2, d_k).


Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__

load_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

named_buffers([prefix, include_sublayers])

Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.


Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.

named_parameters([prefix, include_sublayers])

Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.

named_sublayers([prefix, include_self, ...])

Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.


Returns a list of all Parameters from current layer and its sub-layers.

register_buffer(name, tensor[, persistable])

Registers a tensor as buffer into the layer.


Register a forward post-hook for Layer.


Register a forward pre-hook for Layer.

set_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

set_state_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

state_dict([destination, include_sublayers, ...])

Get all parameters and persistable buffers of current layer and its sub-layers.


Returns a list of sub layers.

to([device, dtype, blocking])

Cast the parameters and buffers of Layer by the give device, dtype and blocking.

to_static_state_dict([destination, ...])

Get all parameters and buffers of current layer and its sub-layers.


Sets this Layer and all its sublayers to training mode.



align(tensor: Tensor, axes: List[int], ndim=None)[source]

重新对齐tensor(批量版expand_dims) axes:原来的第i维对齐新tensor的第axes[i]维; ndim:新tensor的维度。

apply_rotary_position_embeddings(sinusoidal, *tensors)[source]

应用RoPE到tensors中 其中,sinusoidal.shape=[B, T, D],tensors为tensor的列表,而 tensor.shape=[B, T, ..., D], or (B,H,T,D/H)

forward(query: ~paddle.Tensor, key: ~paddle.Tensor, value: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True,        []), pos_emb: ~paddle.Tensor = Tensor(shape=[0], dtype=float32, place=Place(cpu), stop_gradient=True,        []), cache: ~paddle.Tensor = Tensor(shape=[0, 0, 0, 0], dtype=float32, place=Place(cpu), stop_gradient=True,        [])) Tuple[Tensor, Tensor][source]

Compute 'Scaled Dot Product Attention' with rel. positional encoding. Ref: Args:

query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or

(#batch, time1, time2), (0, 0, 0) means fake mask.

pos_emb (paddle.Tensor): Positional embedding tensor

(#batch, time2, size).

cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size


paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size