paddlespeech.s2t.modules.attention module

Multi-Head Attention layer definition.

class paddlespeech.s2t.modules.attention.MultiHeadedAttention(n_head: int, n_feat: int, dropout_rate: float)[source]

Bases: Layer

Multi-Head Attention layer.

Methods

__call__(*inputs, **kwargs)

Call self as a function.

add_parameter(name, parameter)

Adds a Parameter instance.

add_sublayer(name, sublayer)

Adds a sub Layer instance.

apply(fn)

Applies fn recursively to every sublayer (as returned by .sublayers()) as well as self.

buffers([include_sublayers])

Returns a list of all buffers from current layer and its sub-layers.

children()

Returns an iterator over immediate children layers.

clear_gradients()

Clear the gradients of all parameters for this layer.

create_parameter(shape[, attr, dtype, ...])

Create parameters for this layer.

create_tensor([name, persistable, dtype])

Create Tensor for this layer.

create_variable([name, persistable, dtype])

Create Tensor for this layer.

eval()

Sets this Layer and all its sublayers to evaluation mode.

extra_repr()

Extra representation of this layer, you can have custom implementation of your own layer.

forward(query, key, value[, mask, ...])

Compute scaled dot product attention. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2). 1.When applying cross attention between decoder and encoder, the batch padding mask for input is in (#batch, 1, T) shape. 2.When applying self attention of encoder, the mask is in (#batch, T, T) shape. 3.When applying self attention of decoder, the mask is in (#batch, L, L) shape. 4.If the different position in decoder see different block of the encoder, such as Mocha, the passed in mask could be in (#batch, L, T) shape. But there is no such case in current Wenet. cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size Returns: paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size.

forward_attention(value, scores[, mask, ...])

Compute attention context vector. Args: value (paddle.Tensor): Transformed value, size (#batch, n_head, time2, d_k). scores (paddle.Tensor): Attention score, size (#batch, n_head, time1, time2). mask (paddle.Tensor): Mask, size (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. Returns: paddle.Tensor: Transformed value (#batch, time1, d_model) weighted by the attention score (#batch, time1, time2).

forward_qkv(query, key, value)

Transform query, key and value. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). Returns: paddle.Tensor: Transformed query tensor, size (#batch, n_head, time1, d_k). paddle.Tensor: Transformed key tensor, size (#batch, n_head, time2, d_k). paddle.Tensor: Transformed value tensor, size (#batch, n_head, time2, d_k).

full_name()

Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__

load_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

named_buffers([prefix, include_sublayers])

Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.

named_children()

Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.

named_parameters([prefix, include_sublayers])

Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.

named_sublayers([prefix, include_self, ...])

Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.

parameters([include_sublayers])

Returns a list of all Parameters from current layer and its sub-layers.

register_buffer(name, tensor[, persistable])

Registers a tensor as buffer into the layer.

register_forward_post_hook(hook)

Register a forward post-hook for Layer.

register_forward_pre_hook(hook)

Register a forward pre-hook for Layer.

set_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

set_state_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

state_dict([destination, include_sublayers, ...])

Get all parameters and persistable buffers of current layer and its sub-layers.

sublayers([include_self])

Returns a list of sub layers.

to([device, dtype, blocking])

Cast the parameters and buffers of Layer by the give device, dtype and blocking.

to_static_state_dict([destination, ...])

Get all parameters and buffers of current layer and its sub-layers.

train()

Sets this Layer and all its sublayers to training mode.

backward

register_state_dict_hook

forward(query: ~paddle.Tensor, key: ~paddle.Tensor, value: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True,        []), pos_emb: ~paddle.Tensor = Tensor(shape=[0], dtype=float32, place=Place(cpu), stop_gradient=True,        []), cache: ~paddle.Tensor = Tensor(shape=[0, 0, 0, 0], dtype=float32, place=Place(cpu), stop_gradient=True,        [])) Tuple[Tensor, Tensor][source]

Compute scaled dot product attention. Args:

query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or

(#batch, time1, time2). 1.When applying cross attention between decoder and encoder, the batch padding mask for input is in (#batch, 1, T) shape. 2.When applying self attention of encoder, the mask is in (#batch, T, T) shape. 3.When applying self attention of decoder, the mask is in (#batch, L, L) shape. 4.If the different position in decoder see different block of the encoder, such as Mocha, the passed in mask could be in (#batch, L, T) shape. But there is no such case in current Wenet.

cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size

Returns:

paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size

forward_attention(value: ~paddle.Tensor, scores: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True,        [])) Tensor[source]

Compute attention context vector. Args:

value (paddle.Tensor): Transformed value, size

(#batch, n_head, time2, d_k).

scores (paddle.Tensor): Attention score, size

(#batch, n_head, time1, time2).

mask (paddle.Tensor): Mask, size (#batch, 1, time2) or

(#batch, time1, time2), (0, 0, 0) means fake mask.

Returns:
paddle.Tensor: Transformed value (#batch, time1, d_model)

weighted by the attention score (#batch, time1, time2).

forward_qkv(query: Tensor, key: Tensor, value: Tensor) Tuple[Tensor, Tensor, Tensor][source]

Transform query, key and value. Args:

query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size).

Returns:
paddle.Tensor: Transformed query tensor, size

(#batch, n_head, time1, d_k).

paddle.Tensor: Transformed key tensor, size

(#batch, n_head, time2, d_k).

paddle.Tensor: Transformed value tensor, size

(#batch, n_head, time2, d_k).

class paddlespeech.s2t.modules.attention.RelPositionMultiHeadedAttention(n_head, n_feat, dropout_rate, adaptive_scale=False, init_weights=False)[source]

Bases: MultiHeadedAttention

Multi-Head Attention layer with relative position encoding.

Methods

__call__(*inputs, **kwargs)

Call self as a function.

add_parameter(name, parameter)

Adds a Parameter instance.

add_sublayer(name, sublayer)

Adds a sub Layer instance.

apply(fn)

Applies fn recursively to every sublayer (as returned by .sublayers()) as well as self.

buffers([include_sublayers])

Returns a list of all buffers from current layer and its sub-layers.

children()

Returns an iterator over immediate children layers.

clear_gradients()

Clear the gradients of all parameters for this layer.

create_parameter(shape[, attr, dtype, ...])

Create parameters for this layer.

create_tensor([name, persistable, dtype])

Create Tensor for this layer.

create_variable([name, persistable, dtype])

Create Tensor for this layer.

eval()

Sets this Layer and all its sublayers to evaluation mode.

extra_repr()

Extra representation of this layer, you can have custom implementation of your own layer.

forward(query, key, value[, mask, ...])

Compute 'Scaled Dot Product Attention' with rel. positional encoding. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. pos_emb (paddle.Tensor): Positional embedding tensor (#batch, time2, size). cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size Returns: paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size.

forward_attention(value, scores[, mask, ...])

Compute attention context vector. Args: value (paddle.Tensor): Transformed value, size (#batch, n_head, time2, d_k). scores (paddle.Tensor): Attention score, size (#batch, n_head, time1, time2). mask (paddle.Tensor): Mask, size (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. Returns: paddle.Tensor: Transformed value (#batch, time1, d_model) weighted by the attention score (#batch, time1, time2).

forward_qkv(query, key, value)

Transform query, key and value. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). Returns: paddle.Tensor: Transformed query tensor, size (#batch, n_head, time1, d_k). paddle.Tensor: Transformed key tensor, size (#batch, n_head, time2, d_k). paddle.Tensor: Transformed value tensor, size (#batch, n_head, time2, d_k).

full_name()

Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__

load_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

named_buffers([prefix, include_sublayers])

Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.

named_children()

Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.

named_parameters([prefix, include_sublayers])

Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.

named_sublayers([prefix, include_self, ...])

Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.

parameters([include_sublayers])

Returns a list of all Parameters from current layer and its sub-layers.

register_buffer(name, tensor[, persistable])

Registers a tensor as buffer into the layer.

register_forward_post_hook(hook)

Register a forward post-hook for Layer.

register_forward_pre_hook(hook)

Register a forward pre-hook for Layer.

rel_shift(x[, zero_triu])

Compute relative positinal encoding. Args: x (paddle.Tensor): Input tensor (batch, head, time1, time1). zero_triu (bool): If true, return the lower triangular part of the matrix. Returns: paddle.Tensor: Output tensor. (batch, head, time1, time1).

set_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

set_state_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

state_dict([destination, include_sublayers, ...])

Get all parameters and persistable buffers of current layer and its sub-layers.

sublayers([include_self])

Returns a list of sub layers.

to([device, dtype, blocking])

Cast the parameters and buffers of Layer by the give device, dtype and blocking.

to_static_state_dict([destination, ...])

Get all parameters and buffers of current layer and its sub-layers.

train()

Sets this Layer and all its sublayers to training mode.

backward

init_weights

register_state_dict_hook

forward(query: ~paddle.Tensor, key: ~paddle.Tensor, value: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True,        []), pos_emb: ~paddle.Tensor = Tensor(shape=[0], dtype=float32, place=Place(cpu), stop_gradient=True,        []), cache: ~paddle.Tensor = Tensor(shape=[0, 0, 0, 0], dtype=float32, place=Place(cpu), stop_gradient=True,        [])) Tuple[Tensor, Tensor][source]

Compute 'Scaled Dot Product Attention' with rel. positional encoding. Args:

query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or

(#batch, time1, time2), (0, 0, 0) means fake mask.

pos_emb (paddle.Tensor): Positional embedding tensor

(#batch, time2, size).

cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size

Returns:

paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size

init_weights()[source]
rel_shift(x, zero_triu: bool = False)[source]

Compute relative positinal encoding. Args:

x (paddle.Tensor): Input tensor (batch, head, time1, time1). zero_triu (bool): If true, return the lower triangular part of

the matrix.

Returns:

paddle.Tensor: Output tensor. (batch, head, time1, time1)

class paddlespeech.s2t.modules.attention.RoPERelPositionMultiHeadedAttention(n_head, n_feat, dropout_rate, adaptive_scale=False, init_weights=False)[source]

Bases: MultiHeadedAttention

Multi-Head Attention layer with RoPE relative position encoding.

Methods

__call__(*inputs, **kwargs)

Call self as a function.

add_parameter(name, parameter)

Adds a Parameter instance.

add_sublayer(name, sublayer)

Adds a sub Layer instance.

align(tensor, axes[, ndim])

重新对齐tensor(批量版expand_dims) axes:原来的第i维对齐新tensor的第axes[i]维; ndim:新tensor的维度。

apply(fn)

Applies fn recursively to every sublayer (as returned by .sublayers()) as well as self.

apply_rotary_position_embeddings(sinusoidal, ...)

应用RoPE到tensors中 其中,sinusoidal.shape=[B, T, D],tensors为tensor的列表,而 tensor.shape=[B, T, ..., D], or (B,H,T,D/H)

buffers([include_sublayers])

Returns a list of all buffers from current layer and its sub-layers.

children()

Returns an iterator over immediate children layers.

clear_gradients()

Clear the gradients of all parameters for this layer.

create_parameter(shape[, attr, dtype, ...])

Create parameters for this layer.

create_tensor([name, persistable, dtype])

Create Tensor for this layer.

create_variable([name, persistable, dtype])

Create Tensor for this layer.

eval()

Sets this Layer and all its sublayers to evaluation mode.

extra_repr()

Extra representation of this layer, you can have custom implementation of your own layer.

forward(query, key, value[, mask, ...])

Compute 'Scaled Dot Product Attention' with rel. positional encoding. Ref: https://github.com/facebookresearch/llama/blob/main/llama/model.py Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. pos_emb (paddle.Tensor): Positional embedding tensor (#batch, time2, size). cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size Returns: paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size.

forward_attention(value, scores[, mask, ...])

Compute attention context vector. Args: value (paddle.Tensor): Transformed value, size (#batch, n_head, time2, d_k). scores (paddle.Tensor): Attention score, size (#batch, n_head, time1, time2). mask (paddle.Tensor): Mask, size (#batch, 1, time2) or (#batch, time1, time2), (0, 0, 0) means fake mask. Returns: paddle.Tensor: Transformed value (#batch, time1, d_model) weighted by the attention score (#batch, time1, time2).

forward_qkv(query, key, value)

Transform query, key and value. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). Returns: paddle.Tensor: Transformed query tensor, size (#batch, n_head, time1, d_k). paddle.Tensor: Transformed key tensor, size (#batch, n_head, time2, d_k). paddle.Tensor: Transformed value tensor, size (#batch, n_head, time2, d_k).

full_name()

Full name for this layer, composed by name_scope + "/" + MyLayer.__class__.__name__

load_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

named_buffers([prefix, include_sublayers])

Returns an iterator over all buffers in the Layer, yielding tuple of name and Tensor.

named_children()

Returns an iterator over immediate children layers, yielding both the name of the layer as well as the layer itself.

named_parameters([prefix, include_sublayers])

Returns an iterator over all parameters in the Layer, yielding tuple of name and parameter.

named_sublayers([prefix, include_self, ...])

Returns an iterator over all sublayers in the Layer, yielding tuple of name and sublayer.

parameters([include_sublayers])

Returns a list of all Parameters from current layer and its sub-layers.

register_buffer(name, tensor[, persistable])

Registers a tensor as buffer into the layer.

register_forward_post_hook(hook)

Register a forward post-hook for Layer.

register_forward_pre_hook(hook)

Register a forward pre-hook for Layer.

set_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

set_state_dict(state_dict[, use_structured_name])

Set parameters and persistable buffers from state_dict.

state_dict([destination, include_sublayers, ...])

Get all parameters and persistable buffers of current layer and its sub-layers.

sublayers([include_self])

Returns a list of sub layers.

to([device, dtype, blocking])

Cast the parameters and buffers of Layer by the give device, dtype and blocking.

to_static_state_dict([destination, ...])

Get all parameters and buffers of current layer and its sub-layers.

train()

Sets this Layer and all its sublayers to training mode.

backward

register_state_dict_hook

align(tensor: Tensor, axes: List[int], ndim=None)[source]

重新对齐tensor(批量版expand_dims) axes:原来的第i维对齐新tensor的第axes[i]维; ndim:新tensor的维度。

apply_rotary_position_embeddings(sinusoidal, *tensors)[source]

应用RoPE到tensors中 其中,sinusoidal.shape=[B, T, D],tensors为tensor的列表,而 tensor.shape=[B, T, ..., D], or (B,H,T,D/H)

forward(query: ~paddle.Tensor, key: ~paddle.Tensor, value: ~paddle.Tensor, mask: ~paddle.Tensor = Tensor(shape=[0, 0, 0], dtype=bool, place=Place(cpu), stop_gradient=True,        []), pos_emb: ~paddle.Tensor = Tensor(shape=[0], dtype=float32, place=Place(cpu), stop_gradient=True,        []), cache: ~paddle.Tensor = Tensor(shape=[0, 0, 0, 0], dtype=float32, place=Place(cpu), stop_gradient=True,        [])) Tuple[Tensor, Tensor][source]

Compute 'Scaled Dot Product Attention' with rel. positional encoding. Ref: https://github.com/facebookresearch/llama/blob/main/llama/model.py Args:

query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or

(#batch, time1, time2), (0, 0, 0) means fake mask.

pos_emb (paddle.Tensor): Positional embedding tensor

(#batch, time2, size).

cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size

Returns:

paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)

where cache_t == chunk_size * num_decoding_left_chunks and head * d_k == size