mindformers.modules.transformer.TransformerDecoderLayer

class mindformers.modules.transformer.TransformerDecoderLayer(**kwargs)[源代码]

Transformer Decoder Layer. This is an implementation of the single layer of the transformer decoder layer, including self-attention, cross attention and feedward layer. When the encoder_output is None, the cross attention will not be effective.

Args:

hidden_size(int): The hidden size of the input. ffn_hidden_size(int): The hidden size of bottleneck in the feedforward layer. num_heads(int): The number of the heads. batch_size(int): The batch size of the input tensor when do increnmental prediction. Should be a positive

value. When do training or prediction, the argument will not work and the user can just pass None to the argument.

src_seq_length(int): The input source sequence length. tgt_seq_length(int): The input target sequence length. attention_dropout_rate(float): The dropout rate of the attention scores. Default:0.1. hidden_dropout_rate(float): The dropout rate of the final output of the layer. Default:0.1. post_layernorm_residual(bool): Do residuals adds before the layernorm. Default False. use_past(bool): Use the past state to compute, used for incremental prediction. Default False. layernorm_compute_type(dtype.Number): The computation type of the layernorm.

Should be dtype.float32 or dtype.float16. Default dtype.float32.

softmax_compute_type(dtype.Number): The computation type of the softmax in the attention.

Should be dtype.float32 or dtype.float16. Default mstype.float32.

param_init_type(dtype.Number): The parameter initialization type of the module.

Should be dtype.float32 or dtype.float16. Default dtype.float32.

hidden_act (str, nn.Cell): The activation of the internal feedforward layer. Supports ‘relu’,

‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see the examples of the class:mindformers.modules.transformer.FeedForward. Default: gelu.

moe_config(MoEConfig): The configuration of MoE (Mixture of Expert). Default is an instance of MoEConfig

with default values. Please see MoEConfig.

parallel_config(OpParallelConfig, MoEParallelConfig): The parallel configure. When MoE is applied,

MoEParallelConfig is effective, otherwise OpParallelConfig is effective. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • hidden_stats (Tensor) - The input tensor with shape [batch_size, tgt_seq_length, hidden_size] or [batch_size * tgt_seq_length, hidden_size].

  • decoder_mask (Tensor) - The attention mask for decoder with shape [batch_size, src_seq_length, seq_length] or None. None means there will be no mask in softmax computation in self attention.

  • encoder_output (Tensor) - The output of the encoder with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size]. Note this args can not be passed by None when the net is in outermost layer. Default None.

  • memory_mask (Tensor) - The memory mask of the cross attention with shape [batch, tgt_seq_length, src_seq_length] where tgt_seq_length is the length of the decoder. The user can also pass None. None means there will be no mask in softmax computation in cross attention. Default None.

  • init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True.

  • batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present)

  • output (Tensor) - The output logit of this layer. The shape is [batch, seq_length, hidden_size] or [batch * seq_length, hidden_size].

  • layer_present (Tuple) - A tuple, where each tuple is the tensor of the projected key and value vector in self attention with shape ((batch_size, num_heads, size_per_head, tgt_seq_length), (batch_size, num_heads, tgt_seq_length, size_per_head), and of the projected key and value vector in cross attention with shape (batch_size, num_heads, size_per_head, src_seq_length), (batch_size, num_heads, src_seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

Examples:
>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindformers.modules.transformer import TransformerDecoderLayer
>>> from mindspore import Tensor
>>> model = TransformerDecoderLayer(batch_size=2, hidden_size=64, ffn_hidden_size=64, num_heads=2,
...                                 src_seq_length=20, tgt_seq_length=10)
>>> encoder_input_value = Tensor(np.ones((2, 20, 64)), mstype.float32)
>>> decoder_input_value = Tensor(np.ones((2, 10, 64)), mstype.float32)
>>> decoder_input_mask = Tensor(np.ones((2, 10, 10)), mstype.float16)
>>> memory_mask = Tensor(np.ones((2, 10, 20)), mstype.float16)
>>> output, past = model(decoder_input_value, decoder_input_mask, encoder_input_value, memory_mask)
>>> print(output.shape)
(2, 10, 64)
>>> print(past[0].shape)
(2, 2, 32, 10)
>>> print(past[1].shape)
(2, 2, 10, 32)
>>> print(past[2].shape)
(2, 2, 32, 20)
>>> print(past[3].shape)
(2, 2, 20, 32)