mindformers.modules.transformer.TransformerDecoder

class mindformers.modules.transformer.TransformerDecoder(**kwargs)[源代码]

Transformer Decoder module with multi-layer stacked of TransformerDecoderLayer, including multihead self attention, cross attention and feedforward layer.

Args:

num_layers(int): The layers of the TransformerDecoderLayer. batch_size(int): The batch size of the input tensor when do increnmental prediction. Should be a positive

value. When do training or prediction, the argument will not work and the user can just pass None to the argument.

hidden_size(int): The hidden size of the input. ffn_hidden_size(int): The hidden size of bottleneck in the feedforward layer. src_seq_length(int): The input source sequence length. tgt_seq_length(int): The input target sequence length. num_heads(int): The number of the heads. attention_dropout_rate(float): The dropout rate of the attention scores. Default:0.1. hidden_dropout_rate(float): The dropout rate of the final output of the layer. Default:0.1. post_layernorm_residual(bool): Do residuals adds before the layernorm. Default False. layernorm_compute_type(dtype.Number): The computation type of the layernorm.

Should be mstype.float32 or mstype.float16. Default mstype.float32.

softmax_compute_type(dtype.Number): The computation type of the softmax in the attention.

Should be mstype.float32 or mstype.float16. Default mstype.float32.

param_init_type(dtype.Number): The parameter initialization type of the module.

Should be mstype.float32 or mstype.float16. Default mstype.float32.

hidden_act (str, nn.Cell): The activation of the internal feedforward layer. Supports ‘relu’,

‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see the examples of the class:mindformers.modules.transformer.FeedForward. Default: gelu.

lambda_func(function): A function can determine the fusion index,

pipeline stages and recompute attribute. If the user wants to determine the pipeline stage and gradient aggregation fusion, the user can pass a function that accepts network, layer_id, offset, parallel_config, layers. The network(Cell) represents the transformer block, layer_id(int) means the layer index for the current module, counts from zero, offset(int) means the layer_index needs an offset, if there are other modules in the net. The default setting for the pipeline is: (layer_id + offset) // (layers / pipeline_stage). Default: None.

use_past(bool): Use the past state to compute, used for incremental prediction. Default False. offset(int): The initial layer index for the decoder. Used for setting the fusion id and stage id, to not

overlap with the encoder layer. Default 0.

moe_config(MoEConfig): The configuration of MoE (Mixture of Expert). Default is an instance of MoEConfig

with default values. Please see MoEConfig.

parallel_config(TransformerOpParallelConfig): The parallel configure. Default default_transformer_config,

an instance of TransformerOpParallelConfig with default args.

Inputs:
  • hidden_stats (Tensor) - The input tensor with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size]

  • attention_mask (Tensor) - The attention mask for decoder with shape [batch_size, seq_length, seq_length] or None. None means there will be no mask in softmax computation in self attention.

  • encoder_output (Tensor) - The output of the encoder with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size]. Note this args can not be passed by None when the net is in outermost layer. Default None.

  • memory_mask (Tensor) - The memory mask of the cross attention with shape [batch, tgt_seq_length, src_seq_length] where tgt_seq_length is the length of the decoder. The user can also pass None. None means there will be no mask in softmax computation in cross attention. Default None.

  • init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True.

  • batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present)

  • output (Tensor) - The output logit of this layer. The shape is [batch, tgt_seq_length, hidden_size] or [batch * tgt_seq_length, hidden_size]

  • layer_present (Tuple) - A tuple with size of num_layers, where each tuple is the tensor of the projected key and value vector in self attention with shape ((batch_size, num_heads, size_per_head, tgt_seq_length), (batch_size, num_heads, tgt_seq_length, size_per_head), and of the projected key and value vector in cross attention with shape (batch_size, num_heads, size_per_head, src_seq_length), (batch_size, num_heads, src_seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

Examples:
>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindformers.modules.transformer import TransformerDecoder
>>> from mindspore import Tensor
>>> model = TransformerDecoder(batch_size=2, num_layers=1, hidden_size=64, ffn_hidden_size=64,
...                            num_heads=2, src_seq_length=20, tgt_seq_length=10)
>>> encoder_input_value = Tensor(np.ones((2, 20, 64)), mstype.float32)
>>> decoder_input_value = Tensor(np.ones((2, 10, 64)), mstype.float32)
>>> decoder_input_mask = Tensor(np.ones((2, 10, 10)), mstype.float16)
>>> memory_mask = Tensor(np.ones((2, 10, 20)), mstype.float16)
>>> output, past = model(decoder_input_value, decoder_input_mask, encoder_input_value, memory_mask)
>>> print(output.shape)
(2, 10, 64)
>>> print(len(past))
1
>>> print(past[0][0].shape)
(2, 2, 32, 10)
>>> print(past[0][1].shape)
(2, 2, 10, 32)
>>> print(past[0][2].shape)
(2, 2, 32, 20)
>>> print(past[0][3].shape)
(2, 2, 20, 32)