mindformers.modules.transformer.TransformerDecoder¶

class mindformers.modules.transformer.TransformerDecoder(**kwargs)[源代码]¶

Transformer Decoder module with multi-layer stacked of TransformerDecoderLayer, including multihead self attention, cross attention and feedforward layer.

参数

num_layers (int) – The layers of the TransformerDecoderLayer.
batch_size (int) – The batch size of the input tensor when do increnmental prediction. Should be a positive value. When do training or prediction, the argument will not work and the user can just pass None to the argument.
hidden_size (int) – The hidden size of the input.
ffn_hidden_size (int) – The hidden size of bottleneck in the feedforward layer.
src_seq_length (int) – The input source sequence length.
tgt_seq_length (int) – The input target sequence length.
num_heads (int) – The number of the heads.
attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1.
hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default:0.1.
post_layernorm_residual (bool) – Do residuals adds before the layernorm. Default False.
layernorm_compute_type (dtype.Number) – The computation type of the layernorm. Should be mstype.float32 or mstype.float16. Default mstype.float32.
softmax_compute_type (dtype.Number) – The computation type of the softmax in the attention. Should be mstype.float32 or mstype.float16. Default mstype.float32.
param_init_type (dtype.Number) – The parameter initialization type of the module. Should be mstype.float32 or mstype.float16. Default mstype.float32.
hidden_act (str, nn.Cell) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see the examples of the class:mindformers.modules.transformer.FeedForward. Default: gelu.
lambda_func (function) – A function can determine the fusion index, pipeline stages and recompute attribute. If the user wants to determine the pipeline stage and gradient aggregation fusion, the user can pass a function that accepts network, layer_id, offset, parallel_config, layers. The network(Cell) represents the transformer block, layer_id(int) means the layer index for the current module, counts from zero, offset(int) means the layer_index needs an offset, if there are other modules in the net. The default setting for the pipeline is: (layer_id + offset) // (layers / pipeline_stage). Default: None.
use_past (bool) – Use the past state to compute, used for incremental prediction. Default False.
offset (int) – The initial layer index for the decoder. Used for setting the fusion id and stage id, to not overlap with the encoder layer. Default 0.
moe_config (MoEConfig) – The configuration of MoE (Mixture of Expert). Default is an instance of MoEConfig with default values. Please see MoEConfig.
parallel_config (TransformerOpParallelConfig) – The parallel configure. Default default_transformer_config, an instance of TransformerOpParallelConfig with default args.

Inputs:

hidden_stats (Tensor) - The input tensor with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size]
attention_mask (Tensor) - The attention mask for decoder with shape [batch_size, seq_length, seq_length] or None. None means there will be no mask in softmax computation in self attention.
encoder_output (Tensor) - The output of the encoder with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size]. Note this args can not be passed by None when the net is in outermost layer. Default None.
memory_mask (Tensor) - The memory mask of the cross attention with shape [batch, tgt_seq_length, src_seq_length] where tgt_seq_length is the length of the decoder. The user can also pass None. None means there will be no mask in softmax computation in cross attention. Default None.
init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True.
batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present)

output (Tensor) - The output logit of this layer. The shape is [batch, tgt_seq_length, hidden_size] or [batch * tgt_seq_length, hidden_size]
layer_present (Tuple) - A tuple with size of num_layers, where each tuple is the tensor of the projected key and value vector in self attention with shape ((batch_size, num_heads, size_per_head, tgt_seq_length), (batch_size, num_heads, tgt_seq_length, size_per_head), and of the projected key and value vector in cross attention with shape (batch_size, num_heads, size_per_head, src_seq_length), (batch_size, num_heads, src_seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

实际案例

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindformers.modules.transformer import TransformerDecoder
>>> from mindspore import Tensor
>>> model = TransformerDecoder(batch_size=2, num_layers=1, hidden_size=64, ffn_hidden_size=64,
...                            num_heads=2, src_seq_length=20, tgt_seq_length=10)
>>> encoder_input_value = Tensor(np.ones((2, 20, 64)), mstype.float32)
>>> decoder_input_value = Tensor(np.ones((2, 10, 64)), mstype.float32)
>>> decoder_input_mask = Tensor(np.ones((2, 10, 10)), mstype.float16)
>>> memory_mask = Tensor(np.ones((2, 10, 20)), mstype.float16)
>>> output, past = model(decoder_input_value, decoder_input_mask, encoder_input_value, memory_mask)
>>> print(output.shape)
(2, 10, 64)
>>> print(len(past))
1
>>> print(past[0][0].shape)
(2, 2, 32, 10)
>>> print(past[0][1].shape)
(2, 2, 10, 32)
>>> print(past[0][2].shape)
(2, 2, 32, 20)
>>> print(past[0][3].shape)
(2, 2, 20, 32)