mindformers.modules.transformer.TransformerDecoderLayer¶
-
class
mindformers.modules.transformer.TransformerDecoderLayer(**kwargs)[源代码]¶ Transformer Decoder Layer. This is an implementation of the single layer of the transformer decoder layer, including self-attention, cross attention and feedward layer. When the encoder_output is None, the cross attention will not be effective.
- 参数
hidden_size (int) – The hidden size of the input.
ffn_hidden_size (int) – The hidden size of bottleneck in the feedforward layer.
num_heads (int) – The number of the heads.
batch_size (int) – The batch size of the input tensor when do increnmental prediction. Should be a positive value. When do training or prediction, the argument will not work and the user can just pass None to the argument.
src_seq_length (int) – The input source sequence length.
tgt_seq_length (int) – The input target sequence length.
attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1.
hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default:0.1.
post_layernorm_residual (bool) – Do residuals adds before the layernorm. Default False.
use_past (bool) – Use the past state to compute, used for incremental prediction. Default False.
layernorm_compute_type (dtype.Number) – The computation type of the layernorm. Should be dtype.float32 or dtype.float16. Default dtype.float32.
softmax_compute_type (dtype.Number) – The computation type of the softmax in the attention. Should be dtype.float32 or dtype.float16. Default mstype.float32.
param_init_type (dtype.Number) – The parameter initialization type of the module. Should be dtype.float32 or dtype.float16. Default dtype.float32.
hidden_act (str, nn.Cell) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see the examples of the class:mindformers.modules.transformer.FeedForward. Default: gelu.
moe_config (MoEConfig) – The configuration of MoE (Mixture of Expert). Default is an instance of MoEConfig with default values. Please see MoEConfig.
parallel_config (OpParallelConfig, MoEParallelConfig) – The parallel configure. When MoE is applied, MoEParallelConfig is effective, otherwise OpParallelConfig is effective. Default default_dpmp_config, an instance of OpParallelConfig with default args.
- Inputs:
hidden_stats (Tensor) - The input tensor with shape [batch_size, tgt_seq_length, hidden_size] or [batch_size * tgt_seq_length, hidden_size].
decoder_mask (Tensor) - The attention mask for decoder with shape [batch_size, src_seq_length, seq_length] or None. None means there will be no mask in softmax computation in self attention.
encoder_output (Tensor) - The output of the encoder with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size]. Note this args can not be passed by None when the net is in outermost layer. Default None.
memory_mask (Tensor) - The memory mask of the cross attention with shape [batch, tgt_seq_length, src_seq_length] where tgt_seq_length is the length of the decoder. The user can also pass None. None means there will be no mask in softmax computation in cross attention. Default None.
init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True.
batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.
- Outputs:
Tuple, a tuple contains(output, layer_present)
output (Tensor) - The output logit of this layer. The shape is [batch, seq_length, hidden_size] or [batch * seq_length, hidden_size].
layer_present (Tuple) - A tuple, where each tuple is the tensor of the projected key and value vector in self attention with shape ((batch_size, num_heads, size_per_head, tgt_seq_length), (batch_size, num_heads, tgt_seq_length, size_per_head), and of the projected key and value vector in cross attention with shape (batch_size, num_heads, size_per_head, src_seq_length), (batch_size, num_heads, src_seq_length, size_per_head)).
- Supported Platforms:
AscendGPU
实际案例
>>> import numpy as np >>> from mindspore import dtype as mstype >>> from mindformers.modules.transformer import TransformerDecoderLayer >>> from mindspore import Tensor >>> model = TransformerDecoderLayer(batch_size=2, hidden_size=64, ffn_hidden_size=64, num_heads=2, ... src_seq_length=20, tgt_seq_length=10) >>> encoder_input_value = Tensor(np.ones((2, 20, 64)), mstype.float32) >>> decoder_input_value = Tensor(np.ones((2, 10, 64)), mstype.float32) >>> decoder_input_mask = Tensor(np.ones((2, 10, 10)), mstype.float16) >>> memory_mask = Tensor(np.ones((2, 10, 20)), mstype.float16) >>> output, past = model(decoder_input_value, decoder_input_mask, encoder_input_value, memory_mask) >>> print(output.shape) (2, 10, 64) >>> print(past[0].shape) (2, 2, 32, 10) >>> print(past[1].shape) (2, 2, 10, 32) >>> print(past[2].shape) (2, 2, 32, 20) >>> print(past[3].shape) (2, 2, 20, 32)