mindformers.modules.transformer.TransformerDecoderLayer¶
- class mindformers.modules.transformer.TransformerDecoderLayer(**kwargs)[源代码]¶
Transformer Decoder Layer. This is an implementation of the single layer of the transformer decoder layer, including self-attention, cross attention and feedward layer. When the encoder_output is None, the cross attention will not be effective.
- Args:
hidden_size(int): The hidden size of the input. ffn_hidden_size(int): The hidden size of bottleneck in the feedforward layer. num_heads(int): The number of the heads. batch_size(int): The batch size of the input tensor when do increnmental prediction. Should be a positive
value. When do training or prediction, the argument will not work and the user can just pass None to the argument.
src_seq_length(int): The input source sequence length. tgt_seq_length(int): The input target sequence length. attention_dropout_rate(float): The dropout rate of the attention scores. Default:0.1. hidden_dropout_rate(float): The dropout rate of the final output of the layer. Default:0.1. post_layernorm_residual(bool): Do residuals adds before the layernorm. Default False. use_past(bool): Use the past state to compute, used for incremental prediction. Default False. layernorm_compute_type(dtype.Number): The computation type of the layernorm.
Should be dtype.float32 or dtype.float16. Default dtype.float32.
- softmax_compute_type(dtype.Number): The computation type of the softmax in the attention.
Should be dtype.float32 or dtype.float16. Default mstype.float32.
- param_init_type(dtype.Number): The parameter initialization type of the module.
Should be dtype.float32 or dtype.float16. Default dtype.float32.
- hidden_act (str, nn.Cell): The activation of the internal feedforward layer. Supports ‘relu’,
‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see the examples of the class:mindformers.modules.transformer.FeedForward. Default: gelu.
- moe_config(MoEConfig): The configuration of MoE (Mixture of Expert). Default is an instance of MoEConfig
with default values. Please see MoEConfig.
- parallel_config(OpParallelConfig, MoEParallelConfig): The parallel configure. When MoE is applied,
MoEParallelConfig is effective, otherwise OpParallelConfig is effective. Default default_dpmp_config, an instance of OpParallelConfig with default args.
- Inputs:
hidden_stats (Tensor) - The input tensor with shape [batch_size, tgt_seq_length, hidden_size] or [batch_size * tgt_seq_length, hidden_size].
decoder_mask (Tensor) - The attention mask for decoder with shape [batch_size, src_seq_length, seq_length] or None. None means there will be no mask in softmax computation in self attention.
encoder_output (Tensor) - The output of the encoder with shape [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size]. Note this args can not be passed by None when the net is in outermost layer. Default None.
memory_mask (Tensor) - The memory mask of the cross attention with shape [batch, tgt_seq_length, src_seq_length] where tgt_seq_length is the length of the decoder. The user can also pass None. None means there will be no mask in softmax computation in cross attention. Default None.
init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True.
batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.
- Outputs:
Tuple, a tuple contains(output, layer_present)
output (Tensor) - The output logit of this layer. The shape is [batch, seq_length, hidden_size] or [batch * seq_length, hidden_size].
layer_present (Tuple) - A tuple, where each tuple is the tensor of the projected key and value vector in self attention with shape ((batch_size, num_heads, size_per_head, tgt_seq_length), (batch_size, num_heads, tgt_seq_length, size_per_head), and of the projected key and value vector in cross attention with shape (batch_size, num_heads, size_per_head, src_seq_length), (batch_size, num_heads, src_seq_length, size_per_head)).
- Supported Platforms:
AscendGPU- Examples:
>>> import numpy as np >>> from mindspore import dtype as mstype >>> from mindformers.modules.transformer import TransformerDecoderLayer >>> from mindspore import Tensor >>> model = TransformerDecoderLayer(batch_size=2, hidden_size=64, ffn_hidden_size=64, num_heads=2, ... src_seq_length=20, tgt_seq_length=10) >>> encoder_input_value = Tensor(np.ones((2, 20, 64)), mstype.float32) >>> decoder_input_value = Tensor(np.ones((2, 10, 64)), mstype.float32) >>> decoder_input_mask = Tensor(np.ones((2, 10, 10)), mstype.float16) >>> memory_mask = Tensor(np.ones((2, 10, 20)), mstype.float16) >>> output, past = model(decoder_input_value, decoder_input_mask, encoder_input_value, memory_mask) >>> print(output.shape) (2, 10, 64) >>> print(past[0].shape) (2, 2, 32, 10) >>> print(past[1].shape) (2, 2, 10, 32) >>> print(past[2].shape) (2, 2, 32, 20) >>> print(past[3].shape) (2, 2, 20, 32)