mindformers.modules.transformer.TransformerEncoderLayer¶

class mindformers.modules.transformer.TransformerEncoderLayer(**kwargs)[源代码]¶

Transformer Encoder Layer. This is an implementation of the single layer of the transformer encoder layer, including multihead attention and feedward layer.

参数

batch_size (int) – The batch size of the input tensor when do increnmental prediction. Should be a positive value. When do training or prediction, the argument will not work and the user can just pass None to the argument.
hidden_size (int) – The hidden size of the input.
ffn_hidden_size (int) – The hidden size of bottleneck in the feedforward layer.
num_heads (int) – The number of the heads.
seq_length (int) – The input sequence length.
attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1.
hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default:0.1.
post_layernorm_residual (bool) – Do residuals adds before the layernorm. Default False.
layernorm_compute_type (dtype.Number) – The computation type of the layernorm. Should be mstype.float32 or mstype.float16. Default mstype.float32.
softmax_compute_type (dtype.Number) – The computation type of the softmax in the attention. Should be mstype.float32 or mstype.float16. Default mstype.float32.
param_init_type (dtype.Number) – The parameter initialization type of the module. Should be mstype.float32 or mstype.float16. Default mstype.float32.
hidden_act (str, nn.Cell) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see the examples of the class:mindformers.modules.transformer.FeedForward. Default: gelu.
use_past (bool) – Use the past state to compute, used for incremental prediction. For example, if we have two words and want to generate the ten more words. We just need to compute the two words’ state only once, and generate the next word one by one. When use_past is True, there are two steps to run the prediction. In the first step, set the is_first_iteration to be True by model.add_flags_recursive(is_first_iteration=True), and pass the full inputs. Then, set the is_first_iteration to be False by model.add_flags_recursive(is_first_iteration=False). At this moment, pass the single step’s input tensor, and loop it. Default False.
moe_config (MoEConfig) – The configuration of MoE (Mixture of Expert). Default is an instance of MoEConfig with default values. Please see MoEConfig.
parallel_config (OpParallelConfig, MoEParallelConfig) – The parallel configure. When MoE is applied, MoEParallelConfig is effective, otherwise OpParallelConfig is effective. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:

x (Tensor) - Float Tensor, shape should be [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size], if the use_past is False or is_first_iteration=True. Otherwise, should be [batch_size, 1, hidden_size]
input_mask (Tensor) - Float Tensor, If the use_past is False or is_first_iteration=True, the attention mask matrix should ba [batch_size, seq_length, seq_length], or None. None means there will be no mask in softmax computation. Otherwise, should be [batch_size, 1, hidden_size]
init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True.
batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present).

output (Tensor) - The float tensor of the output of the layer with shape (batch_size, seq_length, hidden_size) or (batch_size * seq_length, hidden_size), if the use_past is False or is_first_iteration=True. Otherwise, it will be (batch_size, 1, hidden_size)
layer_present (Tuple) - A tuple of the Tensor of the projected key and value vector with ((batch_size, num_heads, size_per_head, seq_length), (batch_size, num_heads, seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

实际案例

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindformers.modules.transformer import TransformerEncoderLayer
>>> from mindspore import Tensor
>>> model = TransformerEncoderLayer(batch_size=2, hidden_size=8, ffn_hidden_size=64, seq_length=16,
...                                 num_heads=2)
>>> encoder_input_value = Tensor(np.ones((2, 16, 8)), mstype.float32)
>>> encoder_input_mask = Tensor(np.ones((2, 16, 16)), mstype.float16)
>>> output, past = model(encoder_input_value, encoder_input_mask)
>>> print(output.shape)
(2, 16, 8)
>>> print(past[0].shape)
(2, 2, 4, 16)
>>> print(past[1].shape)
(2, 2, 16, 4)
>>> # When use use_past=True, it includes two steps to implement the incremental prediction.
>>> # Step 1: set is_first_iteration=True, and input the full sequence length's state.
>>> batch_valid_length = Tensor(np.ones((2,)), mstype.int32)
>>> init_reset = Tensor([True], mstype.bool_)
>>> # Set is_first_iteration=True to generate the full memory states
>>> model = TransformerEncoderLayer(batch_size=2, hidden_size=8, ffn_hidden_size=64, seq_length=16,
...                                 num_heads=2, use_past=True)
>>> model.add_flags_recursive(is_first_iteration=True)
>>> hidden, past = model(encoder_input_value, encoder_input_mask, init_reset, batch_valid_length)
>>> print(hidden.shape)
(2, 16, 8)
>>> print(past[0].shape)
(2, 2, 4, 16)
>>> print(past[1].shape)
(2, 2, 16, 4)
>>> encoder_input_value = Tensor(np.ones((2, 1, 8)), mstype.float32)
>>> encoder_input_mask = Tensor(np.ones((2, 1, 16)), mstype.float16)
>>> init_reset = Tensor([False], mstype.bool_)
>>> # Step 2: set is_first_iteration=False, and pass the single word to run the prediction rather than
>>> # the full sequence.
>>> model.add_flags_recursive(is_first_iteration=False)
>>> hidden, past = model(encoder_input_value, encoder_input_mask, init_reset, batch_valid_length)
>>> print(hidden.shape)
(2, 1, 8)
>>> print(past[0].shape)
(2, 2, 4, 16)
>>> print(past[1].shape)
(2, 2, 16, 4)