mindformers.modules.transformer.TransformerEncoder¶

class mindformers.modules.transformer.TransformerEncoder(**kwargs)[源代码]¶

Transformer Encoder module with multi-layer stacked of TransformerEncoderLayer, including multihead self attention and feedforward layer.

参数

batch_size (int) – The batch size of the input tensor when do increnmental prediction. Should be a positive value. When do training or prediction, the argument will not work and the user can just pass None to the argument.
num_layers (int) – The layers of the TransformerEncoderLayer
hidden_size (int) – The hidden size of the input.
ffn_hidden_size (int) – The hidden size of bottleneck in the feedforward layer.
seq_length (int) – The seq_length of the input tensor.
num_heads (int) – The number of the heads.
attention_dropout_rate (float) – The dropout rate of the attention scores. Default:0.1.
hidden_dropout_rate (float) – The dropout rate of the final output of the layer. Default: 0.1.
hidden_act (str, nn.Cell) – The activation of the internal feedforward layer. Supports ‘relu’, ‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see the examples of the class:mindformers.modules.transformer.FeedForward. Default: gelu.
post_layernorm_residual (bool) – Do residuals adds before the layernorm. Default False.
layernorm_compute_type (dtype.Number) – The computation type of the layernorm. Should be mstype.float32 or mstype.float16. Default mstype.float32.
softmax_compute_type (dtype.Number) – The computation type of the softmax in the attention. Should be mstype.float32 or mstype.float16. Default: mstype.float32.
param_init_type (dtype.Number) – The parameter initialization type of the module. Should be mstype.float32 or mstype.float16. Default: mstype.float32.
lambda_func (function) – A function can determine the fusion index, pipeline stages and recompute attribute. If the user wants to determine the pipeline stage and gradient aggregation fusion, the user can pass a function that accepts network, layer_id, offset, parallel_config, layers. The network(Cell) represents the transformer block, layer_id(int) means the layer index for the current module, counts from zero, offset(int) means the layer_index needs an offset, if there are other modules in the net. The default setting for the pipeline is: (layer_id + offset) // (layers / pipeline_stage). Default: None.
offset (int) – The initial layer index for the encoder. Used for setting the fusion id and stage id, to not overlap with the encoder layer. Default 0.
use_past (bool) – Use the past state to compute, used for incremental prediction. For example, if we have two words and want to generate the ten more words. We just need to compute the two words’ state only once, and generate the next word one by one. When use_past is True, there are two steps to run the prediction. In the first step, set the is_first_iteration to be True by model.add_flags_recursive(is_first_iteration=True), and pass the full inputs. Then, set the is_first_iteration to be False by model.add_flags_recursive(is_first_iteration=False). At this moment, pass the single step’s input tensor, and loop it. Default: False.
moe_config (MoEConfig) – The configuration of MoE (Mixture of Expert). Default is an instance of MoEConfig with default values. Please see MoEConfig.
parallel_config (TransformerOpParallelConfig) – The parallel configure. Default default_transformer_config, an instance of TransformerOpParallelConfig with default args.

Inputs:

hidden_states (Tensor) - Tensor, shape should be [batch_size, seq_length, hidden_size] or [batch_size * seq_length, hidden_size], if the use_past is False or is_first_iteration=True. Otherwise, should be [batch_size, 1, hidden_size].
attention_mask (Tensor) - Float Tensor, If the use_past is False or is_first_iteration=True, the attention mask matrix should ba [batch_size, seq_length, seq_length], or None. None means there will be no mask in softmax computation. Otherwise, should be [batch_size, 1, hidden_size]
init_reset (Tensor) - A bool tensor with shape [1], used to clear the past key parameter and past value parameter used in the incremental prediction. Only valid when use_past is True. Default True.
batch_valid_length (Tensor) - Int32 tensor with shape [batch_size] the past calculated the index. Used for incremental prediction when the use_past is True. Default None.

Outputs:

Tuple, a tuple contains(output, layer_present)

output (Tensor) - The float tensor of the output of the layer with shape (batch_size, seq_length, hidden_size) or (batch_size * seq_length, hidden_size), if the use_past is False or is_first_iteration=True. Otherwise, it will be (batch_size, 1, hidden_size).
layer_present (Tuple) - A tuple with size of num_layers, where each tuple contains the Tensor the projected key and value vector with shape ((batch_size, num_heads, size_per_head, seq_length), and (batch_size, num_heads, seq_length, size_per_head)).

Supported Platforms:

Ascend GPU

实际案例

>>> import numpy as np
>>> from mindspore import dtype as mstype
>>> from mindformers.modules.transformer import TransformerEncoder
>>> from mindspore import Tensor
>>> model = TransformerEncoder(batch_size=2, num_layers=2, hidden_size=8, ffn_hidden_size=64,
...                            seq_length=16, num_heads=2)
>>> encoder_input_value = Tensor(np.ones((2, 16, 8)), mstype.float32)
>>> encoder_input_mask = Tensor(np.ones((2, 16, 16)), mstype.float16)
>>> output, past = model(encoder_input_value, encoder_input_mask)
>>> print(output.shape)
(2, 16, 8)
>>> print(len(past))
2
>>> print(past[0][0].shape)
(2, 2, 4, 16)
>>> print(past[0][1].shape)
(2, 2, 16, 4)
>>> # When use use_past=True, it includes two steps to implement the incremental prediction.
>>> # Step 1: set is_first_iteration=True, and input the full sequence length's state.
>>> batch_valid_length = Tensor(np.ones((2,)), mstype.int32)
>>> init_reset = Tensor([True], mstype.bool_)
>>> # Set is_first_iteration=True to generate the full memory states
>>> model = TransformerEncoder(batch_size=2, hidden_size=8, ffn_hidden_size=64, seq_length=16,
...                            num_heads=2, num_layers=2, use_past=True)
>>> model.add_flags_recursive(is_first_iteration=True)
>>> hidden, past = model(encoder_input_value, encoder_input_mask, init_reset, batch_valid_length)
>>> print(hidden.shape)
(2, 16, 8)
>>> print(past[0][0].shape)
(2, 2, 4, 16)
>>> print(past[0][1].shape)
(2, 2, 16, 4)
>>> encoder_input_value = Tensor(np.ones((2, 1, 8)), mstype.float32)
>>> encoder_input_mask = Tensor(np.ones((2, 1, 16)), mstype.float16)
>>> init_reset = Tensor([False], mstype.bool_)
>>> # Step 2: set is_first_iteration=False, and pass the single word to run the prediction rather than
>>> # the full sequence.
>>> model.add_flags_recursive(is_first_iteration=False)
>>> hidden, past = model(encoder_input_value, encoder_input_mask, init_reset, batch_valid_length)
>>> print(hidden.shape)
(2, 1, 8)
>>> print(past[0][0].shape)
(2, 2, 4, 16)
>>> print(past[0][1].shape)
(2, 2, 16, 4)