mindformers.modules.transformer.FeedForward¶
- class mindformers.modules.transformer.FeedForward(**kwargs)[源代码]¶
The multilayer perceptron with two linear layers with dropout applied at final output. The first linear will project the input dimension from hidden_size to ffn_hidden_size. The second linear will project the dimension from ffn_hidden_size to hidden_size. The first linear is sharded on the relative dimension, and the second linear is sharded on the output dimension. The overview process can be:
\[Dropout((xW_1+b_1)W_2 + b_2)\]where the \(W_1, W_2, b_1\) and \(b_2\) are trainable parameters.
- Args:
hidden_size (int): The dimension of the inputs. ffn_hidden_size (int): The intermediate hidden size. dropout_rate (float): The dropout rate for the second linear’s output. hidden_act (str, nn.Cell): The activation of the internal feedforward layer. Supports ‘relu’,
‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see examples. Default: gelu.
- expert_num (int): The number of experts used in Linear. For the case expert_num > 1, BatchMatMul is used
and the first dimension in BatchMatMul indicate expert_num. Default: 1.
- expert_group_size (int): The number of tokens in each data parallel group. Default: None. This parameter is
effective only when in AUTO_PARALLEL mode, and NOT SHARDING_PROPAGATION.
- param_init_type (dtype.Number): The parameter initialization type. Should be mstype.float32 or
mstype.float16. Default: mstype.float32.
- parallel_config (OpParallelConfig, MoEParallelConfig): The config of parallel setting, see
OpParallelConfig or MoEParallelConfig. When MoE is applied, MoEParallelConfig is effective, otherwise OpParallelConfig is effective. Default default_dpmp_config, an instance of OpParallelConfig with default args.
- Inputs:
x (Tensor) - should be [batch, seq_length, hidden_size] or [batch * seq_length, hidden_size]. Float tensor.
- Outputs:
Tensor, the output of this layer after mapping. The shape is [batch, seq_length, hidden_size] or [batch * seq_length, hidden_size].
- Raises:
TypeError: hidden_act is not a string or nn.Cell. TypeError: parallel_config is not a subclass of OpParallelConfig. ValueError: ffn_hidden_size is not a multiple of the model parallel way. ValueError: hidden_size is not a multiple of the model parallel way.
- Supported Platforms:
AscendGPU- Examples:
>>> import numpy as np >>> from mindformers.modules.transformer import FeedForward >>> from mindspore import dtype as mstype >>> from mindspore import Tensor, nn >>> import mindspore.ops as ops >>> model = FeedForward(hidden_size=15, ffn_hidden_size=30, dropout_rate=0.1) >>> tensor = Tensor(np.ones((2, 20, 15)), mstype.float32) >>> output = model(tensor) >>> print(output.shape) (2, 20, 15) >>> # Example 2 using custom hidden activation >>> class MyActivationNoShard(nn.Cell): ... def __init__(self): ... super(MyActivationNoShard, self).__init__() ... self.add = ops.Add() ... def construct(self, x): ... return self.add(x, 0.1) >>> model = FeedForward(hidden_size=15, ffn_hidden_size=30, dropout_rate=0.1, ... hidden_act=MyActivationNoShard) >>> tensor = Tensor(np.ones((2, 20, 15)), mstype.float32) >>> output = model(tensor) >>> print(output.shape) (2, 20, 15) >>> # Example 3 using custom hidden activation with activation_shard >>> # If user wantss to run on the SEMI/AUTO parallel mode, the custom activation must provide >>> # a class function named activation_shard. It accepts the argument parallel_config (OpParallelConfig, >>> # MoEParallelConfig) and set the shard for the primitives used in the construct. >>> class MyActivationWithShard(nn.Cell): ... def __init__(self): ... super(MyActivationWithShard, self).__init__() ... self.add = ops.Add() ... def construct(self, x): ... return self.add(x, 0.1) ... def activation_shard(self, parallel_config): ... self.add.shard(((parallel_config.data_parallel, parallel_config.model_parallel), ())) >>> >>> model = FeedForward(hidden_size=15, ffn_hidden_size=30, dropout_rate=0.1, ... hidden_act=MyActivationWithShard) >>> tensor = Tensor(np.ones((2, 20, 15)), mstype.float32) >>> output = model(tensor) >>> print(output.shape) (2, 20, 15)