mindformers.modules.transformer.FeedForward

class mindformers.modules.transformer.FeedForward(**kwargs)[源代码]

The multilayer perceptron with two linear layers with dropout applied at final output. The first linear will project the input dimension from hidden_size to ffn_hidden_size. The second linear will project the dimension from ffn_hidden_size to hidden_size. The first linear is sharded on the relative dimension, and the second linear is sharded on the output dimension. The overview process can be:

\[Dropout((xW_1+b_1)W_2 + b_2)\]

where the \(W_1, W_2, b_1\) and \(b_2\) are trainable parameters.

Args:

hidden_size (int): The dimension of the inputs. ffn_hidden_size (int): The intermediate hidden size. dropout_rate (float): The dropout rate for the second linear’s output. hidden_act (str, nn.Cell): The activation of the internal feedforward layer. Supports ‘relu’,

‘relu6’, ‘tanh’, ‘gelu’, ‘fast_gelu’, ‘elu’, ‘sigmoid’, ‘prelu’, ‘leakyrelu’, ‘hswish’, ‘hsigmoid’, ‘logsigmoid’ and so on. User can provide custom activition to the argument. If user wants to run the net in the parallel mode, the custom activation must also provide the activation_shard function. Please see examples. Default: gelu.

expert_num (int): The number of experts used in Linear. For the case expert_num > 1, BatchMatMul is used

and the first dimension in BatchMatMul indicate expert_num. Default: 1.

expert_group_size (int): The number of tokens in each data parallel group. Default: None. This parameter is

effective only when in AUTO_PARALLEL mode, and NOT SHARDING_PROPAGATION.

param_init_type (dtype.Number): The parameter initialization type. Should be mstype.float32 or

mstype.float16. Default: mstype.float32.

parallel_config (OpParallelConfig, MoEParallelConfig): The config of parallel setting, see

OpParallelConfig or MoEParallelConfig. When MoE is applied, MoEParallelConfig is effective, otherwise OpParallelConfig is effective. Default default_dpmp_config, an instance of OpParallelConfig with default args.

Inputs:
  • x (Tensor) - should be [batch, seq_length, hidden_size] or [batch * seq_length, hidden_size]. Float tensor.

Outputs:

Tensor, the output of this layer after mapping. The shape is [batch, seq_length, hidden_size] or [batch * seq_length, hidden_size].

Raises:

TypeError: hidden_act is not a string or nn.Cell. TypeError: parallel_config is not a subclass of OpParallelConfig. ValueError: ffn_hidden_size is not a multiple of the model parallel way. ValueError: hidden_size is not a multiple of the model parallel way.

Supported Platforms:

Ascend GPU

Examples:
>>> import numpy as np
>>> from mindformers.modules.transformer import FeedForward
>>> from mindspore import dtype as mstype
>>> from mindspore import Tensor, nn
>>> import mindspore.ops as ops
>>> model = FeedForward(hidden_size=15, ffn_hidden_size=30, dropout_rate=0.1)
>>> tensor = Tensor(np.ones((2, 20, 15)), mstype.float32)
>>> output = model(tensor)
>>> print(output.shape)
(2, 20, 15)
>>> # Example 2 using custom hidden activation
>>> class MyActivationNoShard(nn.Cell):
...     def __init__(self):
...         super(MyActivationNoShard, self).__init__()
...         self.add = ops.Add()
...     def construct(self, x):
...         return self.add(x, 0.1)
>>> model = FeedForward(hidden_size=15, ffn_hidden_size=30, dropout_rate=0.1,
...                     hidden_act=MyActivationNoShard)
>>> tensor = Tensor(np.ones((2, 20, 15)), mstype.float32)
>>> output = model(tensor)
>>> print(output.shape)
(2, 20, 15)
>>> # Example 3 using custom hidden activation with activation_shard
>>> # If user wantss to run on the SEMI/AUTO parallel mode, the custom activation must provide
>>> # a class function named activation_shard. It accepts the argument parallel_config (OpParallelConfig,
>>> # MoEParallelConfig) and set the shard for the primitives used in the construct.
>>> class MyActivationWithShard(nn.Cell):
...     def __init__(self):
...         super(MyActivationWithShard, self).__init__()
...         self.add = ops.Add()
...     def construct(self, x):
...         return self.add(x, 0.1)
...     def activation_shard(self, parallel_config):
...         self.add.shard(((parallel_config.data_parallel, parallel_config.model_parallel), ()))
>>>
>>> model = FeedForward(hidden_size=15, ffn_hidden_size=30, dropout_rate=0.1,
...                     hidden_act=MyActivationWithShard)
>>> tensor = Tensor(np.ones((2, 20, 15)), mstype.float32)
>>> output = model(tensor)
>>> print(output.shape)
(2, 20, 15)