mindformers.modules.layers.Linear¶
- class mindformers.modules.layers.Linear(in_channels, out_channels, weight_init='normal', bias_init='zeros', has_bias=True, activation=None, transpose_b=True, expert_num=1, outer_batch=1, expert_group_size=None, param_init_type=mindspore.float32, compute_dtype=mindspore.float16)[源代码]¶
The dense connected layer. Once the parallel mode is enabled, the input shape should be 3-D tensor.
Applies dense connected layer for the input. This layer implements the operation as:
\[\text{outputs} = \text{activation}(\text{X} * \text{kernel} + \text{bias}),\]where \(X\) is the input tensors, \(\text{activation}\) is the activation function passed as the activation argument (if passed in), \(\text{kernel}\) is a weight matrix with the same data type as the \(X\) created by the layer, and \(\text{bias}\) is a bias vector with the same data type as the \(X\) created by the layer (only if has_bias is True).
- Args:
in_channels (int): The number of channels in the input space. out_channels (int): The number of channels in the output space. weight_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable weight_init parameter. The dtype
is same as x. The values of str refer to the function initializer. Default: ‘normal’.
- bias_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable bias_init parameter. The dtype is
same as x. The values of str refer to the function initializer. Default: ‘zeros’.
has_bias (bool): Specifies whether the layer uses a bias vector. Default: True. activation (str): activate function applied to the output of the fully connected layer,
eg. ‘ReLU’. Default: None.
- expert_num (int): The number of experts used in this Linear. Here, for the case expert_num > 1, BatchMatMul is
used and the first dimension in BatchMatMul indicate expert_num. Default: 1.
- outer_batch (int): The replication number of experts. The replication is effective only when MoE is applied.
Default: 1.
expert_group_size (int): The number of tokens in each data parallel group. Default: None. compute_dtype (dtype.Number): The computation type. Default: mstype.float16
- Inputs:
x (Tensor) - Tensor of shape \((*, in\_channels)\). The in_channels in Args should be equal to \(in\_channels\) in Inputs.
- Outputs:
Tensor of shape \((*, out\_channels)\).
- Raises:
TypeError: If in_channels or out_channels is not an int. TypeError: If has_bias is not a bool. TypeError: If activation is not one of str, Cell, Primitive, None. ValueError: If length of shape of weight_init is not equal to 2 or shape[0] of weight_init
is not equal to out_channels or shape[1] of weight_init is not equal to in_channels.
- ValueError: If length of shape of bias_init is not equal to 1
or shape[0] of bias_init is not equal to out_channels.
- Supported Platforms:
AscendGPU
- shard(strategy_matmul, strategy_bias=None, strategy_activation=None, out_strategy_matmul=None)[源代码]¶
Set the shard for the linear. the strategy size should be equal to the inputs.
- Note:
It is valid only in semi auto parallel or auto parallel mode. In other parallel modes, strategies set here will be ignored.
- Args:
strategy_matmul (tuple): The strategy for the matmul. Should be the same shape as the inputs. strategy_bias (tuple): The strategy for the bias_add. Should be the same shape as the inputs. strategy_activation (tuple): The strategy for the strategy_activation. Should be the same shape as the inputs. out_strategy_matmul (tuple): The out strategy for the matmul. Should be the same shape as the inputs.