mindformers.modules.layers.Linear

class mindformers.modules.layers.Linear(in_channels, out_channels, weight_init='normal', bias_init='zeros', has_bias=True, activation=None, transpose_b=True, expert_num=1, outer_batch=1, expert_group_size=None, param_init_type=mindspore.float32, compute_dtype=mindspore.float16)[源代码]

The dense connected layer. Once the parallel mode is enabled, the input shape should be 3-D tensor.

Applies dense connected layer for the input. This layer implements the operation as:

\[\text{outputs} = \text{activation}(\text{X} * \text{kernel} + \text{bias}),\]

where \(X\) is the input tensors, \(\text{activation}\) is the activation function passed as the activation argument (if passed in), \(\text{kernel}\) is a weight matrix with the same data type as the \(X\) created by the layer, and \(\text{bias}\) is a bias vector with the same data type as the \(X\) created by the layer (only if has_bias is True).

参数
  • in_channels (int) – The number of channels in the input space.

  • out_channels (int) – The number of channels in the output space.

  • weight_init (Union[Tensor, str, Initializer, numbers.Number]) – The trainable weight_init parameter. The dtype is same as x. The values of str refer to the function initializer. Default: ‘normal’.

  • bias_init (Union[Tensor, str, Initializer, numbers.Number]) – The trainable bias_init parameter. The dtype is same as x. The values of str refer to the function initializer. Default: ‘zeros’.

  • has_bias (bool) – Specifies whether the layer uses a bias vector. Default: True.

  • activation (str) – activate function applied to the output of the fully connected layer, eg. ‘ReLU’. Default: None.

  • expert_num (int) – The number of experts used in this Linear. Here, for the case expert_num > 1, BatchMatMul is used and the first dimension in BatchMatMul indicate expert_num. Default: 1.

  • outer_batch (int) – The replication number of experts. The replication is effective only when MoE is applied. Default: 1.

  • expert_group_size (int) – The number of tokens in each data parallel group. Default: None.

  • compute_dtype (dtype.Number) – The computation type. Default: mstype.float16

Inputs:
  • x (Tensor) - Tensor of shape \((*, in\_channels)\). The in_channels in Args should be equal to \(in\_channels\) in Inputs.

Outputs:

Tensor of shape \((*, out\_channels)\).

引发
  • TypeError – If in_channels or out_channels is not an int.

  • TypeError – If has_bias is not a bool.

  • TypeError – If activation is not one of str, Cell, Primitive, None.

  • ValueError – If length of shape of weight_init is not equal to 2 or shape[0] of weight_init is not equal to out_channels or shape[1] of weight_init is not equal to in_channels.

  • ValueError – If length of shape of bias_init is not equal to 1 or shape[0] of bias_init is not equal to out_channels.

Supported Platforms:

Ascend GPU

shard(strategy_matmul, strategy_bias=None, strategy_activation=None, out_strategy_matmul=None)[源代码]

Set the shard for the linear. the strategy size should be equal to the inputs.

注解

It is valid only in semi auto parallel or auto parallel mode. In other parallel modes, strategies set here will be ignored.

参数
  • strategy_matmul (tuple) – The strategy for the matmul. Should be the same shape as the inputs.

  • strategy_bias (tuple) – The strategy for the bias_add. Should be the same shape as the inputs.

  • strategy_activation (tuple) – The strategy for the strategy_activation. Should be the same shape as

  • inputs. (the) –

  • out_strategy_matmul (tuple) – The out strategy for the matmul. Should be the same shape as the inputs.