mindformers.modules.layers.Linear

class mindformers.modules.layers.Linear(in_channels, out_channels, weight_init='normal', bias_init='zeros', has_bias=True, activation=None, transpose_b=True, expert_num=1, outer_batch=1, expert_group_size=None, param_init_type=mindspore.float32, compute_dtype=mindspore.float16)[源代码]

The dense connected layer. Once the parallel mode is enabled, the input shape should be 3-D tensor.

Applies dense connected layer for the input. This layer implements the operation as:

\[\text{outputs} = \text{activation}(\text{X} * \text{kernel} + \text{bias}),\]

where \(X\) is the input tensors, \(\text{activation}\) is the activation function passed as the activation argument (if passed in), \(\text{kernel}\) is a weight matrix with the same data type as the \(X\) created by the layer, and \(\text{bias}\) is a bias vector with the same data type as the \(X\) created by the layer (only if has_bias is True).

Args:

in_channels (int): The number of channels in the input space. out_channels (int): The number of channels in the output space. weight_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable weight_init parameter. The dtype

is same as x. The values of str refer to the function initializer. Default: ‘normal’.

bias_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable bias_init parameter. The dtype is

same as x. The values of str refer to the function initializer. Default: ‘zeros’.

has_bias (bool): Specifies whether the layer uses a bias vector. Default: True. activation (str): activate function applied to the output of the fully connected layer,

eg. ‘ReLU’. Default: None.

expert_num (int): The number of experts used in this Linear. Here, for the case expert_num > 1, BatchMatMul is

used and the first dimension in BatchMatMul indicate expert_num. Default: 1.

outer_batch (int): The replication number of experts. The replication is effective only when MoE is applied.

Default: 1.

expert_group_size (int): The number of tokens in each data parallel group. Default: None. compute_dtype (dtype.Number): The computation type. Default: mstype.float16

Inputs:
  • x (Tensor) - Tensor of shape \((*, in\_channels)\). The in_channels in Args should be equal to \(in\_channels\) in Inputs.

Outputs:

Tensor of shape \((*, out\_channels)\).

Raises:

TypeError: If in_channels or out_channels is not an int. TypeError: If has_bias is not a bool. TypeError: If activation is not one of str, Cell, Primitive, None. ValueError: If length of shape of weight_init is not equal to 2 or shape[0] of weight_init

is not equal to out_channels or shape[1] of weight_init is not equal to in_channels.

ValueError: If length of shape of bias_init is not equal to 1

or shape[0] of bias_init is not equal to out_channels.

Supported Platforms:

Ascend GPU

shard(strategy_matmul, strategy_bias=None, strategy_activation=None, out_strategy_matmul=None)[源代码]

Set the shard for the linear. the strategy size should be equal to the inputs.

Note:

It is valid only in semi auto parallel or auto parallel mode. In other parallel modes, strategies set here will be ignored.

Args:

strategy_matmul (tuple): The strategy for the matmul. Should be the same shape as the inputs. strategy_bias (tuple): The strategy for the bias_add. Should be the same shape as the inputs. strategy_activation (tuple): The strategy for the strategy_activation. Should be the same shape as the inputs. out_strategy_matmul (tuple): The out strategy for the matmul. Should be the same shape as the inputs.