mindformers.modules.transformer.TransformerOpParallelConfig¶

class mindformers.modules.transformer.TransformerOpParallelConfig(data_parallel=1, model_parallel=1, expert_parallel=1, pipeline_stage=1, micro_batch_num=1, recompute=<mindformers.modules.transformer.transformer.TransformerRecomputeConfig object>, use_seq_parallel=False, optimizer_shard=False, gradient_aggregation_group=4, vocab_emb_dp=True)[源代码]¶

TransformerOpParallelConfig for setting parallel configuration, such as the data parallel and model parallel.

Note:

Except the recompute argument, other arguments will not be effective when the user doesn’t set auto_parallel_context to SEMI_AUTO_PARALLEL or AUTO_PARALLEL. The micro_batch_num must be greater than or equal to pipeline_stage when training. The data_parallel*model_parallel *pipeline_stage must be equal or less equal to the device. When setting the pipeline stage and optimizer_shard, the config will overwrite the auto_parallel_context. When given the 8 devices and the data_parallel is 1 and model_parallel is 1, the calculation will be repeated on each device.

Args:

data_parallel (int): The data parallel way. The input data will be sliced into n parts for each layer: according to the data parallel way. Default: 1.
model_parallel (int): The model parallel way. The parameters of dense layers in MultiheadAttention and: FeedForward layer will be sliced according to the model parallel way. Default: 1.
expert_parallel (int): The expert parallel way. This is effective only when MoE (Mixture of Experts): is applied. This value specifies the number of partitions to split the experts into.

pipeline_stage (int): The number of the pipeline stage. Should be a positive value. Default: 1. micro_batch_num (int): The micro size of the batches for the pipeline training. Default: 1. optimizer_shard (bool): Whether to enable optimizer shard. Default False. gradient_aggregation_group (int): The fusion group size of the optimizer state sharding. Default: 4. recompute (Union[TransformerRecomputeConfig, bool]): The configuration of recomputation for

the transformer block. Default: An instance of TransformerRecomputeConfig with default values.

vocab_emb_dp (bool): Shard embedding in model parallel or data parallel. Default: True.

Supported Platforms:

Ascend GPU

Examples:

>>> from mindformers.modules.transformer import TransformerRecomputeConfig
>>> recompute_config=TransformerRecomputeConfig(recompute=True, parallel_optimizer_comm_recompute=True, \
...                                             mp_comm_recompute=True, recompute_slice_activation=True)
>>> config=TransformerOpParallelConfig(data_parallel=1, model_parallel=1, recompute=recompute_config)