mindformers.modules.transformer.TransformerOpParallelConfig¶
- class mindformers.modules.transformer.TransformerOpParallelConfig(data_parallel=1, model_parallel=1, expert_parallel=1, pipeline_stage=1, micro_batch_num=1, recompute=<mindformers.modules.transformer.transformer.TransformerRecomputeConfig object>, use_seq_parallel=False, optimizer_shard=False, gradient_aggregation_group=4, vocab_emb_dp=True)[源代码]¶
TransformerOpParallelConfig for setting parallel configuration, such as the data parallel and model parallel.
- Note:
Except the recompute argument, other arguments will not be effective when the user doesn’t set auto_parallel_context to SEMI_AUTO_PARALLEL or AUTO_PARALLEL. The micro_batch_num must be greater than or equal to pipeline_stage when training. The data_parallel*model_parallel *pipeline_stage must be equal or less equal to the device. When setting the pipeline stage and optimizer_shard, the config will overwrite the auto_parallel_context. When given the 8 devices and the data_parallel is 1 and model_parallel is 1, the calculation will be repeated on each device.
- Args:
- data_parallel (int): The data parallel way. The input data will be sliced into n parts for each layer
according to the data parallel way. Default: 1.
- model_parallel (int): The model parallel way. The parameters of dense layers in MultiheadAttention and
FeedForward layer will be sliced according to the model parallel way. Default: 1.
- expert_parallel (int): The expert parallel way. This is effective only when MoE (Mixture of Experts)
is applied. This value specifies the number of partitions to split the experts into.
pipeline_stage (int): The number of the pipeline stage. Should be a positive value. Default: 1. micro_batch_num (int): The micro size of the batches for the pipeline training. Default: 1. optimizer_shard (bool): Whether to enable optimizer shard. Default False. gradient_aggregation_group (int): The fusion group size of the optimizer state sharding. Default: 4. recompute (Union[TransformerRecomputeConfig, bool]): The configuration of recomputation for
the transformer block. Default: An instance of TransformerRecomputeConfig with default values.
vocab_emb_dp (bool): Shard embedding in model parallel or data parallel. Default: True.
- Supported Platforms:
AscendGPU- Examples:
>>> from mindformers.modules.transformer import TransformerRecomputeConfig >>> recompute_config=TransformerRecomputeConfig(recompute=True, parallel_optimizer_comm_recompute=True, \ ... mp_comm_recompute=True, recompute_slice_activation=True) >>> config=TransformerOpParallelConfig(data_parallel=1, model_parallel=1, recompute=recompute_config)