mindformers.modules.transformer.TransformerOpParallelConfig

class mindformers.modules.transformer.TransformerOpParallelConfig(data_parallel=1, model_parallel=1, expert_parallel=1, pipeline_stage=1, micro_batch_num=1, recompute=<mindformers.modules.transformer.transformer.TransformerRecomputeConfig object>, optimizer_shard=False, gradient_aggregation_group=4, vocab_emb_dp=True)[源代码]

TransformerOpParallelConfig for setting parallel configuration, such as the data parallel and model parallel.

注解

Except the recompute argument, other arguments will not be effective when the user doesn’t set auto_parallel_context to SEMI_AUTO_PARALLEL or AUTO_PARALLEL. The micro_batch_num must be greater than or equal to pipeline_stage when training. The data_parallel*model_parallel *pipeline_stage must be equal or less equal to the device. When setting the pipeline stage and optimizer_shard, the config will overwrite the auto_parallel_context. When given the 8 devices and the data_parallel is 1 and model_parallel is 1, the calculation will be repeated on each device.

参数
  • data_parallel (int) – The data parallel way. The input data will be sliced into n parts for each layer according to the data parallel way. Default: 1.

  • model_parallel (int) – The model parallel way. The parameters of dense layers in MultiheadAttention and FeedForward layer will be sliced according to the model parallel way. Default: 1.

  • expert_parallel (int) – The expert parallel way. This is effective only when MoE (Mixture of Experts) is applied. This value specifies the number of partitions to split the experts into.

  • pipeline_stage (int) – The number of the pipeline stage. Should be a positive value. Default: 1.

  • micro_batch_num (int) – The micro size of the batches for the pipeline training. Default: 1.

  • optimizer_shard (bool) – Whether to enable optimizer shard. Default False.

  • gradient_aggregation_group (int) – The fusion group size of the optimizer state sharding. Default: 4.

  • recompute (Union[TransformerRecomputeConfig, bool]) – The configuration of recomputation for the transformer block. Default: An instance of TransformerRecomputeConfig with default values.

  • vocab_emb_dp (bool) – Shard embedding in model parallel or data parallel. Default: True.

Supported Platforms:

Ascend GPU

实际案例

>>> from mindformers.modules.transformer import TransformerRecomputeConfig
>>> recompute_config=TransformerRecomputeConfig(recompute=True, parallel_optimizer_comm_recompute=True, \
...                                             mp_comm_recompute=True, recompute_slice_activation=True)
>>> config=TransformerOpParallelConfig(data_parallel=1, model_parallel=1, recompute=recompute_config)