### GPT2-13B 分布式训练

#### GPT2 模型介绍

GPT-2由OpenAI于2019年发布。GPT-2模型是继承于GPT模型，GPT-2是一个非常庞大的语言模型，它主要是用于预测下一个单词。按照参数量的大小，GPT-2模型可分为small（124M）、medium（355M）、large（774M）、xlarge（1.5B）。

[论文](https://arxiv.org/abs/1810.04805)J Devlin，et al., Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019

GPT2套件代码更多细节请参考[文档](https://gitee.com/mindspore/mindformers/blob/dev/docs/model_cards/gpt2.md)。

#### GPT2 模型开发

GPT2模型代码路径： `mindformers/models/gpt2`

```bash
└── gpt2
    ├── __init__.py
    ├── convert_weight.py
    ├── gpt2.py
    ├── gpt2_config.py
    ├── gpt2_processor.py
    └── gpt2_tokenizer.py
```

- [convert_weight.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/gpt2/convert_weight.py)：权重转化脚本，将pytorch权重转化为mindspore权重；
- [gpt2.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/gpt2/gpt2.py)：gpt2模型架构代码，由词嵌入层、自注意力层等组成；
- [gpt2_config.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/gpt2/gpt2_config.py)：gpt2模型结构配置，如层数、自注意头数等；
- [gpt2_processor.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/gpt2/gpt2_processor.py)：pipeline时文本切词预处理脚本；
- [gpt2_tokenizer.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/models/gpt2/gpt2_tokenizer.py)：gpt2切词脚本。

GPT2配置文件路径: `configs/gpt2`

```bash
# 套件提供三种不同参数量的gpt配置
└── gpt2
    ├── run_gpt2.yaml
    ├── run_gpt2_13b.yaml
    └── run_gpt2_52b.yaml
```

- [run_gpt2_13b.yaml等](https://gitee.com/mindspore/mindformers/blob/dev/configs/gpt2/run_gpt2.yaml)：主配置文件，其中的配置如与以上相同，则以该文件中的配置为准。需要修改配置时，推荐采用在该文件中复写配置的方式。

```text
# 关键参数说明，以4机13B参数模型为例
context:
  mode: 0 # 0-静态图模式; 1-动态图模式
  device_target: "Ascend" # 运行的设备定义 支持 Ascend、CPU、GPU
  max_device_memory: "31GB" # 每张卡运行内存限制数量

parallel: # 并行环境配置
  parallel_mode: 1 # 0-数据并行, 1-半自动并行, 2-自动并行, 3-混合并行
  full_batch: True # 数据导入方式，打开即每设备获取的数据一致，关闭则每张卡采样数据不一致
  enable_parallel_optimizer: True # 开启优化器并行

recompute_config: # 重计算模块
  recompute: True # 打开重计算
  parallel_optimizer_comm_recompute: False # 是否开启优化器通信算子重计算
  mp_comm_recompute: True # 是否开启模型并行下通信算子的重计算
  recompute_slice_activation: True # 是否开启激活权重切片重计算

# 4机32卡
parallel_config: # 模型的并行配置
  data_parallel: 4 # 数据切分数量，batch维度切分
  model_parallel: 2 # 模型权重切分数量，权重维度上切分
  pipeline_stage: 4 # 流水并行切分stage数量，切分transformer layer层 <= 节点数
  optimizer_shard: True # 优化器并行打开
  micro_batch_num: 24 # micro batch number
  vocab_emb_dp: True # 词表是否使用数据并行
  gradient_aggregation_group: 4 # 梯度聚合组数，在组内融合通信算子，提升通信效率
```

#### GPT2-13B 模型大规模分布式训练

##### 多机多卡启动

- 首先在每台机器上运行`mindformers/tools/hccl_tools.py`生成RANK_TABLE_FILE的json文件；
- 将不同机器上生成的RANK_TABLE_FILE文件中的server_list合并，server_count设为机器数，rank_id顺序增加，并保证不同机器上的RANK_TABLE_FILE相同；
- 在多机上同时拉起任务，拉起方式为

```shell
cd scripts
bash run_distribute.sh RANK_TABLE_FILE CONFIG_PATH DEVICE_RANGE RUN_MODE RANK_SIZE
```

```text
# 参数说明
RANK_TABLE_FILE: 由mindformers/tools/hccl_tools.py生成的分布式json文件
CONFIG_PATH: 为configs文件夹下面的gpt2/run_gpt2*.yaml配置文件
DEVICE_RANGE: 为单机分布式卡的范围, 如[0,8]为8卡分布式，不包含8本身
RUN_MODE: 为任务运行状态，支持关键字 train 预训练、predict（文本生成预测）
RANK_SIZE: 总运行卡数
```

```json
# 4机32卡参考RANK_TABLE_FILE样例
{
  "version": "1.0",
  "server_count": "4",
  "server_list": [
    {
      "server_id": "10.155.111.140",
      "device": [
        {"device_id": "0","device_ip": "192.1.27.6","rank_id": "0"},
        {"device_id": "1","device_ip": "192.2.27.6","rank_id": "1"},
        {"device_id": "2","device_ip": "192.3.27.6","rank_id": "2"},
        {"device_id": "3","device_ip": "192.4.27.6","rank_id": "3"},
        {"device_id": "4","device_ip": "192.1.27.7","rank_id": "4"},
        {"device_id": "5","device_ip": "192.2.27.7","rank_id": "5"},
        {"device_id": "6","device_ip": "192.3.27.7","rank_id": "6"},
        {"device_id": "7","device_ip": "192.4.27.7","rank_id": "7"}],
      "host_nic_ip": "reserve"
    },
    {
      "server_id": "10.155.111.141",
      "device": [
        {"device_id": "0","device_ip": "192.1.27.8","rank_id": "8"},
        {"device_id": "1","device_ip": "192.2.27.8","rank_id": "9"},
        {"device_id": "2","device_ip": "192.3.27.8","rank_id": "10"},
        {"device_id": "3","device_ip": "192.4.27.8","rank_id": "11"},
        {"device_id": "4","device_ip": "192.1.27.9","rank_id": "12"},
        {"device_id": "5","device_ip": "192.2.27.9","rank_id": "13"},
        {"device_id": "6","device_ip": "192.3.27.9","rank_id": "14"},
        {"device_id": "7","device_ip": "192.4.27.9","rank_id": "15"}],
      "host_nic_ip": "reserve"
    },
    {
      "server_id": "10.155.111.142",
      "device": [
        {"device_id": "0","device_ip": "192.1.27.10","rank_id": "16"},
        {"device_id": "1","device_ip": "192.2.27.10","rank_id": "17"},
        {"device_id": "2","device_ip": "192.3.27.10","rank_id": "18"},
        {"device_id": "3","device_ip": "192.4.27.10","rank_id": "19"},
        {"device_id": "4","device_ip": "192.1.27.11","rank_id": "20"},
        {"device_id": "5","device_ip": "192.2.27.11","rank_id": "21"},
        {"device_id": "6","device_ip": "192.3.27.11","rank_id": "22"},
        {"device_id": "7","device_ip": "192.4.27.11","rank_id": "23"}],
      "host_nic_ip": "reserve"
    },
    {
      "server_id": "10.155.111.143",
      "device": [
        {"device_id": "0","device_ip": "192.1.27.12","rank_id": "24"},
        {"device_id": "1","device_ip": "192.2.27.12","rank_id": "25"},
        {"device_id": "2","device_ip": "192.3.27.12","rank_id": "26"},
        {"device_id": "3","device_ip": "192.4.27.12","rank_id": "27"},
        {"device_id": "4","device_ip": "192.1.27.13","rank_id": "28"},
        {"device_id": "5","device_ip": "192.2.27.13","rank_id": "29"},
        {"device_id": "6","device_ip": "192.3.27.13","rank_id": "30"},
        {"device_id": "7","device_ip": "192.4.27.13","rank_id": "31"}],
      "host_nic_ip": "reserve"
    }
  ],
  "status": "completed"
}
```

```shell
# 任务拉起命令示例
# 第一台机器
bash run_distribute.sh {RANK_TABLE_FILE path of the first device} ../configs/gpt2/run_gpt2_13b.yaml [0,8] train 32
# 第二台机器
bash run_distribute.sh {RANK_TABLE_FILE path of the second device} ../configs/gpt2/run_gpt2_13b.yaml [8,16] train 32
# 第三台机器
bash run_distribute.sh {RANK_TABLE_FILE path of the third device} ../configs/gpt2/run_gpt2_13b.yaml [16,24] train 32
# 第四台机器
bash run_distribute.sh {RANK_TABLE_FILE path of the forth device} ../configs/gpt2/run_gpt2_13b.yaml [24,32] train 32
```