Baichuan2¶

模型描述¶

Baichuan2 是由百川智能开发的开源可商用的大规模预训练语言模型，基于 Transformer 结构，支持中英双语，上下文窗口长度为 4096。目前支持Baichuan2-7B和Baichuan2-13B模型，参数量分别为70亿和130亿。本仓库提供了Baichuan2-7B和Baichuan2-13B预训练模型。

代码结构介绍¶

Baichuan2 基于 mindformers 实现，主要涉及的文件有：

模型具体实现：research/baichuan2

baichuan2
    ├── baichuan2_tokenizer.py       # tokenizer
    ├── baichuan2_7b.py              # 7B模型实现
    └── baichuan2_13b.py             # 13B模型实现

模型配置：research/baichuan2

baichuan2
    ├── run_baichuan2_7b.yaml             # 7B全量微调910a启动配置
    ├── run_baichuan2_13b.yaml            # 13B全量微调910a启动配置
    ├── run_baichuan2_7b_910b.yaml        # 7B全量微调910b启动配置
    └── run_baichuan2_13b_910b.yaml       # 13B全量微调910b启动配置

数据处理脚本和任务启动脚本：research/baichuan2

baichuan2
    ├── belle_preprocess.py            # belle数据集预处理脚本
    └── run_baichuan2.py               # baichuan2高阶接口使用脚本

环境要求¶

硬件：Ascend 910A
MindSpore：2.0.0 / 1.10.1
MindFormers版本：dev

注：Baichuan2-7B推理可在单机单卡上完成部署，全量微调至少需要16卡。Baichuan2-13B推理至少需要4卡，全量微调至少需要16卡。

权重准备¶

本仓库提供已经转换完成的预训练权重用于训练/微调/推理，用户可自行从下方链接拉取后直接使用，Base用于微调，Chat用于推理。

也可选择从huggingface下载预训练权重后根据以下步骤进行权重转换，需要下载整个工程，huffingface权重的链接如下：

注: 请安装torch=2.0.0和transformers=4.29.2版本

pip install torch==2.0.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install transformers==4.29.2 -i https://pypi.tuna.tsinghua.edu.cn/simple

下载完成后，运行/research/baichuan/convert_weight.py转换脚本，将huggingface的权重转换为完整的ckpt权重。

python ./research/baichuan/convert_weight.py --torch_ckpt_dir TORCH_CKPT_DIR --mindspore_ckpt_path MS_CKPT_NAME

# 参数说明
TORCH_CKPT_DIR: huggingface权重保存目录路径
mindspore_ckpt_path: 权重保存文件名，保存为TORCH_CKPT_DIR/OUTPUT_NAME, 也可以指定为自定义保存路径

SFT数据集准备¶

当前提供belle_chat_ramdon数据集的预处理和微调样例，用于对Baichuan2-7B-Base，Baichuan2-13B-Base模型进行微调。数据集下载链接如下：

belle_chat_ramdon_10k

执行belle_preprocess.py，进行数据预处理、Mindrecord数据生成，将带有prompt模板的数据转换为mindrecord格式。

# 脚本路径：research/baichuan2/belle_preprocess.py
python belle_preprocess.py \
--input_glob /{path}/belle_chat_ramdon_10k.json \
--model_file /{path}/tokenizer.model \
--output_file /{path}/belle_512.mindrecord \
--seq_length 512

Baichuan2-7B¶

快速推理¶

基于高阶接口的推理¶

配置文件设置，添加tokenizer路径vocab_file，并设置batch_size值为1

在使用Trainer接口进行推理时，若用户自行下载Baichuan2-7B权重，请在启动前先在配置文件中将tokenizer.model的路径自行配置，配置项为vocab_file。

# research/baichuan2/run_baichuan2_7b.yaml
# runner config
runner_config:
  epochs: 1
  batch_size: 1                 # batch_size设为1
  sink_mode: True
  sink_size: 2
...
processor:
 return_tensors: ms
 tokenizer:
   unk_token: '<unk>'
   bos_token: '<s>'
   eos_token: '</s>'
   pad_token: '</s>'
   vocab_file: '/path/Baichuan2-7b/tokenizer.model'        # 添加tokenizer路径
   type: Baichuan2Tokenizer

Trainer接口启动推理

Baichuan2-7B的高阶接口使用脚本已集成在run_baichuan2.py脚本中，运行此脚本命令示例：

python run_baichuan2.py \
--config "run_baichuan2_7b.yaml" \
--run_mode predict \
--use_parallel False \
--load_checkpoint ckpt_path_or_dir \
--predict_data '将以下内容翻译成英文：你今天真好看。' \
--device_id 0

# output: [{'text_generation_text': ['将以下内容翻译成英文：你今天真好看。 \nYou look really nice today.']}]

Pipeline推理¶

from mindspore import context
from mindformers.pipeline import pipeline
from mindformers.models import LlamaConfig

from baichuan2_7b import Baichuan7BV2ForCausalLM
from baichuan2_tokenizer import Baichuan2Tokenizer

context.set_context(device_id=0, mode=0)
# init model
baichuan2_model_path = "/path/Baichuan2-7B/baichuan2_7b.ckpt" # Baichuan2-7B ckpt path
baichuan2_config = LlamaConfig(
    vocab_size=125696,
    pad_token_id=0,
    rms_norm_eps=1.0e-6,
    checkpoint_name_or_path=baichuan2_model_path,
    use_past=True
)
baichuan2_model = Baichuan7BV2ForCausalLM(
    config=baichuan2_config
)
# init tokenizer
tokenizer_path = "/path/Baichuan2-7B/tokenizer.model" # Baichuan2-7B tokenizer.model path
tokenizer = Baichuan2Tokenizer(
    vocab_file=tokenizer_path
)
pipeline_task = pipeline(task="text_generation", model=baichuan2_model, tokenizer=tokenizer)
pipeline_result = pipeline_task("诗词接龙：白日依山尽的下一句是什么？",
                                do_sample=False,
                                repetition_penalty=1.05,
                                max_length=64)

print(pipeline_result)

# output: [{'text_generation_text': ['诗词接龙：白日依山尽的下一句是什么？ \n答：黄河入海流。']}]

全参微调¶

全参微调需要多卡启动，以belle_chat_ramdon_10k.json数据集为例,给出了默认配置文件run_baichuan2_7b.yaml。

权重准备

权重支持在线/离线切分方式。在线切分则会在启动微调任务后自动按照分布式策略进行权重切分，离线切分需要在任务前手动进行切分。

若使用在线切分，则需要将完整权重文件按如下路径放置，并将启动配置参数auto_trans_ckpt置为True。

    └── path of ckpt
        └── rank_0
            └── baichuan2_7b.ckpt

若使用离线切分，配置参数auto_trans_ckpt置为False，load_checkpoint传入权重路径文件夹即可。

修改run_baichuan2_7b.yaml中相关配置

output_dir: './output'
load_checkpoint: '{path}/'          # 添加预训练权重路径
auto_trans_ckpt: True
only_save_strategy: False
resume_training: False
use_parallel: True
run_mode: 'finetune'
# dataset
train_dataset: &train_dataset
  data_loader:
    type: MindDataset
    dataset_dir: "{path}/belle512.mindrecord"   # 修改训练数据集路径
    shuffle: True
  input_columns: ["input_ids", "labels"]
# 指令微调时（如belle数据集），input_columns: ["input_ids", "labels"]
# 继续预训练时（如wikitext2数据集），input_columns: ["input_ids"]

启动微调任务，以默认配置2机16卡为例，按照以下步骤启动：

step 1. 首先参考在每台机器上运行mindformers/tools/hccl_tools.py生成RANK_TABLE_FILE的json文件。

# 在每个机器上运行如下命令，生成各自的RANK_TABLE_FILE的json文件。
python ./mindformers/tools/hccl_tools.py --device_num [0,8]

step 2. 合并每台机器上生成的RANK_TABLE_FILE。

将不同机器上生成的RANK_TABLE_FILE文件拷贝到一起，执行merge_hccl.py脚本进行合并，包括server_list合并，server_count设为机器数，rank_id顺序增加。

# 运行如下命令，合并每个机器上的RANK_TABLE_FILE文件。
python ./mindformers/tools/merge_hccl.py hccl*.json

step 3. 将合并后的RANK_TABLE_FILE文件拷贝到所有机器中，保证不同机器上的RANK_TABLE_FILE相同。
step 4. 根据服务器节点数等信息，修改相应的配置。

# 以baichuan2-7B模型两机训练为例，默认配置2机16卡，如果节点数有变，需要修改相应的配置。
# 配置文件路径：./research/baichuan2/run_baichuan2_7b.yaml
parallel_config:
  data_parallel: 2
  model_parallel: 4
  pipeline_stage: 2
  optimizer_shard: True
  micro_batch_num: 8
  vocab_emb_dp: True
  gradient_aggregation_group: 4

step 5. 执行运行脚本。

在多机上同时拉起任务，每台机器拉起方式如下。

# node 1
cd mindformers/research
bash run_multinode.sh \
"python baichuan2/run_baichuan2.py \
--config baichuan2/run_baichuan2_7b.yaml \
--load_checkpoint path/to/baichuan2_7b_ckpt \
--auto_trans_ckpt True \
--use_parallel True \
--run_mode finetune \
--train_data path/to/mindrecord_dir" \
path/to/rank_table_file [0,8] 16

# node 2
cd mindformers/research
bash run_multinode.sh \
"python baichuan2/run_baichuan2.py \
--config baichuan2/run_baichuan2_7b.yaml \
--load_checkpoint path/to/baichuan2_7b_ckpt \
--auto_trans_ckpt True \
--use_parallel True \
--run_mode finetune \
--train_data path/to/mindrecord_dir" \
path/to/rank_table_file [8,16] 16

# 参数说明
config: 配置文件路径
load_checkpoint: 权重文件夹路径
auto_trans_ckpt: 是否进行权重自动切分
run_mode: 运行模式，微调时设置为finetune
train_data: 训练数据集路径

baichuan2-13B¶

需开发者提前pip安装。具体接口说明请参考API接口。

请根据权重准备章节获取baichuan2_13B的完整权重。

快速使用¶

Baichuan2-13B的高阶接口使用脚本已集成在run_baichuan2.py脚本中

910A¶

注1：Baichuan2-13B-Chat用于推理，seq_length默认为512，推理需要2卡，不支持单卡推理。

注2: 由于baichuan2-13B基于高阶接口的形式开发，存放于research文件夹下，使用时需要将mindformers安装为python的包，才能直接进入research目录下执行相关命令。

注3: 当前run_baichuan2_13b.yaml文件默认为train配置，用于eval和predict时需要修改并行策略。

单机多卡运行推理：2卡为例

主要参数配置参考

load_checkpoint: 'model_dir' # 完整模型存放格式为"model_dir/rank_0/xxx.ckpt"
auto_trans_ckpt: True # 打开权重自动转换
use_past: True # 打开增量推理
vocab_file: 'path/to/tokenizer.model'

# 分布式配置
parallel_config:
  data_parallel: 1
  model_parallel: 2
  pipeline_stage: 1
  optimizer_shard: True
  micro_batch_num: 1
  vocab_emb_dp: True
  gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1

生成2卡的rank_table_file

python mindformers/tools/hccl_tools.py --device_num [0,2]

启动推理

cd research
# 推理命令中参数会覆盖yaml文件中的相同参数
bash ./run_singlenode.sh \
"python baichuan2/run_baichuan2.py \
--config baichuan2/run_baichuan2_13b.yaml \
--run_mode predict \
--use_parallel True \
--load_checkpoint model_dir \
--auto_trans_ckpt True \
--predict_data 你是谁？" rank_table_file [0,2] 2

# output: [{'text_generation_text': ['你是谁？ \n我是百川大模型，是由百川智能的工程师们创造的大语言模型，我可以和人类进行自然交流、解答问题、协助创作，帮助大众轻松、普惠的获得世界知识和专业服务。如果你有任何问题，可以随时向我提问']}]

注：推理结束后，保存output/transformed_checkpoint到自定义文件夹下，后续分布式推理可以直接加载transformed_checkpoint里面的4卡分布式权重，配置修改如下：

load_checkpoint: 'transformed_checkpoint' # 完整模型存放格式为"transformed_checkpoint/rank_x/xxx.ckpt"
auto_trans_ckpt: False # 关闭权重自动转换

多机多卡运行微调训练：2节点为例

Baichuan2-13B-Base用于微调，seq_length默认为512，分布式训练需要2节点。

主要参数配置参考

load_checkpoint: 'model_dir' # 完整模型存放格式为"model_dir/rank_0/xxx.ckpt"
auto_trans_ckpt: True # 打开权重自动转换
input_columns: ["inputs", "labels"] # 如果是预训练，请改为["inputs"]
learning_rate: 2.e-5 # 预训练建议用3.e-4
lr_end: 1.e-6 # 预训练建议用3.e-5

# 分布式配置
parallel_config:
  data_parallel: 1
  model_parallel: 8
  pipeline_stage: 2
  optimizer_shard: True
  micro_batch_num: 16
  vocab_emb_dp: True
  gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1

生成合并后的rank_table_file

# step1: 先为每个节点生成rank_table_file
python ./mindformers/tools/hccl_tools.py --device_num "[0,8)"
# step2：拷贝所有节点的rank_table_file到同一节点，进行合并
python ./mindformers/tools/merge_hccl.py hccl*.json
# step3：将合并后的rank_table_file复制到所有节点

拉起分布式训练

# node 1
cd mindformers/research
bash run_multinode.sh \
"python baichuan2/run_baichuan2.py \
--config baichuan2/run_baichuan2_13b.yaml \
--load_checkpoint path/to/baichuan2_13b_ckpt \
--run_mode train \
--use_parallel True \
--train_data path/to/mindrecord_dir" \
path/to/rank_table_file [0,8] 16
# node 2
cd mindformers/research
bash run_multinode.sh \
"python baichuan2/run_baichuan2.py \
--config baichuan2/run_baichuan2_13b.yaml \
--load_checkpoint path/to/baichuan2_13b_ckpt \
--run_mode train \
--use_parallel True \
--train_data path/to/mindrecord_dir" \
 path/to/rank_table_file [8,16] 16

910B¶

注1：Baichuan2-13B-Chat用于推理，seq_length默认为512，支持单卡推理。

注2: 由于baichuan2-13B基于高阶接口的形式开发，存放于research文件夹下，使用时需要将mindformers安装为python的包，才能直接进入research目录下执行相关命令。

注3: 当前run_baichuan2_13b.yaml文件默认为train配置，用于eval和predict时需要修改并行策略。

单机单卡运行推理

主要参数配置参考

auto_trans_ckpt: False # 关闭权重自动转换
use_past: True # 打开增量推理
vocab_file: 'path/to/tokenizer.model'

# 分布式配置
parallel_config:
  data_parallel: 1
  model_parallel: 1
  pipeline_stage: 1
  optimizer_shard: True
  micro_batch_num: 1
  vocab_emb_dp: True
  gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1

启动推理

cd research
# 推理命令中参数会覆盖yaml文件中的相同参数
python baichuan2/run_baichuan2.py \
--config baichuan2/run_baichuan2_13b.yaml \
--run_mode predict \
--use_parallel False \
--load_checkpoint path/to/baichuan2-13b-chat.ckpt \
--predict_data 你是谁？

# output: [{'text_generation_text': ['你是谁？ \n我是百川大模型，是由百川智能的工程师们创造的大语言模型，我可以和人类进行自然交流、解答问题、协助创作，帮助大众轻松、普惠的获得世界知识和专业服务。如果你有任何问题，可以随时向我提问']}]

单机多卡运行微调训练：8卡为例

主要参数配置参考

load_checkpoint: 'model_dir' # 完整模型存放格式为"model_dir/rank_0/xxx.ckpt"
auto_trans_ckpt: True # 打开权重自动转换
input_columns: ["inputs", "labels"] # 如果是预训练，请改为["inputs"]
learning_rate: 2.e-5 # 预训练建议用3.e-4
lr_end: 1.e-6 # 预训练建议用3.e-5

# 分布式配置
parallel_config:
  data_parallel: 1
  model_parallel: 2
  pipeline_stage: 4
  optimizer_shard: True
  micro_batch_num: 16
  vocab_emb_dp: True
  gradient_aggregation_group: 4
# when model parallel is greater than 1, we can set micro_batch_interleave_num=2, that may accelerate the train process.
micro_batch_interleave_num: 1

生成8卡的rank_table_file

python mindformers/tools/hccl_tools.py --device_num [0,8]

拉起分布式训练

cd mindformers/research
bash run_singlenode.sh \
"python baichuan2/run_baichuan2.py \
--config baichuan2/run_baichuan2_13b.yaml \
--load_checkpoint model_dir \
--auto_trans_ckpt True \
--run_mode train \
--use_parallel True \
--train_data path/to/mindrecord_dir" \
path/to/rank_table_file [0,8] 8