# T5 ## 模型描述 T5:全名`Text-to-Text Transfer Transformer`模型是谷歌在2019年基于C4数据集训练的Transformer模型。 [论文](https://arxiv.org/abs/1910.10683)C Raffel,N Shazeer,A Roberts,K Lee,S Narang,M Matena,Y Zhou,W Li,PJ Liu, 2020 ## 数据集准备 使用的数据集:[WMT16](https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz) 对应的文件路径如下: ```bash └── wmt_en_ro ├── test.source ├── test.target ├── train.source ├── train.target ├── val.source └── val.target ``` ## 快速使用 ### 脚本启动 > 需开发者提前clone工程。 - 请参考[使用脚本启动](https://gitee.com/mindspore/transformer/blob/master/README.md#%E6%96%B9%E5%BC%8F%E4%B8%80clone-%E5%B7%A5%E7%A8%8B%E4%BB%A3%E7%A0%81) 示例命令如下,将会执行一个只有1层的T5模型训练 ```shell python run_mindformer.py --config configs/t5/run_t5_tiny_on_wmt16.yaml --run_mode train \ --device_target Ascend \ --train_dataset_dir /your_path/wmt_en_ro ``` 其中`device_target`根据用户的运行设备不同,可选`GPU/Ascend/CPU`。`config`的入参还可以为`configs/t5/run_t5_small.yaml`,在 这个配置下将会加载`t5_small`的权重并且开始执行微调。 ### 调用API启动 > 需开发者提前pip安装。具体接口说明请参考[API接口](https://gitee.com/mindspore/transformer/wikis/API/) #### Model调用接口 - 模型计算Loss ```python from mindformers import T5ForConditionalGeneration, T5Tokenizer model = T5ForConditionalGeneration.from_pretrained('t5_small') tokenizer = T5Tokenizer.from_pretrained('t5_small') src_output = tokenizer(["hello world"], padding='max_length', max_length=model.config.seq_length, return_tensors='ms') model_input = tokenizer(["So happy to see you!"], padding='max_length', max_length=model.config.max_decode_length, return_tensors='ms')["input_ids"] input_ids = src_output['input_ids'] attention_mask = src_output['attention_mask'] output = model(input_ids, attention_mask, model_input) print(output) # [5.64458] ``` - 推理 执行下述的命令,可以自动云上拉取`t5_small`模型并且进行推理。 ```python from mindformers import T5ForConditionalGeneration, T5Tokenizer t5 = T5ForConditionalGeneration.from_pretrained("t5_small") tokenizer = T5Tokenizer.from_pretrained("t5_small") words = tokenizer("translate the English to the Romanian: UN Chief Says There Is No Military " "Solution in Syria")['input_ids'] output = t5.generate(words, do_sample=False) output = tokenizer.decode(output, skip_special_tokens=True) print(output) # "eful ONU declară că nu există o soluţie militară în Siri" ``` - Trainer接口开启训练/预测: ```python from mindformers.trainer import Trainer # 初始化预训练任务 trainer = Trainer(task='translation', model='t5_small', train_dataset="your data file path") # 方式1: 开启训练,并使用训练好的权重进行推理 trainer.train() res = trainer.predict(predict_checkpoint=True, input_data="translate the English to Romanian: a good boy!") print(res) #[{'translation_text': ['un băiat bun!']}] # 方式2: 从obs下载训练好的权重并进行推理 res = trainer.predict(input_data="translate the English to Romanian: a good boy!") print(res) #[{'translation_text': ['un băiat bun!']}] ``` - pipeline接口开启快速推理 ```python from mindformers.pipeline import pipeline pipeline_task = pipeline("translation", model='t5_small') pipeline_result = pipeline_task("translate the English to Romanian: a good boy!", top_k=3) print(pipeline_result) #[{'translation_text': ['un băiat bun!']}] ``` ## 模型权重 本仓库中的`t5_small`来自于HuggingFace的[`t5_small`](https://huggingface.co/t5-small), 基于下述的步骤获取: 1. 从上述的链接中下载`t5_small`的HuggingFace权重,文件名为`pytorch_model.bin` 2. 执行转换脚本,得到转换后的输出文件`mindspore_t5.ckpt` ```shell python mindformers/models/t5/convert_weight.py --layers 6 --torch_path pytorch_model.bin --mindspore_path ./mindspore_t5.ckpt ```