# T5

## 模型描述

T5:全名`Text-to-Text Transfer Transformer`模型是谷歌在2019年基于C4数据集训练的Transformer模型。

[论文](https://arxiv.org/abs/1910.10683)C Raffel，N Shazeer，A Roberts，K Lee，S Narang，M Matena，Y Zhou，W Li，PJ Liu, 2020

## 数据集准备

使用的数据集：[WMT16](https://cdn-datasets.huggingface.co/translation/wmt_en_ro.tar.gz)

对应的文件路径如下：

```bash
└── wmt_en_ro
    ├── test.source
    ├── test.target
    ├── train.source
    ├── train.target
    ├── val.source
    └── val.target
```

## 快速使用

### 脚本启动

> 需开发者提前clone工程。

- 请参考[使用脚本启动](https://gitee.com/mindspore/transformer/blob/master/README.md#%E6%96%B9%E5%BC%8F%E4%B8%80clone-%E5%B7%A5%E7%A8%8B%E4%BB%A3%E7%A0%81)

示例命令如下，将会执行一个只有1层的T5模型训练

```shell
python run_mindformer.py --config configs/t5/run_t5_tiny_on_wmt16.yaml --run_mode train  \
                         --device_target Ascend \
                         --train_dataset_dir /your_path/wmt_en_ro
```

其中`device_target`根据用户的运行设备不同，可选`GPU/Ascend/CPU`。`config`的入参还可以为`configs/t5/run_t5_small.yaml`，在
这个配置下将会加载`t5_small`的权重并且开始执行微调。

### 调用API启动

> 需开发者提前pip安装。具体接口说明请参考[API接口](https://gitee.com/mindspore/transformer/wikis/API/)

#### Model调用接口

- 模型计算Loss

```python
from mindformers import T5ForConditionalGeneration, T5Tokenizer

model = T5ForConditionalGeneration.from_pretrained('t5_small')
tokenizer = T5Tokenizer.from_pretrained('t5_small')

src_output = tokenizer(["hello world"], padding='max_length', max_length=model.config.seq_length,
                       return_tensors='ms')

model_input = tokenizer(["So happy to see you!"], padding='max_length', max_length=model.config.max_decode_length,
                        return_tensors='ms')["input_ids"]
input_ids = src_output['input_ids']
attention_mask = src_output['attention_mask']
output = model(input_ids, attention_mask, model_input)
print(output)
# [5.64458]
```

- 推理

执行下述的命令，可以自动云上拉取`t5_small`模型并且进行推理。

```python
from mindformers import T5ForConditionalGeneration, T5Tokenizer

t5 = T5ForConditionalGeneration.from_pretrained("t5_small")
tokenizer = T5Tokenizer.from_pretrained("t5_small")
words = tokenizer("translate the English to the Romanian: UN Chief Says There Is No Military "
                  "Solution in Syria")['input_ids']
output = t5.generate(words, do_sample=False)
output = tokenizer.decode(output, skip_special_tokens=True)
print(output)
# "eful ONU declară că nu există o soluţie militară în Siri"
```

- Trainer接口开启训练/预测：

```python
from mindformers.trainer import Trainer
# 初始化预训练任务
trainer = Trainer(task='translation', model='t5_small', train_dataset="your data file path")

# 方式1: 开启训练，并使用训练好的权重进行推理
trainer.train()
res = trainer.predict(predict_checkpoint=True, input_data="translate the English to Romanian: a good boy!")
print(res)
#[{'translation_text': ['un băiat bun!']}]

# 方式2： 从obs下载训练好的权重并进行推理
res = trainer.predict(input_data="translate the English to Romanian: a good boy!")
print(res)
#[{'translation_text': ['un băiat bun!']}]
```

- pipeline接口开启快速推理

```python
from mindformers.pipeline import pipeline
pipeline_task = pipeline("translation", model='t5_small')
pipeline_result = pipeline_task("translate the English to Romanian: a good boy!", top_k=3)
print(pipeline_result)
#[{'translation_text': ['un băiat bun!']}]
```

## 模型权重

本仓库中的`t5_small`来自于HuggingFace的[`t5_small`](https://huggingface.co/t5-small), 基于下述的步骤获取：

1. 从上述的链接中下载`t5_small`的HuggingFace权重，文件名为`pytorch_model.bin`

2. 执行转换脚本，得到转换后的输出文件`mindspore_t5.ckpt`

```shell
python mindformers/models/t5/convert_weight.py --layers 6 --torch_path pytorch_model.bin --mindspore_path ./mindspore_t5.ckpt
```