mindformers.models.t5.T5Tokenizer¶

class mindformers.models.t5.T5Tokenizer(vocab_file: str, eos_token: str = '</s>', unk_token: str = '<unk>', pad_token: str = '<pad>', extra_ids: int = 100, **kwargs)[源代码]¶

Tokenize the input string and convert them into the ids. The tokenizer use the sentence piece internally.

参数

vocab_file (str) – The spiece.model file path.
eos_token (str) – The token that represents the end-of-sentence. Default “</s>”.
unk_token(str – The token that represents the unknown. Default “<unk>”.
pad_token (str) – The token that represents the pad. Default “<pad>”.
**kwargs – Other kwargs that will be passed into the base class of the Tokenizer.

实际案例

>>> from mindformers import T5Tokenizer
>>> tokenizer = T5Tokenizer.from_pretrained("t5_small")
>>> res = tokenizer("hello world")
>>> print(res)
{'input_ids': [21820, 296, 1], 'attention_mask': [1, 1, 1]}
>>> res = tokenizer("hello world", padding='max_length', max_length=10)
>>> print(res)
{'input_ids': [21820, 296, 1, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 0, 0, 0, 0, 0, 0, 0]}
>>> res = tokenizer("hello world", add_special_tokens=False)
>>> print(res)
{'input_ids': [21820, 296], 'attention_mask': [1, 1]}
>>> res = tokenizer("hello world", return_tensors='ms')
>>> print(res)
{'input_ids': Tensor(shape=[3], dtype=Int32, value= [21820,   296,     1]),
'attention_mask': Tensor(shape=[3], dtype=Int32, value= [1, 1, 1])}
>>> res = tokenizer(["hello world", "today is a good day"],
...                 max_length=7, padding='max_length', return_tensors='ms')
>>> print(res)
{'input_ids': Tensor(shape=[3], dtype=Int32, value= [21820,   296,     1]),
'attention_mask': Tensor(shape=[3], dtype=Int32, value= [1, 1, 1])}

Outputs:: A dict contains the processed ids, attention_mask that specific by the member MODEL_INPUT_NAME of the subclass.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]¶: Add the eos to the token_ids0

save_vocabulary(save_directory, filename_prefix)[源代码]¶: write the word to the files

tokenize(text)[源代码]¶: Tokenizer the input_text

property vocab_size¶: Return the vocab size of the tokenizer