mindformers.models.t5.T5Tokenizer

class mindformers.models.t5.T5Tokenizer(vocab_file: str, eos_token: str = '</s>', unk_token: str = '<unk>', pad_token: str = '<pad>', extra_ids: int = 100, **kwargs)[源代码]

Tokenize the input string and convert them into the ids. The tokenizer use the sentence piece internally.

参数
  • vocab_file (str) – The spiece.model file path.

  • eos_token (str) – The token that represents the end-of-sentence. Default “</s>”.

  • unk_token(str – The token that represents the unknown. Default “<unk>”.

  • pad_token (str) – The token that represents the pad. Default “<pad>”.

  • **kwargs – Other kwargs that will be passed into the base class of the Tokenizer.

实际案例

>>> from mindformers import T5Tokenizer
>>> tokenizer = T5Tokenizer.from_pretrained("t5_small")
>>> res = tokenizer("hello world")
>>> print(res)
{'input_ids': [21820, 296, 1], 'attention_mask': [1, 1, 1]}
>>> res = tokenizer("hello world", padding='max_length', max_length=10)
>>> print(res)
{'input_ids': [21820, 296, 1, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 0, 0, 0, 0, 0, 0, 0]}
>>> res = tokenizer("hello world", add_special_tokens=False)
>>> print(res)
{'input_ids': [21820, 296], 'attention_mask': [1, 1]}
>>> res = tokenizer("hello world", return_tensors='ms')
>>> print(res)
{'input_ids': Tensor(shape=[3], dtype=Int32, value= [21820,   296,     1]),
'attention_mask': Tensor(shape=[3], dtype=Int32, value= [1, 1, 1])}
>>> res = tokenizer(["hello world", "today is a good day"],
...                 max_length=7, padding='max_length', return_tensors='ms')
>>> print(res)
{'input_ids': Tensor(shape=[3], dtype=Int32, value= [21820,   296,     1]),
'attention_mask': Tensor(shape=[3], dtype=Int32, value= [1, 1, 1])}
Outputs:

A dict contains the processed ids, attention_mask that specific by the member MODEL_INPUT_NAME of the subclass.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]

Add the eos to the token_ids0

save_vocabulary(save_directory, filename_prefix)[源代码]

write the word to the files

tokenize(text)[源代码]

Tokenizer the input_text

property vocab_size

Return the vocab size of the tokenizer