mindformers.models.t5.T5Tokenizer¶
-
class
mindformers.models.t5.T5Tokenizer(vocab_file: str, eos_token: str = '</s>', unk_token: str = '<unk>', pad_token: str = '<pad>', extra_ids: int = 100, **kwargs)[源代码]¶ Tokenize the input string and convert them into the ids. The tokenizer use the sentence piece internally.
- 参数
vocab_file (str) – The spiece.model file path.
eos_token (str) – The token that represents the end-of-sentence. Default “</s>”.
unk_token(str – The token that represents the unknown. Default “<unk>”.
pad_token (str) – The token that represents the pad. Default “<pad>”.
**kwargs – Other kwargs that will be passed into the base class of the Tokenizer.
实际案例
>>> from mindformers import T5Tokenizer >>> tokenizer = T5Tokenizer.from_pretrained("t5_small") >>> res = tokenizer("hello world") >>> print(res) {'input_ids': [21820, 296, 1], 'attention_mask': [1, 1, 1]} >>> res = tokenizer("hello world", padding='max_length', max_length=10) >>> print(res) {'input_ids': [21820, 296, 1, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 0, 0, 0, 0, 0, 0, 0]} >>> res = tokenizer("hello world", add_special_tokens=False) >>> print(res) {'input_ids': [21820, 296], 'attention_mask': [1, 1]} >>> res = tokenizer("hello world", return_tensors='ms') >>> print(res) {'input_ids': Tensor(shape=[3], dtype=Int32, value= [21820, 296, 1]), 'attention_mask': Tensor(shape=[3], dtype=Int32, value= [1, 1, 1])} >>> res = tokenizer(["hello world", "today is a good day"], ... max_length=7, padding='max_length', return_tensors='ms') >>> print(res) {'input_ids': Tensor(shape=[3], dtype=Int32, value= [21820, 296, 1]), 'attention_mask': Tensor(shape=[3], dtype=Int32, value= [1, 1, 1])}
- Outputs:
A dict contains the processed ids, attention_mask that specific by the member MODEL_INPUT_NAME of the subclass.
-
property
vocab_size¶ Return the vocab size of the tokenizer