mindformers.models.llama.LlamaTokenizer

class mindformers.models.llama.LlamaTokenizer(vocab_file, extra_ids=100, add_bos=True, eos_token='</s>', bos_token='<s>', unk_token='<unk>', pad_token='<pad>', **kwargs)[源代码]

Tokenize the input string and convert them into the ids. The tokenizer use the sentence piece internally. Tokenizer of llama will default add bos at the beginning of tokens and add eos token on the tail of tokens.

参数
  • model_path (str) – The spiece.model file path.

  • add_bos (bool) – The flag defines whether add bos token, Default True.

  • eos_token (str) – The token that represents the end-of-sentence. Default “</s>”.

  • unk_token(str – The token that represents the unknown. Default “<unk>”.

  • pad_token (str) – The token that represents the pad. Default “<pad>”.

  • **kwargs – Other kwargs that will be passed into the base class of the Tokenizer.

实际案例

>>> from mindformers import LlamaTokenizer
>>> tokenizer = LlamaTokenizer.from_pretrained(name_or_path="/path/tokenizer_dir")
>>> res = tokenizer("hello world")
>>> print(res)
{'input_ids': [1, 22172, 3186, 2]}
>>> res = tokenizer("hello world", padding='max_length', max_length=10)
>>> print(res)
{'input_ids': [1, 22172, 3186, 2, 0, 0, 0, 0, 0, 0]}
>>> res = tokenizer("hello world", return_tensors='ms')
>>> print(res)
{'input_ids': Tensor(shape=[3], dtype=Int32, value= [1, 22172, 3186, 2])}
Outputs:

A dict contains the processed ids, attention_mask that specific by the member MODEL_INPUT_NAME of the subclass.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]

Add the eos to the token_ids0

save_vocabulary(save_directory, filename_prefix)[源代码]

write the word to the files

tokenize(text)[源代码]

Tokenizer the input_text

property vocab_size

Return the vocab size of the tokenizer