mindformers.models.llama.LlamaTokenizer¶

class mindformers.models.llama.LlamaTokenizer(vocab_file, unk_token='<unk>', bos_token='<s>', eos_token='</s>', pad_token='<pad>', sp_model_kwargs: Optional[Dict[str, Any]] = None, add_bos_token=True, add_eos_token=False, clean_up_tokenization_spaces=False, **kwargs)[源代码]¶

Tokenize the input string and convert them into the ids. The tokenizer use the sentence piece internally. Tokenizer of llama will default add bos at the beginning of tokens and add eos token on the tail of tokens.

Args:

model_path(str): The spiece.model file path. add_bos(bool): The flag defines whether add bos token, Default True. eos_token(str): The token that represents the end-of-sentence. Default “</s>”. unk_token(str): The token that represents the unknown. Default “<unk>”. pad_token(str): The token that represents the pad. Default “<pad>”. sp_model_kwargs(str): Other kwargs for sp_model`. add_bos_token(bool): Whether or not to add the bos_token_id to the left of the input. Default “True” add_eos_token(bool): Whether or not to add the eos_token_id to the right of the input. Default “True” clean_up_tokenization_spaces (bool): Whether or not the model should cleanup the spaces that were added when splitting the input text during the tokenization process. Default “False” **kwargs: Other kwargs that will be passed into the base class of the Tokenizer.

Examples:

>>> from mindformers import LlamaTokenizer
>>> tokenizer = LlamaTokenizer.from_pretrained("llama_7b")
>>> res = tokenizer("hello world")
>>> print(res)
{'input_ids': [1, 22172, 3186, 2]}
>>> res = tokenizer("hello world", padding='max_length', max_length=10)
>>> print(res)
{'input_ids': [1, 22172, 3186, 2, 0, 0, 0, 0, 0, 0]}
>>> res = tokenizer("hello world", return_tensors='ms')
>>> print(res)
{'input_ids': Tensor(shape=[3], dtype=Int32, value= [1, 22172, 3186, 2])}

Outputs:

A dict contains the processed ids, attention_mask that specific by the member MODEL_INPUT_NAME of the subclass.

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]¶: Insert the special tokens to the input_ids. Currently

convert_tokens_to_string(tokens)[源代码]¶: Converts a sequence of tokens (string) in a single string.

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][源代码]¶

Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT sequence pair mask has the following format:

` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | `

if token_ids_1 is None, only returns the first portion of the mask (0s).

Args:

token_ids_0 (List[int]):: List of ids.
token_ids_1 (List[int], optional):: Optional second list of IDs for sequence pairs.

Returns:

List[int]: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).

get_special_tokens_mask(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False) → List[int][源代码]¶

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

Args:

token_ids_0 (List[int]):: List of IDs.
token_ids_1 (List[int], optional):: Optional second list of IDs for sequence pairs.
already_has_special_tokens (bool, optional, defaults to False):: Whether or not the token list is already formatted with special tokens for the model.

Returns:

List[int]: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

get_vocab()[源代码]¶: Returns vocab as a dict

save_vocabulary(save_directory, filename_prefix=None)[源代码]¶

Save the vocabulary and special tokens file to a directory.

Args:

save_directory (str):: The directory in which to save the vocabulary.

Returns:

Tuple(str): Paths to the files saved.

property vocab_size¶: Returns vocab size