mindformers.models.bloom.BloomTokenizer¶

class mindformers.models.bloom.BloomTokenizer(vocab_file, unk_token='<unk>', bos_token='<s>', eos_token='</s>', pad_token='<pad>', add_prefix_space=False, add_bos_token=False, add_eos_token=False, **kwargs)[源代码]¶

Tokenize the input string and convert them into the ids. The tokenizer use the sentence piece internally.

Args:

vocab_file(str): The vocabulary file path. unk_token(str): The token that represents the unknown. Default “<|unk|>”. bos_token(str): The token that represents the begin-of-sentence. Default “<|s|>””. eos_token(str): The token that represents the end-of-sentence. Default “<|/s|>”. pad_token(str): The token that represents the pad. Default “<|pad|>”. add_prefix_space(bool): whether to add a whitespace in the front of text. Default “False” add_bos_token(bool): Whether or not to add the bos_token_id to the left of the input. Default “True” add_eos_token(bool): Whether or not to add the eos_token_id to the right of the input. Default “True” **kwargs: Other kwargs that will be passed into the base class of the Tokenizer.

Examples:

>>> from mindformers import BloomTokenizer
>>> tokenizer = BloomTokenizer.from_pretrained("bloom_560m")
>>> res = tokenizer("Hello world", add_special_tokens=False)
>>> print(res)
{'input_ids': [59414, 8876], 'token_type_ids': [0, 0], 'attention_mask': [1, 1]}

Outputs:

A dict contains the processed ids, attention_mask that specific by the member MODEL_INPUT_NAME of the subclass.

bpe(token)[源代码]¶: bpe encode

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]¶: Insert the special tokens to the input_ids. Currently

convert_tokens_to_string(tokens)[源代码]¶: Convert the tokens to the string

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][源代码]¶

Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT sequence pair mask has the following format:

` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | `

if token_ids_1 is None, only returns the first portion of the mask (0s).

Args:

token_ids_0 (List[int]):: List of ids.
token_ids_1 (List[int], optional):: Optional second list of IDs for sequence pairs.

Returns:

List[int]: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).

prepare_for_tokenization(text, **kwargs)[源代码]¶: whether to add a whitespace in the front of text

save_vocabulary(save_directory, filename_prefix=None)[源代码]¶: write the word to the files

property vocab_size¶: Get the vocab size of the