mindformers.models.bloom.BloomTokenizer¶

class mindformers.models.bloom.BloomTokenizer(vocab_file, unk_token='<|unk|>', bos_token='<|s|>', eos_token='<|/s|>', pad_token='<|pad|>', add_prefix_space=False, **kwargs)[源代码]¶

Tokenize the input string and convert them into the ids. The tokenizer use the sentence piece internally.

参数

vocab_file (str) – The vocabulary file path.
unk_token (str) – The token that represents the unknown. Default “<|unk|>”.
bos_token (str) – The token that represents the begin-of-sentence. Default “<|s|>””.
eos_token (str) – The token that represents the end-of-sentence. Default “<|/s|>”.
pad_token (str) – The token that represents the pad. Default “<|pad|>”.
add_prefix_space (bool) – whether to add a whitespace in the front of text. Default “False”
**kwargs – Other kwargs that will be passed into the base class of the Tokenizer.

实际案例

>>> from mindformers import BloomTokenizer
>>> tokenizer = BloomTokenizer.from_pretrained("bloom_560m")
>>> res = tokenizer("Hello world", add_special_tokens=False)
>>> print(res)
{'input_ids': [59414, 8876], 'token_type_ids': [0, 0], 'attention_mask': [1, 1]}

Outputs:: A dict contains the processed ids, attention_mask that specific by the member MODEL_INPUT_NAME of the subclass.

bpe(token)[源代码]¶: bpe encode

convert_tokens_to_string(tokens)[源代码]¶: Convert the tokens to the string

prepare_for_tokenization(text, is_pretokenized=False, **kwargs)[源代码]¶: whether to add a whitespace in the front of text

save_vocabulary(save_directory, filename_prefix)[源代码]¶: write the word to the files

property vocab_size¶: Get the vocab size of the