mindformers.models.gpt2.GPT2Tokenizer

class mindformers.models.gpt2.GPT2Tokenizer(vocab_file, merge_file, unk_token='<|endoftext|>', bos_token='<|endoftext|>', eos_token='<|endoftext|>', pad_token='<|endoftext|>', add_prefix_space=False, **kwargs)[源代码]

Tokenize the input string and convert them into the ids. The tokenizer use the sentence piece internally.

参数
  • vocab_file (str) – The vocabulary file path.

  • merge_file (str) – The merge file path.

  • unk_token (str) – The token that represents the unknown. Default “<|endoftext|>”.

  • bos_token (str) – The token that represents the begin-of-sentence. Default “<|endoftext|>”.

  • eos_token (str) – The token that represents the end-of-sentence. Default “<|endoftext|>”.

  • pad_token (str) – The token that represents the pad. Default “<|endoftext|>”.

  • add_prefix_space (bool) – whether to add a whitespace in the front of text. Default “False”

  • **kwargs – Other kwargs that will be passed into the base class of the Tokenizer.

实际案例

>>> from mindformers import GPT2Tokenizer
>>> tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
>>> res = tokenizer("Hello world")
>>> print(res)
{'input_ids': [50256, 15496, 995, 50256], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
>>> res = tokenizer("Hello world", add_special_tokens=False)
>>> print(res)
{'input_ids': [15496, 995], 'token_type_ids': [0, 0], 'attention_mask': [1, 1]}
Outputs:

A dict contains the processed ids, attention_mask that specific by the member MODEL_INPUT_NAME of the subclass.

bpe(token)[源代码]

bpe encode

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None)[源代码]

Build model inputs from a sequence or a pair of sequence by concatenating and adding special tokens.

A GPT2 sequence has the following format: - single sequence: <bos> X <eos> - pair of sequences: <bos> A <eos> B <eos>

参数
  • token_ids_0 (List[int]) – List of IDs to which the special tokens will be added

  • token_ids_1 (List[int], optional, defaults to None) – Optional second list of IDs for sequence pairs.

convert_tokens_to_string(tokens)[源代码]

Convert the tokens to the string

prepare_for_tokenization(text, is_pretokenized=False, **kwargs)[源代码]

whether to add a whitespace in the front of text

save_vocabulary(save_directory, filename_prefix)[源代码]

write the word to the files

property vocab_size

Get the vocab size of the