mindformers.models.gpt2.GPT2Tokenizer¶

Construct a GPT-2 tokenizer. Based on byte-level Byte-Pair-Encoding.

This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not:

```python >>> from mindformers import GPT2Tokenizer

>>> tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
>>> tokenizer("Hello world", add_bos_token=False, add_eos_token=False)["input_ids"]
[15496, 995]

>>> tokenizer(" Hello world", add_bos_token=False, add_eos_token=False)["input_ids"]
[18435, 995]
```

You can get around that behavior by passing add_prefix_space=True when instantiating this tokenizer or when you call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.

<Tip>

When used with is_split_into_words=True, this tokenizer will add a space before each word (even the first one).

</Tip>

This tokenizer inherits from [PreTrainedTokenizer] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

Args:

vocab_file (str):: Path to the vocabulary file.
merges_file (str):: Path to the merges file.
errors (str, optional, defaults to “replace”):: Paradigm to follow when decoding bytes to UTF-8. See [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information.
unk_token (str, optional, defaults to <|endoftext|>):: The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
bos_token (str, optional, defaults to <|endoftext|>):: The beginning of sequence token.
eos_token (str, optional, defaults to <|endoftext|>):: The end of sequence token.
add_prefix_space (bool, optional, defaults to False):: Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word. (GPT2 tokenizer detect beginning of words by the preceding space).
add_bos_token(bool, optional, defaults to True):: Whether or not to add the bos_token_id to the left of the input.
add_eos_token(bool, optional, defaults to True):: Whether or not to add the eos_token_id to the right of the input.

Examples:

>>> from mindformers import GPT2Tokenizer

>>> tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
>>> res = tokenizer("Hello world")
>>> print(res)
{'input_ids': [50256, 15496, 995, 50256], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}
>>> tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
>>> res = tokenizer("Hello world", add_special_tokens=False)
>>> print(res)
{'input_ids': [15496, 995], 'token_type_ids': [0, 0], 'attention_mask': [1, 1]}

Outputs:

A dict contains the processed ids, attention_mask that specific by the member MODEL_INPUT_NAME of the subclass.

bpe(token)[源代码]¶: bpe encode

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]¶: Insert the special tokens to the input_ids. Currently

convert_tokens_to_string(tokens)[源代码]¶: Converts a sequence of tokens (string) in a single string.

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][源代码]¶

Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT sequence pair mask has the following format:

` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | `

if token_ids_1 is None, only returns the first portion of the mask (0s).

Args:

token_ids_0 (List[int]):: List of ids.
token_ids_1 (List[int], optional):: Optional second list of IDs for sequence pairs.

Returns:

List[int]: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).

get_special_tokens_mask(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False) → List[int][源代码]¶

Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model or encode_plus methods.

Args:

token_ids_0 (List[int]):: List of IDs.
token_ids_1 (List[int], optional):: Optional second list of IDs for sequence pairs.
already_has_special_tokens (bool, optional, defaults to False):: Whether or not the token list is already formatted with special tokens for the model.

Returns:

List[int]: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

save_vocabulary(save_directory, filename_prefix=None)[源代码]¶: write the word to the files