mindformers.models.Tokenizer¶
- class mindformers.models.Tokenizer(**kwargs)[源代码]¶
Base class for all slow tokenizers.
Inherits from [~tokenization_utils_base.PreTrainedTokenizerBase].
Handle all the shared methods for tokenization and special tokens as well as methods downloading/caching/loading pretrained tokenizers as well as adding tokens to the vocabulary.
This class also contain the added tokens in a unified way on top of all tokenizers so we don’t have to handle the specific vocabulary augmentation methods of the various underlying dictionary structures (BPE, sentencepiece…).
Class attributes (overridden by derived classes)
vocab_files_names (Dict[str, str]) – A dictionary with, as keys, the __init__ keyword name of each vocabulary file required by the model, and as associated values, the filename for saving the associated file (string).
pretrained_vocab_files_map (Dict[str, Dict[str, str]]) – A dictionary of dictionaries, with the high-level keys being the __init__ keyword name of each vocabulary file required by the model, the low-level being the short-cut-names of the pretrained models with, as associated values, the url to the associated pretrained vocabulary file.
max_model_input_sizes (Dict[str, Optional[int]]) – A dictionary with, as keys, the short-cut-names of the pretrained models, and as associated values, the maximum length of the sequence inputs of this model, or None if the model has no maximum input size.
pretrained_init_configuration (Dict[str, Dict[str, Any]]) – A dictionary with, as keys, the short-cut-names of the pretrained models, and as associated values, a dictionary of specific arguments to pass to the __init__ method of the tokenizer class for this pretrained model when loading the tokenizer with the [~tokenization_utils_base.PreTrainedTokenizerBase.from_pretrained] method.
model_input_names (List[str]) – A list of inputs expected in the forward pass of the model.
padding_side (str) – The default value for the side on which the model should have padding applied. Should be ‘right’ or ‘left’.
truncation_side (str) – The default value for the side on which the model should have truncation applied. Should be ‘right’ or ‘left’.
- Args:
- model_max_length (int, optional):
The maximum length (in number of tokens) for the inputs to the transformer model. When the tokenizer is loaded with [~tokenization_utils_base.PreTrainedTokenizerBase.from_pretrained], this will be set to the value stored for the associated model in max_model_input_sizes (see above). If no value is provided, will default to VERY_LARGE_INTEGER (int(1e30)).
- padding_side (str, optional):
The side on which the model should have padding applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
- truncation_side (str, optional):
The side on which the model should have truncation applied. Should be selected between [‘right’, ‘left’]. Default value is picked from the class attribute of the same name.
- model_input_names (List[string], optional):
The list of inputs accepted by the forward pass of the model (like “token_type_ids” or “attention_mask”). Default value is picked from the class attribute of the same name.
- bos_token (str or tokenizers.AddedToken, optional):
A special token representing the beginning of a sentence. Will be associated to self.bos_token and self.bos_token_id.
- eos_token (str or tokenizers.AddedToken, optional):
A special token representing the end of a sentence. Will be associated to self.eos_token and self.eos_token_id.
- unk_token (str or tokenizers.AddedToken, optional):
A special token representing an out-of-vocabulary token. Will be associated to self.unk_token and self.unk_token_id.
- sep_token (str or tokenizers.AddedToken, optional):
A special token separating two different sentences in the same input (used by BERT for instance). Will be associated to self.sep_token and self.sep_token_id.
- pad_token (str or tokenizers.AddedToken, optional):
A special token used to make arrays of tokens the same size for batching purpose. Will then be ignored by attention mechanisms or loss computation. Will be associated to self.pad_token and self.pad_token_id.
- cls_token (str or tokenizers.AddedToken, optional):
A special token representing the class of the input (used by BERT for instance). Will be associated to self.cls_token and self.cls_token_id.
- mask_token (str or tokenizers.AddedToken, optional):
A special token representing a masked token (used by masked-language modeling pretraining objectives, like BERT). Will be associated to self.mask_token and self.mask_token_id.
- additional_special_tokens (tuple or list of str or tokenizers.AddedToken, optional):
A tuple or a list of additional special tokens. Add them here to ensure they won’t be split by the tokenization process. Will be associated to self.additional_special_tokens and self.additional_special_tokens_ids.
- clean_up_tokenization_spaces (bool, optional, defaults to True):
Whether or not the model should cleanup the spaces that were added when splitting the input text during the tokenization process.
- convert_ids_to_tokens(ids: Union[int, List[int]], skip_special_tokens: bool = False) Union[str, List[str]][源代码]¶
Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.
- Args:
- ids (int or List[int]):
The token id (or token ids) to convert to tokens.
- skip_special_tokens (bool, optional, defaults to False):
Whether or not to remove special tokens in the decoding.
- Returns:
str or List[str]: The decoded token(s).
- convert_tokens_to_ids(tokens: Union[str, List[str]]) Union[int, List[int]][源代码]¶
Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.
- Args:
tokens (str or List[str]): One or several token(s) to convert to token id(s).
- Returns:
int or List[int]: The token id or list of token ids.
- get_added_vocab() Dict[str, int][源代码]¶
Returns the added tokens in the vocabulary as a dictionary of token to index.
- Returns:
Dict[str, int]: The added tokens.
- get_special_tokens_mask(token_ids_0: List, token_ids_1: Optional[List] = None, already_has_special_tokens: bool = False) List[int][源代码]¶
Retrieves sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model or encode_plus methods.
- Args:
- token_ids_0 (List[int]):
List of ids of the first sequence.
- token_ids_1 (List[int], optional):
List of ids of the second sequence.
- already_has_special_tokens (bool, optional, defaults to False):
Whether or not the token list is already formatted with special tokens for the model.
- Returns:
A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
- num_special_tokens_to_add(pair: bool = False) int[源代码]¶
Returns the number of added tokens when encoding a sequence with special tokens.
<Tip>
This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.
</Tip>
- Args:
- pair (bool, optional, defaults to False):
Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence.
- Returns:
int: Number of special tokens added to sequences.
- prepare_for_tokenization(text: str, **kwargs) Tuple[str, Dict[str, Any]][源代码]¶
Performs any necessary transformations before tokenization.
This method should pop the arguments from kwargs and return the remaining kwargs as well. We test the kwargs at the end of the encoding process to be sure all the arguments have been used.
- Args:
- text (str):
The text to prepare.
- is_split_into_words (bool, optional, defaults to False):
Whether or not the input is already pre-tokenized (e.g., split into words). If set to True, the tokenizer assumes the input is already split into words (for instance, by splitting it on whitespace) which it will tokenize. This is useful for NER or token classification.
- kwargs:
Keyword arguments to use for the tokenization.
- Returns:
Tuple[str, Dict[str, Any]]: The prepared text and the unused kwargs.
- tokenize(text: str, pair: Optional[str] = None, add_special_tokens: bool = True, **kwargs) List[str][源代码]¶
Converts a string in a sequence of tokens, using the tokenizer.
Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Takes care of added tokens.
- Args:
- text (str):
The sequence to be encoded.
- **kwargs (additional keyword arguments):
Passed along to the model-specific prepare_for_tokenization preprocessing method.
- Returns:
List[str]: The list of tokens.
- property vocab_size: int¶
int: Size of the base vocabulary (without the added tokens).