mindformers.models.t5.T5Tokenizer¶
- class mindformers.models.t5.T5Tokenizer(vocab_file, eos_token='</s>', unk_token='<unk>', pad_token='<pad>', extra_ids=100, additional_special_tokens=None, sp_model_kwargs: Optional[Dict[str, Any]] = None, **kwargs)[源代码]¶
Construct a T5 tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece).
This tokenizer inherits from [PreTrainedTokenizer] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
- Args:
- vocab_file (str):
[SentencePiece](https://github.com/google/sentencepiece) file (generally has a .spm extension) that contains the vocabulary necessary to instantiate a tokenizer.
- eos_token (str, optional, defaults to “</s>”):
The end of sequence token.
<Tip>
When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.
</Tip>
- unk_token (str, optional, defaults to “<unk>”):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
- pad_token (str, optional, defaults to “<pad>”):
The token used for padding, for example when batching sequences of different lengths.
- extra_ids (int, optional, defaults to 100):
- Add a number of extra ids added to the vocabulary for use as sentinels. These tokens are
accessible as “<extra_id_{%d}>” where “{%d}” is a number between 0 and extra_ids-1. These tokens can be retrieved by calling get_sentinel_tokens method and token ids can be by calling get_sentinel_token_ids method
- additional_special_tokens (List[str], optional):
Additional special tokens used by the tokenizer.
- sp_model_kwargs (dict, optional):
Will be passed to the SentencePieceProcessor.__init__() method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set:
enable_sampling: Enable subword regularization.
nbest_size: Sampling parameters for unigram. Invalid for BPE-Dropout.
nbest_size = {0,1}: No sampling is performed.
nbest_size > 1: samples from the nbest_size results.
nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.
- Attributes:
- sp_model (SentencePieceProcessor):
The SentencePiece processor that is used for every conversion (string, tokens and IDs).
- build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) List[int][源代码]¶
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A sequence has the following format:
single sequence: X </s>
pair of sequences: A </s> B </s>
- Args:
- token_ids_0 (List[int]):
List of IDs to which the special tokens will be added.
- token_ids_1 (List[int], optional):
Optional second list of IDs for sequence pairs.
- Returns:
List[int]: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
- create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) List[int][源代码]¶
Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make use of token type ids, therefore a list of zeros is returned.
- Args:
- token_ids_0 (List[int]):
List of IDs.
- token_ids_1 (List[int], optional):
Optional second list of IDs for sequence pairs.
- Returns:
List[int]: List of zeros.
- get_special_tokens_mask(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False) List[int][源代码]¶
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.
- Args:
- token_ids_0 (List[int]):
List of IDs.
- token_ids_1 (List[int], optional):
Optional second list of IDs for sequence pairs.
- already_has_special_tokens (bool, optional, defaults to False):
Whether or not the token list is already formatted with special tokens for the model.
- Returns:
List[int]: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.