mindformers.models.t5.T5Tokenizer

class mindformers.models.t5.T5Tokenizer(vocab_file, eos_token='</s>', unk_token='<unk>', pad_token='<pad>', extra_ids=100, additional_special_tokens=None, sp_model_kwargs: Optional[Dict[str, Any]] = None, **kwargs)[源代码]

Construct a T5 tokenizer. Based on [SentencePiece](https://github.com/google/sentencepiece).

This tokenizer inherits from [PreTrainedTokenizer] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

Args:
vocab_file (str):

[SentencePiece](https://github.com/google/sentencepiece) file (generally has a .spm extension) that contains the vocabulary necessary to instantiate a tokenizer.

eos_token (str, optional, defaults to “</s>”):

The end of sequence token.

<Tip>

When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.

</Tip>

unk_token (str, optional, defaults to “<unk>”):

The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

pad_token (str, optional, defaults to “<pad>”):

The token used for padding, for example when batching sequences of different lengths.

extra_ids (int, optional, defaults to 100):
Add a number of extra ids added to the vocabulary for use as sentinels. These tokens are

accessible as “<extra_id_{%d}>” where “{%d}” is a number between 0 and extra_ids-1. These tokens can be retrieved by calling get_sentinel_tokens method and token ids can be by calling get_sentinel_token_ids method

additional_special_tokens (List[str], optional):

Additional special tokens used by the tokenizer.

sp_model_kwargs (dict, optional):

Will be passed to the SentencePieceProcessor.__init__() method. The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) can be used, among other things, to set:

  • enable_sampling: Enable subword regularization.

  • nbest_size: Sampling parameters for unigram. Invalid for BPE-Dropout.

    • nbest_size = {0,1}: No sampling is performed.

    • nbest_size > 1: samples from the nbest_size results.

    • nbest_size < 0: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.

  • alpha: Smoothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout.

Attributes:
sp_model (SentencePieceProcessor):

The SentencePiece processor that is used for every conversion (string, tokens and IDs).

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) List[int][源代码]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A sequence has the following format:

  • single sequence: X </s>

  • pair of sequences: A </s> B </s>

Args:
token_ids_0 (List[int]):

List of IDs to which the special tokens will be added.

token_ids_1 (List[int], optional):

Optional second list of IDs for sequence pairs.

Returns:

List[int]: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.

convert_tokens_to_string(tokens)[源代码]

Converts a sequence of tokens (string) in a single string.

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) List[int][源代码]

Create a mask from the two sequences passed to be used in a sequence-pair classification task. T5 does not make use of token type ids, therefore a list of zeros is returned.

Args:
token_ids_0 (List[int]):

List of IDs.

token_ids_1 (List[int], optional):

Optional second list of IDs for sequence pairs.

Returns:

List[int]: List of zeros.

get_special_tokens_mask(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False) List[int][源代码]

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

Args:
token_ids_0 (List[int]):

List of IDs.

token_ids_1 (List[int], optional):

Optional second list of IDs for sequence pairs.

already_has_special_tokens (bool, optional, defaults to False):

Whether or not the token list is already formatted with special tokens for the model.

Returns:

List[int]: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.