mindformers.models.clip.CLIPTokenizer¶

CLIP Tokenizer

Args:

vocab_file(str): The vocabulary file path. eos_token(str): The token that represents the end-of-sentence. Default “<|endoftext|>”. bos_token(str): The token that represents the begin-of-sentence. Default “<|startoftext|>””. pad_token(str): The token that represents the pad. Default “<|endoftext|>”. unk_token(str): The token that represents the unknown. Default “<|endoftext|>”. add_prefix_space(bool): whether to add a whitespace in the front of text. Default “False” add_bos_token(bool): Whether or not to add the bos_token_id to the left of the input. Default “True” add_eos_token(bool): Whether or not to add the eos_token_id to the right of the input. Default “True” **kwargs: Other kwargs that will be passed into the base class of the Tokenizer.

Examples:

>>> from mindformers import CLIPTokenizer
>>> CLIPTokenizer.show_support_list()
    INFO - support list of CLIPTokenizer is:
    INFO -    ['clip_vit_b_32']
    INFO - -------------------------------------
>>> tokenizer = CLIPTokenizer.from_pretrained('clip_vit_b_32')
>>> tokenizer("a boy")
    {'input_ids': [49406, 320, 1876, 49407], 'attention_mask': [1, 1, 1, 1]}

FILE_LIST: List[str] = ['tokenizer_config.json']¶: clip tokenizer

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]¶: Insert the special tokens to the input_ids. Currently, we support token_ids_0 is a list of ids.

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][源代码]¶

Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT sequence pair mask has the following format:

` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | `

if token_ids_1 is None, only returns the first portion of the mask (0s).

Args:

token_ids_0 (List[int]):: List of ids.
token_ids_1 (List[int], optional):: Optional second list of IDs for sequence pairs.

Returns:

List[int]: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).

get_vocab()[源代码]¶: Returns vocab as a dict

save_vocabulary(save_directory, filename_prefix=None)[源代码]¶: Save_vocabulary

tokenize(text, pair=None, add_special_tokens=True, **kwargs)[源代码]¶: Tokenizer the input_text

property vocab_size¶: Get the vocab size