mindformers.models.clip.CLIPTokenizer

class mindformers.models.clip.CLIPTokenizer(vocab_file, eos_token='<|endoftext|>', bos_token='<|startoftext|>', pad_token='<|endoftext|>', unk_token='<|endoftext|>', add_bos_token=True, add_eos_token=True)[源代码]

CLIP Tokenizer

Args:

vocab_file(str): The vocabulary file path. eos_token(str): The token that represents the end-of-sentence. Default “<|endoftext|>”. bos_token(str): The token that represents the begin-of-sentence. Default “<|startoftext|>””. pad_token(str): The token that represents the pad. Default “<|endoftext|>”. unk_token(str): The token that represents the unknown. Default “<|endoftext|>”. add_prefix_space(bool): whether to add a whitespace in the front of text. Default “False” add_bos_token(bool): Whether or not to add the bos_token_id to the left of the input. Default “True” add_eos_token(bool): Whether or not to add the eos_token_id to the right of the input. Default “True” **kwargs: Other kwargs that will be passed into the base class of the Tokenizer.

Examples:
>>> from mindformers import CLIPTokenizer
>>> CLIPTokenizer.show_support_list()
    INFO - support list of CLIPTokenizer is:
    INFO -    ['clip_vit_b_32']
    INFO - -------------------------------------
>>> tokenizer = CLIPTokenizer.from_pretrained('clip_vit_b_32')
>>> tokenizer("a boy")
    {'input_ids': [49406, 320, 1876, 49407], 'attention_mask': [1, 1, 1, 1]}
FILE_LIST: List[str] = ['tokenizer_config.json']

clip tokenizer

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]

Insert the special tokens to the input_ids. Currently, we support token_ids_0 is a list of ids.

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) List[int][源代码]

Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT sequence pair mask has the following format:

` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence    | second sequence | `

if token_ids_1 is None, only returns the first portion of the mask (0s).

Args:
token_ids_0 (List[int]):

List of ids.

token_ids_1 (List[int], optional):

Optional second list of IDs for sequence pairs.

Returns:

List[int]: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).

get_vocab()[源代码]

Returns vocab as a dict

save_vocabulary(save_directory, filename_prefix=None)[源代码]

Save_vocabulary

tokenize(text, pair=None, add_special_tokens=True, **kwargs)[源代码]

Tokenizer the input_text

property vocab_size

Get the vocab size