mindformers.models.clip.CLIPTokenizer¶
- class mindformers.models.clip.CLIPTokenizer(vocab_file, eos_token='<|endoftext|>', bos_token='<|startoftext|>', pad_token='<|endoftext|>', unk_token='<|endoftext|>', add_bos_token=True, add_eos_token=True)[源代码]¶
CLIP Tokenizer
- Args:
vocab_file(str): The vocabulary file path. eos_token(str): The token that represents the end-of-sentence. Default “<|endoftext|>”. bos_token(str): The token that represents the begin-of-sentence. Default “<|startoftext|>””. pad_token(str): The token that represents the pad. Default “<|endoftext|>”. unk_token(str): The token that represents the unknown. Default “<|endoftext|>”. add_prefix_space(bool): whether to add a whitespace in the front of text. Default “False” add_bos_token(bool): Whether or not to add the bos_token_id to the left of the input. Default “True” add_eos_token(bool): Whether or not to add the eos_token_id to the right of the input. Default “True” **kwargs: Other kwargs that will be passed into the base class of the Tokenizer.
- Examples:
>>> from mindformers import CLIPTokenizer >>> CLIPTokenizer.show_support_list() INFO - support list of CLIPTokenizer is: INFO - ['clip_vit_b_32'] INFO - ------------------------------------- >>> tokenizer = CLIPTokenizer.from_pretrained('clip_vit_b_32') >>> tokenizer("a boy") {'input_ids': [49406, 320, 1876, 49407], 'attention_mask': [1, 1, 1, 1]}
- FILE_LIST: List[str] = ['tokenizer_config.json']¶
clip tokenizer
- build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]¶
Insert the special tokens to the input_ids. Currently, we support token_ids_0 is a list of ids.
- create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) List[int][源代码]¶
Creates a mask from the two sequences passed to be used in a sequence-pair classification task. An ALBERT sequence pair mask has the following format:
` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | `if token_ids_1 is None, only returns the first portion of the mask (0s).
- Args:
- token_ids_0 (List[int]):
List of ids.
- token_ids_1 (List[int], optional):
Optional second list of IDs for sequence pairs.
- Returns:
List[int]: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).
- property vocab_size¶
Get the vocab size