mindformers.models.clip.CLIPTokenizer

class mindformers.models.clip.CLIPTokenizer(vocab_file: str, eos_token: str = '<|endoftext|>', bos_token: str = '<|startoftext|>', pad_token: str = '<|endoftext|>', unk_token: str = '<|endoftext|>')[源代码]

CLIP Tokenizer

参数
  • vocab_file (str) – File path of vocab.

  • eos_token (str) – Eos_token.

  • bos_token (str) – Bos_token.

  • pad_token (str) – Pad_token.

  • unk_token (str) – Unk_token.

实际案例

>>> from mindformers import CLIPTokenizer
>>> CLIPTokenizer.show_support_list()
    INFO - support list of CLIPTokenizer is:
    INFO -    ['clip_vit_b_32']
    INFO - -------------------------------------
>>> tokenizer = CLIPTokenizer.from_pretrained('clip_vit_b_32')
>>> tokenizer("a boy")
    {'input_ids': [49406, 320, 1876, 49407], 'attention_mask': [1, 1, 1, 1]}
FILE_LIST = ['tokenizer_config.json']

clip tokenizer

build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)[源代码]

Insert the special tokens to the input_ids. Currently, we support token_ids_0 is a list of ids.

save_vocabulary(save_directory, filename_prefix)[源代码]

Save_vocabulary

tokenize(text)[源代码]

Tokenizer the input_text

property vocab_size

Get the vocab size