mindformers.models.clip.CLIPModel¶

class mindformers.models.clip.CLIPModel(config: CLIPConfig)[源代码]¶

CLIPModel. The supported model name could be selected from CLIPModel.show_support_list().

Args:

config (CLIPConfig): The config of clip model, which could be obtained by CLIPConfig class.

Examples:

>>> from mindformers import CLIPModel
>>> CLIPModel.show_support_list()
    INFO - support list of CLIPModel is:
    INFO -    ['clip_vit_b_32']
    INFO - -------------------------------------
>>> model = CLIPModel.from_pretrained('clip_vit_b_32')
>>> type(model)
    <class 'mindformers.models.clip.clip.CLIPModel'>

build_attention_mask()[源代码]¶: Build_attention_mask

get_dtype(dtype: str)[源代码]¶: Get_dtype

get_image_features(image: Tensor, pixel_values: Optional[Tensor] = None)[源代码]¶

Get_image_features

Args:

image (ms.Tensor): A image tensor processed by image_processor. pixel_values (Optional[ms.Tensor]): Equal to “image”,

if “pixel_values” is set, “image” is useless.

Returns:

Image feature.

Examples:

>>> import numpy as np
>>> from mindformers import CLIPModel, CLIPProcessor
>>> processor = CLIPProcessor.from_pretrained('clip_vit_b_32')
>>> model = CLIPModel.from_pretrained('clip_vit_b_32')
>>> fake_image_batch = np.random.random((5, 3, 578, 213))
>>> model.get_image_features(processor.image_processor(fake_image_batch))
    Tensor(shape=[5, 512], dtype=Float32, value=
    [[-1.50102973e-001, -2.63687313e-001, -5.65953791e-001 ... -2.93511450e-001],
     [-1.50103331e-001, -2.63622820e-001, -5.65623760e-001 ... -2.93337226e-001],
     [-1.50102973e-001, -2.63687313e-001, -5.65953791e-001 ... -2.93511450e-001],
     [-1.49712294e-001, -2.64100820e-001, -5.65740824e-001 ... -2.93599486e-001],
     [-1.50102973e-001, -2.63687313e-001, -5.65953791e-001 ... -2.93511450e-001]])

get_text_features(text: Tensor, input_ids: Optional[Tensor] = None)[源代码]¶

Get_text_features

Args:

text (ms.Tensor): A text id tensor processed by tokenizer. input_ids (Optional[ms.Tensor]): Equal to “text”,

if “input_ids” is set, “text” is useless.

Returns:

Text feature.

Examples:

>>> from mindformers import CLIPModel, CLIPProcessor
>>> processor = CLIPProcessor.from_pretrained('clip_vit_b_32')
>>> model = CLIPModel.from_pretrained('clip_vit_b_32')
>>> fake_text_batch = ["a boy", "a girl", "a women", "a men"]
>>> text = processor.tokenizer(
...    fake_text_batch, max_length=77, padding="max_length", return_tensors="ms"
...    )["input_ids"]
>>> model.get_text_features(text)
    Tensor(shape=[4, 512], dtype=Float32, value=
    [[6.03631809e-002, 1.79528534e-001, ... -2.23753393e-001, 1.42413378e-002],
    [1.28974199e-001, 7.46373609e-002, ...  -3.68579805e-001, 1.53980583e-001],
    [9.89909172e-002, 2.01410800e-002, ...  -2.54495114e-001, 7.68117979e-002],
    [3.16975415e-002, 2.26992741e-001, ... -5.22942394e-002, 1.98922127e-001]])