Skip to content

clip

mindnlp.transformers.models.clip.configuration_clip.CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP = {'openai/clip-vit-base-patch32': 'https://hf-mirror.com/openai/clip-vit-base-patch32/resolve/main/config.json'} module-attribute

mindnlp.transformers.models.clip.configuration_clip.CLIPConfig

Bases: PretrainedConfig

[CLIPConfig] is the configuration class to store the configuration of a [CLIPModel]. It is used to instantiate a CLIP model according to the specified arguments, defining the text model and vision model configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the CLIP openai/clip-vit-base-patch32 architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
text_config

Dictionary of configuration options used to initialize [CLIPTextConfig].

TYPE: `dict`, *optional* DEFAULT: None

vision_config

Dictionary of configuration options used to initialize [CLIPVisionConfig].

TYPE: `dict`, *optional* DEFAULT: None

projection_dim

Dimentionality of text and vision projection layers.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

logit_scale_init_value

The inital value of the logit_scale paramter. Default is used as per the original CLIP implementation.

TYPE: `float`, *optional*, defaults to 2.6592 DEFAULT: 2.6592

kwargs

Dictionary of keyword arguments.

TYPE: *optional* DEFAULT: {}

Example
>>> from transformers import CLIPConfig, CLIPModel
...
>>> # Initializing a CLIPConfig with openai/clip-vit-base-patch32 style configuration
>>> configuration = CLIPConfig()
...
>>> # Initializing a CLIPModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
>>> model = CLIPModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
...
>>> # We can also initialize a CLIPConfig from a CLIPTextConfig and a CLIPVisionConfig
>>> from transformers import CLIPTextConfig, CLIPVisionConfig
...
>>> # Initializing a CLIPText and CLIPVision configuration
>>> config_text = CLIPTextConfig()
>>> config_vision = CLIPVisionConfig()
...
>>> config = CLIPConfig.from_text_vision_configs(config_text, config_vision)
Source code in mindnlp/transformers/models/clip/configuration_clip.py
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
class CLIPConfig(PretrainedConfig):
    r"""
    [`CLIPConfig`] is the configuration class to store the configuration of a [`CLIPModel`]. It is used to instantiate
    a CLIP model according to the specified arguments, defining the text model and vision model configs. Instantiating
    a configuration with the defaults will yield a similar configuration to that of the CLIP
    [openai/clip-vit-base-patch32](https://hf-mirror.com/openai/clip-vit-base-patch32) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        text_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`CLIPTextConfig`].
        vision_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`CLIPVisionConfig`].
        projection_dim (`int`, *optional*, defaults to 512):
            Dimentionality of text and vision projection layers.
        logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
            The inital value of the *logit_scale* paramter. Default is used as per the original CLIP implementation.
        kwargs (*optional*):
            Dictionary of keyword arguments.

    Example:
        ```python
        >>> from transformers import CLIPConfig, CLIPModel
        ...
        >>> # Initializing a CLIPConfig with openai/clip-vit-base-patch32 style configuration
        >>> configuration = CLIPConfig()
        ...
        >>> # Initializing a CLIPModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
        >>> model = CLIPModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ...
        >>> # We can also initialize a CLIPConfig from a CLIPTextConfig and a CLIPVisionConfig
        >>> from transformers import CLIPTextConfig, CLIPVisionConfig
        ...
        >>> # Initializing a CLIPText and CLIPVision configuration
        >>> config_text = CLIPTextConfig()
        >>> config_vision = CLIPVisionConfig()
        ...
        >>> config = CLIPConfig.from_text_vision_configs(config_text, config_vision)
        ```
    """
    model_type = "clip"

    def __init__(
        self, text_config=None, vision_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs
    ):
        """
        Initializes a new instance of CLIPConfig.

        Args:
            self: The instance of the class.
            text_config (dict): The configuration for text inputs. If provided, overrides default values. Default is None.
            vision_config (dict): The configuration for vision inputs. If provided, overrides default values. Default is None.
            projection_dim (int): The dimension of the projection. Default is 512.
            logit_scale_init_value (float): The initial value for logit scaling. Default is 2.6592.

        Returns:
            None

        Raises:
            TypeError: If text_config or vision_config are not of type dict.
            ValueError: If projection_dim or logit_scale_init_value are not of type int or float respectively.
            KeyError: If 'transformers_version' key is present in text_config or vision_config.
            AttributeError: If 'id2label' key is not present in vision_config.
        """
        # If `_config_dict` exist, we use them for the backward compatibility.
        # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot
        # of confusion!).
        text_config_dict = kwargs.pop("text_config_dict", None)
        vision_config_dict = kwargs.pop("vision_config_dict", None)

        super().__init__(**kwargs)

        # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in
        # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most
        # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`.
        if text_config_dict is not None:
            if text_config is None:
                text_config = {}

            # This is the complete result when using `text_config_dict`.
            _text_config_dict = CLIPTextConfig(**text_config_dict).to_dict()

            # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different.
            for key, value in _text_config_dict.items():
                if key in text_config and value != text_config[key] and key not in ["transformers_version"]:
                    # If specified in `text_config_dict`
                    if key in text_config_dict:
                        message = (
                            f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. "
                            f'The value `text_config_dict["{key}"]` will be used instead.'
                        )
                    # If inferred from default argument values (just to be super careful)
                    else:
                        message = (
                            f"`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The "
                            f'value `text_config["{key}"]` will be overriden.'
                        )
                    logger.info(message)

            # Update all values in `text_config` with the ones in `_text_config_dict`.
            text_config.update(_text_config_dict)

        if vision_config_dict is not None:
            if vision_config is None:
                vision_config = {}

            # This is the complete result when using `vision_config_dict`.
            _vision_config_dict = CLIPVisionConfig(**vision_config_dict).to_dict()
            # convert keys to string instead of integer
            if "id2label" in _vision_config_dict:
                _vision_config_dict["id2label"] = {
                    str(key): value for key, value in _vision_config_dict["id2label"].items()
                }

            # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different.
            for key, value in _vision_config_dict.items():
                if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]:
                    # If specified in `vision_config_dict`
                    if key in vision_config_dict:
                        message = (
                            f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different "
                            f'values. The value `vision_config_dict["{key}"]` will be used instead.'
                        )
                    # If inferred from default argument values (just to be super careful)
                    else:
                        message = (
                            f"`vision_config_dict` is provided which will be used to initialize `CLIPVisionConfig`. "
                            f'The value `vision_config["{key}"]` will be overriden.'
                        )
                    logger.info(message)

            # Update all values in `vision_config` with the ones in `_vision_config_dict`.
            vision_config.update(_vision_config_dict)

        if text_config is None:
            text_config = {}
            logger.info("`text_config` is `None`. Initializing the `CLIPTextConfig` with default values.")

        if vision_config is None:
            vision_config = {}
            logger.info("`vision_config` is `None`. initializing the `CLIPVisionConfig` with default values.")

        self.text_config = CLIPTextConfig(**text_config)
        self.vision_config = CLIPVisionConfig(**vision_config)

        self.projection_dim = projection_dim
        self.logit_scale_init_value = logit_scale_init_value
        self.initializer_factor = 1.0

    @classmethod
    def from_text_vision_configs(cls, text_config: CLIPTextConfig, vision_config: CLIPVisionConfig, **kwargs):
        r"""
        Instantiate a [`CLIPConfig`] (or a derived class) from clip text model configuration and clip vision model
        configuration.

        Returns:
            [`CLIPConfig`]: An instance of a configuration object
        """
        return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)

mindnlp.transformers.models.clip.configuration_clip.CLIPConfig.__init__(text_config=None, vision_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs)

Initializes a new instance of CLIPConfig.

PARAMETER DESCRIPTION
self

The instance of the class.

text_config

The configuration for text inputs. If provided, overrides default values. Default is None.

TYPE: dict DEFAULT: None

vision_config

The configuration for vision inputs. If provided, overrides default values. Default is None.

TYPE: dict DEFAULT: None

projection_dim

The dimension of the projection. Default is 512.

TYPE: int DEFAULT: 512

logit_scale_init_value

The initial value for logit scaling. Default is 2.6592.

TYPE: float DEFAULT: 2.6592

RETURNS DESCRIPTION

None

RAISES DESCRIPTION
TypeError

If text_config or vision_config are not of type dict.

ValueError

If projection_dim or logit_scale_init_value are not of type int or float respectively.

KeyError

If 'transformers_version' key is present in text_config or vision_config.

AttributeError

If 'id2label' key is not present in vision_config.

Source code in mindnlp/transformers/models/clip/configuration_clip.py
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
def __init__(
    self, text_config=None, vision_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs
):
    """
    Initializes a new instance of CLIPConfig.

    Args:
        self: The instance of the class.
        text_config (dict): The configuration for text inputs. If provided, overrides default values. Default is None.
        vision_config (dict): The configuration for vision inputs. If provided, overrides default values. Default is None.
        projection_dim (int): The dimension of the projection. Default is 512.
        logit_scale_init_value (float): The initial value for logit scaling. Default is 2.6592.

    Returns:
        None

    Raises:
        TypeError: If text_config or vision_config are not of type dict.
        ValueError: If projection_dim or logit_scale_init_value are not of type int or float respectively.
        KeyError: If 'transformers_version' key is present in text_config or vision_config.
        AttributeError: If 'id2label' key is not present in vision_config.
    """
    # If `_config_dict` exist, we use them for the backward compatibility.
    # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot
    # of confusion!).
    text_config_dict = kwargs.pop("text_config_dict", None)
    vision_config_dict = kwargs.pop("vision_config_dict", None)

    super().__init__(**kwargs)

    # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in
    # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most
    # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`.
    if text_config_dict is not None:
        if text_config is None:
            text_config = {}

        # This is the complete result when using `text_config_dict`.
        _text_config_dict = CLIPTextConfig(**text_config_dict).to_dict()

        # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different.
        for key, value in _text_config_dict.items():
            if key in text_config and value != text_config[key] and key not in ["transformers_version"]:
                # If specified in `text_config_dict`
                if key in text_config_dict:
                    message = (
                        f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. "
                        f'The value `text_config_dict["{key}"]` will be used instead.'
                    )
                # If inferred from default argument values (just to be super careful)
                else:
                    message = (
                        f"`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The "
                        f'value `text_config["{key}"]` will be overriden.'
                    )
                logger.info(message)

        # Update all values in `text_config` with the ones in `_text_config_dict`.
        text_config.update(_text_config_dict)

    if vision_config_dict is not None:
        if vision_config is None:
            vision_config = {}

        # This is the complete result when using `vision_config_dict`.
        _vision_config_dict = CLIPVisionConfig(**vision_config_dict).to_dict()
        # convert keys to string instead of integer
        if "id2label" in _vision_config_dict:
            _vision_config_dict["id2label"] = {
                str(key): value for key, value in _vision_config_dict["id2label"].items()
            }

        # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different.
        for key, value in _vision_config_dict.items():
            if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]:
                # If specified in `vision_config_dict`
                if key in vision_config_dict:
                    message = (
                        f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different "
                        f'values. The value `vision_config_dict["{key}"]` will be used instead.'
                    )
                # If inferred from default argument values (just to be super careful)
                else:
                    message = (
                        f"`vision_config_dict` is provided which will be used to initialize `CLIPVisionConfig`. "
                        f'The value `vision_config["{key}"]` will be overriden.'
                    )
                logger.info(message)

        # Update all values in `vision_config` with the ones in `_vision_config_dict`.
        vision_config.update(_vision_config_dict)

    if text_config is None:
        text_config = {}
        logger.info("`text_config` is `None`. Initializing the `CLIPTextConfig` with default values.")

    if vision_config is None:
        vision_config = {}
        logger.info("`vision_config` is `None`. initializing the `CLIPVisionConfig` with default values.")

    self.text_config = CLIPTextConfig(**text_config)
    self.vision_config = CLIPVisionConfig(**vision_config)

    self.projection_dim = projection_dim
    self.logit_scale_init_value = logit_scale_init_value
    self.initializer_factor = 1.0

mindnlp.transformers.models.clip.configuration_clip.CLIPConfig.from_text_vision_configs(text_config, vision_config, **kwargs) classmethod

Instantiate a [CLIPConfig] (or a derived class) from clip text model configuration and clip vision model configuration.

RETURNS DESCRIPTION

[CLIPConfig]: An instance of a configuration object

Source code in mindnlp/transformers/models/clip/configuration_clip.py
509
510
511
512
513
514
515
516
517
518
@classmethod
def from_text_vision_configs(cls, text_config: CLIPTextConfig, vision_config: CLIPVisionConfig, **kwargs):
    r"""
    Instantiate a [`CLIPConfig`] (or a derived class) from clip text model configuration and clip vision model
    configuration.

    Returns:
        [`CLIPConfig`]: An instance of a configuration object
    """
    return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)

mindnlp.transformers.models.clip.configuration_clip.CLIPTextConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [CLIPTextModel]. It is used to instantiate a CLIP text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the text encoder of the CLIP openai/clip-vit-base-patch32 architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
vocab_size

Vocabulary size of the CLIP text model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [CLIPModel].

TYPE: `int`, *optional*, defaults to 49408 DEFAULT: 49408

hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

intermediate_size

Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 2048 DEFAULT: 2048

projection_dim

Dimentionality of text and vision projection layers.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 8 DEFAULT: 8

max_position_embeddings

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

TYPE: `int`, *optional*, defaults to 77 DEFAULT: 77

hidden_act

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" "quick_gelu" are supported.

TYPE: `str` or `function`, *optional*, defaults to `"quick_gelu"` DEFAULT: 'quick_gelu'

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-05 DEFAULT: 1e-05

attention_dropout

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

initializer_factor

A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).

TYPE: `float`, *optional*, defaults to 1.0 DEFAULT: 1.0

pad_token_id

Padding token id.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

bos_token_id

Beginning of stream token id.

TYPE: `int`, *optional*, defaults to 49406 DEFAULT: 49406

eos_token_id

End of stream token id.

TYPE: `int`, *optional*, defaults to 49407 DEFAULT: 49407

Example
>>> from transformers import CLIPTextConfig, CLIPTextModel
...
>>> # Initializing a CLIPTextConfig with openai/clip-vit-base-patch32 style configuration
>>> configuration = CLIPTextConfig()
...
>>> # Initializing a CLIPTextModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
>>> model = CLIPTextModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/clip/configuration_clip.py
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
class CLIPTextConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`CLIPTextModel`]. It is used to instantiate a CLIP
    text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration
    with the defaults will yield a similar configuration to that of the text encoder of the CLIP
    [openai/clip-vit-base-patch32](https://hf-mirror.com/openai/clip-vit-base-patch32) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        vocab_size (`int`, *optional*, defaults to 49408):
            Vocabulary size of the CLIP text model. Defines the number of different tokens that can be represented by
            the `inputs_ids` passed when calling [`CLIPModel`].
        hidden_size (`int`, *optional*, defaults to 512):
            Dimensionality of the encoder layers and the pooler layer.
        intermediate_size (`int`, *optional*, defaults to 2048):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        projection_dim (`int`, *optional*, defaults to 512):
            Dimentionality of text and vision projection layers.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 8):
            Number of attention heads for each attention layer in the Transformer encoder.
        max_position_embeddings (`int`, *optional*, defaults to 77):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"selu"` and `"gelu_new"` `"quick_gelu"` are supported.
        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon used by the layer normalization layers.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        initializer_factor (`float`, *optional*, defaults to 1.0):
            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
            testing).
        pad_token_id (`int`, *optional*, defaults to 1):
            Padding token id.
        bos_token_id (`int`, *optional*, defaults to 49406):
            Beginning of stream token id.
        eos_token_id (`int`, *optional*, defaults to 49407):
            End of stream token id.

    Example:
        ```python
        >>> from transformers import CLIPTextConfig, CLIPTextModel
        ...
        >>> # Initializing a CLIPTextConfig with openai/clip-vit-base-patch32 style configuration
        >>> configuration = CLIPTextConfig()
        ...
        >>> # Initializing a CLIPTextModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
        >>> model = CLIPTextModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "clip_text_model"

    def __init__(
        self,
        vocab_size=49408,
        hidden_size=512,
        intermediate_size=2048,
        projection_dim=512,
        num_hidden_layers=12,
        num_attention_heads=8,
        max_position_embeddings=77,
        hidden_act="quick_gelu",
        layer_norm_eps=1e-5,
        attention_dropout=0.0,
        initializer_range=0.02,
        initializer_factor=1.0,
        # This differs from `CLIPTokenizer`'s default and from openai/clip
        # See https://github.com/huggingface/transformers/pull/24773#issuecomment-1632287538
        pad_token_id=1,
        bos_token_id=49406,
        eos_token_id=49407,
        **kwargs,
    ):
        """
        Initialize CLIPTextConfig.

        Args:
            vocab_size (int, optional): The size of the vocabulary. Default is 49408.
            hidden_size (int, optional): The size of the hidden layers. Default is 512.
            intermediate_size (int, optional): The size of the intermediate layers. Default is 2048.
            projection_dim (int, optional): The projection dimension. Default is 512.
            num_hidden_layers (int, optional): The number of hidden layers. Default is 12.
            num_attention_heads (int, optional): The number of attention heads. Default is 8.
            max_position_embeddings (int, optional): The maximum position embeddings. Default is 77.
            hidden_act (str, optional): The type of activation function for the hidden layers. Default is 'quick_gelu'.
            layer_norm_eps (float, optional): Epsilon value for layer normalization. Default is 1e-05.
            attention_dropout (float, optional): The dropout rate for attention layers. Default is 0.0.
            initializer_range (float, optional): The range for parameter initializers. Default is 0.02.
            initializer_factor (float, optional): The factor for parameter initializers. Default is 1.0.
            pad_token_id (int, optional): The ID of the padding token. Default is 1.
            bos_token_id (int, optional): The ID of the beginning of sequence token. Default is 49406.
            eos_token_id (int, optional): The ID of the end of sequence token. Default is 49407.
            **kwargs: Additional keyword arguments.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)

        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.projection_dim = projection_dim
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.max_position_embeddings = max_position_embeddings
        self.layer_norm_eps = layer_norm_eps
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
        self.initializer_factor = initializer_factor
        self.attention_dropout = attention_dropout

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        """
        Creates a CLIPTextConfig instance from a pretrained model.

        Args:
            cls (type): The class object.
            pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model.

        Returns:
            PretrainedConfig: A CLIPTextConfig instance initialized with the configuration specified by the pretrained model.

        Raises:
            TypeError: If the input parameters are not of the expected types.
            ValueError: If the configuration dictionary does not contain the required information.
            Warning: If the model type being used for instantiation does not match the class's model type, which may lead to errors.
        """
        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

        # get the text config dict if we are loading from CLIPConfig
        if config_dict.get("model_type") == "clip":
            config_dict = config_dict["text_config"]

        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
            logger.warning(
                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
            )

        return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.clip.configuration_clip.CLIPTextConfig.__init__(vocab_size=49408, hidden_size=512, intermediate_size=2048, projection_dim=512, num_hidden_layers=12, num_attention_heads=8, max_position_embeddings=77, hidden_act='quick_gelu', layer_norm_eps=1e-05, attention_dropout=0.0, initializer_range=0.02, initializer_factor=1.0, pad_token_id=1, bos_token_id=49406, eos_token_id=49407, **kwargs)

Initialize CLIPTextConfig.

PARAMETER DESCRIPTION
vocab_size

The size of the vocabulary. Default is 49408.

TYPE: int DEFAULT: 49408

hidden_size

The size of the hidden layers. Default is 512.

TYPE: int DEFAULT: 512

intermediate_size

The size of the intermediate layers. Default is 2048.

TYPE: int DEFAULT: 2048

projection_dim

The projection dimension. Default is 512.

TYPE: int DEFAULT: 512

num_hidden_layers

The number of hidden layers. Default is 12.

TYPE: int DEFAULT: 12

num_attention_heads

The number of attention heads. Default is 8.

TYPE: int DEFAULT: 8

max_position_embeddings

The maximum position embeddings. Default is 77.

TYPE: int DEFAULT: 77

hidden_act

The type of activation function for the hidden layers. Default is 'quick_gelu'.

TYPE: str DEFAULT: 'quick_gelu'

layer_norm_eps

Epsilon value for layer normalization. Default is 1e-05.

TYPE: float DEFAULT: 1e-05

attention_dropout

The dropout rate for attention layers. Default is 0.0.

TYPE: float DEFAULT: 0.0

initializer_range

The range for parameter initializers. Default is 0.02.

TYPE: float DEFAULT: 0.02

initializer_factor

The factor for parameter initializers. Default is 1.0.

TYPE: float DEFAULT: 1.0

pad_token_id

The ID of the padding token. Default is 1.

TYPE: int DEFAULT: 1

bos_token_id

The ID of the beginning of sequence token. Default is 49406.

TYPE: int DEFAULT: 49406

eos_token_id

The ID of the end of sequence token. Default is 49407.

TYPE: int DEFAULT: 49407

**kwargs

Additional keyword arguments.

DEFAULT: {}

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/clip/configuration_clip.py
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
def __init__(
    self,
    vocab_size=49408,
    hidden_size=512,
    intermediate_size=2048,
    projection_dim=512,
    num_hidden_layers=12,
    num_attention_heads=8,
    max_position_embeddings=77,
    hidden_act="quick_gelu",
    layer_norm_eps=1e-5,
    attention_dropout=0.0,
    initializer_range=0.02,
    initializer_factor=1.0,
    # This differs from `CLIPTokenizer`'s default and from openai/clip
    # See https://github.com/huggingface/transformers/pull/24773#issuecomment-1632287538
    pad_token_id=1,
    bos_token_id=49406,
    eos_token_id=49407,
    **kwargs,
):
    """
    Initialize CLIPTextConfig.

    Args:
        vocab_size (int, optional): The size of the vocabulary. Default is 49408.
        hidden_size (int, optional): The size of the hidden layers. Default is 512.
        intermediate_size (int, optional): The size of the intermediate layers. Default is 2048.
        projection_dim (int, optional): The projection dimension. Default is 512.
        num_hidden_layers (int, optional): The number of hidden layers. Default is 12.
        num_attention_heads (int, optional): The number of attention heads. Default is 8.
        max_position_embeddings (int, optional): The maximum position embeddings. Default is 77.
        hidden_act (str, optional): The type of activation function for the hidden layers. Default is 'quick_gelu'.
        layer_norm_eps (float, optional): Epsilon value for layer normalization. Default is 1e-05.
        attention_dropout (float, optional): The dropout rate for attention layers. Default is 0.0.
        initializer_range (float, optional): The range for parameter initializers. Default is 0.02.
        initializer_factor (float, optional): The factor for parameter initializers. Default is 1.0.
        pad_token_id (int, optional): The ID of the padding token. Default is 1.
        bos_token_id (int, optional): The ID of the beginning of sequence token. Default is 49406.
        eos_token_id (int, optional): The ID of the end of sequence token. Default is 49407.
        **kwargs: Additional keyword arguments.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)

    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.intermediate_size = intermediate_size
    self.projection_dim = projection_dim
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.max_position_embeddings = max_position_embeddings
    self.layer_norm_eps = layer_norm_eps
    self.hidden_act = hidden_act
    self.initializer_range = initializer_range
    self.initializer_factor = initializer_factor
    self.attention_dropout = attention_dropout

mindnlp.transformers.models.clip.configuration_clip.CLIPTextConfig.from_pretrained(pretrained_model_name_or_path, **kwargs) classmethod

Creates a CLIPTextConfig instance from a pretrained model.

PARAMETER DESCRIPTION
cls

The class object.

TYPE: type

pretrained_model_name_or_path

The name or path of the pretrained model.

TYPE: Union[str, PathLike]

RETURNS DESCRIPTION
PretrainedConfig

A CLIPTextConfig instance initialized with the configuration specified by the pretrained model.

TYPE: PretrainedConfig

RAISES DESCRIPTION
TypeError

If the input parameters are not of the expected types.

ValueError

If the configuration dictionary does not contain the required information.

Warning

If the model type being used for instantiation does not match the class's model type, which may lead to errors.

Source code in mindnlp/transformers/models/clip/configuration_clip.py
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
    """
    Creates a CLIPTextConfig instance from a pretrained model.

    Args:
        cls (type): The class object.
        pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model.

    Returns:
        PretrainedConfig: A CLIPTextConfig instance initialized with the configuration specified by the pretrained model.

    Raises:
        TypeError: If the input parameters are not of the expected types.
        ValueError: If the configuration dictionary does not contain the required information.
        Warning: If the model type being used for instantiation does not match the class's model type, which may lead to errors.
    """
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

    # get the text config dict if we are loading from CLIPConfig
    if config_dict.get("model_type") == "clip":
        config_dict = config_dict["text_config"]

    if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
        logger.warning(
            f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
            f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
        )

    return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.clip.configuration_clip.CLIPVisionConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [CLIPVisionModel]. It is used to instantiate a CLIP vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the vision encoder of the CLIP openai/clip-vit-base-patch32 architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

intermediate_size

Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 3072 DEFAULT: 3072

projection_dim

Dimentionality of text and vision projection layers.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_channels

The number of input channels.

TYPE: `int`, *optional*, defaults to 3 DEFAULT: 3

image_size

The size (resolution) of each image.

TYPE: `int`, *optional*, defaults to 224 DEFAULT: 224

patch_size

The size (resolution) of each patch.

TYPE: `int`, *optional*, defaults to 32 DEFAULT: 32

hidden_act

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" `"quick_gelu" are supported.

TYPE: `str` or `function`, *optional*, defaults to `"quick_gelu"` DEFAULT: 'quick_gelu'

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-05 DEFAULT: 1e-05

attention_dropout

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

initializer_factor

A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).

TYPE: `float`, *optional*, defaults to 1.0 DEFAULT: 1.0

Example
>>> from transformers import CLIPVisionConfig, CLIPVisionModel
...
>>> # Initializing a CLIPVisionConfig with openai/clip-vit-base-patch32 style configuration
>>> configuration = CLIPVisionConfig()
...
>>> # Initializing a CLIPVisionModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
>>> model = CLIPVisionModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/clip/configuration_clip.py
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
class CLIPVisionConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`CLIPVisionModel`]. It is used to instantiate a
    CLIP vision encoder according to the specified arguments, defining the model architecture. Instantiating a
    configuration with the defaults will yield a similar configuration to that of the vision encoder of the CLIP
    [openai/clip-vit-base-patch32](https://hf-mirror.com/openai/clip-vit-base-patch32) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        projection_dim (`int`, *optional*, defaults to 512):
            Dimentionality of text and vision projection layers.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        num_channels (`int`, *optional*, defaults to 3):
            The number of input channels.
        image_size (`int`, *optional*, defaults to 224):
            The size (resolution) of each image.
        patch_size (`int`, *optional*, defaults to 32):
            The size (resolution) of each patch.
        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon used by the layer normalization layers.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        initializer_factor (`float`, *optional*, defaults to 1.0):
            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
            testing).

    Example:
        ```python
        >>> from transformers import CLIPVisionConfig, CLIPVisionModel
        ...
        >>> # Initializing a CLIPVisionConfig with openai/clip-vit-base-patch32 style configuration
        >>> configuration = CLIPVisionConfig()
        ...
        >>> # Initializing a CLIPVisionModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
        >>> model = CLIPVisionModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "clip_vision_model"

    def __init__(
        self,
        hidden_size=768,
        intermediate_size=3072,
        projection_dim=512,
        num_hidden_layers=12,
        num_attention_heads=12,
        num_channels=3,
        image_size=224,
        patch_size=32,
        hidden_act="quick_gelu",
        layer_norm_eps=1e-5,
        attention_dropout=0.0,
        initializer_range=0.02,
        initializer_factor=1.0,
        **kwargs,
    ):
        """
        Initialize a CLIPVisionConfig object with the provided configuration parameters.

        Args:
            hidden_size (int): The size of the hidden layers in the network.
            intermediate_size (int): The size of the intermediate hidden layers in the network.
            projection_dim (int): The dimension of the projected embeddings.
            num_hidden_layers (int): The number of hidden layers in the network.
            num_attention_heads (int): The number of attention heads in the network.
            num_channels (int): The number of channels in the input image.
            image_size (int): The size of the input image.
            patch_size (int): The size of the image patch used in the network.
            hidden_act (str): The activation function used in the hidden layers.
            layer_norm_eps (float): The epsilon value for layer normalization.
            attention_dropout (float): The dropout rate for attention layers.
            initializer_range (float): The range for parameter initialization.
            initializer_factor (float): The factor for parameter initialization.

        Returns:
            None.

        Raises:
            ValueError: If any of the input parameters are invalid or out of range.
        """
        super().__init__(**kwargs)

        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.projection_dim = projection_dim
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.num_channels = num_channels
        self.patch_size = patch_size
        self.image_size = image_size
        self.initializer_range = initializer_range
        self.initializer_factor = initializer_factor
        self.attention_dropout = attention_dropout
        self.layer_norm_eps = layer_norm_eps
        self.hidden_act = hidden_act

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        """
        Load a pretrained configuration from a given model name or path.

        Args:
            cls (class): The class object.
            pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model.
                It can be either a string representing the name of the model or a path-like object pointing to the model location.

        Returns:
            PretrainedConfig: The loaded pretrained configuration.

        Raises:
            None.

        This method is a class method that allows loading a pretrained configuration. It takes in the class object 'cls'
        and the name or path of the pretrained model 'pretrained_model_name_or_path' as parameters. The method returns an instance
        of type 'PretrainedConfig', which represents the loaded pretrained configuration.

        The 'pretrained_model_name_or_path' parameter can be either a string representing the name of the pretrained model
        or a path-like object pointing to the location of the model. It is used to identify and locate the pretrained model
        that needs to be loaded.

        Note: If the loaded configuration belongs to the 'clip' model type, the 'config_dict' will be updated to use the
        'vision_config' sub-dictionary. Additionally, if the 'model_type' attribute is present in the 'cls' class and
        the loaded configuration's 'model_type' is different from 'cls.model_type', a warning will be logged indicating
        that instantiating a model of different types may lead to errors.

        Example:
            ```python
            >>> config = CLIPVisionConfig.from_pretrained("clip_model")
            ...
            ```
            In the above example, the 'from_pretrained' method is called on the 'CLIPVisionConfig' class to load the pretrained
            configuration of the 'clip_model'. The resulting configuration is stored in the 'config' variable.
        """
        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

        # get the vision config dict if we are loading from CLIPConfig
        if config_dict.get("model_type") == "clip":
            config_dict = config_dict["vision_config"]

        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
            logger.warning(
                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
            )

        return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.clip.configuration_clip.CLIPVisionConfig.__init__(hidden_size=768, intermediate_size=3072, projection_dim=512, num_hidden_layers=12, num_attention_heads=12, num_channels=3, image_size=224, patch_size=32, hidden_act='quick_gelu', layer_norm_eps=1e-05, attention_dropout=0.0, initializer_range=0.02, initializer_factor=1.0, **kwargs)

Initialize a CLIPVisionConfig object with the provided configuration parameters.

PARAMETER DESCRIPTION
hidden_size

The size of the hidden layers in the network.

TYPE: int DEFAULT: 768

intermediate_size

The size of the intermediate hidden layers in the network.

TYPE: int DEFAULT: 3072

projection_dim

The dimension of the projected embeddings.

TYPE: int DEFAULT: 512

num_hidden_layers

The number of hidden layers in the network.

TYPE: int DEFAULT: 12

num_attention_heads

The number of attention heads in the network.

TYPE: int DEFAULT: 12

num_channels

The number of channels in the input image.

TYPE: int DEFAULT: 3

image_size

The size of the input image.

TYPE: int DEFAULT: 224

patch_size

The size of the image patch used in the network.

TYPE: int DEFAULT: 32

hidden_act

The activation function used in the hidden layers.

TYPE: str DEFAULT: 'quick_gelu'

layer_norm_eps

The epsilon value for layer normalization.

TYPE: float DEFAULT: 1e-05

attention_dropout

The dropout rate for attention layers.

TYPE: float DEFAULT: 0.0

initializer_range

The range for parameter initialization.

TYPE: float DEFAULT: 0.02

initializer_factor

The factor for parameter initialization.

TYPE: float DEFAULT: 1.0

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If any of the input parameters are invalid or out of range.

Source code in mindnlp/transformers/models/clip/configuration_clip.py
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
def __init__(
    self,
    hidden_size=768,
    intermediate_size=3072,
    projection_dim=512,
    num_hidden_layers=12,
    num_attention_heads=12,
    num_channels=3,
    image_size=224,
    patch_size=32,
    hidden_act="quick_gelu",
    layer_norm_eps=1e-5,
    attention_dropout=0.0,
    initializer_range=0.02,
    initializer_factor=1.0,
    **kwargs,
):
    """
    Initialize a CLIPVisionConfig object with the provided configuration parameters.

    Args:
        hidden_size (int): The size of the hidden layers in the network.
        intermediate_size (int): The size of the intermediate hidden layers in the network.
        projection_dim (int): The dimension of the projected embeddings.
        num_hidden_layers (int): The number of hidden layers in the network.
        num_attention_heads (int): The number of attention heads in the network.
        num_channels (int): The number of channels in the input image.
        image_size (int): The size of the input image.
        patch_size (int): The size of the image patch used in the network.
        hidden_act (str): The activation function used in the hidden layers.
        layer_norm_eps (float): The epsilon value for layer normalization.
        attention_dropout (float): The dropout rate for attention layers.
        initializer_range (float): The range for parameter initialization.
        initializer_factor (float): The factor for parameter initialization.

    Returns:
        None.

    Raises:
        ValueError: If any of the input parameters are invalid or out of range.
    """
    super().__init__(**kwargs)

    self.hidden_size = hidden_size
    self.intermediate_size = intermediate_size
    self.projection_dim = projection_dim
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.num_channels = num_channels
    self.patch_size = patch_size
    self.image_size = image_size
    self.initializer_range = initializer_range
    self.initializer_factor = initializer_factor
    self.attention_dropout = attention_dropout
    self.layer_norm_eps = layer_norm_eps
    self.hidden_act = hidden_act

mindnlp.transformers.models.clip.configuration_clip.CLIPVisionConfig.from_pretrained(pretrained_model_name_or_path, **kwargs) classmethod

Load a pretrained configuration from a given model name or path.

PARAMETER DESCRIPTION
cls

The class object.

TYPE: class

pretrained_model_name_or_path

The name or path of the pretrained model. It can be either a string representing the name of the model or a path-like object pointing to the model location.

TYPE: Union[str, PathLike]

RETURNS DESCRIPTION
PretrainedConfig

The loaded pretrained configuration.

TYPE: PretrainedConfig

This method is a class method that allows loading a pretrained configuration. It takes in the class object 'cls' and the name or path of the pretrained model 'pretrained_model_name_or_path' as parameters. The method returns an instance of type 'PretrainedConfig', which represents the loaded pretrained configuration.

The 'pretrained_model_name_or_path' parameter can be either a string representing the name of the pretrained model or a path-like object pointing to the location of the model. It is used to identify and locate the pretrained model that needs to be loaded.

Note: If the loaded configuration belongs to the 'clip' model type, the 'config_dict' will be updated to use the 'vision_config' sub-dictionary. Additionally, if the 'model_type' attribute is present in the 'cls' class and the loaded configuration's 'model_type' is different from 'cls.model_type', a warning will be logged indicating that instantiating a model of different types may lead to errors.

Example

>>> config = CLIPVisionConfig.from_pretrained("clip_model")
...
In the above example, the 'from_pretrained' method is called on the 'CLIPVisionConfig' class to load the pretrained configuration of the 'clip_model'. The resulting configuration is stored in the 'config' variable.

Source code in mindnlp/transformers/models/clip/configuration_clip.py
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
    """
    Load a pretrained configuration from a given model name or path.

    Args:
        cls (class): The class object.
        pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model.
            It can be either a string representing the name of the model or a path-like object pointing to the model location.

    Returns:
        PretrainedConfig: The loaded pretrained configuration.

    Raises:
        None.

    This method is a class method that allows loading a pretrained configuration. It takes in the class object 'cls'
    and the name or path of the pretrained model 'pretrained_model_name_or_path' as parameters. The method returns an instance
    of type 'PretrainedConfig', which represents the loaded pretrained configuration.

    The 'pretrained_model_name_or_path' parameter can be either a string representing the name of the pretrained model
    or a path-like object pointing to the location of the model. It is used to identify and locate the pretrained model
    that needs to be loaded.

    Note: If the loaded configuration belongs to the 'clip' model type, the 'config_dict' will be updated to use the
    'vision_config' sub-dictionary. Additionally, if the 'model_type' attribute is present in the 'cls' class and
    the loaded configuration's 'model_type' is different from 'cls.model_type', a warning will be logged indicating
    that instantiating a model of different types may lead to errors.

    Example:
        ```python
        >>> config = CLIPVisionConfig.from_pretrained("clip_model")
        ...
        ```
        In the above example, the 'from_pretrained' method is called on the 'CLIPVisionConfig' class to load the pretrained
        configuration of the 'clip_model'. The resulting configuration is stored in the 'config' variable.
    """
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

    # get the vision config dict if we are loading from CLIPConfig
    if config_dict.get("model_type") == "clip":
        config_dict = config_dict["vision_config"]

    if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
        logger.warning(
            f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
            f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
        )

    return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.clip.image_processing_clip.CLIPImageProcessor

Bases: BaseImageProcessor

Constructs a CLIP image processor.

PARAMETER DESCRIPTION
do_resize

Whether to resize the image's (height, width) dimensions to the specified size. Can be overridden by do_resize in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

size

224}): Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge resized to keep the input aspect ratio. Can be overridden bysizein thepreprocess` method.

TYPE: `Dict[str, int]` *optional*, defaults to `{"shortest_edge" DEFAULT: None

resample

Resampling filter to use if resizing the image. Can be overridden by resample in the preprocess method.

TYPE: `PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC` DEFAULT: BICUBIC

do_center_crop

Whether to center crop the image to the specified crop_size. Can be overridden by do_center_crop in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

crop_size

Size of the output image after applying center_crop. Can be overridden by crop_size in the preprocess method.

TYPE: `Dict[str, int]` *optional*, defaults to 224 DEFAULT: None

do_rescale

Whether to rescale the image by the specified scale rescale_factor. Can be overridden by do_rescale in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

rescale_factor

Scale factor to use if rescaling the image. Can be overridden by rescale_factor in the preprocess method.

TYPE: `int` or `float`, *optional*, defaults to `1/255` DEFAULT: 1 / 255

do_normalize

Whether to normalize the image. Can be overridden by do_normalize in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

image_mean

Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_mean parameter in the preprocess method.

TYPE: `float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]` DEFAULT: None

image_std

Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_std parameter in the preprocess method. Can be overridden by the image_std parameter in the preprocess method.

TYPE: `float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]` DEFAULT: None

do_convert_rgb

Whether to convert the image to RGB.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

Source code in mindnlp/transformers/models/clip/image_processing_clip.py
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
class CLIPImageProcessor(BaseImageProcessor):
    r"""
    Constructs a CLIP image processor.

    Args:
        do_resize (`bool`, *optional*, defaults to `True`):
            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by
            `do_resize` in the `preprocess` method.
        size (`Dict[str, int]` *optional*, defaults to `{"shortest_edge": 224}`):
            Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with
            the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess`
            method.
        resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
            Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
        do_center_crop (`bool`, *optional*, defaults to `True`):
            Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
            `preprocess` method.
        crop_size (`Dict[str, int]` *optional*, defaults to 224):
            Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
            method.
        do_rescale (`bool`, *optional*, defaults to `True`):
            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in
            the `preprocess` method.
        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
            Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess`
            method.
        do_normalize (`bool`, *optional*, defaults to `True`):
            Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
        image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
        image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
            Can be overridden by the `image_std` parameter in the `preprocess` method.
        do_convert_rgb (`bool`, *optional*, defaults to `True`):
            Whether to convert the image to RGB.
    """
    model_input_names = ["pixel_values"]

    def __init__(
        self,
        do_resize: bool = True,
        size: Dict[str, int] = None,
        resample: PILImageResampling = PILImageResampling.BICUBIC,
        do_center_crop: bool = True,
        crop_size: Dict[str, int] = None,
        do_rescale: bool = True,
        rescale_factor: Union[int, float] = 1 / 255,
        do_normalize: bool = True,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_convert_rgb: bool = True,
        **kwargs,
    ) -> None:
        """
        Initializes a CLIPImageProcessor object.

        Args:
            self: The CLIPImageProcessor object itself.
            do_resize (bool): A flag indicating whether to resize the image. Defaults to True.
            size (Dict[str, int]): A dictionary containing the size of the image. Defaults to None.
            resample (PILImageResampling): The resampling method for resizing the image. Defaults to PILImageResampling.BICUBIC.
            do_center_crop (bool): A flag indicating whether to perform center cropping. Defaults to True.
            crop_size (Dict[str, int]): A dictionary containing the size for cropping. Defaults to None.
            do_rescale (bool): A flag indicating whether to rescale the image. Defaults to True.
            rescale_factor (Union[int, float]): The factor by which to rescale the image. Defaults to 1 / 255.
            do_normalize (bool): A flag indicating whether to normalize the image. Defaults to True.
            image_mean (Optional[Union[float, List[float]]]): The mean value for image normalization. Defaults to None.
            image_std (Optional[Union[float, List[float]]]): The standard deviation for image normalization. Defaults to None.
            do_convert_rgb (bool): A flag indicating whether to convert the image to RGB format. Defaults to True.

        Returns:
            None.

        Raises:
            None specified.
        """
        super().__init__(**kwargs)
        size = size if size is not None else {"shortest_edge": 224}
        size = get_size_dict(size, default_to_square=False)
        crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
        crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")

        self.do_resize = do_resize
        self.size = size
        self.resample = resample
        self.do_center_crop = do_center_crop
        self.crop_size = crop_size
        self.do_rescale = do_rescale
        self.rescale_factor = rescale_factor
        self.do_normalize = do_normalize
        self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
        self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
        self.do_convert_rgb = do_convert_rgb
        self._valid_processor_keys = [
            "images",
            "do_resize",
            "size",
            "resample",
            "do_center_crop",
            "crop_size",
            "do_rescale",
            "rescale_factor",
            "do_normalize",
            "image_mean",
            "image_std",
            "do_convert_rgb",
            "return_tensors",
            "data_format",
            "input_data_format",
        ]

        # for backwards compatibility of KOSMOS-2
        if "use_square_size" in kwargs:
            self.size = {"height": size["shortest_edge"], "width": size["shortest_edge"]}
            delattr(self, "use_square_size")

    def resize(
        self,
        image: np.ndarray,
        size: Dict[str, int],
        resample: PILImageResampling = PILImageResampling.BICUBIC,
        data_format: Optional[Union[str, ChannelDimension]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> np.ndarray:
        """
        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
        resized to keep the input aspect ratio.

        Args:
            image (`np.ndarray`):
                Image to resize.
            size (`Dict[str, int]`):
                Size of the output image.
            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
                Resampling filter to use when resiizing the image.
            data_format (`str` or `ChannelDimension`, *optional*):
                The channel dimension format of the image. If not provided, it will be the same as the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format of the input image. If not provided, it will be inferred.
        """
        default_to_square = True
        if "shortest_edge" in size:
            size = size["shortest_edge"]
            default_to_square = False
        elif "height" in size and "width" in size:
            size = (size["height"], size["width"])
        else:
            raise ValueError("Size must contain either 'shortest_edge' or 'height' and 'width'.")

        output_size = get_resize_output_image_size(
            image,
            size=size,
            default_to_square=default_to_square,
            input_data_format=input_data_format,
        )
        return resize(
            image,
            size=output_size,
            resample=resample,
            data_format=data_format,
            input_data_format=input_data_format,
            **kwargs,
        )

    def preprocess(
        self,
        images: ImageInput,
        do_resize: bool = None,
        size: Dict[str, int] = None,
        resample: PILImageResampling = None,
        do_center_crop: bool = None,
        crop_size: int = None,
        do_rescale: bool = None,
        rescale_factor: float = None,
        do_normalize: bool = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_convert_rgb: bool = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> PIL.Image.Image:
        """
        Preprocess an image or batch of images.

        Args:
            images (`ImageInput`):
                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the image.
            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
                Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
                the longest edge resized to keep the input aspect ratio.
            resample (`int`, *optional*, defaults to `self.resample`):
                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
                has an effect if `do_resize` is set to `True`.
            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
                Whether to center crop the image.
            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
                Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the image.
            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the image.
            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
                `True`.
            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
                Whether to convert the image to RGB.
            return_tensors (`str` or `TensorType`, *optional*):
                The type of tensors to return. Can be one of:

                - Unset: Return a list of `np.ndarray`.
                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
                The channel dimension format for the output image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - Unset: Use the channel dimension format of the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. If unset, the channel dimension format is inferred
                from the input image. Can be one of:
                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
        """
        do_resize = do_resize if do_resize is not None else self.do_resize
        size = size if size is not None else self.size
        size = get_size_dict(size, param_name="size", default_to_square=False)
        resample = resample if resample is not None else self.resample
        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
        crop_size = crop_size if crop_size is not None else self.crop_size
        crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True)
        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
        image_mean = image_mean if image_mean is not None else self.image_mean
        image_std = image_std if image_std is not None else self.image_std
        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
        validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)
        images = make_list_of_images(images)
        if not valid_images(images):
            raise ValueError(
                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                "torch.Tensor, tf.Tensor or jax.ndarray."
            )
        validate_preprocess_arguments(
            do_rescale=do_rescale,
            rescale_factor=rescale_factor,
            do_normalize=do_normalize,
            image_mean=image_mean,
            image_std=image_std,
            do_center_crop=do_center_crop,
            crop_size=crop_size,
            do_resize=do_resize,
            size=size,
            resample=resample,
        )

        if do_convert_rgb:
            images = [convert_to_rgb(image) for image in images]

        # All transformations expect numpy arrays.
        images = [to_numpy_array(image) for image in images]

        if is_scaled_image(images[0]) and do_rescale:
            logger.warning_once(
                "It looks like you are trying to rescale already rescaled images. If the input"
                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
            )

        if input_data_format is None:
            # We assume that all images have the same channel dimension format.
            input_data_format = infer_channel_dimension_format(images[0])

        if do_resize:
            images = [
                self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
                for image in images
            ]

        if do_center_crop:
            images = [
                self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images
            ]

        if do_rescale:
            images = [
                self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
                for image in images
            ]

        if do_normalize:
            images = [
                self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
                for image in images
            ]

        images = [
            to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
        ]

        data = {"pixel_values": images}
        return BatchFeature(data=data, tensor_type=return_tensors)

mindnlp.transformers.models.clip.image_processing_clip.CLIPImageProcessor.__init__(do_resize=True, size=None, resample=PILImageResampling.BICUBIC, do_center_crop=True, crop_size=None, do_rescale=True, rescale_factor=1 / 255, do_normalize=True, image_mean=None, image_std=None, do_convert_rgb=True, **kwargs)

Initializes a CLIPImageProcessor object.

PARAMETER DESCRIPTION
self

The CLIPImageProcessor object itself.

do_resize

A flag indicating whether to resize the image. Defaults to True.

TYPE: bool DEFAULT: True

size

A dictionary containing the size of the image. Defaults to None.

TYPE: Dict[str, int] DEFAULT: None

resample

The resampling method for resizing the image. Defaults to PILImageResampling.BICUBIC.

TYPE: PILImageResampling DEFAULT: BICUBIC

do_center_crop

A flag indicating whether to perform center cropping. Defaults to True.

TYPE: bool DEFAULT: True

crop_size

A dictionary containing the size for cropping. Defaults to None.

TYPE: Dict[str, int] DEFAULT: None

do_rescale

A flag indicating whether to rescale the image. Defaults to True.

TYPE: bool DEFAULT: True

rescale_factor

The factor by which to rescale the image. Defaults to 1 / 255.

TYPE: Union[int, float] DEFAULT: 1 / 255

do_normalize

A flag indicating whether to normalize the image. Defaults to True.

TYPE: bool DEFAULT: True

image_mean

The mean value for image normalization. Defaults to None.

TYPE: Optional[Union[float, List[float]]] DEFAULT: None

image_std

The standard deviation for image normalization. Defaults to None.

TYPE: Optional[Union[float, List[float]]] DEFAULT: None

do_convert_rgb

A flag indicating whether to convert the image to RGB format. Defaults to True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
None

None.

Source code in mindnlp/transformers/models/clip/image_processing_clip.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
def __init__(
    self,
    do_resize: bool = True,
    size: Dict[str, int] = None,
    resample: PILImageResampling = PILImageResampling.BICUBIC,
    do_center_crop: bool = True,
    crop_size: Dict[str, int] = None,
    do_rescale: bool = True,
    rescale_factor: Union[int, float] = 1 / 255,
    do_normalize: bool = True,
    image_mean: Optional[Union[float, List[float]]] = None,
    image_std: Optional[Union[float, List[float]]] = None,
    do_convert_rgb: bool = True,
    **kwargs,
) -> None:
    """
    Initializes a CLIPImageProcessor object.

    Args:
        self: The CLIPImageProcessor object itself.
        do_resize (bool): A flag indicating whether to resize the image. Defaults to True.
        size (Dict[str, int]): A dictionary containing the size of the image. Defaults to None.
        resample (PILImageResampling): The resampling method for resizing the image. Defaults to PILImageResampling.BICUBIC.
        do_center_crop (bool): A flag indicating whether to perform center cropping. Defaults to True.
        crop_size (Dict[str, int]): A dictionary containing the size for cropping. Defaults to None.
        do_rescale (bool): A flag indicating whether to rescale the image. Defaults to True.
        rescale_factor (Union[int, float]): The factor by which to rescale the image. Defaults to 1 / 255.
        do_normalize (bool): A flag indicating whether to normalize the image. Defaults to True.
        image_mean (Optional[Union[float, List[float]]]): The mean value for image normalization. Defaults to None.
        image_std (Optional[Union[float, List[float]]]): The standard deviation for image normalization. Defaults to None.
        do_convert_rgb (bool): A flag indicating whether to convert the image to RGB format. Defaults to True.

    Returns:
        None.

    Raises:
        None specified.
    """
    super().__init__(**kwargs)
    size = size if size is not None else {"shortest_edge": 224}
    size = get_size_dict(size, default_to_square=False)
    crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
    crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")

    self.do_resize = do_resize
    self.size = size
    self.resample = resample
    self.do_center_crop = do_center_crop
    self.crop_size = crop_size
    self.do_rescale = do_rescale
    self.rescale_factor = rescale_factor
    self.do_normalize = do_normalize
    self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
    self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
    self.do_convert_rgb = do_convert_rgb
    self._valid_processor_keys = [
        "images",
        "do_resize",
        "size",
        "resample",
        "do_center_crop",
        "crop_size",
        "do_rescale",
        "rescale_factor",
        "do_normalize",
        "image_mean",
        "image_std",
        "do_convert_rgb",
        "return_tensors",
        "data_format",
        "input_data_format",
    ]

    # for backwards compatibility of KOSMOS-2
    if "use_square_size" in kwargs:
        self.size = {"height": size["shortest_edge"], "width": size["shortest_edge"]}
        delattr(self, "use_square_size")

mindnlp.transformers.models.clip.image_processing_clip.CLIPImageProcessor.preprocess(images, do_resize=None, size=None, resample=None, do_center_crop=None, crop_size=None, do_rescale=None, rescale_factor=None, do_normalize=None, image_mean=None, image_std=None, do_convert_rgb=None, return_tensors=None, data_format=ChannelDimension.FIRST, input_data_format=None, **kwargs)

Preprocess an image or batch of images.

PARAMETER DESCRIPTION
images

Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False.

TYPE: `ImageInput`

do_resize

Whether to resize the image.

TYPE: `bool`, *optional*, defaults to `self.do_resize` DEFAULT: None

size

Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with the longest edge resized to keep the input aspect ratio.

TYPE: `Dict[str, int]`, *optional*, defaults to `self.size` DEFAULT: None

resample

Resampling filter to use if resizing the image. This can be one of the enum PILImageResampling. Only has an effect if do_resize is set to True.

TYPE: `int`, *optional*, defaults to `self.resample` DEFAULT: None

do_center_crop

Whether to center crop the image.

TYPE: `bool`, *optional*, defaults to `self.do_center_crop` DEFAULT: None

crop_size

Size of the center crop. Only has an effect if do_center_crop is set to True.

TYPE: `Dict[str, int]`, *optional*, defaults to `self.crop_size` DEFAULT: None

do_rescale

Whether to rescale the image.

TYPE: `bool`, *optional*, defaults to `self.do_rescale` DEFAULT: None

rescale_factor

Rescale factor to rescale the image by if do_rescale is set to True.

TYPE: `float`, *optional*, defaults to `self.rescale_factor` DEFAULT: None

do_normalize

Whether to normalize the image.

TYPE: `bool`, *optional*, defaults to `self.do_normalize` DEFAULT: None

image_mean

Image mean to use for normalization. Only has an effect if do_normalize is set to True.

TYPE: `float` or `List[float]`, *optional*, defaults to `self.image_mean` DEFAULT: None

image_std

Image standard deviation to use for normalization. Only has an effect if do_normalize is set to True.

TYPE: `float` or `List[float]`, *optional*, defaults to `self.image_std` DEFAULT: None

do_convert_rgb

Whether to convert the image to RGB.

TYPE: `bool`, *optional*, defaults to `self.do_convert_rgb` DEFAULT: None

return_tensors

The type of tensors to return. Can be one of:

  • Unset: Return a list of np.ndarray.
  • TensorType.TENSORFLOW or 'tf': Return a batch of type tf.Tensor.
  • TensorType.PYTORCH or 'pt': Return a batch of type torch.Tensor.
  • TensorType.NUMPY or 'np': Return a batch of type np.ndarray.
  • TensorType.JAX or 'jax': Return a batch of type jax.numpy.ndarray.

TYPE: `str` or `TensorType`, *optional* DEFAULT: None

data_format

The channel dimension format for the output image. Can be one of:

  • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
  • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
  • Unset: Use the channel dimension format of the input image.

TYPE: `ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST` DEFAULT: FIRST

input_data_format

The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of: - "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format. - "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format. - "none" or ChannelDimension.NONE: image in (height, width) format.

TYPE: `ChannelDimension` or `str`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/clip/image_processing_clip.py
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
def preprocess(
    self,
    images: ImageInput,
    do_resize: bool = None,
    size: Dict[str, int] = None,
    resample: PILImageResampling = None,
    do_center_crop: bool = None,
    crop_size: int = None,
    do_rescale: bool = None,
    rescale_factor: float = None,
    do_normalize: bool = None,
    image_mean: Optional[Union[float, List[float]]] = None,
    image_std: Optional[Union[float, List[float]]] = None,
    do_convert_rgb: bool = None,
    return_tensors: Optional[Union[str, TensorType]] = None,
    data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
) -> PIL.Image.Image:
    """
    Preprocess an image or batch of images.

    Args:
        images (`ImageInput`):
            Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
            passing in images with pixel values between 0 and 1, set `do_rescale=False`.
        do_resize (`bool`, *optional*, defaults to `self.do_resize`):
            Whether to resize the image.
        size (`Dict[str, int]`, *optional*, defaults to `self.size`):
            Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
            the longest edge resized to keep the input aspect ratio.
        resample (`int`, *optional*, defaults to `self.resample`):
            Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
            has an effect if `do_resize` is set to `True`.
        do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
            Whether to center crop the image.
        crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
            Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
        do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
            Whether to rescale the image.
        rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
            Rescale factor to rescale the image by if `do_rescale` is set to `True`.
        do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
            Whether to normalize the image.
        image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
            Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
        image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
            Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
            `True`.
        do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
            Whether to convert the image to RGB.
        return_tensors (`str` or `TensorType`, *optional*):
            The type of tensors to return. Can be one of:

            - Unset: Return a list of `np.ndarray`.
            - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
            - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
            - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
            - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
        data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
            The channel dimension format for the output image. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - Unset: Use the channel dimension format of the input image.
        input_data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format for the input image. If unset, the channel dimension format is inferred
            from the input image. Can be one of:
            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
    """
    do_resize = do_resize if do_resize is not None else self.do_resize
    size = size if size is not None else self.size
    size = get_size_dict(size, param_name="size", default_to_square=False)
    resample = resample if resample is not None else self.resample
    do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
    crop_size = crop_size if crop_size is not None else self.crop_size
    crop_size = get_size_dict(crop_size, param_name="crop_size", default_to_square=True)
    do_rescale = do_rescale if do_rescale is not None else self.do_rescale
    rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
    do_normalize = do_normalize if do_normalize is not None else self.do_normalize
    image_mean = image_mean if image_mean is not None else self.image_mean
    image_std = image_std if image_std is not None else self.image_std
    do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
    validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)
    images = make_list_of_images(images)
    if not valid_images(images):
        raise ValueError(
            "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
            "torch.Tensor, tf.Tensor or jax.ndarray."
        )
    validate_preprocess_arguments(
        do_rescale=do_rescale,
        rescale_factor=rescale_factor,
        do_normalize=do_normalize,
        image_mean=image_mean,
        image_std=image_std,
        do_center_crop=do_center_crop,
        crop_size=crop_size,
        do_resize=do_resize,
        size=size,
        resample=resample,
    )

    if do_convert_rgb:
        images = [convert_to_rgb(image) for image in images]

    # All transformations expect numpy arrays.
    images = [to_numpy_array(image) for image in images]

    if is_scaled_image(images[0]) and do_rescale:
        logger.warning_once(
            "It looks like you are trying to rescale already rescaled images. If the input"
            " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
        )

    if input_data_format is None:
        # We assume that all images have the same channel dimension format.
        input_data_format = infer_channel_dimension_format(images[0])

    if do_resize:
        images = [
            self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
            for image in images
        ]

    if do_center_crop:
        images = [
            self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images
        ]

    if do_rescale:
        images = [
            self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
            for image in images
        ]

    if do_normalize:
        images = [
            self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
            for image in images
        ]

    images = [
        to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
    ]

    data = {"pixel_values": images}
    return BatchFeature(data=data, tensor_type=return_tensors)

mindnlp.transformers.models.clip.image_processing_clip.CLIPImageProcessor.resize(image, size, resample=PILImageResampling.BICUBIC, data_format=None, input_data_format=None, **kwargs)

Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge resized to keep the input aspect ratio.

PARAMETER DESCRIPTION
image

Image to resize.

TYPE: `np.ndarray`

size

Size of the output image.

TYPE: `Dict[str, int]`

resample

Resampling filter to use when resiizing the image.

TYPE: `PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC` DEFAULT: BICUBIC

data_format

The channel dimension format of the image. If not provided, it will be the same as the input image.

TYPE: `str` or `ChannelDimension`, *optional* DEFAULT: None

input_data_format

The channel dimension format of the input image. If not provided, it will be inferred.

TYPE: `ChannelDimension` or `str`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/clip/image_processing_clip.py
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
def resize(
    self,
    image: np.ndarray,
    size: Dict[str, int],
    resample: PILImageResampling = PILImageResampling.BICUBIC,
    data_format: Optional[Union[str, ChannelDimension]] = None,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
) -> np.ndarray:
    """
    Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
    resized to keep the input aspect ratio.

    Args:
        image (`np.ndarray`):
            Image to resize.
        size (`Dict[str, int]`):
            Size of the output image.
        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
            Resampling filter to use when resiizing the image.
        data_format (`str` or `ChannelDimension`, *optional*):
            The channel dimension format of the image. If not provided, it will be the same as the input image.
        input_data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format of the input image. If not provided, it will be inferred.
    """
    default_to_square = True
    if "shortest_edge" in size:
        size = size["shortest_edge"]
        default_to_square = False
    elif "height" in size and "width" in size:
        size = (size["height"], size["width"])
    else:
        raise ValueError("Size must contain either 'shortest_edge' or 'height' and 'width'.")

    output_size = get_resize_output_image_size(
        image,
        size=size,
        default_to_square=default_to_square,
        input_data_format=input_data_format,
    )
    return resize(
        image,
        size=output_size,
        resample=resample,
        data_format=data_format,
        input_data_format=input_data_format,
        **kwargs,
    )

mindnlp.transformers.models.clip.modeling_clip.CLIP_PRETRAINED_MODEL_ARCHIVE_LIST = ['openai/clip-vit-base-patch32'] module-attribute

mindnlp.transformers.models.clip.modeling_clip.CLIPModel

Bases: CLIPPreTrainedModel

A Python class representing a CLIP (Contrastive Language-Image Pre-training) model that combines text and vision inputs for image-text similarity scoring. This class inherits from CLIPPreTrainedModel and provides methods for extracting text and image features, as well as for forwarding the final CLIP output. The class handles the initialization of model configurations, text and vision embeddings, projectionlayers, and scaling of logits for calculating similarity scores. It also includes examples on how to use the model for text and image inputs.

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
class CLIPModel(CLIPPreTrainedModel):

    """
    A Python class representing a CLIP (Contrastive Language-Image Pre-training) model that combines text and vision
    inputs for image-text similarity scoring. This class inherits from CLIPPreTrainedModel and provides methods for
    extracting text and image features, as well as for forwarding the final CLIP output.
    The class handles the initialization of model configurations, text and vision embeddings, projectionlayers,
    and scaling of logits for calculating similarity scores.
    It also includes examples on how to use the model for text and image inputs.
    """
    config_class = CLIPConfig
    _no_split_modules = ["CLIPTextEmbeddings", "CLIPEncoderLayer"]

    def __init__(self, config: CLIPConfig):
        """
        Initializes an instance of the CLIPModel class.

        Args:
            self: The instance of the class.
            config (CLIPConfig):
                An instance of the CLIPConfig class which holds the configuration parameters for the CLIPModel.

                - text_config (CLIPTextConfig): An instance of the CLIPTextConfig class which holds the configuration parameters for the text model.

                    - hidden_size (int): The dimension of the hidden state in the text model.

                - vision_config (CLIPVisionConfig): An instance of the CLIPVisionConfig class which holds the configuration parameters for the vision model.

                    - hidden_size (int): The dimension of the hidden state in the vision model.

                - projection_dim (int): The dimension of the projection output.

        Returns:
            None.

        Raises:
            ValueError: If the 'config.text_config' parameter is not of type CLIPTextConfig.
            ValueError: If the 'config.vision_config' parameter is not of type CLIPVisionConfig.
        """
        super().__init__(config)

        if not isinstance(config.text_config, CLIPTextConfig):
            raise ValueError(
                "config.text_config is expected to be of type CLIPTextConfig but is of type"
                f" {type(config.text_config)}."
            )

        if not isinstance(config.vision_config, CLIPVisionConfig):
            raise ValueError(
                "config.vision_config is expected to be of type CLIPVisionConfig but is of type"
                f" {type(config.vision_config)}."
            )

        text_config = config.text_config
        vision_config = config.vision_config

        self.projection_dim = config.projection_dim
        self.text_embed_dim = text_config.hidden_size
        self.vision_embed_dim = vision_config.hidden_size

        self.text_model = CLIPTextTransformer(text_config)
        self.vision_model = CLIPVisionTransformer(vision_config)

        self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=False)
        self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=False)
        self.logit_scale = mindspore.tensor([self.config.logit_scale_init_value])

        # Initialize weights and apply final processing
        self.post_init()

    def get_text_features(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> mindspore.Tensor:
        r"""
        Returns:
            text_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
                applying the projection layer to the pooled output of [`CLIPTextModel`].

        Example:
            ```python
            >>> from transformers import AutoTokenizer, CLIPModel
            ...
            >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
            >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
            ...
            >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
            >>> text_features = model.get_text_features(**inputs)
            ```
        """
        # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = text_outputs[1]
        text_features = self.text_projection(pooled_output)

        return text_features

    def get_image_features(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> mindspore.Tensor:
        r"""

        Returns:
            image_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
                applying the projection layer to the pooled output of [`CLIPVisionModel`].

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, CLIPModel
            ...
            >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
            >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> inputs = processor(images=image, return_tensors="pt")
            ...
            >>> image_features = model.get_image_features(**inputs)
            ```
        """
        # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        vision_outputs = self.vision_model(
            pixel_values=pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = vision_outputs[1]  # pooled_output
        image_features = self.visual_projection(pooled_output)

        return image_features

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        pixel_values: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        return_loss: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, CLIPOutput]:
        r"""

        Returns:
            Union[Tuple, CLIPOutput]

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, CLIPModel
            ...
            >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
            >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> inputs = processor(
            ...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
            ... )
            ...
            >>> outputs = model(**inputs)
            >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
            >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
            ```
        """
        # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        vision_outputs = self.vision_model(
            pixel_values=pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        image_embeds = vision_outputs[1]
        image_embeds = self.visual_projection(image_embeds)

        text_embeds = text_outputs[1]
        text_embeds = self.text_projection(text_embeds)

        # normalized features
        image_embeds = image_embeds / image_embeds.norm(ord=2, dim=-1, keepdim=True)
        text_embeds = text_embeds / text_embeds.norm(ord=2, dim=-1, keepdim=True)

        # cosine similarity as logits
        logit_scale = self.logit_scale.exp()
        logits_per_text = ops.matmul(text_embeds, image_embeds.t()) * logit_scale
        logits_per_image = logits_per_text.t()

        loss = None
        if return_loss:
            loss = clip_loss(logits_per_text)

        if not return_dict:
            output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
            return ((loss,) + output) if loss is not None else output

        return CLIPOutput(
            loss=loss,
            logits_per_image=logits_per_image,
            logits_per_text=logits_per_text,
            text_embeds=text_embeds,
            image_embeds=image_embeds,
            text_model_output=text_outputs,
            vision_model_output=vision_outputs,
        )

mindnlp.transformers.models.clip.modeling_clip.CLIPModel.__init__(config)

Initializes an instance of the CLIPModel class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

An instance of the CLIPConfig class which holds the configuration parameters for the CLIPModel.

  • text_config (CLIPTextConfig): An instance of the CLIPTextConfig class which holds the configuration parameters for the text model.

    • hidden_size (int): The dimension of the hidden state in the text model.
  • vision_config (CLIPVisionConfig): An instance of the CLIPVisionConfig class which holds the configuration parameters for the vision model.

    • hidden_size (int): The dimension of the hidden state in the vision model.
  • projection_dim (int): The dimension of the projection output.

TYPE: CLIPConfig

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If the 'config.text_config' parameter is not of type CLIPTextConfig.

ValueError

If the 'config.vision_config' parameter is not of type CLIPVisionConfig.

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
def __init__(self, config: CLIPConfig):
    """
    Initializes an instance of the CLIPModel class.

    Args:
        self: The instance of the class.
        config (CLIPConfig):
            An instance of the CLIPConfig class which holds the configuration parameters for the CLIPModel.

            - text_config (CLIPTextConfig): An instance of the CLIPTextConfig class which holds the configuration parameters for the text model.

                - hidden_size (int): The dimension of the hidden state in the text model.

            - vision_config (CLIPVisionConfig): An instance of the CLIPVisionConfig class which holds the configuration parameters for the vision model.

                - hidden_size (int): The dimension of the hidden state in the vision model.

            - projection_dim (int): The dimension of the projection output.

    Returns:
        None.

    Raises:
        ValueError: If the 'config.text_config' parameter is not of type CLIPTextConfig.
        ValueError: If the 'config.vision_config' parameter is not of type CLIPVisionConfig.
    """
    super().__init__(config)

    if not isinstance(config.text_config, CLIPTextConfig):
        raise ValueError(
            "config.text_config is expected to be of type CLIPTextConfig but is of type"
            f" {type(config.text_config)}."
        )

    if not isinstance(config.vision_config, CLIPVisionConfig):
        raise ValueError(
            "config.vision_config is expected to be of type CLIPVisionConfig but is of type"
            f" {type(config.vision_config)}."
        )

    text_config = config.text_config
    vision_config = config.vision_config

    self.projection_dim = config.projection_dim
    self.text_embed_dim = text_config.hidden_size
    self.vision_embed_dim = vision_config.hidden_size

    self.text_model = CLIPTextTransformer(text_config)
    self.vision_model = CLIPVisionTransformer(vision_config)

    self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=False)
    self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=False)
    self.logit_scale = mindspore.tensor([self.config.logit_scale_init_value])

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.clip.modeling_clip.CLIPModel.forward(input_ids=None, pixel_values=None, attention_mask=None, position_ids=None, return_loss=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
Union[Tuple, CLIPOutput]

Union[Tuple, CLIPOutput]

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, CLIPModel
...
>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = processor(
...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
... )
...
>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
>>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
Source code in mindnlp/transformers/models/clip/modeling_clip.py
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    pixel_values: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    return_loss: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, CLIPOutput]:
    r"""

    Returns:
        Union[Tuple, CLIPOutput]

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, CLIPModel
        ...
        >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = processor(
        ...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
        ... )
        ...
        >>> outputs = model(**inputs)
        >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
        >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
        ```
    """
    # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    vision_outputs = self.vision_model(
        pixel_values=pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    text_outputs = self.text_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    image_embeds = vision_outputs[1]
    image_embeds = self.visual_projection(image_embeds)

    text_embeds = text_outputs[1]
    text_embeds = self.text_projection(text_embeds)

    # normalized features
    image_embeds = image_embeds / image_embeds.norm(ord=2, dim=-1, keepdim=True)
    text_embeds = text_embeds / text_embeds.norm(ord=2, dim=-1, keepdim=True)

    # cosine similarity as logits
    logit_scale = self.logit_scale.exp()
    logits_per_text = ops.matmul(text_embeds, image_embeds.t()) * logit_scale
    logits_per_image = logits_per_text.t()

    loss = None
    if return_loss:
        loss = clip_loss(logits_per_text)

    if not return_dict:
        output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
        return ((loss,) + output) if loss is not None else output

    return CLIPOutput(
        loss=loss,
        logits_per_image=logits_per_image,
        logits_per_text=logits_per_text,
        text_embeds=text_embeds,
        image_embeds=image_embeds,
        text_model_output=text_outputs,
        vision_model_output=vision_outputs,
    )

mindnlp.transformers.models.clip.modeling_clip.CLIPModel.get_image_features(pixel_values=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
image_features

The image embeddings obtained by applying the projection layer to the pooled output of [CLIPVisionModel].

TYPE: `mindspore.Tensor` of shape `(batch_size, output_dim`

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, CLIPModel
...
>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = processor(images=image, return_tensors="pt")
...
>>> image_features = model.get_image_features(**inputs)
Source code in mindnlp/transformers/models/clip/modeling_clip.py
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
def get_image_features(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> mindspore.Tensor:
    r"""

    Returns:
        image_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
            applying the projection layer to the pooled output of [`CLIPVisionModel`].

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, CLIPModel
        ...
        >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = processor(images=image, return_tensors="pt")
        ...
        >>> image_features = model.get_image_features(**inputs)
        ```
    """
    # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    vision_outputs = self.vision_model(
        pixel_values=pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    pooled_output = vision_outputs[1]  # pooled_output
    image_features = self.visual_projection(pooled_output)

    return image_features

mindnlp.transformers.models.clip.modeling_clip.CLIPModel.get_text_features(input_ids=None, attention_mask=None, position_ids=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
text_features

The text embeddings obtained by applying the projection layer to the pooled output of [CLIPTextModel].

TYPE: `mindspore.Tensor` of shape `(batch_size, output_dim`

Example
>>> from transformers import AutoTokenizer, CLIPModel
...
>>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
>>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
...
>>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
>>> text_features = model.get_text_features(**inputs)
Source code in mindnlp/transformers/models/clip/modeling_clip.py
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
def get_text_features(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> mindspore.Tensor:
    r"""
    Returns:
        text_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
            applying the projection layer to the pooled output of [`CLIPTextModel`].

    Example:
        ```python
        >>> from transformers import AutoTokenizer, CLIPModel
        ...
        >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
        ...
        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
        >>> text_features = model.get_text_features(**inputs)
        ```
    """
    # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    text_outputs = self.text_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    pooled_output = text_outputs[1]
    text_features = self.text_projection(pooled_output)

    return text_features

mindnlp.transformers.models.clip.modeling_clip.CLIPPreTrainedModel

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in mindnlp/transformers/models/clip/modeling_clip.py
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
class CLIPPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """
    config_class = CLIPConfig
    base_model_prefix = "clip"
    supports_gradient_checkpointing = True

    def _init_weights(self, cell):
        """Initialize the weights"""
        factor = self.config.initializer_factor
        if isinstance(cell, CLIPTextEmbeddings):
            cell.token_embedding.weight.set_data(initializer(Normal(factor * 0.02),
                                                 cell.token_embedding.weight.shape, cell.token_embedding.weight.dtype))
            cell.position_embedding.weight.set_data(initializer(Normal(factor * 0.02),
                                        cell.position_embedding.weight.shape, cell.position_embedding.weight.dtype))
        elif isinstance(cell, CLIPVisionEmbeddings):
            factor = self.config.initializer_factor
            cell.class_embedding.set_data(initializer(Normal(cell.embed_dim**-0.5 * factor),
                                        cell.class_embedding.shape, cell.class_embedding.dtype))
            cell.patch_embedding.weight.set_data(initializer(Normal(cell.config.initializer_range * factor),
                                                 cell.patch_embedding.weight.shape, cell.patch_embedding.weight.dtype))
            cell.position_embedding.weight.set_data(initializer(Normal(cell.config.initializer_range * factor),
                                                 cell.position_embedding.weight.shape, cell.position_embedding.weight.dtype))

        elif isinstance(cell, CLIPAttention):
            factor = self.config.initializer_factor
            in_proj_std = (cell.embed_dim**-0.5) * ((2 * cell.config.num_hidden_layers) ** -0.5) * factor
            out_proj_std = (cell.embed_dim**-0.5) * factor

            cell.q_proj.weight.set_data(initializer(Normal(in_proj_std),
                                        cell.q_proj.weight.shape, cell.q_proj.weight.dtype))
            cell.k_proj.weight.set_data(initializer(Normal(in_proj_std),
                                        cell.k_proj.weight.shape, cell.k_proj.weight.dtype))
            cell.v_proj.weight.set_data(initializer(Normal(in_proj_std),
                                        cell.v_proj.weight.shape, cell.v_proj.weight.dtype))
            cell.out_proj.weight.set_data(initializer(Normal(out_proj_std),
                                        cell.out_proj.weight.shape, cell.out_proj.weight.dtype))

        elif isinstance(cell, CLIPMLP):
            factor = self.config.initializer_factor
            in_proj_std = (cell.config.hidden_size**-0.5) * ((2 * cell.config.num_hidden_layers) ** -0.5) * factor
            fc_std = (2 * cell.config.hidden_size) ** -0.5 * factor

            cell.fc1.weight.set_data(initializer(Normal(fc_std),
                                    cell.fc1.weight.shape, cell.fc1.weight.dtype))
            cell.fc2.weight.set_data(initializer(Normal(in_proj_std),
                                    cell.fc2.weight.shape, cell.fc2.weight.dtype))

        elif isinstance(cell, CLIPModel):
            cell.text_projection.weight.set_data(initializer(Normal(cell.text_embed_dim**-0.5 * self.config.initializer_factor),
                                    cell.text_projection.weight.shape, cell.text_projection.weight.dtype))

            cell.visual_projection.weight.set_data(initializer(Normal(cell.vision_embed_dim**-0.5 * self.config.initializer_factor),
                                    cell.visual_projection.weight.shape, cell.visual_projection.weight.dtype))
        elif isinstance(cell, CLIPVisionModelWithProjection):
            cell.visual_projection.weight.set_data(initializer(Normal(self.config.hidden_size**-0.5 * self.config.initializer_factor),
                                    cell.visual_projection.weight.shape, cell.visual_projection.weight.dtype))

        elif isinstance(cell, CLIPTextModelWithProjection):
            cell.text_projection.weight.set_data(initializer(Normal(self.config.hidden_size**-0.5 * self.config.initializer_factor),
                                    cell.text_projection.weight.shape, cell.text_projection.weight.dtype))
        if isinstance(cell, nn.LayerNorm):
            cell.weight.set_data(initializer('ones', cell.weight.shape, cell.weight.dtype))
            cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))

        if isinstance(cell, nn.Linear) and cell.bias is not None:
            cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))

mindnlp.transformers.models.clip.modeling_clip.CLIPTextModel

Bases: CLIPPreTrainedModel

The CLIPTextModel class represents a model for processing text inputs using the CLIP (Contrastive Language-Image Pretraining) framework. This class inherits from CLIPPreTrainedModel and provides methods for initializing the model, obtaining input embeddings, and forwarding the model for inference.

The CLIPTextModel class includes methods for initializing the model with a configuration, obtaining input embeddings, and forwarding the model for inference. The get_input_embeddings method returns the token embeddings used as input to the model, while the set_input_embeddings method allows for updating the token embeddings. The forward method forwards the model for performing inference, with options for specifying input tensors, attention masks, position ids, and return settings.

The forward method returns the model outputs based on the provided inputs and settings. Additionally, the docstring includes usage examples for initializing the CLIPTextModel and performing inference using the model.

Example
>>> from transformers import AutoTokenizer, CLIPTextModel
...
>>> model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
>>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
...
>>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
Source code in mindnlp/transformers/models/clip/modeling_clip.py
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
class CLIPTextModel(CLIPPreTrainedModel):

    """
    The `CLIPTextModel` class represents a model for processing text inputs using the CLIP (Contrastive Language-Image Pretraining) framework.
    This class inherits from `CLIPPreTrainedModel` and provides methods for initializing the model, obtaining input embeddings,
    and forwarding the model for inference.

    The `CLIPTextModel` class includes methods for initializing the model with a configuration, obtaining input embeddings,
    and forwarding the model for inference. The `get_input_embeddings` method returns the token embeddings used as input
    to the model, while the `set_input_embeddings` method allows for updating the token embeddings.
    The `forward` method forwards the model for performing inference, with options for specifying input tensors,
    attention masks, position ids, and return settings.

    The `forward` method returns the model outputs based on the provided inputs and settings.
    Additionally, the docstring includes usage examples for initializing the `CLIPTextModel` and performing inference
    using the model.

    Example:
        ```python
        >>> from transformers import AutoTokenizer, CLIPTextModel
        ...
        >>> model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
        >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
        ...
        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> last_hidden_state = outputs.last_hidden_state
        >>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
        ```
    """
    config_class = CLIPTextConfig

    _no_split_modules = ["CLIPTextEmbeddings", "CLIPEncoderLayer"]

    def __init__(self, config: CLIPTextConfig):
        """Initialize the CLIPTextModel object with the given configuration.

            Args:
                self (CLIPTextModel): The instance of the CLIPTextModel class.
                config (CLIPTextConfig): The configuration object for CLIPTextModel.

            Returns:
                None

            Raises:
                None
            """
        super().__init__(config)
        self.text_model = CLIPTextTransformer(config)
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self) -> nn.Module:
        """
        Method to retrieve the input embeddings from the CLIPTextModel.

        Args:
            self (CLIPTextModel): The instance of the CLIPTextModel class.
                This parameter refers to the current instance of the CLIPTextModel class
                from which the input embeddings are being retrieved.

        Returns:
            nn.Module: An instance of the neural network Cell class representing the input embeddings.
                The return value is the token embedding from the text model, which serves as the input embeddings
                for further processing within the CLIPTextModel.

        Raises:
            None.
        """
        return self.text_model.embeddings.token_embedding

    def set_input_embeddings(self, value):
        """
        Sets the input embeddings for the CLIPTextModel.

        Args:
            self (CLIPTextModel): The instance of the CLIPTextModel.
            value: The new input embeddings to be set. It can be of any type.

        Returns:
            None

        Raises:
            None
        """
        self.text_model.embeddings.token_embedding = value

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPooling]:
        r"""
        Returns:
            `Union[Tuple, BaseModelOutputWithPooling]`

        Example:
            ```python
            >>> from transformers import AutoTokenizer, CLIPTextModel
            ...
            >>> model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
            >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
            ...
            >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
            ...
            >>> outputs = model(**inputs)
            >>> last_hidden_state = outputs.last_hidden_state
            >>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        return self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

mindnlp.transformers.models.clip.modeling_clip.CLIPTextModel.__init__(config)

Initialize the CLIPTextModel object with the given configuration.

PARAMETER DESCRIPTION
self

The instance of the CLIPTextModel class.

TYPE: CLIPTextModel

config

The configuration object for CLIPTextModel.

TYPE: CLIPTextConfig

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
def __init__(self, config: CLIPTextConfig):
    """Initialize the CLIPTextModel object with the given configuration.

        Args:
            self (CLIPTextModel): The instance of the CLIPTextModel class.
            config (CLIPTextConfig): The configuration object for CLIPTextModel.

        Returns:
            None

        Raises:
            None
        """
    super().__init__(config)
    self.text_model = CLIPTextTransformer(config)
    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.clip.modeling_clip.CLIPTextModel.forward(input_ids=None, attention_mask=None, position_ids=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
Union[Tuple, BaseModelOutputWithPooling]

Union[Tuple, BaseModelOutputWithPooling]

Example
>>> from transformers import AutoTokenizer, CLIPTextModel
...
>>> model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
>>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
...
>>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
Source code in mindnlp/transformers/models/clip/modeling_clip.py
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPooling]:
    r"""
    Returns:
        `Union[Tuple, BaseModelOutputWithPooling]`

    Example:
        ```python
        >>> from transformers import AutoTokenizer, CLIPTextModel
        ...
        >>> model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
        >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
        ...
        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> last_hidden_state = outputs.last_hidden_state
        >>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    return self.text_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

mindnlp.transformers.models.clip.modeling_clip.CLIPTextModel.get_input_embeddings()

Method to retrieve the input embeddings from the CLIPTextModel.

PARAMETER DESCRIPTION
self

The instance of the CLIPTextModel class. This parameter refers to the current instance of the CLIPTextModel class from which the input embeddings are being retrieved.

TYPE: CLIPTextModel

RETURNS DESCRIPTION
Module

nn.Module: An instance of the neural network Cell class representing the input embeddings. The return value is the token embedding from the text model, which serves as the input embeddings for further processing within the CLIPTextModel.

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
def get_input_embeddings(self) -> nn.Module:
    """
    Method to retrieve the input embeddings from the CLIPTextModel.

    Args:
        self (CLIPTextModel): The instance of the CLIPTextModel class.
            This parameter refers to the current instance of the CLIPTextModel class
            from which the input embeddings are being retrieved.

    Returns:
        nn.Module: An instance of the neural network Cell class representing the input embeddings.
            The return value is the token embedding from the text model, which serves as the input embeddings
            for further processing within the CLIPTextModel.

    Raises:
        None.
    """
    return self.text_model.embeddings.token_embedding

mindnlp.transformers.models.clip.modeling_clip.CLIPTextModel.set_input_embeddings(value)

Sets the input embeddings for the CLIPTextModel.

PARAMETER DESCRIPTION
self

The instance of the CLIPTextModel.

TYPE: CLIPTextModel

value

The new input embeddings to be set. It can be of any type.

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
def set_input_embeddings(self, value):
    """
    Sets the input embeddings for the CLIPTextModel.

    Args:
        self (CLIPTextModel): The instance of the CLIPTextModel.
        value: The new input embeddings to be set. It can be of any type.

    Returns:
        None

    Raises:
        None
    """
    self.text_model.embeddings.token_embedding = value

mindnlp.transformers.models.clip.modeling_clip.CLIPTextModelWithProjection

Bases: CLIPPreTrainedModel

This class represents a CLIP text model with a projection layer for embedding text inputs. It inherits from the CLIPPreTrainedModel class.

The CLIPTextModelWithProjection class is designed to process text inputs using the CLIP (Contrastive Language-Image Pretraining) model architecture. It incorporates a CLIPTextTransformer and a text projection layer to generate text embeddings.

The class provides functionality for initializing the model with a CLIPTextConfig, accessing the input embeddings, setting the input embeddings, and forwarding the model's outputs based on input text ids, attention masks, and position ids.

The forward method takes optional input tensors representing text ids, attention masks, position ids, output attentions, output hidden states, and return dictionary flag. It returns a CLIPTextModelOutput object containing the text embeddings and other relevant information.

Example
>>> from transformers import AutoTokenizer, CLIPTextModelWithProjection
...
>>> model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
>>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
...
>>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> text_embeds = outputs.text_embeds
Source code in mindnlp/transformers/models/clip/modeling_clip.py
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
class CLIPTextModelWithProjection(CLIPPreTrainedModel):

    """
    This class represents a CLIP text model with a projection layer for embedding text inputs.
    It inherits from the CLIPPreTrainedModel class.

    The CLIPTextModelWithProjection class is designed to process text inputs using the CLIP (Contrastive Language-Image Pretraining) model architecture.
    It incorporates a CLIPTextTransformer and a text projection layer to generate text embeddings.

    The class provides functionality for initializing the model with a CLIPTextConfig, accessing the input embeddings,
    setting the input embeddings, and forwarding the model's outputs based on input text ids, attention masks, and position ids.

    The forward method takes optional input tensors representing text ids, attention masks, position ids,
    output attentions, output hidden states, and return dictionary flag.
    It returns a CLIPTextModelOutput object containing the text embeddings and other relevant information.

    Example:
        ```python
        >>> from transformers import AutoTokenizer, CLIPTextModelWithProjection
        ...
        >>> model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
        >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
        ...
        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> text_embeds = outputs.text_embeds
        ```

    """
    config_class = CLIPTextConfig

    _no_split_modules = ["CLIPTextEmbeddings", "CLIPEncoderLayer"]

    def __init__(self, config: CLIPTextConfig):
        """
        Initializes an instance of the CLIPTextModelWithProjection class.

        Args:
            self: The instance of the class.
            config (CLIPTextConfig):
                An instance of CLIPTextConfig class that contains the configuration parameters for the model.

        Returns:
            None

        Raises:
            None
        """
        super().__init__(config)

        self.text_model = CLIPTextTransformer(config)

        self.text_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self) -> nn.Module:
        """
        Method to get the input embeddings from the CLIPTextModelWithProjection instance.

        Args:
            self (object): Instance of the CLIPTextModelWithProjection class.
                Represents the current instance of the class.

        Returns:
            nn.Module: Returns the input embeddings of type nn.Module.
                Represents the token embeddings used by the text model.

        Raises:
            None.
        """
        return self.text_model.embeddings.token_embedding

    def set_input_embeddings(self, value):
        """
        Sets the input embeddings for the CLIPTextModelWithProjection class.

        Args:
            self (CLIPTextModelWithProjection): The instance of the CLIPTextModelWithProjection class.
            value: The input embeddings to be set for the text model.
                This should be a tensor or object that can be assigned to the `token_embedding` attribute of the text model.

        Returns:
            None: This method modifies the state of the text model by setting the input embeddings.

        Raises:
            None.
        """
        self.text_model.embeddings.token_embedding = value

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, CLIPTextModelOutput]:
        r"""

        Returns:
            Union[Tuple, CLIPTextModelOutput]

        Example:
            ```python
            >>> from transformers import AutoTokenizer, CLIPTextModelWithProjection
            ...
            >>> model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
            >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
            ...
            >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
            ...
            >>> outputs = model(**inputs)
            >>> text_embeds = outputs.text_embeds
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = text_outputs[1]

        text_embeds = self.text_projection(pooled_output)

        if not return_dict:
            outputs = (text_embeds, text_outputs[0]) + text_outputs[2:]
            return tuple(output for output in outputs if output is not None)

        return CLIPTextModelOutput(
            text_embeds=text_embeds,
            last_hidden_state=text_outputs.last_hidden_state,
            hidden_states=text_outputs.hidden_states,
            attentions=text_outputs.attentions,
        )

mindnlp.transformers.models.clip.modeling_clip.CLIPTextModelWithProjection.__init__(config)

Initializes an instance of the CLIPTextModelWithProjection class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

An instance of CLIPTextConfig class that contains the configuration parameters for the model.

TYPE: CLIPTextConfig

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
def __init__(self, config: CLIPTextConfig):
    """
    Initializes an instance of the CLIPTextModelWithProjection class.

    Args:
        self: The instance of the class.
        config (CLIPTextConfig):
            An instance of CLIPTextConfig class that contains the configuration parameters for the model.

    Returns:
        None

    Raises:
        None
    """
    super().__init__(config)

    self.text_model = CLIPTextTransformer(config)

    self.text_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.clip.modeling_clip.CLIPTextModelWithProjection.forward(input_ids=None, attention_mask=None, position_ids=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
Union[Tuple, CLIPTextModelOutput]

Union[Tuple, CLIPTextModelOutput]

Example
>>> from transformers import AutoTokenizer, CLIPTextModelWithProjection
...
>>> model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
>>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
...
>>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> text_embeds = outputs.text_embeds
Source code in mindnlp/transformers/models/clip/modeling_clip.py
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, CLIPTextModelOutput]:
    r"""

    Returns:
        Union[Tuple, CLIPTextModelOutput]

    Example:
        ```python
        >>> from transformers import AutoTokenizer, CLIPTextModelWithProjection
        ...
        >>> model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
        >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
        ...
        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> text_embeds = outputs.text_embeds
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    text_outputs = self.text_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    pooled_output = text_outputs[1]

    text_embeds = self.text_projection(pooled_output)

    if not return_dict:
        outputs = (text_embeds, text_outputs[0]) + text_outputs[2:]
        return tuple(output for output in outputs if output is not None)

    return CLIPTextModelOutput(
        text_embeds=text_embeds,
        last_hidden_state=text_outputs.last_hidden_state,
        hidden_states=text_outputs.hidden_states,
        attentions=text_outputs.attentions,
    )

mindnlp.transformers.models.clip.modeling_clip.CLIPTextModelWithProjection.get_input_embeddings()

Method to get the input embeddings from the CLIPTextModelWithProjection instance.

PARAMETER DESCRIPTION
self

Instance of the CLIPTextModelWithProjection class. Represents the current instance of the class.

TYPE: object

RETURNS DESCRIPTION
Module

nn.Module: Returns the input embeddings of type nn.Module. Represents the token embeddings used by the text model.

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
def get_input_embeddings(self) -> nn.Module:
    """
    Method to get the input embeddings from the CLIPTextModelWithProjection instance.

    Args:
        self (object): Instance of the CLIPTextModelWithProjection class.
            Represents the current instance of the class.

    Returns:
        nn.Module: Returns the input embeddings of type nn.Module.
            Represents the token embeddings used by the text model.

    Raises:
        None.
    """
    return self.text_model.embeddings.token_embedding

mindnlp.transformers.models.clip.modeling_clip.CLIPTextModelWithProjection.set_input_embeddings(value)

Sets the input embeddings for the CLIPTextModelWithProjection class.

PARAMETER DESCRIPTION
self

The instance of the CLIPTextModelWithProjection class.

TYPE: CLIPTextModelWithProjection

value

The input embeddings to be set for the text model. This should be a tensor or object that can be assigned to the token_embedding attribute of the text model.

RETURNS DESCRIPTION
None

This method modifies the state of the text model by setting the input embeddings.

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
def set_input_embeddings(self, value):
    """
    Sets the input embeddings for the CLIPTextModelWithProjection class.

    Args:
        self (CLIPTextModelWithProjection): The instance of the CLIPTextModelWithProjection class.
        value: The input embeddings to be set for the text model.
            This should be a tensor or object that can be assigned to the `token_embedding` attribute of the text model.

    Returns:
        None: This method modifies the state of the text model by setting the input embeddings.

    Raises:
        None.
    """
    self.text_model.embeddings.token_embedding = value

mindnlp.transformers.models.clip.modeling_clip.CLIPVisionModel

Bases: CLIPPreTrainedModel

The CLIPVisionModel class represents a model for vision tasks using the CLIP (Contrastive Language-Image Pre-training) framework. It is designed to process images and generate visual embeddings using the CLIPVisionTransformer.

PARAMETER DESCRIPTION
config

The configuration object that defines the model architecture and behavior.

TYPE: CLIPVisionConfig

ATTRIBUTE DESCRIPTION
vision_model

The CLIPVisionTransformer instance used for image processing.

TYPE: CLIPVisionTransformer

METHOD DESCRIPTION
__init__

Initializes a new instance of the CLIPVisionModel class.

get_input_embeddings

Returns the input embeddings of the vision model.

forward

Constructs the vision model and performs image processing.

RETURNS DESCRIPTION

The forwarded CLIPVisionModel instance.

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, CLIPVisionModel
...
>>> model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = processor(images=image, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output  # pooled CLS states
Source code in mindnlp/transformers/models/clip/modeling_clip.py
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
class CLIPVisionModel(CLIPPreTrainedModel):

    """
    The `CLIPVisionModel` class represents a model for vision tasks using the CLIP (Contrastive Language-Image Pre-training)
    framework. It is designed to process images and generate visual embeddings using the CLIPVisionTransformer.

    Args:
        config (CLIPVisionConfig): The configuration object that defines the model architecture and behavior.

    Attributes:
        vision_model (CLIPVisionTransformer): The CLIPVisionTransformer instance used for image processing.

    Methods:
        __init__: Initializes a new instance of the `CLIPVisionModel` class.
        get_input_embeddings: Returns the input embeddings of the vision model.
        forward: Constructs the vision model and performs image processing.

    Returns:
        The forwarded `CLIPVisionModel` instance.

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, CLIPVisionModel
        ...
        >>> model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
        >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = processor(images=image, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> last_hidden_state = outputs.last_hidden_state
        >>> pooled_output = outputs.pooler_output  # pooled CLS states
        ```
    """
    config_class = CLIPVisionConfig
    main_input_name = "pixel_values"
    _no_split_modules = ["CLIPEncoderLayer"]

    def __init__(self, config: CLIPVisionConfig):
        """
        Initializes a new instance of the CLIPVisionModel class.

        Args:
            self: The instance of the class.
            config (CLIPVisionConfig): An instance of CLIPVisionConfig class representing the configuration settings.
                It is required to initialize the CLIPVisionModel.
                It must be of type CLIPVisionConfig.

        Returns:
            None.

        Raises:
            TypeError: If the config parameter is not of type CLIPVisionConfig.
        """
        super().__init__(config)
        self.vision_model = CLIPVisionTransformer(config)
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self) -> nn.Module:
        """
        This method returns the input embeddings from the CLIPVisionModel.

        Args:
            self (CLIPVisionModel): The instance of the CLIPVisionModel class.

        Returns:
            nn.Module: The input embeddings from the vision model. This is of type nn.Module.

        Raises:
            None
        """
        return self.vision_model.embeddings.patch_embedding

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPooling]:
        r"""

        Returns:
            `Union[Tuple, BaseModelOutputWithPooling]`

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, CLIPVisionModel
            ...
            >>> model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
            >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> inputs = processor(images=image, return_tensors="pt")
            ...
            >>> outputs = model(**inputs)
            >>> last_hidden_state = outputs.last_hidden_state
            >>> pooled_output = outputs.pooler_output  # pooled CLS states
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        return self.vision_model(
            pixel_values=pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

mindnlp.transformers.models.clip.modeling_clip.CLIPVisionModel.__init__(config)

Initializes a new instance of the CLIPVisionModel class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

An instance of CLIPVisionConfig class representing the configuration settings. It is required to initialize the CLIPVisionModel. It must be of type CLIPVisionConfig.

TYPE: CLIPVisionConfig

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the config parameter is not of type CLIPVisionConfig.

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
def __init__(self, config: CLIPVisionConfig):
    """
    Initializes a new instance of the CLIPVisionModel class.

    Args:
        self: The instance of the class.
        config (CLIPVisionConfig): An instance of CLIPVisionConfig class representing the configuration settings.
            It is required to initialize the CLIPVisionModel.
            It must be of type CLIPVisionConfig.

    Returns:
        None.

    Raises:
        TypeError: If the config parameter is not of type CLIPVisionConfig.
    """
    super().__init__(config)
    self.vision_model = CLIPVisionTransformer(config)
    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.clip.modeling_clip.CLIPVisionModel.forward(pixel_values=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
Union[Tuple, BaseModelOutputWithPooling]

Union[Tuple, BaseModelOutputWithPooling]

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, CLIPVisionModel
...
>>> model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = processor(images=image, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output  # pooled CLS states
Source code in mindnlp/transformers/models/clip/modeling_clip.py
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPooling]:
    r"""

    Returns:
        `Union[Tuple, BaseModelOutputWithPooling]`

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, CLIPVisionModel
        ...
        >>> model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
        >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = processor(images=image, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> last_hidden_state = outputs.last_hidden_state
        >>> pooled_output = outputs.pooler_output  # pooled CLS states
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    return self.vision_model(
        pixel_values=pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

mindnlp.transformers.models.clip.modeling_clip.CLIPVisionModel.get_input_embeddings()

This method returns the input embeddings from the CLIPVisionModel.

PARAMETER DESCRIPTION
self

The instance of the CLIPVisionModel class.

TYPE: CLIPVisionModel

RETURNS DESCRIPTION
Module

nn.Module: The input embeddings from the vision model. This is of type nn.Module.

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
def get_input_embeddings(self) -> nn.Module:
    """
    This method returns the input embeddings from the CLIPVisionModel.

    Args:
        self (CLIPVisionModel): The instance of the CLIPVisionModel class.

    Returns:
        nn.Module: The input embeddings from the vision model. This is of type nn.Module.

    Raises:
        None
    """
    return self.vision_model.embeddings.patch_embedding

mindnlp.transformers.models.clip.modeling_clip.CLIPVisionModelWithProjection

Bases: CLIPPreTrainedModel

Represents a vision model with projection for CLIP (Contrastive Language-Image Pre-training) framework.

This class inherits from CLIPPreTrainedModel and includes methods for initializing the model, retrieving input embeddings, and forwarding the model.

The 'CLIPVisionModelWithProjection' class initializes with a configuration object of type 'CLIPVisionConfig' and sets up the vision model and visual projection. It provides a method to retrieve input embeddings and forwards the vision model with optional parameters for pixel values, attentions, hidden states, and return dictionary. The method returns image embeddings and other model outputs based on the input parameters.

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, CLIPVisionModelWithProjection
...
>>> model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = processor(images=image, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> image_embeds = outputs.image_embeds
Source code in mindnlp/transformers/models/clip/modeling_clip.py
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
class CLIPVisionModelWithProjection(CLIPPreTrainedModel):

    '''
    Represents a vision model with projection for CLIP (Contrastive Language-Image Pre-training) framework.

    This class inherits from CLIPPreTrainedModel and includes methods for initializing the model,
    retrieving input embeddings, and forwarding the model.

    The 'CLIPVisionModelWithProjection' class initializes with a configuration object of type 'CLIPVisionConfig'
    and sets up the vision model and visual projection.
    It provides a method to retrieve input embeddings and forwards the vision model with optional parameters for
    pixel values, attentions, hidden states, and return dictionary.
    The method returns image embeddings and other model outputs based on the input parameters.

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, CLIPVisionModelWithProjection
        ...
        >>> model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
        >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = processor(images=image, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> image_embeds = outputs.image_embeds
        ```
    '''
    config_class = CLIPVisionConfig
    main_input_name = "pixel_values"

    def __init__(self, config: CLIPVisionConfig):
        """
        Initializes a CLIPVisionModelWithProjection instance.

        Args:
            self: The instance itself.
            config (CLIPVisionConfig): The configuration object for the CLIPVisionModelWithProjection.
                It contains the necessary parameters for configuring the model.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)

        self.vision_model = CLIPVisionTransformer(config)

        self.visual_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self) -> nn.Module:
        """
        Returns the input embeddings of the CLIPVisionModelWithProjection.

        Args:
            self (CLIPVisionModelWithProjection): An instance of CLIPVisionModelWithProjection class.

        Returns:
            nn.Module: A neural network cell representing the input embeddings of the vision model.

        Raises:
            None.

        """
        return self.vision_model.embeddings.patch_embedding

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, CLIPVisionModelOutput]:
        r"""
        Returns:
            Union[Tuple, CLIPVisionModelOutput]

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, CLIPVisionModelWithProjection
            ...
            >>> model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
            >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> inputs = processor(images=image, return_tensors="pt")
            ...
            >>> outputs = model(**inputs)
            >>> image_embeds = outputs.image_embeds
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        vision_outputs = self.vision_model(
            pixel_values=pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = vision_outputs[1]  # pooled_output

        image_embeds = self.visual_projection(pooled_output)

        if not return_dict:
            outputs = (image_embeds, vision_outputs[0]) + vision_outputs[2:]
            return tuple(output for output in outputs if output is not None)

        return CLIPVisionModelOutput(
            image_embeds=image_embeds,
            last_hidden_state=vision_outputs.last_hidden_state,
            hidden_states=vision_outputs.hidden_states,
            attentions=vision_outputs.attentions,
        )

mindnlp.transformers.models.clip.modeling_clip.CLIPVisionModelWithProjection.__init__(config)

Initializes a CLIPVisionModelWithProjection instance.

PARAMETER DESCRIPTION
self

The instance itself.

config

The configuration object for the CLIPVisionModelWithProjection. It contains the necessary parameters for configuring the model.

TYPE: CLIPVisionConfig

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
def __init__(self, config: CLIPVisionConfig):
    """
    Initializes a CLIPVisionModelWithProjection instance.

    Args:
        self: The instance itself.
        config (CLIPVisionConfig): The configuration object for the CLIPVisionModelWithProjection.
            It contains the necessary parameters for configuring the model.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)

    self.vision_model = CLIPVisionTransformer(config)

    self.visual_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.clip.modeling_clip.CLIPVisionModelWithProjection.forward(pixel_values=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
Union[Tuple, CLIPVisionModelOutput]

Union[Tuple, CLIPVisionModelOutput]

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, CLIPVisionModelWithProjection
...
>>> model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
>>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = processor(images=image, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> image_embeds = outputs.image_embeds
Source code in mindnlp/transformers/models/clip/modeling_clip.py
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, CLIPVisionModelOutput]:
    r"""
    Returns:
        Union[Tuple, CLIPVisionModelOutput]

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, CLIPVisionModelWithProjection
        ...
        >>> model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
        >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = processor(images=image, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> image_embeds = outputs.image_embeds
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    vision_outputs = self.vision_model(
        pixel_values=pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    pooled_output = vision_outputs[1]  # pooled_output

    image_embeds = self.visual_projection(pooled_output)

    if not return_dict:
        outputs = (image_embeds, vision_outputs[0]) + vision_outputs[2:]
        return tuple(output for output in outputs if output is not None)

    return CLIPVisionModelOutput(
        image_embeds=image_embeds,
        last_hidden_state=vision_outputs.last_hidden_state,
        hidden_states=vision_outputs.hidden_states,
        attentions=vision_outputs.attentions,
    )

mindnlp.transformers.models.clip.modeling_clip.CLIPVisionModelWithProjection.get_input_embeddings()

Returns the input embeddings of the CLIPVisionModelWithProjection.

PARAMETER DESCRIPTION
self

An instance of CLIPVisionModelWithProjection class.

TYPE: CLIPVisionModelWithProjection

RETURNS DESCRIPTION
Module

nn.Module: A neural network cell representing the input embeddings of the vision model.

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
def get_input_embeddings(self) -> nn.Module:
    """
    Returns the input embeddings of the CLIPVisionModelWithProjection.

    Args:
        self (CLIPVisionModelWithProjection): An instance of CLIPVisionModelWithProjection class.

    Returns:
        nn.Module: A neural network cell representing the input embeddings of the vision model.

    Raises:
        None.

    """
    return self.vision_model.embeddings.patch_embedding

mindnlp.transformers.models.clip.modeling_clip.CLIPForImageClassification

Bases: CLIPPreTrainedModel

The CLIPForImageClassification class represents a model for image classification using the Contrastive Language-Image Pretraining (CLIP) approach. It inherits from the CLIPPreTrainedModel class and implements the necessary methods for image classification tasks.

ATTRIBUTE DESCRIPTION
config

The configuration for the CLIP model, containing parameters such as num_labels, vision_model, and classifier.

TYPE: CLIPConfig

METHOD DESCRIPTION
__init__

Initializes the CLIPForImageClassification model with the provided configuration.

forward

Constructs the image classification model using the specified pixel values and labels. It returns the logits, loss, hidden states, and attentions if specified.

PARAMETER DESCRIPTION
config

The configuration for the CLIP model.

TYPE: CLIPConfig

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
class CLIPForImageClassification(CLIPPreTrainedModel):

    """
    The CLIPForImageClassification class represents a model for image classification using the Contrastive
    Language-Image Pretraining (CLIP) approach.
    It inherits from the CLIPPreTrainedModel class and implements the necessary methods for image classification tasks.

    Attributes:
        config (CLIPConfig):
            The configuration for the CLIP model, containing parameters such as num_labels, vision_model, and classifier.

    Methods:
        __init__:
            Initializes the CLIPForImageClassification model with the provided configuration.

        forward:
            Constructs the image classification model using the specified pixel values and labels.
                It returns the logits, loss, hidden states, and attentions if specified.

    Args:
        config (CLIPConfig): The configuration for the CLIP model.

    Returns:
        None

    Raises:
        None
    """
    main_input_name = "pixel_values"

    def __init__(self, config: CLIPConfig) -> None:
        """
        Initializes an instance of the CLIPForImageClassification class.

        Args:
            self: The instance of the class.
            config (CLIPConfig): An instance of the CLIPConfig class containing configuration parameters for CLIP.
                It specifies the configuration settings needed for initializing the CLIP model.
                It must be of type CLIPConfig.

        Returns:
            None.

        Raises:
            TypeError: If the config parameter is not of type CLIPConfig.
            ValueError: If the num_labels attribute in the config is invalid or missing.
            RuntimeError: If an error occurs during initialization of the vision model or classifier.
        """
        super().__init__(config)

        self.num_labels = config.num_labels
        self.vision_model = CLIPVisionTransformer(config.vision_config)

        # Classifier head
        self.classifier = (
            nn.Linear(config.vision_config.hidden_size, config.num_labels) if config.num_labels > 0 else nn.Identity()
        )

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[tuple, ImageClassifierOutput]:
        r"""
        Args:
            labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
                Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
                config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
                `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.vision_model(
            pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]

        # average pool the patch tokens
        sequence_output = ops.mean(sequence_output[:, 1:, :], axis=1)
        # apply classifier
        logits = self.classifier(sequence_output)

        loss = None
        if labels is not None:
            # move labels to correct device to enable model parallelism
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                if self.num_labels == 1:
                    loss = ops.mse_loss(logits.squeeze(), labels.squeeze())
                else:
                    loss = ops.mse_loss(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss = ops.cross_entropy(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss = ops.binary_cross_entropy_with_logits(logits, labels)

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return ImageClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

mindnlp.transformers.models.clip.modeling_clip.CLIPForImageClassification.__init__(config)

Initializes an instance of the CLIPForImageClassification class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

An instance of the CLIPConfig class containing configuration parameters for CLIP. It specifies the configuration settings needed for initializing the CLIP model. It must be of type CLIPConfig.

TYPE: CLIPConfig

RETURNS DESCRIPTION
None

None.

RAISES DESCRIPTION
TypeError

If the config parameter is not of type CLIPConfig.

ValueError

If the num_labels attribute in the config is invalid or missing.

RuntimeError

If an error occurs during initialization of the vision model or classifier.

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
def __init__(self, config: CLIPConfig) -> None:
    """
    Initializes an instance of the CLIPForImageClassification class.

    Args:
        self: The instance of the class.
        config (CLIPConfig): An instance of the CLIPConfig class containing configuration parameters for CLIP.
            It specifies the configuration settings needed for initializing the CLIP model.
            It must be of type CLIPConfig.

    Returns:
        None.

    Raises:
        TypeError: If the config parameter is not of type CLIPConfig.
        ValueError: If the num_labels attribute in the config is invalid or missing.
        RuntimeError: If an error occurs during initialization of the vision model or classifier.
    """
    super().__init__(config)

    self.num_labels = config.num_labels
    self.vision_model = CLIPVisionTransformer(config.vision_config)

    # Classifier head
    self.classifier = (
        nn.Linear(config.vision_config.hidden_size, config.num_labels) if config.num_labels > 0 else nn.Identity()
    )

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.clip.modeling_clip.CLIPForImageClassification.forward(pixel_values=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
labels

Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

TYPE: `mindspore.Tensor` of shape `(batch_size,)`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/clip/modeling_clip.py
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[tuple, ImageClassifierOutput]:
    r"""
    Args:
        labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    outputs = self.vision_model(
        pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    sequence_output = outputs[0]

    # average pool the patch tokens
    sequence_output = ops.mean(sequence_output[:, 1:, :], axis=1)
    # apply classifier
    logits = self.classifier(sequence_output)

    loss = None
    if labels is not None:
        # move labels to correct device to enable model parallelism
        if self.config.problem_type is None:
            if self.num_labels == 1:
                self.config.problem_type = "regression"
            elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                self.config.problem_type = "single_label_classification"
            else:
                self.config.problem_type = "multi_label_classification"

        if self.config.problem_type == "regression":
            if self.num_labels == 1:
                loss = ops.mse_loss(logits.squeeze(), labels.squeeze())
            else:
                loss = ops.mse_loss(logits, labels)
        elif self.config.problem_type == "single_label_classification":
            loss = ops.cross_entropy(logits.view(-1, self.num_labels), labels.view(-1))
        elif self.config.problem_type == "multi_label_classification":
            loss = ops.binary_cross_entropy_with_logits(logits, labels)

    if not return_dict:
        output = (logits,) + outputs[2:]
        return ((loss,) + output) if loss is not None else output

    return ImageClassifierOutput(
        loss=loss,
        logits=logits,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor

Bases: ProcessorMixin

Constructs a CLIP processor which wraps a CLIP image processor and a CLIP tokenizer into a single processor.

[CLIPProcessor] offers all the functionalities of [CLIPImageProcessor] and [CLIPTokenizerFast]. See the [~CLIPProcessor.__call__] and [~CLIPProcessor.decode] for more information.

PARAMETER DESCRIPTION
image_processor

The image processor is a required input.

TYPE: [`CLIPImageProcessor`], *optional* DEFAULT: None

tokenizer

The tokenizer is a required input.

TYPE: [`CLIPTokenizerFast`], *optional* DEFAULT: None

Source code in mindnlp/transformers/models/clip/processing_clip.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
class CLIPProcessor(ProcessorMixin):
    r"""
    Constructs a CLIP processor which wraps a CLIP image processor and a CLIP tokenizer into a single processor.

    [`CLIPProcessor`] offers all the functionalities of [`CLIPImageProcessor`] and [`CLIPTokenizerFast`]. See the
    [`~CLIPProcessor.__call__`] and [`~CLIPProcessor.decode`] for more information.

    Args:
        image_processor ([`CLIPImageProcessor`], *optional*):
            The image processor is a required input.
        tokenizer ([`CLIPTokenizerFast`], *optional*):
            The tokenizer is a required input.
    """
    attributes = ["image_processor", "tokenizer"]
    image_processor_class = "CLIPImageProcessor"
    tokenizer_class = ("CLIPTokenizer", "CLIPTokenizerFast")

    def __init__(self, image_processor=None, tokenizer=None, **kwargs):
        """
        Initialize a CLIPProcessor object.

        Args:
            self (object): The instance of the class.
            image_processor (object, optional): An image processor object used for processing images. 
                If not provided, it can be passed as part of the kwargs parameter.
            tokenizer (object): A tokenizer object used for tokenizing text inputs.

        Returns:
            None.

        Raises:
            ValueError: If either `image_processor` or `tokenizer` is not specified.
            FutureWarning: If the deprecated argument `feature_extractor` is used,
                a warning is issued recommending to use `image_processor` instead.
        """
        feature_extractor = None
        if "feature_extractor" in kwargs:
            warnings.warn(
                "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
                " instead.",
                FutureWarning,
            )
            feature_extractor = kwargs.pop("feature_extractor")

        image_processor = image_processor if image_processor is not None else feature_extractor
        if image_processor is None:
            raise ValueError("You need to specify an `image_processor`.")
        if tokenizer is None:
            raise ValueError("You need to specify a `tokenizer`.")

        super().__init__(image_processor, tokenizer)

    def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
        """
        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
        and `kwargs` arguments to CLIPTokenizerFast's [`~CLIPTokenizerFast.__call__`] if `text` is not `None` to encode
        the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
        CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
        of the above two methods for more information.

        Args:
            text (`str`, `List[str]`, `List[List[str]]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
                number of channels, H and W are image height and width.

            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:

                - `'tf'`: Return TensorFlow `tf.constant` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
                - `'jax'`: Return JAX `jnp.ndarray` objects.

        Returns:
            [`BatchEncoding`]: A [`BatchEncoding`] with the following fields:

                - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
                - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
                `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
                `None`).
                - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
        """
        if text is None and images is None:
            raise ValueError("You have to specify either text or images. Both cannot be none.")

        if text is not None:
            encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)

        if images is not None:
            image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)

        if text is not None and images is not None:
            encoding["pixel_values"] = image_features.pixel_values
            return encoding
        elif text is not None:
            return encoding
        else:
            return BatchEncoding(data={**image_features}, tensor_type=return_tensors)

    def batch_decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
        refer to the docstring of this method for more information.
        """
        return self.tokenizer.batch_decode(*args, **kwargs)

    def decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
        the docstring of this method for more information.
        """
        return self.tokenizer.decode(*args, **kwargs)

    @property
    def model_input_names(self):
        """
        This method, 'model_input_names', is a property of the 'CLIPProcessor' class.
        It returns a list of unique model input names derived from the tokenizer and image processor model input names.

        Args:
            self: An instance of the 'CLIPProcessor' class.

        Returns:
            The method returns a list of unique model input names derived from the tokenizer and image processor model input names.

        Raises:
            No exceptions are explicitly raised by this method.
        """
        tokenizer_input_names = self.tokenizer.model_input_names
        image_processor_input_names = self.image_processor.model_input_names
        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))

    @property
    def feature_extractor_class(self):
        """
        This method returns the image processor class used for extracting features in the CLIPProcessor class.

        Args:
            self: An instance of the CLIPProcessor class.

        Returns:
            None

        Raises:
            FutureWarning: If the method is called, a FutureWarning will be raised to inform the user that
                `feature_extractor_class` is deprecated and will be removed in v5. It is recommended to use
                `image_processor_class` instead.

        Note:
            The returned image processor class is responsible for extracting features from images in the CLIPProcessor.

        Example:
            ```python
            >>> clip_processor = CLIPProcessor()
            >>> clip_processor.feature_extractor_class
            <class 'image_processor.ImageProcessor'>
            ```
        """
        warnings.warn(
            "`feature_extractor_class` is deprecated and will be removed in v5. Use `image_processor_class` instead.",
            FutureWarning,
        )
        return self.image_processor_class

    @property
    def feature_extractor(self):
        """
        This method is deprecated and will be removed in v5. Use `image_processor` instead.

        Args:
            self: An instance of the CLIPProcessor class.

        Returns:
            None.

        Raises:
            FutureWarning: This method raises a FutureWarning to alert users that it is deprecated and will be removed in v5.
        """
        warnings.warn(
            "`feature_extractor` is deprecated and will be removed in v5. Use `image_processor` instead.",
            FutureWarning,
        )
        return self.image_processor

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor.feature_extractor property

This method is deprecated and will be removed in v5. Use image_processor instead.

PARAMETER DESCRIPTION
self

An instance of the CLIPProcessor class.

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
FutureWarning

This method raises a FutureWarning to alert users that it is deprecated and will be removed in v5.

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor.feature_extractor_class property

This method returns the image processor class used for extracting features in the CLIPProcessor class.

PARAMETER DESCRIPTION
self

An instance of the CLIPProcessor class.

RETURNS DESCRIPTION

None

RAISES DESCRIPTION
FutureWarning

If the method is called, a FutureWarning will be raised to inform the user that feature_extractor_class is deprecated and will be removed in v5. It is recommended to use image_processor_class instead.

Note

The returned image processor class is responsible for extracting features from images in the CLIPProcessor.

Example
>>> clip_processor = CLIPProcessor()
>>> clip_processor.feature_extractor_class
<class 'image_processor.ImageProcessor'>

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor.model_input_names property

This method, 'model_input_names', is a property of the 'CLIPProcessor' class. It returns a list of unique model input names derived from the tokenizer and image processor model input names.

PARAMETER DESCRIPTION
self

An instance of the 'CLIPProcessor' class.

RETURNS DESCRIPTION

The method returns a list of unique model input names derived from the tokenizer and image processor model input names.

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor.__call__(text=None, images=None, return_tensors=None, **kwargs)

Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the text and kwargs arguments to CLIPTokenizerFast's [~CLIPTokenizerFast.__call__] if text is not None to encode the text. To prepare the image(s), this method forwards the images and kwrags arguments to CLIPImageProcessor's [~CLIPImageProcessor.__call__] if images is not None. Please refer to the doctsring of the above two methods for more information.

PARAMETER DESCRIPTION
text

The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

TYPE: `str`, `List[str]`, `List[List[str]]` DEFAULT: None

images

The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a number of channels, H and W are image height and width.

TYPE: `PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]` DEFAULT: None

return_tensors

If set, will return tensors of a particular framework. Acceptable values are:

  • 'tf': Return TensorFlow tf.constant objects.
  • 'pt': Return PyTorch torch.Tensor objects.
  • 'np': Return NumPy np.ndarray objects.
  • 'jax': Return JAX jnp.ndarray objects.

TYPE: `str` or [`~utils.TensorType`], *optional* DEFAULT: None

RETURNS DESCRIPTION

[BatchEncoding]: A [BatchEncoding] with the following fields:

  • input_ids -- List of token ids to be fed to a model. Returned when text is not None.
  • attention_mask -- List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if "attention_mask" is in self.model_input_names and if text is not None).
  • pixel_values -- Pixel values to be fed to a model. Returned when images is not None.
Source code in mindnlp/transformers/models/clip/processing_clip.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
    """
    Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
    and `kwargs` arguments to CLIPTokenizerFast's [`~CLIPTokenizerFast.__call__`] if `text` is not `None` to encode
    the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
    CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
    of the above two methods for more information.

    Args:
        text (`str`, `List[str]`, `List[List[str]]`):
            The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
            (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
            `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
        images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
            The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
            tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
            number of channels, H and W are image height and width.

        return_tensors (`str` or [`~utils.TensorType`], *optional*):
            If set, will return tensors of a particular framework. Acceptable values are:

            - `'tf'`: Return TensorFlow `tf.constant` objects.
            - `'pt'`: Return PyTorch `torch.Tensor` objects.
            - `'np'`: Return NumPy `np.ndarray` objects.
            - `'jax'`: Return JAX `jnp.ndarray` objects.

    Returns:
        [`BatchEncoding`]: A [`BatchEncoding`] with the following fields:

            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
            `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
            `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
    """
    if text is None and images is None:
        raise ValueError("You have to specify either text or images. Both cannot be none.")

    if text is not None:
        encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)

    if images is not None:
        image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)

    if text is not None and images is not None:
        encoding["pixel_values"] = image_features.pixel_values
        return encoding
    elif text is not None:
        return encoding
    else:
        return BatchEncoding(data={**image_features}, tensor_type=return_tensors)

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor.__init__(image_processor=None, tokenizer=None, **kwargs)

Initialize a CLIPProcessor object.

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: object

image_processor

An image processor object used for processing images. If not provided, it can be passed as part of the kwargs parameter.

TYPE: object DEFAULT: None

tokenizer

A tokenizer object used for tokenizing text inputs.

TYPE: object DEFAULT: None

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If either image_processor or tokenizer is not specified.

FutureWarning

If the deprecated argument feature_extractor is used, a warning is issued recommending to use image_processor instead.

Source code in mindnlp/transformers/models/clip/processing_clip.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def __init__(self, image_processor=None, tokenizer=None, **kwargs):
    """
    Initialize a CLIPProcessor object.

    Args:
        self (object): The instance of the class.
        image_processor (object, optional): An image processor object used for processing images. 
            If not provided, it can be passed as part of the kwargs parameter.
        tokenizer (object): A tokenizer object used for tokenizing text inputs.

    Returns:
        None.

    Raises:
        ValueError: If either `image_processor` or `tokenizer` is not specified.
        FutureWarning: If the deprecated argument `feature_extractor` is used,
            a warning is issued recommending to use `image_processor` instead.
    """
    feature_extractor = None
    if "feature_extractor" in kwargs:
        warnings.warn(
            "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
            " instead.",
            FutureWarning,
        )
        feature_extractor = kwargs.pop("feature_extractor")

    image_processor = image_processor if image_processor is not None else feature_extractor
    if image_processor is None:
        raise ValueError("You need to specify an `image_processor`.")
    if tokenizer is None:
        raise ValueError("You need to specify a `tokenizer`.")

    super().__init__(image_processor, tokenizer)

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor.batch_decode(*args, **kwargs)

This method forwards all its arguments to CLIPTokenizerFast's [~PreTrainedTokenizer.batch_decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp/transformers/models/clip/processing_clip.py
130
131
132
133
134
135
def batch_decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
    refer to the docstring of this method for more information.
    """
    return self.tokenizer.batch_decode(*args, **kwargs)

mindnlp.transformers.models.clip.processing_clip.CLIPProcessor.decode(*args, **kwargs)

This method forwards all its arguments to CLIPTokenizerFast's [~PreTrainedTokenizer.decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp/transformers/models/clip/processing_clip.py
137
138
139
140
141
142
def decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
    the docstring of this method for more information.
    """
    return self.tokenizer.decode(*args, **kwargs)

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer

Bases: PreTrainedTokenizer

Construct a CLIP tokenizer. Based on byte-level Byte-Pair-Encoding.

This tokenizer inherits from [PreTrainedTokenizer] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

PARAMETER DESCRIPTION
vocab_file

Path to the vocabulary file.

TYPE: `str`

merges_file

Path to the merges file.

TYPE: `str`

errors

Paradigm to follow when decoding bytes to UTF-8. See bytes.decode for more information.

TYPE: `str`, *optional*, defaults to `"replace"` DEFAULT: 'replace'

unk_token

The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

TYPE: `str`, *optional*, defaults to `"<|endoftext|>"` DEFAULT: '<|endoftext|>'

bos_token

The beginning of sequence token.

TYPE: `str`, *optional*, defaults to `"<|startoftext|>"` DEFAULT: '<|startoftext|>'

eos_token

The end of sequence token.

TYPE: `str`, *optional*, defaults to `"<|endoftext|>"` DEFAULT: '<|endoftext|>'

pad_token

The token used for padding, for example when batching sequences of different lengths.

TYPE: `str`, *optional*, defaults to `"<|endoftext|>"` DEFAULT: '<|endoftext|>'

Source code in mindnlp/transformers/models/clip/tokenization_clip.py
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
class CLIPTokenizer(PreTrainedTokenizer):
    """
    Construct a CLIP tokenizer. Based on byte-level Byte-Pair-Encoding.

    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
    this superclass for more information regarding those methods.

    Args:
        vocab_file (`str`):
            Path to the vocabulary file.
        merges_file (`str`):
            Path to the merges file.
        errors (`str`, *optional*, defaults to `"replace"`):
            Paradigm to follow when decoding bytes to UTF-8. See
            [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information.
        unk_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
        bos_token (`str`, *optional*, defaults to `"<|startoftext|>"`):
            The beginning of sequence token.
        eos_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
            The end of sequence token.
        pad_token (`str`, *optional*, defaults to `"<|endoftext|>"`):
            The token used for padding, for example when batching sequences of different lengths.
    """
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    model_input_names = ["input_ids", "attention_mask"]

    def __init__(
        self,
        vocab_file,
        merges_file,
        errors="replace",
        unk_token="<|endoftext|>",
        bos_token="<|startoftext|>",
        eos_token="<|endoftext|>",
        pad_token="<|endoftext|>",  # hack to enable padding
        **kwargs,
    ):
        """
        Initializes a CLIPTokenizer object.

        Args:
            self (object): The instance of the CLIPTokenizer class.
            vocab_file (str): The path to the vocabulary file containing token encodings.
            merges_file (str): The path to the file containing BPE merges for tokenization.
            errors (str, optional): The error handling strategy for text decoding. Defaults to 'replace'.
            unk_token (str, optional): The token to represent unknown words. Defaults to an empty string.
            bos_token (str, optional): The beginning of sequence token. Defaults to '<|startoftext|>'.
            eos_token (str, optional): The end of sequence token. Defaults to an empty string.
            pad_token (str, optional): The padding token. Defaults to an empty string.

        Returns:
            None.

        Raises:
            ImportError: If the 'ftfy' package is not installed.
        """
        bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
        eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
        unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
        try:
            import ftfy

            self.fix_text = ftfy.fix_text
        except ImportError:
            logger.info("ftfy or spacy is not installed using custom BasicTokenizer instead of ftfy.")
            self.nlp = BasicTokenizer(strip_accents=False, do_split_on_punc=False)
            self.fix_text = None

        with open(vocab_file, encoding="utf-8") as vocab_handle:
            self.encoder = json.load(vocab_handle)
        self.decoder = {v: k for k, v in self.encoder.items()}
        self.errors = errors  # how to handle errors in decoding
        self.byte_encoder = bytes_to_unicode()
        self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
        with open(merges_file, encoding="utf-8") as merges_handle:
            bpe_merges = merges_handle.read().strip().split("\n")[1 : 49152 - 256 - 2 + 1]
        bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
        self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
        self.cache = {"<|startoftext|>": "<|startoftext|>", "<|endoftext|>": "<|endoftext|>"}

        self.pat = re.compile(
            r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""",
            re.IGNORECASE,
        )

        super().__init__(
            errors=errors,
            unk_token=unk_token,
            bos_token=bos_token,
            eos_token=eos_token,
            pad_token=pad_token,
            **kwargs,
        )

    @property
    def vocab_size(self):
        """
        Method to return the vocabulary size of the CLIPTokenizer instance.

        Args:
            self (CLIPTokenizer): The instance of the CLIPTokenizer class.
                This parameter refers to the current instance of the CLIPTokenizer for which the vocabulary size is to
                be calculated.

        Returns:
            int: The number of unique tokens in the vocabulary.
                The method returns an integer value representing the size of the vocabulary as the count of unique
                tokens stored in the encoder.

        Raises:
            None.
        """
        return len(self.encoder)

    def get_vocab(self):
        """
        Method to retrieve the vocabulary of the CLIPTokenizer instance.

        Args:
            self (CLIPTokenizer): The instance of the CLIPTokenizer class.
                Represents the current instance of the CLIPTokenizer.

        Returns:
            dict: A dictionary containing the combined vocabulary of the encoder and added_tokens_encoder.
                The vocabulary includes both the original encoder tokens and any additional tokens added to the tokenizer.

        Raises:
            None
        """
        return dict(self.encoder, **self.added_tokens_encoder)

    def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. A CLIP sequence has the following format:

        - single sequence: `<|startoftext|> X <|endoftext|>`

        Pairs of sequences are not the expected use case, but they will be handled without a separator.

        Args:
            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """
        bos_token = [self.bos_token_id]
        eos_token = [self.eos_token_id]

        if token_ids_1 is None:
            return bos_token + token_ids_0 + eos_token
        return bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token

    def get_special_tokens_mask(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
    ) -> List[int]:
        """
        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
        special tokens using the tokenizer `prepare_for_model` method.

        Args:
            token_ids_0 (`List[int]`):
                List of IDs.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.
            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
                Whether or not the token list is already formatted with special tokens for the model.

        Returns:
            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
        """
        if already_has_special_tokens:
            return super().get_special_tokens_mask(
                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
            )

        if token_ids_1 is None:
            return [1] + ([0] * len(token_ids_0)) + [1]
        return [1] + ([0] * len(token_ids_0)) + [1] + [1] + ([0] * len(token_ids_1)) + [1]

    def create_token_type_ids_from_sequences(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Create a mask from the two sequences passed. CLIP does not make use of token type ids, therefore a list of
        zeros is returned.

        Args:
            token_ids_0 (`List[int]`):
                List of IDs.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
            `List[int]`: List of zeros.
        """
        bos_token = [self.bos_token_id]
        eos_token = [self.eos_token_id]

        if token_ids_1 is None:
            return len(bos_token + token_ids_0 + eos_token) * [0]
        return len(bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token) * [0]

    def bpe(self, token):
        """
        This method 'bpe' is defined in the class 'CLIPTokenizer'. It processes a given token using Byte Pair Encoding (BPE).

        Args:
            self: This parameter represents the instance of the class 'CLIPTokenizer'.
                It is used to access the attributes and methods of the class.
            token (str): The input token to be processed using Byte Pair Encoding (BPE).
                It should be a string representing a single token.

        Returns:
            str: The processed token after applying Byte Pair Encoding (BPE) algorithm.
                The token is modified based on the algorithm rules.

        Raises:
            None.
        """
        if token in self.cache:
            return self.cache[token]
        word = tuple(token[:-1]) + (token[-1] + "</w>",)
        pairs = get_pairs(word)

        if not pairs:
            return token + "</w>"

        while True:
            bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
            if bigram not in self.bpe_ranks:
                break
            first, second = bigram
            new_word = []
            i = 0
            while i < len(word):
                try:
                    j = word.index(first, i)
                except ValueError:
                    new_word.extend(word[i:])
                    break
                else:
                    new_word.extend(word[i:j])
                    i = j

                if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
                    new_word.append(first + second)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_word = tuple(new_word)
            word = new_word
            if len(word) == 1:
                break
            pairs = get_pairs(word)
        word = " ".join(word)
        self.cache[token] = word
        return word

    def _tokenize(self, text):
        """Tokenize a string."""
        bpe_tokens = []
        if self.fix_text is None:
            text = " ".join(self.nlp.tokenize(text))
        else:
            text = whitespace_clean(self.fix_text(text)).lower()

        for token in re.findall(self.pat, text):
            token = "".join(
                self.byte_encoder[b] for b in token.encode("utf-8")
            )  # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case)
            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
        return bpe_tokens

    def _convert_token_to_id(self, token):
        """Converts a token (str) in an id using the vocab."""
        return self.encoder.get(token, self.encoder.get(self.unk_token))

    def _convert_id_to_token(self, index):
        """Converts an index (integer) in a token (str) using the vocab."""
        return self.decoder.get(index)

    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        text = "".join(tokens)
        byte_array = bytearray([self.byte_decoder[c] for c in text])
        text = byte_array.decode("utf-8", errors=self.errors).replace("</w>", " ").strip()
        return text

    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
        """
        Save the vocabulary to the specified directory with an optional filename prefix.

        Args:
            self (CLIPTokenizer): The instance of the CLIPTokenizer class.
            save_directory (str): The directory where the vocabulary files will be saved.
            filename_prefix (Optional[str], optional): An optional prefix to be added to the filename. Defaults to None.

        Returns:
            Tuple[str]: A tuple containing the paths to the saved vocabulary file and merge file.

        Raises:
            OSError: If the specified save_directory is not a valid directory.
            IOError: If there is an issue with writing the vocabulary or merge files.
            Exception: If any other unexpected error occurs during the saving process.
        """
        if not os.path.isdir(save_directory):
            logger.error("Vocabulary path ({}) should be a directory".format(save_directory))
            return
        vocab_file = os.path.join(
            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
        )
        merge_file = os.path.join(
            save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["merges_file"]
        )

        with open(vocab_file, "w", encoding="utf-8") as f:
            f.write(json.dumps(self.encoder, indent=2, sort_keys=True, ensure_ascii=False) + "\n")

        index = 0
        with open(merge_file, "w", encoding="utf-8") as writer:
            writer.write("#version: 0.2\n")
            for bpe_tokens, token_index in sorted(self.bpe_ranks.items(), key=lambda kv: kv[1]):
                if index != token_index:
                    logger.warning(
                        "Saving vocabulary to {}: BPE merge indices are not consecutive."
                        " Please check that the tokenizer is not corrupted!".format(merge_file)
                    )
                    index = token_index
                writer.write(" ".join(bpe_tokens) + "\n")
                index += 1

        return vocab_file, merge_file

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.vocab_size property

Method to return the vocabulary size of the CLIPTokenizer instance.

PARAMETER DESCRIPTION
self

The instance of the CLIPTokenizer class. This parameter refers to the current instance of the CLIPTokenizer for which the vocabulary size is to be calculated.

TYPE: CLIPTokenizer

RETURNS DESCRIPTION
int

The number of unique tokens in the vocabulary. The method returns an integer value representing the size of the vocabulary as the count of unique tokens stored in the encoder.

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.__init__(vocab_file, merges_file, errors='replace', unk_token='<|endoftext|>', bos_token='<|startoftext|>', eos_token='<|endoftext|>', pad_token='<|endoftext|>', **kwargs)

Initializes a CLIPTokenizer object.

PARAMETER DESCRIPTION
self

The instance of the CLIPTokenizer class.

TYPE: object

vocab_file

The path to the vocabulary file containing token encodings.

TYPE: str

merges_file

The path to the file containing BPE merges for tokenization.

TYPE: str

errors

The error handling strategy for text decoding. Defaults to 'replace'.

TYPE: str DEFAULT: 'replace'

unk_token

The token to represent unknown words. Defaults to an empty string.

TYPE: str DEFAULT: '<|endoftext|>'

bos_token

The beginning of sequence token. Defaults to '<|startoftext|>'.

TYPE: str DEFAULT: '<|startoftext|>'

eos_token

The end of sequence token. Defaults to an empty string.

TYPE: str DEFAULT: '<|endoftext|>'

pad_token

The padding token. Defaults to an empty string.

TYPE: str DEFAULT: '<|endoftext|>'

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ImportError

If the 'ftfy' package is not installed.

Source code in mindnlp/transformers/models/clip/tokenization_clip.py
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
def __init__(
    self,
    vocab_file,
    merges_file,
    errors="replace",
    unk_token="<|endoftext|>",
    bos_token="<|startoftext|>",
    eos_token="<|endoftext|>",
    pad_token="<|endoftext|>",  # hack to enable padding
    **kwargs,
):
    """
    Initializes a CLIPTokenizer object.

    Args:
        self (object): The instance of the CLIPTokenizer class.
        vocab_file (str): The path to the vocabulary file containing token encodings.
        merges_file (str): The path to the file containing BPE merges for tokenization.
        errors (str, optional): The error handling strategy for text decoding. Defaults to 'replace'.
        unk_token (str, optional): The token to represent unknown words. Defaults to an empty string.
        bos_token (str, optional): The beginning of sequence token. Defaults to '<|startoftext|>'.
        eos_token (str, optional): The end of sequence token. Defaults to an empty string.
        pad_token (str, optional): The padding token. Defaults to an empty string.

    Returns:
        None.

    Raises:
        ImportError: If the 'ftfy' package is not installed.
    """
    bos_token = AddedToken(bos_token, lstrip=False, rstrip=False) if isinstance(bos_token, str) else bos_token
    eos_token = AddedToken(eos_token, lstrip=False, rstrip=False) if isinstance(eos_token, str) else eos_token
    unk_token = AddedToken(unk_token, lstrip=False, rstrip=False) if isinstance(unk_token, str) else unk_token
    try:
        import ftfy

        self.fix_text = ftfy.fix_text
    except ImportError:
        logger.info("ftfy or spacy is not installed using custom BasicTokenizer instead of ftfy.")
        self.nlp = BasicTokenizer(strip_accents=False, do_split_on_punc=False)
        self.fix_text = None

    with open(vocab_file, encoding="utf-8") as vocab_handle:
        self.encoder = json.load(vocab_handle)
    self.decoder = {v: k for k, v in self.encoder.items()}
    self.errors = errors  # how to handle errors in decoding
    self.byte_encoder = bytes_to_unicode()
    self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
    with open(merges_file, encoding="utf-8") as merges_handle:
        bpe_merges = merges_handle.read().strip().split("\n")[1 : 49152 - 256 - 2 + 1]
    bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
    self.bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
    self.cache = {"<|startoftext|>": "<|startoftext|>", "<|endoftext|>": "<|endoftext|>"}

    self.pat = re.compile(
        r"""<\|startoftext\|>|<\|endoftext\|>|'s|'t|'re|'ve|'m|'ll|'d|[\p{L}]+|[\p{N}]|[^\s\p{L}\p{N}]+""",
        re.IGNORECASE,
    )

    super().__init__(
        errors=errors,
        unk_token=unk_token,
        bos_token=bos_token,
        eos_token=eos_token,
        pad_token=pad_token,
        **kwargs,
    )

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.bpe(token)

This method 'bpe' is defined in the class 'CLIPTokenizer'. It processes a given token using Byte Pair Encoding (BPE).

PARAMETER DESCRIPTION
self

This parameter represents the instance of the class 'CLIPTokenizer'. It is used to access the attributes and methods of the class.

token

The input token to be processed using Byte Pair Encoding (BPE). It should be a string representing a single token.

TYPE: str

RETURNS DESCRIPTION
str

The processed token after applying Byte Pair Encoding (BPE) algorithm. The token is modified based on the algorithm rules.

Source code in mindnlp/transformers/models/clip/tokenization_clip.py
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
def bpe(self, token):
    """
    This method 'bpe' is defined in the class 'CLIPTokenizer'. It processes a given token using Byte Pair Encoding (BPE).

    Args:
        self: This parameter represents the instance of the class 'CLIPTokenizer'.
            It is used to access the attributes and methods of the class.
        token (str): The input token to be processed using Byte Pair Encoding (BPE).
            It should be a string representing a single token.

    Returns:
        str: The processed token after applying Byte Pair Encoding (BPE) algorithm.
            The token is modified based on the algorithm rules.

    Raises:
        None.
    """
    if token in self.cache:
        return self.cache[token]
    word = tuple(token[:-1]) + (token[-1] + "</w>",)
    pairs = get_pairs(word)

    if not pairs:
        return token + "</w>"

    while True:
        bigram = min(pairs, key=lambda pair: self.bpe_ranks.get(pair, float("inf")))
        if bigram not in self.bpe_ranks:
            break
        first, second = bigram
        new_word = []
        i = 0
        while i < len(word):
            try:
                j = word.index(first, i)
            except ValueError:
                new_word.extend(word[i:])
                break
            else:
                new_word.extend(word[i:j])
                i = j

            if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
                new_word.append(first + second)
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        new_word = tuple(new_word)
        word = new_word
        if len(word) == 1:
            break
        pairs = get_pairs(word)
    word = " ".join(word)
    self.cache[token] = word
    return word

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.build_inputs_with_special_tokens(token_ids_0, token_ids_1=None)

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A CLIP sequence has the following format:

  • single sequence: <|startoftext|> X <|endoftext|>

Pairs of sequences are not the expected use case, but they will be handled without a separator.

PARAMETER DESCRIPTION
token_ids_0

List of IDs to which the special tokens will be added.

TYPE: `List[int]`

token_ids_1

Optional second list of IDs for sequence pairs.

TYPE: `List[int]`, *optional* DEFAULT: None

RETURNS DESCRIPTION
List[int]

List[int]: List of input IDs with the appropriate special tokens.

Source code in mindnlp/transformers/models/clip/tokenization_clip.py
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
def build_inputs_with_special_tokens(
    self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
    """
    Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
    adding special tokens. A CLIP sequence has the following format:

    - single sequence: `<|startoftext|> X <|endoftext|>`

    Pairs of sequences are not the expected use case, but they will be handled without a separator.

    Args:
        token_ids_0 (`List[int]`):
            List of IDs to which the special tokens will be added.
        token_ids_1 (`List[int]`, *optional*):
            Optional second list of IDs for sequence pairs.

    Returns:
        `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
    """
    bos_token = [self.bos_token_id]
    eos_token = [self.eos_token_id]

    if token_ids_1 is None:
        return bos_token + token_ids_0 + eos_token
    return bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.convert_tokens_to_string(tokens)

Converts a sequence of tokens (string) in a single string.

Source code in mindnlp/transformers/models/clip/tokenization_clip.py
592
593
594
595
596
597
def convert_tokens_to_string(self, tokens):
    """Converts a sequence of tokens (string) in a single string."""
    text = "".join(tokens)
    byte_array = bytearray([self.byte_decoder[c] for c in text])
    text = byte_array.decode("utf-8", errors=self.errors).replace("</w>", " ").strip()
    return text

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.create_token_type_ids_from_sequences(token_ids_0, token_ids_1=None)

Create a mask from the two sequences passed. CLIP does not make use of token type ids, therefore a list of zeros is returned.

PARAMETER DESCRIPTION
token_ids_0

List of IDs.

TYPE: `List[int]`

token_ids_1

Optional second list of IDs for sequence pairs.

TYPE: `List[int]`, *optional* DEFAULT: None

RETURNS DESCRIPTION
List[int]

List[int]: List of zeros.

Source code in mindnlp/transformers/models/clip/tokenization_clip.py
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
def create_token_type_ids_from_sequences(
    self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
    """
    Create a mask from the two sequences passed. CLIP does not make use of token type ids, therefore a list of
    zeros is returned.

    Args:
        token_ids_0 (`List[int]`):
            List of IDs.
        token_ids_1 (`List[int]`, *optional*):
            Optional second list of IDs for sequence pairs.

    Returns:
        `List[int]`: List of zeros.
    """
    bos_token = [self.bos_token_id]
    eos_token = [self.eos_token_id]

    if token_ids_1 is None:
        return len(bos_token + token_ids_0 + eos_token) * [0]
    return len(bos_token + token_ids_0 + eos_token + eos_token + token_ids_1 + eos_token) * [0]

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.get_special_tokens_mask(token_ids_0, token_ids_1=None, already_has_special_tokens=False)

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer prepare_for_model method.

PARAMETER DESCRIPTION
token_ids_0

List of IDs.

TYPE: `List[int]`

token_ids_1

Optional second list of IDs for sequence pairs.

TYPE: `List[int]`, *optional* DEFAULT: None

already_has_special_tokens

Whether or not the token list is already formatted with special tokens for the model.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

RETURNS DESCRIPTION
List[int]

List[int]: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

Source code in mindnlp/transformers/models/clip/tokenization_clip.py
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
def get_special_tokens_mask(
    self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
    """
    Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
    special tokens using the tokenizer `prepare_for_model` method.

    Args:
        token_ids_0 (`List[int]`):
            List of IDs.
        token_ids_1 (`List[int]`, *optional*):
            Optional second list of IDs for sequence pairs.
        already_has_special_tokens (`bool`, *optional*, defaults to `False`):
            Whether or not the token list is already formatted with special tokens for the model.

    Returns:
        `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
    """
    if already_has_special_tokens:
        return super().get_special_tokens_mask(
            token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
        )

    if token_ids_1 is None:
        return [1] + ([0] * len(token_ids_0)) + [1]
    return [1] + ([0] * len(token_ids_0)) + [1] + [1] + ([0] * len(token_ids_1)) + [1]

mindnlp.transformers.models.clip.tokenization_clip.CLIPTokenizer.get_vocab()

Method to retrieve the vocabulary of the CLIPTokenizer instance.

PARAMETER DESCRIPTION