Skip to content

altclip

mindnlp.transformers.models.altclip.configuration_altclip.ALTCLIP_PRETRAINED_CONFIG_ARCHIVE_MAP = {'BAAI/AltCLIP': 'https://hf-mirror.com/BAAI/AltCLIP/resolve/main/config.json'} module-attribute

mindnlp.transformers.models.altclip.configuration_altclip.AltCLIPConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [AltCLIPModel]. It is used to instantiate an AltCLIP model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the AltCLIP BAAI/AltCLIP architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
text_config

Dictionary of configuration options used to initialize [AltCLIPTextConfig].

TYPE: `dict`, *optional* DEFAULT: None

vision_config

Dictionary of configuration options used to initialize [AltCLIPVisionConfig].

TYPE: `dict`, *optional* DEFAULT: None

projection_dim

Dimentionality of text and vision projection layers.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

logit_scale_init_value

The inital value of the logit_scale paramter. Default is used as per the original CLIP implementation.

TYPE: `float`, *optional*, defaults to 2.6592 DEFAULT: 2.6592

kwargs

Dictionary of keyword arguments.

TYPE: *optional* DEFAULT: {}

Example
>>> from transformers import AltCLIPConfig, AltCLIPModel
...
>>> # Initializing a AltCLIPConfig with BAAI/AltCLIP style configuration
>>> configuration = AltCLIPConfig()
...
>>> # Initializing a AltCLIPModel (with random weights) from the BAAI/AltCLIP style configuration
>>> model = AltCLIPModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
...
>>> # We can also initialize a AltCLIPConfig from a AltCLIPTextConfig and a AltCLIPVisionConfig
...
>>> # Initializing a AltCLIPText and AltCLIPVision configuration
>>> config_text = AltCLIPTextConfig()
>>> config_vision = AltCLIPVisionConfig()
...
>>> config = AltCLIPConfig.from_text_vision_configs(config_text, config_vision)
Source code in mindnlp/transformers/models/altclip/configuration_altclip.py
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
class AltCLIPConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`AltCLIPModel`]. It is used to instantiate an
    AltCLIP model according to the specified arguments, defining the model architecture. Instantiating a configuration
    with the defaults will yield a similar configuration to that of the AltCLIP
    [BAAI/AltCLIP](https://hf-mirror.com/BAAI/AltCLIP) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        text_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`AltCLIPTextConfig`].
        vision_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`AltCLIPVisionConfig`].
        projection_dim (`int`, *optional*, defaults to 768):
            Dimentionality of text and vision projection layers.
        logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
            The inital value of the *logit_scale* paramter. Default is used as per the original CLIP implementation.
        kwargs (*optional*):
            Dictionary of keyword arguments.

    Example:
        ```python
        >>> from transformers import AltCLIPConfig, AltCLIPModel
        ...
        >>> # Initializing a AltCLIPConfig with BAAI/AltCLIP style configuration
        >>> configuration = AltCLIPConfig()
        ...
        >>> # Initializing a AltCLIPModel (with random weights) from the BAAI/AltCLIP style configuration
        >>> model = AltCLIPModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ...
        >>> # We can also initialize a AltCLIPConfig from a AltCLIPTextConfig and a AltCLIPVisionConfig
        ...
        >>> # Initializing a AltCLIPText and AltCLIPVision configuration
        >>> config_text = AltCLIPTextConfig()
        >>> config_vision = AltCLIPVisionConfig()
        ...
        >>> config = AltCLIPConfig.from_text_vision_configs(config_text, config_vision)
        ```
    """
    model_type = "altclip"

    def __init__(
        self, text_config=None, vision_config=None, projection_dim=768, logit_scale_init_value=2.6592, **kwargs
    ):
        """
        Initializes an instance of the AltCLIPConfig class.

        Args:
            self: The instance of the class.
            text_config (Optional[Dict]): A dictionary containing configuration parameters for text processing. Defaults to None.
            vision_config (Optional[Dict]): A dictionary containing configuration parameters for vision processing. Defaults to None.
            projection_dim (int): The dimension of the projection layer. Defaults to 768.
            logit_scale_init_value (float): The initial value for the logit scale. Defaults to 2.6592.

        Returns:
            None

        Raises:
            None
        """
        # If `_config_dict` exist, we use them for the backward compatibility.
        # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot
        # of confusion!).
        text_config_dict = kwargs.pop("text_config_dict", None)
        vision_config_dict = kwargs.pop("vision_config_dict", None)

        super().__init__(**kwargs)

        # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in
        # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most
        # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`.
        if text_config_dict is not None:
            if text_config is None:
                text_config = {}

            # This is the complete result when using `text_config_dict`.
            _text_config_dict = AltCLIPTextConfig(**text_config_dict).to_dict()

            # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different.
            for key, value in _text_config_dict.items():
                if key in text_config and value != text_config[key] and key not in ["transformers_version"]:
                    # If specified in `text_config_dict`
                    if key in text_config_dict:
                        message = (
                            f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. "
                            f'The value `text_config_dict["{key}"]` will be used instead.'
                        )
                    # If inferred from default argument values (just to be super careful)
                    else:
                        message = (
                            f"`text_config_dict` is provided which will be used to initialize `AltCLIPTextConfig`. The "
                            f'value `text_config["{key}"]` will be overriden.'
                        )
                    logger.info(message)

            # Update all values in `text_config` with the ones in `_text_config_dict`.
            text_config.update(_text_config_dict)

        if vision_config_dict is not None:
            if vision_config is None:
                vision_config = {}

            # This is the complete result when using `vision_config_dict`.
            _vision_config_dict = AltCLIPVisionConfig(**vision_config_dict).to_dict()
            # convert keys to string instead of integer
            if "id2label" in _vision_config_dict:
                _vision_config_dict["id2label"] = {
                    str(key): value for key, value in _vision_config_dict["id2label"].items()
                }

            # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different.
            for key, value in _vision_config_dict.items():
                if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]:
                    # If specified in `vision_config_dict`
                    if key in vision_config_dict:
                        message = (
                            f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different "
                            f'values. The value `vision_config_dict["{key}"]` will be used instead.'
                        )
                    # If inferred from default argument values (just to be super careful)
                    else:
                        message = (
                            f"`vision_config_dict` is provided which will be used to initialize `AltCLIPVisionConfig`. "
                            f'The value `vision_config["{key}"]` will be overriden.'
                        )
                    logger.info(message)

            # Update all values in `vision_config` with the ones in `_vision_config_dict`.
            vision_config.update(_vision_config_dict)

        if text_config is None:
            text_config = {}
            logger.info("`text_config` is `None`. Initializing the `AltCLIPTextConfig` with default values.")

        if vision_config is None:
            vision_config = {}
            logger.info("`vision_config` is `None`. initializing the `AltCLIPVisionConfig` with default values.")

        self.text_config = AltCLIPTextConfig(**text_config)
        self.vision_config = AltCLIPVisionConfig(**vision_config)

        self.projection_dim = projection_dim
        self.logit_scale_init_value = logit_scale_init_value
        self.initializer_factor = 1.0

    @classmethod
    def from_text_vision_configs(cls, text_config: AltCLIPTextConfig, vision_config: AltCLIPVisionConfig, **kwargs):
        r"""
        Instantiate a [`AltCLIPConfig`] (or a derived class) from altclip text model configuration and altclip vision
        model configuration.

        Returns:
            [`AltCLIPConfig`]: An instance of a configuration object
        """
        return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)

mindnlp.transformers.models.altclip.configuration_altclip.AltCLIPConfig.__init__(text_config=None, vision_config=None, projection_dim=768, logit_scale_init_value=2.6592, **kwargs)

Initializes an instance of the AltCLIPConfig class.

PARAMETER DESCRIPTION
self

The instance of the class.

text_config

A dictionary containing configuration parameters for text processing. Defaults to None.

TYPE: Optional[Dict] DEFAULT: None

vision_config

A dictionary containing configuration parameters for vision processing. Defaults to None.

TYPE: Optional[Dict] DEFAULT: None

projection_dim

The dimension of the projection layer. Defaults to 768.

TYPE: int DEFAULT: 768

logit_scale_init_value

The initial value for the logit scale. Defaults to 2.6592.

TYPE: float DEFAULT: 2.6592

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/altclip/configuration_altclip.py
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
def __init__(
    self, text_config=None, vision_config=None, projection_dim=768, logit_scale_init_value=2.6592, **kwargs
):
    """
    Initializes an instance of the AltCLIPConfig class.

    Args:
        self: The instance of the class.
        text_config (Optional[Dict]): A dictionary containing configuration parameters for text processing. Defaults to None.
        vision_config (Optional[Dict]): A dictionary containing configuration parameters for vision processing. Defaults to None.
        projection_dim (int): The dimension of the projection layer. Defaults to 768.
        logit_scale_init_value (float): The initial value for the logit scale. Defaults to 2.6592.

    Returns:
        None

    Raises:
        None
    """
    # If `_config_dict` exist, we use them for the backward compatibility.
    # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot
    # of confusion!).
    text_config_dict = kwargs.pop("text_config_dict", None)
    vision_config_dict = kwargs.pop("vision_config_dict", None)

    super().__init__(**kwargs)

    # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in
    # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most
    # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`.
    if text_config_dict is not None:
        if text_config is None:
            text_config = {}

        # This is the complete result when using `text_config_dict`.
        _text_config_dict = AltCLIPTextConfig(**text_config_dict).to_dict()

        # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different.
        for key, value in _text_config_dict.items():
            if key in text_config and value != text_config[key] and key not in ["transformers_version"]:
                # If specified in `text_config_dict`
                if key in text_config_dict:
                    message = (
                        f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. "
                        f'The value `text_config_dict["{key}"]` will be used instead.'
                    )
                # If inferred from default argument values (just to be super careful)
                else:
                    message = (
                        f"`text_config_dict` is provided which will be used to initialize `AltCLIPTextConfig`. The "
                        f'value `text_config["{key}"]` will be overriden.'
                    )
                logger.info(message)

        # Update all values in `text_config` with the ones in `_text_config_dict`.
        text_config.update(_text_config_dict)

    if vision_config_dict is not None:
        if vision_config is None:
            vision_config = {}

        # This is the complete result when using `vision_config_dict`.
        _vision_config_dict = AltCLIPVisionConfig(**vision_config_dict).to_dict()
        # convert keys to string instead of integer
        if "id2label" in _vision_config_dict:
            _vision_config_dict["id2label"] = {
                str(key): value for key, value in _vision_config_dict["id2label"].items()
            }

        # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different.
        for key, value in _vision_config_dict.items():
            if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]:
                # If specified in `vision_config_dict`
                if key in vision_config_dict:
                    message = (
                        f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different "
                        f'values. The value `vision_config_dict["{key}"]` will be used instead.'
                    )
                # If inferred from default argument values (just to be super careful)
                else:
                    message = (
                        f"`vision_config_dict` is provided which will be used to initialize `AltCLIPVisionConfig`. "
                        f'The value `vision_config["{key}"]` will be overriden.'
                    )
                logger.info(message)

        # Update all values in `vision_config` with the ones in `_vision_config_dict`.
        vision_config.update(_vision_config_dict)

    if text_config is None:
        text_config = {}
        logger.info("`text_config` is `None`. Initializing the `AltCLIPTextConfig` with default values.")

    if vision_config is None:
        vision_config = {}
        logger.info("`vision_config` is `None`. initializing the `AltCLIPVisionConfig` with default values.")

    self.text_config = AltCLIPTextConfig(**text_config)
    self.vision_config = AltCLIPVisionConfig(**vision_config)

    self.projection_dim = projection_dim
    self.logit_scale_init_value = logit_scale_init_value
    self.initializer_factor = 1.0

mindnlp.transformers.models.altclip.configuration_altclip.AltCLIPConfig.from_text_vision_configs(text_config, vision_config, **kwargs) classmethod

Instantiate a [AltCLIPConfig] (or a derived class) from altclip text model configuration and altclip vision model configuration.

RETURNS DESCRIPTION

[AltCLIPConfig]: An instance of a configuration object

Source code in mindnlp/transformers/models/altclip/configuration_altclip.py
474
475
476
477
478
479
480
481
482
483
@classmethod
def from_text_vision_configs(cls, text_config: AltCLIPTextConfig, vision_config: AltCLIPVisionConfig, **kwargs):
    r"""
    Instantiate a [`AltCLIPConfig`] (or a derived class) from altclip text model configuration and altclip vision
    model configuration.

    Returns:
        [`AltCLIPConfig`]: An instance of a configuration object
    """
    return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)

mindnlp.transformers.models.altclip.configuration_altclip.AltCLIPTextConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [AltCLIPTextModel]. It is used to instantiate a AltCLIP text model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the AltCLIP BAAI/AltCLIP architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
vocab_size

Vocabulary size of the AltCLIP model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [AltCLIPTextModel].

TYPE: `int`, *optional*, defaults to 250002 DEFAULT: 250002

hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 1024 DEFAULT: 1024

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 24 DEFAULT: 24

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 16 DEFAULT: 16

intermediate_size

Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 4096 DEFAULT: 4096

hidden_act

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and "gelu_new" are supported.

TYPE: `str` or `Callable`, *optional*, defaults to `"gelu"` DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

TYPE: `float`, *optional*, defaults to 0.1 DEFAULT: 0.1

attention_probs_dropout_prob

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.1 DEFAULT: 0.1

max_position_embeddings

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

TYPE: `int`, *optional*, defaults to 514 DEFAULT: 514

type_vocab_size

The vocabulary size of the token_type_ids passed when calling [AltCLIPTextModel]

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

initializer_factor

A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-05 DEFAULT: 1e-05

pad_token_id

The id of the padding token.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

bos_token_id

The id of the beginning-of-sequence token.

TYPE: `int`, *optional*, defaults to 0 DEFAULT: 0

eos_token_id

The id of the end-of-sequence token. Optionally, use a list to set multiple end-of-sequence tokens.

TYPE: `Union[int, List[int]]`, *optional*, defaults to 2 DEFAULT: 2

position_embedding_type

Type of position embedding. Choose one of "absolute", "relative_key", "relative_key_query". For positional embeddings use "absolute". For more information on "relative_key", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on "relative_key_query", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

TYPE: `str`, *optional*, defaults to `"absolute"` DEFAULT: 'absolute'

use_cache

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

project_dim

The dimentions of the teacher model before the mapping layer.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

Example
>>> from transformers import AltCLIPTextModel, AltCLIPTextConfig
...
>>> # Initializing a AltCLIPTextConfig with BAAI/AltCLIP style configuration
>>> configuration = AltCLIPTextConfig()
...
>>> # Initializing a AltCLIPTextModel (with random weights) from the BAAI/AltCLIP style configuration
>>> model = AltCLIPTextModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/altclip/configuration_altclip.py
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
class AltCLIPTextConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`AltCLIPTextModel`]. It is used to instantiate a
    AltCLIP text model according to the specified arguments, defining the model architecture. Instantiating a
    configuration with the defaults will yield a similar configuration to that of the AltCLIP
    [BAAI/AltCLIP](https://hf-mirror.com/BAAI/AltCLIP) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.


    Args:
        vocab_size (`int`, *optional*, defaults to 250002):
            Vocabulary size of the AltCLIP model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`AltCLIPTextModel`].
        hidden_size (`int`, *optional*, defaults to 1024):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 24):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
        intermediate_size (`int`, *optional*, defaults to 4096):
            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"silu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (`int`, *optional*, defaults to 514):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (`int`, *optional*, defaults to 1):
            The vocabulary size of the `token_type_ids` passed when calling [`AltCLIPTextModel`]
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        initializer_factor (`float`, *optional*, defaults to 0.02):
            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
            testing).
        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon used by the layer normalization layers.
        pad_token_id (`int`, *optional*, defaults to 1): The id of the *padding* token.
        bos_token_id (`int`, *optional*, defaults to 0): The id of the *beginning-of-sequence* token.
        eos_token_id (`Union[int, List[int]]`, *optional*, defaults to 2):
            The id of the *end-of-sequence* token. Optionally, use a list to set multiple *end-of-sequence* tokens.
        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.
        project_dim (`int`, *optional*, defaults to 768):
            The dimentions of the teacher model before the mapping layer.

    Example:
        ```python
        >>> from transformers import AltCLIPTextModel, AltCLIPTextConfig
        ...
        >>> # Initializing a AltCLIPTextConfig with BAAI/AltCLIP style configuration
        >>> configuration = AltCLIPTextConfig()
        ...
        >>> # Initializing a AltCLIPTextModel (with random weights) from the BAAI/AltCLIP style configuration
        >>> model = AltCLIPTextModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "altclip_text_model"

    def __init__(
        self,
        vocab_size=250002,
        hidden_size=1024,
        num_hidden_layers=24,
        num_attention_heads=16,
        intermediate_size=4096,
        hidden_act="gelu",
        hidden_dropout_prob=0.1,
        attention_probs_dropout_prob=0.1,
        max_position_embeddings=514,
        type_vocab_size=1,
        initializer_range=0.02,
        initializer_factor=0.02,
        layer_norm_eps=1e-05,
        pad_token_id=1,
        bos_token_id=0,
        eos_token_id=2,
        position_embedding_type="absolute",
        use_cache=True,
        project_dim=768,
        **kwargs,
    ):
        """
        Initializes an instance of the AltCLIPTextConfig class.

        Args:
            vocab_size (int): The size of the vocabulary. Default is 250002.
            hidden_size (int): The size of the hidden layers. Default is 1024.
            num_hidden_layers (int): The number of hidden layers. Default is 24.
            num_attention_heads (int): The number of attention heads. Default is 16.
            intermediate_size (int): The size of the intermediate layer. Default is 4096.
            hidden_act (str): The activation function for the hidden layers. Default is 'gelu'.
            hidden_dropout_prob (float): The dropout probability for the hidden layers. Default is 0.1.
            attention_probs_dropout_prob (float): The dropout probability for the attention probabilities. Default is 0.1.
            max_position_embeddings (int): The maximum position embeddings. Default is 514.
            type_vocab_size (int): The size of the type vocabulary. Default is 1.
            initializer_range (float): The range for weight initialization. Default is 0.02.
            initializer_factor (float): The factor for weight initialization. Default is 0.02.
            layer_norm_eps (float): The epsilon value for layer normalization. Default is 1e-05.
            pad_token_id (int): The token ID for padding. Default is 1.
            bos_token_id (int): The token ID for the beginning of sentence. Default is 0.
            eos_token_id (int): The token ID for the end of sentence. Default is 2.
            position_embedding_type (str): The type of position embedding. Default is 'absolute'.
            use_cache (bool): Whether to use cache. Default is True.
            project_dim (int): The dimension for project. Default is 768.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)

        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.hidden_act = hidden_act
        self.intermediate_size = intermediate_size
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.max_position_embeddings = max_position_embeddings
        self.type_vocab_size = type_vocab_size
        self.initializer_range = initializer_range
        self.initializer_factor = initializer_factor
        self.layer_norm_eps = layer_norm_eps
        self.position_embedding_type = position_embedding_type
        self.use_cache = use_cache
        self.project_dim = project_dim

mindnlp.transformers.models.altclip.configuration_altclip.AltCLIPTextConfig.__init__(vocab_size=250002, hidden_size=1024, num_hidden_layers=24, num_attention_heads=16, intermediate_size=4096, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=514, type_vocab_size=1, initializer_range=0.02, initializer_factor=0.02, layer_norm_eps=1e-05, pad_token_id=1, bos_token_id=0, eos_token_id=2, position_embedding_type='absolute', use_cache=True, project_dim=768, **kwargs)

Initializes an instance of the AltCLIPTextConfig class.

PARAMETER DESCRIPTION
vocab_size

The size of the vocabulary. Default is 250002.

TYPE: int DEFAULT: 250002

hidden_size

The size of the hidden layers. Default is 1024.

TYPE: int DEFAULT: 1024

num_hidden_layers

The number of hidden layers. Default is 24.

TYPE: int DEFAULT: 24

num_attention_heads

The number of attention heads. Default is 16.

TYPE: int DEFAULT: 16

intermediate_size

The size of the intermediate layer. Default is 4096.

TYPE: int DEFAULT: 4096

hidden_act

The activation function for the hidden layers. Default is 'gelu'.

TYPE: str DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for the hidden layers. Default is 0.1.

TYPE: float DEFAULT: 0.1

attention_probs_dropout_prob

The dropout probability for the attention probabilities. Default is 0.1.

TYPE: float DEFAULT: 0.1

max_position_embeddings

The maximum position embeddings. Default is 514.

TYPE: int DEFAULT: 514

type_vocab_size

The size of the type vocabulary. Default is 1.

TYPE: int DEFAULT: 1

initializer_range

The range for weight initialization. Default is 0.02.

TYPE: float DEFAULT: 0.02

initializer_factor

The factor for weight initialization. Default is 0.02.

TYPE: float DEFAULT: 0.02

layer_norm_eps

The epsilon value for layer normalization. Default is 1e-05.

TYPE: float DEFAULT: 1e-05

pad_token_id

The token ID for padding. Default is 1.

TYPE: int DEFAULT: 1

bos_token_id

The token ID for the beginning of sentence. Default is 0.

TYPE: int DEFAULT: 0

eos_token_id

The token ID for the end of sentence. Default is 2.

TYPE: int DEFAULT: 2

position_embedding_type

The type of position embedding. Default is 'absolute'.

TYPE: str DEFAULT: 'absolute'

use_cache

Whether to use cache. Default is True.

TYPE: bool DEFAULT: True

project_dim

The dimension for project. Default is 768.

TYPE: int DEFAULT: 768

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/altclip/configuration_altclip.py
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
def __init__(
    self,
    vocab_size=250002,
    hidden_size=1024,
    num_hidden_layers=24,
    num_attention_heads=16,
    intermediate_size=4096,
    hidden_act="gelu",
    hidden_dropout_prob=0.1,
    attention_probs_dropout_prob=0.1,
    max_position_embeddings=514,
    type_vocab_size=1,
    initializer_range=0.02,
    initializer_factor=0.02,
    layer_norm_eps=1e-05,
    pad_token_id=1,
    bos_token_id=0,
    eos_token_id=2,
    position_embedding_type="absolute",
    use_cache=True,
    project_dim=768,
    **kwargs,
):
    """
    Initializes an instance of the AltCLIPTextConfig class.

    Args:
        vocab_size (int): The size of the vocabulary. Default is 250002.
        hidden_size (int): The size of the hidden layers. Default is 1024.
        num_hidden_layers (int): The number of hidden layers. Default is 24.
        num_attention_heads (int): The number of attention heads. Default is 16.
        intermediate_size (int): The size of the intermediate layer. Default is 4096.
        hidden_act (str): The activation function for the hidden layers. Default is 'gelu'.
        hidden_dropout_prob (float): The dropout probability for the hidden layers. Default is 0.1.
        attention_probs_dropout_prob (float): The dropout probability for the attention probabilities. Default is 0.1.
        max_position_embeddings (int): The maximum position embeddings. Default is 514.
        type_vocab_size (int): The size of the type vocabulary. Default is 1.
        initializer_range (float): The range for weight initialization. Default is 0.02.
        initializer_factor (float): The factor for weight initialization. Default is 0.02.
        layer_norm_eps (float): The epsilon value for layer normalization. Default is 1e-05.
        pad_token_id (int): The token ID for padding. Default is 1.
        bos_token_id (int): The token ID for the beginning of sentence. Default is 0.
        eos_token_id (int): The token ID for the end of sentence. Default is 2.
        position_embedding_type (str): The type of position embedding. Default is 'absolute'.
        use_cache (bool): Whether to use cache. Default is True.
        project_dim (int): The dimension for project. Default is 768.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)

    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.type_vocab_size = type_vocab_size
    self.initializer_range = initializer_range
    self.initializer_factor = initializer_factor
    self.layer_norm_eps = layer_norm_eps
    self.position_embedding_type = position_embedding_type
    self.use_cache = use_cache
    self.project_dim = project_dim

mindnlp.transformers.models.altclip.configuration_altclip.AltCLIPVisionConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [AltCLIPModel]. It is used to instantiate an AltCLIP model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the AltCLIP BAAI/AltCLIP architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

intermediate_size

Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 3072 DEFAULT: 3072

projection_dim

Dimentionality of text and vision projection layers.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_channels

The number of input channels.

TYPE: `int`, *optional*, defaults to 3 DEFAULT: 3

image_size

The size (resolution) of each image.

TYPE: `int`, *optional*, defaults to 224 DEFAULT: 224

patch_size

The size (resolution) of each patch.

TYPE: `int`, *optional*, defaults to 32 DEFAULT: 32

hidden_act

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" `"quick_gelu" are supported.

TYPE: `str` or `function`, *optional*, defaults to `"quick_gelu"` DEFAULT: 'quick_gelu'

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-05 DEFAULT: 1e-05

attention_dropout

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

initializer_factor

A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).

TYPE: `float`, *optional*, defaults to 1.0 DEFAULT: 1.0

Example
>>> from transformers import AltCLIPVisionConfig, AltCLIPVisionModel
...
>>> # Initializing a AltCLIPVisionConfig with BAAI/AltCLIP style configuration
>>> configuration = AltCLIPVisionConfig()
...
>>> # Initializing a AltCLIPVisionModel (with random weights) from the BAAI/AltCLIP style configuration
>>> model = AltCLIPVisionModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/altclip/configuration_altclip.py
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
class AltCLIPVisionConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`AltCLIPModel`]. It is used to instantiate an
    AltCLIP model according to the specified arguments, defining the model architecture. Instantiating a configuration
    with the defaults will yield a similar configuration to that of the AltCLIP
    [BAAI/AltCLIP](https://hf-mirror.com/BAAI/AltCLIP) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        projection_dim (`int`, *optional*, defaults to 512):
            Dimentionality of text and vision projection layers.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        num_channels (`int`, *optional*, defaults to 3):
            The number of input channels.
        image_size (`int`, *optional*, defaults to 224):
            The size (resolution) of each image.
        patch_size (`int`, *optional*, defaults to 32):
            The size (resolution) of each patch.
        hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon used by the layer normalization layers.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        initializer_factor (`float`, *optional*, defaults to 1.0):
            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
            testing).

    Example:
        ```python
        >>> from transformers import AltCLIPVisionConfig, AltCLIPVisionModel
        ...
        >>> # Initializing a AltCLIPVisionConfig with BAAI/AltCLIP style configuration
        >>> configuration = AltCLIPVisionConfig()
        ...
        >>> # Initializing a AltCLIPVisionModel (with random weights) from the BAAI/AltCLIP style configuration
        >>> model = AltCLIPVisionModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "altclip_vision_model"

    def __init__(
        self,
        hidden_size=768,
        intermediate_size=3072,
        projection_dim=512,
        num_hidden_layers=12,
        num_attention_heads=12,
        num_channels=3,
        image_size=224,
        patch_size=32,
        hidden_act="quick_gelu",
        layer_norm_eps=1e-5,
        attention_dropout=0.0,
        initializer_range=0.02,
        initializer_factor=1.0,
        **kwargs,
    ):
        """
        Initializes an instance of the AltCLIPVisionConfig class.

        Args:
            self (AltCLIPVisionConfig): The instance of the class itself.
            hidden_size (int, optional): The size of the hidden layer.
            intermediate_size (int, optional): The size of the intermediate layer.
            projection_dim (int, optional): The dimension of the projection.
            num_hidden_layers (int, optional): The number of hidden layers.
            num_attention_heads (int, optional): The number of attention heads.
            num_channels (int, optional): The number of channels in the image.
            image_size (int, optional): The size of the image.
            patch_size (int, optional): The size of the patch.
            hidden_act (str, optional): The activation function for the hidden layer.
            layer_norm_eps (float, optional): The epsilon value for layer normalization.
            attention_dropout (float, optional): The dropout rate for attention.
            initializer_range (float, optional): The range for weight initialization.
            initializer_factor (float, optional): The factor for weight initialization.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(**kwargs)

        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.projection_dim = projection_dim
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.num_channels = num_channels
        self.patch_size = patch_size
        self.image_size = image_size
        self.initializer_range = initializer_range
        self.initializer_factor = initializer_factor
        self.attention_dropout = attention_dropout
        self.layer_norm_eps = layer_norm_eps
        self.hidden_act = hidden_act

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        """
        This method creates an instance of the AltCLIPVisionConfig class from a pretrained model.

        Args:
            cls (object): The class object. It represents the AltCLIPVisionConfig class.
            pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model.
                It can be a string or a valid path.

        Returns:
            PretrainedConfig: An instance of the 'PretrainedConfig' class representing the configuration of the
                pretrained model. It contains the configuration details for the pretrained model.

        Raises:
            None
        """
        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

        # get the vision config dict if we are loading from AltCLIPConfig
        if config_dict.get("model_type") == "altclip":
            config_dict = config_dict["vision_config"]

        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
            logger.warning(
                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
            )

        return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.altclip.configuration_altclip.AltCLIPVisionConfig.__init__(hidden_size=768, intermediate_size=3072, projection_dim=512, num_hidden_layers=12, num_attention_heads=12, num_channels=3, image_size=224, patch_size=32, hidden_act='quick_gelu', layer_norm_eps=1e-05, attention_dropout=0.0, initializer_range=0.02, initializer_factor=1.0, **kwargs)

Initializes an instance of the AltCLIPVisionConfig class.

PARAMETER DESCRIPTION
self

The instance of the class itself.

TYPE: AltCLIPVisionConfig

hidden_size

The size of the hidden layer.

TYPE: int DEFAULT: 768

intermediate_size

The size of the intermediate layer.

TYPE: int DEFAULT: 3072

projection_dim

The dimension of the projection.

TYPE: int DEFAULT: 512

num_hidden_layers

The number of hidden layers.

TYPE: int DEFAULT: 12

num_attention_heads

The number of attention heads.

TYPE: int DEFAULT: 12

num_channels

The number of channels in the image.

TYPE: int DEFAULT: 3

image_size

The size of the image.

TYPE: int DEFAULT: 224

patch_size

The size of the patch.

TYPE: int DEFAULT: 32

hidden_act

The activation function for the hidden layer.

TYPE: str DEFAULT: 'quick_gelu'

layer_norm_eps

The epsilon value for layer normalization.

TYPE: float DEFAULT: 1e-05

attention_dropout

The dropout rate for attention.

TYPE: float DEFAULT: 0.0

initializer_range

The range for weight initialization.

TYPE: float DEFAULT: 0.02

initializer_factor

The factor for weight initialization.

TYPE: float DEFAULT: 1.0

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/altclip/configuration_altclip.py
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
def __init__(
    self,
    hidden_size=768,
    intermediate_size=3072,
    projection_dim=512,
    num_hidden_layers=12,
    num_attention_heads=12,
    num_channels=3,
    image_size=224,
    patch_size=32,
    hidden_act="quick_gelu",
    layer_norm_eps=1e-5,
    attention_dropout=0.0,
    initializer_range=0.02,
    initializer_factor=1.0,
    **kwargs,
):
    """
    Initializes an instance of the AltCLIPVisionConfig class.

    Args:
        self (AltCLIPVisionConfig): The instance of the class itself.
        hidden_size (int, optional): The size of the hidden layer.
        intermediate_size (int, optional): The size of the intermediate layer.
        projection_dim (int, optional): The dimension of the projection.
        num_hidden_layers (int, optional): The number of hidden layers.
        num_attention_heads (int, optional): The number of attention heads.
        num_channels (int, optional): The number of channels in the image.
        image_size (int, optional): The size of the image.
        patch_size (int, optional): The size of the patch.
        hidden_act (str, optional): The activation function for the hidden layer.
        layer_norm_eps (float, optional): The epsilon value for layer normalization.
        attention_dropout (float, optional): The dropout rate for attention.
        initializer_range (float, optional): The range for weight initialization.
        initializer_factor (float, optional): The factor for weight initialization.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(**kwargs)

    self.hidden_size = hidden_size
    self.intermediate_size = intermediate_size
    self.projection_dim = projection_dim
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.num_channels = num_channels
    self.patch_size = patch_size
    self.image_size = image_size
    self.initializer_range = initializer_range
    self.initializer_factor = initializer_factor
    self.attention_dropout = attention_dropout
    self.layer_norm_eps = layer_norm_eps
    self.hidden_act = hidden_act

mindnlp.transformers.models.altclip.configuration_altclip.AltCLIPVisionConfig.from_pretrained(pretrained_model_name_or_path, **kwargs) classmethod

This method creates an instance of the AltCLIPVisionConfig class from a pretrained model.

PARAMETER DESCRIPTION
cls

The class object. It represents the AltCLIPVisionConfig class.

TYPE: object

pretrained_model_name_or_path

The name or path of the pretrained model. It can be a string or a valid path.

TYPE: Union[str, PathLike]

RETURNS DESCRIPTION
PretrainedConfig

An instance of the 'PretrainedConfig' class representing the configuration of the pretrained model. It contains the configuration details for the pretrained model.

TYPE: PretrainedConfig

Source code in mindnlp/transformers/models/altclip/configuration_altclip.py
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
    """
    This method creates an instance of the AltCLIPVisionConfig class from a pretrained model.

    Args:
        cls (object): The class object. It represents the AltCLIPVisionConfig class.
        pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model.
            It can be a string or a valid path.

    Returns:
        PretrainedConfig: An instance of the 'PretrainedConfig' class representing the configuration of the
            pretrained model. It contains the configuration details for the pretrained model.

    Raises:
        None
    """
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

    # get the vision config dict if we are loading from AltCLIPConfig
    if config_dict.get("model_type") == "altclip":
        config_dict = config_dict["vision_config"]

    if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
        logger.warning(
            f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
            f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
        )

    return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.altclip.modeling_altclip.ALTCLIP_PRETRAINED_MODEL_ARCHIVE_LIST = ['BAAI/AltCLIP'] module-attribute

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPPreTrainedModel

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
class AltCLIPPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """
    config_class = AltCLIPConfig
    base_model_prefix = "altclip"
    supports_gradient_checkpointing = True

    def _init_weights(self, cell):
        """Initialize the weights"""
        factor = self.config.initializer_factor
        if isinstance(cell, AltCLIPVisionEmbeddings):
            factor = self.config.initializer_factor
            cell.class_embedding.set_data(initializer(Normal(cell.embed_dim**-0.5 * factor),
                                                    cell.class_embedding.shape, cell.class_embedding.dtype))
            cell.patch_embedding.weight.set_data(initializer(Normal(cell.config.initializer_range * factor),
                                                    cell.patch_embedding.weight.shape, cell.patch_embedding.weight.dtype))
            cell.position_embedding.weight.set_data(initializer(Normal(cell.config.initializer_range * factor),
                                                    cell.position_embedding.weight.shape, cell.position_embedding.weight.dtype))
        elif isinstance(cell, AltCLIPAttention):
            factor = self.config.initializer_factor
            in_proj_std = (cell.embed_dim**-0.5) * ((2 * cell.config.num_hidden_layers) ** -0.5) * factor
            out_proj_std = (cell.embed_dim**-0.5) * factor
            cell.q_proj.weight.set_data(initializer(Normal(in_proj_std),
                                                    cell.q_proj.weight.shape, cell.q_proj.weight.dtype))
            cell.k_proj.weight.set_data(initializer(Normal(in_proj_std),
                                                    cell.k_proj.weight.shape, cell.k_proj.weight.dtype))
            cell.v_proj.weight.set_data(initializer(Normal(in_proj_std),
                                                    cell.v_proj.weight.shape, cell.v_proj.weight.dtype))
            cell.out_proj.weight.set_data(initializer(Normal(out_proj_std),
                                                    cell.out_proj.weight.shape, cell.out_proj.weight.dtype))

        elif isinstance(cell, AltCLIPMLP):
            factor = self.config.initializer_factor
            in_proj_std = (cell.config.hidden_size**-0.5) * ((2 * cell.config.num_hidden_layers) ** -0.5) * factor
            fc_std = (2 * cell.config.hidden_size) ** -0.5 * factor

            cell.fc1.weight.set_data(initializer(Normal(fc_std),
                                                cell.fc1.weight.shape, cell.fc1.weight.dtype))
            cell.fc2.weight.set_data(initializer(Normal(in_proj_std),
                                                cell.fc2.weight.shape, cell.fc2.weight.dtype))

        elif isinstance(cell, AltCLIPModel):
            cell.text_projection.weight.set_data(initializer(Normal(cell.text_embed_dim**-0.5 * self.config.initializer_factor),
                                                cell.text_projection.weight.shape, cell.text_projection.weight.dtype))
            cell.text_projection._is_initialized = True
            cell.visual_projection.weight.set_data(initializer(Normal(cell.vision_embed_dim**-0.5 * self.config.initializer_factor),
                                                cell.visual_projection.weight.shape, cell.visual_projection.weight.dtype))
            cell.visual_projection._is_initialized = True

        elif isinstance(cell, nn.Linear):
            # Slightly different from the TF version which uses truncated_normal for initialization
            # cf https://github.com/pytorch/pytorch/pull/5617
            cell.weight.set_data(initializer(Normal(self.config.initializer_factor),
                                                    cell.weight.shape, cell.weight.dtype))
            if cell.bias is not None:
                cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))
        elif isinstance(cell, nn.Embedding):
            weight = np.random.normal(0.0, self.config.initializer_factor, cell.weight.shape)
            if cell.padding_idx:
                weight[cell.padding_idx] = 0

            cell.weight.set_data(Tensor(weight, cell.weight.dtype))
        elif isinstance(cell, nn.LayerNorm):
            cell.weight.set_data(initializer('ones', cell.weight.shape, cell.weight.dtype))
            cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPModel

Bases: AltCLIPPreTrainedModel

AltCLIPModel Represents an alternative implementation of the Contrastive Language-Image Pretraining (CLIP) model.

This class inherits from the AltCLIPPreTrainedModel class and includes methods to obtain text and image features, as well as to forward the final output.

The AltCLIPModel class includes the following methods:

  • get_text_features: Returns the text embeddings obtained by applying the projection layer to the pooled output of AltCLIPTextModel.
  • get_image_features: Returns the image embeddings obtained by applying the projection layer to the pooled output of AltCLIPVisionModel.
  • forward: Constructs the final output, including image-text similarity scores and label probabilities.
Example
>>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
>>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(images=image, return_tensors="pt")
>>> image_features = model.get_image_features(**inputs)
...
>>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
>>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(
...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
... )
>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
>>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
class AltCLIPModel(AltCLIPPreTrainedModel):

    """
    AltCLIPModel
    Represents an alternative implementation of the Contrastive Language-Image Pretraining (CLIP) model.

    This class inherits from the `AltCLIPPreTrainedModel` class and includes methods to obtain text and image features, as well as to forward the final output.

    The `AltCLIPModel` class includes the following methods:

    - get_text_features: Returns the text embeddings obtained by applying the projection layer to the pooled output of `AltCLIPTextModel`.
    - get_image_features: Returns the image embeddings obtained by applying the projection layer to the pooled output of `AltCLIPVisionModel`.
    - forward: Constructs the final output, including image-text similarity scores and label probabilities.

    Example:
        ```python
        >>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
        >>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        >>> inputs = processor(images=image, return_tensors="pt")
        >>> image_features = model.get_image_features(**inputs)
        ...
        >>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
        >>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        >>> inputs = processor(
        ...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
        ... )
        >>> outputs = model(**inputs)
        >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
        >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
        ```
    """
    config_class = AltCLIPConfig

    def __init__(self, config: AltCLIPConfig):
        """Initialize the AltCLIPModel with the provided configuration.

        Args:
            self: The instance of the AltCLIPModel class.
            config (AltCLIPConfig): The configuration object containing the settings for the AltCLIPModel.

        Returns:
            None.

        Raises:
            ValueError: If the 'config.vision_config' is not an instance of AltCLIPVisionConfig.
            ValueError: If the 'config.text_config' is not an instance of AltCLIPTextConfig.
        """
        super().__init__(config)

        if not isinstance(config.vision_config, AltCLIPVisionConfig):
            raise ValueError(
                "config.vision_config is expected to be of type AltCLIPVisionConfig but is of type"
                f" {type(config.vision_config)}."
            )
        if not isinstance(config.text_config, AltCLIPTextConfig):
            raise ValueError(
                "config.text_config is expected to be of type AltCLIPTextConfig but is of type"
                f" {type(config.text_config)}."
            )

        text_config = config.text_config
        vision_config = config.vision_config

        self.projection_dim = config.projection_dim
        self.text_embed_dim = text_config.project_dim
        self.vision_embed_dim = vision_config.hidden_size

        self.text_model = AltCLIPTextModel(text_config)
        self.vision_model = AltCLIPVisionTransformer(vision_config)

        self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=False)
        self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=False)
        self.logit_scale = Parameter(mindspore.tensor(self.config.logit_scale_init_value))

        # Initialize weights and apply final processing
        self.post_init()

    def get_text_features(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        token_type_ids=None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> mindspore.Tensor:
        r"""

        Returns:
            text_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
            applying the projection layer to the pooled output of [`AltCLIPTextModel`].

        Example:
            ```python
            >>> from transformers import AutoProcessor, AltCLIPModel
            ...
            >>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
            >>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
            >>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
            >>> text_features = model.get_text_features(**inputs)
            ```
        """
        # Use AltCLIP model's config for some fields (if specified) instead of those of vision & text components.
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            token_type_ids=token_type_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        pooled_output = text_outputs[1]
        text_features = self.text_projection(pooled_output)

        return text_features

    def get_image_features(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> mindspore.Tensor:
        r"""

        Returns:
            image_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
            applying the projection layer to the pooled output of [`AltCLIPVisionModel`].

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, AltCLIPModel
            ...
            >>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
            >>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            >>> inputs = processor(images=image, return_tensors="pt")
            >>> image_features = model.get_image_features(**inputs)
            ```
        """
        # Use AltCLIP model's config for some fields (if specified) instead of those of vision & text components.
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        vision_outputs = self.vision_model(
            pixel_values=pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = vision_outputs[1]  # pooled_output
        image_features = self.visual_projection(pooled_output)

        return image_features

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        pixel_values: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        return_loss: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, AltCLIPOutput]:
        r"""

        Returns:
            `Union[Tuple, AltCLIPOutput]`

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, AltCLIPModel
            ...
            >>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
            >>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            >>> inputs = processor(
            ...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
            ... )
            >>> outputs = model(**inputs)
            >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
            >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
            ```
        """
        # Use AltCLIP model's config for some fields (if specified) instead of those of vision & text components.
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        vision_outputs = self.vision_model(
            pixel_values=pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        image_embeds = vision_outputs[1]
        image_embeds = self.visual_projection(image_embeds)

        text_embeds = text_outputs[1]
        text_embeds = self.text_projection(text_embeds)

        # normalized features
        image_embeds = image_embeds / image_embeds.norm(ord=2, dim=-1, keepdim=True)
        text_embeds = text_embeds / text_embeds.norm(ord=2, dim=-1, keepdim=True)

        # cosine similarity as logits
        logit_scale = self.logit_scale.exp()
        logits_per_text = ops.matmul(text_embeds, image_embeds.t()) * logit_scale
        logits_per_image = logits_per_text.T

        loss = None
        if return_loss:
            loss = clip_loss(logits_per_text)

        if not return_dict:
            output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
            return ((loss,) + output) if loss is not None else output

        return AltCLIPOutput(
            loss=loss,
            logits_per_image=logits_per_image,
            logits_per_text=logits_per_text,
            text_embeds=text_embeds,
            image_embeds=image_embeds,
            text_model_output=text_outputs,
            vision_model_output=vision_outputs,
        )

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPModel.__init__(config)

Initialize the AltCLIPModel with the provided configuration.

PARAMETER DESCRIPTION
self

The instance of the AltCLIPModel class.

config

The configuration object containing the settings for the AltCLIPModel.

TYPE: AltCLIPConfig

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If the 'config.vision_config' is not an instance of AltCLIPVisionConfig.

ValueError

If the 'config.text_config' is not an instance of AltCLIPTextConfig.

Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
def __init__(self, config: AltCLIPConfig):
    """Initialize the AltCLIPModel with the provided configuration.

    Args:
        self: The instance of the AltCLIPModel class.
        config (AltCLIPConfig): The configuration object containing the settings for the AltCLIPModel.

    Returns:
        None.

    Raises:
        ValueError: If the 'config.vision_config' is not an instance of AltCLIPVisionConfig.
        ValueError: If the 'config.text_config' is not an instance of AltCLIPTextConfig.
    """
    super().__init__(config)

    if not isinstance(config.vision_config, AltCLIPVisionConfig):
        raise ValueError(
            "config.vision_config is expected to be of type AltCLIPVisionConfig but is of type"
            f" {type(config.vision_config)}."
        )
    if not isinstance(config.text_config, AltCLIPTextConfig):
        raise ValueError(
            "config.text_config is expected to be of type AltCLIPTextConfig but is of type"
            f" {type(config.text_config)}."
        )

    text_config = config.text_config
    vision_config = config.vision_config

    self.projection_dim = config.projection_dim
    self.text_embed_dim = text_config.project_dim
    self.vision_embed_dim = vision_config.hidden_size

    self.text_model = AltCLIPTextModel(text_config)
    self.vision_model = AltCLIPVisionTransformer(vision_config)

    self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=False)
    self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=False)
    self.logit_scale = Parameter(mindspore.tensor(self.config.logit_scale_init_value))

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPModel.forward(input_ids=None, pixel_values=None, attention_mask=None, position_ids=None, token_type_ids=None, return_loss=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
Union[Tuple, AltCLIPOutput]

Union[Tuple, AltCLIPOutput]

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AltCLIPModel
...
>>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
>>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(
...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
... )
>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
>>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    pixel_values: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    token_type_ids: Optional[mindspore.Tensor] = None,
    return_loss: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, AltCLIPOutput]:
    r"""

    Returns:
        `Union[Tuple, AltCLIPOutput]`

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, AltCLIPModel
        ...
        >>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
        >>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        >>> inputs = processor(
        ...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
        ... )
        >>> outputs = model(**inputs)
        >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
        >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
        ```
    """
    # Use AltCLIP model's config for some fields (if specified) instead of those of vision & text components.
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    text_outputs = self.text_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    vision_outputs = self.vision_model(
        pixel_values=pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    image_embeds = vision_outputs[1]
    image_embeds = self.visual_projection(image_embeds)

    text_embeds = text_outputs[1]
    text_embeds = self.text_projection(text_embeds)

    # normalized features
    image_embeds = image_embeds / image_embeds.norm(ord=2, dim=-1, keepdim=True)
    text_embeds = text_embeds / text_embeds.norm(ord=2, dim=-1, keepdim=True)

    # cosine similarity as logits
    logit_scale = self.logit_scale.exp()
    logits_per_text = ops.matmul(text_embeds, image_embeds.t()) * logit_scale
    logits_per_image = logits_per_text.T

    loss = None
    if return_loss:
        loss = clip_loss(logits_per_text)

    if not return_dict:
        output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
        return ((loss,) + output) if loss is not None else output

    return AltCLIPOutput(
        loss=loss,
        logits_per_image=logits_per_image,
        logits_per_text=logits_per_text,
        text_embeds=text_embeds,
        image_embeds=image_embeds,
        text_model_output=text_outputs,
        vision_model_output=vision_outputs,
    )

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPModel.get_image_features(pixel_values=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
image_features

The image embeddings obtained by

TYPE: `mindspore.Tensor` of shape `(batch_size, output_dim`

Tensor

applying the projection layer to the pooled output of [AltCLIPVisionModel].

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AltCLIPModel
...
>>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
>>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(images=image, return_tensors="pt")
>>> image_features = model.get_image_features(**inputs)
Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
def get_image_features(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> mindspore.Tensor:
    r"""

    Returns:
        image_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
        applying the projection layer to the pooled output of [`AltCLIPVisionModel`].

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, AltCLIPModel
        ...
        >>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
        >>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        >>> inputs = processor(images=image, return_tensors="pt")
        >>> image_features = model.get_image_features(**inputs)
        ```
    """
    # Use AltCLIP model's config for some fields (if specified) instead of those of vision & text components.
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    vision_outputs = self.vision_model(
        pixel_values=pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    pooled_output = vision_outputs[1]  # pooled_output
    image_features = self.visual_projection(pooled_output)

    return image_features

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPModel.get_text_features(input_ids=None, attention_mask=None, position_ids=None, token_type_ids=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
text_features

The text embeddings obtained by

TYPE: `mindspore.Tensor` of shape `(batch_size, output_dim`

Tensor

applying the projection layer to the pooled output of [AltCLIPTextModel].

Example
>>> from transformers import AutoProcessor, AltCLIPModel
...
>>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
>>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
>>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
>>> text_features = model.get_text_features(**inputs)
Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
def get_text_features(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    token_type_ids=None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> mindspore.Tensor:
    r"""

    Returns:
        text_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
        applying the projection layer to the pooled output of [`AltCLIPTextModel`].

    Example:
        ```python
        >>> from transformers import AutoProcessor, AltCLIPModel
        ...
        >>> model = AltCLIPModel.from_pretrained("BAAI/AltCLIP")
        >>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
        >>> inputs = processor(text=["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
        >>> text_features = model.get_text_features(**inputs)
        ```
    """
    # Use AltCLIP model's config for some fields (if specified) instead of those of vision & text components.
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    text_outputs = self.text_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        token_type_ids=token_type_ids,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    pooled_output = text_outputs[1]
    text_features = self.text_projection(pooled_output)

    return text_features

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPTextModel

Bases: AltCLIPPreTrainedModel

Represents an alternative implementation of the CLIP (Contrastive Language-Image Pretraining) model specifically tailored for text. This class extends the AltCLIPPreTrainedModel class and includes methods for initializing the model, getting and setting input embeddings, resizing token embeddings, and forwarding the model for inference. The 'forward' method takes various input tensors and optional parameters and returns the model's output, including the last hidden state and the pooled CLS states. Additionally, usage examples are provided for reference.

Example
>>> from transformers import AutoProcessor, AltCLIPTextModel
>>> model = AltCLIPTextModel.from_pretrained("BAAI/AltCLIP")
>>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
>>> texts = ["it's a cat", "it's a dog"]
>>> inputs = processor(text=texts, padding=True, return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output  # pooled CLS states
Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
class AltCLIPTextModel(AltCLIPPreTrainedModel):

    """
    Represents an alternative implementation of the CLIP (Contrastive Language-Image Pretraining) model specifically
    tailored for text. This class extends the AltCLIPPreTrainedModel class and includes methods for initializing the
    model, getting and setting input embeddings, resizing token embeddings, and forwarding the model for inference.
    The 'forward' method takes various input tensors and optional parameters and returns the model's output,
    including the last hidden state and the pooled CLS states. Additionally, usage examples are provided for reference.

    Example:
        ```python
        >>> from transformers import AutoProcessor, AltCLIPTextModel
        >>> model = AltCLIPTextModel.from_pretrained("BAAI/AltCLIP")
        >>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
        >>> texts = ["it's a cat", "it's a dog"]
        >>> inputs = processor(text=texts, padding=True, return_tensors="pt")
        >>> outputs = model(**inputs)
        >>> last_hidden_state = outputs.last_hidden_state
        >>> pooled_output = outputs.pooler_output  # pooled CLS states
        ```
    """
    config_class = AltCLIPTextConfig

    def __init__(self, config):
        """
        Initializes an instance of the AltCLIPTextModel class.

        Args:
            self: The instance of the class.
            config:
                A configuration object containing parameters for the model initialization.

                - Type: dict
                - Purpose: Specifies the configuration settings for the model.
                - Restrictions: Must be a valid configuration dictionary.

        Returns:
            None.

        Raises:
            TypeError: If the provided config parameter is not of type dict.
            ValueError: If the config dictionary is missing required keys or contains invalid values.
            RuntimeError: If there is an issue during the initialization process.
        """
        super().__init__(config)
        self.roberta = AltRobertaModel(config, add_pooling_layer=False)
        self.transformation = nn.Linear(config.hidden_size, config.project_dim)
        self.pre_LN = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.post_init()

    def get_input_embeddings(self) -> nn.Module:
        """
        This method returns the input embeddings for the AltCLIPTextModel.

        Args:
            self: AltCLIPTextModel
                The instance of the AltCLIPTextModel class.

        Returns:
            nn.Module
                The input embeddings for the AltCLIPTextModel, represented as an instance of nn.Module.

        Raises:
            None
                This method does not raise any exceptions.
        """
        return self.roberta.embeddings.word_embeddings

    def set_input_embeddings(self, value: nn.Embedding) -> None:
        """
        Method to set the input embeddings for the AltCLIPTextModel.

        Args:
            self (AltCLIPTextModel): An instance of the AltCLIPTextModel class.
                This parameter refers to the object itself.
            value (nn.Embedding): The new embedding to be set as the input embedding.
                It should be an instance of nn.Embedding representing the input embeddings.
                The value parameter will replace the existing word embeddings in the model.

        Returns:
            None: This method does not return any value. It updates the input embeddings of the model in place.

        Raises:
            None
        """
        self.roberta.embeddings.word_embeddings = value

    def resize_token_embeddings(self, new_num_tokens: Optional[int] = None) -> nn.Embedding:
        """
        This method resizes the token embeddings of the AltCLIPTextModel.

        Args:
            self (AltCLIPTextModel): The instance of the AltCLIPTextModel class.

            new_num_tokens (Optional[int]): The new number of tokens for the resized embeddings.
            If None, the original number of tokens will be used. Default is None.

        Returns:
            nn.Embedding: The resized token embeddings as an instance of nn.Embedding.

        Raises:
            None: This method does not explicitly raise any exceptions.
        """
        return super().resize_token_embeddings(new_num_tokens)

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        encoder_hidden_states: Optional[mindspore.Tensor] = None,
        encoder_attention_mask: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPoolingAndProjection]:
        r"""
        Returns:
            Union[Tuple, BaseModelOutputWithPoolingAndProjection]

        Example:
            ```python
            >>> from transformers import AutoProcessor, AltCLIPTextModel
            ...
            >>> model = AltCLIPTextModel.from_pretrained("BAAI/AltCLIP")
            >>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
            ...
            >>> texts = ["it's a cat", "it's a dog"]
            ...
            >>> inputs = processor(text=texts, padding=True, return_tensors="pt")
            ...
            >>> outputs = model(**inputs)
            >>> last_hidden_state = outputs.last_hidden_state
            >>> pooled_output = outputs.pooler_output  # pooled CLS states
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.roberta(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        # last module outputs
        sequence_output = outputs[0]

        # project every module
        sequence_output = self.pre_LN(sequence_output)

        # pooler
        projection_state = self.transformation(sequence_output)
        pooler_output = projection_state[:, 0]

        if not return_dict:
            return (projection_state, pooler_output) + outputs[2:4]

        return BaseModelOutputWithPoolingAndProjection(
            last_hidden_state=projection_state,
            pooler_output=pooler_output,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPTextModel.__init__(config)

Initializes an instance of the AltCLIPTextModel class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

A configuration object containing parameters for the model initialization.

  • Type: dict
  • Purpose: Specifies the configuration settings for the model.
  • Restrictions: Must be a valid configuration dictionary.

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the provided config parameter is not of type dict.

ValueError

If the config dictionary is missing required keys or contains invalid values.

RuntimeError

If there is an issue during the initialization process.

Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
def __init__(self, config):
    """
    Initializes an instance of the AltCLIPTextModel class.

    Args:
        self: The instance of the class.
        config:
            A configuration object containing parameters for the model initialization.

            - Type: dict
            - Purpose: Specifies the configuration settings for the model.
            - Restrictions: Must be a valid configuration dictionary.

    Returns:
        None.

    Raises:
        TypeError: If the provided config parameter is not of type dict.
        ValueError: If the config dictionary is missing required keys or contains invalid values.
        RuntimeError: If there is an issue during the initialization process.
    """
    super().__init__(config)
    self.roberta = AltRobertaModel(config, add_pooling_layer=False)
    self.transformation = nn.Linear(config.hidden_size, config.project_dim)
    self.pre_LN = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
    self.post_init()

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPTextModel.forward(input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, encoder_hidden_states=None, encoder_attention_mask=None, output_attentions=None, return_dict=None, output_hidden_states=None)

RETURNS DESCRIPTION
Union[Tuple, BaseModelOutputWithPoolingAndProjection]

Union[Tuple, BaseModelOutputWithPoolingAndProjection]

Example
>>> from transformers import AutoProcessor, AltCLIPTextModel
...
>>> model = AltCLIPTextModel.from_pretrained("BAAI/AltCLIP")
>>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
...
>>> texts = ["it's a cat", "it's a dog"]
...
>>> inputs = processor(text=texts, padding=True, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output  # pooled CLS states
Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    token_type_ids: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    encoder_hidden_states: Optional[mindspore.Tensor] = None,
    encoder_attention_mask: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    return_dict: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPoolingAndProjection]:
    r"""
    Returns:
        Union[Tuple, BaseModelOutputWithPoolingAndProjection]

    Example:
        ```python
        >>> from transformers import AutoProcessor, AltCLIPTextModel
        ...
        >>> model = AltCLIPTextModel.from_pretrained("BAAI/AltCLIP")
        >>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
        ...
        >>> texts = ["it's a cat", "it's a dog"]
        ...
        >>> inputs = processor(text=texts, padding=True, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> last_hidden_state = outputs.last_hidden_state
        >>> pooled_output = outputs.pooler_output  # pooled CLS states
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    outputs = self.roberta(
        input_ids=input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        encoder_hidden_states=encoder_hidden_states,
        encoder_attention_mask=encoder_attention_mask,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    # last module outputs
    sequence_output = outputs[0]

    # project every module
    sequence_output = self.pre_LN(sequence_output)

    # pooler
    projection_state = self.transformation(sequence_output)
    pooler_output = projection_state[:, 0]

    if not return_dict:
        return (projection_state, pooler_output) + outputs[2:4]

    return BaseModelOutputWithPoolingAndProjection(
        last_hidden_state=projection_state,
        pooler_output=pooler_output,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPTextModel.get_input_embeddings()

This method returns the input embeddings for the AltCLIPTextModel.

PARAMETER DESCRIPTION
self

AltCLIPTextModel The instance of the AltCLIPTextModel class.

RETURNS DESCRIPTION
Module

nn.Module The input embeddings for the AltCLIPTextModel, represented as an instance of nn.Module.

Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
def get_input_embeddings(self) -> nn.Module:
    """
    This method returns the input embeddings for the AltCLIPTextModel.

    Args:
        self: AltCLIPTextModel
            The instance of the AltCLIPTextModel class.

    Returns:
        nn.Module
            The input embeddings for the AltCLIPTextModel, represented as an instance of nn.Module.

    Raises:
        None
            This method does not raise any exceptions.
    """
    return self.roberta.embeddings.word_embeddings

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPTextModel.resize_token_embeddings(new_num_tokens=None)

This method resizes the token embeddings of the AltCLIPTextModel.

PARAMETER DESCRIPTION
self

The instance of the AltCLIPTextModel class.

TYPE: AltCLIPTextModel

new_num_tokens

The new number of tokens for the resized embeddings.

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION
Embedding

nn.Embedding: The resized token embeddings as an instance of nn.Embedding.

RAISES DESCRIPTION
None

This method does not explicitly raise any exceptions.

Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
def resize_token_embeddings(self, new_num_tokens: Optional[int] = None) -> nn.Embedding:
    """
    This method resizes the token embeddings of the AltCLIPTextModel.

    Args:
        self (AltCLIPTextModel): The instance of the AltCLIPTextModel class.

        new_num_tokens (Optional[int]): The new number of tokens for the resized embeddings.
        If None, the original number of tokens will be used. Default is None.

    Returns:
        nn.Embedding: The resized token embeddings as an instance of nn.Embedding.

    Raises:
        None: This method does not explicitly raise any exceptions.
    """
    return super().resize_token_embeddings(new_num_tokens)

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPTextModel.set_input_embeddings(value)

Method to set the input embeddings for the AltCLIPTextModel.

PARAMETER DESCRIPTION
self

An instance of the AltCLIPTextModel class. This parameter refers to the object itself.

TYPE: AltCLIPTextModel

value

The new embedding to be set as the input embedding. It should be an instance of nn.Embedding representing the input embeddings. The value parameter will replace the existing word embeddings in the model.

TYPE: Embedding

RETURNS DESCRIPTION
None

This method does not return any value. It updates the input embeddings of the model in place.

TYPE: None

Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
def set_input_embeddings(self, value: nn.Embedding) -> None:
    """
    Method to set the input embeddings for the AltCLIPTextModel.

    Args:
        self (AltCLIPTextModel): An instance of the AltCLIPTextModel class.
            This parameter refers to the object itself.
        value (nn.Embedding): The new embedding to be set as the input embedding.
            It should be an instance of nn.Embedding representing the input embeddings.
            The value parameter will replace the existing word embeddings in the model.

    Returns:
        None: This method does not return any value. It updates the input embeddings of the model in place.

    Raises:
        None
    """
    self.roberta.embeddings.word_embeddings = value

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPVisionModel

Bases: AltCLIPPreTrainedModel

The 'AltCLIPVisionModel' class represents a vision model for the AltCLIP framework. It inherits from the 'AltCLIPPreTrainedModel' class and contains methods for initializing the model, obtaining input embeddings, and forwarding the model output. The 'AltCLIPVisionModel' class is designed to work with image inputs and provides flexibility in handling output attentions, hidden states, and return dictionaries. It supports the use of pre-trained models and enables easy integration with image processing pipelines.

The 'AltCLIPVisionModel' class can be instantiated and used to process image data, extract features, and perform inference in the context of the AltCLIP framework. It provides a convenient interface for leveraging vision transformers and accessing model outputs, such as hidden states and pooled representations of images.

This class encapsulates the functionality required to utilize vision models within the AltCLIP framework, allowing for seamless integration with image processing workflows and enabling efficient utilization of pre-trained models for various vision-related tasks.

Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
class AltCLIPVisionModel(AltCLIPPreTrainedModel):

    """
    The 'AltCLIPVisionModel' class represents a vision model for the AltCLIP framework.
    It inherits from the 'AltCLIPPreTrainedModel' class and contains methods for initializing the model, obtaining input
    embeddings, and forwarding the model output.
    The 'AltCLIPVisionModel' class is designed to work with image inputs and provides flexibility in handling output attentions, hidden states, and return
    dictionaries.
    It supports the use of pre-trained models and enables easy integration with image processing pipelines.

    The 'AltCLIPVisionModel' class can be instantiated and used to process image data, extract features,
    and perform inference in the context of the AltCLIP framework.
    It provides a convenient interface for leveraging vision transformers and accessing model outputs,
    such as hidden states and pooled representations of images.

    This class encapsulates the functionality required to utilize vision models within the AltCLIP framework,
    allowing for seamless integration with image processing workflows and enabling efficient
    utilization of pre-trained models for various vision-related tasks.
    """
    config_class = AltCLIPVisionConfig
    main_input_name = "pixel_values"

    def __init__(self, config: AltCLIPVisionConfig):
        """
        Initializes an instance of the AltCLIPVisionModel class.

        Args:
            self: The instance of the AltCLIPVisionModel class.
            config (AltCLIPVisionConfig):
                An instance of AltCLIPVisionConfig representing the configuration parameters for the vision model.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)
        self.vision_model = AltCLIPVisionTransformer(config)
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self) -> nn.Module:
        """
        Returns the input embeddings of the AltCLIPVisionModel.

        Args:
            self (AltCLIPVisionModel): An instance of the AltCLIPVisionModel class.

        Returns:
            nn.Module: The input embeddings of the AltCLIPVisionModel.

        Raises:
            None.
        """
        return self.vision_model.embeddings.patch_embedding

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPooling]:
        r"""
        Returns:
            Union[Tuple, BaseModelOutputWithPooling]

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, AltCLIPVisionModel
            ...
            >>> model = AltCLIPVisionModel.from_pretrained("BAAI/AltCLIP")
            >>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> inputs = processor(images=image, return_tensors="pt")
            ...
            >>> outputs = model(**inputs)
            >>> last_hidden_state = outputs.last_hidden_state
            >>> pooled_output = outputs.pooler_output  # pooled CLS states
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        return self.vision_model(
            pixel_values=pixel_values,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPVisionModel.__init__(config)

Initializes an instance of the AltCLIPVisionModel class.

PARAMETER DESCRIPTION
self

The instance of the AltCLIPVisionModel class.

config

An instance of AltCLIPVisionConfig representing the configuration parameters for the vision model.

TYPE: AltCLIPVisionConfig

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
def __init__(self, config: AltCLIPVisionConfig):
    """
    Initializes an instance of the AltCLIPVisionModel class.

    Args:
        self: The instance of the AltCLIPVisionModel class.
        config (AltCLIPVisionConfig):
            An instance of AltCLIPVisionConfig representing the configuration parameters for the vision model.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)
    self.vision_model = AltCLIPVisionTransformer(config)
    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPVisionModel.forward(pixel_values=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
Union[Tuple, BaseModelOutputWithPooling]

Union[Tuple, BaseModelOutputWithPooling]

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AltCLIPVisionModel
...
>>> model = AltCLIPVisionModel.from_pretrained("BAAI/AltCLIP")
>>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = processor(images=image, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output  # pooled CLS states
Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPooling]:
    r"""
    Returns:
        Union[Tuple, BaseModelOutputWithPooling]

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, AltCLIPVisionModel
        ...
        >>> model = AltCLIPVisionModel.from_pretrained("BAAI/AltCLIP")
        >>> processor = AutoProcessor.from_pretrained("BAAI/AltCLIP")
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = processor(images=image, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> last_hidden_state = outputs.last_hidden_state
        >>> pooled_output = outputs.pooler_output  # pooled CLS states
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    return self.vision_model(
        pixel_values=pixel_values,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

mindnlp.transformers.models.altclip.modeling_altclip.AltCLIPVisionModel.get_input_embeddings()

Returns the input embeddings of the AltCLIPVisionModel.

PARAMETER DESCRIPTION
self

An instance of the AltCLIPVisionModel class.

TYPE: AltCLIPVisionModel

RETURNS DESCRIPTION
Module

nn.Module: The input embeddings of the AltCLIPVisionModel.

Source code in mindnlp/transformers/models/altclip/modeling_altclip.py
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
def get_input_embeddings(self) -> nn.Module:
    """
    Returns the input embeddings of the AltCLIPVisionModel.

    Args:
        self (AltCLIPVisionModel): An instance of the AltCLIPVisionModel class.

    Returns:
        nn.Module: The input embeddings of the AltCLIPVisionModel.

    Raises:
        None.
    """
    return self.vision_model.embeddings.patch_embedding

mindnlp.transformers.models.altclip.processing_altclip.AltCLIPProcessor

Bases: ProcessorMixin

Constructs a AltCLIP processor which wraps a CLIP image processor and a XLM-Roberta tokenizer into a single processor.

[AltCLIPProcessor] offers all the functionalities of [CLIPImageProcessor] and [XLMRobertaTokenizerFast]. See the [~AltCLIPProcessor.__call__] and [~AltCLIPProcessor.decode] for more information.

PARAMETER DESCRIPTION
image_processor

The image processor is a required input.

TYPE: [`CLIPImageProcessor`], *optional* DEFAULT: None

tokenizer

The tokenizer is a required input.

TYPE: [`XLMRobertaTokenizerFast`], *optional* DEFAULT: None

Source code in mindnlp/transformers/models/altclip/processing_altclip.py
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
class AltCLIPProcessor(ProcessorMixin):
    r"""
    Constructs a AltCLIP processor which wraps a CLIP image processor and a XLM-Roberta tokenizer into a single
    processor.

    [`AltCLIPProcessor`] offers all the functionalities of [`CLIPImageProcessor`] and [`XLMRobertaTokenizerFast`]. See
    the [`~AltCLIPProcessor.__call__`] and [`~AltCLIPProcessor.decode`] for more information.

    Args:
        image_processor ([`CLIPImageProcessor`], *optional*):
            The image processor is a required input.
        tokenizer ([`XLMRobertaTokenizerFast`], *optional*):
            The tokenizer is a required input.
    """
    attributes = ["image_processor", "tokenizer"]
    image_processor_class = "CLIPImageProcessor"
    tokenizer_class = ("XLMRobertaTokenizer", "XLMRobertaTokenizerFast")

    def __init__(self, image_processor=None, tokenizer=None, **kwargs):
        """
        Initializes an instance of AltCLIPProcessor.

        Args:
            self (object): The instance of the class AltCLIPProcessor.
            image_processor (object, optional): An object responsible for processing images. 
                If not provided explicitly, it can be extracted from the 'feature_extractor' argument.
                Default is None.
            tokenizer (object, required): An object responsible for tokenizing input data.

        Returns:
            None.

        Raises:
            ValueError: If 'image_processor' is not specified.
            ValueError: If 'tokenizer' is not specified.
            FutureWarning: If 'feature_extractor' argument is used (deprecated).
        """
        feature_extractor = None
        if "feature_extractor" in kwargs:
            warnings.warn(
                "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
                " instead.",
                FutureWarning,
            )
            feature_extractor = kwargs.pop("feature_extractor")

        image_processor = image_processor if image_processor is not None else feature_extractor
        if image_processor is None:
            raise ValueError("You need to specify an `image_processor`.")
        if tokenizer is None:
            raise ValueError("You need to specify a `tokenizer`.")

        super().__init__(image_processor, tokenizer)

    def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
        """
        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
        and `kwargs` arguments to XLMRobertaTokenizerFast's [`~XLMRobertaTokenizerFast.__call__`] if `text` is not
        `None` to encode the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
        CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
        of the above two methods for more information.

        Args:
            text (`str`, `List[str]`, `List[List[str]]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
                number of channels, H and W are image height and width.

            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:

                - `'tf'`: Return TensorFlow `tf.constant` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
                - `'jax'`: Return JAX `jnp.ndarray` objects.

        Returns:
            [`BatchEncoding`]:
                A [`BatchEncoding`] with the following fields:

                - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
                - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
                  `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
                  `None`).
                - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
        """
        if text is None and images is None:
            raise ValueError("You have to specify either text or images. Both cannot be none.")

        if text is not None:
            encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)

        if images is not None:
            image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)

        if text is not None and images is not None:
            encoding["pixel_values"] = image_features.pixel_values
            return encoding
        if text is not None:
            return encoding
        return BatchEncoding(data={**image_features}, tensor_type=return_tensors)

    def batch_decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to XLMRobertaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`].
        Please refer to the docstring of this method for more information.
        """
        return self.tokenizer.batch_decode(*args, **kwargs)

    def decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to XLMRobertaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please
        refer to the docstring of this method for more information.
        """
        return self.tokenizer.decode(*args, **kwargs)

    @property
    def model_input_names(self):
        """
        Retrieve the input names required for the model from the tokenizer and image processor.

        Args:
            self: An instance of the AltCLIPProcessor class.

        Returns:
            list: A list of unique input names required for the model, obtained by combining the input names from the tokenizer and image processor.

        Raises:
            None
        """
        tokenizer_input_names = self.tokenizer.model_input_names
        image_processor_input_names = self.image_processor.model_input_names
        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))

mindnlp.transformers.models.altclip.processing_altclip.AltCLIPProcessor.model_input_names property

Retrieve the input names required for the model from the tokenizer and image processor.

PARAMETER DESCRIPTION
self

An instance of the AltCLIPProcessor class.

RETURNS DESCRIPTION
list

A list of unique input names required for the model, obtained by combining the input names from the tokenizer and image processor.

mindnlp.transformers.models.altclip.processing_altclip.AltCLIPProcessor.__call__(text=None, images=None, return_tensors=None, **kwargs)

Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the text and kwargs arguments to XLMRobertaTokenizerFast's [~XLMRobertaTokenizerFast.__call__] if text is not None to encode the text. To prepare the image(s), this method forwards the images and kwrags arguments to CLIPImageProcessor's [~CLIPImageProcessor.__call__] if images is not None. Please refer to the doctsring of the above two methods for more information.

PARAMETER DESCRIPTION
text

The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

TYPE: `str`, `List[str]`, `List[List[str]]` DEFAULT: None

images

The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a number of channels, H and W are image height and width.

TYPE: `PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]` DEFAULT: None

return_tensors

If set, will return tensors of a particular framework. Acceptable values are:

  • 'tf': Return TensorFlow tf.constant objects.
  • 'pt': Return PyTorch torch.Tensor objects.
  • 'np': Return NumPy np.ndarray objects.
  • 'jax': Return JAX jnp.ndarray objects.

TYPE: `str` or [`~utils.TensorType`], *optional* DEFAULT: None

RETURNS DESCRIPTION

[BatchEncoding]: A [BatchEncoding] with the following fields:

  • input_ids -- List of token ids to be fed to a model. Returned when text is not None.
  • attention_mask -- List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if "attention_mask" is in self.model_input_names and if text is not None).
  • pixel_values -- Pixel values to be fed to a model. Returned when images is not None.
Source code in mindnlp/transformers/models/altclip/processing_altclip.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
def __call__(self, text=None, images=None, return_tensors=None, **kwargs):
    """
    Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
    and `kwargs` arguments to XLMRobertaTokenizerFast's [`~XLMRobertaTokenizerFast.__call__`] if `text` is not
    `None` to encode the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
    CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
    of the above two methods for more information.

    Args:
        text (`str`, `List[str]`, `List[List[str]]`):
            The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
            (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
            `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
        images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
            The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
            tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
            number of channels, H and W are image height and width.

        return_tensors (`str` or [`~utils.TensorType`], *optional*):
            If set, will return tensors of a particular framework. Acceptable values are:

            - `'tf'`: Return TensorFlow `tf.constant` objects.
            - `'pt'`: Return PyTorch `torch.Tensor` objects.
            - `'np'`: Return NumPy `np.ndarray` objects.
            - `'jax'`: Return JAX `jnp.ndarray` objects.

    Returns:
        [`BatchEncoding`]:
            A [`BatchEncoding`] with the following fields:

            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
              `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
    """
    if text is None and images is None:
        raise ValueError("You have to specify either text or images. Both cannot be none.")

    if text is not None:
        encoding = self.tokenizer(text, return_tensors=return_tensors, **kwargs)

    if images is not None:
        image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)

    if text is not None and images is not None:
        encoding["pixel_values"] = image_features.pixel_values
        return encoding
    if text is not None:
        return encoding
    return BatchEncoding(data={**image_features}, tensor_type=return_tensors)

mindnlp.transformers.models.altclip.processing_altclip.AltCLIPProcessor.__init__(image_processor=None, tokenizer=None, **kwargs)

Initializes an instance of AltCLIPProcessor.

PARAMETER DESCRIPTION
self

The instance of the class AltCLIPProcessor.

TYPE: object

image_processor

An object responsible for processing images. If not provided explicitly, it can be extracted from the 'feature_extractor' argument. Default is None.

TYPE: object DEFAULT: None

tokenizer

An object responsible for tokenizing input data.

TYPE: (object, required) DEFAULT: None

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If 'image_processor' is not specified.

ValueError

If 'tokenizer' is not specified.

FutureWarning

If 'feature_extractor' argument is used (deprecated).

Source code in mindnlp/transformers/models/altclip/processing_altclip.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def __init__(self, image_processor=None, tokenizer=None, **kwargs):
    """
    Initializes an instance of AltCLIPProcessor.

    Args:
        self (object): The instance of the class AltCLIPProcessor.
        image_processor (object, optional): An object responsible for processing images. 
            If not provided explicitly, it can be extracted from the 'feature_extractor' argument.
            Default is None.
        tokenizer (object, required): An object responsible for tokenizing input data.

    Returns:
        None.

    Raises:
        ValueError: If 'image_processor' is not specified.
        ValueError: If 'tokenizer' is not specified.
        FutureWarning: If 'feature_extractor' argument is used (deprecated).
    """
    feature_extractor = None
    if "feature_extractor" in kwargs:
        warnings.warn(
            "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
            " instead.",
            FutureWarning,
        )
        feature_extractor = kwargs.pop("feature_extractor")

    image_processor = image_processor if image_processor is not None else feature_extractor
    if image_processor is None:
        raise ValueError("You need to specify an `image_processor`.")
    if tokenizer is None:
        raise ValueError("You need to specify a `tokenizer`.")

    super().__init__(image_processor, tokenizer)

mindnlp.transformers.models.altclip.processing_altclip.AltCLIPProcessor.batch_decode(*args, **kwargs)

This method forwards all its arguments to XLMRobertaTokenizerFast's [~PreTrainedTokenizer.batch_decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp/transformers/models/altclip/processing_altclip.py
131
132
133
134
135
136
def batch_decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to XLMRobertaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`].
    Please refer to the docstring of this method for more information.
    """
    return self.tokenizer.batch_decode(*args, **kwargs)

mindnlp.transformers.models.altclip.processing_altclip.AltCLIPProcessor.decode(*args, **kwargs)

This method forwards all its arguments to XLMRobertaTokenizerFast's [~PreTrainedTokenizer.decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp/transformers/models/altclip/processing_altclip.py
138
139
140
141
142
143
def decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to XLMRobertaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please
    refer to the docstring of this method for more information.
    """
    return self.tokenizer.decode(*args, **kwargs)