Skip to content

align

mindnlp.transformers.models.align.configuration_align.ALIGN_PRETRAINED_CONFIG_ARCHIVE_MAP = {'kakaobrain/align-base': 'https://hf-mirror.com/kakaobrain/align-base/resolve/main/config.json'} module-attribute

mindnlp.transformers.models.align.configuration_align.AlignConfig

Bases: PretrainedConfig

[AlignConfig] is the configuration class to store the configuration of a [AlignModel]. It is used to instantiate a ALIGN model according to the specified arguments, defining the text model and vision model configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the ALIGN kakaobrain/align-base architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
text_config

Dictionary of configuration options used to initialize [AlignTextConfig].

TYPE: `dict`, *optional* DEFAULT: None

vision_config

Dictionary of configuration options used to initialize [AlignVisionConfig].

TYPE: `dict`, *optional* DEFAULT: None

projection_dim

Dimentionality of text and vision projection layers.

TYPE: `int`, *optional*, defaults to 640 DEFAULT: 640

temperature_init_value

The inital value of the temperature paramter. Default is used as per the original ALIGN implementation.

TYPE: `float`, *optional*, defaults to 1.0 DEFAULT: 1.0

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

kwargs

Dictionary of keyword arguments.

TYPE: *optional* DEFAULT: {}

Example
>>> from transformers import AlignConfig, AlignModel
...
>>> # Initializing a AlignConfig with kakaobrain/align-base style configuration
>>> configuration = AlignConfig()
...
>>> # Initializing a AlignModel (with random weights) from the kakaobrain/align-base style configuration
>>> model = AlignModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
...
>>> # We can also initialize a AlignConfig from a AlignTextConfig and a AlignVisionConfig
>>> from transformers import AlignTextConfig, AlignVisionConfig
...
>>> # Initializing ALIGN Text and Vision configurations
>>> config_text = AlignTextConfig()
>>> config_vision = AlignVisionConfig()
...
>>> config = AlignConfig.from_text_vision_configs(config_text, config_vision)
Source code in mindnlp/transformers/models/align/configuration_align.py
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
class AlignConfig(PretrainedConfig):
    r"""
    [`AlignConfig`] is the configuration class to store the configuration of a [`AlignModel`]. It is used to
    instantiate a ALIGN model according to the specified arguments, defining the text model and vision model configs.
    Instantiating a configuration with the defaults will yield a similar configuration to that of the ALIGN
    [kakaobrain/align-base](https://hf-mirror.com/kakaobrain/align-base) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        text_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`AlignTextConfig`].
        vision_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`AlignVisionConfig`].
        projection_dim (`int`, *optional*, defaults to 640):
            Dimentionality of text and vision projection layers.
        temperature_init_value (`float`, *optional*, defaults to 1.0):
            The inital value of the *temperature* paramter. Default is used as per the original ALIGN implementation.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        kwargs (*optional*):
            Dictionary of keyword arguments.

    Example:
        ```python
        >>> from transformers import AlignConfig, AlignModel
        ...
        >>> # Initializing a AlignConfig with kakaobrain/align-base style configuration
        >>> configuration = AlignConfig()
        ...
        >>> # Initializing a AlignModel (with random weights) from the kakaobrain/align-base style configuration
        >>> model = AlignModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ...
        >>> # We can also initialize a AlignConfig from a AlignTextConfig and a AlignVisionConfig
        >>> from transformers import AlignTextConfig, AlignVisionConfig
        ...
        >>> # Initializing ALIGN Text and Vision configurations
        >>> config_text = AlignTextConfig()
        >>> config_vision = AlignVisionConfig()
        ...
        >>> config = AlignConfig.from_text_vision_configs(config_text, config_vision)
        ```
    """
    model_type = "align"

    def __init__(
        self,
        text_config=None,
        vision_config=None,
        projection_dim=640,
        temperature_init_value=1.0,
        initializer_range=0.02,
        **kwargs,
    ):
        """
        Initializes an instance of the AlignConfig class.

        Args:
            self: The instance of the AlignConfig class.
            text_config (dict, optional): A dictionary containing configurations for text alignment. Defaults to None.
            vision_config (dict, optional): A dictionary containing configurations for vision alignment. Defaults to None.
            projection_dim (int, optional): The dimension of the projection. Defaults to 640.
            temperature_init_value (float, optional): The initial value for temperature. Defaults to 1.0.
            initializer_range (float, optional): The range for initializing variables. Defaults to 0.02.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(**kwargs)

        if text_config is None:
            text_config = {}
            logger.info("text_config is None. Initializing the AlignTextConfig with default values.")

        if vision_config is None:
            vision_config = {}
            logger.info("vision_config is None. Initializing the AlignVisionConfig with default values.")

        self.text_config = AlignTextConfig(**text_config)
        self.vision_config = AlignVisionConfig(**vision_config)

        self.projection_dim = projection_dim
        self.temperature_init_value = temperature_init_value
        self.initializer_range = initializer_range

    @classmethod
    def from_text_vision_configs(cls, text_config: AlignTextConfig, vision_config: AlignVisionConfig, **kwargs):
        r"""
        Instantiate a [`AlignConfig`] (or a derived class) from align text model configuration and align vision model
        configuration.

        Returns:
            [`AlignConfig`]: An instance of a configuration object
        """
        return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)

mindnlp.transformers.models.align.configuration_align.AlignConfig.__init__(text_config=None, vision_config=None, projection_dim=640, temperature_init_value=1.0, initializer_range=0.02, **kwargs)

Initializes an instance of the AlignConfig class.

PARAMETER DESCRIPTION
self

The instance of the AlignConfig class.

text_config

A dictionary containing configurations for text alignment. Defaults to None.

TYPE: dict DEFAULT: None

vision_config

A dictionary containing configurations for vision alignment. Defaults to None.

TYPE: dict DEFAULT: None

projection_dim

The dimension of the projection. Defaults to 640.

TYPE: int DEFAULT: 640

temperature_init_value

The initial value for temperature. Defaults to 1.0.

TYPE: float DEFAULT: 1.0

initializer_range

The range for initializing variables. Defaults to 0.02.

TYPE: float DEFAULT: 0.02

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/align/configuration_align.py
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
def __init__(
    self,
    text_config=None,
    vision_config=None,
    projection_dim=640,
    temperature_init_value=1.0,
    initializer_range=0.02,
    **kwargs,
):
    """
    Initializes an instance of the AlignConfig class.

    Args:
        self: The instance of the AlignConfig class.
        text_config (dict, optional): A dictionary containing configurations for text alignment. Defaults to None.
        vision_config (dict, optional): A dictionary containing configurations for vision alignment. Defaults to None.
        projection_dim (int, optional): The dimension of the projection. Defaults to 640.
        temperature_init_value (float, optional): The initial value for temperature. Defaults to 1.0.
        initializer_range (float, optional): The range for initializing variables. Defaults to 0.02.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(**kwargs)

    if text_config is None:
        text_config = {}
        logger.info("text_config is None. Initializing the AlignTextConfig with default values.")

    if vision_config is None:
        vision_config = {}
        logger.info("vision_config is None. Initializing the AlignVisionConfig with default values.")

    self.text_config = AlignTextConfig(**text_config)
    self.vision_config = AlignVisionConfig(**vision_config)

    self.projection_dim = projection_dim
    self.temperature_init_value = temperature_init_value
    self.initializer_range = initializer_range

mindnlp.transformers.models.align.configuration_align.AlignConfig.from_text_vision_configs(text_config, vision_config, **kwargs) classmethod

Instantiate a [AlignConfig] (or a derived class) from align text model configuration and align vision model configuration.

RETURNS DESCRIPTION

[AlignConfig]: An instance of a configuration object

Source code in mindnlp/transformers/models/align/configuration_align.py
474
475
476
477
478
479
480
481
482
483
@classmethod
def from_text_vision_configs(cls, text_config: AlignTextConfig, vision_config: AlignVisionConfig, **kwargs):
    r"""
    Instantiate a [`AlignConfig`] (or a derived class) from align text model configuration and align vision model
    configuration.

    Returns:
        [`AlignConfig`]: An instance of a configuration object
    """
    return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)

mindnlp.transformers.models.align.configuration_align.AlignTextConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [AlignTextModel]. It is used to instantiate a ALIGN text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the text encoder of the ALIGN kakaobrain/align-base architecture. The default values here are copied from BERT.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
vocab_size

Vocabulary size of the Align Text model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [AlignTextModel].

TYPE: `int`, *optional*, defaults to 30522 DEFAULT: 30522

hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

intermediate_size

Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 3072 DEFAULT: 3072

hidden_act

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and "gelu_new" are supported.

TYPE: `str` or `Callable`, *optional*, defaults to `"gelu"` DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

TYPE: `float`, *optional*, defaults to 0.1 DEFAULT: 0.1

attention_probs_dropout_prob

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.1 DEFAULT: 0.1

max_position_embeddings

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

type_vocab_size

The vocabulary size of the token_type_ids passed when calling [AlignTextModel].

TYPE: `int`, *optional*, defaults to 2 DEFAULT: 2

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-12 DEFAULT: 1e-12

pad_token_id

Padding token id.

TYPE: `int`, *optional*, defaults to 0 DEFAULT: 0

position_embedding_type

Type of position embedding. Choose one of "absolute", "relative_key", "relative_key_query". For positional embeddings use "absolute". For more information on "relative_key", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on "relative_key_query", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

TYPE: `str`, *optional*, defaults to `"absolute"` DEFAULT: 'absolute'

use_cache

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

Example
>>> from transformers import AlignTextConfig, AlignTextModel
...
>>> # Initializing a AlignTextConfig with kakaobrain/align-base style configuration
>>> configuration = AlignTextConfig()
...
>>> # Initializing a AlignTextModel (with random weights) from the kakaobrain/align-base style configuration
>>> model = AlignTextModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/align/configuration_align.py
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
class AlignTextConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`AlignTextModel`]. It is used to instantiate a
    ALIGN text encoder according to the specified arguments, defining the model architecture. Instantiating a
    configuration with the defaults will yield a similar configuration to that of the text encoder of the ALIGN
    [kakaobrain/align-base](https://hf-mirror.com/kakaobrain/align-base) architecture. The default values here are
    copied from BERT.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        vocab_size (`int`, *optional*, defaults to 30522):
            Vocabulary size of the Align Text model. Defines the number of different tokens that can be represented by
            the `inputs_ids` passed when calling [`AlignTextModel`].
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"silu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (`int`, *optional*, defaults to 512):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (`int`, *optional*, defaults to 2):
            The vocabulary size of the `token_type_ids` passed when calling [`AlignTextModel`].
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        pad_token_id (`int`, *optional*, defaults to 0):
            Padding token id.
        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.

    Example:
        ```python
        >>> from transformers import AlignTextConfig, AlignTextModel
        ...
        >>> # Initializing a AlignTextConfig with kakaobrain/align-base style configuration
        >>> configuration = AlignTextConfig()
        ...
        >>> # Initializing a AlignTextModel (with random weights) from the kakaobrain/align-base style configuration
        >>> model = AlignTextModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "align_text_model"

    def __init__(
        self,
        vocab_size=30522,
        hidden_size=768,
        num_hidden_layers=12,
        num_attention_heads=12,
        intermediate_size=3072,
        hidden_act="gelu",
        hidden_dropout_prob=0.1,
        attention_probs_dropout_prob=0.1,
        max_position_embeddings=512,
        type_vocab_size=2,
        initializer_range=0.02,
        layer_norm_eps=1e-12,
        pad_token_id=0,
        position_embedding_type="absolute",
        use_cache=True,
        **kwargs,
    ):
        """
        Initializes a new instance of the AlignTextConfig class.

        Args:
            self: The instance of the class.
            vocab_size (int, optional): The size of the vocabulary. Defaults to 30522.
            hidden_size (int, optional): The size of the hidden layers. Defaults to 768.
            num_hidden_layers (int, optional): The number of hidden layers. Defaults to 12.
            num_attention_heads (int, optional): The number of attention heads. Defaults to 12.
            intermediate_size (int, optional): The size of the intermediate layer in the transformer encoder. Defaults to 3072.
            hidden_act (str, optional): The activation function for the hidden layers. Defaults to 'gelu'.
            hidden_dropout_prob (float, optional): The dropout probability for the hidden layers. Defaults to 0.1.
            attention_probs_dropout_prob (float, optional): The dropout probability for the attention probabilities. Defaults to 0.1.
            max_position_embeddings (int, optional): The maximum number of positions for the positional embeddings. Defaults to 512.
            type_vocab_size (int, optional): The size of the type vocabulary. Defaults to 2.
            initializer_range (float, optional): The range for the weight initializers. Defaults to 0.02.
            layer_norm_eps (float, optional): The epsilon value for layer normalization. Defaults to 1e-12.
            pad_token_id (int, optional): The token id for padding. Defaults to 0.
            position_embedding_type (str, optional): The type of position embedding to use (e.g., 'absolute'). Defaults to 'absolute'.
            use_cache (bool, optional): Whether to use cache for the transformer encoder. Defaults to True.

        Returns:
            None.

        Raises:
            ValueError: If any of the provided parameters are not of the expected type or value range.
        """
        super().__init__(**kwargs)

        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.hidden_act = hidden_act
        self.intermediate_size = intermediate_size
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.max_position_embeddings = max_position_embeddings
        self.type_vocab_size = type_vocab_size
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps
        self.position_embedding_type = position_embedding_type
        self.use_cache = use_cache
        self.pad_token_id = pad_token_id

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        """
        Loads a pretrained model configuration from a given model name or file path.

        Args:
            cls (class): The class object itself.
            pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model.
                It can be either a string representing the model name or an os.PathLike object representing the file path.
                Note that the model should be of type 'align' according to the configuration.
                Using a model of different type may cause errors in some configurations.

        Returns:
            PretrainedConfig: The loaded pretrained model configuration.

        Raises:
            None.

        """
        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

        # get the text config dict if we are loading from AlignConfig
        if config_dict.get("model_type") == "align":
            config_dict = config_dict["text_config"]

        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
            logger.warning(
                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
            )

        return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.align.configuration_align.AlignTextConfig.__init__(vocab_size=30522, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, type_vocab_size=2, initializer_range=0.02, layer_norm_eps=1e-12, pad_token_id=0, position_embedding_type='absolute', use_cache=True, **kwargs)

Initializes a new instance of the AlignTextConfig class.

PARAMETER DESCRIPTION
self

The instance of the class.

vocab_size

The size of the vocabulary. Defaults to 30522.

TYPE: int DEFAULT: 30522

hidden_size

The size of the hidden layers. Defaults to 768.

TYPE: int DEFAULT: 768

num_hidden_layers

The number of hidden layers. Defaults to 12.

TYPE: int DEFAULT: 12

num_attention_heads

The number of attention heads. Defaults to 12.

TYPE: int DEFAULT: 12

intermediate_size

The size of the intermediate layer in the transformer encoder. Defaults to 3072.

TYPE: int DEFAULT: 3072

hidden_act

The activation function for the hidden layers. Defaults to 'gelu'.

TYPE: str DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for the hidden layers. Defaults to 0.1.

TYPE: float DEFAULT: 0.1

attention_probs_dropout_prob

The dropout probability for the attention probabilities. Defaults to 0.1.

TYPE: float DEFAULT: 0.1

max_position_embeddings

The maximum number of positions for the positional embeddings. Defaults to 512.

TYPE: int DEFAULT: 512

type_vocab_size

The size of the type vocabulary. Defaults to 2.

TYPE: int DEFAULT: 2

initializer_range

The range for the weight initializers. Defaults to 0.02.

TYPE: float DEFAULT: 0.02

layer_norm_eps

The epsilon value for layer normalization. Defaults to 1e-12.

TYPE: float DEFAULT: 1e-12

pad_token_id

The token id for padding. Defaults to 0.

TYPE: int DEFAULT: 0

position_embedding_type

The type of position embedding to use (e.g., 'absolute'). Defaults to 'absolute'.

TYPE: str DEFAULT: 'absolute'

use_cache

Whether to use cache for the transformer encoder. Defaults to True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If any of the provided parameters are not of the expected type or value range.

Source code in mindnlp/transformers/models/align/configuration_align.py
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
def __init__(
    self,
    vocab_size=30522,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072,
    hidden_act="gelu",
    hidden_dropout_prob=0.1,
    attention_probs_dropout_prob=0.1,
    max_position_embeddings=512,
    type_vocab_size=2,
    initializer_range=0.02,
    layer_norm_eps=1e-12,
    pad_token_id=0,
    position_embedding_type="absolute",
    use_cache=True,
    **kwargs,
):
    """
    Initializes a new instance of the AlignTextConfig class.

    Args:
        self: The instance of the class.
        vocab_size (int, optional): The size of the vocabulary. Defaults to 30522.
        hidden_size (int, optional): The size of the hidden layers. Defaults to 768.
        num_hidden_layers (int, optional): The number of hidden layers. Defaults to 12.
        num_attention_heads (int, optional): The number of attention heads. Defaults to 12.
        intermediate_size (int, optional): The size of the intermediate layer in the transformer encoder. Defaults to 3072.
        hidden_act (str, optional): The activation function for the hidden layers. Defaults to 'gelu'.
        hidden_dropout_prob (float, optional): The dropout probability for the hidden layers. Defaults to 0.1.
        attention_probs_dropout_prob (float, optional): The dropout probability for the attention probabilities. Defaults to 0.1.
        max_position_embeddings (int, optional): The maximum number of positions for the positional embeddings. Defaults to 512.
        type_vocab_size (int, optional): The size of the type vocabulary. Defaults to 2.
        initializer_range (float, optional): The range for the weight initializers. Defaults to 0.02.
        layer_norm_eps (float, optional): The epsilon value for layer normalization. Defaults to 1e-12.
        pad_token_id (int, optional): The token id for padding. Defaults to 0.
        position_embedding_type (str, optional): The type of position embedding to use (e.g., 'absolute'). Defaults to 'absolute'.
        use_cache (bool, optional): Whether to use cache for the transformer encoder. Defaults to True.

    Returns:
        None.

    Raises:
        ValueError: If any of the provided parameters are not of the expected type or value range.
    """
    super().__init__(**kwargs)

    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.type_vocab_size = type_vocab_size
    self.initializer_range = initializer_range
    self.layer_norm_eps = layer_norm_eps
    self.position_embedding_type = position_embedding_type
    self.use_cache = use_cache
    self.pad_token_id = pad_token_id

mindnlp.transformers.models.align.configuration_align.AlignTextConfig.from_pretrained(pretrained_model_name_or_path, **kwargs) classmethod

Loads a pretrained model configuration from a given model name or file path.

PARAMETER DESCRIPTION
cls

The class object itself.

TYPE: class

pretrained_model_name_or_path

The name or path of the pretrained model. It can be either a string representing the model name or an os.PathLike object representing the file path. Note that the model should be of type 'align' according to the configuration. Using a model of different type may cause errors in some configurations.

TYPE: Union[str, PathLike]

RETURNS DESCRIPTION
PretrainedConfig

The loaded pretrained model configuration.

TYPE: PretrainedConfig

Source code in mindnlp/transformers/models/align/configuration_align.py
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
    """
    Loads a pretrained model configuration from a given model name or file path.

    Args:
        cls (class): The class object itself.
        pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model.
            It can be either a string representing the model name or an os.PathLike object representing the file path.
            Note that the model should be of type 'align' according to the configuration.
            Using a model of different type may cause errors in some configurations.

    Returns:
        PretrainedConfig: The loaded pretrained model configuration.

    Raises:
        None.

    """
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

    # get the text config dict if we are loading from AlignConfig
    if config_dict.get("model_type") == "align":
        config_dict = config_dict["text_config"]

    if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
        logger.warning(
            f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
            f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
        )

    return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.align.configuration_align.AlignVisionConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [AlignVisionModel]. It is used to instantiate a ALIGN vision encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the vision encoder of the ALIGN kakaobrain/align-base architecture. The default values are copied from EfficientNet (efficientnet-b7)

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
num_channels

The number of input channels.

TYPE: `int`, *optional*, defaults to 3 DEFAULT: 3

image_size

The input image size.

TYPE: `int`, *optional*, defaults to 600 DEFAULT: 600

width_coefficient

Scaling coefficient for network width at each stage.

TYPE: `float`, *optional*, defaults to 2.0 DEFAULT: 2.0

depth_coefficient

Scaling coefficient for network depth at each stage.

TYPE: `float`, *optional*, defaults to 3.1 DEFAULT: 3.1

depth_divisor

A unit of network width.

TYPE: `int`, *optional*, defaults to 8 DEFAULT: 8

kernel_sizes

List of kernel sizes to be used in each block.

TYPE: `List[int]`, *optional*, defaults to `[3, 3, 5, 3, 5, 5, 3]` DEFAULT: [3, 3, 5, 3, 5, 5, 3]

in_channels

List of input channel sizes to be used in each block for convolutional layers.

TYPE: `List[int]`, *optional*, defaults to `[32, 16, 24, 40, 80, 112, 192]` DEFAULT: [32, 16, 24, 40, 80, 112, 192]

out_channels

List of output channel sizes to be used in each block for convolutional layers.

TYPE: `List[int]`, *optional*, defaults to `[16, 24, 40, 80, 112, 192, 320]` DEFAULT: [16, 24, 40, 80, 112, 192, 320]

depthwise_padding

List of block indices with square padding.

TYPE: `List[int]`, *optional*, defaults to `[]` DEFAULT: []

strides

List of stride sizes to be used in each block for convolutional layers.

TYPE: `List[int]`, *optional*, defaults to `[1, 2, 2, 2, 1, 2, 1]` DEFAULT: [1, 2, 2, 2, 1, 2, 1]

num_block_repeats

List of the number of times each block is to repeated.

TYPE: `List[int]`, *optional*, defaults to `[1, 2, 2, 3, 3, 4, 1]` DEFAULT: [1, 2, 2, 3, 3, 4, 1]

expand_ratios

List of scaling coefficient of each block.

TYPE: `List[int]`, *optional*, defaults to `[1, 6, 6, 6, 6, 6, 6]` DEFAULT: [1, 6, 6, 6, 6, 6, 6]

squeeze_expansion_ratio

Squeeze expansion ratio.

TYPE: `float`, *optional*, defaults to 0.25 DEFAULT: 0.25

hidden_act

The non-linear activation function (function or string) in each block. If string, "gelu", "relu", "selu","gelu_new","silu"and"mish"` are supported.

TYPE: `str` or `function`, *optional*, defaults to `"silu"` DEFAULT: 'swish'

hiddem_dim

The hidden dimension of the layer before the classification head.

TYPE: `int`, *optional*, defaults to 1280

pooling_type

Type of final pooling to be applied before the dense classification head. Available options are ["mean", "max"]

TYPE: `str` or `function`, *optional*, defaults to `"mean"` DEFAULT: 'mean'

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

batch_norm_eps

The epsilon used by the batch normalization layers.

TYPE: `float`, *optional*, defaults to 1e-3 DEFAULT: 0.001

batch_norm_momentum

The momentum used by the batch normalization layers.

TYPE: `float`, *optional*, defaults to 0.99 DEFAULT: 0.99

drop_connect_rate

The drop rate for skip connections.

TYPE: `float`, *optional*, defaults to 0.2 DEFAULT: 0.2

Example
>>> from transformers import AlignVisionConfig, AlignVisionModel
...
>>> # Initializing a AlignVisionConfig with kakaobrain/align-base style configuration
>>> configuration = AlignVisionConfig()
...
>>> # Initializing a AlignVisionModel (with random weights) from the kakaobrain/align-base style configuration
>>> model = AlignVisionModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/align/configuration_align.py
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
class AlignVisionConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`AlignVisionModel`]. It is used to instantiate a
    ALIGN vision encoder according to the specified arguments, defining the model architecture. Instantiating a
    configuration with the defaults will yield a similar configuration to that of the vision encoder of the ALIGN
    [kakaobrain/align-base](https://hf-mirror.com/kakaobrain/align-base) architecture. The default values are copied
    from EfficientNet (efficientnet-b7)

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        num_channels (`int`, *optional*, defaults to 3):
            The number of input channels.
        image_size (`int`, *optional*, defaults to 600):
            The input image size.
        width_coefficient (`float`, *optional*, defaults to 2.0):
            Scaling coefficient for network width at each stage.
        depth_coefficient (`float`, *optional*, defaults to 3.1):
            Scaling coefficient for network depth at each stage.
        depth_divisor `int`, *optional*, defaults to 8):
            A unit of network width.
        kernel_sizes (`List[int]`, *optional*, defaults to `[3, 3, 5, 3, 5, 5, 3]`):
            List of kernel sizes to be used in each block.
        in_channels (`List[int]`, *optional*, defaults to `[32, 16, 24, 40, 80, 112, 192]`):
            List of input channel sizes to be used in each block for convolutional layers.
        out_channels (`List[int]`, *optional*, defaults to `[16, 24, 40, 80, 112, 192, 320]`):
            List of output channel sizes to be used in each block for convolutional layers.
        depthwise_padding (`List[int]`, *optional*, defaults to `[]`):
            List of block indices with square padding.
        strides (`List[int]`, *optional*, defaults to `[1, 2, 2, 2, 1, 2, 1]`):
            List of stride sizes to be used in each block for convolutional layers.
        num_block_repeats (`List[int]`, *optional*, defaults to `[1, 2, 2, 3, 3, 4, 1]`):
            List of the number of times each block is to repeated.
        expand_ratios (`List[int]`, *optional*, defaults to `[1, 6, 6, 6, 6, 6, 6]`):
            List of scaling coefficient of each block.
        squeeze_expansion_ratio (`float`, *optional*, defaults to 0.25):
            Squeeze expansion ratio.
        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
            The non-linear activation function (function or string) in each block. If string, `"gelu"`, `"relu"`,
            `"selu", `"gelu_new"`, `"silu"` and `"mish"` are supported.
        hiddem_dim (`int`, *optional*, defaults to 1280):
            The hidden dimension of the layer before the classification head.
        pooling_type (`str` or `function`, *optional*, defaults to `"mean"`):
            Type of final pooling to be applied before the dense classification head. Available options are [`"mean"`,
            `"max"`]
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        batch_norm_eps (`float`, *optional*, defaults to 1e-3):
            The epsilon used by the batch normalization layers.
        batch_norm_momentum (`float`, *optional*, defaults to 0.99):
            The momentum used by the batch normalization layers.
        drop_connect_rate (`float`, *optional*, defaults to 0.2):
            The drop rate for skip connections.

    Example:
        ```python
        >>> from transformers import AlignVisionConfig, AlignVisionModel
        ...
        >>> # Initializing a AlignVisionConfig with kakaobrain/align-base style configuration
        >>> configuration = AlignVisionConfig()
        ...
        >>> # Initializing a AlignVisionModel (with random weights) from the kakaobrain/align-base style configuration
        >>> model = AlignVisionModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "align_vision_model"

    def __init__(
        self,
        num_channels: int = 3,
        image_size: int = 600,
        width_coefficient: float = 2.0,
        depth_coefficient: float = 3.1,
        depth_divisor: int = 8,
        kernel_sizes: List[int] = [3, 3, 5, 3, 5, 5, 3],
        in_channels: List[int] = [32, 16, 24, 40, 80, 112, 192],
        out_channels: List[int] = [16, 24, 40, 80, 112, 192, 320],
        depthwise_padding: List[int] = [],
        strides: List[int] = [1, 2, 2, 2, 1, 2, 1],
        num_block_repeats: List[int] = [1, 2, 2, 3, 3, 4, 1],
        expand_ratios: List[int] = [1, 6, 6, 6, 6, 6, 6],
        squeeze_expansion_ratio: float = 0.25,
        hidden_act: str = "swish",
        hidden_dim: int = 2560,
        pooling_type: str = "mean",
        initializer_range: float = 0.02,
        batch_norm_eps: float = 0.001,
        batch_norm_momentum: float = 0.99,
        drop_connect_rate: float = 0.2,
        **kwargs,
    ):
        """
        Initializes an instance of the `AlignVisionConfig` class.

        Args:
            self: The instance of the class itself.
            num_channels (int): The number of channels in the input image. Default is 3.
            image_size (int): The size of the input image. Default is 600.
            width_coefficient (float): The width coefficient for scaling the number of channels in each layer. Default is 2.0.
            depth_coefficient (float): The depth coefficient for scaling the number of layers. Default is 3.1.
            depth_divisor (int): The divisor for computing the number of output channels in each layer. Default is 8.
            kernel_sizes (List[int]): The list of kernel sizes for each layer. Default is [3, 3, 5, 3, 5, 5, 3].
            in_channels (List[int]): The list of input channels for each layer. Default is [32, 16, 24, 40, 80, 112, 192].
            out_channels (List[int]): The list of output channels for each layer. Default is [16, 24, 40, 80, 112, 192, 320].
            depthwise_padding (List[int]): The list of padding values for depthwise convolution layers. Default is [].
            strides (List[int]): The list of stride values for each layer. Default is [1, 2, 2, 2, 1, 2, 1].
            num_block_repeats (List[int]): The list of repeat counts for each block. Default is [1, 2, 2, 3, 3, 4, 1].
            expand_ratios (List[int]): The list of expansion ratios for each block. Default is [1, 6, 6, 6, 6, 6, 6].
            squeeze_expansion_ratio (float): The expansion ratio for the squeeze layer. Default is 0.25.
            hidden_act (str): The activation function for the hidden layers. Default is 'swish'.
            hidden_dim (int): The dimension of the hidden layers. Default is 2560.
            pooling_type (str): The type of pooling to use. Default is 'mean'.
            initializer_range (float): The range of the initializer. Default is 0.02.
            batch_norm_eps (float): The epsilon value for batch normalization. Default is 0.001.
            batch_norm_momentum (float): The momentum value for batch normalization. Default is 0.99.
            drop_connect_rate (float): The rate at which to drop connections. Default is 0.2.
            **kwargs: Additional keyword arguments.

        Returns:
            None.

        Raises:
            None.

        """
        super().__init__(**kwargs)

        self.num_channels = num_channels
        self.image_size = image_size
        self.width_coefficient = width_coefficient
        self.depth_coefficient = depth_coefficient
        self.depth_divisor = depth_divisor
        self.kernel_sizes = kernel_sizes
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.depthwise_padding = depthwise_padding
        self.strides = strides
        self.num_block_repeats = num_block_repeats
        self.expand_ratios = expand_ratios
        self.squeeze_expansion_ratio = squeeze_expansion_ratio
        self.hidden_act = hidden_act
        self.hidden_dim = hidden_dim
        self.pooling_type = pooling_type
        self.initializer_range = initializer_range
        self.batch_norm_eps = batch_norm_eps
        self.batch_norm_momentum = batch_norm_momentum
        self.drop_connect_rate = drop_connect_rate
        self.num_hidden_layers = sum(num_block_repeats) * 4

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        """
        This method creates an instance of AlignVisionConfig from a pretrained model configuration.

        Args:
            cls (class): The class object itself.
            pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model configuration.
                It can be either a string or a path-like object.

        Returns:
            PretrainedConfig: An instance of PretrainedConfig class representing the configuration of the pretrained model.

        Raises:
            Warning: If the model type specified in the configuration is different from the model type of the class, a warning is issued
                because using a different model type may lead to errors in some configurations of models.
        """
        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

        # get the vision config dict if we are loading from AlignConfig
        if config_dict.get("model_type") == "align":
            config_dict = config_dict["vision_config"]

        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
            logger.warning(
                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
            )

        return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.align.configuration_align.AlignVisionConfig.__init__(num_channels=3, image_size=600, width_coefficient=2.0, depth_coefficient=3.1, depth_divisor=8, kernel_sizes=[3, 3, 5, 3, 5, 5, 3], in_channels=[32, 16, 24, 40, 80, 112, 192], out_channels=[16, 24, 40, 80, 112, 192, 320], depthwise_padding=[], strides=[1, 2, 2, 2, 1, 2, 1], num_block_repeats=[1, 2, 2, 3, 3, 4, 1], expand_ratios=[1, 6, 6, 6, 6, 6, 6], squeeze_expansion_ratio=0.25, hidden_act='swish', hidden_dim=2560, pooling_type='mean', initializer_range=0.02, batch_norm_eps=0.001, batch_norm_momentum=0.99, drop_connect_rate=0.2, **kwargs)

Initializes an instance of the AlignVisionConfig class.

PARAMETER DESCRIPTION
self

The instance of the class itself.

num_channels

The number of channels in the input image. Default is 3.

TYPE: int DEFAULT: 3

image_size

The size of the input image. Default is 600.

TYPE: int DEFAULT: 600

width_coefficient

The width coefficient for scaling the number of channels in each layer. Default is 2.0.

TYPE: float DEFAULT: 2.0

depth_coefficient

The depth coefficient for scaling the number of layers. Default is 3.1.

TYPE: float DEFAULT: 3.1

depth_divisor

The divisor for computing the number of output channels in each layer. Default is 8.

TYPE: int DEFAULT: 8

kernel_sizes

The list of kernel sizes for each layer. Default is [3, 3, 5, 3, 5, 5, 3].

TYPE: List[int] DEFAULT: [3, 3, 5, 3, 5, 5, 3]

in_channels

The list of input channels for each layer. Default is [32, 16, 24, 40, 80, 112, 192].

TYPE: List[int] DEFAULT: [32, 16, 24, 40, 80, 112, 192]

out_channels

The list of output channels for each layer. Default is [16, 24, 40, 80, 112, 192, 320].

TYPE: List[int] DEFAULT: [16, 24, 40, 80, 112, 192, 320]

depthwise_padding

The list of padding values for depthwise convolution layers. Default is [].

TYPE: List[int] DEFAULT: []

strides

The list of stride values for each layer. Default is [1, 2, 2, 2, 1, 2, 1].

TYPE: List[int] DEFAULT: [1, 2, 2, 2, 1, 2, 1]

num_block_repeats

The list of repeat counts for each block. Default is [1, 2, 2, 3, 3, 4, 1].

TYPE: List[int] DEFAULT: [1, 2, 2, 3, 3, 4, 1]

expand_ratios

The list of expansion ratios for each block. Default is [1, 6, 6, 6, 6, 6, 6].

TYPE: List[int] DEFAULT: [1, 6, 6, 6, 6, 6, 6]

squeeze_expansion_ratio

The expansion ratio for the squeeze layer. Default is 0.25.

TYPE: float DEFAULT: 0.25

hidden_act

The activation function for the hidden layers. Default is 'swish'.

TYPE: str DEFAULT: 'swish'

hidden_dim

The dimension of the hidden layers. Default is 2560.

TYPE: int DEFAULT: 2560

pooling_type

The type of pooling to use. Default is 'mean'.

TYPE: str DEFAULT: 'mean'

initializer_range

The range of the initializer. Default is 0.02.

TYPE: float DEFAULT: 0.02

batch_norm_eps

The epsilon value for batch normalization. Default is 0.001.

TYPE: float DEFAULT: 0.001

batch_norm_momentum

The momentum value for batch normalization. Default is 0.99.

TYPE: float DEFAULT: 0.99

drop_connect_rate

The rate at which to drop connections. Default is 0.2.

TYPE: float DEFAULT: 0.2

**kwargs

Additional keyword arguments.

DEFAULT: {}

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/align/configuration_align.py
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
def __init__(
    self,
    num_channels: int = 3,
    image_size: int = 600,
    width_coefficient: float = 2.0,
    depth_coefficient: float = 3.1,
    depth_divisor: int = 8,
    kernel_sizes: List[int] = [3, 3, 5, 3, 5, 5, 3],
    in_channels: List[int] = [32, 16, 24, 40, 80, 112, 192],
    out_channels: List[int] = [16, 24, 40, 80, 112, 192, 320],
    depthwise_padding: List[int] = [],
    strides: List[int] = [1, 2, 2, 2, 1, 2, 1],
    num_block_repeats: List[int] = [1, 2, 2, 3, 3, 4, 1],
    expand_ratios: List[int] = [1, 6, 6, 6, 6, 6, 6],
    squeeze_expansion_ratio: float = 0.25,
    hidden_act: str = "swish",
    hidden_dim: int = 2560,
    pooling_type: str = "mean",
    initializer_range: float = 0.02,
    batch_norm_eps: float = 0.001,
    batch_norm_momentum: float = 0.99,
    drop_connect_rate: float = 0.2,
    **kwargs,
):
    """
    Initializes an instance of the `AlignVisionConfig` class.

    Args:
        self: The instance of the class itself.
        num_channels (int): The number of channels in the input image. Default is 3.
        image_size (int): The size of the input image. Default is 600.
        width_coefficient (float): The width coefficient for scaling the number of channels in each layer. Default is 2.0.
        depth_coefficient (float): The depth coefficient for scaling the number of layers. Default is 3.1.
        depth_divisor (int): The divisor for computing the number of output channels in each layer. Default is 8.
        kernel_sizes (List[int]): The list of kernel sizes for each layer. Default is [3, 3, 5, 3, 5, 5, 3].
        in_channels (List[int]): The list of input channels for each layer. Default is [32, 16, 24, 40, 80, 112, 192].
        out_channels (List[int]): The list of output channels for each layer. Default is [16, 24, 40, 80, 112, 192, 320].
        depthwise_padding (List[int]): The list of padding values for depthwise convolution layers. Default is [].
        strides (List[int]): The list of stride values for each layer. Default is [1, 2, 2, 2, 1, 2, 1].
        num_block_repeats (List[int]): The list of repeat counts for each block. Default is [1, 2, 2, 3, 3, 4, 1].
        expand_ratios (List[int]): The list of expansion ratios for each block. Default is [1, 6, 6, 6, 6, 6, 6].
        squeeze_expansion_ratio (float): The expansion ratio for the squeeze layer. Default is 0.25.
        hidden_act (str): The activation function for the hidden layers. Default is 'swish'.
        hidden_dim (int): The dimension of the hidden layers. Default is 2560.
        pooling_type (str): The type of pooling to use. Default is 'mean'.
        initializer_range (float): The range of the initializer. Default is 0.02.
        batch_norm_eps (float): The epsilon value for batch normalization. Default is 0.001.
        batch_norm_momentum (float): The momentum value for batch normalization. Default is 0.99.
        drop_connect_rate (float): The rate at which to drop connections. Default is 0.2.
        **kwargs: Additional keyword arguments.

    Returns:
        None.

    Raises:
        None.

    """
    super().__init__(**kwargs)

    self.num_channels = num_channels
    self.image_size = image_size
    self.width_coefficient = width_coefficient
    self.depth_coefficient = depth_coefficient
    self.depth_divisor = depth_divisor
    self.kernel_sizes = kernel_sizes
    self.in_channels = in_channels
    self.out_channels = out_channels
    self.depthwise_padding = depthwise_padding
    self.strides = strides
    self.num_block_repeats = num_block_repeats
    self.expand_ratios = expand_ratios
    self.squeeze_expansion_ratio = squeeze_expansion_ratio
    self.hidden_act = hidden_act
    self.hidden_dim = hidden_dim
    self.pooling_type = pooling_type
    self.initializer_range = initializer_range
    self.batch_norm_eps = batch_norm_eps
    self.batch_norm_momentum = batch_norm_momentum
    self.drop_connect_rate = drop_connect_rate
    self.num_hidden_layers = sum(num_block_repeats) * 4

mindnlp.transformers.models.align.configuration_align.AlignVisionConfig.from_pretrained(pretrained_model_name_or_path, **kwargs) classmethod

This method creates an instance of AlignVisionConfig from a pretrained model configuration.

PARAMETER DESCRIPTION
cls

The class object itself.

TYPE: class

pretrained_model_name_or_path

The name or path of the pretrained model configuration. It can be either a string or a path-like object.

TYPE: Union[str, PathLike]

RETURNS DESCRIPTION
PretrainedConfig

An instance of PretrainedConfig class representing the configuration of the pretrained model.

TYPE: PretrainedConfig

RAISES DESCRIPTION
Warning

If the model type specified in the configuration is different from the model type of the class, a warning is issued because using a different model type may lead to errors in some configurations of models.

Source code in mindnlp/transformers/models/align/configuration_align.py
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
    """
    This method creates an instance of AlignVisionConfig from a pretrained model configuration.

    Args:
        cls (class): The class object itself.
        pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model configuration.
            It can be either a string or a path-like object.

    Returns:
        PretrainedConfig: An instance of PretrainedConfig class representing the configuration of the pretrained model.

    Raises:
        Warning: If the model type specified in the configuration is different from the model type of the class, a warning is issued
            because using a different model type may lead to errors in some configurations of models.
    """
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

    # get the vision config dict if we are loading from AlignConfig
    if config_dict.get("model_type") == "align":
        config_dict = config_dict["vision_config"]

    if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
        logger.warning(
            f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
            f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
        )

    return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.align.modeling_align.ALIGN_PRETRAINED_MODEL_ARCHIVE_LIST = ['kakaobrain/align-base'] module-attribute

mindnlp.transformers.models.align.modeling_align.AlignModel

Bases: AlignPreTrainedModel

The AlignModel class is a model for aligning text and image embeddings. It is designed to compute image-text similarity scores using pre-trained text and vision models. The class inherits from the AlignPreTrainedModel class.

ATTRIBUTE DESCRIPTION
`projection_dim`

The dimension of the projection layer.

`text_embed_dim`

The dimension of the text embeddings.

`text_model`

An instance of the AlignTextModel class for processing text inputs.

`vision_model`

An instance of the AlignVisionModel class for processing image inputs.

`text_projection`

A dense layer for projecting the text embeddings.

`temperature`

A parameter for scaling the similarity scores.

METHOD DESCRIPTION
`__init__`

Initializes the AlignModel class.

`get_text_features`

Computes the text embeddings.

`get_image_features`

Computes the image embeddings.

`forward`

Constructs the model and computes the image-text similarity scores.

Please see the code examples in the docstrings of each method for usage details.

Source code in mindnlp/transformers/models/align/modeling_align.py
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
class AlignModel(AlignPreTrainedModel):

    """
    The `AlignModel` class is a model for aligning text and image embeddings.
    It is designed to compute image-text similarity scores using pre-trained text and vision models.
    The class inherits from the `AlignPreTrainedModel` class.

    Attributes:
        `projection_dim`: The dimension of the projection layer.
        `text_embed_dim`: The dimension of the text embeddings.
        `text_model`: An instance of the `AlignTextModel` class for processing text inputs.
        `vision_model`: An instance of the `AlignVisionModel` class for processing image inputs.
        `text_projection`: A dense layer for projecting the text embeddings.
        `temperature`: A parameter for scaling the similarity scores.

    Methods:
        `__init__`: Initializes the `AlignModel` class.
        `get_text_features`: Computes the text embeddings.
        `get_image_features`: Computes the image embeddings.
        `forward`: Constructs the model and computes the image-text similarity scores.

    Please see the code examples in the docstrings of each method for usage details.
    """
    config_class = AlignConfig

    def __init__(self, config: AlignConfig):
        '''
        Initializes the AlignModel with the specified configuration.

        Args:
            self: The instance of the AlignModel class.
            config (AlignConfig):
                An object containing the configuration settings for the AlignModel.

                - text_config (AlignTextConfig): The configuration settings for the text model.
                - vision_config (AlignVisionConfig): The configuration settings for the vision model.
                - projection_dim (int): The dimension for the projection.

        Returns:
            None.

        Raises:
            ValueError: If the config.text_config is not of type AlignTextConfig.
            ValueError: If the config.vision_config is not of type AlignVisionConfig.
        '''
        super().__init__(config)

        if not isinstance(config.text_config, AlignTextConfig):
            raise ValueError(
                "config.text_config is expected to be of type AlignTextConfig but is of type"
                f" {type(config.text_config)}."
            )

        if not isinstance(config.vision_config, AlignVisionConfig):
            raise ValueError(
                "config.vision_config is expected to be of type AlignVisionConfig but is of type"
                f" {type(config.vision_config)}."
            )

        text_config = config.text_config
        vision_config = config.vision_config

        self.projection_dim = config.projection_dim
        self.text_embed_dim = text_config.hidden_size

        self.text_model = AlignTextModel(text_config)
        self.vision_model = AlignVisionModel(vision_config)

        self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim)
        self.register_buffer('temperature', mindspore.tensor(self.config.temperature_init_value))

        # Initialize weights and apply final processing
        self.post_init()

    def get_text_features(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> mindspore.Tensor:
        r"""
        Returns:
            text_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
                applying the projection layer to the pooled output of [`AlignTextModel`].

        Example:
            ```python
            >>> from transformers import AutoTokenizer, AlignModel
            ...
            >>> model = AlignModel.from_pretrained("kakaobrain/align-base")
            >>> tokenizer = AutoTokenizer.from_pretrained("kakaobrain/align-base")
            ...
            >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
            >>> text_features = model.get_text_features(**inputs)
            ```
        """
        # Use ALIGN model's config for some fields (if specified) instead of those of vision & text components.
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        last_hidden_state = text_outputs[0][:, 0, :]
        text_features = self.text_projection(last_hidden_state)

        return text_features

    def get_image_features(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> mindspore.Tensor:
        r"""
        Returns:
            image_features (`mindspore.Tensor` of shape `(batch_size, output_dim`):
                The image embeddings obtained by applying the projection layer to the pooled output of [`AlignVisionModel`].

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, AlignModel
            ...
            >>> model = AlignModel.from_pretrained("kakaobrain/align-base")
            >>> processor = AutoProcessor.from_pretrained("kakaobrain/align-base")
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> inputs = processor(images=image, return_tensors="pt")
            ...
            >>> image_features = model.get_image_features(**inputs)
            ```
        """
        # Use ALIGN model's config for some fields (if specified) instead of those of vision & text components.
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        vision_outputs = self.vision_model(
            pixel_values=pixel_values,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        image_features = vision_outputs[1]  # pooled_output

        return image_features

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        pixel_values: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        return_loss: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, AlignOutput]:
        r"""
        Returns:
            Union[Tuple, AlignOutput]

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, AlignModel
            ...
            >>> model = AlignModel.from_pretrained("kakaobrain/align-base")
            >>> processor = AutoProcessor.from_pretrained("kakaobrain/align-base")
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> inputs = processor(
            ...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
            ... )
            ...
            >>> outputs = model(**inputs)
            >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
            >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
            ```
        """
        # Use ALIGN model's config for some fields (if specified) instead of those of vision & text components.
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        vision_outputs = self.vision_model(
            pixel_values=pixel_values,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        image_embeds = vision_outputs[1]
        text_embeds = text_outputs[0][:, 0, :]
        text_embeds = self.text_projection(text_embeds)

        # normalized features
        image_embeds = image_embeds / image_embeds.norm(ord=2, dim=-1, keepdim=True)
        text_embeds = text_embeds / text_embeds.norm(ord=2, dim=-1, keepdim=True)

        # cosine similarity as logits
        logits_per_text = ops.matmul(text_embeds, image_embeds.t()) / self.temperature
        logits_per_image = logits_per_text.t()

        loss = None
        if return_loss:
            loss = align_loss(logits_per_text)

        if not return_dict:
            output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
            return ((loss,) + output) if loss is not None else output

        return AlignOutput(
            loss=loss,
            logits_per_image=logits_per_image,
            logits_per_text=logits_per_text,
            text_embeds=text_embeds,
            image_embeds=image_embeds,
            text_model_output=text_outputs,
            vision_model_output=vision_outputs,
        )

mindnlp.transformers.models.align.modeling_align.AlignModel.__init__(config)

Initializes the AlignModel with the specified configuration.

PARAMETER DESCRIPTION
self

The instance of the AlignModel class.

config

An object containing the configuration settings for the AlignModel.

  • text_config (AlignTextConfig): The configuration settings for the text model.
  • vision_config (AlignVisionConfig): The configuration settings for the vision model.
  • projection_dim (int): The dimension for the projection.

TYPE: AlignConfig

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If the config.text_config is not of type AlignTextConfig.

ValueError

If the config.vision_config is not of type AlignVisionConfig.

Source code in mindnlp/transformers/models/align/modeling_align.py
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
def __init__(self, config: AlignConfig):
    '''
    Initializes the AlignModel with the specified configuration.

    Args:
        self: The instance of the AlignModel class.
        config (AlignConfig):
            An object containing the configuration settings for the AlignModel.

            - text_config (AlignTextConfig): The configuration settings for the text model.
            - vision_config (AlignVisionConfig): The configuration settings for the vision model.
            - projection_dim (int): The dimension for the projection.

    Returns:
        None.

    Raises:
        ValueError: If the config.text_config is not of type AlignTextConfig.
        ValueError: If the config.vision_config is not of type AlignVisionConfig.
    '''
    super().__init__(config)

    if not isinstance(config.text_config, AlignTextConfig):
        raise ValueError(
            "config.text_config is expected to be of type AlignTextConfig but is of type"
            f" {type(config.text_config)}."
        )

    if not isinstance(config.vision_config, AlignVisionConfig):
        raise ValueError(
            "config.vision_config is expected to be of type AlignVisionConfig but is of type"
            f" {type(config.vision_config)}."
        )

    text_config = config.text_config
    vision_config = config.vision_config

    self.projection_dim = config.projection_dim
    self.text_embed_dim = text_config.hidden_size

    self.text_model = AlignTextModel(text_config)
    self.vision_model = AlignVisionModel(vision_config)

    self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim)
    self.register_buffer('temperature', mindspore.tensor(self.config.temperature_init_value))

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.align.modeling_align.AlignModel.forward(input_ids=None, pixel_values=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, return_loss=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
Union[Tuple, AlignOutput]

Union[Tuple, AlignOutput]

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AlignModel
...
>>> model = AlignModel.from_pretrained("kakaobrain/align-base")
>>> processor = AutoProcessor.from_pretrained("kakaobrain/align-base")
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = processor(
...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
... )
...
>>> outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
>>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
Source code in mindnlp/transformers/models/align/modeling_align.py
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    pixel_values: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    token_type_ids: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    return_loss: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, AlignOutput]:
    r"""
    Returns:
        Union[Tuple, AlignOutput]

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, AlignModel
        ...
        >>> model = AlignModel.from_pretrained("kakaobrain/align-base")
        >>> processor = AutoProcessor.from_pretrained("kakaobrain/align-base")
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = processor(
        ...     text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
        ... )
        ...
        >>> outputs = model(**inputs)
        >>> logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
        >>> probs = logits_per_image.softmax(dim=1)  # we can take the softmax to get the label probabilities
        ```
    """
    # Use ALIGN model's config for some fields (if specified) instead of those of vision & text components.
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    vision_outputs = self.vision_model(
        pixel_values=pixel_values,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    text_outputs = self.text_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    image_embeds = vision_outputs[1]
    text_embeds = text_outputs[0][:, 0, :]
    text_embeds = self.text_projection(text_embeds)

    # normalized features
    image_embeds = image_embeds / image_embeds.norm(ord=2, dim=-1, keepdim=True)
    text_embeds = text_embeds / text_embeds.norm(ord=2, dim=-1, keepdim=True)

    # cosine similarity as logits
    logits_per_text = ops.matmul(text_embeds, image_embeds.t()) / self.temperature
    logits_per_image = logits_per_text.t()

    loss = None
    if return_loss:
        loss = align_loss(logits_per_text)

    if not return_dict:
        output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
        return ((loss,) + output) if loss is not None else output

    return AlignOutput(
        loss=loss,
        logits_per_image=logits_per_image,
        logits_per_text=logits_per_text,
        text_embeds=text_embeds,
        image_embeds=image_embeds,
        text_model_output=text_outputs,
        vision_model_output=vision_outputs,
    )

mindnlp.transformers.models.align.modeling_align.AlignModel.get_image_features(pixel_values=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
image_features

The image embeddings obtained by applying the projection layer to the pooled output of [AlignVisionModel].

TYPE: `mindspore.Tensor` of shape `(batch_size, output_dim`

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AlignModel
...
>>> model = AlignModel.from_pretrained("kakaobrain/align-base")
>>> processor = AutoProcessor.from_pretrained("kakaobrain/align-base")
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = processor(images=image, return_tensors="pt")
...
>>> image_features = model.get_image_features(**inputs)
Source code in mindnlp/transformers/models/align/modeling_align.py
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
def get_image_features(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> mindspore.Tensor:
    r"""
    Returns:
        image_features (`mindspore.Tensor` of shape `(batch_size, output_dim`):
            The image embeddings obtained by applying the projection layer to the pooled output of [`AlignVisionModel`].

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, AlignModel
        ...
        >>> model = AlignModel.from_pretrained("kakaobrain/align-base")
        >>> processor = AutoProcessor.from_pretrained("kakaobrain/align-base")
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = processor(images=image, return_tensors="pt")
        ...
        >>> image_features = model.get_image_features(**inputs)
        ```
    """
    # Use ALIGN model's config for some fields (if specified) instead of those of vision & text components.
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    vision_outputs = self.vision_model(
        pixel_values=pixel_values,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    image_features = vision_outputs[1]  # pooled_output

    return image_features

mindnlp.transformers.models.align.modeling_align.AlignModel.get_text_features(input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
text_features

The text embeddings obtained by applying the projection layer to the pooled output of [AlignTextModel].

TYPE: `mindspore.Tensor` of shape `(batch_size, output_dim`

Example
>>> from transformers import AutoTokenizer, AlignModel
...
>>> model = AlignModel.from_pretrained("kakaobrain/align-base")
>>> tokenizer = AutoTokenizer.from_pretrained("kakaobrain/align-base")
...
>>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
>>> text_features = model.get_text_features(**inputs)
Source code in mindnlp/transformers/models/align/modeling_align.py
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
def get_text_features(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    token_type_ids: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> mindspore.Tensor:
    r"""
    Returns:
        text_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
            applying the projection layer to the pooled output of [`AlignTextModel`].

    Example:
        ```python
        >>> from transformers import AutoTokenizer, AlignModel
        ...
        >>> model = AlignModel.from_pretrained("kakaobrain/align-base")
        >>> tokenizer = AutoTokenizer.from_pretrained("kakaobrain/align-base")
        ...
        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
        >>> text_features = model.get_text_features(**inputs)
        ```
    """
    # Use ALIGN model's config for some fields (if specified) instead of those of vision & text components.
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    text_outputs = self.text_model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    last_hidden_state = text_outputs[0][:, 0, :]
    text_features = self.text_projection(last_hidden_state)

    return text_features

mindnlp.transformers.models.align.modeling_align.AlignPreTrainedModel

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in mindnlp/transformers/models/align/modeling_align.py
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
class AlignPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """
    config_class = AlignConfig
    base_model_prefix = "align"
    supports_gradient_checkpointing = True

    def _init_weights(self, cell):
        """Initialize the weights"""
        if isinstance(cell, (nn.Linear, nn.Conv2d)):
            cell.weight.set_data(initializer(Normal(self.config.initializer_range),
                                                    cell.weight.shape, cell.weight.dtype))
            if cell.bias is not None:
                cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))
        elif isinstance(cell, AlignModel):
            cell.text_projection.weight.set_data(initializer(XavierUniform(), cell.text_projection.weight.shape,
                                                             cell.text_projection.weight.dtype))
            cell.text_projection.bias[:] = 0
            cell.text_projection._is_initialized = True
        elif isinstance(cell, nn.Embedding):
            weight = np.random.normal(0.0, self.config.initializer_range, cell.weight.shape)
            if cell.padding_idx:
                weight[cell.padding_idx] = 0

            cell.weight.set_data(Tensor(weight, cell.weight.dtype))
        elif isinstance(cell, nn.LayerNorm):
            cell.weight.set_data(initializer('ones', cell.weight.shape, cell.weight.dtype))
            cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))

mindnlp.transformers.models.align.modeling_align.AlignTextModel

Bases: AlignPreTrainedModel

The AlignTextModel class represents a model for aligning text. It includes methods for initializing the model, getting and setting input embeddings, and forwarding the model for inference.

The __init__ method initializes the model with the provided configuration and sets up the embeddings, encoder, and pooler layers based on the configuration parameters.

The get_input_embeddings method retrieves the word embeddings used as input to the model.

The set_input_embeddings method allows for setting custom word embeddings as input to the model.

The forward method forwards the model for inference based on the input parameters such as input tokens, attention mask, token type ids, etc. It returns the model outputs including the last hidden state and pooled output.

The class also includes examples of how to use the model for text alignment tasks.

This class inherits from AlignPreTrainedModel.

Source code in mindnlp/transformers/models/align/modeling_align.py
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
class AlignTextModel(AlignPreTrainedModel):

    """
    The `AlignTextModel` class represents a model for aligning text.
    It includes methods for initializing the model, getting and setting input embeddings, and forwarding the model for inference.

    The `__init__` method initializes the model with the provided configuration and sets up
    the embeddings, encoder, and pooler layers based on the configuration parameters.

    The `get_input_embeddings` method retrieves the word embeddings used as input to the model.

    The `set_input_embeddings` method allows for setting custom word embeddings as input to the model.

    The `forward` method forwards the model for inference based on the input parameters such as
    input tokens, attention mask, token type ids, etc.
    It returns the model outputs including the last hidden state and pooled output.

    The class also includes examples of how to use the model for text alignment tasks.

    This class inherits from `AlignPreTrainedModel`.
    """
    config_class = AlignTextConfig

    def __init__(self, config: AlignTextConfig, add_pooling_layer: bool = True):
        """
        Initializes an instance of AlignTextModel.

        Args:
            self: The instance of the AlignTextModel class.
            config (AlignTextConfig): An instance of AlignTextConfig containing configuration parameters.
            add_pooling_layer (bool, optional): A flag indicating whether to add a pooling layer. Defaults to True.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)
        self.config = config

        self.embeddings = AlignTextEmbeddings(config)
        self.encoder = AlignTextEncoder(config)

        self.pooler = AlignTextPooler(config) if add_pooling_layer else None

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        """
        This method retrieves the input embeddings from the AlignTextModel.

        Args:
            self: The instance of the AlignTextModel class.

        Returns:
            None: This method returns None as it retrieves the input embeddings without any transformations.

        Raises:
            None.
        """
        return self.embeddings.word_embeddings

    def set_input_embeddings(self, value):
        """
        Sets the input embeddings for the AlignTextModel.

        Args:
            self (AlignTextModel): The instance of the AlignTextModel class.
            value (any): The input embeddings value to be set for the model. It can be of any type.

        Returns:
            None.

        Raises:
            None.
        """
        self.embeddings.word_embeddings = value

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPoolingAndCrossAttentions]:
        r"""
        Returns:
            Union[Tuple, BaseModelOutputWithPoolingAndCrossAttentions]

        Example:
            ```python
            >>> from transformers import AutoTokenizer, AlignTextModel
            ...
            >>> model = AlignTextModel.from_pretrained("kakaobrain/align-base")
            >>> tokenizer = AutoTokenizer.from_pretrained("kakaobrain/align-base")
            ...
            >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
            ...
            >>> outputs = model(**inputs)
            >>> last_hidden_state = outputs.last_hidden_state
            >>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
            ```
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        if input_ids is not None:
            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
            input_shape = input_ids.shape
        elif inputs_embeds is not None:
            input_shape = inputs_embeds.shape[:-1]
        else:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        batch_size, seq_length = input_shape

        if attention_mask is None:
            attention_mask = ops.ones(batch_size, seq_length)

        if token_type_ids is None:
            if hasattr(self.embeddings, "token_type_ids"):
                buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
                buffered_token_type_ids_expanded = ops.broadcast_to(buffered_token_type_ids, (batch_size, seq_length))
                token_type_ids = buffered_token_type_ids_expanded
            else:
                token_type_ids = ops.zeros(*input_shape, dtype=mindspore.int64)

        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
        # ourselves in which case we just need to make it broadcastable to all heads.
        extended_attention_mask: mindspore.Tensor = self.get_extended_attention_mask(attention_mask, input_shape)

        # Prepare head mask if needed
        # 1.0 in head_mask indicate we keep the head
        # attention_probs has shape bsz x n_heads x N x N
        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)

        embedding_output = self.embeddings(
            input_ids=input_ids,
            position_ids=position_ids,
            token_type_ids=token_type_ids,
            inputs_embeds=inputs_embeds,
        )
        encoder_outputs = self.encoder(
            embedding_output,
            attention_mask=extended_attention_mask,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = encoder_outputs[0]
        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

        if not return_dict:
            return (sequence_output, pooled_output) + encoder_outputs[1:]

        return BaseModelOutputWithPoolingAndCrossAttentions(
            last_hidden_state=sequence_output,
            pooler_output=pooled_output,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
            cross_attentions=encoder_outputs.cross_attentions,
        )

mindnlp.transformers.models.align.modeling_align.AlignTextModel.__init__(config, add_pooling_layer=True)

Initializes an instance of AlignTextModel.

PARAMETER DESCRIPTION
self

The instance of the AlignTextModel class.

config

An instance of AlignTextConfig containing configuration parameters.

TYPE: AlignTextConfig

add_pooling_layer

A flag indicating whether to add a pooling layer. Defaults to True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/align/modeling_align.py
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
def __init__(self, config: AlignTextConfig, add_pooling_layer: bool = True):
    """
    Initializes an instance of AlignTextModel.

    Args:
        self: The instance of the AlignTextModel class.
        config (AlignTextConfig): An instance of AlignTextConfig containing configuration parameters.
        add_pooling_layer (bool, optional): A flag indicating whether to add a pooling layer. Defaults to True.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)
    self.config = config

    self.embeddings = AlignTextEmbeddings(config)
    self.encoder = AlignTextEncoder(config)

    self.pooler = AlignTextPooler(config) if add_pooling_layer else None

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.align.modeling_align.AlignTextModel.forward(input_ids=None, attention_mask=None, token_type_ids=None, position_ids=None, head_mask=None, inputs_embeds=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
Union[Tuple, BaseModelOutputWithPoolingAndCrossAttentions]

Union[Tuple, BaseModelOutputWithPoolingAndCrossAttentions]

Example
>>> from transformers import AutoTokenizer, AlignTextModel
...
>>> model = AlignTextModel.from_pretrained("kakaobrain/align-base")
>>> tokenizer = AutoTokenizer.from_pretrained("kakaobrain/align-base")
...
>>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
Source code in mindnlp/transformers/models/align/modeling_align.py
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    token_type_ids: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPoolingAndCrossAttentions]:
    r"""
    Returns:
        Union[Tuple, BaseModelOutputWithPoolingAndCrossAttentions]

    Example:
        ```python
        >>> from transformers import AutoTokenizer, AlignTextModel
        ...
        >>> model = AlignTextModel.from_pretrained("kakaobrain/align-base")
        >>> tokenizer = AutoTokenizer.from_pretrained("kakaobrain/align-base")
        ...
        >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> last_hidden_state = outputs.last_hidden_state
        >>> pooled_output = outputs.pooler_output  # pooled (EOS token) states
        ```
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    if input_ids is not None and inputs_embeds is not None:
        raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
    if input_ids is not None:
        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
        input_shape = input_ids.shape
    elif inputs_embeds is not None:
        input_shape = inputs_embeds.shape[:-1]
    else:
        raise ValueError("You have to specify either input_ids or inputs_embeds")

    batch_size, seq_length = input_shape

    if attention_mask is None:
        attention_mask = ops.ones(batch_size, seq_length)

    if token_type_ids is None:
        if hasattr(self.embeddings, "token_type_ids"):
            buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
            buffered_token_type_ids_expanded = ops.broadcast_to(buffered_token_type_ids, (batch_size, seq_length))
            token_type_ids = buffered_token_type_ids_expanded
        else:
            token_type_ids = ops.zeros(*input_shape, dtype=mindspore.int64)

    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
    # ourselves in which case we just need to make it broadcastable to all heads.
    extended_attention_mask: mindspore.Tensor = self.get_extended_attention_mask(attention_mask, input_shape)

    # Prepare head mask if needed
    # 1.0 in head_mask indicate we keep the head
    # attention_probs has shape bsz x n_heads x N x N
    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)

    embedding_output = self.embeddings(
        input_ids=input_ids,
        position_ids=position_ids,
        token_type_ids=token_type_ids,
        inputs_embeds=inputs_embeds,
    )
    encoder_outputs = self.encoder(
        embedding_output,
        attention_mask=extended_attention_mask,
        head_mask=head_mask,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    sequence_output = encoder_outputs[0]
    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

    if not return_dict:
        return (sequence_output, pooled_output) + encoder_outputs[1:]

    return BaseModelOutputWithPoolingAndCrossAttentions(
        last_hidden_state=sequence_output,
        pooler_output=pooled_output,
        hidden_states=encoder_outputs.hidden_states,
        attentions=encoder_outputs.attentions,
        cross_attentions=encoder_outputs.cross_attentions,
    )

mindnlp.transformers.models.align.modeling_align.AlignTextModel.get_input_embeddings()

This method retrieves the input embeddings from the AlignTextModel.

PARAMETER DESCRIPTION
self

The instance of the AlignTextModel class.

RETURNS DESCRIPTION
None

This method returns None as it retrieves the input embeddings without any transformations.

Source code in mindnlp/transformers/models/align/modeling_align.py
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
def get_input_embeddings(self):
    """
    This method retrieves the input embeddings from the AlignTextModel.

    Args:
        self: The instance of the AlignTextModel class.

    Returns:
        None: This method returns None as it retrieves the input embeddings without any transformations.

    Raises:
        None.
    """
    return self.embeddings.word_embeddings

mindnlp.transformers.models.align.modeling_align.AlignTextModel.set_input_embeddings(value)

Sets the input embeddings for the AlignTextModel.

PARAMETER DESCRIPTION
self

The instance of the AlignTextModel class.

TYPE: AlignTextModel

value

The input embeddings value to be set for the model. It can be of any type.

TYPE: any

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/align/modeling_align.py
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
def set_input_embeddings(self, value):
    """
    Sets the input embeddings for the AlignTextModel.

    Args:
        self (AlignTextModel): The instance of the AlignTextModel class.
        value (any): The input embeddings value to be set for the model. It can be of any type.

    Returns:
        None.

    Raises:
        None.
    """
    self.embeddings.word_embeddings = value

mindnlp.transformers.models.align.modeling_align.AlignVisionModel

Bases: AlignPreTrainedModel

This class represents an AlignVision model for vision tasks, which includes functionalities for processing images and generating embeddings using a vision encoder.

The model supports different pooling strategies for extracting features from the encoded image representations.

It inherits from AlignPreTrainedModel and provides methods for initializing the model, accessing input embeddings, and forwarding the model output.

The model's forwardor takes an AlignVisionConfig object as a parameter to configure the model's behavior. It initializes the model's components including embeddings and encoder based on the provided configuration, and sets up the pooling strategy based on the specified pooling type in the configuration.

The 'get_input_embeddings' method returns the input embeddings generated by the model's convolutional layers for further processing.

The 'forward' method processes input pixel values to generate embeddings using the model's embeddings and encoder components. It then applies the pooling strategy to extract features from the encoded image representations. The method returns the last hidden state, pooled output, and additional encoder outputs based on the specified return format.

The class provides examples in the docstring to demonstrate how to use the model for image processing tasks, including loading an image, processing it with the model, and accessing the output hidden states and pooled output for further analysis.

Source code in mindnlp/transformers/models/align/modeling_align.py
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
class AlignVisionModel(AlignPreTrainedModel):

    """
    This class represents an AlignVision model for vision tasks, which includes functionalities for processing images
    and generating embeddings using a vision encoder.

    The model supports different pooling strategies for extracting features from the encoded image representations.

    It inherits from AlignPreTrainedModel and provides methods for initializing the model, accessing input embeddings,
    and forwarding the model output.

    The model's forwardor takes an AlignVisionConfig object as a parameter to configure the model's behavior.
    It initializes the model's components including embeddings and encoder based on the provided configuration,
    and sets up the pooling strategy based on the specified pooling type in the configuration.

    The 'get_input_embeddings' method returns the input embeddings generated by the model's convolutional layers for further processing.

    The 'forward' method processes input pixel values to generate embeddings using the model's embeddings and encoder components.
    It then applies the pooling strategy to extract features from the encoded image representations.
    The method returns the last hidden state, pooled output, and additional encoder outputs based on the specified return format.

    The class provides examples in the docstring to demonstrate how to use the model for image processing tasks,
    including loading an image, processing it with the model, and accessing the output hidden states
    and pooled output for further analysis.
    """
    config_class = AlignVisionConfig
    main_input_name = "pixel_values"
    supports_gradient_checkpointing = False

    def __init__(self, config: AlignVisionConfig):
        """
        Initializes an instance of the AlignVisionModel class.

        Args:
            self: The instance of the class.
            config (AlignVisionConfig): An object containing configuration parameters for the model.

        Returns:
            None

        Raises:
            ValueError: If the 'pooling_type' in the config is not one of ['mean', 'max'].

        Description:
            This method initializes an instance of the AlignVisionModel class.
            It takes in a config object which contains the configuration parameters for the model.
            The 'config' parameter is of type AlignVisionConfig.

            Inside the method, the superclass's __init__ method is called with the 'config' parameter.
            The 'config' is then assigned to the 'self.config' attribute.

            The method also initializes the 'embeddings' attribute with an instance of AlignVisionEmbeddings,
            passing in the 'config' parameter. Similarly, the 'encoder' attribute is initialized with an instance
            of AlignVisionEncoder, passing in the 'config' parameter.

            The 'pooler' attribute is dynamically set based on the value of the 'pooling_type' in the 'config'.

            - If 'pooling_type' is set to 'mean', the 'pooler' attribute is set to a partial function 'ops.mean'
            with the specified axis and keep_dims parameters.
            - If 'pooling_type' is set to 'max', the 'pooler' attribute is set to an instance of nn.MaxPool2d
            with the specified 'hidden_dim' and 'ceil_mode' parameters.
            - If the 'pooling_type' in the 'config' is not one of ['mean', 'max'], a ValueError is raised.

            Finally, the 'post_init' method is called.

            This method does not return any value.
        """
        super().__init__(config)
        self.config = config
        self.embeddings = AlignVisionEmbeddings(config)
        self.encoder = AlignVisionEncoder(config)

        # Final pooling layer
        if config.pooling_type == "mean":
            # self.pooler = nn.AvgPool2d(config.hidden_dim, ceil_mode=True)
            self.pooler = partial(ops.mean, dim=(2,3), keepdim=True)
        elif config.pooling_type == "max":
            self.pooler = nn.MaxPool2d(config.hidden_dim, ceil_mode=True)
        else:
            raise ValueError(f"config.pooling must be one of ['mean', 'max'] got {config.pooling}")

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self) -> nn.Module:
        """
        Retrieve the input embeddings from the AlignVisionModel.

        Args:
            self (AlignVisionModel): The instance of the AlignVisionModel class.

        Returns:
            nn.Module: The input embeddings extracted from the vision model's convolution layer.

        Raises:
            None.
        """
        return self.vision_model.embeddings.convolution

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPoolingAndNoAttention]:
        r"""
        Returns:
            Union[Tuple, BaseModelOutputWithPoolingAndNoAttention]

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, AlignVisionModel
            ...
            >>> model = AlignVisionModel.from_pretrained("kakaobrain/align-base")
            >>> processor = AutoProcessor.from_pretrained("kakaobrain/align-base")
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> inputs = processor(images=image, return_tensors="pt")
            ...
            >>> outputs = model(**inputs)
            >>> last_hidden_state = outputs.last_hidden_state
            >>> pooled_output = outputs.pooler_output  # pooled CLS states
            ```
        """
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if pixel_values is None:
            raise ValueError("You have to specify pixel_values")

        embedding_output = self.embeddings(pixel_values)
        encoder_outputs = self.encoder(
            embedding_output,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        # Apply pooling
        last_hidden_state = encoder_outputs[0]
        pooled_output = self.pooler(last_hidden_state)
        # Reshape (batch_size, projection_dim, 1 , 1) -> (batch_size, projection_dim)
        pooled_output = pooled_output.reshape(pooled_output.shape[:2])

        if not return_dict:
            return (last_hidden_state, pooled_output) + encoder_outputs[1:]

        return BaseModelOutputWithPoolingAndNoAttention(
            last_hidden_state=last_hidden_state,
            pooler_output=pooled_output,
            hidden_states=encoder_outputs.hidden_states,
        )

mindnlp.transformers.models.align.modeling_align.AlignVisionModel.__init__(config)

Initializes an instance of the AlignVisionModel class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

An object containing configuration parameters for the model.

TYPE: AlignVisionConfig

RETURNS DESCRIPTION

None

RAISES DESCRIPTION
ValueError

If the 'pooling_type' in the config is not one of ['mean', 'max'].

Description

This method initializes an instance of the AlignVisionModel class. It takes in a config object which contains the configuration parameters for the model. The 'config' parameter is of type AlignVisionConfig.

Inside the method, the superclass's init method is called with the 'config' parameter. The 'config' is then assigned to the 'self.config' attribute.

The method also initializes the 'embeddings' attribute with an instance of AlignVisionEmbeddings, passing in the 'config' parameter. Similarly, the 'encoder' attribute is initialized with an instance of AlignVisionEncoder, passing in the 'config' parameter.

The 'pooler' attribute is dynamically set based on the value of the 'pooling_type' in the 'config'.

  • If 'pooling_type' is set to 'mean', the 'pooler' attribute is set to a partial function 'ops.mean' with the specified axis and keep_dims parameters.
  • If 'pooling_type' is set to 'max', the 'pooler' attribute is set to an instance of nn.MaxPool2d with the specified 'hidden_dim' and 'ceil_mode' parameters.
  • If the 'pooling_type' in the 'config' is not one of ['mean', 'max'], a ValueError is raised.

Finally, the 'post_init' method is called.

This method does not return any value.

Source code in mindnlp/transformers/models/align/modeling_align.py
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
def __init__(self, config: AlignVisionConfig):
    """
    Initializes an instance of the AlignVisionModel class.

    Args:
        self: The instance of the class.
        config (AlignVisionConfig): An object containing configuration parameters for the model.

    Returns:
        None

    Raises:
        ValueError: If the 'pooling_type' in the config is not one of ['mean', 'max'].

    Description:
        This method initializes an instance of the AlignVisionModel class.
        It takes in a config object which contains the configuration parameters for the model.
        The 'config' parameter is of type AlignVisionConfig.

        Inside the method, the superclass's __init__ method is called with the 'config' parameter.
        The 'config' is then assigned to the 'self.config' attribute.

        The method also initializes the 'embeddings' attribute with an instance of AlignVisionEmbeddings,
        passing in the 'config' parameter. Similarly, the 'encoder' attribute is initialized with an instance
        of AlignVisionEncoder, passing in the 'config' parameter.

        The 'pooler' attribute is dynamically set based on the value of the 'pooling_type' in the 'config'.

        - If 'pooling_type' is set to 'mean', the 'pooler' attribute is set to a partial function 'ops.mean'
        with the specified axis and keep_dims parameters.
        - If 'pooling_type' is set to 'max', the 'pooler' attribute is set to an instance of nn.MaxPool2d
        with the specified 'hidden_dim' and 'ceil_mode' parameters.
        - If the 'pooling_type' in the 'config' is not one of ['mean', 'max'], a ValueError is raised.

        Finally, the 'post_init' method is called.

        This method does not return any value.
    """
    super().__init__(config)
    self.config = config
    self.embeddings = AlignVisionEmbeddings(config)
    self.encoder = AlignVisionEncoder(config)

    # Final pooling layer
    if config.pooling_type == "mean":
        # self.pooler = nn.AvgPool2d(config.hidden_dim, ceil_mode=True)
        self.pooler = partial(ops.mean, dim=(2,3), keepdim=True)
    elif config.pooling_type == "max":
        self.pooler = nn.MaxPool2d(config.hidden_dim, ceil_mode=True)
    else:
        raise ValueError(f"config.pooling must be one of ['mean', 'max'] got {config.pooling}")

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.align.modeling_align.AlignVisionModel.forward(pixel_values=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
Union[Tuple, BaseModelOutputWithPoolingAndNoAttention]

Union[Tuple, BaseModelOutputWithPoolingAndNoAttention]

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AlignVisionModel
...
>>> model = AlignVisionModel.from_pretrained("kakaobrain/align-base")
>>> processor = AutoProcessor.from_pretrained("kakaobrain/align-base")
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = processor(images=image, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output  # pooled CLS states
Source code in mindnlp/transformers/models/align/modeling_align.py
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPoolingAndNoAttention]:
    r"""
    Returns:
        Union[Tuple, BaseModelOutputWithPoolingAndNoAttention]

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, AlignVisionModel
        ...
        >>> model = AlignVisionModel.from_pretrained("kakaobrain/align-base")
        >>> processor = AutoProcessor.from_pretrained("kakaobrain/align-base")
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = processor(images=image, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> last_hidden_state = outputs.last_hidden_state
        >>> pooled_output = outputs.pooler_output  # pooled CLS states
        ```
    """
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    if pixel_values is None:
        raise ValueError("You have to specify pixel_values")

    embedding_output = self.embeddings(pixel_values)
    encoder_outputs = self.encoder(
        embedding_output,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    # Apply pooling
    last_hidden_state = encoder_outputs[0]
    pooled_output = self.pooler(last_hidden_state)
    # Reshape (batch_size, projection_dim, 1 , 1) -> (batch_size, projection_dim)
    pooled_output = pooled_output.reshape(pooled_output.shape[:2])

    if not return_dict:
        return (last_hidden_state, pooled_output) + encoder_outputs[1:]

    return BaseModelOutputWithPoolingAndNoAttention(
        last_hidden_state=last_hidden_state,
        pooler_output=pooled_output,
        hidden_states=encoder_outputs.hidden_states,
    )

mindnlp.transformers.models.align.modeling_align.AlignVisionModel.get_input_embeddings()

Retrieve the input embeddings from the AlignVisionModel.

PARAMETER DESCRIPTION
self

The instance of the AlignVisionModel class.

TYPE: AlignVisionModel

RETURNS DESCRIPTION
Module

nn.Module: The input embeddings extracted from the vision model's convolution layer.

Source code in mindnlp/transformers/models/align/modeling_align.py
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
def get_input_embeddings(self) -> nn.Module:
    """
    Retrieve the input embeddings from the AlignVisionModel.

    Args:
        self (AlignVisionModel): The instance of the AlignVisionModel class.

    Returns:
        nn.Module: The input embeddings extracted from the vision model's convolution layer.

    Raises:
        None.
    """
    return self.vision_model.embeddings.convolution

mindnlp.transformers.models.align.processing_align.AlignProcessor

Bases: ProcessorMixin

Constructs an ALIGN processor which wraps [EfficientNetImageProcessor] and [BertTokenizer]/[BertTokenizerFast] into a single processor that interits both the image processor and tokenizer functionalities. See the [~AlignProcessor.__call__] and [~OwlViTProcessor.decode] for more information.

PARAMETER DESCRIPTION
image_processor

The image processor is a required input.

TYPE: [`EfficientNetImageProcessor`]

tokenizer

The tokenizer is a required input.

TYPE: [`BertTokenizer`, `BertTokenizerFast`]

Source code in mindnlp/transformers/models/align/processing_align.py
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
class AlignProcessor(ProcessorMixin):
    r"""
    Constructs an ALIGN processor which wraps [`EfficientNetImageProcessor`] and
    [`BertTokenizer`]/[`BertTokenizerFast`] into a single processor that interits both the image processor and
    tokenizer functionalities. See the [`~AlignProcessor.__call__`] and [`~OwlViTProcessor.decode`] for more
    information.

    Args:
        image_processor ([`EfficientNetImageProcessor`]):
            The image processor is a required input.
        tokenizer ([`BertTokenizer`, `BertTokenizerFast`]):
            The tokenizer is a required input.
    """
    attributes = ["image_processor", "tokenizer"]
    image_processor_class = "EfficientNetImageProcessor"
    tokenizer_class = ("BertTokenizer", "BertTokenizerFast")

    def __init__(self, image_processor, tokenizer):
        """
        Initializes an AlignProcessor object.

        Args:
            self (object): The instance of the class.
            image_processor (object): An object of the image processor class that handles image processing.
            tokenizer (object): An object of the tokenizer class that handles text tokenization.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(image_processor, tokenizer)

    def __call__(self, text=None, images=None, padding="max_length", max_length=64, return_tensors=None, **kwargs):
        """
        Main method to prepare text(s) and image(s) to be fed as input to the model. This method forwards the `text`
        and `kwargs` arguments to BertTokenizerFast's [`~BertTokenizerFast.__call__`] if `text` is not `None` to encode
        the text. To prepare the image(s), this method forwards the `images` and `kwargs` arguments to
        EfficientNetImageProcessor's [`~EfficientNetImageProcessor.__call__`] if `images` is not `None`. Please refer
        to the doctsring of the above two methods for more information.

        Args:
            text (`str`, `List[str]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
                number of channels, H and W are image height and width.
            padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `max_length`):
                Activates and controls padding for tokenization of input text. Choose between [`True` or `'longest'`,
                `'max_length'`, `False` or `'do_not_pad'`]
            max_length (`int`, *optional*, defaults to `max_length`):
                Maximum padding value to use to pad the input text during tokenization.

            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:

                - `'tf'`: Return TensorFlow `tf.constant` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
                - `'jax'`: Return JAX `jnp.ndarray` objects.

        Returns:
            [`BatchEncoding`]:
                A [`BatchEncoding`] with the following fields:

                - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
                - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
                  `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
                  `None`).
                - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
        """
        if text is None and images is None:
            raise ValueError("You have to specify either text or images. Both cannot be none.")

        if text is not None:
            encoding = self.tokenizer(
                text, padding=padding, max_length=max_length, return_tensors=return_tensors, **kwargs
            )

        if images is not None:
            image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)

        if text is not None and images is not None:
            encoding["pixel_values"] = image_features.pixel_values
            return encoding
        if text is not None:
            return encoding
        return BatchEncoding(data={**image_features}, tensor_type=return_tensors)

    def batch_decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
        refer to the docstring of this method for more information.
        """
        return self.tokenizer.batch_decode(*args, **kwargs)

    def decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
        the docstring of this method for more information.
        """
        return self.tokenizer.decode(*args, **kwargs)

    @property
    def model_input_names(self):
        """
        This method retrieves the input names required for the model from the tokenizer and image processor.

        Args:
            self: The instance of the AlignProcessor class.

        Returns:
            list: A list of unique input names required for the model, which are obtained by combining the input names from the tokenizer and image processor.

        Raises:
            None
        """
        tokenizer_input_names = self.tokenizer.model_input_names
        image_processor_input_names = self.image_processor.model_input_names
        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))

mindnlp.transformers.models.align.processing_align.AlignProcessor.model_input_names property

This method retrieves the input names required for the model from the tokenizer and image processor.

PARAMETER DESCRIPTION
self

The instance of the AlignProcessor class.

RETURNS DESCRIPTION
list

A list of unique input names required for the model, which are obtained by combining the input names from the tokenizer and image processor.

mindnlp.transformers.models.align.processing_align.AlignProcessor.__call__(text=None, images=None, padding='max_length', max_length=64, return_tensors=None, **kwargs)

Main method to prepare text(s) and image(s) to be fed as input to the model. This method forwards the text and kwargs arguments to BertTokenizerFast's [~BertTokenizerFast.__call__] if text is not None to encode the text. To prepare the image(s), this method forwards the images and kwargs arguments to EfficientNetImageProcessor's [~EfficientNetImageProcessor.__call__] if images is not None. Please refer to the doctsring of the above two methods for more information.

PARAMETER DESCRIPTION
text

The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

TYPE: `str`, `List[str]` DEFAULT: None

images

The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a number of channels, H and W are image height and width.

TYPE: `PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]` DEFAULT: None

padding

Activates and controls padding for tokenization of input text. Choose between [True or 'longest', 'max_length', False or 'do_not_pad']

TYPE: `bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `max_length` DEFAULT: 'max_length'

max_length

Maximum padding value to use to pad the input text during tokenization.

TYPE: `int`, *optional*, defaults to `max_length` DEFAULT: 64

return_tensors

If set, will return tensors of a particular framework. Acceptable values are:

  • 'tf': Return TensorFlow tf.constant objects.
  • 'pt': Return PyTorch torch.Tensor objects.
  • 'np': Return NumPy np.ndarray objects.
  • 'jax': Return JAX jnp.ndarray objects.

TYPE: `str` or [`~utils.TensorType`], *optional* DEFAULT: None

RETURNS DESCRIPTION

[BatchEncoding]: A [BatchEncoding] with the following fields:

  • input_ids -- List of token ids to be fed to a model. Returned when text is not None.
  • attention_mask -- List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if "attention_mask" is in self.model_input_names and if text is not None).
  • pixel_values -- Pixel values to be fed to a model. Returned when images is not None.
Source code in mindnlp/transformers/models/align/processing_align.py
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
def __call__(self, text=None, images=None, padding="max_length", max_length=64, return_tensors=None, **kwargs):
    """
    Main method to prepare text(s) and image(s) to be fed as input to the model. This method forwards the `text`
    and `kwargs` arguments to BertTokenizerFast's [`~BertTokenizerFast.__call__`] if `text` is not `None` to encode
    the text. To prepare the image(s), this method forwards the `images` and `kwargs` arguments to
    EfficientNetImageProcessor's [`~EfficientNetImageProcessor.__call__`] if `images` is not `None`. Please refer
    to the doctsring of the above two methods for more information.

    Args:
        text (`str`, `List[str]`):
            The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
            (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
            `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
        images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
            The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
            tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W), where C is a
            number of channels, H and W are image height and width.
        padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `max_length`):
            Activates and controls padding for tokenization of input text. Choose between [`True` or `'longest'`,
            `'max_length'`, `False` or `'do_not_pad'`]
        max_length (`int`, *optional*, defaults to `max_length`):
            Maximum padding value to use to pad the input text during tokenization.

        return_tensors (`str` or [`~utils.TensorType`], *optional*):
            If set, will return tensors of a particular framework. Acceptable values are:

            - `'tf'`: Return TensorFlow `tf.constant` objects.
            - `'pt'`: Return PyTorch `torch.Tensor` objects.
            - `'np'`: Return NumPy `np.ndarray` objects.
            - `'jax'`: Return JAX `jnp.ndarray` objects.

    Returns:
        [`BatchEncoding`]:
            A [`BatchEncoding`] with the following fields:

            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
              `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
    """
    if text is None and images is None:
        raise ValueError("You have to specify either text or images. Both cannot be none.")

    if text is not None:
        encoding = self.tokenizer(
            text, padding=padding, max_length=max_length, return_tensors=return_tensors, **kwargs
        )

    if images is not None:
        image_features = self.image_processor(images, return_tensors=return_tensors, **kwargs)

    if text is not None and images is not None:
        encoding["pixel_values"] = image_features.pixel_values
        return encoding
    if text is not None:
        return encoding
    return BatchEncoding(data={**image_features}, tensor_type=return_tensors)

mindnlp.transformers.models.align.processing_align.AlignProcessor.__init__(image_processor, tokenizer)

Initializes an AlignProcessor object.

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: object

image_processor

An object of the image processor class that handles image processing.

TYPE: object

tokenizer

An object of the tokenizer class that handles text tokenization.

TYPE: object

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/align/processing_align.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def __init__(self, image_processor, tokenizer):
    """
    Initializes an AlignProcessor object.

    Args:
        self (object): The instance of the class.
        image_processor (object): An object of the image processor class that handles image processing.
        tokenizer (object): An object of the tokenizer class that handles text tokenization.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(image_processor, tokenizer)

mindnlp.transformers.models.align.processing_align.AlignProcessor.batch_decode(*args, **kwargs)

This method forwards all its arguments to BertTokenizerFast's [~PreTrainedTokenizer.batch_decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp/transformers/models/align/processing_align.py
117
118
119
120
121
122
def batch_decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
    refer to the docstring of this method for more information.
    """
    return self.tokenizer.batch_decode(*args, **kwargs)

mindnlp.transformers.models.align.processing_align.AlignProcessor.decode(*args, **kwargs)

This method forwards all its arguments to BertTokenizerFast's [~PreTrainedTokenizer.decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp/transformers/models/align/processing_align.py
124
125
126
127
128
129
def decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
    the docstring of this method for more information.
    """
    return self.tokenizer.decode(*args, **kwargs)