Skip to content

bridgetower

mindnlp.transformers.models.bridgetower.configuration_bridgetower.BridgeTowerConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [BridgeTowerModel]. It is used to instantiate a BridgeTower model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the bridgetower-base BridgeTower/bridgetower-base architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
share_cross_modal_transformer_layers

Whether cross modal transformer layers are shared.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

hidden_act

The non-linear activation function (function or string) in the encoder and pooler.

TYPE: `str` or `function`, *optional*, defaults to `"gelu"` DEFAULT: 'gelu'

hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

initializer_factor

A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).

TYPE: `float`, *optional*, defaults to 1 DEFAULT: 1

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-05 DEFAULT: 1e-05

share_link_tower_layers

Whether the bride/link tower layers are shared.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

link_tower_type

Type of the bridge/link layer.

TYPE: `str`, *optional*, defaults to `"add"` DEFAULT: 'add'

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 6 DEFAULT: 6

tie_word_embeddings

Whether to tie input and output embeddings.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

init_layernorm_from_vision_encoder

Whether to init LayerNorm from the vision encoder.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

text_config

Dictionary of configuration options used to initialize [BridgeTowerTextConfig].

TYPE: `dict`, *optional* DEFAULT: None

vision_config

Dictionary of configuration options used to initialize [BridgeTowerVisionConfig].

TYPE: `dict`, *optional* DEFAULT: None

Example
>>> from transformers import BridgeTowerModel, BridgeTowerConfig
...
>>> # Initializing a BridgeTower BridgeTower/bridgetower-base style configuration
>>> configuration = BridgeTowerConfig()
...
>>> # Initializing a model from the BridgeTower/bridgetower-base style configuration
>>> model = BridgeTowerModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/bridgetower/configuration_bridgetower.py
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
class BridgeTowerConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`BridgeTowerModel`]. It is used to instantiate a
    BridgeTower model according to the specified arguments, defining the model architecture. Instantiating a
    configuration with the defaults will yield a similar configuration to that of the bridgetower-base
    [BridgeTower/bridgetower-base](https://huggingface.co/BridgeTower/bridgetower-base/) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        share_cross_modal_transformer_layers (`bool`, *optional*, defaults to `True`):
            Whether cross modal transformer layers are shared.
        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler.
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        initializer_factor (`float`, *optional*, defaults to 1):
            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
            testing).
        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon used by the layer normalization layers.
        share_link_tower_layers (`bool`, *optional*, defaults to `False`):
            Whether the bride/link tower layers are shared.
        link_tower_type (`str`, *optional*, defaults to `"add"`):
            Type of the bridge/link layer.
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        num_hidden_layers (`int`, *optional*, defaults to 6):
            Number of hidden layers in the Transformer encoder.
        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
            Whether to tie input and output embeddings.
        init_layernorm_from_vision_encoder (`bool`, *optional*, defaults to `False`):
            Whether to init LayerNorm from the vision encoder.
        text_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`BridgeTowerTextConfig`].
        vision_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`BridgeTowerVisionConfig`].

    Example:
        ```python
        >>> from transformers import BridgeTowerModel, BridgeTowerConfig
        ...
        >>> # Initializing a BridgeTower BridgeTower/bridgetower-base style configuration
        >>> configuration = BridgeTowerConfig()
        ...
        >>> # Initializing a model from the BridgeTower/bridgetower-base style configuration
        >>> model = BridgeTowerModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "bridgetower"

    def __init__(
        self,
        share_cross_modal_transformer_layers=True,
        hidden_act="gelu",
        hidden_size=768,
        initializer_factor=1,
        layer_norm_eps=1e-05,
        share_link_tower_layers=False,
        link_tower_type="add",
        num_attention_heads=12,
        num_hidden_layers=6,
        tie_word_embeddings=False,
        init_layernorm_from_vision_encoder=False,
        text_config=None,
        vision_config=None,
        **kwargs,
    ):
        """
        __init__

        Initializes an instance of the BridgeTowerConfig class.

        Args:
            self: The instance of the class.
            share_cross_modal_transformer_layers (bool): Indicates whether to share cross modal transformer layers.
            hidden_act (str): The activation function for the hidden layers.
            hidden_size (int): The size of the hidden layers.
            initializer_factor (int): The factor to initialize the layers.
            layer_norm_eps (float): The epsilon value for layer normalization.
            share_link_tower_layers (bool): Indicates whether to share link tower layers.
            link_tower_type (str): The type of link tower.
            num_attention_heads (int): The number of attention heads.
            num_hidden_layers (int): The number of hidden layers.
            tie_word_embeddings (bool): Indicates whether word embeddings are tied.
            init_layernorm_from_vision_encoder (bool): Indicates whether to initialize layernorm from the vision encoder.
            text_config (dict): The configuration for text.
            vision_config (dict): The configuration for vision.

        Returns:
            None.

        Raises:
            TypeError: If the provided input types are invalid.
        """
        # TODO: remove this once the Hub files are updated.
        _ = kwargs.pop("text_config_dict", None)
        _ = kwargs.pop("vision_config_dict", None)

        super().__init__(**kwargs)
        self.share_cross_modal_transformer_layers = share_cross_modal_transformer_layers
        self.hidden_act = hidden_act
        self.hidden_size = hidden_size
        self.initializer_factor = initializer_factor
        self.layer_norm_eps = layer_norm_eps
        self.share_link_tower_layers = share_link_tower_layers
        self.link_tower_type = link_tower_type
        self.num_attention_heads = num_attention_heads
        self.num_hidden_layers = num_hidden_layers
        self.tie_word_embeddings = tie_word_embeddings
        self.init_layernorm_from_vision_encoder = init_layernorm_from_vision_encoder

        if text_config is None:
            text_config = {}
            logger.info("`text_config` is `None`. Initializing the `BridgeTowerTextConfig` with default values.")

        if vision_config is None:
            vision_config = {}
            logger.info("`vision_config` is `None`. Initializing the `BridgeTowerVisionConfig` with default values.")

        self.text_config = BridgeTowerTextConfig(**text_config)
        self.vision_config = BridgeTowerVisionConfig(**vision_config)

    @classmethod
    def from_text_vision_configs(
        cls, text_config: BridgeTowerTextConfig, vision_config: BridgeTowerVisionConfig, **kwargs
    ):
        r"""
        Instantiate a [`BridgeTowerConfig`] (or a derived class) from BridgeTower text model configuration.

        Returns:
            [`BridgeTowerConfig`]: An instance of a configuration object
        """
        return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)

mindnlp.transformers.models.bridgetower.configuration_bridgetower.BridgeTowerConfig.__init__(share_cross_modal_transformer_layers=True, hidden_act='gelu', hidden_size=768, initializer_factor=1, layer_norm_eps=1e-05, share_link_tower_layers=False, link_tower_type='add', num_attention_heads=12, num_hidden_layers=6, tie_word_embeddings=False, init_layernorm_from_vision_encoder=False, text_config=None, vision_config=None, **kwargs)

init

Initializes an instance of the BridgeTowerConfig class.

PARAMETER DESCRIPTION
self

The instance of the class.

share_cross_modal_transformer_layers

Indicates whether to share cross modal transformer layers.

TYPE: bool DEFAULT: True

hidden_act

The activation function for the hidden layers.

TYPE: str DEFAULT: 'gelu'

hidden_size

The size of the hidden layers.

TYPE: int DEFAULT: 768

initializer_factor

The factor to initialize the layers.

TYPE: int DEFAULT: 1

layer_norm_eps

The epsilon value for layer normalization.

TYPE: float DEFAULT: 1e-05

share_link_tower_layers

Indicates whether to share link tower layers.

TYPE: bool DEFAULT: False

link_tower_type

The type of link tower.

TYPE: str DEFAULT: 'add'

num_attention_heads

The number of attention heads.

TYPE: int DEFAULT: 12

num_hidden_layers

The number of hidden layers.

TYPE: int DEFAULT: 6

tie_word_embeddings

Indicates whether word embeddings are tied.

TYPE: bool DEFAULT: False

init_layernorm_from_vision_encoder

Indicates whether to initialize layernorm from the vision encoder.

TYPE: bool DEFAULT: False

text_config

The configuration for text.

TYPE: dict DEFAULT: None

vision_config

The configuration for vision.

TYPE: dict DEFAULT: None

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the provided input types are invalid.

Source code in mindnlp/transformers/models/bridgetower/configuration_bridgetower.py
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
def __init__(
    self,
    share_cross_modal_transformer_layers=True,
    hidden_act="gelu",
    hidden_size=768,
    initializer_factor=1,
    layer_norm_eps=1e-05,
    share_link_tower_layers=False,
    link_tower_type="add",
    num_attention_heads=12,
    num_hidden_layers=6,
    tie_word_embeddings=False,
    init_layernorm_from_vision_encoder=False,
    text_config=None,
    vision_config=None,
    **kwargs,
):
    """
    __init__

    Initializes an instance of the BridgeTowerConfig class.

    Args:
        self: The instance of the class.
        share_cross_modal_transformer_layers (bool): Indicates whether to share cross modal transformer layers.
        hidden_act (str): The activation function for the hidden layers.
        hidden_size (int): The size of the hidden layers.
        initializer_factor (int): The factor to initialize the layers.
        layer_norm_eps (float): The epsilon value for layer normalization.
        share_link_tower_layers (bool): Indicates whether to share link tower layers.
        link_tower_type (str): The type of link tower.
        num_attention_heads (int): The number of attention heads.
        num_hidden_layers (int): The number of hidden layers.
        tie_word_embeddings (bool): Indicates whether word embeddings are tied.
        init_layernorm_from_vision_encoder (bool): Indicates whether to initialize layernorm from the vision encoder.
        text_config (dict): The configuration for text.
        vision_config (dict): The configuration for vision.

    Returns:
        None.

    Raises:
        TypeError: If the provided input types are invalid.
    """
    # TODO: remove this once the Hub files are updated.
    _ = kwargs.pop("text_config_dict", None)
    _ = kwargs.pop("vision_config_dict", None)

    super().__init__(**kwargs)
    self.share_cross_modal_transformer_layers = share_cross_modal_transformer_layers
    self.hidden_act = hidden_act
    self.hidden_size = hidden_size
    self.initializer_factor = initializer_factor
    self.layer_norm_eps = layer_norm_eps
    self.share_link_tower_layers = share_link_tower_layers
    self.link_tower_type = link_tower_type
    self.num_attention_heads = num_attention_heads
    self.num_hidden_layers = num_hidden_layers
    self.tie_word_embeddings = tie_word_embeddings
    self.init_layernorm_from_vision_encoder = init_layernorm_from_vision_encoder

    if text_config is None:
        text_config = {}
        logger.info("`text_config` is `None`. Initializing the `BridgeTowerTextConfig` with default values.")

    if vision_config is None:
        vision_config = {}
        logger.info("`vision_config` is `None`. Initializing the `BridgeTowerVisionConfig` with default values.")

    self.text_config = BridgeTowerTextConfig(**text_config)
    self.vision_config = BridgeTowerVisionConfig(**vision_config)

mindnlp.transformers.models.bridgetower.configuration_bridgetower.BridgeTowerConfig.from_text_vision_configs(text_config, vision_config, **kwargs) classmethod

Instantiate a [BridgeTowerConfig] (or a derived class) from BridgeTower text model configuration.

RETURNS DESCRIPTION

[BridgeTowerConfig]: An instance of a configuration object

Source code in mindnlp/transformers/models/bridgetower/configuration_bridgetower.py
440
441
442
443
444
445
446
447
448
449
450
@classmethod
def from_text_vision_configs(
    cls, text_config: BridgeTowerTextConfig, vision_config: BridgeTowerVisionConfig, **kwargs
):
    r"""
    Instantiate a [`BridgeTowerConfig`] (or a derived class) from BridgeTower text model configuration.

    Returns:
        [`BridgeTowerConfig`]: An instance of a configuration object
    """
    return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), **kwargs)

mindnlp.transformers.models.bridgetower.configuration_bridgetower.BridgeTowerTextConfig

Bases: PretrainedConfig

This is the configuration class to store the text configuration of a [BridgeTowerModel]. The default values here are copied from RoBERTa. Instantiating a configuration with the defaults will yield a similar configuration to that of the bridgetower-base BridegTower/bridgetower-base architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
vocab_size

Vocabulary size of the text part of the model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [BridgeTowerModel].

TYPE: `int`, *optional*, defaults to 50265 DEFAULT: 50265

hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

intermediate_size

Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 3072 DEFAULT: 3072

hidden_act

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and "gelu_new" are supported.

TYPE: `str` or `Callable`, *optional*, defaults to `"gelu"` DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

TYPE: `float`, *optional*, defaults to 0.1 DEFAULT: 0.1

attention_probs_dropout_prob

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.1 DEFAULT: 0.1

max_position_embeddings

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

TYPE: `int`, *optional*, defaults to 514 DEFAULT: 514

type_vocab_size

The vocabulary size of the token_type_ids.

TYPE: `int`, *optional*, defaults to 2 DEFAULT: 1

initializer_factor

A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).

TYPE: `float`, *optional*, defaults to 1 DEFAULT: 1

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-05 DEFAULT: 1e-05

position_embedding_type

Type of position embedding. Choose one of "absolute", "relative_key", "relative_key_query". For positional embeddings use "absolute". For more information on "relative_key", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on "relative_key_query", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

TYPE: `str`, *optional*, defaults to `"absolute"` DEFAULT: 'absolute'

is_decoder

Whether the model is used as a decoder or not. If False, the model is used as an encoder.

TYPE: `bool`, *optional*, defaults to `False`

use_cache

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

Example
>>> from transformers import BridgeTowerTextConfig
...
>>> # Initializing a BridgeTower BridgeTower/bridgetower-base style configuration for the text model
>>> configuration = BridgeTowerTextConfig()
...
>>> # Accessing the configuration
>>> configuration
Source code in mindnlp/transformers/models/bridgetower/configuration_bridgetower.py
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
class BridgeTowerTextConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the text configuration of a [`BridgeTowerModel`]. The default values here
    are copied from RoBERTa. Instantiating a configuration with the defaults will yield a similar configuration to that
    of the bridgetower-base [BridegTower/bridgetower-base](https://huggingface.co/BridgeTower/bridgetower-base/)
    architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        vocab_size (`int`, *optional*, defaults to 50265):
            Vocabulary size of the text part of the model. Defines the number of different tokens that can be
            represented by the `inputs_ids` passed when calling [`BridgeTowerModel`].
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"silu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (`int`, *optional*, defaults to 514):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        type_vocab_size (`int`, *optional*, defaults to 2):
            The vocabulary size of the `token_type_ids`.
        initializer_factor (`float`, *optional*, defaults to 1):
            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
            testing).
        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon used by the layer normalization layers.
        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
        is_decoder (`bool`, *optional*, defaults to `False`):
            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.

    Example:
        ```python
        >>> from transformers import BridgeTowerTextConfig
        ...
        >>> # Initializing a BridgeTower BridgeTower/bridgetower-base style configuration for the text model
        >>> configuration = BridgeTowerTextConfig()
        ...
        >>> # Accessing the configuration
        >>> configuration
        ```
    """
    model_type = "bridgetower_text_model"

    def __init__(
        self,
        vocab_size=50265,
        hidden_size=768,
        num_hidden_layers=12,
        num_attention_heads=12,
        initializer_factor=1,
        intermediate_size=3072,
        hidden_act="gelu",
        hidden_dropout_prob=0.1,
        attention_probs_dropout_prob=0.1,
        max_position_embeddings=514,
        type_vocab_size=1,
        layer_norm_eps=1e-05,
        pad_token_id=1,
        bos_token_id=0,
        eos_token_id=2,
        position_embedding_type="absolute",
        use_cache=True,
        **kwargs,
    ):
        """
        Args:
            self (object): The instance of the class.
            vocab_size (int, optional): The size of the vocabulary. Defaults to 50265.
            hidden_size (int, optional): The hidden size of the model. Defaults to 768.
            num_hidden_layers (int, optional): The number of hidden layers. Defaults to 12.
            num_attention_heads (int, optional): The number of attention heads. Defaults to 12.
            initializer_factor (int, optional): The factor for weight initialization. Defaults to 1.
            intermediate_size (int, optional): The size of the intermediate layer in the transformer. Defaults to 3072.
            hidden_act (str, optional): The activation function for the hidden layers. Defaults to 'gelu'.
            hidden_dropout_prob (float, optional): The dropout probability for the hidden layers. Defaults to 0.1.
            attention_probs_dropout_prob (float, optional): The dropout probability for the attention probabilities. Defaults to 0.1.
            max_position_embeddings (int, optional): The maximum position for the embeddings. Defaults to 514.
            type_vocab_size (int, optional): The size of the type vocabulary. Defaults to 1.
            layer_norm_eps (float, optional): The epsilon value for layer normalization. Defaults to 1e-05.
            pad_token_id (int, optional): The token id for padding. Defaults to 1.
            bos_token_id (int, optional): The token id for the beginning of sequence. Defaults to 0.
            eos_token_id (int, optional): The token id for the end of sequence. Defaults to 2.
            position_embedding_type (str, optional): The type of position embedding. Defaults to 'absolute'.
            use_cache (bool, optional): Whether to use caching. Defaults to True.

        Returns:
            None.

        Raises:
            None
        """
        super().__init__(**kwargs)

        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.hidden_act = hidden_act
        self.initializer_factor = initializer_factor
        self.intermediate_size = intermediate_size
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.max_position_embeddings = max_position_embeddings
        self.type_vocab_size = type_vocab_size
        self.layer_norm_eps = layer_norm_eps
        self.position_embedding_type = position_embedding_type
        self.use_cache = use_cache
        self.pad_token_id = pad_token_id
        self.bos_token_id = bos_token_id
        self.eos_token_id = eos_token_id

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        """
        This method instantiates a BridgeTowerTextConfig from a pretrained model or a model configuration file.

        Args:
            cls (class): The class object itself.
            pretrained_model_name_or_path (Union[str, os.PathLike]):
                The name or path of the pretrained model or model configuration file.
                It can be a string or a valid os.PathLike object.

        Returns:
            PretrainedConfig: An instance of a PretrainedConfig object representing the configuration of the pretrained model.

        Raises:
            This method does not raise any specific exceptions.
        """
        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

        if config_dict.get("model_type") == "bridgetower":
            config_dict = config_dict["text_config"]

        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
            logger.warning(
                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
            )

        return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.bridgetower.configuration_bridgetower.BridgeTowerTextConfig.__init__(vocab_size=50265, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, initializer_factor=1, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=514, type_vocab_size=1, layer_norm_eps=1e-05, pad_token_id=1, bos_token_id=0, eos_token_id=2, position_embedding_type='absolute', use_cache=True, **kwargs)

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: object

vocab_size

The size of the vocabulary. Defaults to 50265.

TYPE: int DEFAULT: 50265

hidden_size

The hidden size of the model. Defaults to 768.

TYPE: int DEFAULT: 768

num_hidden_layers

The number of hidden layers. Defaults to 12.

TYPE: int DEFAULT: 12

num_attention_heads

The number of attention heads. Defaults to 12.

TYPE: int DEFAULT: 12

initializer_factor

The factor for weight initialization. Defaults to 1.

TYPE: int DEFAULT: 1

intermediate_size

The size of the intermediate layer in the transformer. Defaults to 3072.

TYPE: int DEFAULT: 3072

hidden_act

The activation function for the hidden layers. Defaults to 'gelu'.

TYPE: str DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for the hidden layers. Defaults to 0.1.

TYPE: float DEFAULT: 0.1

attention_probs_dropout_prob

The dropout probability for the attention probabilities. Defaults to 0.1.

TYPE: float DEFAULT: 0.1

max_position_embeddings

The maximum position for the embeddings. Defaults to 514.

TYPE: int DEFAULT: 514

type_vocab_size

The size of the type vocabulary. Defaults to 1.

TYPE: int DEFAULT: 1

layer_norm_eps

The epsilon value for layer normalization. Defaults to 1e-05.

TYPE: float DEFAULT: 1e-05

pad_token_id

The token id for padding. Defaults to 1.

TYPE: int DEFAULT: 1

bos_token_id

The token id for the beginning of sequence. Defaults to 0.

TYPE: int DEFAULT: 0

eos_token_id

The token id for the end of sequence. Defaults to 2.

TYPE: int DEFAULT: 2

position_embedding_type

The type of position embedding. Defaults to 'absolute'.

TYPE: str DEFAULT: 'absolute'

use_cache

Whether to use caching. Defaults to True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/bridgetower/configuration_bridgetower.py
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
def __init__(
    self,
    vocab_size=50265,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    initializer_factor=1,
    intermediate_size=3072,
    hidden_act="gelu",
    hidden_dropout_prob=0.1,
    attention_probs_dropout_prob=0.1,
    max_position_embeddings=514,
    type_vocab_size=1,
    layer_norm_eps=1e-05,
    pad_token_id=1,
    bos_token_id=0,
    eos_token_id=2,
    position_embedding_type="absolute",
    use_cache=True,
    **kwargs,
):
    """
    Args:
        self (object): The instance of the class.
        vocab_size (int, optional): The size of the vocabulary. Defaults to 50265.
        hidden_size (int, optional): The hidden size of the model. Defaults to 768.
        num_hidden_layers (int, optional): The number of hidden layers. Defaults to 12.
        num_attention_heads (int, optional): The number of attention heads. Defaults to 12.
        initializer_factor (int, optional): The factor for weight initialization. Defaults to 1.
        intermediate_size (int, optional): The size of the intermediate layer in the transformer. Defaults to 3072.
        hidden_act (str, optional): The activation function for the hidden layers. Defaults to 'gelu'.
        hidden_dropout_prob (float, optional): The dropout probability for the hidden layers. Defaults to 0.1.
        attention_probs_dropout_prob (float, optional): The dropout probability for the attention probabilities. Defaults to 0.1.
        max_position_embeddings (int, optional): The maximum position for the embeddings. Defaults to 514.
        type_vocab_size (int, optional): The size of the type vocabulary. Defaults to 1.
        layer_norm_eps (float, optional): The epsilon value for layer normalization. Defaults to 1e-05.
        pad_token_id (int, optional): The token id for padding. Defaults to 1.
        bos_token_id (int, optional): The token id for the beginning of sequence. Defaults to 0.
        eos_token_id (int, optional): The token id for the end of sequence. Defaults to 2.
        position_embedding_type (str, optional): The type of position embedding. Defaults to 'absolute'.
        use_cache (bool, optional): Whether to use caching. Defaults to True.

    Returns:
        None.

    Raises:
        None
    """
    super().__init__(**kwargs)

    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.initializer_factor = initializer_factor
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.type_vocab_size = type_vocab_size
    self.layer_norm_eps = layer_norm_eps
    self.position_embedding_type = position_embedding_type
    self.use_cache = use_cache
    self.pad_token_id = pad_token_id
    self.bos_token_id = bos_token_id
    self.eos_token_id = eos_token_id

mindnlp.transformers.models.bridgetower.configuration_bridgetower.BridgeTowerTextConfig.from_pretrained(pretrained_model_name_or_path, **kwargs) classmethod

This method instantiates a BridgeTowerTextConfig from a pretrained model or a model configuration file.

PARAMETER DESCRIPTION
cls

The class object itself.

TYPE: class

pretrained_model_name_or_path

The name or path of the pretrained model or model configuration file. It can be a string or a valid os.PathLike object.

TYPE: Union[str, PathLike]

RETURNS DESCRIPTION
PretrainedConfig

An instance of a PretrainedConfig object representing the configuration of the pretrained model.

TYPE: PretrainedConfig

Source code in mindnlp/transformers/models/bridgetower/configuration_bridgetower.py
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
    """
    This method instantiates a BridgeTowerTextConfig from a pretrained model or a model configuration file.

    Args:
        cls (class): The class object itself.
        pretrained_model_name_or_path (Union[str, os.PathLike]):
            The name or path of the pretrained model or model configuration file.
            It can be a string or a valid os.PathLike object.

    Returns:
        PretrainedConfig: An instance of a PretrainedConfig object representing the configuration of the pretrained model.

    Raises:
        This method does not raise any specific exceptions.
    """
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

    if config_dict.get("model_type") == "bridgetower":
        config_dict = config_dict["text_config"]

    if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
        logger.warning(
            f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
            f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
        )

    return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.bridgetower.configuration_bridgetower.BridgeTowerVisionConfig

Bases: PretrainedConfig

This is the configuration class to store the vision configuration of a [BridgeTowerModel]. Instantiating a configuration with the defaults will yield a similar configuration to that of the bridgetower-base BridgeTower/bridgetower-base architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

num_hidden_layers

Number of hidden layers in visual encoder model.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

patch_size

The size (resolution) of each patch.

TYPE: `int`, *optional*, defaults to 16 DEFAULT: 16

image_size

The size (resolution) of each image.

TYPE: `int`, *optional*, defaults to 288 DEFAULT: 288

initializer_factor

A factor for initializing all weight matrices (should be kept to 1, used internally for initialization testing).

TYPE: `float`, *optional*, defaults to 1 DEFAULT: 1

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-05 DEFAULT: 1e-05

stop_gradient

Whether to stop gradient for training.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

share_layernorm

Whether LayerNorm layers are shared.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

remove_last_layer

Whether to remove the last layer from the vision encoder.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

Example
>>> from transformers import BridgeTowerVisionConfig
...
>>> # Initializing a BridgeTower BridgeTower/bridgetower-base style configuration for the vision model
>>> configuration = BridgeTowerVisionConfig()
...
>>> # Accessing the configuration
>>> configuration
Source code in mindnlp/transformers/models/bridgetower/configuration_bridgetower.py
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
class BridgeTowerVisionConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the vision configuration of a [`BridgeTowerModel`]. Instantiating a
    configuration with the defaults will yield a similar configuration to that of the bridgetower-base
    [BridgeTower/bridgetower-base](https://huggingface.co/BridgeTower/bridgetower-base/) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in visual encoder model.
        patch_size (`int`, *optional*, defaults to 16):
            The size (resolution) of each patch.
        image_size (`int`, *optional*, defaults to 288):
            The size (resolution) of each image.
        initializer_factor (`float`, *optional*, defaults to 1):
            A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
            testing).
        layer_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon used by the layer normalization layers.
        stop_gradient (`bool`, *optional*, defaults to `False`):
            Whether to stop gradient for training.
        share_layernorm (`bool`, *optional*, defaults to `True`):
            Whether LayerNorm layers are shared.
        remove_last_layer (`bool`, *optional*, defaults to `False`):
            Whether to remove the last layer from the vision encoder.

    Example:
        ```python
        >>> from transformers import BridgeTowerVisionConfig
        ...
        >>> # Initializing a BridgeTower BridgeTower/bridgetower-base style configuration for the vision model
        >>> configuration = BridgeTowerVisionConfig()
        ...
        >>> # Accessing the configuration
        >>> configuration
        ```
    """
    model_type = "bridgetower_vision_model"

    def __init__(
        self,
        hidden_size=768,
        num_hidden_layers=12,
        num_channels=3,
        patch_size=16,
        image_size=288,
        initializer_factor=1,
        layer_norm_eps=1e-05,
        stop_gradient=False,
        share_layernorm=True,
        remove_last_layer=False,
        **kwargs,
    ):
        """
        Initializes a BridgeTowerVisionConfig object with the specified configuration parameters.

        Args:
            hidden_size (int): The size of the hidden layers in the vision model.
            num_hidden_layers (int): The number of hidden layers in the vision model.
            num_channels (int): The number of input channels in the image data.
            patch_size (int): The size of the image patches used in the model.
            image_size (int): The size of the input images processed by the model.
            initializer_factor (int): A factor used for weight initialization in the model.
            layer_norm_eps (float): The epsilon value for layer normalization.
            stop_gradient (bool): Whether to stop gradients during training.
            share_layernorm (bool): Whether to share layer normalization parameters across layers.
            remove_last_layer (bool): Whether to remove the last layer of the model.
            **kwargs: Additional keyword arguments for customization.

        Returns:
            None.

        Raises:
            None
        """
        super().__init__(**kwargs)
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_channels = num_channels
        self.patch_size = patch_size
        self.image_size = image_size
        self.initializer_factor = initializer_factor
        self.layer_norm_eps = layer_norm_eps
        self.stop_gradient = stop_gradient
        self.share_layernorm = share_layernorm
        self.remove_last_layer = remove_last_layer

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        """
        This method creates an instance of a 'BridgeTowerVisionConfig' class from a pretrained model or its path.

        Args:
            cls (class): The class reference.
            pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model.
                It can be a string or a path-like object representing the location of the pretrained model.

        Returns:
            PretrainedConfig: An instance of the 'PretrainedConfig' class representing the configuration of the pretrained model.

        Raises:
            TypeError: If the input parameters are of incorrect types.
            ValueError: If the configuration dictionary obtained from the pretrained model is incomplete or invalid.
            Warning: If attempting to instantiate a model of a different type than the specified 'model_type', as this may lead to errors.
        """
        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

        if config_dict.get("model_type") == "bridgetower":
            config_dict = config_dict["text_config"]

        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
            logger.warning(
                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
            )

        return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.bridgetower.configuration_bridgetower.BridgeTowerVisionConfig.__init__(hidden_size=768, num_hidden_layers=12, num_channels=3, patch_size=16, image_size=288, initializer_factor=1, layer_norm_eps=1e-05, stop_gradient=False, share_layernorm=True, remove_last_layer=False, **kwargs)

Initializes a BridgeTowerVisionConfig object with the specified configuration parameters.

PARAMETER DESCRIPTION
hidden_size

The size of the hidden layers in the vision model.

TYPE: int DEFAULT: 768

num_hidden_layers

The number of hidden layers in the vision model.

TYPE: int DEFAULT: 12

num_channels

The number of input channels in the image data.

TYPE: int DEFAULT: 3

patch_size

The size of the image patches used in the model.

TYPE: int DEFAULT: 16

image_size

The size of the input images processed by the model.

TYPE: int DEFAULT: 288

initializer_factor

A factor used for weight initialization in the model.

TYPE: int DEFAULT: 1

layer_norm_eps

The epsilon value for layer normalization.

TYPE: float DEFAULT: 1e-05

stop_gradient

Whether to stop gradients during training.

TYPE: bool DEFAULT: False

share_layernorm

Whether to share layer normalization parameters across layers.

TYPE: bool DEFAULT: True

remove_last_layer

Whether to remove the last layer of the model.

TYPE: bool DEFAULT: False

**kwargs

Additional keyword arguments for customization.

DEFAULT: {}

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/bridgetower/configuration_bridgetower.py
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
def __init__(
    self,
    hidden_size=768,
    num_hidden_layers=12,
    num_channels=3,
    patch_size=16,
    image_size=288,
    initializer_factor=1,
    layer_norm_eps=1e-05,
    stop_gradient=False,
    share_layernorm=True,
    remove_last_layer=False,
    **kwargs,
):
    """
    Initializes a BridgeTowerVisionConfig object with the specified configuration parameters.

    Args:
        hidden_size (int): The size of the hidden layers in the vision model.
        num_hidden_layers (int): The number of hidden layers in the vision model.
        num_channels (int): The number of input channels in the image data.
        patch_size (int): The size of the image patches used in the model.
        image_size (int): The size of the input images processed by the model.
        initializer_factor (int): A factor used for weight initialization in the model.
        layer_norm_eps (float): The epsilon value for layer normalization.
        stop_gradient (bool): Whether to stop gradients during training.
        share_layernorm (bool): Whether to share layer normalization parameters across layers.
        remove_last_layer (bool): Whether to remove the last layer of the model.
        **kwargs: Additional keyword arguments for customization.

    Returns:
        None.

    Raises:
        None
    """
    super().__init__(**kwargs)
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_channels = num_channels
    self.patch_size = patch_size
    self.image_size = image_size
    self.initializer_factor = initializer_factor
    self.layer_norm_eps = layer_norm_eps
    self.stop_gradient = stop_gradient
    self.share_layernorm = share_layernorm
    self.remove_last_layer = remove_last_layer

mindnlp.transformers.models.bridgetower.configuration_bridgetower.BridgeTowerVisionConfig.from_pretrained(pretrained_model_name_or_path, **kwargs) classmethod

This method creates an instance of a 'BridgeTowerVisionConfig' class from a pretrained model or its path.

PARAMETER DESCRIPTION
cls

The class reference.

TYPE: class

pretrained_model_name_or_path

The name or path of the pretrained model. It can be a string or a path-like object representing the location of the pretrained model.

TYPE: Union[str, PathLike]

RETURNS DESCRIPTION
PretrainedConfig

An instance of the 'PretrainedConfig' class representing the configuration of the pretrained model.

TYPE: PretrainedConfig

RAISES DESCRIPTION
TypeError

If the input parameters are of incorrect types.

ValueError

If the configuration dictionary obtained from the pretrained model is incomplete or invalid.

Warning

If attempting to instantiate a model of a different type than the specified 'model_type', as this may lead to errors.

Source code in mindnlp/transformers/models/bridgetower/configuration_bridgetower.py
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
    """
    This method creates an instance of a 'BridgeTowerVisionConfig' class from a pretrained model or its path.

    Args:
        cls (class): The class reference.
        pretrained_model_name_or_path (Union[str, os.PathLike]): The name or path of the pretrained model.
            It can be a string or a path-like object representing the location of the pretrained model.

    Returns:
        PretrainedConfig: An instance of the 'PretrainedConfig' class representing the configuration of the pretrained model.

    Raises:
        TypeError: If the input parameters are of incorrect types.
        ValueError: If the configuration dictionary obtained from the pretrained model is incomplete or invalid.
        Warning: If attempting to instantiate a model of a different type than the specified 'model_type', as this may lead to errors.
    """
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

    if config_dict.get("model_type") == "bridgetower":
        config_dict = config_dict["text_config"]

    if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
        logger.warning(
            f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
            f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
        )

    return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerForContrastiveLearning

Bases: BridgeTowerPreTrainedModel

Represents a BridgeTower model for contrastive learning.

This class inherits from BridgeTowerPreTrainedModel and includes initialization and forwardion methods for contrastive learning. It contains methods for processing input data, calculating contrastive loss, and returning outputs for text and image embeddings.

The forward method takes input tensors for text and image data, and optional parameters for attention, token types, and masks. It returns a BridgeTowerContrastiveOutput object containing the contrastive loss, logits, text embeddings, image embeddings, cross-modal embeddings, hidden states, and attentions.

The example provided demonstrates the usage of the BridgeTowerForContrastiveLearning class for processing images and texts, calculating contrastive loss, and obtaining model outputs.

Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
3304
3305
3306
3307
3308
3309
3310
3311
3312
3313
3314
3315
3316
3317
3318
3319
3320
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
3346
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
3443
3444
3445
3446
3447
3448
3449
3450
3451
3452
3453
3454
3455
3456
3457
3458
3459
3460
3461
3462
class BridgeTowerForContrastiveLearning(BridgeTowerPreTrainedModel):

    """
    Represents a BridgeTower model for contrastive learning.

    This class inherits from BridgeTowerPreTrainedModel and includes initialization and forwardion methods for
    contrastive learning. It contains methods for processing input data, calculating contrastive loss, and returning
    outputs for text and image embeddings.

    The `forward` method takes input tensors for text and image data, and optional parameters for attention, token
    types, and masks. It returns a BridgeTowerContrastiveOutput object containing the contrastive loss, logits,
    text embeddings, image embeddings, cross-modal embeddings, hidden states, and attentions.

    The example provided demonstrates the usage of the BridgeTowerForContrastiveLearning class for processing images
    and texts, calculating contrastive loss, and obtaining model outputs.

    """
    def __init__(self, config):
        """
        Initializes an instance of the BridgeTowerForContrastiveLearning class.

        Args:
            self: The instance of the class.
            config: The configuration object containing various settings and parameters.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)

        self.bridgetower = BridgeTowerModel(config)

        self.itc_text_head = BridgeTowerContrastiveHead(config.hidden_size, config.contrastive_hidden_size)
        self.itc_image_head = BridgeTowerContrastiveHead(config.hidden_size, config.contrastive_hidden_size)
        self.itc_cross_modal_head = BridgeTowerContrastiveHead(config.hidden_size * 2, config.contrastive_hidden_size)

        self.logit_scale = Parameter(mindspore.tensor(self.config.logit_scale_init_value))
        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        pixel_values: Optional[mindspore.Tensor] = None,
        pixel_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        image_embeds: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = True,
        return_dict: Optional[bool] = None,
        return_loss: Optional[bool] = None,
    ) -> Union[BridgeTowerContrastiveOutput, Tuple[mindspore.Tensor]]:
        r"""
        Args:
            return_loss (`bool`, *optional*):
                Whether or not to return the contrastive loss.

        Returns:
            Union[BridgeTowerContrastiveOutput, Tuple[mindspore.Tensor]]

        Example:
            ```python
            >>> from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning
            >>> import requests
            >>> from PIL import Image
            >>> import torch
            ...
            >>> image_urls = [
            ...     "https://farm4.staticflickr.com/3395/3428278415_81c3e27f15_z.jpg",
            ...     "http://images.cocodataset.org/val2017/000000039769.jpg",
            ... ]
            >>> texts = ["two dogs in a car", "two cats sleeping on a couch"]
            >>> images = [Image.open(requests.get(url, stream=True).raw) for url in image_urls]
            ...
            >>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
            >>> model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
            ...
            >>> inputs = processor(images, texts, padding=True, return_tensors="pt")
            >>> loss = model(**inputs, return_loss=True).loss
            ...
            >>> inputs = processor(images, texts[::-1], padding=True, return_tensors="pt")
            >>> loss_swapped = model(**inputs, return_loss=True).loss
            ...
            >>> print("Loss", round(loss.item(), 4))
            Loss 0.0019
            >>> print("Loss with swapped images", round(loss_swapped.item(), 4))
            Loss with swapped images 2.126
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.bridgetower(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            pixel_values=pixel_values,
            pixel_mask=pixel_mask,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            image_embeds=image_embeds,
            output_attentions=output_attentions,
            output_hidden_states=True,
            return_dict=return_dict,
        )

        pooler_output = outputs.pooler_output if return_dict else outputs[2]
        hidden_states_txt, hidden_states_img, hidden_states_cross_modal = (
            outputs.hidden_states if return_dict else outputs[3]
        )

        text_embeds = hidden_states_txt[-1]
        image_embeds = hidden_states_img[-1]

        image_embeds_with_ln = self.bridgetower.vision_model.visual.forward_post(image_embeds)
        image_token_type_embeddings = self.bridgetower.token_type_embeddings(
            ops.full((1,), 1, dtype=mindspore.int64)
        ).expand_as(image_embeds_with_ln)

        image_embeds = self.bridgetower.cross_modal_image_transform(image_embeds_with_ln) + image_token_type_embeddings

        # normalized features
        text_embeds = F.normalize(self.itc_text_head(text_embeds[:, 0, :]), dim=-1, p=2)
        image_embeds = F.normalize(self.itc_image_head(image_embeds[:, 0, :]), dim=-1, p=2)
        cross_embeds = F.normalize(self.itc_cross_modal_head(pooler_output), dim=-1, p=2)
        logits = ops.stack([text_embeds, image_embeds, cross_embeds], dim=-2)

        logit_scale = self.logit_scale.exp()
        logits_text_to_image = ops.matmul(text_embeds, image_embeds.t()) * logit_scale
        logits_text_to_cross = ops.matmul(text_embeds, cross_embeds.t()) * logit_scale
        logits_image_to_cross = ops.matmul(image_embeds, cross_embeds.t()) * logit_scale

        itc_loss = None

        if return_loss:
            labels = ops.arange(len(logits))
            text_to_image_loss = F.cross_entropy(logits_text_to_image, labels)
            text_to_cross_loss = F.cross_entropy(logits_text_to_cross, labels)
            image_to_cross_loss = F.cross_entropy(logits_image_to_cross, labels)
            itc_loss = (text_to_image_loss + text_to_cross_loss + image_to_cross_loss) / 3.0

        if not return_dict:
            output = (logits, text_embeds, image_embeds, cross_embeds) + outputs[3:]
            return ((itc_loss,) + output) if itc_loss is not None else output

        return BridgeTowerContrastiveOutput(
            loss=itc_loss,
            logits=logits,
            text_embeds=text_embeds,
            image_embeds=image_embeds,
            cross_embeds=cross_embeds,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerForContrastiveLearning.__init__(config)

Initializes an instance of the BridgeTowerForContrastiveLearning class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

The configuration object containing various settings and parameters.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
3321
3322
3323
3324
3325
3326
3327
3328
3329
3330
3331
3332
3333
3334
3335
3336
3337
3338
3339
3340
3341
3342
3343
3344
3345
def __init__(self, config):
    """
    Initializes an instance of the BridgeTowerForContrastiveLearning class.

    Args:
        self: The instance of the class.
        config: The configuration object containing various settings and parameters.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)

    self.bridgetower = BridgeTowerModel(config)

    self.itc_text_head = BridgeTowerContrastiveHead(config.hidden_size, config.contrastive_hidden_size)
    self.itc_image_head = BridgeTowerContrastiveHead(config.hidden_size, config.contrastive_hidden_size)
    self.itc_cross_modal_head = BridgeTowerContrastiveHead(config.hidden_size * 2, config.contrastive_hidden_size)

    self.logit_scale = Parameter(mindspore.tensor(self.config.logit_scale_init_value))
    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerForContrastiveLearning.forward(input_ids=None, attention_mask=None, token_type_ids=None, pixel_values=None, pixel_mask=None, head_mask=None, inputs_embeds=None, image_embeds=None, output_attentions=None, output_hidden_states=True, return_dict=None, return_loss=None)

PARAMETER DESCRIPTION
return_loss

Whether or not to return the contrastive loss.

TYPE: `bool`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[BridgeTowerContrastiveOutput, Tuple[Tensor]]

Union[BridgeTowerContrastiveOutput, Tuple[mindspore.Tensor]]

Example
>>> from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning
>>> import requests
>>> from PIL import Image
>>> import torch
...
>>> image_urls = [
...     "https://farm4.staticflickr.com/3395/3428278415_81c3e27f15_z.jpg",
...     "http://images.cocodataset.org/val2017/000000039769.jpg",
... ]
>>> texts = ["two dogs in a car", "two cats sleeping on a couch"]
>>> images = [Image.open(requests.get(url, stream=True).raw) for url in image_urls]
...
>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
>>> model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
...
>>> inputs = processor(images, texts, padding=True, return_tensors="pt")
>>> loss = model(**inputs, return_loss=True).loss
...
>>> inputs = processor(images, texts[::-1], padding=True, return_tensors="pt")
>>> loss_swapped = model(**inputs, return_loss=True).loss
...
>>> print("Loss", round(loss.item(), 4))
Loss 0.0019
>>> print("Loss with swapped images", round(loss_swapped.item(), 4))
Loss with swapped images 2.126
Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
3347
3348
3349
3350
3351
3352
3353
3354
3355
3356
3357
3358
3359
3360
3361
3362
3363
3364
3365
3366
3367
3368
3369
3370
3371
3372
3373
3374
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392
3393
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426
3427
3428
3429
3430
3431
3432
3433
3434
3435
3436
3437
3438
3439
3440
3441
3442
3443
3444
3445
3446
3447
3448
3449
3450
3451
3452
3453
3454
3455
3456
3457
3458
3459
3460
3461
3462
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    token_type_ids: Optional[mindspore.Tensor] = None,
    pixel_values: Optional[mindspore.Tensor] = None,
    pixel_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    image_embeds: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = True,
    return_dict: Optional[bool] = None,
    return_loss: Optional[bool] = None,
) -> Union[BridgeTowerContrastiveOutput, Tuple[mindspore.Tensor]]:
    r"""
    Args:
        return_loss (`bool`, *optional*):
            Whether or not to return the contrastive loss.

    Returns:
        Union[BridgeTowerContrastiveOutput, Tuple[mindspore.Tensor]]

    Example:
        ```python
        >>> from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning
        >>> import requests
        >>> from PIL import Image
        >>> import torch
        ...
        >>> image_urls = [
        ...     "https://farm4.staticflickr.com/3395/3428278415_81c3e27f15_z.jpg",
        ...     "http://images.cocodataset.org/val2017/000000039769.jpg",
        ... ]
        >>> texts = ["two dogs in a car", "two cats sleeping on a couch"]
        >>> images = [Image.open(requests.get(url, stream=True).raw) for url in image_urls]
        ...
        >>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
        >>> model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
        ...
        >>> inputs = processor(images, texts, padding=True, return_tensors="pt")
        >>> loss = model(**inputs, return_loss=True).loss
        ...
        >>> inputs = processor(images, texts[::-1], padding=True, return_tensors="pt")
        >>> loss_swapped = model(**inputs, return_loss=True).loss
        ...
        >>> print("Loss", round(loss.item(), 4))
        Loss 0.0019
        >>> print("Loss with swapped images", round(loss_swapped.item(), 4))
        Loss with swapped images 2.126
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    outputs = self.bridgetower(
        input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        pixel_values=pixel_values,
        pixel_mask=pixel_mask,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        image_embeds=image_embeds,
        output_attentions=output_attentions,
        output_hidden_states=True,
        return_dict=return_dict,
    )

    pooler_output = outputs.pooler_output if return_dict else outputs[2]
    hidden_states_txt, hidden_states_img, hidden_states_cross_modal = (
        outputs.hidden_states if return_dict else outputs[3]
    )

    text_embeds = hidden_states_txt[-1]
    image_embeds = hidden_states_img[-1]

    image_embeds_with_ln = self.bridgetower.vision_model.visual.forward_post(image_embeds)
    image_token_type_embeddings = self.bridgetower.token_type_embeddings(
        ops.full((1,), 1, dtype=mindspore.int64)
    ).expand_as(image_embeds_with_ln)

    image_embeds = self.bridgetower.cross_modal_image_transform(image_embeds_with_ln) + image_token_type_embeddings

    # normalized features
    text_embeds = F.normalize(self.itc_text_head(text_embeds[:, 0, :]), dim=-1, p=2)
    image_embeds = F.normalize(self.itc_image_head(image_embeds[:, 0, :]), dim=-1, p=2)
    cross_embeds = F.normalize(self.itc_cross_modal_head(pooler_output), dim=-1, p=2)
    logits = ops.stack([text_embeds, image_embeds, cross_embeds], dim=-2)

    logit_scale = self.logit_scale.exp()
    logits_text_to_image = ops.matmul(text_embeds, image_embeds.t()) * logit_scale
    logits_text_to_cross = ops.matmul(text_embeds, cross_embeds.t()) * logit_scale
    logits_image_to_cross = ops.matmul(image_embeds, cross_embeds.t()) * logit_scale

    itc_loss = None

    if return_loss:
        labels = ops.arange(len(logits))
        text_to_image_loss = F.cross_entropy(logits_text_to_image, labels)
        text_to_cross_loss = F.cross_entropy(logits_text_to_cross, labels)
        image_to_cross_loss = F.cross_entropy(logits_image_to_cross, labels)
        itc_loss = (text_to_image_loss + text_to_cross_loss + image_to_cross_loss) / 3.0

    if not return_dict:
        output = (logits, text_embeds, image_embeds, cross_embeds) + outputs[3:]
        return ((itc_loss,) + output) if itc_loss is not None else output

    return BridgeTowerContrastiveOutput(
        loss=itc_loss,
        logits=logits,
        text_embeds=text_embeds,
        image_embeds=image_embeds,
        cross_embeds=cross_embeds,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerForImageAndTextRetrieval

Bases: BridgeTowerPreTrainedModel

BridgeTowerForImageAndTextRetrieval is a class for performing image and text retrieval using the BridgeTower model.

This class extends the BridgeTowerPreTrainedModel and provides methods for forwarding the model and computing the image-text matching loss.

PARAMETER DESCRIPTION
config

Configuration for the model.

TYPE: BridgeTowerConfig

RETURNS DESCRIPTION

SequenceClassifierOutput or Tuple[mindspore.Tensor]: The output of the model, including the image-text matching loss and the logits.

Example
>>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
>>> import requests
>>> from PIL import Image
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
>>> model = BridgeTowerForImageAndTextRetrieval.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
>>> scores = dict()
>>> for text in texts:
...     encoding = processor(image, text, return_tensors="pt")
...     outputs = model(**encoding)
...     scores[text] = outputs.logits[0, 1].item()
Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
3118
3119
3120
3121
3122
3123
3124
3125
3126
3127
3128
3129
3130
3131
3132
3133
3134
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145
3146
3147
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
3170
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
class BridgeTowerForImageAndTextRetrieval(BridgeTowerPreTrainedModel):

    """
    BridgeTowerForImageAndTextRetrieval is a class for performing image and text retrieval using the BridgeTower model.

    This class extends the BridgeTowerPreTrainedModel and provides methods for forwarding the model and computing the image-text matching loss.

    Args:
        config (BridgeTowerConfig): Configuration for the model.

    Returns:
        SequenceClassifierOutput or Tuple[mindspore.Tensor]: The output of the model, including the image-text matching loss and the logits.

    Example:
        ```python
        >>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
        >>> import requests
        >>> from PIL import Image
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        >>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
        >>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
        >>> model = BridgeTowerForImageAndTextRetrieval.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
        >>> scores = dict()
        >>> for text in texts:
        ...     encoding = processor(image, text, return_tensors="pt")
        ...     outputs = model(**encoding)
        ...     scores[text] = outputs.logits[0, 1].item()
        ```
    """
    def __init__(self, config):
        """
        Initializes an instance of the BridgeTowerForImageAndTextRetrieval class.

        Args:
            self: The instance of the class itself.
            config: A configuration object containing the necessary parameters for initialization.

        Returns:
            None

        Raises:
            None
        """
        super().__init__(config)

        self.bridgetower = BridgeTowerModel(config)

        self.itm_score = BridgeTowerITMHead(config.hidden_size * 2)

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        pixel_values: Optional[mindspore.Tensor] = None,
        pixel_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        image_embeds: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        labels: Optional[mindspore.Tensor] = None,
    ) -> Union[SequenceClassifierOutput, Tuple[mindspore.Tensor]]:
        r"""
        Args:
            labels (`mindspore.Tensor` of shape `(batch_size, 1)`, *optional*):
                Labels for computing the image-text matching loss. 0 means the pairs don't match and 1 means they match.
                The pairs with 0 will be skipped for calculation.

        Returns:
            Union[SequenceClassifierOutput, Tuple[mindspore.Tensor]]

        Example:
            ```python
            >>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
            >>> import requests
            >>> from PIL import Image
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            >>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
            ...
            >>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
            >>> model = BridgeTowerForImageAndTextRetrieval.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
            ...
            >>> # forward pass
            >>> scores = dict()
            >>> for text in texts:
            ...     # prepare inputs
            ...     encoding = processor(image, text, return_tensors="pt")
            ...     outputs = model(**encoding)
            ...     scores[text] = outputs.logits[0, 1].item()
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.bridgetower(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            pixel_values=pixel_values,
            pixel_mask=pixel_mask,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            image_embeds=image_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooler_output = outputs.pooler_output if return_dict else outputs[2]

        logits = self.itm_score(pooler_output)

        itm_loss = None
        if labels is not None:
            itm_loss = F.cross_entropy(logits, labels)

        if not return_dict:
            output = tuple(logits)
            return ((itm_loss,) + output) if itm_loss is not None else output

        return SequenceClassifierOutput(
            loss=itm_loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerForImageAndTextRetrieval.__init__(config)

Initializes an instance of the BridgeTowerForImageAndTextRetrieval class.

PARAMETER DESCRIPTION
self

The instance of the class itself.

config

A configuration object containing the necessary parameters for initialization.

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
3148
3149
3150
3151
3152
3153
3154
3155
3156
3157
3158
3159
3160
3161
3162
3163
3164
3165
3166
3167
3168
3169
def __init__(self, config):
    """
    Initializes an instance of the BridgeTowerForImageAndTextRetrieval class.

    Args:
        self: The instance of the class itself.
        config: A configuration object containing the necessary parameters for initialization.

    Returns:
        None

    Raises:
        None
    """
    super().__init__(config)

    self.bridgetower = BridgeTowerModel(config)

    self.itm_score = BridgeTowerITMHead(config.hidden_size * 2)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerForImageAndTextRetrieval.forward(input_ids=None, attention_mask=None, token_type_ids=None, pixel_values=None, pixel_mask=None, head_mask=None, inputs_embeds=None, image_embeds=None, output_attentions=None, output_hidden_states=None, return_dict=None, labels=None)

PARAMETER DESCRIPTION
labels

Labels for computing the image-text matching loss. 0 means the pairs don't match and 1 means they match. The pairs with 0 will be skipped for calculation.

TYPE: `mindspore.Tensor` of shape `(batch_size, 1)`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[SequenceClassifierOutput, Tuple[Tensor]]

Union[SequenceClassifierOutput, Tuple[mindspore.Tensor]]

Example
>>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
>>> import requests
>>> from PIL import Image
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
...
>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
>>> model = BridgeTowerForImageAndTextRetrieval.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
...
>>> # forward pass
>>> scores = dict()
>>> for text in texts:
...     # prepare inputs
...     encoding = processor(image, text, return_tensors="pt")
...     outputs = model(**encoding)
...     scores[text] = outputs.logits[0, 1].item()
Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
3171
3172
3173
3174
3175
3176
3177
3178
3179
3180
3181
3182
3183
3184
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210
3211
3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
3233
3234
3235
3236
3237
3238
3239
3240
3241
3242
3243
3244
3245
3246
3247
3248
3249
3250
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    token_type_ids: Optional[mindspore.Tensor] = None,
    pixel_values: Optional[mindspore.Tensor] = None,
    pixel_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    image_embeds: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
    labels: Optional[mindspore.Tensor] = None,
) -> Union[SequenceClassifierOutput, Tuple[mindspore.Tensor]]:
    r"""
    Args:
        labels (`mindspore.Tensor` of shape `(batch_size, 1)`, *optional*):
            Labels for computing the image-text matching loss. 0 means the pairs don't match and 1 means they match.
            The pairs with 0 will be skipped for calculation.

    Returns:
        Union[SequenceClassifierOutput, Tuple[mindspore.Tensor]]

    Example:
        ```python
        >>> from transformers import BridgeTowerProcessor, BridgeTowerForImageAndTextRetrieval
        >>> import requests
        >>> from PIL import Image
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        >>> texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
        ...
        >>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
        >>> model = BridgeTowerForImageAndTextRetrieval.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
        ...
        >>> # forward pass
        >>> scores = dict()
        >>> for text in texts:
        ...     # prepare inputs
        ...     encoding = processor(image, text, return_tensors="pt")
        ...     outputs = model(**encoding)
        ...     scores[text] = outputs.logits[0, 1].item()
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    outputs = self.bridgetower(
        input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        pixel_values=pixel_values,
        pixel_mask=pixel_mask,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        image_embeds=image_embeds,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    pooler_output = outputs.pooler_output if return_dict else outputs[2]

    logits = self.itm_score(pooler_output)

    itm_loss = None
    if labels is not None:
        itm_loss = F.cross_entropy(logits, labels)

    if not return_dict:
        output = tuple(logits)
        return ((itm_loss,) + output) if itm_loss is not None else output

    return SequenceClassifierOutput(
        loss=itm_loss,
        logits=logits,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerForMaskedLM

Bases: BridgeTowerPreTrainedModel

BridgeTowerForMaskedLM class represents a model for masked language modeling using the BridgeTower architecture. It inherits functionality from the BridgeTowerPreTrainedModel class.

This class includes methods for initializing the model with configuration, getting and setting output embeddings, and forwarding the model for inference. The 'forward' method takes various input tensors such as input_ids, attention_mask, token_type_ids, pixel_values, pixel_mask, etc., and returns masked language modeling outputs. It also supports optional labels for computing the masked language modeling loss.

The class provides an example of how to use the model for masked language modeling tasks using images and text inputs. It showcases the process of preparing inputs, performing a forward pass, decoding model outputs, and printing the results.

The BridgeTowerForMaskedLM class encapsulates the functionality for masked language modeling tasks using the BridgeTower architecture.

Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
class BridgeTowerForMaskedLM(BridgeTowerPreTrainedModel):

    """
    BridgeTowerForMaskedLM class represents a model for masked language modeling using the BridgeTower architecture.
    It inherits functionality from the BridgeTowerPreTrainedModel class.

    This class includes methods for initializing the model with configuration, getting and setting output embeddings,
    and forwarding the model for inference.
    The 'forward' method takes various input tensors such as input_ids, attention_mask, token_type_ids, pixel_values,
    pixel_mask, etc., and returns masked language modeling outputs.
    It also supports optional labels for computing the masked language modeling loss.

    The class provides an example of how to use the model for masked language modeling tasks using images and text inputs.
    It showcases the process of preparing inputs, performing a forward pass, decoding model outputs, and printing the results.

    The BridgeTowerForMaskedLM class encapsulates the functionality for masked language modeling tasks using the BridgeTower architecture.
    """
    _tied_weights_keys = ["mlm_score.decoder.weight"]

    def __init__(self, config):
        """
        __init__

        Initializes an instance of the BridgeTowerForMaskedLM class.

        Args:
            self (object): The instance of the class.
            config (object): The configuration object containing settings and parameters for the BridgeTowerForMaskedLM instance.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)

        self.bridgetower = BridgeTowerModel(config)
        self.mlm_score = BridgeTowerMLMHead(config)

        # Initialize weights and apply final processing
        self.post_init()

    def get_output_embeddings(self):
        """
        This method returns the output embeddings for the Masked Language Model (MLM) decoder.

        Args:
            self (BridgeTowerForMaskedLM): The instance of the BridgeTowerForMaskedLM class.

        Returns:
            None: This method returns None, as it directly returns the output embeddings without any further processing.

        Raises:
            None
        """
        return self.mlm_score.decoder

    def set_output_embeddings(self, new_embeddings):
        """
        Sets the output embeddings for the BridgeTowerForMaskedLM model.

        Args:
            self (BridgeTowerForMaskedLM): The instance of the BridgeTowerForMaskedLM class.
            new_embeddings (Tensor): The new embeddings to be set as the output embeddings. It should be a tensor of shape (vocab_size, hidden_size).

        Returns:
            None.

        Raises:
            None.

        This method sets the output embeddings for the BridgeTowerForMaskedLM model by updating the decoder attribute
        of the mlm_score object. The new_embeddings parameter should be a tensor representing the new embeddings to be
        used as the output embeddings. The tensor should have a shape of (vocab_size, hidden_size) where vocab_size is
        the number of tokens in the vocabulary and hidden_size is the size of the hidden state of the model.

        Example:
            ```python
            >>> model = BridgeTowerForMaskedLM()
            >>> new_embeddings = torch.randn(model.vocab_size, model.hidden_size)
            >>> model.set_output_embeddings(new_embeddings)
            ```
        """
        self.mlm_score.decoder = new_embeddings

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        pixel_values: Optional[mindspore.Tensor] = None,
        pixel_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        image_embeds: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        labels: Optional[mindspore.Tensor] = None,
    ) -> Union[MaskedLMOutput, Tuple[mindspore.Tensor]]:
        r"""
        Args:
            labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
                config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
                loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`

        Returns:
            `Union[MaskedLMOutput, Tuple[mindspore.Tensor]]`

        Example:
            ```python
            >>> from transformers import BridgeTowerProcessor, BridgeTowerForMaskedLM
            >>> from PIL import Image
            >>> import requests
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000360943.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
            >>> text = "a <mask> looking out of the window"
            ...
            >>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
            >>> model = BridgeTowerForMaskedLM.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
            ...
            >>> # prepare inputs
            >>> encoding = processor(image, text, return_tensors="pt")
            ...
            >>> # forward pass
            >>> outputs = model(**encoding)
            ...
            >>> results = processor.decode(outputs.logits.argmax(dim=-1).squeeze(0).tolist())
            ...
            >>> print(results)
            .a cat looking out of the window.
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        outputs = self.bridgetower(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            pixel_values=pixel_values,
            pixel_mask=pixel_mask,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            image_embeds=image_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        mlm_logits = self.mlm_score(outputs.text_features if return_dict else outputs[0])
        masked_lm_loss = None
        if labels is not None:
            masked_lm_loss = F.cross_entropy(mlm_logits.view(-1, self.config.text_config.vocab_size), labels.view(-1))

        if not return_dict:
            output = tuple(mlm_logits)
            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output

        return MaskedLMOutput(
            loss=masked_lm_loss,
            logits=mlm_logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerForMaskedLM.__init__(config)

init

Initializes an instance of the BridgeTowerForMaskedLM class.

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: object

config

The configuration object containing settings and parameters for the BridgeTowerForMaskedLM instance.

TYPE: object

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
def __init__(self, config):
    """
    __init__

    Initializes an instance of the BridgeTowerForMaskedLM class.

    Args:
        self (object): The instance of the class.
        config (object): The configuration object containing settings and parameters for the BridgeTowerForMaskedLM instance.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)

    self.bridgetower = BridgeTowerModel(config)
    self.mlm_score = BridgeTowerMLMHead(config)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerForMaskedLM.forward(input_ids=None, attention_mask=None, token_type_ids=None, pixel_values=None, pixel_mask=None, head_mask=None, inputs_embeds=None, image_embeds=None, output_attentions=None, output_hidden_states=None, return_dict=None, labels=None)

PARAMETER DESCRIPTION
labels

Labels for computing the masked language modeling loss. Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[MaskedLMOutput, Tuple[Tensor]]

Union[MaskedLMOutput, Tuple[mindspore.Tensor]]

Example
>>> from transformers import BridgeTowerProcessor, BridgeTowerForMaskedLM
>>> from PIL import Image
>>> import requests
...
>>> url = "http://images.cocodataset.org/val2017/000000360943.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
>>> text = "a <mask> looking out of the window"
...
>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
>>> model = BridgeTowerForMaskedLM.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
...
>>> # prepare inputs
>>> encoding = processor(image, text, return_tensors="pt")
...
>>> # forward pass
>>> outputs = model(**encoding)
...
>>> results = processor.decode(outputs.logits.argmax(dim=-1).squeeze(0).tolist())
...
>>> print(results)
.a cat looking out of the window.
Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060
3061
3062
3063
3064
3065
3066
3067
3068
3069
3070
3071
3072
3073
3074
3075
3076
3077
3078
3079
3080
3081
3082
3083
3084
3085
3086
3087
3088
3089
3090
3091
3092
3093
3094
3095
3096
3097
3098
3099
3100
3101
3102
3103
3104
3105
3106
3107
3108
3109
3110
3111
3112
3113
3114
3115
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    token_type_ids: Optional[mindspore.Tensor] = None,
    pixel_values: Optional[mindspore.Tensor] = None,
    pixel_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    image_embeds: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
    labels: Optional[mindspore.Tensor] = None,
) -> Union[MaskedLMOutput, Tuple[mindspore.Tensor]]:
    r"""
    Args:
        labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
            config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
            loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`

    Returns:
        `Union[MaskedLMOutput, Tuple[mindspore.Tensor]]`

    Example:
        ```python
        >>> from transformers import BridgeTowerProcessor, BridgeTowerForMaskedLM
        >>> from PIL import Image
        >>> import requests
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000360943.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
        >>> text = "a <mask> looking out of the window"
        ...
        >>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
        >>> model = BridgeTowerForMaskedLM.from_pretrained("BridgeTower/bridgetower-base-itm-mlm")
        ...
        >>> # prepare inputs
        >>> encoding = processor(image, text, return_tensors="pt")
        ...
        >>> # forward pass
        >>> outputs = model(**encoding)
        ...
        >>> results = processor.decode(outputs.logits.argmax(dim=-1).squeeze(0).tolist())
        ...
        >>> print(results)
        .a cat looking out of the window.
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    outputs = self.bridgetower(
        input_ids,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        pixel_values=pixel_values,
        pixel_mask=pixel_mask,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        image_embeds=image_embeds,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    mlm_logits = self.mlm_score(outputs.text_features if return_dict else outputs[0])
    masked_lm_loss = None
    if labels is not None:
        masked_lm_loss = F.cross_entropy(mlm_logits.view(-1, self.config.text_config.vocab_size), labels.view(-1))

    if not return_dict:
        output = tuple(mlm_logits)
        return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output

    return MaskedLMOutput(
        loss=masked_lm_loss,
        logits=mlm_logits,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerForMaskedLM.get_output_embeddings()

This method returns the output embeddings for the Masked Language Model (MLM) decoder.

PARAMETER DESCRIPTION
self

The instance of the BridgeTowerForMaskedLM class.

TYPE: BridgeTowerForMaskedLM

RETURNS DESCRIPTION
None

This method returns None, as it directly returns the output embeddings without any further processing.

Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
def get_output_embeddings(self):
    """
    This method returns the output embeddings for the Masked Language Model (MLM) decoder.

    Args:
        self (BridgeTowerForMaskedLM): The instance of the BridgeTowerForMaskedLM class.

    Returns:
        None: This method returns None, as it directly returns the output embeddings without any further processing.

    Raises:
        None
    """
    return self.mlm_score.decoder

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerForMaskedLM.set_output_embeddings(new_embeddings)

Sets the output embeddings for the BridgeTowerForMaskedLM model.

PARAMETER DESCRIPTION
self

The instance of the BridgeTowerForMaskedLM class.

TYPE: BridgeTowerForMaskedLM

new_embeddings

The new embeddings to be set as the output embeddings. It should be a tensor of shape (vocab_size, hidden_size).

TYPE: Tensor

RETURNS DESCRIPTION

None.

This method sets the output embeddings for the BridgeTowerForMaskedLM model by updating the decoder attribute of the mlm_score object. The new_embeddings parameter should be a tensor representing the new embeddings to be used as the output embeddings. The tensor should have a shape of (vocab_size, hidden_size) where vocab_size is the number of tokens in the vocabulary and hidden_size is the size of the hidden state of the model.

Example
>>> model = BridgeTowerForMaskedLM()
>>> new_embeddings = torch.randn(model.vocab_size, model.hidden_size)
>>> model.set_output_embeddings(new_embeddings)
Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
3033
3034
def set_output_embeddings(self, new_embeddings):
    """
    Sets the output embeddings for the BridgeTowerForMaskedLM model.

    Args:
        self (BridgeTowerForMaskedLM): The instance of the BridgeTowerForMaskedLM class.
        new_embeddings (Tensor): The new embeddings to be set as the output embeddings. It should be a tensor of shape (vocab_size, hidden_size).

    Returns:
        None.

    Raises:
        None.

    This method sets the output embeddings for the BridgeTowerForMaskedLM model by updating the decoder attribute
    of the mlm_score object. The new_embeddings parameter should be a tensor representing the new embeddings to be
    used as the output embeddings. The tensor should have a shape of (vocab_size, hidden_size) where vocab_size is
    the number of tokens in the vocabulary and hidden_size is the size of the hidden state of the model.

    Example:
        ```python
        >>> model = BridgeTowerForMaskedLM()
        >>> new_embeddings = torch.randn(model.vocab_size, model.hidden_size)
        >>> model.set_output_embeddings(new_embeddings)
        ```
    """
    self.mlm_score.decoder = new_embeddings

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerModel

Bases: BridgeTowerPreTrainedModel

BridgeTowerModel Represents a BridgeTower model, which is a model for processing multimodal inputs, combining text and image information using cross-modal transformers.

This class inherits from BridgeTowerPreTrainedModel and implements methods for initializing the model, forwarding the model, and getting classification features.

The BridgeTowerModel class includes methods for getting and setting input embeddings, as well as forwarding the model for processing multimodal inputs. It also provides a method for obtaining classification features from the processed multimodal inputs.

ATTRIBUTE DESCRIPTION
config

The configuration for the BridgeTowerModel.

METHOD DESCRIPTION
__init__

Initializes the BridgeTowerModel with the provided configuration.

get_input_embeddings

Retrieves the input embeddings from the text model.

set_input_embeddings

Sets the input embeddings for the text model.

forward

Constructs the model for processing multimodal inputs and returns the model output.

get_cls_features

Retrieves the classification features from the processed multimodal inputs.

Example
>>> from transformers import BridgeTowerProcessor, BridgeTowerModel
>>> from PIL import Image
>>> import requests
...
>>> # prepare image and text
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> text = "hello world"
>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base")
>>> model = BridgeTowerModel.from_pretrained("BridgeTower/bridgetower-base")
...
>>> inputs = processor(image, text, return_tensors="pt")
>>> outputs = model(**inputs)
>>> outputs.keys()
odict_keys(['text_features', 'image_features', 'pooler_output'])
Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
class BridgeTowerModel(BridgeTowerPreTrainedModel):

    """
    BridgeTowerModel
    Represents a BridgeTower model, which is a model for processing multimodal inputs, combining text and
    image information using cross-modal transformers.

    This class inherits from BridgeTowerPreTrainedModel and implements methods for initializing the model,
    forwarding the model, and getting classification features.

    The BridgeTowerModel class includes methods for getting and setting input embeddings, as well as forwarding
    the model for processing multimodal inputs. It also provides a method for obtaining
    classification features from the processed multimodal inputs.

    Attributes:
        config: The configuration for the BridgeTowerModel.

    Methods:
        __init__: Initializes the BridgeTowerModel with the provided configuration.
        get_input_embeddings: Retrieves the input embeddings from the text model.
        set_input_embeddings: Sets the input embeddings for the text model.
        forward: Constructs the model for processing multimodal inputs and returns the model output.
        get_cls_features: Retrieves the classification features from the processed multimodal inputs.

    Example:
        ```python
        >>> from transformers import BridgeTowerProcessor, BridgeTowerModel
        >>> from PIL import Image
        >>> import requests
        ...
        >>> # prepare image and text
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        >>> text = "hello world"
        >>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base")
        >>> model = BridgeTowerModel.from_pretrained("BridgeTower/bridgetower-base")
        ...
        >>> inputs = processor(image, text, return_tensors="pt")
        >>> outputs = model(**inputs)
        >>> outputs.keys()
        odict_keys(['text_features', 'image_features', 'pooler_output'])
        ```
    """
    def __init__(self, config):
        """
        Initializes a BridgeTowerModel instance.

        Args:
            self (object): The instance of the BridgeTowerModel class.
            config (object):
                An object containing configuration settings for the model.

                - Purpose: Specifies the configuration parameters for the BridgeTowerModel.
                - Restrictions: Must be a valid configuration object.

        Returns:
            None.

        Raises:
            None
        """
        super().__init__(config)
        self.config = config
        vision_config = config.vision_config
        text_config = config.text_config

        if config.share_cross_modal_transformer_layers:
            self.cross_modal_text_transform = nn.Linear(text_config.hidden_size, config.hidden_size)
            self.cross_modal_image_transform = nn.Linear(vision_config.hidden_size, config.hidden_size)
        else:
            self.cross_modal_text_transform = nn.ModuleList(
                [nn.Linear(text_config.hidden_size, config.hidden_size) for _ in range(config.num_hidden_layers)]
            )
            self.cross_modal_image_transform = nn.ModuleList(
                [nn.Linear(vision_config.hidden_size, config.hidden_size) for _ in range(config.num_hidden_layers)]
            )

        self.token_type_embeddings = nn.Embedding(2, config.hidden_size)

        self.vision_model = BridgeTowerVisionModel(vision_config)

        self.text_model = BridgeTowerTextModel(text_config)

        if not vision_config.share_layernorm and config.init_layernorm_from_vision_encoder:
            for ln in self.vision_model.visual.cross_modal_ln_separate:
                ln.weight.data = self.vision_model.visual.ln_post.weight.data
                ln.bias.data = self.vision_model.visual.ln_post.bias.data

        self.cross_modal_image_layers = nn.ModuleList(
            [BridgeTowerBertCrossLayer(text_config) for _ in range(config.num_hidden_layers)]
        )
        self.cross_modal_text_layers = nn.ModuleList(
            [BridgeTowerBertCrossLayer(text_config) for _ in range(config.num_hidden_layers)]
        )

        # Class token => Linear => Tanh
        self.cross_modal_image_pooler = BridgeTowerPooler(config)
        self.cross_modal_text_pooler = BridgeTowerPooler(config)

        # Initialize BridgeTower Components
        self.cross_modal_text_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.cross_modal_image_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)

        if config.share_link_tower_layers:
            self.cross_modal_text_link_tower = BridgeTowerLinkTower(config)
            self.cross_modal_image_link_tower = BridgeTowerLinkTower(config)
        else:
            self.cross_modal_text_link_tower = nn.ModuleList(
                [BridgeTowerLinkTower(config) for _ in range(config.num_hidden_layers - 1)]
            )
            self.cross_modal_image_link_tower = nn.ModuleList(
                [BridgeTowerLinkTower(config) for _ in range(config.num_hidden_layers - 1)]
            )

        self.post_init()

    def get_input_embeddings(self):
        """
        Retrieves the input embeddings from the BridgeTowerModel's text model.

        Args:
            self: An instance of the BridgeTowerModel class.

        Returns:
            None.

        Raises:
            None.

        This method retrieves the input embeddings from the underlying text model of the BridgeTowerModel.
        The input embeddings are representations of the input text that are used for further processing or analysis.
        By calling this method, you can access the input embeddings that have been generated by the text model.

        Note that the text model must be initialized and trained before calling this method.
        If the text model has not been initialized or trained, this method may not return the expected embeddings or may
        raise an exception.
        """
        return self.text_model.get_input_embeddings()

    def set_input_embeddings(self, value):
        """
        Sets the input embeddings for the BridgeTowerModel.

        Args:
            self (BridgeTowerModel): The instance of the BridgeTowerModel class.
            value: The input embeddings to be set for the BridgeTowerModel. It should be of type Tensor or None.

        Returns:
            None.

        Raises:
            None.
        """
        self.text_model.set_input_embeddings(value)

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        pixel_values: Optional[mindspore.Tensor] = None,
        pixel_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        image_embeds: Optional[mindspore.Tensor] = None,
        image_token_type_idx: Optional[int] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        labels: Optional[mindspore.Tensor] = None,
    ) -> Union[Tuple[mindspore.Tensor], BridgeTowerModelOutput]:
        r"""
        Args:
            output_hidden_states (`bool`, *optional*):
                If set to `True`, hidden states are returned as a list containing the hidden states of text, image, and
                cross-modal components respectively. i.e. `(hidden_states_text, hidden_states_image,
                hidden_states_cross_modal)` where each element is a list of the hidden states of the corresponding
                modality. `hidden_states_txt/img` are a list of tensors corresponding to unimodal hidden states and
                `hidden_states_cross_modal` is a list of tuples containing `cross_modal_text_hidden_states` and
                `cross_modal_image_hidden_states` of each brdige layer.
            labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
                Labels are currently not supported.

        Returns:
            Union[Tuple[mindspore.Tensor], BridgeTowerModelOutput]:

        Example:
            ```python
            >>> from transformers import BridgeTowerProcessor, BridgeTowerModel
            >>> from PIL import Image
            >>> import requests
            ...
            >>> # prepare image and text
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            >>> text = "hello world"
            >>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base")
            >>> model = BridgeTowerModel.from_pretrained("BridgeTower/bridgetower-base")
            ...
            >>> inputs = processor(image, text, return_tensors="pt")
            >>> outputs = model(**inputs)
            >>> outputs.keys()
            odict_keys(['text_features', 'image_features', 'pooler_output'])
            ```
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        all_hidden_states_text = () if output_hidden_states else None
        all_hidden_states_image = () if output_hidden_states else None
        all_hidden_states_cross = () if output_hidden_states else None
        all_hidden_states = () if output_hidden_states else None
        all_self_attentions = () if output_attentions else None

        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        image_token_type_idx = image_token_type_idx if image_token_type_idx else 1
        input_shape = input_ids.shape
        text_embeds = self.text_model.embeddings(input_ids=input_ids)

        if output_hidden_states:
            all_hidden_states_text += (text_embeds,)

        if attention_mask is None:
            attention_mask = ops.ones(*input_shape, dtype=mindspore.int64)
        extend_text_masks = self.text_model.get_extended_attention_mask(attention_mask, input_shape)

        # The split_index determines how many layers of the uni-modal encoder are applied before the cross-modal encoder
        split_index = len(self.text_model.encoder.layer) - self.config.num_hidden_layers + 1

        # Run the first 'split_index' layers of the textual encoder
        for layer in self.text_model.encoder.layer[:split_index]:
            text_embeds = layer(text_embeds, extend_text_masks)[0]

            if output_hidden_states:
                all_hidden_states_text += (text_embeds,)

        if image_embeds is None:
            image_embeds = self.vision_model.visual.forward_pre(pixel_values.type(self.vision_model.dtype))
        else:
            # Permute as BridgeTowerResidualAttention has batch_first=True
            image_embeds = image_embeds.permute(1, 0, 2)

        if output_hidden_states:
            all_hidden_states_image += (image_embeds,)

        # Run the first 'split_index' layers of the visual encoder
        for block in self.vision_model.visual.transformer.resblocks[:split_index]:
            image_embeds = block(image_embeds)
            if output_hidden_states:
                all_hidden_states_image += (image_embeds,)

        image_embeds_with_ln = self.vision_model.visual.forward_post(image_embeds.type(self.vision_model.dtype))

        # first layer is a special case because we don't have the output from the cross-encoder yet
        cross_modal_text = self.cross_modal_text_transform(text_embeds)

        text_token_type_embeddings = self.token_type_embeddings(
            ops.zeros(1, dtype=mindspore.int64)
        ).expand_as(cross_modal_text)

        cross_modal_text = self.cross_modal_text_layernorm(cross_modal_text + text_token_type_embeddings)

        image_embeds_with_ln = self.cross_modal_image_transform(image_embeds_with_ln)
        image_token_type_embeddings = self.token_type_embeddings(
            ops.full((1,), image_token_type_idx, dtype=mindspore.int64)
        ).expand_as(image_embeds_with_ln)

        image_embeds_with_ln = image_embeds_with_ln + image_token_type_embeddings
        cross_modal_image = self.cross_modal_image_layernorm(image_embeds_with_ln)

        pixel_mask = ops.ones(
            cross_modal_image.shape[0], cross_modal_image.shape[1],
            dtype=mindspore.int64,
        )
        extend_image_masks = self.text_model.get_extended_attention_mask(pixel_mask, pixel_mask.shape)

        layer_outputs_text = self.cross_modal_text_layers[0](
            cross_modal_text,
            cross_modal_image,
            attention_mask=extend_text_masks,
            encoder_attention_mask=extend_image_masks,
            output_attentions=output_attentions,
        )
        cross_text_features = layer_outputs_text[0]

        layer_outputs_image = self.cross_modal_image_layers[0](
            cross_modal_image,
            cross_modal_text,
            attention_mask=extend_image_masks,
            encoder_attention_mask=extend_text_masks,
            output_attentions=output_attentions,
        )
        cross_image_features = layer_outputs_image[0]

        if output_hidden_states:
            all_hidden_states_cross += ((cross_text_features, cross_image_features),)

        if output_attentions:
            all_self_attentions += ((layer_outputs_text[1], layer_outputs_image[1]),)

        link_layer_index = 0

        #  Each of the top 6 layers of the visual and textual encoders ([split_index:]) is connected to each layer of
        #  the cross-modal encoder via bridge layers, which brings bottom-up alignment and fusion to the cross-modal encoder.
        for i in range(split_index, len(self.text_model.encoder.layer)):
            text_embeds = self.text_model.encoder.layer[i](text_embeds, extend_text_masks)[0]
            image_embeds = self.vision_model.visual.transformer.resblocks[i](image_embeds).type(
                self.vision_model.dtype
            )
            image_embeds_with_ln = (
                self.cross_modal_image_transform(self.vision_model.visual.forward_post(image_embeds))
                + image_token_type_embeddings
            )

            text_link_tower = self.cross_modal_text_link_tower[link_layer_index]
            image_link_tower = self.cross_modal_image_link_tower[link_layer_index]

            # Bridge layers for textual and visual encoders
            cross_text_features_ = text_link_tower(
                self.cross_modal_text_transform(text_embeds) + text_token_type_embeddings,
                cross_text_features,
                extend_text_masks,
            )
            cross_image_features_ = image_link_tower(image_embeds_with_ln, cross_image_features, extend_image_masks)

            # Cross-modal encoder via bridge layers of textual and visual encoders
            layer_outputs_text = self.cross_modal_text_layers[link_layer_index + 1](
                cross_text_features_,
                cross_image_features_,
                attention_mask=extend_text_masks,
                encoder_attention_mask=extend_image_masks,
                output_attentions=output_attentions,
            )
            cross_text_features = layer_outputs_text[0]

            layer_outputs_image = self.cross_modal_image_layers[link_layer_index + 1](
                cross_image_features_,
                cross_text_features_,
                attention_mask=extend_image_masks,
                encoder_attention_mask=extend_text_masks,
                output_attentions=output_attentions,
            )
            cross_image_features = layer_outputs_image[0]

            link_layer_index += 1

            if output_hidden_states:
                all_hidden_states_text += (text_embeds,)
                all_hidden_states_image += (image_embeds,)
                all_hidden_states_cross += ((cross_text_features, cross_image_features),)

            if output_attentions:
                all_self_attentions += ((layer_outputs_text[1], layer_outputs_image[1]),)

        #  Concatenate the cls token of the text and image features to get the final represtation
        text_features, image_features = cross_text_features, cross_image_features
        cls_features = self.get_cls_features(text_features, image_features)

        if output_hidden_states:
            all_hidden_states = (all_hidden_states_text, all_hidden_states_image, all_hidden_states_cross)

        if not return_dict:
            return tuple(
                v
                for v in [text_features, image_features, cls_features, all_hidden_states, all_self_attentions]
                if v is not None
            )

        return BridgeTowerModelOutput(
            text_features=text_features,
            image_features=image_features,
            pooler_output=cls_features,
            hidden_states=all_hidden_states,
            attentions=all_self_attentions,
        )

    def get_cls_features(self, text_features, image_features):
        """
        This method 'get_cls_features' is defined in the class 'BridgeTowerModel'
        and is used to obtain the class features by pooling text and image features.

        Args:
            self (object): The instance of the BridgeTowerModel class.
            text_features (array): The input text features to be pooled for obtaining class features.
            image_features (array): The input image features to be pooled for obtaining class features.

        Returns:
            None:
                This method returns None, as the class features are directly computed and concatenated without any additional processing.

        Raises:
            None.
        """
        cls_features_text = self.cross_modal_text_pooler(text_features)
        cls_features_image = self.cross_modal_image_pooler(image_features)
        return ops.cat([cls_features_text, cls_features_image], dim=-1)

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerModel.__init__(config)

Initializes a BridgeTowerModel instance.

PARAMETER DESCRIPTION
self

The instance of the BridgeTowerModel class.

TYPE: object

config

An object containing configuration settings for the model.

  • Purpose: Specifies the configuration parameters for the BridgeTowerModel.
  • Restrictions: Must be a valid configuration object.

TYPE: object

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
def __init__(self, config):
    """
    Initializes a BridgeTowerModel instance.

    Args:
        self (object): The instance of the BridgeTowerModel class.
        config (object):
            An object containing configuration settings for the model.

            - Purpose: Specifies the configuration parameters for the BridgeTowerModel.
            - Restrictions: Must be a valid configuration object.

    Returns:
        None.

    Raises:
        None
    """
    super().__init__(config)
    self.config = config
    vision_config = config.vision_config
    text_config = config.text_config

    if config.share_cross_modal_transformer_layers:
        self.cross_modal_text_transform = nn.Linear(text_config.hidden_size, config.hidden_size)
        self.cross_modal_image_transform = nn.Linear(vision_config.hidden_size, config.hidden_size)
    else:
        self.cross_modal_text_transform = nn.ModuleList(
            [nn.Linear(text_config.hidden_size, config.hidden_size) for _ in range(config.num_hidden_layers)]
        )
        self.cross_modal_image_transform = nn.ModuleList(
            [nn.Linear(vision_config.hidden_size, config.hidden_size) for _ in range(config.num_hidden_layers)]
        )

    self.token_type_embeddings = nn.Embedding(2, config.hidden_size)

    self.vision_model = BridgeTowerVisionModel(vision_config)

    self.text_model = BridgeTowerTextModel(text_config)

    if not vision_config.share_layernorm and config.init_layernorm_from_vision_encoder:
        for ln in self.vision_model.visual.cross_modal_ln_separate:
            ln.weight.data = self.vision_model.visual.ln_post.weight.data
            ln.bias.data = self.vision_model.visual.ln_post.bias.data

    self.cross_modal_image_layers = nn.ModuleList(
        [BridgeTowerBertCrossLayer(text_config) for _ in range(config.num_hidden_layers)]
    )
    self.cross_modal_text_layers = nn.ModuleList(
        [BridgeTowerBertCrossLayer(text_config) for _ in range(config.num_hidden_layers)]
    )

    # Class token => Linear => Tanh
    self.cross_modal_image_pooler = BridgeTowerPooler(config)
    self.cross_modal_text_pooler = BridgeTowerPooler(config)

    # Initialize BridgeTower Components
    self.cross_modal_text_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
    self.cross_modal_image_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)

    if config.share_link_tower_layers:
        self.cross_modal_text_link_tower = BridgeTowerLinkTower(config)
        self.cross_modal_image_link_tower = BridgeTowerLinkTower(config)
    else:
        self.cross_modal_text_link_tower = nn.ModuleList(
            [BridgeTowerLinkTower(config) for _ in range(config.num_hidden_layers - 1)]
        )
        self.cross_modal_image_link_tower = nn.ModuleList(
            [BridgeTowerLinkTower(config) for _ in range(config.num_hidden_layers - 1)]
        )

    self.post_init()

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerModel.forward(input_ids=None, attention_mask=None, token_type_ids=None, pixel_values=None, pixel_mask=None, head_mask=None, inputs_embeds=None, image_embeds=None, image_token_type_idx=None, output_attentions=None, output_hidden_states=None, return_dict=None, labels=None)

PARAMETER DESCRIPTION
output_hidden_states

If set to True, hidden states are returned as a list containing the hidden states of text, image, and cross-modal components respectively. i.e. (hidden_states_text, hidden_states_image, hidden_states_cross_modal) where each element is a list of the hidden states of the corresponding modality. hidden_states_txt/img are a list of tensors corresponding to unimodal hidden states and hidden_states_cross_modal is a list of tuples containing cross_modal_text_hidden_states and cross_modal_image_hidden_states of each brdige layer.

TYPE: `bool`, *optional* DEFAULT: None

labels

Labels are currently not supported.

TYPE: `mindspore.Tensor` of shape `(batch_size,)`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple[Tensor], BridgeTowerModelOutput]

Union[Tuple[mindspore.Tensor], BridgeTowerModelOutput]:

Example
>>> from transformers import BridgeTowerProcessor, BridgeTowerModel
>>> from PIL import Image
>>> import requests
...
>>> # prepare image and text
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> text = "hello world"
>>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base")
>>> model = BridgeTowerModel.from_pretrained("BridgeTower/bridgetower-base")
...
>>> inputs = processor(image, text, return_tensors="pt")
>>> outputs = model(**inputs)
>>> outputs.keys()
odict_keys(['text_features', 'image_features', 'pooler_output'])
Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    token_type_ids: Optional[mindspore.Tensor] = None,
    pixel_values: Optional[mindspore.Tensor] = None,
    pixel_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    image_embeds: Optional[mindspore.Tensor] = None,
    image_token_type_idx: Optional[int] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
    labels: Optional[mindspore.Tensor] = None,
) -> Union[Tuple[mindspore.Tensor], BridgeTowerModelOutput]:
    r"""
    Args:
        output_hidden_states (`bool`, *optional*):
            If set to `True`, hidden states are returned as a list containing the hidden states of text, image, and
            cross-modal components respectively. i.e. `(hidden_states_text, hidden_states_image,
            hidden_states_cross_modal)` where each element is a list of the hidden states of the corresponding
            modality. `hidden_states_txt/img` are a list of tensors corresponding to unimodal hidden states and
            `hidden_states_cross_modal` is a list of tuples containing `cross_modal_text_hidden_states` and
            `cross_modal_image_hidden_states` of each brdige layer.
        labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels are currently not supported.

    Returns:
        Union[Tuple[mindspore.Tensor], BridgeTowerModelOutput]:

    Example:
        ```python
        >>> from transformers import BridgeTowerProcessor, BridgeTowerModel
        >>> from PIL import Image
        >>> import requests
        ...
        >>> # prepare image and text
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        >>> text = "hello world"
        >>> processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-base")
        >>> model = BridgeTowerModel.from_pretrained("BridgeTower/bridgetower-base")
        ...
        >>> inputs = processor(image, text, return_tensors="pt")
        >>> outputs = model(**inputs)
        >>> outputs.keys()
        odict_keys(['text_features', 'image_features', 'pooler_output'])
        ```
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    all_hidden_states_text = () if output_hidden_states else None
    all_hidden_states_image = () if output_hidden_states else None
    all_hidden_states_cross = () if output_hidden_states else None
    all_hidden_states = () if output_hidden_states else None
    all_self_attentions = () if output_attentions else None

    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    image_token_type_idx = image_token_type_idx if image_token_type_idx else 1
    input_shape = input_ids.shape
    text_embeds = self.text_model.embeddings(input_ids=input_ids)

    if output_hidden_states:
        all_hidden_states_text += (text_embeds,)

    if attention_mask is None:
        attention_mask = ops.ones(*input_shape, dtype=mindspore.int64)
    extend_text_masks = self.text_model.get_extended_attention_mask(attention_mask, input_shape)

    # The split_index determines how many layers of the uni-modal encoder are applied before the cross-modal encoder
    split_index = len(self.text_model.encoder.layer) - self.config.num_hidden_layers + 1

    # Run the first 'split_index' layers of the textual encoder
    for layer in self.text_model.encoder.layer[:split_index]:
        text_embeds = layer(text_embeds, extend_text_masks)[0]

        if output_hidden_states:
            all_hidden_states_text += (text_embeds,)

    if image_embeds is None:
        image_embeds = self.vision_model.visual.forward_pre(pixel_values.type(self.vision_model.dtype))
    else:
        # Permute as BridgeTowerResidualAttention has batch_first=True
        image_embeds = image_embeds.permute(1, 0, 2)

    if output_hidden_states:
        all_hidden_states_image += (image_embeds,)

    # Run the first 'split_index' layers of the visual encoder
    for block in self.vision_model.visual.transformer.resblocks[:split_index]:
        image_embeds = block(image_embeds)
        if output_hidden_states:
            all_hidden_states_image += (image_embeds,)

    image_embeds_with_ln = self.vision_model.visual.forward_post(image_embeds.type(self.vision_model.dtype))

    # first layer is a special case because we don't have the output from the cross-encoder yet
    cross_modal_text = self.cross_modal_text_transform(text_embeds)

    text_token_type_embeddings = self.token_type_embeddings(
        ops.zeros(1, dtype=mindspore.int64)
    ).expand_as(cross_modal_text)

    cross_modal_text = self.cross_modal_text_layernorm(cross_modal_text + text_token_type_embeddings)

    image_embeds_with_ln = self.cross_modal_image_transform(image_embeds_with_ln)
    image_token_type_embeddings = self.token_type_embeddings(
        ops.full((1,), image_token_type_idx, dtype=mindspore.int64)
    ).expand_as(image_embeds_with_ln)

    image_embeds_with_ln = image_embeds_with_ln + image_token_type_embeddings
    cross_modal_image = self.cross_modal_image_layernorm(image_embeds_with_ln)

    pixel_mask = ops.ones(
        cross_modal_image.shape[0], cross_modal_image.shape[1],
        dtype=mindspore.int64,
    )
    extend_image_masks = self.text_model.get_extended_attention_mask(pixel_mask, pixel_mask.shape)

    layer_outputs_text = self.cross_modal_text_layers[0](
        cross_modal_text,
        cross_modal_image,
        attention_mask=extend_text_masks,
        encoder_attention_mask=extend_image_masks,
        output_attentions=output_attentions,
    )
    cross_text_features = layer_outputs_text[0]

    layer_outputs_image = self.cross_modal_image_layers[0](
        cross_modal_image,
        cross_modal_text,
        attention_mask=extend_image_masks,
        encoder_attention_mask=extend_text_masks,
        output_attentions=output_attentions,
    )
    cross_image_features = layer_outputs_image[0]

    if output_hidden_states:
        all_hidden_states_cross += ((cross_text_features, cross_image_features),)

    if output_attentions:
        all_self_attentions += ((layer_outputs_text[1], layer_outputs_image[1]),)

    link_layer_index = 0

    #  Each of the top 6 layers of the visual and textual encoders ([split_index:]) is connected to each layer of
    #  the cross-modal encoder via bridge layers, which brings bottom-up alignment and fusion to the cross-modal encoder.
    for i in range(split_index, len(self.text_model.encoder.layer)):
        text_embeds = self.text_model.encoder.layer[i](text_embeds, extend_text_masks)[0]
        image_embeds = self.vision_model.visual.transformer.resblocks[i](image_embeds).type(
            self.vision_model.dtype
        )
        image_embeds_with_ln = (
            self.cross_modal_image_transform(self.vision_model.visual.forward_post(image_embeds))
            + image_token_type_embeddings
        )

        text_link_tower = self.cross_modal_text_link_tower[link_layer_index]
        image_link_tower = self.cross_modal_image_link_tower[link_layer_index]

        # Bridge layers for textual and visual encoders
        cross_text_features_ = text_link_tower(
            self.cross_modal_text_transform(text_embeds) + text_token_type_embeddings,
            cross_text_features,
            extend_text_masks,
        )
        cross_image_features_ = image_link_tower(image_embeds_with_ln, cross_image_features, extend_image_masks)

        # Cross-modal encoder via bridge layers of textual and visual encoders
        layer_outputs_text = self.cross_modal_text_layers[link_layer_index + 1](
            cross_text_features_,
            cross_image_features_,
            attention_mask=extend_text_masks,
            encoder_attention_mask=extend_image_masks,
            output_attentions=output_attentions,
        )
        cross_text_features = layer_outputs_text[0]

        layer_outputs_image = self.cross_modal_image_layers[link_layer_index + 1](
            cross_image_features_,
            cross_text_features_,
            attention_mask=extend_image_masks,
            encoder_attention_mask=extend_text_masks,
            output_attentions=output_attentions,
        )
        cross_image_features = layer_outputs_image[0]

        link_layer_index += 1

        if output_hidden_states:
            all_hidden_states_text += (text_embeds,)
            all_hidden_states_image += (image_embeds,)
            all_hidden_states_cross += ((cross_text_features, cross_image_features),)

        if output_attentions:
            all_self_attentions += ((layer_outputs_text[1], layer_outputs_image[1]),)

    #  Concatenate the cls token of the text and image features to get the final represtation
    text_features, image_features = cross_text_features, cross_image_features
    cls_features = self.get_cls_features(text_features, image_features)

    if output_hidden_states:
        all_hidden_states = (all_hidden_states_text, all_hidden_states_image, all_hidden_states_cross)

    if not return_dict:
        return tuple(
            v
            for v in [text_features, image_features, cls_features, all_hidden_states, all_self_attentions]
            if v is not None
        )

    return BridgeTowerModelOutput(
        text_features=text_features,
        image_features=image_features,
        pooler_output=cls_features,
        hidden_states=all_hidden_states,
        attentions=all_self_attentions,
    )

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerModel.get_cls_features(text_features, image_features)

This method 'get_cls_features' is defined in the class 'BridgeTowerModel' and is used to obtain the class features by pooling text and image features.

PARAMETER DESCRIPTION
self

The instance of the BridgeTowerModel class.

TYPE: object

text_features

The input text features to be pooled for obtaining class features.

TYPE: array

image_features

The input image features to be pooled for obtaining class features.

TYPE: array

RETURNS DESCRIPTION
None

This method returns None, as the class features are directly computed and concatenated without any additional processing.

Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
def get_cls_features(self, text_features, image_features):
    """
    This method 'get_cls_features' is defined in the class 'BridgeTowerModel'
    and is used to obtain the class features by pooling text and image features.

    Args:
        self (object): The instance of the BridgeTowerModel class.
        text_features (array): The input text features to be pooled for obtaining class features.
        image_features (array): The input image features to be pooled for obtaining class features.

    Returns:
        None:
            This method returns None, as the class features are directly computed and concatenated without any additional processing.

    Raises:
        None.
    """
    cls_features_text = self.cross_modal_text_pooler(text_features)
    cls_features_image = self.cross_modal_image_pooler(image_features)
    return ops.cat([cls_features_text, cls_features_image], dim=-1)

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerModel.get_input_embeddings()

Retrieves the input embeddings from the BridgeTowerModel's text model.

PARAMETER DESCRIPTION
self

An instance of the BridgeTowerModel class.

RETURNS DESCRIPTION

None.

This method retrieves the input embeddings from the underlying text model of the BridgeTowerModel. The input embeddings are representations of the input text that are used for further processing or analysis. By calling this method, you can access the input embeddings that have been generated by the text model.

Note that the text model must be initialized and trained before calling this method. If the text model has not been initialized or trained, this method may not return the expected embeddings or may raise an exception.

Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
def get_input_embeddings(self):
    """
    Retrieves the input embeddings from the BridgeTowerModel's text model.

    Args:
        self: An instance of the BridgeTowerModel class.

    Returns:
        None.

    Raises:
        None.

    This method retrieves the input embeddings from the underlying text model of the BridgeTowerModel.
    The input embeddings are representations of the input text that are used for further processing or analysis.
    By calling this method, you can access the input embeddings that have been generated by the text model.

    Note that the text model must be initialized and trained before calling this method.
    If the text model has not been initialized or trained, this method may not return the expected embeddings or may
    raise an exception.
    """
    return self.text_model.get_input_embeddings()

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerModel.set_input_embeddings(value)

Sets the input embeddings for the BridgeTowerModel.

PARAMETER DESCRIPTION
self

The instance of the BridgeTowerModel class.

TYPE: BridgeTowerModel

value

The input embeddings to be set for the BridgeTowerModel. It should be of type Tensor or None.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
def set_input_embeddings(self, value):
    """
    Sets the input embeddings for the BridgeTowerModel.

    Args:
        self (BridgeTowerModel): The instance of the BridgeTowerModel class.
        value: The input embeddings to be set for the BridgeTowerModel. It should be of type Tensor or None.

    Returns:
        None.

    Raises:
        None.
    """
    self.text_model.set_input_embeddings(value)

mindnlp.transformers.models.bridgetower.modeling_bridgetower.BridgeTowerPreTrainedModel

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in mindnlp/transformers/models/bridgetower/modeling_bridgetower.py
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
class BridgeTowerPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """
    config_class = BridgeTowerConfig
    base_model_prefix = "bridgetower"
    supports_gradient_checkpointing = False
    _no_split_modules = ["BridgeTowerSelfAttention", "BridgeTowerResidualAttention"]
    _skip_keys_device_placement = "past_key_values"

    def _init_weights(self, cell):
        """
        Initializes the weights of the model's cells.

        Args:
            self: An instance of the BridgeTowerPreTrainedModel class.
            cell: The cell whose weights need to be initialized.

        Returns:
            None

        Raises:
            None
        """
        if isinstance(cell, BridgeTowerVisionModel):
            proj_std = (cell.visual.transformer.hidden_size**-0.5) * (
                (2 * cell.visual.transformer.num_hidden_layers) ** -0.5
            )
            attn_std = cell.visual.transformer.hidden_size**-0.5
            fc_std = (2 * cell.visual.transformer.hidden_size) ** -0.5
            for block in cell.visual.transformer.resblocks:
                ops.initialize(block.attn.in_proj_weight, Normal(attn_std * self.config.initializer_factor))
                ops.initialize(block.attn.out_proj.weight, Normal(proj_std * self.config.initializer_factor))
                ops.initialize(block.mlp.c_fc.weight, Normal(fc_std * self.config.initializer_factor))
                ops.initialize(block.mlp.c_proj.weight, Normal(proj_std * self.config.initializer_factor))

            ops.initialize(cell.visual.embeddings.class_embedding, Normal(attn_std * self.config.initializer_factor))
            ops.initialize(cell.visual.embeddings.position_embedding.weight, Normal(attn_std * self.config.initializer_factor))
        elif isinstance(cell, (nn.Linear, nn.Conv2d, nn.Embedding)):
            ops.initialize(cell.weight, Normal(0.05 * self.config.initializer_factor))
        elif isinstance(cell, nn.LayerNorm):
            ops.initialize(cell.bias, 'zeros')
            ops.initialize(cell.weight, 'ones')
        if isinstance(cell, nn.Linear) and cell.bias is not None:
            ops.initialize(cell.bias, 'zeros')

mindnlp.transformers.models.bridgetower.image_processing_bridgetower.BridgeTowerImageProcessor

Bases: BaseImageProcessor

Constructs a BridgeTower image processor.

PARAMETER DESCRIPTION
do_resize

Whether to resize the image's (height, width) dimensions to the specified size. Can be overridden by the do_resize parameter in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

size

288}): Resize the shorter side of the input tosize["shortest_edge"]. The longer side will be limited to underint((1333 / 800) * size["shortest_edge"])while preserving the aspect ratio. Only has an effect ifdo_resizeis set toTrue. Can be overridden by thesizeparameter in thepreprocess` method.

TYPE: `Dict[str, int]` *optional*, defaults to `{'shortest_edge' DEFAULT: None

size_divisor

The size by which to make sure both the height and width can be divided. Only has an effect if do_resize is set to True. Can be overridden by the size_divisor parameter in the preprocess method.

TYPE: `int`, *optional*, defaults to 32 DEFAULT: 32

resample

Resampling filter to use if resizing the image. Only has an effect if do_resize is set to True. Can be overridden by the resample parameter in the preprocess method.

TYPE: `PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC` DEFAULT: BICUBIC

do_rescale

Whether to rescale the image by the specified scale rescale_factor. Can be overridden by the do_rescale parameter in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

rescale_factor

Scale factor to use if rescaling the image. Only has an effect if do_rescale is set to True. Can be overridden by the rescale_factor parameter in the preprocess method.

TYPE: `int` or `float`, *optional*, defaults to `1/255` DEFAULT: 1 / 255

do_normalize

Whether to normalize the image. Can be overridden by the do_normalize parameter in the preprocess method. Can be overridden by the do_normalize parameter in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

image_mean

Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_mean parameter in the preprocess method. Can be overridden by the image_mean parameter in the preprocess method.

TYPE: `float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN` DEFAULT: None

image_std

Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_std parameter in the preprocess method. Can be overridden by the image_std parameter in the preprocess method.

TYPE: `float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD` DEFAULT: None

do_center_crop

Whether to center crop the image. Can be overridden by the do_center_crop parameter in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

crop_size

Desired output size when applying center-cropping. Only has an effect if do_center_crop is set to True. Can be overridden by the crop_size parameter in the preprocess method. If unset defaults to size,

TYPE: `Dict[str, int]`, *optional* DEFAULT: None

do_pad

Whether to pad the image to the (max_height, max_width) of the images in the batch. Can be overridden by the do_pad parameter in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

Source code in mindnlp/transformers/models/bridgetower/image_processing_bridgetower.py
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
class BridgeTowerImageProcessor(BaseImageProcessor):
    r"""
    Constructs a BridgeTower image processor.

    Args:
        do_resize (`bool`, *optional*, defaults to `True`):
            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the
            `do_resize` parameter in the `preprocess` method.
        size (`Dict[str, int]` *optional*, defaults to `{'shortest_edge': 288}`):
            Resize the shorter side of the input to `size["shortest_edge"]`. The longer side will be limited to under
            `int((1333 / 800) * size["shortest_edge"])` while preserving the aspect ratio. Only has an effect if
            `do_resize` is set to `True`. Can be overridden by the `size` parameter in the `preprocess` method.
        size_divisor (`int`, *optional*, defaults to 32):
            The size by which to make sure both the height and width can be divided. Only has an effect if `do_resize`
            is set to `True`. Can be overridden by the `size_divisor` parameter in the `preprocess` method.
        resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
            Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`. Can be
            overridden by the `resample` parameter in the `preprocess` method.
        do_rescale (`bool`, *optional*, defaults to `True`):
            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale`
            parameter in the `preprocess` method.
        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
            Scale factor to use if rescaling the image. Only has an effect if `do_rescale` is set to `True`. Can be
            overridden by the `rescale_factor` parameter in the `preprocess` method.
        do_normalize (`bool`, *optional*, defaults to `True`):
            Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
            method. Can be overridden by the `do_normalize` parameter in the `preprocess` method.
        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method. Can be
            overridden by the `image_mean` parameter in the `preprocess` method.
        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
            Can be overridden by the `image_std` parameter in the `preprocess` method.
        do_center_crop (`bool`, *optional*, defaults to `True`):
            Whether to center crop the image. Can be overridden by the `do_center_crop` parameter in the `preprocess`
            method.
        crop_size (`Dict[str, int]`, *optional*):
            Desired output size when applying center-cropping. Only has an effect if `do_center_crop` is set to `True`.
            Can be overridden by the `crop_size` parameter in the `preprocess` method. If unset defaults to `size`,
        do_pad (`bool`, *optional*, defaults to `True`):
            Whether to pad the image to the `(max_height, max_width)` of the images in the batch. Can be overridden by
            the `do_pad` parameter in the `preprocess` method.
    """
    model_input_names = ["pixel_values"]

    def __init__(
        self,
        do_resize: bool = True,
        size: Dict[str, int] = None,
        size_divisor: int = 32,
        resample: PILImageResampling = PILImageResampling.BICUBIC,
        do_rescale: bool = True,
        rescale_factor: Union[int, float] = 1 / 255,
        do_normalize: bool = True,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_center_crop: bool = True,
        crop_size: Dict[str, int] = None,
        do_pad: bool = True,
        **kwargs,
    ) -> None:
        """
        Initializes an instance of the BridgeTowerImageProcessor class.

        Args:
            self: The instance of the class itself.
            do_resize (bool, optional): Indicates whether to resize the image. Defaults to True.
            size (Dict[str, int], optional): The desired size of the image. Defaults to {'shortest_edge': 288}.
            size_divisor (int, optional): The divisor to be used during resizing. Defaults to 32.
            resample (PILImageResampling, optional): The resampling method to be used during resizing. Defaults to PILImageResampling.BICUBIC.
            do_rescale (bool, optional): Indicates whether to rescale the image. Defaults to True.
            rescale_factor (Union[int, float], optional): The factor to be used during rescaling. Defaults to 1 / 255.
            do_normalize (bool, optional): Indicates whether to normalize the image. Defaults to True.
            image_mean (Optional[Union[float, List[float]]], optional): The mean values for image normalization. Defaults to None.
            image_std (Optional[Union[float, List[float]]], optional): The standard deviation values for image normalization. Defaults to None.
            do_center_crop (bool, optional): Indicates whether to perform center cropping. Defaults to True.
            crop_size (Dict[str, int], optional): The desired size for center cropping. Defaults to None.
            do_pad (bool, optional): Indicates whether to pad the image. Defaults to True.
            **kwargs: Additional keyword arguments.

        Returns:
            None.

        Raises:
            None.
        """
        if "pad_and_return_pixel_mask" in kwargs:
            do_pad = kwargs.pop("pad_and_return_pixel_mask")

        super().__init__(**kwargs)
        size = size if size is not None else {"shortest_edge": 288}
        size = get_size_dict(size, default_to_square=False)

        self.do_resize = do_resize
        self.size = size
        self.size_divisor = size_divisor
        self.resample = resample
        self.do_rescale = do_rescale
        self.rescale_factor = rescale_factor
        self.do_normalize = do_normalize
        self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
        self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
        self.do_pad = do_pad
        self.do_center_crop = do_center_crop
        self.crop_size = crop_size
        self._valid_processor_keys = [
            "images",
            "do_resize",
            "size",
            "size_divisor",
            "resample",
            "do_rescale",
            "rescale_factor",
            "do_normalize",
            "image_mean",
            "image_std",
            "do_pad",
            "do_center_crop",
            "crop_size",
            "return_tensors",
            "data_format",
            "input_data_format",
        ]

    # Copied from transformers.models.vilt.image_processing_vilt.ViltImageProcessor.resize
    def resize(
        self,
        image: np.ndarray,
        size: Dict[str, int],
        size_divisor: int = 32,
        resample: PILImageResampling = PILImageResampling.BICUBIC,
        data_format: Optional[Union[str, ChannelDimension]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> np.ndarray:
        """
        Resize an image.

        Resizes the shorter side of the image to `size["shortest_edge"]` while preserving the aspect ratio. If the
        longer side is larger than the max size `(int(`size["shortest_edge"]` * 1333 / 800))`, the longer side is then
        resized to the max size while preserving the aspect ratio.

        Args:
            image (`np.ndarray`):
                Image to resize.
            size (`Dict[str, int]`):
                Controls the size of the output image. Should be of the form `{"shortest_edge": int}`.
            size_divisor (`int`, defaults to 32):
                The image is resized to a size that is a multiple of this value.
            resample (`PILImageResampling` filter, *optional*, defaults to `PILImageResampling.BICUBIC`):
                Resampling filter to use when resiizing the image.
            data_format (`str` or `ChannelDimension`, *optional*):
                The channel dimension format of the image. If not provided, it will be the same as the input image.
            input_data_format (`str` or `ChannelDimension`, *optional*):
                The channel dimension format of the input image. If not provided, it will be inferred.
        """
        size = get_size_dict(size, default_to_square=False)
        if "shortest_edge" not in size:
            raise ValueError(f"The `size` dictionary must contain the key `shortest_edge`. Got {size.keys()}")
        shorter = size["shortest_edge"]
        longer = int(1333 / 800 * shorter)
        output_size = get_resize_output_image_size(
            image, shorter=shorter, longer=longer, size_divisor=size_divisor, input_data_format=input_data_format
        )
        return resize(
            image,
            size=output_size,
            resample=resample,
            data_format=data_format,
            input_data_format=input_data_format,
            **kwargs,
        )

    def center_crop(
        self,
        image: np.ndarray,
        size: Dict[str, int],
        data_format: Optional[Union[str, ChannelDimension]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> np.ndarray:
        """
        Center crop an image to `(size["height"], size["width"])`. If the input size is smaller than `crop_size` along
        any edge, the image is padded with 0's and then center cropped.

        Args:
            image (`np.ndarray`):
                Image to center crop.
            size (`Dict[str, int]`):
                Size of the output image in the form `{"height": h, "width": w}`.
            data_format (`str` or `ChannelDimension`, *optional*):
                The channel dimension format of the image. If not provided, it will be the same as the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format of the input image. If not provided, it will be inferred from the input
                image.
        """
        output_size = size["shortest_edge"]
        return center_crop(
            image,
            size=(output_size, output_size),
            data_format=data_format,
            input_data_format=input_data_format,
            **kwargs,
        )

    # Copied from transformers.models.vilt.image_processing_vilt.ViltImageProcessor._pad_image
    def _pad_image(
        self,
        image: np.ndarray,
        output_size: Tuple[int, int],
        constant_values: Union[float, Iterable[float]] = 0,
        data_format: Optional[ChannelDimension] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ) -> np.ndarray:
        """
        Pad an image with zeros to the given size.
        """
        input_height, input_width = get_image_size(image, channel_dim=input_data_format)
        output_height, output_width = output_size

        pad_bottom = output_height - input_height
        pad_right = output_width - input_width
        padding = ((0, pad_bottom), (0, pad_right))
        padded_image = pad(
            image,
            padding,
            mode=PaddingMode.CONSTANT,
            constant_values=constant_values,
            data_format=data_format,
            input_data_format=input_data_format,
        )
        return padded_image

    # Copied from transformers.models.vilt.image_processing_vilt.ViltImageProcessor.pad
    def pad(
        self,
        images: List[np.ndarray],
        constant_values: Union[float, Iterable[float]] = 0,
        return_pixel_mask: bool = True,
        return_tensors: Optional[Union[str, TensorType]] = None,
        data_format: Optional[ChannelDimension] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ) -> BatchFeature:
        """
        Pads a batch of images to the bottom and right of the image with zeros to the size of largest height and width
        in the batch and optionally returns their corresponding pixel mask.

        Args:
            image (`np.ndarray`):
                Image to pad.
            constant_values (`float` or `Iterable[float]`, *optional*):
                The value to use for the padding if `mode` is `"constant"`.
            return_pixel_mask (`bool`, *optional*, defaults to `True`):
                Whether to return a pixel mask.
            return_tensors (`str` or `TensorType`, *optional*):
                The type of tensors to return. Can be one of:

                - Unset: Return a list of `np.ndarray`.
                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
            data_format (`str` or `ChannelDimension`, *optional*):
                The channel dimension format of the image. If not provided, it will be the same as the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format of the input image. If not provided, it will be inferred.
        """
        pad_size = get_max_height_width(images, input_data_format=input_data_format)

        padded_images = [
            self._pad_image(
                image,
                pad_size,
                constant_values=constant_values,
                data_format=data_format,
                input_data_format=input_data_format,
            )
            for image in images
        ]
        data = {"pixel_values": padded_images}

        if return_pixel_mask:
            masks = [
                make_pixel_mask(image=image, output_size=pad_size, input_data_format=input_data_format)
                for image in images
            ]
            data["pixel_mask"] = masks

        return BatchFeature(data=data, tensor_type=return_tensors)

    def preprocess(
        self,
        images: ImageInput,
        do_resize: Optional[bool] = None,
        size: Optional[Dict[str, int]] = None,
        size_divisor: Optional[int] = None,
        resample: PILImageResampling = None,
        do_rescale: Optional[bool] = None,
        rescale_factor: Optional[float] = None,
        do_normalize: Optional[bool] = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_pad: Optional[bool] = None,
        do_center_crop: Optional[bool] = None,
        crop_size: Dict[str, int] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        data_format: ChannelDimension = ChannelDimension.FIRST,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> PIL.Image.Image:
        """
        Preprocess an image or batch of images.

        Args:
            images (`ImageInput`):
                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the image.
            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
                Controls the size of the image after `resize`. The shortest edge of the image is resized to
                `size["shortest_edge"]` whilst preserving the aspect ratio. If the longest edge of this resized image
                is > `int(size["shortest_edge"] * (1333 / 800))`, then the image is resized again to make the longest
                edge equal to `int(size["shortest_edge"] * (1333 / 800))`.
            size_divisor (`int`, *optional*, defaults to `self.size_divisor`):
                The image is resized to a size that is a multiple of this value.
            resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
                Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`.
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the image values between [0 - 1].
            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the image.
            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
                Image mean to normalize the image by if `do_normalize` is set to `True`.
            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
                Image standard deviation to normalize the image by if `do_normalize` is set to `True`.
            do_pad (`bool`, *optional*, defaults to `self.do_pad`):
                Whether to pad the image to the (max_height, max_width) in the batch. If `True`, a pixel mask is also
                created and returned.
            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
                Whether to center crop the image. If the input size is smaller than `crop_size` along any edge, the
                image is padded with 0's and then center cropped.
            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
                Size of the image after center crop. If one edge the image is smaller than `crop_size`, it will be
                padded with zeros and then cropped
            return_tensors (`str` or `TensorType`, *optional*):
                The type of tensors to return. Can be one of:

                - Unset: Return a list of `np.ndarray`.
                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
                The channel dimension format for the output image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - Unset: Use the channel dimension format of the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. If unset, the channel dimension format is inferred
                from the input image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
        """
        do_resize = do_resize if do_resize is not None else self.do_resize
        size_divisor = size_divisor if size_divisor is not None else self.size_divisor
        resample = resample if resample is not None else self.resample
        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
        image_mean = image_mean if image_mean is not None else self.image_mean
        image_std = image_std if image_std is not None else self.image_std
        do_pad = do_pad if do_pad is not None else self.do_pad
        do_center_crop if do_center_crop is not None else self.do_center_crop # pylint: disable=pointless-statement
        # For backwards compatibility. Initial version of this processor was cropping to the "size" argument, which
        # it should default to if crop_size is undefined.
        crop_size = (
            crop_size if crop_size is not None else (self.crop_size if self.crop_size is not None else self.size)
        )

        size = size if size is not None else self.size
        size = get_size_dict(size, default_to_square=False)

        validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)

        if not is_batched(images):
            images = [images]

        if not valid_images(images):
            raise ValueError(
                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                "torch.Tensor, tf.Tensor or jax.ndarray."
            )
        # Here, crop_size is used only if it is set, else size will be used.
        validate_preprocess_arguments(
            do_rescale=do_rescale,
            rescale_factor=rescale_factor,
            do_normalize=do_normalize,
            image_mean=image_mean,
            image_std=image_std,
            do_pad=do_pad,
            size_divisibility=size_divisor,
            do_center_crop=do_center_crop,
            crop_size=crop_size,
            do_resize=do_resize,
            size=size,
            resample=resample,
        )
        # All transformations expect numpy arrays.
        images = [to_numpy_array(image) for image in images]

        if is_scaled_image(images[0]) and do_rescale:
            logger.warning_once(
                "It looks like you are trying to rescale already rescaled images. If the input"
                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
            )

        if do_resize:
            images = [
                self.resize(
                    image=image,
                    size=size,
                    size_divisor=size_divisor,
                    resample=resample,
                    input_data_format=input_data_format,
                )
                for image in images
            ]

        if do_center_crop:
            images = [
                self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images
            ]

        if do_rescale:
            images = [
                self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
                for image in images
            ]

        if do_normalize:
            images = [
                self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
                for image in images
            ]

        images = [
            to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
        ]

        if do_pad:
            encoded_outputs = self.pad(
                images, return_pixel_mask=True, return_tensors=return_tensors, input_data_format=data_format
            )
        else:
            encoded_outputs = BatchFeature(data={"pixel_values": images}, tensor_type=return_tensors)

        return encoded_outputs

mindnlp.transformers.models.bridgetower.image_processing_bridgetower.BridgeTowerImageProcessor.__init__(do_resize=True, size=None, size_divisor=32, resample=PILImageResampling.BICUBIC, do_rescale=True, rescale_factor=1 / 255, do_normalize=True, image_mean=None, image_std=None, do_center_crop=True, crop_size=None, do_pad=True, **kwargs)

Initializes an instance of the BridgeTowerImageProcessor class.

PARAMETER DESCRIPTION
self

The instance of the class itself.

do_resize

Indicates whether to resize the image. Defaults to True.

TYPE: bool DEFAULT: True

size

The desired size of the image. Defaults to {'shortest_edge': 288}.

TYPE: Dict[str, int] DEFAULT: None

size_divisor

The divisor to be used during resizing. Defaults to 32.

TYPE: int DEFAULT: 32

resample

The resampling method to be used during resizing. Defaults to PILImageResampling.BICUBIC.

TYPE: PILImageResampling DEFAULT: BICUBIC

do_rescale

Indicates whether to rescale the image. Defaults to True.

TYPE: bool DEFAULT: True

rescale_factor

The factor to be used during rescaling. Defaults to 1 / 255.

TYPE: Union[int, float] DEFAULT: 1 / 255

do_normalize

Indicates whether to normalize the image. Defaults to True.

TYPE: bool DEFAULT: True

image_mean

The mean values for image normalization. Defaults to None.

TYPE: Optional[Union[float, List[float]]] DEFAULT: None

image_std

The standard deviation values for image normalization. Defaults to None.

TYPE: Optional[Union[float, List[float]]] DEFAULT: None

do_center_crop

Indicates whether to perform center cropping. Defaults to True.

TYPE: bool DEFAULT: True

crop_size

The desired size for center cropping. Defaults to None.

TYPE: Dict[str, int] DEFAULT: None

do_pad

Indicates whether to pad the image. Defaults to True.

TYPE: bool DEFAULT: True

**kwargs

Additional keyword arguments.

DEFAULT: {}

RETURNS DESCRIPTION
None

None.

Source code in mindnlp/transformers/models/bridgetower/image_processing_bridgetower.py
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
def __init__(
    self,
    do_resize: bool = True,
    size: Dict[str, int] = None,
    size_divisor: int = 32,
    resample: PILImageResampling = PILImageResampling.BICUBIC,
    do_rescale: bool = True,
    rescale_factor: Union[int, float] = 1 / 255,
    do_normalize: bool = True,
    image_mean: Optional[Union[float, List[float]]] = None,
    image_std: Optional[Union[float, List[float]]] = None,
    do_center_crop: bool = True,
    crop_size: Dict[str, int] = None,
    do_pad: bool = True,
    **kwargs,
) -> None:
    """
    Initializes an instance of the BridgeTowerImageProcessor class.

    Args:
        self: The instance of the class itself.
        do_resize (bool, optional): Indicates whether to resize the image. Defaults to True.
        size (Dict[str, int], optional): The desired size of the image. Defaults to {'shortest_edge': 288}.
        size_divisor (int, optional): The divisor to be used during resizing. Defaults to 32.
        resample (PILImageResampling, optional): The resampling method to be used during resizing. Defaults to PILImageResampling.BICUBIC.
        do_rescale (bool, optional): Indicates whether to rescale the image. Defaults to True.
        rescale_factor (Union[int, float], optional): The factor to be used during rescaling. Defaults to 1 / 255.
        do_normalize (bool, optional): Indicates whether to normalize the image. Defaults to True.
        image_mean (Optional[Union[float, List[float]]], optional): The mean values for image normalization. Defaults to None.
        image_std (Optional[Union[float, List[float]]], optional): The standard deviation values for image normalization. Defaults to None.
        do_center_crop (bool, optional): Indicates whether to perform center cropping. Defaults to True.
        crop_size (Dict[str, int], optional): The desired size for center cropping. Defaults to None.
        do_pad (bool, optional): Indicates whether to pad the image. Defaults to True.
        **kwargs: Additional keyword arguments.

    Returns:
        None.

    Raises:
        None.
    """
    if "pad_and_return_pixel_mask" in kwargs:
        do_pad = kwargs.pop("pad_and_return_pixel_mask")

    super().__init__(**kwargs)
    size = size if size is not None else {"shortest_edge": 288}
    size = get_size_dict(size, default_to_square=False)

    self.do_resize = do_resize
    self.size = size
    self.size_divisor = size_divisor
    self.resample = resample
    self.do_rescale = do_rescale
    self.rescale_factor = rescale_factor
    self.do_normalize = do_normalize
    self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
    self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
    self.do_pad = do_pad
    self.do_center_crop = do_center_crop
    self.crop_size = crop_size
    self._valid_processor_keys = [
        "images",
        "do_resize",
        "size",
        "size_divisor",
        "resample",
        "do_rescale",
        "rescale_factor",
        "do_normalize",
        "image_mean",
        "image_std",
        "do_pad",
        "do_center_crop",
        "crop_size",
        "return_tensors",
        "data_format",
        "input_data_format",
    ]

mindnlp.transformers.models.bridgetower.image_processing_bridgetower.BridgeTowerImageProcessor.center_crop(image, size, data_format=None, input_data_format=None, **kwargs)

Center crop an image to (size["height"], size["width"]). If the input size is smaller than crop_size along any edge, the image is padded with 0's and then center cropped.

PARAMETER DESCRIPTION
image

Image to center crop.

TYPE: `np.ndarray`

size

Size of the output image in the form {"height": h, "width": w}.

TYPE: `Dict[str, int]`

data_format

The channel dimension format of the image. If not provided, it will be the same as the input image.

TYPE: `str` or `ChannelDimension`, *optional* DEFAULT: None

input_data_format

The channel dimension format of the input image. If not provided, it will be inferred from the input image.

TYPE: `ChannelDimension` or `str`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/bridgetower/image_processing_bridgetower.py
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
def center_crop(
    self,
    image: np.ndarray,
    size: Dict[str, int],
    data_format: Optional[Union[str, ChannelDimension]] = None,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
) -> np.ndarray:
    """
    Center crop an image to `(size["height"], size["width"])`. If the input size is smaller than `crop_size` along
    any edge, the image is padded with 0's and then center cropped.

    Args:
        image (`np.ndarray`):
            Image to center crop.
        size (`Dict[str, int]`):
            Size of the output image in the form `{"height": h, "width": w}`.
        data_format (`str` or `ChannelDimension`, *optional*):
            The channel dimension format of the image. If not provided, it will be the same as the input image.
        input_data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format of the input image. If not provided, it will be inferred from the input
            image.
    """
    output_size = size["shortest_edge"]
    return center_crop(
        image,
        size=(output_size, output_size),
        data_format=data_format,
        input_data_format=input_data_format,
        **kwargs,
    )

mindnlp.transformers.models.bridgetower.image_processing_bridgetower.BridgeTowerImageProcessor.pad(images, constant_values=0, return_pixel_mask=True, return_tensors=None, data_format=None, input_data_format=None)

Pads a batch of images to the bottom and right of the image with zeros to the size of largest height and width in the batch and optionally returns their corresponding pixel mask.

PARAMETER DESCRIPTION
image

Image to pad.

TYPE: `np.ndarray`

constant_values

The value to use for the padding if mode is "constant".

TYPE: `float` or `Iterable[float]`, *optional* DEFAULT: 0

return_pixel_mask

Whether to return a pixel mask.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

return_tensors

The type of tensors to return. Can be one of:

  • Unset: Return a list of np.ndarray.
  • TensorType.TENSORFLOW or 'tf': Return a batch of type tf.Tensor.
  • TensorType.PYTORCH or 'pt': Return a batch of type torch.Tensor.
  • TensorType.NUMPY or 'np': Return a batch of type np.ndarray.
  • TensorType.JAX or 'jax': Return a batch of type jax.numpy.ndarray.

TYPE: `str` or `TensorType`, *optional* DEFAULT: None

data_format

The channel dimension format of the image. If not provided, it will be the same as the input image.

TYPE: `str` or `ChannelDimension`, *optional* DEFAULT: None

input_data_format

The channel dimension format of the input image. If not provided, it will be inferred.

TYPE: `ChannelDimension` or `str`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/bridgetower/image_processing_bridgetower.py
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
def pad(
    self,
    images: List[np.ndarray],
    constant_values: Union[float, Iterable[float]] = 0,
    return_pixel_mask: bool = True,
    return_tensors: Optional[Union[str, TensorType]] = None,
    data_format: Optional[ChannelDimension] = None,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
) -> BatchFeature:
    """
    Pads a batch of images to the bottom and right of the image with zeros to the size of largest height and width
    in the batch and optionally returns their corresponding pixel mask.

    Args:
        image (`np.ndarray`):
            Image to pad.
        constant_values (`float` or `Iterable[float]`, *optional*):
            The value to use for the padding if `mode` is `"constant"`.
        return_pixel_mask (`bool`, *optional*, defaults to `True`):
            Whether to return a pixel mask.
        return_tensors (`str` or `TensorType`, *optional*):
            The type of tensors to return. Can be one of:

            - Unset: Return a list of `np.ndarray`.
            - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
            - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
            - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
            - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
        data_format (`str` or `ChannelDimension`, *optional*):
            The channel dimension format of the image. If not provided, it will be the same as the input image.
        input_data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format of the input image. If not provided, it will be inferred.
    """
    pad_size = get_max_height_width(images, input_data_format=input_data_format)

    padded_images = [
        self._pad_image(
            image,
            pad_size,
            constant_values=constant_values,
            data_format=data_format,
            input_data_format=input_data_format,
        )
        for image in images
    ]
    data = {"pixel_values": padded_images}

    if return_pixel_mask:
        masks = [
            make_pixel_mask(image=image, output_size=pad_size, input_data_format=input_data_format)
            for image in images
        ]
        data["pixel_mask"] = masks

    return BatchFeature(data=data, tensor_type=return_tensors)

mindnlp.transformers.models.bridgetower.image_processing_bridgetower.BridgeTowerImageProcessor.preprocess(images, do_resize=None, size=None, size_divisor=None, resample=None, do_rescale=None, rescale_factor=None, do_normalize=None, image_mean=None, image_std=None, do_pad=None, do_center_crop=None, crop_size=None, return_tensors=None, data_format=ChannelDimension.FIRST, input_data_format=None, **kwargs)

Preprocess an image or batch of images.

PARAMETER DESCRIPTION
images

Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False.

TYPE: `ImageInput`

do_resize

Whether to resize the image.

TYPE: `bool`, *optional*, defaults to `self.do_resize` DEFAULT: None

size

Controls the size of the image after resize. The shortest edge of the image is resized to size["shortest_edge"] whilst preserving the aspect ratio. If the longest edge of this resized image is > int(size["shortest_edge"] * (1333 / 800)), then the image is resized again to make the longest edge equal to int(size["shortest_edge"] * (1333 / 800)).

TYPE: `Dict[str, int]`, *optional*, defaults to `self.size` DEFAULT: None

size_divisor

The image is resized to a size that is a multiple of this value.

TYPE: `int`, *optional*, defaults to `self.size_divisor` DEFAULT: None

resample

Resampling filter to use if resizing the image. Only has an effect if do_resize is set to True.

TYPE: `PILImageResampling`, *optional*, defaults to `self.resample` DEFAULT: None

do_rescale

Whether to rescale the image values between [0 - 1].

TYPE: `bool`, *optional*, defaults to `self.do_rescale` DEFAULT: None

rescale_factor

Rescale factor to rescale the image by if do_rescale is set to True.

TYPE: `float`, *optional*, defaults to `self.rescale_factor` DEFAULT: None

do_normalize

Whether to normalize the image.

TYPE: `bool`, *optional*, defaults to `self.do_normalize` DEFAULT: None

image_mean

Image mean to normalize the image by if do_normalize is set to True.

TYPE: `float` or `List[float]`, *optional*, defaults to `self.image_mean` DEFAULT: None

image_std

Image standard deviation to normalize the image by if do_normalize is set to True.

TYPE: `float` or `List[float]`, *optional*, defaults to `self.image_std` DEFAULT: None

do_pad

Whether to pad the image to the (max_height, max_width) in the batch. If True, a pixel mask is also created and returned.

TYPE: `bool`, *optional*, defaults to `self.do_pad` DEFAULT: None

do_center_crop

Whether to center crop the image. If the input size is smaller than crop_size along any edge, the image is padded with 0's and then center cropped.

TYPE: `bool`, *optional*, defaults to `self.do_center_crop` DEFAULT: None

crop_size

Size of the image after center crop. If one edge the image is smaller than crop_size, it will be padded with zeros and then cropped

TYPE: `Dict[str, int]`, *optional*, defaults to `self.crop_size` DEFAULT: None

return_tensors

The type of tensors to return. Can be one of:

  • Unset: Return a list of np.ndarray.
  • TensorType.TENSORFLOW or 'tf': Return a batch of type tf.Tensor.
  • TensorType.PYTORCH or 'pt': Return a batch of type torch.Tensor.
  • TensorType.NUMPY or 'np': Return a batch of type np.ndarray.
  • TensorType.JAX or 'jax': Return a batch of type jax.numpy.ndarray.

TYPE: `str` or `TensorType`, *optional* DEFAULT: None

data_format

The channel dimension format for the output image. Can be one of:

  • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
  • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
  • Unset: Use the channel dimension format of the input image.

TYPE: `ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST` DEFAULT: FIRST

input_data_format

The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:

  • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
  • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
  • "none" or ChannelDimension.NONE: image in (height, width) format.

TYPE: `ChannelDimension` or `str`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/bridgetower/image_processing_bridgetower.py
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
def preprocess(
    self,
    images: ImageInput,
    do_resize: Optional[bool] = None,
    size: Optional[Dict[str, int]] = None,
    size_divisor: Optional[int] = None,
    resample: PILImageResampling = None,
    do_rescale: Optional[bool] = None,
    rescale_factor: Optional[float] = None,
    do_normalize: Optional[bool] = None,
    image_mean: Optional[Union[float, List[float]]] = None,
    image_std: Optional[Union[float, List[float]]] = None,
    do_pad: Optional[bool] = None,
    do_center_crop: Optional[bool] = None,
    crop_size: Dict[str, int] = None,
    return_tensors: Optional[Union[str, TensorType]] = None,
    data_format: ChannelDimension = ChannelDimension.FIRST,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
) -> PIL.Image.Image:
    """
    Preprocess an image or batch of images.

    Args:
        images (`ImageInput`):
            Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
            passing in images with pixel values between 0 and 1, set `do_rescale=False`.
        do_resize (`bool`, *optional*, defaults to `self.do_resize`):
            Whether to resize the image.
        size (`Dict[str, int]`, *optional*, defaults to `self.size`):
            Controls the size of the image after `resize`. The shortest edge of the image is resized to
            `size["shortest_edge"]` whilst preserving the aspect ratio. If the longest edge of this resized image
            is > `int(size["shortest_edge"] * (1333 / 800))`, then the image is resized again to make the longest
            edge equal to `int(size["shortest_edge"] * (1333 / 800))`.
        size_divisor (`int`, *optional*, defaults to `self.size_divisor`):
            The image is resized to a size that is a multiple of this value.
        resample (`PILImageResampling`, *optional*, defaults to `self.resample`):
            Resampling filter to use if resizing the image. Only has an effect if `do_resize` is set to `True`.
        do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
            Whether to rescale the image values between [0 - 1].
        rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
            Rescale factor to rescale the image by if `do_rescale` is set to `True`.
        do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
            Whether to normalize the image.
        image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
            Image mean to normalize the image by if `do_normalize` is set to `True`.
        image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
            Image standard deviation to normalize the image by if `do_normalize` is set to `True`.
        do_pad (`bool`, *optional*, defaults to `self.do_pad`):
            Whether to pad the image to the (max_height, max_width) in the batch. If `True`, a pixel mask is also
            created and returned.
        do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
            Whether to center crop the image. If the input size is smaller than `crop_size` along any edge, the
            image is padded with 0's and then center cropped.
        crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
            Size of the image after center crop. If one edge the image is smaller than `crop_size`, it will be
            padded with zeros and then cropped
        return_tensors (`str` or `TensorType`, *optional*):
            The type of tensors to return. Can be one of:

            - Unset: Return a list of `np.ndarray`.
            - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
            - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
            - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
            - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
        data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
            The channel dimension format for the output image. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - Unset: Use the channel dimension format of the input image.
        input_data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format for the input image. If unset, the channel dimension format is inferred
            from the input image. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
    """
    do_resize = do_resize if do_resize is not None else self.do_resize
    size_divisor = size_divisor if size_divisor is not None else self.size_divisor
    resample = resample if resample is not None else self.resample
    do_rescale = do_rescale if do_rescale is not None else self.do_rescale
    rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
    do_normalize = do_normalize if do_normalize is not None else self.do_normalize
    image_mean = image_mean if image_mean is not None else self.image_mean
    image_std = image_std if image_std is not None else self.image_std
    do_pad = do_pad if do_pad is not None else self.do_pad
    do_center_crop if do_center_crop is not None else self.do_center_crop # pylint: disable=pointless-statement
    # For backwards compatibility. Initial version of this processor was cropping to the "size" argument, which
    # it should default to if crop_size is undefined.
    crop_size = (
        crop_size if crop_size is not None else (self.crop_size if self.crop_size is not None else self.size)
    )

    size = size if size is not None else self.size
    size = get_size_dict(size, default_to_square=False)

    validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)

    if not is_batched(images):
        images = [images]

    if not valid_images(images):
        raise ValueError(
            "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
            "torch.Tensor, tf.Tensor or jax.ndarray."
        )
    # Here, crop_size is used only if it is set, else size will be used.
    validate_preprocess_arguments(
        do_rescale=do_rescale,
        rescale_factor=rescale_factor,
        do_normalize=do_normalize,
        image_mean=image_mean,
        image_std=image_std,
        do_pad=do_pad,
        size_divisibility=size_divisor,
        do_center_crop=do_center_crop,
        crop_size=crop_size,
        do_resize=do_resize,
        size=size,
        resample=resample,
    )
    # All transformations expect numpy arrays.
    images = [to_numpy_array(image) for image in images]

    if is_scaled_image(images[0]) and do_rescale:
        logger.warning_once(
            "It looks like you are trying to rescale already rescaled images. If the input"
            " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
        )

    if do_resize:
        images = [
            self.resize(
                image=image,
                size=size,
                size_divisor=size_divisor,
                resample=resample,
                input_data_format=input_data_format,
            )
            for image in images
        ]

    if do_center_crop:
        images = [
            self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images
        ]

    if do_rescale:
        images = [
            self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
            for image in images
        ]

    if do_normalize:
        images = [
            self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
            for image in images
        ]

    images = [
        to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
    ]

    if do_pad:
        encoded_outputs = self.pad(
            images, return_pixel_mask=True, return_tensors=return_tensors, input_data_format=data_format
        )
    else:
        encoded_outputs = BatchFeature(data={"pixel_values": images}, tensor_type=return_tensors)

    return encoded_outputs

mindnlp.transformers.models.bridgetower.image_processing_bridgetower.BridgeTowerImageProcessor.resize(image, size, size_divisor=32, resample=PILImageResampling.BICUBIC, data_format=None, input_data_format=None, **kwargs)

Resize an image.

Resizes the shorter side of the image to size["shortest_edge"] while preserving the aspect ratio. If the longer side is larger than the max size (int(size["shortest_edge"]* 1333 / 800)), the longer side is then resized to the max size while preserving the aspect ratio.

PARAMETER DESCRIPTION
image

Image to resize.

TYPE: `np.ndarray`

size

Controls the size of the output image. Should be of the form {"shortest_edge": int}.

TYPE: `Dict[str, int]`

size_divisor

The image is resized to a size that is a multiple of this value.

TYPE: `int`, defaults to 32 DEFAULT: 32

resample

Resampling filter to use when resiizing the image.

TYPE: `PILImageResampling` filter, *optional*, defaults to `PILImageResampling.BICUBIC` DEFAULT: BICUBIC

data_format

The channel dimension format of the image. If not provided, it will be the same as the input image.

TYPE: `str` or `ChannelDimension`, *optional* DEFAULT: None

input_data_format

The channel dimension format of the input image. If not provided, it will be inferred.

TYPE: `str` or `ChannelDimension`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/bridgetower/image_processing_bridgetower.py
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
def resize(
    self,
    image: np.ndarray,
    size: Dict[str, int],
    size_divisor: int = 32,
    resample: PILImageResampling = PILImageResampling.BICUBIC,
    data_format: Optional[Union[str, ChannelDimension]] = None,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
) -> np.ndarray:
    """
    Resize an image.

    Resizes the shorter side of the image to `size["shortest_edge"]` while preserving the aspect ratio. If the
    longer side is larger than the max size `(int(`size["shortest_edge"]` * 1333 / 800))`, the longer side is then
    resized to the max size while preserving the aspect ratio.

    Args:
        image (`np.ndarray`):
            Image to resize.
        size (`Dict[str, int]`):
            Controls the size of the output image. Should be of the form `{"shortest_edge": int}`.
        size_divisor (`int`, defaults to 32):
            The image is resized to a size that is a multiple of this value.
        resample (`PILImageResampling` filter, *optional*, defaults to `PILImageResampling.BICUBIC`):
            Resampling filter to use when resiizing the image.
        data_format (`str` or `ChannelDimension`, *optional*):
            The channel dimension format of the image. If not provided, it will be the same as the input image.
        input_data_format (`str` or `ChannelDimension`, *optional*):
            The channel dimension format of the input image. If not provided, it will be inferred.
    """
    size = get_size_dict(size, default_to_square=False)
    if "shortest_edge" not in size:
        raise ValueError(f"The `size` dictionary must contain the key `shortest_edge`. Got {size.keys()}")
    shorter = size["shortest_edge"]
    longer = int(1333 / 800 * shorter)
    output_size = get_resize_output_image_size(
        image, shorter=shorter, longer=longer, size_divisor=size_divisor, input_data_format=input_data_format
    )
    return resize(
        image,
        size=output_size,
        resample=resample,
        data_format=data_format,
        input_data_format=input_data_format,
        **kwargs,
    )

mindnlp.transformers.models.bridgetower.processing_bridgetower.BridgeTowerProcessor

Bases: ProcessorMixin

Constructs a BridgeTower processor which wraps a Roberta tokenizer and BridgeTower image processor into a single processor.

[BridgeTowerProcessor] offers all the functionalities of [BridgeTowerImageProcessor] and [RobertaTokenizerFast]. See the docstring of [~BridgeTowerProcessor.__call__] and [~BridgeTowerProcessor.decode] for more information.

PARAMETER DESCRIPTION
image_processor

An instance of [BridgeTowerImageProcessor]. The image processor is a required input.

TYPE: `BridgeTowerImageProcessor`

tokenizer

An instance of ['RobertaTokenizerFast`]. The tokenizer is a required input.

TYPE: `RobertaTokenizerFast`

Source code in mindnlp/transformers/models/bridgetower/processing_bridgetower.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
class BridgeTowerProcessor(ProcessorMixin):
    r"""
    Constructs a BridgeTower processor which wraps a Roberta tokenizer and BridgeTower image processor into a single
    processor.

    [`BridgeTowerProcessor`] offers all the functionalities of [`BridgeTowerImageProcessor`] and
    [`RobertaTokenizerFast`]. See the docstring of [`~BridgeTowerProcessor.__call__`] and
    [`~BridgeTowerProcessor.decode`] for more information.

    Args:
        image_processor (`BridgeTowerImageProcessor`):
            An instance of [`BridgeTowerImageProcessor`]. The image processor is a required input.
        tokenizer (`RobertaTokenizerFast`):
            An instance of ['RobertaTokenizerFast`]. The tokenizer is a required input.
    """
    attributes = ["image_processor", "tokenizer"]
    image_processor_class = "BridgeTowerImageProcessor"
    tokenizer_class = ("RobertaTokenizer", "RobertaTokenizerFast")

    def __init__(self, image_processor, tokenizer):
        """
        This method initializes an instance of the BridgeTowerProcessor class.

        Args:
            self (object): The instance of the BridgeTowerProcessor class.
            image_processor (object): An object representing the image processor to be used for processing images.
            tokenizer (object): An object representing the tokenizer to be used for tokenizing text.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(image_processor, tokenizer)

    def __call__(
        self,
        images,
        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
        add_special_tokens: bool = True,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = None,
        max_length: Optional[int] = None,
        stride: int = 0,
        pad_to_multiple_of: Optional[int] = None,
        return_token_type_ids: Optional[bool] = None,
        return_attention_mask: Optional[bool] = None,
        return_overflowing_tokens: bool = False,
        return_special_tokens_mask: bool = False,
        return_offsets_mapping: bool = False,
        return_length: bool = False,
        verbose: bool = True,
        return_tensors: Optional[Union[str, TensorType]] = None,
        **kwargs,
    ) -> BatchEncoding:
        """
        This method uses [`BridgeTowerImageProcessor.__call__`] method to prepare image(s) for the model, and
        [`RobertaTokenizerFast.__call__`] to prepare text for the model.

        Please refer to the docstring of the above two methods for more information.
        """
        encoding = self.tokenizer(
            text=text,
            add_special_tokens=add_special_tokens,
            padding=padding,
            truncation=truncation,
            max_length=max_length,
            stride=stride,
            pad_to_multiple_of=pad_to_multiple_of,
            return_token_type_ids=return_token_type_ids,
            return_attention_mask=return_attention_mask,
            return_overflowing_tokens=return_overflowing_tokens,
            return_special_tokens_mask=return_special_tokens_mask,
            return_offsets_mapping=return_offsets_mapping,
            return_length=return_length,
            verbose=verbose,
            return_tensors=return_tensors,
            **kwargs,
        )
        # add pixel_values + pixel_mask
        encoding_image_processor = self.image_processor(
            images, return_tensors=return_tensors, do_normalize=True, do_center_crop=True, **kwargs
        )
        encoding.update(encoding_image_processor)

        return encoding

    def batch_decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to RobertaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
        refer to the docstring of this method for more information.
        """
        return self.tokenizer.batch_decode(*args, **kwargs)

    def decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to RobertaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer
        to the docstring of this method for more information.
        """
        return self.tokenizer.decode(*args, **kwargs)

    @property
    def model_input_names(self):
        """
        Returns a list of model input names for the 'BridgeTowerProcessor' class.

        Args:
            self (BridgeTowerProcessor): An instance of the 'BridgeTowerProcessor' class.

        Returns:
            list: A list containing the model input names for the tokenizer and the image processor.

        Raises:
            None.
        """
        tokenizer_input_names = self.tokenizer.model_input_names
        image_processor_input_names = self.image_processor.model_input_names
        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))

mindnlp.transformers.models.bridgetower.processing_bridgetower.BridgeTowerProcessor.model_input_names property

Returns a list of model input names for the 'BridgeTowerProcessor' class.

PARAMETER DESCRIPTION
self

An instance of the 'BridgeTowerProcessor' class.

TYPE: BridgeTowerProcessor

RETURNS DESCRIPTION
list

A list containing the model input names for the tokenizer and the image processor.

mindnlp.transformers.models.bridgetower.processing_bridgetower.BridgeTowerProcessor.__call__(images, text=None, add_special_tokens=True, padding=False, truncation=None, max_length=None, stride=0, pad_to_multiple_of=None, return_token_type_ids=None, return_attention_mask=None, return_overflowing_tokens=False, return_special_tokens_mask=False, return_offsets_mapping=False, return_length=False, verbose=True, return_tensors=None, **kwargs)

This method uses [BridgeTowerImageProcessor.__call__] method to prepare image(s) for the model, and [RobertaTokenizerFast.__call__] to prepare text for the model.

Please refer to the docstring of the above two methods for more information.

Source code in mindnlp/transformers/models/bridgetower/processing_bridgetower.py
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
def __call__(
    self,
    images,
    text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
    add_special_tokens: bool = True,
    padding: Union[bool, str, PaddingStrategy] = False,
    truncation: Union[bool, str, TruncationStrategy] = None,
    max_length: Optional[int] = None,
    stride: int = 0,
    pad_to_multiple_of: Optional[int] = None,
    return_token_type_ids: Optional[bool] = None,
    return_attention_mask: Optional[bool] = None,
    return_overflowing_tokens: bool = False,
    return_special_tokens_mask: bool = False,
    return_offsets_mapping: bool = False,
    return_length: bool = False,
    verbose: bool = True,
    return_tensors: Optional[Union[str, TensorType]] = None,
    **kwargs,
) -> BatchEncoding:
    """
    This method uses [`BridgeTowerImageProcessor.__call__`] method to prepare image(s) for the model, and
    [`RobertaTokenizerFast.__call__`] to prepare text for the model.

    Please refer to the docstring of the above two methods for more information.
    """
    encoding = self.tokenizer(
        text=text,
        add_special_tokens=add_special_tokens,
        padding=padding,
        truncation=truncation,
        max_length=max_length,
        stride=stride,
        pad_to_multiple_of=pad_to_multiple_of,
        return_token_type_ids=return_token_type_ids,
        return_attention_mask=return_attention_mask,
        return_overflowing_tokens=return_overflowing_tokens,
        return_special_tokens_mask=return_special_tokens_mask,
        return_offsets_mapping=return_offsets_mapping,
        return_length=return_length,
        verbose=verbose,
        return_tensors=return_tensors,
        **kwargs,
    )
    # add pixel_values + pixel_mask
    encoding_image_processor = self.image_processor(
        images, return_tensors=return_tensors, do_normalize=True, do_center_crop=True, **kwargs
    )
    encoding.update(encoding_image_processor)

    return encoding

mindnlp.transformers.models.bridgetower.processing_bridgetower.BridgeTowerProcessor.__init__(image_processor, tokenizer)

This method initializes an instance of the BridgeTowerProcessor class.

PARAMETER DESCRIPTION
self

The instance of the BridgeTowerProcessor class.

TYPE: object

image_processor

An object representing the image processor to be used for processing images.

TYPE: object

tokenizer

An object representing the tokenizer to be used for tokenizing text.

TYPE: object

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/bridgetower/processing_bridgetower.py
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def __init__(self, image_processor, tokenizer):
    """
    This method initializes an instance of the BridgeTowerProcessor class.

    Args:
        self (object): The instance of the BridgeTowerProcessor class.
        image_processor (object): An object representing the image processor to be used for processing images.
        tokenizer (object): An object representing the tokenizer to be used for tokenizing text.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(image_processor, tokenizer)

mindnlp.transformers.models.bridgetower.processing_bridgetower.BridgeTowerProcessor.batch_decode(*args, **kwargs)

This method forwards all its arguments to RobertaTokenizerFast's [~PreTrainedTokenizer.batch_decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp/transformers/models/bridgetower/processing_bridgetower.py
114
115
116
117
118
119
def batch_decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to RobertaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
    refer to the docstring of this method for more information.
    """
    return self.tokenizer.batch_decode(*args, **kwargs)

mindnlp.transformers.models.bridgetower.processing_bridgetower.BridgeTowerProcessor.decode(*args, **kwargs)

This method forwards all its arguments to RobertaTokenizerFast's [~PreTrainedTokenizer.decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp/transformers/models/bridgetower/processing_bridgetower.py
121
122
123
124
125
126
def decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to RobertaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer
    to the docstring of this method for more information.
    """
    return self.tokenizer.decode(*args, **kwargs)