Skip to content

beit

mindnlp.transformers.models.beit.configuration_beit.BeitConfig

Bases: BackboneConfigMixin, PretrainedConfig

This is the configuration class to store the configuration of a [BeitModel]. It is used to instantiate an BEiT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BEiT microsoft/beit-base-patch16-224-pt22k architecture.

PARAMETER DESCRIPTION
vocab_size

Vocabulary size of the BEiT model. Defines the number of different image tokens that can be used during pre-training.

TYPE: `int`, *optional*, defaults to 8192 DEFAULT: 8192

hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

intermediate_size

Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 3072 DEFAULT: 3072

hidden_act

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" are supported.

TYPE: `str` or `function`, *optional*, defaults to `"gelu"` DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

attention_probs_dropout_prob

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-12 DEFAULT: 1e-12

image_size

The size (resolution) of each image.

TYPE: `int`, *optional*, defaults to 224 DEFAULT: 224

patch_size

The size (resolution) of each patch.

TYPE: `int`, *optional*, defaults to 16 DEFAULT: 16

num_channels

The number of input channels.

TYPE: `int`, *optional*, defaults to 3 DEFAULT: 3

use_mask_token

Whether to use a mask token for masked image modeling.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

use_absolute_position_embeddings

Whether to use BERT-style absolute position embeddings.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

use_relative_position_bias

Whether to use T5-style relative position embeddings in the self-attention layers.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

use_shared_relative_position_bias

Whether to use the same relative position embeddings across all self-attention layers of the Transformer.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

layer_scale_init_value

Scale to use in the self-attention layers. 0.1 for base, 1e-5 for large. Set 0 to disable layer scale.

TYPE: `float`, *optional*, defaults to 0.1 DEFAULT: 0.1

drop_path_rate

Stochastic depth rate per sample (when applied in the main path of residual layers).

TYPE: `float`, *optional*, defaults to 0.1 DEFAULT: 0.1

use_mean_pooling

Whether to mean pool the final hidden states of the patches instead of using the final hidden state of the CLS token, before applying the classification head.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

pool_scales

Pooling scales used in Pooling Pyramid Module applied on the last feature map.

TYPE: `Tuple[int]`, *optional*, defaults to `[1, 2, 3, 6]` DEFAULT: [1, 2, 3, 6]

use_auxiliary_head

Whether to use an auxiliary head during training.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

auxiliary_loss_weight

Weight of the cross-entropy loss of the auxiliary head.

TYPE: `float`, *optional*, defaults to 0.4 DEFAULT: 0.4

auxiliary_channels

Number of channels to use in the auxiliary head.

TYPE: `int`, *optional*, defaults to 256 DEFAULT: 256

auxiliary_num_convs

Number of convolutional layers to use in the auxiliary head.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

auxiliary_concat_input

Whether to concatenate the output of the auxiliary head with the input before the classification layer.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

semantic_loss_ignore_index

The index that is ignored by the loss function of the semantic segmentation model.

TYPE: `int`, *optional*, defaults to 255 DEFAULT: 255

out_features

If used as backbone, list of features to output. Can be any of "stem", "stage1", "stage2", etc. (depending on how many stages the model has). If unset and out_indices is set, will default to the corresponding stages. If unset and out_indices is unset, will default to the last stage. Must be in the same order as defined in the stage_names attribute.

TYPE: `List[str]`, *optional* DEFAULT: None

out_indices

If used as backbone, list of indices of features to output. Can be any of 0, 1, 2, etc. (depending on how many stages the model has). If unset and out_features is set, will default to the corresponding stages. If unset and out_features is unset, will default to the last stage. Must be in the same order as defined in the stage_names attribute.

TYPE: `List[int]`, *optional* DEFAULT: None

add_fpn

Whether to add a FPN as part of the backbone. Only relevant for [BeitBackbone].

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

reshape_hidden_states

Whether to reshape the feature maps to 4D tensors of shape (batch_size, hidden_size, height, width) in case the model is used as backbone. If False, the feature maps will be 3D tensors of shape (batch_size, seq_len, hidden_size). Only relevant for [BeitBackbone].

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

Example
>>> from transformers import BeitConfig, BeitModel
...
>>> # Initializing a BEiT beit-base-patch16-224-pt22k style configuration
>>> configuration = BeitConfig()
...
>>> # Initializing a model (with random weights) from the beit-base-patch16-224-pt22k style configuration
>>> model = BeitModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/beit/configuration_beit.py
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
class BeitConfig(BackboneConfigMixin, PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`BeitModel`]. It is used to instantiate an BEiT
    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
    defaults will yield a similar configuration to that of the BEiT
    [microsoft/beit-base-patch16-224-pt22k](https://hf-mirror.com/microsoft/beit-base-patch16-224-pt22k) architecture.

    Args:
        vocab_size (`int`, *optional*, defaults to 8192):
            Vocabulary size of the BEiT model. Defines the number of different image tokens that can be used during
            pre-training.
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"selu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        image_size (`int`, *optional*, defaults to 224):
            The size (resolution) of each image.
        patch_size (`int`, *optional*, defaults to 16):
            The size (resolution) of each patch.
        num_channels (`int`, *optional*, defaults to 3):
            The number of input channels.
        use_mask_token (`bool`, *optional*, defaults to `False`):
            Whether to use a mask token for masked image modeling.
        use_absolute_position_embeddings (`bool`, *optional*, defaults to `False`):
            Whether to use BERT-style absolute position embeddings.
        use_relative_position_bias (`bool`, *optional*, defaults to `False`):
            Whether to use T5-style relative position embeddings in the self-attention layers.
        use_shared_relative_position_bias (`bool`, *optional*, defaults to `False`):
            Whether to use the same relative position embeddings across all self-attention layers of the Transformer.
        layer_scale_init_value (`float`, *optional*, defaults to 0.1):
            Scale to use in the self-attention layers. 0.1 for base, 1e-5 for large. Set 0 to disable layer scale.
        drop_path_rate (`float`, *optional*, defaults to 0.1):
            Stochastic depth rate per sample (when applied in the main path of residual layers).
        use_mean_pooling (`bool`, *optional*, defaults to `True`):
            Whether to mean pool the final hidden states of the patches instead of using the final hidden state of the
            CLS token, before applying the classification head.
        pool_scales (`Tuple[int]`, *optional*, defaults to `[1, 2, 3, 6]`):
            Pooling scales used in Pooling Pyramid Module applied on the last feature map.
        use_auxiliary_head (`bool`, *optional*, defaults to `True`):
            Whether to use an auxiliary head during training.
        auxiliary_loss_weight (`float`, *optional*, defaults to 0.4):
            Weight of the cross-entropy loss of the auxiliary head.
        auxiliary_channels (`int`, *optional*, defaults to 256):
            Number of channels to use in the auxiliary head.
        auxiliary_num_convs (`int`, *optional*, defaults to 1):
            Number of convolutional layers to use in the auxiliary head.
        auxiliary_concat_input (`bool`, *optional*, defaults to `False`):
            Whether to concatenate the output of the auxiliary head with the input before the classification layer.
        semantic_loss_ignore_index (`int`, *optional*, defaults to 255):
            The index that is ignored by the loss function of the semantic segmentation model.
        out_features (`List[str]`, *optional*):
            If used as backbone, list of features to output. Can be any of `"stem"`, `"stage1"`, `"stage2"`, etc.
            (depending on how many stages the model has). If unset and `out_indices` is set, will default to the
            corresponding stages. If unset and `out_indices` is unset, will default to the last stage. Must be in the
            same order as defined in the `stage_names` attribute.
        out_indices (`List[int]`, *optional*):
            If used as backbone, list of indices of features to output. Can be any of 0, 1, 2, etc. (depending on how
            many stages the model has). If unset and `out_features` is set, will default to the corresponding stages.
            If unset and `out_features` is unset, will default to the last stage. Must be in the
            same order as defined in the `stage_names` attribute.
        add_fpn (`bool`, *optional*, defaults to `False`):
            Whether to add a FPN as part of the backbone. Only relevant for [`BeitBackbone`].
        reshape_hidden_states (`bool`, *optional*, defaults to `True`):
            Whether to reshape the feature maps to 4D tensors of shape `(batch_size, hidden_size, height, width)` in
            case the model is used as backbone. If `False`, the feature maps will be 3D tensors of shape `(batch_size,
            seq_len, hidden_size)`. Only relevant for [`BeitBackbone`].

    Example:
        ```python
        >>> from transformers import BeitConfig, BeitModel
        ...
        >>> # Initializing a BEiT beit-base-patch16-224-pt22k style configuration
        >>> configuration = BeitConfig()
        ...
        >>> # Initializing a model (with random weights) from the beit-base-patch16-224-pt22k style configuration
        >>> model = BeitModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "beit"

    def __init__(
        self,
        vocab_size=8192,
        hidden_size=768,
        num_hidden_layers=12,
        num_attention_heads=12,
        intermediate_size=3072,
        hidden_act="gelu",
        hidden_dropout_prob=0.0,
        attention_probs_dropout_prob=0.0,
        initializer_range=0.02,
        layer_norm_eps=1e-12,
        image_size=224,
        patch_size=16,
        num_channels=3,
        use_mask_token=False,
        use_absolute_position_embeddings=False,
        use_relative_position_bias=False,
        use_shared_relative_position_bias=False,
        layer_scale_init_value=0.1,
        drop_path_rate=0.1,
        use_mean_pooling=True,
        pool_scales=[1, 2, 3, 6],
        use_auxiliary_head=True,
        auxiliary_loss_weight=0.4,
        auxiliary_channels=256,
        auxiliary_num_convs=1,
        auxiliary_concat_input=False,
        semantic_loss_ignore_index=255,
        out_features=None,
        out_indices=None,
        add_fpn=False,
        reshape_hidden_states=True,
        **kwargs,
    ):
        """
        __init__

        Initializes an instance of the BeitConfig class.

        Args:
            vocab_size (int, optional): The size of the vocabulary. Defaults to 8192.
            hidden_size (int, optional): The size of the hidden layers. Defaults to 768.
            num_hidden_layers (int, optional): The number of hidden layers. Defaults to 12.
            num_attention_heads (int, optional): The number of attention heads. Defaults to 12.
            intermediate_size (int, optional): The size of the intermediate layers. Defaults to 3072.
            hidden_act (str, optional): The activation function for the hidden layers. Defaults to 'gelu'.
            hidden_dropout_prob (float, optional): The dropout probability for the hidden layers. Defaults to 0.0.
            attention_probs_dropout_prob (float, optional): The dropout probability for the attention probabilities. Defaults to 0.0.
            initializer_range (float, optional): The range for weight initialization. Defaults to 0.02.
            layer_norm_eps (float, optional): The epsilon value for layer normalization. Defaults to 1e-12.
            image_size (int, optional): The size of the input image. Defaults to 224.
            patch_size (int, optional): The size of the patches. Defaults to 16.
            num_channels (int, optional): The number of input channels. Defaults to 3.
            use_mask_token (bool, optional): Whether to use a mask token. Defaults to False.
            use_absolute_position_embeddings (bool, optional): Whether to use absolute position embeddings. Defaults to False.
            use_relative_position_bias (bool, optional): Whether to use relative position bias. Defaults to False.
            use_shared_relative_position_bias (bool, optional): Whether to use shared relative position bias. Defaults to False.
            layer_scale_init_value (float, optional): The initial value for layer scale. Defaults to 0.1.
            drop_path_rate (float, optional): The rate for drop path. Defaults to 0.1.
            use_mean_pooling (bool, optional): Whether to use mean pooling. Defaults to True.
            pool_scales (list of int, optional): The scales for pooling. Defaults to [1, 2, 3, 6].
            use_auxiliary_head (bool, optional): Whether to use an auxiliary head. Defaults to True.
            auxiliary_loss_weight (float, optional): The weight for the auxiliary loss. Defaults to 0.4.
            auxiliary_channels (int, optional): The number of channels for the auxiliary head. Defaults to 256.
            auxiliary_num_convs (int, optional): The number of convolutional layers for the auxiliary head. Defaults to 1.
            auxiliary_concat_input (bool, optional): Whether to concatenate input for the auxiliary head. Defaults to False.
            semantic_loss_ignore_index (int, optional): The index to ignore for semantic loss. Defaults to 255.
            out_features (None, optional): The output features. Defaults to None.
            out_indices (None, optional): The output indices. Defaults to None.
            add_fpn (bool, optional): Whether to add feature pyramid network. Defaults to False.
            reshape_hidden_states (bool, optional): Whether to reshape the hidden states. Defaults to True.

        Returns:
            None.

        Raises:
            FutureWarning:
                If the 'segmentation_indices' argument is used,
                a warning is issued indicating its deprecation and advising to use 'out_indices' instead.
        """
        super().__init__(**kwargs)

        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.intermediate_size = intermediate_size
        self.hidden_act = hidden_act
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps

        self.image_size = image_size
        self.patch_size = patch_size
        self.num_channels = num_channels
        self.use_mask_token = use_mask_token
        self.use_absolute_position_embeddings = use_absolute_position_embeddings
        self.use_relative_position_bias = use_relative_position_bias
        self.use_shared_relative_position_bias = use_shared_relative_position_bias
        self.layer_scale_init_value = layer_scale_init_value
        self.drop_path_rate = drop_path_rate
        self.use_mean_pooling = use_mean_pooling
        # decode head attributes (semantic segmentation)
        self.pool_scales = pool_scales
        # auxiliary head attributes (semantic segmentation)
        self.use_auxiliary_head = use_auxiliary_head
        self.auxiliary_loss_weight = auxiliary_loss_weight
        self.auxiliary_channels = auxiliary_channels
        self.auxiliary_num_convs = auxiliary_num_convs
        self.auxiliary_concat_input = auxiliary_concat_input
        self.semantic_loss_ignore_index = semantic_loss_ignore_index

        # handle backwards compatibility
        if "segmentation_indices" in kwargs:
            logger.warning(
                "The `segmentation_indices` argument is deprecated and will be removed in a future version, use `out_indices` instead.",
                FutureWarning,
            )
            out_indices = kwargs.pop("segmentation_indices")

        # backbone attributes
        self.stage_names = ["stem"] + [f"stage{idx}" for idx in range(1, self.num_hidden_layers + 1)]
        self._out_features, self._out_indices = get_aligned_output_features_output_indices(
            out_features=out_features, out_indices=out_indices, stage_names=self.stage_names
        )
        self.add_fpn = add_fpn
        self.reshape_hidden_states = reshape_hidden_states

mindnlp.transformers.models.beit.configuration_beit.BeitConfig.__init__(vocab_size=8192, hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.0, attention_probs_dropout_prob=0.0, initializer_range=0.02, layer_norm_eps=1e-12, image_size=224, patch_size=16, num_channels=3, use_mask_token=False, use_absolute_position_embeddings=False, use_relative_position_bias=False, use_shared_relative_position_bias=False, layer_scale_init_value=0.1, drop_path_rate=0.1, use_mean_pooling=True, pool_scales=[1, 2, 3, 6], use_auxiliary_head=True, auxiliary_loss_weight=0.4, auxiliary_channels=256, auxiliary_num_convs=1, auxiliary_concat_input=False, semantic_loss_ignore_index=255, out_features=None, out_indices=None, add_fpn=False, reshape_hidden_states=True, **kwargs)

init

Initializes an instance of the BeitConfig class.

PARAMETER DESCRIPTION
vocab_size

The size of the vocabulary. Defaults to 8192.

TYPE: int DEFAULT: 8192

hidden_size

The size of the hidden layers. Defaults to 768.

TYPE: int DEFAULT: 768

num_hidden_layers

The number of hidden layers. Defaults to 12.

TYPE: int DEFAULT: 12

num_attention_heads

The number of attention heads. Defaults to 12.

TYPE: int DEFAULT: 12

intermediate_size

The size of the intermediate layers. Defaults to 3072.

TYPE: int DEFAULT: 3072

hidden_act

The activation function for the hidden layers. Defaults to 'gelu'.

TYPE: str DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for the hidden layers. Defaults to 0.0.

TYPE: float DEFAULT: 0.0

attention_probs_dropout_prob

The dropout probability for the attention probabilities. Defaults to 0.0.

TYPE: float DEFAULT: 0.0

initializer_range

The range for weight initialization. Defaults to 0.02.

TYPE: float DEFAULT: 0.02

layer_norm_eps

The epsilon value for layer normalization. Defaults to 1e-12.

TYPE: float DEFAULT: 1e-12

image_size

The size of the input image. Defaults to 224.

TYPE: int DEFAULT: 224

patch_size

The size of the patches. Defaults to 16.

TYPE: int DEFAULT: 16

num_channels

The number of input channels. Defaults to 3.

TYPE: int DEFAULT: 3

use_mask_token

Whether to use a mask token. Defaults to False.

TYPE: bool DEFAULT: False

use_absolute_position_embeddings

Whether to use absolute position embeddings. Defaults to False.

TYPE: bool DEFAULT: False

use_relative_position_bias

Whether to use relative position bias. Defaults to False.

TYPE: bool DEFAULT: False

use_shared_relative_position_bias

Whether to use shared relative position bias. Defaults to False.

TYPE: bool DEFAULT: False

layer_scale_init_value

The initial value for layer scale. Defaults to 0.1.

TYPE: float DEFAULT: 0.1

drop_path_rate

The rate for drop path. Defaults to 0.1.

TYPE: float DEFAULT: 0.1

use_mean_pooling

Whether to use mean pooling. Defaults to True.

TYPE: bool DEFAULT: True

pool_scales

The scales for pooling. Defaults to [1, 2, 3, 6].

TYPE: list of int DEFAULT: [1, 2, 3, 6]

use_auxiliary_head

Whether to use an auxiliary head. Defaults to True.

TYPE: bool DEFAULT: True

auxiliary_loss_weight

The weight for the auxiliary loss. Defaults to 0.4.

TYPE: float DEFAULT: 0.4

auxiliary_channels

The number of channels for the auxiliary head. Defaults to 256.

TYPE: int DEFAULT: 256

auxiliary_num_convs

The number of convolutional layers for the auxiliary head. Defaults to 1.

TYPE: int DEFAULT: 1

auxiliary_concat_input

Whether to concatenate input for the auxiliary head. Defaults to False.

TYPE: bool DEFAULT: False

semantic_loss_ignore_index

The index to ignore for semantic loss. Defaults to 255.

TYPE: int DEFAULT: 255

out_features

The output features. Defaults to None.

TYPE: None DEFAULT: None

out_indices

The output indices. Defaults to None.

TYPE: None DEFAULT: None

add_fpn

Whether to add feature pyramid network. Defaults to False.

TYPE: bool DEFAULT: False

reshape_hidden_states

Whether to reshape the hidden states. Defaults to True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
FutureWarning

If the 'segmentation_indices' argument is used, a warning is issued indicating its deprecation and advising to use 'out_indices' instead.

Source code in mindnlp/transformers/models/beit/configuration_beit.py
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
def __init__(
    self,
    vocab_size=8192,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072,
    hidden_act="gelu",
    hidden_dropout_prob=0.0,
    attention_probs_dropout_prob=0.0,
    initializer_range=0.02,
    layer_norm_eps=1e-12,
    image_size=224,
    patch_size=16,
    num_channels=3,
    use_mask_token=False,
    use_absolute_position_embeddings=False,
    use_relative_position_bias=False,
    use_shared_relative_position_bias=False,
    layer_scale_init_value=0.1,
    drop_path_rate=0.1,
    use_mean_pooling=True,
    pool_scales=[1, 2, 3, 6],
    use_auxiliary_head=True,
    auxiliary_loss_weight=0.4,
    auxiliary_channels=256,
    auxiliary_num_convs=1,
    auxiliary_concat_input=False,
    semantic_loss_ignore_index=255,
    out_features=None,
    out_indices=None,
    add_fpn=False,
    reshape_hidden_states=True,
    **kwargs,
):
    """
    __init__

    Initializes an instance of the BeitConfig class.

    Args:
        vocab_size (int, optional): The size of the vocabulary. Defaults to 8192.
        hidden_size (int, optional): The size of the hidden layers. Defaults to 768.
        num_hidden_layers (int, optional): The number of hidden layers. Defaults to 12.
        num_attention_heads (int, optional): The number of attention heads. Defaults to 12.
        intermediate_size (int, optional): The size of the intermediate layers. Defaults to 3072.
        hidden_act (str, optional): The activation function for the hidden layers. Defaults to 'gelu'.
        hidden_dropout_prob (float, optional): The dropout probability for the hidden layers. Defaults to 0.0.
        attention_probs_dropout_prob (float, optional): The dropout probability for the attention probabilities. Defaults to 0.0.
        initializer_range (float, optional): The range for weight initialization. Defaults to 0.02.
        layer_norm_eps (float, optional): The epsilon value for layer normalization. Defaults to 1e-12.
        image_size (int, optional): The size of the input image. Defaults to 224.
        patch_size (int, optional): The size of the patches. Defaults to 16.
        num_channels (int, optional): The number of input channels. Defaults to 3.
        use_mask_token (bool, optional): Whether to use a mask token. Defaults to False.
        use_absolute_position_embeddings (bool, optional): Whether to use absolute position embeddings. Defaults to False.
        use_relative_position_bias (bool, optional): Whether to use relative position bias. Defaults to False.
        use_shared_relative_position_bias (bool, optional): Whether to use shared relative position bias. Defaults to False.
        layer_scale_init_value (float, optional): The initial value for layer scale. Defaults to 0.1.
        drop_path_rate (float, optional): The rate for drop path. Defaults to 0.1.
        use_mean_pooling (bool, optional): Whether to use mean pooling. Defaults to True.
        pool_scales (list of int, optional): The scales for pooling. Defaults to [1, 2, 3, 6].
        use_auxiliary_head (bool, optional): Whether to use an auxiliary head. Defaults to True.
        auxiliary_loss_weight (float, optional): The weight for the auxiliary loss. Defaults to 0.4.
        auxiliary_channels (int, optional): The number of channels for the auxiliary head. Defaults to 256.
        auxiliary_num_convs (int, optional): The number of convolutional layers for the auxiliary head. Defaults to 1.
        auxiliary_concat_input (bool, optional): Whether to concatenate input for the auxiliary head. Defaults to False.
        semantic_loss_ignore_index (int, optional): The index to ignore for semantic loss. Defaults to 255.
        out_features (None, optional): The output features. Defaults to None.
        out_indices (None, optional): The output indices. Defaults to None.
        add_fpn (bool, optional): Whether to add feature pyramid network. Defaults to False.
        reshape_hidden_states (bool, optional): Whether to reshape the hidden states. Defaults to True.

    Returns:
        None.

    Raises:
        FutureWarning:
            If the 'segmentation_indices' argument is used,
            a warning is issued indicating its deprecation and advising to use 'out_indices' instead.
    """
    super().__init__(**kwargs)

    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.intermediate_size = intermediate_size
    self.hidden_act = hidden_act
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.initializer_range = initializer_range
    self.layer_norm_eps = layer_norm_eps

    self.image_size = image_size
    self.patch_size = patch_size
    self.num_channels = num_channels
    self.use_mask_token = use_mask_token
    self.use_absolute_position_embeddings = use_absolute_position_embeddings
    self.use_relative_position_bias = use_relative_position_bias
    self.use_shared_relative_position_bias = use_shared_relative_position_bias
    self.layer_scale_init_value = layer_scale_init_value
    self.drop_path_rate = drop_path_rate
    self.use_mean_pooling = use_mean_pooling
    # decode head attributes (semantic segmentation)
    self.pool_scales = pool_scales
    # auxiliary head attributes (semantic segmentation)
    self.use_auxiliary_head = use_auxiliary_head
    self.auxiliary_loss_weight = auxiliary_loss_weight
    self.auxiliary_channels = auxiliary_channels
    self.auxiliary_num_convs = auxiliary_num_convs
    self.auxiliary_concat_input = auxiliary_concat_input
    self.semantic_loss_ignore_index = semantic_loss_ignore_index

    # handle backwards compatibility
    if "segmentation_indices" in kwargs:
        logger.warning(
            "The `segmentation_indices` argument is deprecated and will be removed in a future version, use `out_indices` instead.",
            FutureWarning,
        )
        out_indices = kwargs.pop("segmentation_indices")

    # backbone attributes
    self.stage_names = ["stem"] + [f"stage{idx}" for idx in range(1, self.num_hidden_layers + 1)]
    self._out_features, self._out_indices = get_aligned_output_features_output_indices(
        out_features=out_features, out_indices=out_indices, stage_names=self.stage_names
    )
    self.add_fpn = add_fpn
    self.reshape_hidden_states = reshape_hidden_states

mindnlp.transformers.models.beit.image_processing_beit.BeitImageProcessor

Bases: BaseImageProcessor

Constructs a BEiT image processor.

PARAMETER DESCRIPTION
do_resize

Whether to resize the image's (height, width) dimensions to the specified size. Can be overridden by the do_resize parameter in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

size

256, "width": 256}): Size of the output image after resizing. Can be overridden by thesizeparameter in thepreprocess` method.

TYPE: `Dict[str, int]` *optional*, defaults to `{"height" DEFAULT: None

resample

Resampling filter to use if resizing the image. Can be overridden by the resample parameter in the preprocess method.

TYPE: `PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC` DEFAULT: BICUBIC

do_center_crop

Whether to center crop the image. If the input size is smaller than crop_size along any edge, the image is padded with 0's and then center cropped. Can be overridden by the do_center_crop parameter in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

crop_size

224, "width": 224}): Desired output size when applying center-cropping. Only has an effect ifdo_center_cropis set toTrue. Can be overridden by thecrop_sizeparameter in thepreprocess` method.

TYPE: `Dict[str, int]`, *optional*, defaults to `{"height" DEFAULT: None

rescale_factor

Scale factor to use if rescaling the image. Can be overridden by the rescale_factor parameter in the preprocess method.

TYPE: `int` or `float`, *optional*, defaults to `1/255` DEFAULT: 1 / 255

do_rescale

Whether to rescale the image by the specified scale rescale_factor. Can be overridden by the do_rescale parameter in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

do_normalize

Whether to normalize the image. Can be overridden by the do_normalize parameter in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

image_mean

The mean to use if normalizing the image. This is a float or list of floats of length of the number of channels of the image. Can be overridden by the image_mean parameter in the preprocess method.

TYPE: `float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN` DEFAULT: None

image_std

The standard deviation to use if normalizing the image. This is a float or list of floats of length of the number of channels of the image. Can be overridden by the image_std parameter in the preprocess method.

TYPE: `float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD` DEFAULT: None

do_reduce_labels

Whether or not to reduce all label values of segmentation maps by 1. Usually used for datasets where 0 is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). The background label will be replaced by 255. Can be overridden by the do_reduce_labels parameter in the preprocess method.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

Source code in mindnlp/transformers/models/beit/image_processing_beit.py
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
class BeitImageProcessor(BaseImageProcessor):
    r"""
    Constructs a BEiT image processor.

    Args:
        do_resize (`bool`, *optional*, defaults to `True`):
            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the
            `do_resize` parameter in the `preprocess` method.
        size (`Dict[str, int]` *optional*, defaults to `{"height": 256, "width": 256}`):
            Size of the output image after resizing. Can be overridden by the `size` parameter in the `preprocess`
            method.
        resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
            Resampling filter to use if resizing the image. Can be overridden by the `resample` parameter in the
            `preprocess` method.
        do_center_crop (`bool`, *optional*, defaults to `True`):
            Whether to center crop the image. If the input size is smaller than `crop_size` along any edge, the image
            is padded with 0's and then center cropped. Can be overridden by the `do_center_crop` parameter in the
            `preprocess` method.
        crop_size (`Dict[str, int]`, *optional*, defaults to `{"height": 224, "width": 224}`):
            Desired output size when applying center-cropping. Only has an effect if `do_center_crop` is set to `True`.
            Can be overridden by the `crop_size` parameter in the `preprocess` method.
        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
            Scale factor to use if rescaling the image. Can be overridden by the `rescale_factor` parameter in the
            `preprocess` method.
        do_rescale (`bool`, *optional*, defaults to `True`):
            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale`
            parameter in the `preprocess` method.
        do_normalize (`bool`, *optional*, defaults to `True`):
            Whether to normalize the image. Can be overridden by the `do_normalize` parameter in the `preprocess`
            method.
        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
            The mean to use if normalizing the image. This is a float or list of floats of length of the number of
            channels of the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
            The standard deviation to use if normalizing the image. This is a float or list of floats of length of the
            number of channels of the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
        do_reduce_labels (`bool`, *optional*, defaults to `False`):
            Whether or not to reduce all label values of segmentation maps by 1. Usually used for datasets where 0 is
            used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). The
            background label will be replaced by 255. Can be overridden by the `do_reduce_labels` parameter in the
            `preprocess` method.
    """
    model_input_names = ["pixel_values"]

    def __init__(
        self,
        do_resize: bool = True,
        size: Dict[str, int] = None,
        resample: PILImageResampling = PILImageResampling.BICUBIC,
        do_center_crop: bool = True,
        crop_size: Dict[str, int] = None,
        rescale_factor: Union[int, float] = 1 / 255,
        do_rescale: bool = True,
        do_normalize: bool = True,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_reduce_labels: bool = False,
        **kwargs,
    ) -> None:
        """
        Initializes an instance of the BeitImageProcessor class.

        Args:
            self: The instance of the class.
            do_resize (bool, optional): Flag indicating whether to resize the image. Defaults to True.
            size (Dict[str, int], optional): The target size of the image as a dictionary with 'height' and 'width' keys. Defaults to {'height': 256, 'width': 256}.
            resample (PILImageResampling, optional): The resampling algorithm to be used during resizing. Defaults to PILImageResampling.BICUBIC.
            do_center_crop (bool, optional): Flag indicating whether to perform center cropping. Defaults to True.
            crop_size (Dict[str, int], optional): The size of the center crop as a dictionary with 'height' and 'width' keys. Defaults to {'height': 224, 'width': 224}.
            rescale_factor (Union[int, float], optional): The factor by which to rescale the image. Defaults to 1 / 255.
            do_rescale (bool, optional): Flag indicating whether to rescale the image. Defaults to True.
            do_normalize (bool, optional): Flag indicating whether to normalize the image. Defaults to True.
            image_mean (Optional[Union[float, List[float]]], optional): The mean values for image normalization. Defaults to None.
            image_std (Optional[Union[float, List[float]]], optional): The standard deviation values for image normalization. Defaults to None.
            do_reduce_labels (bool, optional): Flag indicating whether to reduce labels. Defaults to False.
            **kwargs: Additional keyword arguments.

        Returns:
            None

        Raises:
            FutureWarning:
                If the 'reduce_labels' parameter is used.
                This parameter is deprecated and will be removed in a future version.
                Please use 'do_reduce_labels' instead.
        """
        if "reduce_labels" in kwargs:
            warnings.warn(
                "The `reduce_labels` parameter is deprecated and will be removed in a future version. Please use"
                " `do_reduce_labels` instead.",
                FutureWarning,
            )
            do_reduce_labels = kwargs.pop("reduce_labels")
        super().__init__(**kwargs)
        size = size if size is not None else {"height": 256, "width": 256}
        size = get_size_dict(size)
        crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
        crop_size = get_size_dict(crop_size, param_name="crop_size")
        self.do_resize = do_resize
        self.size = size
        self.resample = resample
        self.do_center_crop = do_center_crop
        self.crop_size = crop_size
        self.do_rescale = do_rescale
        self.rescale_factor = rescale_factor
        self.do_normalize = do_normalize
        self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
        self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
        self.do_reduce_labels = do_reduce_labels
        self._valid_processor_keys = [
            "images",
            "segmentation_maps",
            "do_resize",
            "size",
            "resample",
            "do_center_crop",
            "crop_size",
            "do_rescale",
            "rescale_factor",
            "do_normalize",
            "image_mean",
            "image_std",
            "do_reduce_labels",
            "return_tensors",
            "data_format",
            "input_data_format",
        ]

    @classmethod
    def from_dict(cls, image_processor_dict: Dict[str, Any], **kwargs):
        """
        Overrides the `from_dict` method from the base class to make sure `reduce_labels` is updated if image processor
        is created using from_dict and kwargs e.g. `BeitImageProcessor.from_pretrained(checkpoint, reduce_labels=True)`
        """
        image_processor_dict = image_processor_dict.copy()
        if "reduce_labels" in kwargs:
            image_processor_dict["reduce_labels"] = kwargs.pop("reduce_labels")
        return super().from_dict(image_processor_dict, **kwargs)

    def resize(
        self,
        image: np.ndarray,
        size: Dict[str, int],
        resample: PILImageResampling = PILImageResampling.BICUBIC,
        data_format: Optional[Union[str, ChannelDimension]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> np.ndarray:
        """
        Resize an image to (size["height"], size["width"]).

        Args:
            image (`np.ndarray`):
                Image to resize.
            size (`Dict[str, int]`):
                Size of the output image.
            resample (`PILImageResampling`, *optional*, defaults to `PIL.Image.BICUBIC`):
                Resampling filter to use when resiizing the image.
            data_format (`str` or `ChannelDimension`, *optional*):
                The channel dimension format of the image. If not provided, it will be the same as the input image.
            input_data_format (`str` or `ChannelDimension`, *optional*):
                The channel dimension format of the input image. If not provided, it will be inferred.
        """
        size = get_size_dict(size, default_to_square=True, param_name="size")
        if "height" not in size or "width" not in size:
            raise ValueError(f"The `size` argument must contain `height` and `width` keys. Got {size.keys()}")
        return resize(
            image,
            size=(size["height"], size["width"]),
            resample=resample,
            data_format=data_format,
            input_data_format=input_data_format,
            **kwargs,
        )

    def reduce_label(self, label: ImageInput) -> np.ndarray:
        """
        Reduce the label values in the input image.

        Args:
            self: BeitImageProcessor
                The instance of the BeitImageProcessor class.
            label: ImageInput
                The input label image to be processed. It should be a valid ImageInput object.

        Returns:
            np.ndarray:
                Returns a numpy array representing the reduced label image.

        Raises:
            ValueError
                If the input label is not a valid ImageInput object.
            TypeError
                If the input label cannot be converted to a numpy array.
            IndexError
                If the label array indexing operation fails due to invalid indices.
        """
        label = to_numpy_array(label)
        # Avoid using underflow conversion
        label[label == 0] = 255
        label = label - 1
        label[label == 254] = 255
        return label

    def _preprocess(
        self,
        image: ImageInput,
        do_reduce_labels: bool = None,
        do_resize: bool = None,
        size: Dict[str, int] = None,
        resample: PILImageResampling = None,
        do_center_crop: bool = None,
        crop_size: Dict[str, int] = None,
        do_rescale: bool = None,
        rescale_factor: float = None,
        do_normalize: bool = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ):
        """
        _preprocess method processes the input image based on the provided parameters.

        Args:
            self: The instance of the BeitImageProcessor class.
            image (ImageInput): The input image to be processed.
            do_reduce_labels (bool, optional): If True, reduces the labels of the input image. Defaults to None.
            do_resize (bool, optional): If True, resizes the input image. Defaults to None.
            size (Dict[str, int], optional): A dictionary specifying the target size for resizing the image. Defaults to None.
            resample (PILImageResampling, optional): The resampling filter to use when resizing the image. Defaults to None.
            do_center_crop (bool, optional): If True, performs a center crop on the input image. Defaults to None.
            crop_size (Dict[str, int], optional): A dictionary specifying the size of the center crop. Defaults to None.
            do_rescale (bool, optional): If True, rescales the input image. Defaults to None.
            rescale_factor (float, optional): The factor by which the image will be rescaled. Defaults to None.
            do_normalize (bool, optional): If True, normalizes the input image. Defaults to None.
            image_mean (Optional[Union[float, List[float]]], optional): The mean value used for normalization. Defaults to None.
            image_std (Optional[Union[float, List[float]]], optional): The standard deviation used for normalization. Defaults to None.
            input_data_format (Optional[Union[str, ChannelDimension]], optional): The format of the input data. Defaults to None.

        Returns:
            None: The processed image is returned as the output.

        Raises:
            None.
        """
        if do_reduce_labels:
            image = self.reduce_label(image)

        if do_resize:
            image = self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)

        if do_center_crop:
            image = self.center_crop(image=image, size=crop_size, input_data_format=input_data_format)

        if do_rescale:
            image = self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)

        if do_normalize:
            image = self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)

        return image

    def _preprocess_image(
        self,
        image: ImageInput,
        do_resize: bool = None,
        size: Dict[str, int] = None,
        resample: PILImageResampling = None,
        do_center_crop: bool = None,
        crop_size: Dict[str, int] = None,
        do_rescale: bool = None,
        rescale_factor: float = None,
        do_normalize: bool = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        data_format: Optional[Union[str, ChannelDimension]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ) -> np.ndarray:
        """Preprocesses a single image."""
        # All transformations expect numpy arrays.
        image = to_numpy_array(image)
        if is_scaled_image(image) and do_rescale:
            logger.warning_once(
                "It looks like you are trying to rescale already rescaled images. If the input"
                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
            )
        if input_data_format is None:
            input_data_format = infer_channel_dimension_format(image)
        image = self._preprocess(
            image,
            do_reduce_labels=False,
            do_resize=do_resize,
            size=size,
            resample=resample,
            do_center_crop=do_center_crop,
            crop_size=crop_size,
            do_rescale=do_rescale,
            rescale_factor=rescale_factor,
            do_normalize=do_normalize,
            image_mean=image_mean,
            image_std=image_std,
            input_data_format=input_data_format,
        )
        if data_format is not None:
            image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
        return image

    def _preprocess_segmentation_map(
        self,
        segmentation_map: ImageInput,
        do_resize: bool = None,
        size: Dict[str, int] = None,
        resample: PILImageResampling = None,
        do_center_crop: bool = None,
        crop_size: Dict[str, int] = None,
        do_reduce_labels: bool = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ):
        """Preprocesses a single segmentation map."""
        # All transformations expect numpy arrays.
        segmentation_map = to_numpy_array(segmentation_map)
        # Add an axis to the segmentation maps for transformations.
        if segmentation_map.ndim == 2:
            segmentation_map = segmentation_map[None, ...]
            added_dimension = True
            input_data_format = ChannelDimension.FIRST
        else:
            added_dimension = False
            if input_data_format is None:
                input_data_format = infer_channel_dimension_format(segmentation_map, num_channels=1)
        segmentation_map = self._preprocess(
            image=segmentation_map,
            do_reduce_labels=do_reduce_labels,
            do_resize=do_resize,
            resample=resample,
            size=size,
            do_center_crop=do_center_crop,
            crop_size=crop_size,
            do_normalize=False,
            do_rescale=False,
            input_data_format=ChannelDimension.FIRST,
        )
        # Remove extra axis if added
        if added_dimension:
            segmentation_map = np.squeeze(segmentation_map, axis=0)
        segmentation_map = segmentation_map.astype(np.int64)
        return segmentation_map

    def __call__(self, images, segmentation_maps=None, **kwargs):
        """
        __call__

        This method processes images and segmentation maps using the BeitImageProcessor.

        Args:
            self (object): The instance of the BeitImageProcessor class.
            images (array-like): The input images to be processed.
            segmentation_maps (array-like, optional): The segmentation maps corresponding to the input images. 
                Defaults to None.

        Returns:
            None: This method does not return any value. The processing is done in place.

        Raises:
            None.
        """
        # Overrides the `__call__` method of the `Preprocessor` class such that the images and segmentation maps can both
        # be passed in as positional arguments.
        return super().__call__(images, segmentation_maps=segmentation_maps, **kwargs)

    def preprocess(
        self,
        images: ImageInput,
        segmentation_maps: Optional[ImageInput] = None,
        do_resize: bool = None,
        size: Dict[str, int] = None,
        resample: PILImageResampling = None,
        do_center_crop: bool = None,
        crop_size: Dict[str, int] = None,
        do_rescale: bool = None,
        rescale_factor: float = None,
        do_normalize: bool = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_reduce_labels: Optional[bool] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        data_format: ChannelDimension = ChannelDimension.FIRST,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> PIL.Image.Image:
        """
        Preprocess an image or batch of images.

        Args:
            images (`ImageInput`):
                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
            segmentation_maps (`ImageInput`, *optional*)
                Segmentation maps to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the image.
            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
                Size of the image after resizing.
            resample (`int`, *optional*, defaults to `self.resample`):
                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`, Only
                has an effect if `do_resize` is set to `True`.
            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
                Whether to center crop the image.
            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
                Size of the image after center crop. If one edge the image is smaller than `crop_size`, it will be
                padded with zeros and then cropped
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the image values between [0 - 1].
            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the image.
            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
                Image mean.
            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
                Image standard deviation.
            do_reduce_labels (`bool`, *optional*, defaults to `self.do_reduce_labels`):
                Whether or not to reduce all label values of segmentation maps by 1. Usually used for datasets where 0
                is used for background, and background itself is not included in all classes of a dataset (e.g.
                ADE20k). The background label will be replaced by 255.
            return_tensors (`str` or `TensorType`, *optional*):
                The type of tensors to return. Can be one of:

                - Unset: Return a list of `np.ndarray`.
                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
                The channel dimension format for the output image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - Unset: Use the channel dimension format of the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. If unset, the channel dimension format is inferred
                from the input image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
        """
        do_resize = do_resize if do_resize is not None else self.do_resize
        size = size if size is not None else self.size
        size = get_size_dict(size, default_to_square=True, param_name="size")
        resample = resample if resample is not None else self.resample
        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
        crop_size = crop_size if crop_size is not None else self.crop_size
        crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
        image_mean = image_mean if image_mean is not None else self.image_mean
        image_std = image_std if image_std is not None else self.image_std
        do_reduce_labels = do_reduce_labels if do_reduce_labels is not None else self.do_reduce_labels

        validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)

        images = make_list_of_images(images)

        if segmentation_maps is not None:
            segmentation_maps = make_list_of_images(segmentation_maps, expected_ndims=2)

        if segmentation_maps is not None and not valid_images(segmentation_maps):
            raise ValueError(
                "Invalid segmentation_maps type. Must be of type PIL.Image.Image, numpy.ndarray, "
                "torch.Tensor, tf.Tensor or jax.ndarray."
            )
        if not valid_images(images):
            raise ValueError(
                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                "torch.Tensor, tf.Tensor or jax.ndarray."
            )

        validate_preprocess_arguments(
            do_rescale=do_rescale,
            rescale_factor=rescale_factor,
            do_normalize=do_normalize,
            image_mean=image_mean,
            image_std=image_std,
            do_center_crop=do_center_crop,
            crop_size=crop_size,
            do_resize=do_resize,
            size=size,
            resample=resample,
        )

        images = [
            self._preprocess_image(
                image=img,
                do_resize=do_resize,
                do_center_crop=do_center_crop,
                do_rescale=do_rescale,
                do_normalize=do_normalize,
                resample=resample,
                size=size,
                rescale_factor=rescale_factor,
                crop_size=crop_size,
                image_mean=image_mean,
                image_std=image_std,
                data_format=data_format,
                input_data_format=input_data_format,
            )
            for img in images
        ]

        data = {"pixel_values": images}

        if segmentation_maps is not None:
            segmentation_maps = [
                self._preprocess_segmentation_map(
                    segmentation_map=segmentation_map,
                    do_reduce_labels=do_reduce_labels,
                    do_resize=do_resize,
                    resample=resample,
                    size=size,
                    do_center_crop=do_center_crop,
                    crop_size=crop_size,
                )
                for segmentation_map in segmentation_maps
            ]
            data["labels"] = segmentation_maps

        return BatchFeature(data=data, tensor_type=return_tensors)

    def post_process_semantic_segmentation(self, outputs, target_sizes: List[Tuple] = None):
        """
        Converts the output of [`BeitForSemanticSegmentation`] into semantic segmentation maps. Only supports PyTorch.

        Args:
            outputs ([`BeitForSemanticSegmentation`]):
                Raw outputs of the model.
            target_sizes (`List[Tuple]` of length `batch_size`, *optional*):
                List of tuples corresponding to the requested final size (height, width) of each prediction. If unset,
                predictions will not be resized.

        Returns:
            semantic_segmentation: `List[torch.Tensor]` of length `batch_size`, where each item is a semantic
                segmentation map of shape (height, width) corresponding to the target_sizes entry (if `target_sizes` is
                specified). Each entry of each `torch.Tensor` correspond to a semantic class id.
        """
        # TODO: add support for other frameworks
        logits = outputs.logits

        # Resize logits and compute semantic segmentation maps
        if target_sizes is not None:
            if len(logits) != len(target_sizes):
                raise ValueError(
                    "Make sure that you pass in as many target sizes as the batch dimension of the logits"
                )

            if is_mindspore_tensor(target_sizes):
                target_sizes = target_sizes.numpy()

            semantic_segmentation = []

            for idx in range(len(logits)):
                resized_logits = ops.interpolate(
                    logits[idx].unsqueeze(0), size=target_sizes[idx], mode="bilinear", align_corners=False
                )
                semantic_map = resized_logits[0].argmax(axis=0)
                semantic_segmentation.append(semantic_map)
        else:
            semantic_segmentation = logits.argmax(axis=1)
            semantic_segmentation = [semantic_segmentation[i] for i in range(semantic_segmentation.shape[0])]

        return semantic_segmentation

mindnlp.transformers.models.beit.image_processing_beit.BeitImageProcessor.__call__(images, segmentation_maps=None, **kwargs)

call

This method processes images and segmentation maps using the BeitImageProcessor.

PARAMETER DESCRIPTION
self

The instance of the BeitImageProcessor class.

TYPE: object

images

The input images to be processed.

TYPE: array - like

segmentation_maps

The segmentation maps corresponding to the input images. Defaults to None.

TYPE: array - like DEFAULT: None

RETURNS DESCRIPTION
None

This method does not return any value. The processing is done in place.

Source code in mindnlp/transformers/models/beit/image_processing_beit.py
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
def __call__(self, images, segmentation_maps=None, **kwargs):
    """
    __call__

    This method processes images and segmentation maps using the BeitImageProcessor.

    Args:
        self (object): The instance of the BeitImageProcessor class.
        images (array-like): The input images to be processed.
        segmentation_maps (array-like, optional): The segmentation maps corresponding to the input images. 
            Defaults to None.

    Returns:
        None: This method does not return any value. The processing is done in place.

    Raises:
        None.
    """
    # Overrides the `__call__` method of the `Preprocessor` class such that the images and segmentation maps can both
    # be passed in as positional arguments.
    return super().__call__(images, segmentation_maps=segmentation_maps, **kwargs)

mindnlp.transformers.models.beit.image_processing_beit.BeitImageProcessor.__init__(do_resize=True, size=None, resample=PILImageResampling.BICUBIC, do_center_crop=True, crop_size=None, rescale_factor=1 / 255, do_rescale=True, do_normalize=True, image_mean=None, image_std=None, do_reduce_labels=False, **kwargs)

Initializes an instance of the BeitImageProcessor class.

PARAMETER DESCRIPTION
self

The instance of the class.

do_resize

Flag indicating whether to resize the image. Defaults to True.

TYPE: bool DEFAULT: True

size

The target size of the image as a dictionary with 'height' and 'width' keys. Defaults to {'height': 256, 'width': 256}.

TYPE: Dict[str, int] DEFAULT: None

resample

The resampling algorithm to be used during resizing. Defaults to PILImageResampling.BICUBIC.

TYPE: PILImageResampling DEFAULT: BICUBIC

do_center_crop

Flag indicating whether to perform center cropping. Defaults to True.

TYPE: bool DEFAULT: True

crop_size

The size of the center crop as a dictionary with 'height' and 'width' keys. Defaults to {'height': 224, 'width': 224}.

TYPE: Dict[str, int] DEFAULT: None

rescale_factor

The factor by which to rescale the image. Defaults to 1 / 255.

TYPE: Union[int, float] DEFAULT: 1 / 255

do_rescale

Flag indicating whether to rescale the image. Defaults to True.

TYPE: bool DEFAULT: True

do_normalize

Flag indicating whether to normalize the image. Defaults to True.

TYPE: bool DEFAULT: True

image_mean

The mean values for image normalization. Defaults to None.

TYPE: Optional[Union[float, List[float]]] DEFAULT: None

image_std

The standard deviation values for image normalization. Defaults to None.

TYPE: Optional[Union[float, List[float]]] DEFAULT: None

do_reduce_labels

Flag indicating whether to reduce labels. Defaults to False.

TYPE: bool DEFAULT: False

**kwargs

Additional keyword arguments.

DEFAULT: {}

RETURNS DESCRIPTION
None

None

RAISES DESCRIPTION
FutureWarning

If the 'reduce_labels' parameter is used. This parameter is deprecated and will be removed in a future version. Please use 'do_reduce_labels' instead.

Source code in mindnlp/transformers/models/beit/image_processing_beit.py
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
def __init__(
    self,
    do_resize: bool = True,
    size: Dict[str, int] = None,
    resample: PILImageResampling = PILImageResampling.BICUBIC,
    do_center_crop: bool = True,
    crop_size: Dict[str, int] = None,
    rescale_factor: Union[int, float] = 1 / 255,
    do_rescale: bool = True,
    do_normalize: bool = True,
    image_mean: Optional[Union[float, List[float]]] = None,
    image_std: Optional[Union[float, List[float]]] = None,
    do_reduce_labels: bool = False,
    **kwargs,
) -> None:
    """
    Initializes an instance of the BeitImageProcessor class.

    Args:
        self: The instance of the class.
        do_resize (bool, optional): Flag indicating whether to resize the image. Defaults to True.
        size (Dict[str, int], optional): The target size of the image as a dictionary with 'height' and 'width' keys. Defaults to {'height': 256, 'width': 256}.
        resample (PILImageResampling, optional): The resampling algorithm to be used during resizing. Defaults to PILImageResampling.BICUBIC.
        do_center_crop (bool, optional): Flag indicating whether to perform center cropping. Defaults to True.
        crop_size (Dict[str, int], optional): The size of the center crop as a dictionary with 'height' and 'width' keys. Defaults to {'height': 224, 'width': 224}.
        rescale_factor (Union[int, float], optional): The factor by which to rescale the image. Defaults to 1 / 255.
        do_rescale (bool, optional): Flag indicating whether to rescale the image. Defaults to True.
        do_normalize (bool, optional): Flag indicating whether to normalize the image. Defaults to True.
        image_mean (Optional[Union[float, List[float]]], optional): The mean values for image normalization. Defaults to None.
        image_std (Optional[Union[float, List[float]]], optional): The standard deviation values for image normalization. Defaults to None.
        do_reduce_labels (bool, optional): Flag indicating whether to reduce labels. Defaults to False.
        **kwargs: Additional keyword arguments.

    Returns:
        None

    Raises:
        FutureWarning:
            If the 'reduce_labels' parameter is used.
            This parameter is deprecated and will be removed in a future version.
            Please use 'do_reduce_labels' instead.
    """
    if "reduce_labels" in kwargs:
        warnings.warn(
            "The `reduce_labels` parameter is deprecated and will be removed in a future version. Please use"
            " `do_reduce_labels` instead.",
            FutureWarning,
        )
        do_reduce_labels = kwargs.pop("reduce_labels")
    super().__init__(**kwargs)
    size = size if size is not None else {"height": 256, "width": 256}
    size = get_size_dict(size)
    crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
    crop_size = get_size_dict(crop_size, param_name="crop_size")
    self.do_resize = do_resize
    self.size = size
    self.resample = resample
    self.do_center_crop = do_center_crop
    self.crop_size = crop_size
    self.do_rescale = do_rescale
    self.rescale_factor = rescale_factor
    self.do_normalize = do_normalize
    self.image_mean = image_mean if image_mean is not None else IMAGENET_STANDARD_MEAN
    self.image_std = image_std if image_std is not None else IMAGENET_STANDARD_STD
    self.do_reduce_labels = do_reduce_labels
    self._valid_processor_keys = [
        "images",
        "segmentation_maps",
        "do_resize",
        "size",
        "resample",
        "do_center_crop",
        "crop_size",
        "do_rescale",
        "rescale_factor",
        "do_normalize",
        "image_mean",
        "image_std",
        "do_reduce_labels",
        "return_tensors",
        "data_format",
        "input_data_format",
    ]

mindnlp.transformers.models.beit.image_processing_beit.BeitImageProcessor.from_dict(image_processor_dict, **kwargs) classmethod

Overrides the from_dict method from the base class to make sure reduce_labels is updated if image processor is created using from_dict and kwargs e.g. BeitImageProcessor.from_pretrained(checkpoint, reduce_labels=True)

Source code in mindnlp/transformers/models/beit/image_processing_beit.py
179
180
181
182
183
184
185
186
187
188
@classmethod
def from_dict(cls, image_processor_dict: Dict[str, Any], **kwargs):
    """
    Overrides the `from_dict` method from the base class to make sure `reduce_labels` is updated if image processor
    is created using from_dict and kwargs e.g. `BeitImageProcessor.from_pretrained(checkpoint, reduce_labels=True)`
    """
    image_processor_dict = image_processor_dict.copy()
    if "reduce_labels" in kwargs:
        image_processor_dict["reduce_labels"] = kwargs.pop("reduce_labels")
    return super().from_dict(image_processor_dict, **kwargs)

mindnlp.transformers.models.beit.image_processing_beit.BeitImageProcessor.post_process_semantic_segmentation(outputs, target_sizes=None)

Converts the output of [BeitForSemanticSegmentation] into semantic segmentation maps. Only supports PyTorch.

PARAMETER DESCRIPTION
outputs

Raw outputs of the model.

TYPE: [`BeitForSemanticSegmentation`]

target_sizes

List of tuples corresponding to the requested final size (height, width) of each prediction. If unset, predictions will not be resized.

TYPE: `List[Tuple]` of length `batch_size`, *optional* DEFAULT: None

RETURNS DESCRIPTION
semantic_segmentation

List[torch.Tensor] of length batch_size, where each item is a semantic segmentation map of shape (height, width) corresponding to the target_sizes entry (if target_sizes is specified). Each entry of each torch.Tensor correspond to a semantic class id.

Source code in mindnlp/transformers/models/beit/image_processing_beit.py
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
def post_process_semantic_segmentation(self, outputs, target_sizes: List[Tuple] = None):
    """
    Converts the output of [`BeitForSemanticSegmentation`] into semantic segmentation maps. Only supports PyTorch.

    Args:
        outputs ([`BeitForSemanticSegmentation`]):
            Raw outputs of the model.
        target_sizes (`List[Tuple]` of length `batch_size`, *optional*):
            List of tuples corresponding to the requested final size (height, width) of each prediction. If unset,
            predictions will not be resized.

    Returns:
        semantic_segmentation: `List[torch.Tensor]` of length `batch_size`, where each item is a semantic
            segmentation map of shape (height, width) corresponding to the target_sizes entry (if `target_sizes` is
            specified). Each entry of each `torch.Tensor` correspond to a semantic class id.
    """
    # TODO: add support for other frameworks
    logits = outputs.logits

    # Resize logits and compute semantic segmentation maps
    if target_sizes is not None:
        if len(logits) != len(target_sizes):
            raise ValueError(
                "Make sure that you pass in as many target sizes as the batch dimension of the logits"
            )

        if is_mindspore_tensor(target_sizes):
            target_sizes = target_sizes.numpy()

        semantic_segmentation = []

        for idx in range(len(logits)):
            resized_logits = ops.interpolate(
                logits[idx].unsqueeze(0), size=target_sizes[idx], mode="bilinear", align_corners=False
            )
            semantic_map = resized_logits[0].argmax(axis=0)
            semantic_segmentation.append(semantic_map)
    else:
        semantic_segmentation = logits.argmax(axis=1)
        semantic_segmentation = [semantic_segmentation[i] for i in range(semantic_segmentation.shape[0])]

    return semantic_segmentation

mindnlp.transformers.models.beit.image_processing_beit.BeitImageProcessor.preprocess(images, segmentation_maps=None, do_resize=None, size=None, resample=None, do_center_crop=None, crop_size=None, do_rescale=None, rescale_factor=None, do_normalize=None, image_mean=None, image_std=None, do_reduce_labels=None, return_tensors=None, data_format=ChannelDimension.FIRST, input_data_format=None, **kwargs)

Preprocess an image or batch of images.

PARAMETER DESCRIPTION
images

Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False.

TYPE: `ImageInput`

do_resize

Whether to resize the image.

TYPE: `bool`, *optional*, defaults to `self.do_resize` DEFAULT: None

size

Size of the image after resizing.

TYPE: `Dict[str, int]`, *optional*, defaults to `self.size` DEFAULT: None

resample

Resampling filter to use if resizing the image. This can be one of the enum PILImageResampling, Only has an effect if do_resize is set to True.

TYPE: `int`, *optional*, defaults to `self.resample` DEFAULT: None

do_center_crop

Whether to center crop the image.

TYPE: `bool`, *optional*, defaults to `self.do_center_crop` DEFAULT: None

crop_size

Size of the image after center crop. If one edge the image is smaller than crop_size, it will be padded with zeros and then cropped

TYPE: `Dict[str, int]`, *optional*, defaults to `self.crop_size` DEFAULT: None

do_rescale

Whether to rescale the image values between [0 - 1].

TYPE: `bool`, *optional*, defaults to `self.do_rescale` DEFAULT: None

rescale_factor

Rescale factor to rescale the image by if do_rescale is set to True.

TYPE: `float`, *optional*, defaults to `self.rescale_factor` DEFAULT: None

do_normalize

Whether to normalize the image.

TYPE: `bool`, *optional*, defaults to `self.do_normalize` DEFAULT: None

image_mean

Image mean.

TYPE: `float` or `List[float]`, *optional*, defaults to `self.image_mean` DEFAULT: None

image_std

Image standard deviation.

TYPE: `float` or `List[float]`, *optional*, defaults to `self.image_std` DEFAULT: None

do_reduce_labels

Whether or not to reduce all label values of segmentation maps by 1. Usually used for datasets where 0 is used for background, and background itself is not included in all classes of a dataset (e.g. ADE20k). The background label will be replaced by 255.

TYPE: `bool`, *optional*, defaults to `self.do_reduce_labels` DEFAULT: None

return_tensors

The type of tensors to return. Can be one of:

  • Unset: Return a list of np.ndarray.
  • TensorType.TENSORFLOW or 'tf': Return a batch of type tf.Tensor.
  • TensorType.PYTORCH or 'pt': Return a batch of type torch.Tensor.
  • TensorType.NUMPY or 'np': Return a batch of type np.ndarray.
  • TensorType.JAX or 'jax': Return a batch of type jax.numpy.ndarray.

TYPE: `str` or `TensorType`, *optional* DEFAULT: None

data_format

The channel dimension format for the output image. Can be one of:

  • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
  • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
  • Unset: Use the channel dimension format of the input image.

TYPE: `ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST` DEFAULT: FIRST

input_data_format

The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:

  • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
  • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
  • "none" or ChannelDimension.NONE: image in (height, width) format.

TYPE: `ChannelDimension` or `str`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/beit/image_processing_beit.py
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
def preprocess(
    self,
    images: ImageInput,
    segmentation_maps: Optional[ImageInput] = None,
    do_resize: bool = None,
    size: Dict[str, int] = None,
    resample: PILImageResampling = None,
    do_center_crop: bool = None,
    crop_size: Dict[str, int] = None,
    do_rescale: bool = None,
    rescale_factor: float = None,
    do_normalize: bool = None,
    image_mean: Optional[Union[float, List[float]]] = None,
    image_std: Optional[Union[float, List[float]]] = None,
    do_reduce_labels: Optional[bool] = None,
    return_tensors: Optional[Union[str, TensorType]] = None,
    data_format: ChannelDimension = ChannelDimension.FIRST,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
) -> PIL.Image.Image:
    """
    Preprocess an image or batch of images.

    Args:
        images (`ImageInput`):
            Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
            passing in images with pixel values between 0 and 1, set `do_rescale=False`.
        segmentation_maps (`ImageInput`, *optional*)
            Segmentation maps to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
            passing in images with pixel values between 0 and 1, set `do_rescale=False`.
        do_resize (`bool`, *optional*, defaults to `self.do_resize`):
            Whether to resize the image.
        size (`Dict[str, int]`, *optional*, defaults to `self.size`):
            Size of the image after resizing.
        resample (`int`, *optional*, defaults to `self.resample`):
            Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`, Only
            has an effect if `do_resize` is set to `True`.
        do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
            Whether to center crop the image.
        crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
            Size of the image after center crop. If one edge the image is smaller than `crop_size`, it will be
            padded with zeros and then cropped
        do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
            Whether to rescale the image values between [0 - 1].
        rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
            Rescale factor to rescale the image by if `do_rescale` is set to `True`.
        do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
            Whether to normalize the image.
        image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
            Image mean.
        image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
            Image standard deviation.
        do_reduce_labels (`bool`, *optional*, defaults to `self.do_reduce_labels`):
            Whether or not to reduce all label values of segmentation maps by 1. Usually used for datasets where 0
            is used for background, and background itself is not included in all classes of a dataset (e.g.
            ADE20k). The background label will be replaced by 255.
        return_tensors (`str` or `TensorType`, *optional*):
            The type of tensors to return. Can be one of:

            - Unset: Return a list of `np.ndarray`.
            - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
            - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
            - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
            - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
        data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
            The channel dimension format for the output image. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - Unset: Use the channel dimension format of the input image.
        input_data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format for the input image. If unset, the channel dimension format is inferred
            from the input image. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
    """
    do_resize = do_resize if do_resize is not None else self.do_resize
    size = size if size is not None else self.size
    size = get_size_dict(size, default_to_square=True, param_name="size")
    resample = resample if resample is not None else self.resample
    do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
    crop_size = crop_size if crop_size is not None else self.crop_size
    crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
    do_rescale = do_rescale if do_rescale is not None else self.do_rescale
    rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
    do_normalize = do_normalize if do_normalize is not None else self.do_normalize
    image_mean = image_mean if image_mean is not None else self.image_mean
    image_std = image_std if image_std is not None else self.image_std
    do_reduce_labels = do_reduce_labels if do_reduce_labels is not None else self.do_reduce_labels

    validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)

    images = make_list_of_images(images)

    if segmentation_maps is not None:
        segmentation_maps = make_list_of_images(segmentation_maps, expected_ndims=2)

    if segmentation_maps is not None and not valid_images(segmentation_maps):
        raise ValueError(
            "Invalid segmentation_maps type. Must be of type PIL.Image.Image, numpy.ndarray, "
            "torch.Tensor, tf.Tensor or jax.ndarray."
        )
    if not valid_images(images):
        raise ValueError(
            "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
            "torch.Tensor, tf.Tensor or jax.ndarray."
        )

    validate_preprocess_arguments(
        do_rescale=do_rescale,
        rescale_factor=rescale_factor,
        do_normalize=do_normalize,
        image_mean=image_mean,
        image_std=image_std,
        do_center_crop=do_center_crop,
        crop_size=crop_size,
        do_resize=do_resize,
        size=size,
        resample=resample,
    )

    images = [
        self._preprocess_image(
            image=img,
            do_resize=do_resize,
            do_center_crop=do_center_crop,
            do_rescale=do_rescale,
            do_normalize=do_normalize,
            resample=resample,
            size=size,
            rescale_factor=rescale_factor,
            crop_size=crop_size,
            image_mean=image_mean,
            image_std=image_std,
            data_format=data_format,
            input_data_format=input_data_format,
        )
        for img in images
    ]

    data = {"pixel_values": images}

    if segmentation_maps is not None:
        segmentation_maps = [
            self._preprocess_segmentation_map(
                segmentation_map=segmentation_map,
                do_reduce_labels=do_reduce_labels,
                do_resize=do_resize,
                resample=resample,
                size=size,
                do_center_crop=do_center_crop,
                crop_size=crop_size,
            )
            for segmentation_map in segmentation_maps
        ]
        data["labels"] = segmentation_maps

    return BatchFeature(data=data, tensor_type=return_tensors)

mindnlp.transformers.models.beit.image_processing_beit.BeitImageProcessor.reduce_label(label)

Reduce the label values in the input image.

PARAMETER DESCRIPTION
self

BeitImageProcessor The instance of the BeitImageProcessor class.

label

ImageInput The input label image to be processed. It should be a valid ImageInput object.

TYPE: ImageInput

RETURNS DESCRIPTION
ndarray

np.ndarray: Returns a numpy array representing the reduced label image.

Source code in mindnlp/transformers/models/beit/image_processing_beit.py
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
def reduce_label(self, label: ImageInput) -> np.ndarray:
    """
    Reduce the label values in the input image.

    Args:
        self: BeitImageProcessor
            The instance of the BeitImageProcessor class.
        label: ImageInput
            The input label image to be processed. It should be a valid ImageInput object.

    Returns:
        np.ndarray:
            Returns a numpy array representing the reduced label image.

    Raises:
        ValueError
            If the input label is not a valid ImageInput object.
        TypeError
            If the input label cannot be converted to a numpy array.
        IndexError
            If the label array indexing operation fails due to invalid indices.
    """
    label = to_numpy_array(label)
    # Avoid using underflow conversion
    label[label == 0] = 255
    label = label - 1
    label[label == 254] = 255
    return label

mindnlp.transformers.models.beit.image_processing_beit.BeitImageProcessor.resize(image, size, resample=PILImageResampling.BICUBIC, data_format=None, input_data_format=None, **kwargs)

Resize an image to (size["height"], size["width"]).

PARAMETER DESCRIPTION
image

Image to resize.

TYPE: `np.ndarray`

size

Size of the output image.

TYPE: `Dict[str, int]`

resample

Resampling filter to use when resiizing the image.

TYPE: `PILImageResampling`, *optional*, defaults to `PIL.Image.BICUBIC` DEFAULT: BICUBIC

data_format

The channel dimension format of the image. If not provided, it will be the same as the input image.

TYPE: `str` or `ChannelDimension`, *optional* DEFAULT: None

input_data_format

The channel dimension format of the input image. If not provided, it will be inferred.

TYPE: `str` or `ChannelDimension`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/beit/image_processing_beit.py
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
def resize(
    self,
    image: np.ndarray,
    size: Dict[str, int],
    resample: PILImageResampling = PILImageResampling.BICUBIC,
    data_format: Optional[Union[str, ChannelDimension]] = None,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
) -> np.ndarray:
    """
    Resize an image to (size["height"], size["width"]).

    Args:
        image (`np.ndarray`):
            Image to resize.
        size (`Dict[str, int]`):
            Size of the output image.
        resample (`PILImageResampling`, *optional*, defaults to `PIL.Image.BICUBIC`):
            Resampling filter to use when resiizing the image.
        data_format (`str` or `ChannelDimension`, *optional*):
            The channel dimension format of the image. If not provided, it will be the same as the input image.
        input_data_format (`str` or `ChannelDimension`, *optional*):
            The channel dimension format of the input image. If not provided, it will be inferred.
    """
    size = get_size_dict(size, default_to_square=True, param_name="size")
    if "height" not in size or "width" not in size:
        raise ValueError(f"The `size` argument must contain `height` and `width` keys. Got {size.keys()}")
    return resize(
        image,
        size=(size["height"], size["width"]),
        resample=resample,
        data_format=data_format,
        input_data_format=input_data_format,
        **kwargs,
    )

mindnlp.transformers.models.beit.modeling_beit.BEIT_PRETRAINED_MODEL_ARCHIVE_LIST = ['microsoft/beit-base-patch16-224'] module-attribute

mindnlp.transformers.models.beit.modeling_beit.BeitForImageClassification

Bases: BeitPreTrainedModel

This class is an implementation of the Beit model for image classification tasks. It is designed to be used for both single-label and multi-label classification as well as regression tasks.

The class inherits from the BeitPreTrainedModel, which provides the basic architecture and functionality of the Beit model.

ATTRIBUTE DESCRIPTION
num_labels

The number of labels in the classification task.

TYPE: int

beit

The Beit model for image classification.

TYPE: BeitModel

classifier

The classifier layer for predicting the labels.

TYPE: Linear or Identity

METHOD DESCRIPTION
__init__

Initializes a new instance of the BeitForImageClassification class.

forward

Constructs the forward pass of the model for image classification. This method takes input pixel values, head mask, labels, and other optional arguments, and returns the output of the model.

Source code in mindnlp/transformers/models/beit/modeling_beit.py
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
class BeitForImageClassification(BeitPreTrainedModel):

    """
    This class is an implementation of the Beit model for image classification tasks. 
    It is designed to be used for both single-label and multi-label classification as well as regression tasks.

    The class inherits from the BeitPreTrainedModel, which provides the basic architecture and functionality of the Beit model.

    Attributes:
        num_labels (int): The number of labels in the classification task.
        beit (BeitModel): The Beit model for image classification.
        classifier (nn.Linear or nn.Identity): The classifier layer for predicting the labels.

    Methods:
        __init__:
            Initializes a new instance of the BeitForImageClassification class.

        forward:
            Constructs the forward pass of the model for image classification.
            This method takes input pixel values, head mask, labels, and other optional arguments, 
            and returns the output of the model.
    """
    _keys_to_ignore_on_load_unexpected = [r"relative_position_index", r"num_batches_tracked"]
    def __init__(self, config: BeitConfig) -> None:
        """
        Initializes a new instance of the `BeitForImageClassification` class.

        Args:
            self: The object instance.
            config (BeitConfig): The configuration object that holds various hyperparameters and settings for the model.

        Returns:
            None

        Raises:
            None
        """
        super().__init__(config)

        self.num_labels = config.num_labels
        self.beit = BeitModel(config, add_pooling_layer=True)

        # Classifier head
        self.classifier = nn.Linear(config.hidden_size, config.num_labels) if config.num_labels > 0 else nn.Identity()

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[tuple, ImageClassifierOutput]:
        r"""
        Args:
            labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
                Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
                config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
                `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        outputs = self.beit(
            pixel_values,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = outputs.pooler_output if return_dict else outputs[1]

        logits = self.classifier(pooled_output)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                if self.num_labels == 1:
                    loss = ops.mse_loss(logits.squeeze(), labels.squeeze())
                else:
                    loss = ops.mse_loss(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss = F.cross_entropy(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss = ops.binary_cross_entropy_with_logits(logits, labels)
        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return ImageClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

mindnlp.transformers.models.beit.modeling_beit.BeitForImageClassification.__init__(config)

Initializes a new instance of the BeitForImageClassification class.

PARAMETER DESCRIPTION
self

The object instance.

config

The configuration object that holds various hyperparameters and settings for the model.

TYPE: BeitConfig

RETURNS DESCRIPTION
None

None

Source code in mindnlp/transformers/models/beit/modeling_beit.py
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
def __init__(self, config: BeitConfig) -> None:
    """
    Initializes a new instance of the `BeitForImageClassification` class.

    Args:
        self: The object instance.
        config (BeitConfig): The configuration object that holds various hyperparameters and settings for the model.

    Returns:
        None

    Raises:
        None
    """
    super().__init__(config)

    self.num_labels = config.num_labels
    self.beit = BeitModel(config, add_pooling_layer=True)

    # Classifier head
    self.classifier = nn.Linear(config.hidden_size, config.num_labels) if config.num_labels > 0 else nn.Identity()

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.beit.modeling_beit.BeitForImageClassification.forward(pixel_values=None, head_mask=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
labels

Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

TYPE: `torch.LongTensor` of shape `(batch_size,)`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/beit/modeling_beit.py
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[tuple, ImageClassifierOutput]:
    r"""
    Args:
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    outputs = self.beit(
        pixel_values,
        head_mask=head_mask,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    pooled_output = outputs.pooler_output if return_dict else outputs[1]

    logits = self.classifier(pooled_output)

    loss = None
    if labels is not None:
        if self.config.problem_type is None:
            if self.num_labels == 1:
                self.config.problem_type = "regression"
            elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                self.config.problem_type = "single_label_classification"
            else:
                self.config.problem_type = "multi_label_classification"

        if self.config.problem_type == "regression":
            if self.num_labels == 1:
                loss = ops.mse_loss(logits.squeeze(), labels.squeeze())
            else:
                loss = ops.mse_loss(logits, labels)
        elif self.config.problem_type == "single_label_classification":
            loss = F.cross_entropy(logits.view(-1, self.num_labels), labels.view(-1))
        elif self.config.problem_type == "multi_label_classification":
            loss = ops.binary_cross_entropy_with_logits(logits, labels)
    if not return_dict:
        output = (logits,) + outputs[2:]
        return ((loss,) + output) if loss is not None else output

    return ImageClassifierOutput(
        loss=loss,
        logits=logits,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.beit.modeling_beit.BeitForMaskedImageModeling

Bases: BeitPreTrainedModel

BeitForMaskedImageModeling is a class that represents a model for masked image modeling using the BEiT (Vision Transformer) architecture. It is designed for processing images with masked positions and generating predictions for masked language modeling tasks.

This class inherits from BeitPreTrainedModel and includes methods for initialization and forwarding the model. The init method initializes the model with the provided configuration, setting up various components such as BEiT model, layer normalization, and the LM head.

The forward method takes input pixel values, boolean masked positions, head mask, labels, and other optional parameters to perform masked language modeling on the input image. It processes the input through the BEiT model, applies layer normalization, computes prediction scores, and calculates masked language modeling loss if labels are provided.

The method returns a MaskedLMOutput object containing the masked language modeling loss, prediction scores, hidden states, and attentions. Additionally, the docstring includes examples demonstrating how to use the BeitForMaskedImageModeling class for masked image modeling tasks.

Source code in mindnlp/transformers/models/beit/modeling_beit.py
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
class BeitForMaskedImageModeling(BeitPreTrainedModel):

    """
    BeitForMaskedImageModeling is a class that represents a model for masked image modeling using the BEiT 
    (Vision Transformer) architecture. 
    It is designed for processing images with masked positions and generating predictions for masked language modeling tasks.

    This class inherits from BeitPreTrainedModel and includes methods for initialization and forwarding the model. 
    The __init__ method initializes the model with the provided configuration, setting up various components such as 
    BEiT model, layer normalization, and the LM head.

    The forward method takes input pixel values, boolean masked positions, head mask, labels, 
    and other optional parameters to perform masked language modeling on the input image. 
    It processes the input through the BEiT model, applies layer normalization, computes prediction scores, 
    and calculates masked language modeling loss if labels are provided.

    The method returns a MaskedLMOutput object containing the masked language modeling loss, prediction scores, 
    hidden states, and attentions. 
    Additionally, the docstring includes examples demonstrating how to  use the BeitForMaskedImageModeling class 
    for masked image modeling tasks.
    """
    _keys_to_ignore_on_load_unexpectedecode_headd = [r"relative_position_index", r"num_batches_tracked"]
    def __init__(self, config: BeitConfig) -> None:
        """
        Initializes a new instance of the BeitForMaskedImageModeling class.

        Args:
            self: The instance of the class.
            config (BeitConfig): 
                The configuration object for the Beit model. It contains various hyperparameters and settings.

                - num_labels (int): The number of labels for classification.

        Returns:
            None

        Raises:
            None
        """
        super().__init__(config)

        self.num_labels = config.num_labels
        self.beit = BeitModel(config, add_pooling_layer=False)

        # Classifier head
        self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size)

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        bool_masked_pos: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[tuple, MaskedLMOutput]:
        r"""
        Args:
            bool_masked_pos (`mindspore.Tensor` of shape `(batch_size, num_patches)`):
                Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).

            labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
                Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
                config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
                `config.num_labels > 1` a classification loss is computed (Cross-Entropy).

        Returns:
            Union[tuple, MaskedLMOutput]

        Example:
            ```python
            >>> from transformers import AutoImageProcessor, BeitForMaskedImageModeling
            >>> import torch
            >>> from PIL import Image
            >>> import requests
            ... 
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ... 
            >>> image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
            >>> model = BeitForMaskedImageModeling.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
            ... 
            >>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
            >>> pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
            >>> # create random boolean mask of shape (batch_size, num_patches)
            >>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
            ... 
            >>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
            >>> loss, logits = outputs.loss, outputs.logits
            >>> list(logits.shape)
            [1, 196, 8192]
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.beit(
            pixel_values,
            bool_masked_pos=bool_masked_pos,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]
        sequence_output = self.layernorm(sequence_output)
        prediction_scores = self.lm_head(sequence_output[:, 1:])

        masked_lm_loss = None
        if labels is not None:
            masked_lm_loss = F.cross_entropy(prediction_scores[bool_masked_pos], labels)

        if not return_dict:
            output = (prediction_scores,) + outputs[1:]
            return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output

        return MaskedLMOutput(
            loss=masked_lm_loss,
            logits=prediction_scores,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

mindnlp.transformers.models.beit.modeling_beit.BeitForMaskedImageModeling.__init__(config)

Initializes a new instance of the BeitForMaskedImageModeling class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

The configuration object for the Beit model. It contains various hyperparameters and settings.

  • num_labels (int): The number of labels for classification.

TYPE: BeitConfig

RETURNS DESCRIPTION
None

None

Source code in mindnlp/transformers/models/beit/modeling_beit.py
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
def __init__(self, config: BeitConfig) -> None:
    """
    Initializes a new instance of the BeitForMaskedImageModeling class.

    Args:
        self: The instance of the class.
        config (BeitConfig): 
            The configuration object for the Beit model. It contains various hyperparameters and settings.

            - num_labels (int): The number of labels for classification.

    Returns:
        None

    Raises:
        None
    """
    super().__init__(config)

    self.num_labels = config.num_labels
    self.beit = BeitModel(config, add_pooling_layer=False)

    # Classifier head
    self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
    self.lm_head = nn.Linear(config.hidden_size, config.vocab_size)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.beit.modeling_beit.BeitForMaskedImageModeling.forward(pixel_values=None, bool_masked_pos=None, head_mask=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
bool_masked_pos

Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).

TYPE: `mindspore.Tensor` of shape `(batch_size, num_patches)` DEFAULT: None

labels

Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

TYPE: `torch.LongTensor` of shape `(batch_size,)`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[tuple, MaskedLMOutput]

Union[tuple, MaskedLMOutput]

Example
>>> from transformers import AutoImageProcessor, BeitForMaskedImageModeling
>>> import torch
>>> from PIL import Image
>>> import requests
... 
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
... 
>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
>>> model = BeitForMaskedImageModeling.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
... 
>>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
>>> pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
>>> # create random boolean mask of shape (batch_size, num_patches)
>>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
... 
>>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
>>> loss, logits = outputs.loss, outputs.logits
>>> list(logits.shape)
[1, 196, 8192]
Source code in mindnlp/transformers/models/beit/modeling_beit.py
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    bool_masked_pos: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[tuple, MaskedLMOutput]:
    r"""
    Args:
        bool_masked_pos (`mindspore.Tensor` of shape `(batch_size, num_patches)`):
            Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).

        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).

    Returns:
        Union[tuple, MaskedLMOutput]

    Example:
        ```python
        >>> from transformers import AutoImageProcessor, BeitForMaskedImageModeling
        >>> import torch
        >>> from PIL import Image
        >>> import requests
        ... 
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ... 
        >>> image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
        >>> model = BeitForMaskedImageModeling.from_pretrained("microsoft/beit-base-patch16-224-pt22k")
        ... 
        >>> num_patches = (model.config.image_size // model.config.patch_size) ** 2
        >>> pixel_values = image_processor(images=image, return_tensors="pt").pixel_values
        >>> # create random boolean mask of shape (batch_size, num_patches)
        >>> bool_masked_pos = torch.randint(low=0, high=2, size=(1, num_patches)).bool()
        ... 
        >>> outputs = model(pixel_values, bool_masked_pos=bool_masked_pos)
        >>> loss, logits = outputs.loss, outputs.logits
        >>> list(logits.shape)
        [1, 196, 8192]
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    outputs = self.beit(
        pixel_values,
        bool_masked_pos=bool_masked_pos,
        head_mask=head_mask,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    sequence_output = outputs[0]
    sequence_output = self.layernorm(sequence_output)
    prediction_scores = self.lm_head(sequence_output[:, 1:])

    masked_lm_loss = None
    if labels is not None:
        masked_lm_loss = F.cross_entropy(prediction_scores[bool_masked_pos], labels)

    if not return_dict:
        output = (prediction_scores,) + outputs[1:]
        return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output

    return MaskedLMOutput(
        loss=masked_lm_loss,
        logits=prediction_scores,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.beit.modeling_beit.BeitForSemanticSegmentation

Bases: BeitPreTrainedModel

A Python class representing the Beit model for semantic segmentation.

This class inherits from the BeitPreTrainedModel and includes methods for initializing the model, computing loss, and forwarding the model. T he BeitForSemanticSegmentation class is designed for semantic segmentation tasks, where it takes input pixel values and produces semantic segmentation maps. It utilizes the BeitModel for feature extraction, applies a series of operations to the extracted features, and outputs logits for semantic segmentation.

The class's __init__ method initializes the model with the specified configuration and sets up the necessary components for semantic segmentation, such as the BeitModel, decoder head, and auxiliary head. The compute_loss method calculates the loss based on the model's predictions and ground truth labels. The forward method processes input pixel values, applies the model, and returns the semantic segmentation logits along with optional outputs like hidden states and attentions. Additionally, it provides detailed information on the expected input format for labels and examples of how to use the class for semantic segmentation tasks.

Source code in mindnlp/transformers/models/beit/modeling_beit.py
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
class BeitForSemanticSegmentation(BeitPreTrainedModel):

    """
    A Python class representing the Beit model for semantic segmentation.

    This class inherits from the BeitPreTrainedModel and includes methods for initializing the model, computing loss,
    and forwarding the model. T
    he `BeitForSemanticSegmentation` class is designed for semantic segmentation tasks, where it takes input pixel values
    and produces semantic segmentation maps.
    It utilizes the BeitModel for feature extraction, applies a series of operations to the extracted features,
    and outputs logits for semantic segmentation.

    The class's `__init__` method initializes the model with the specified configuration and sets up the necessary
    components for semantic segmentation, such as the BeitModel, decoder head, and auxiliary head.
    The `compute_loss` method calculates the loss based on the model's predictions and ground truth labels.
    The `forward` method processes input pixel values, applies the model, and returns the semantic segmentation logits
    along with optional outputs like hidden states and attentions.
    Additionally, it provides detailed information on the expected input format for labels and examples of how to use the class
    for semantic segmentation tasks.

    """
    _keys_to_ignore_on_load_unexpected = [r"relative_position_index", r"num_batches_tracked"]
    def __init__(self, config: BeitConfig) -> None:
        """
        Initializes a new instance of BeitForSemanticSegmentation.

        Args:
            self: The instance of the class.
            config (BeitConfig): The configuration object containing parameters for the model.
                It is used to set up the model architecture and define various hyperparameters.
                Must be an instance of BeitConfig.

        Returns:
            None.

        Raises:
            ValueError: Raised if the length of config.out_indices is not 4, indicating an incorrect specification.
                BeitForSemanticSegmentation requires config.out_indices to be a list of 4 integers,
                specifying which features to use from the backbone.
                Recommended values are [3, 5, 7, 11] for a base-sized architecture.
        """
        super().__init__(config)

        self.num_labels = config.num_labels
        self.beit = BeitModel(config, add_pooling_layer=False)

        # FPNs
        if len(self.config.out_indices) != 4:
            raise ValueError(
                "BeitForSemanticSegmentation requires config.out_indices to be a list of 4 integers, "
                "specifying which features to use from the backbone. One can use [3, 5, 7, 11] in case of "
                "a base-sized architecture."
            )
        self.fpn1 = nn.Sequential(
            nn.ConvTranspose2d(config.hidden_size, config.hidden_size, kernel_size=2, stride=2),
            nn.BatchNorm2d(config.hidden_size),
            nn.GELU(),
            nn.ConvTranspose2d(config.hidden_size, config.hidden_size, kernel_size=2, stride=2),
        )
        self.fpn2 = nn.Sequential(
            nn.ConvTranspose2d(config.hidden_size, config.hidden_size, kernel_size=2, stride=2),
        )
        self.fpn3 = nn.Identity()
        self.fpn4 = nn.MaxPool2d(kernel_size=2, stride=2)

        # Semantic segmentation head(s)
        self.decode_head = BeitUperHead(config)
        self.auxiliary_head = BeitFCNHead(config) if config.use_auxiliary_head else None

        # Initialize weights and apply final processing
        self.post_init()

    def compute_loss(self, logits, auxiliary_logits, labels):
        '''
        This method computes the loss for semantic segmentation using the logits and auxiliary logits, and compares them with the provided labels.

        Args:
            self: (object) The instance of the class BeitForSemanticSegmentation.
            logits: (tensor) The main logits for semantic segmentation.
            auxiliary_logits: (tensor or None) The auxiliary logits for semantic segmentation. It can be None if not provided.
            labels: (tensor) The ground truth labels for semantic segmentation.

        Returns:
            None: The method computes the loss and updates the internal state of the class.

        Raises:
            ValueError: If the size of logits or auxiliary_logits does not match the size of the labels.
            ValueError: If labels contain values outside the range [0, num_classes-1], where num_classes is the number of classes for semantic segmentation.
            RuntimeError: If the mode specified for interpolation is not supported.
        '''
        # upsample logits to the images' original size
        upsampled_logits = F.interpolate(
            logits, size=labels.shape[-2:], mode="bilinear", align_corners=False
        )
        if auxiliary_logits is not None:
            upsampled_auxiliary_logits = F.interpolate(
                auxiliary_logits, size=labels.shape[-2:], mode="bilinear", align_corners=False
            )
        # compute weighted loss
        main_loss = F.cross_entropy(upsampled_logits, labels, ignore_index=self.config.semantic_loss_ignore_index)
        loss = main_loss
        if auxiliary_logits is not None:
            auxiliary_loss = F.cross_entropy(upsampled_auxiliary_logits, labels, ignore_index=self.config.semantic_loss_ignore_index)
            loss += self.config.auxiliary_loss_weight * auxiliary_loss

        return loss

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[tuple, SemanticSegmenterOutput]:
        r"""
        Args:
            labels (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
                Ground truth semantic segmentation maps for computing the loss. Indices should be in `[0, ...,
                config.num_labels - 1]`. If `config.num_labels > 1`, a classification loss is computed (Cross-Entropy).

        Returns:
            Union[tuple, SemanticSegmenterOutput]

        Example:
            ```python
            >>> from transformers import AutoImageProcessor, BeitForSemanticSegmentation
            >>> from PIL import Image
            >>> import requests
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-finetuned-ade-640-640")
            >>> model = BeitForSemanticSegmentation.from_pretrained("microsoft/beit-base-finetuned-ade-640-640")
            ...
            >>> inputs = image_processor(images=image, return_tensors="pt")
            >>> outputs = model(**inputs)
            >>> # logits are of shape (batch_size, num_labels, height, width)
            >>> logits = outputs.logits
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )

        outputs = self.beit(
            pixel_values,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=True,  # we need the intermediate hidden states
            return_dict=return_dict,
        )

        encoder_hidden_states = outputs.hidden_states if return_dict else outputs[1]

        # only keep certain features, and reshape
        # note that we do +1 as the encoder_hidden_states also includes the initial embeddings
        features = [feature for idx, feature in enumerate(encoder_hidden_states) if idx + 1 in self.config.out_indices]
        batch_size = pixel_values.shape[0]
        patch_resolution = self.config.image_size // self.config.patch_size
        features = [
            x[:, 1:, :].permute(0, 2, 1).reshape(batch_size, -1, patch_resolution, patch_resolution) for x in features
        ]

        # apply FPNs
        operators = [self.fpn1, self.fpn2, self.fpn3, self.fpn4]
        for i in range(len(features)):
            features[i] = operators[i](features[i])

        logits = self.decode_head(features)

        auxiliary_logits = None
        if self.auxiliary_head is not None:
            auxiliary_logits = self.auxiliary_head(features)

        loss = None
        if labels is not None:
            if self.config.num_labels == 1:
                raise ValueError("The number of labels should be greater than one")
            else:
                loss = self.compute_loss(logits, auxiliary_logits, labels)

        if not return_dict:
            if output_hidden_states:
                output = (logits,) + outputs[1:]
            else:
                output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return SemanticSegmenterOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states if output_hidden_states else None,
            attentions=outputs.attentions,
        )

mindnlp.transformers.models.beit.modeling_beit.BeitForSemanticSegmentation.__init__(config)

Initializes a new instance of BeitForSemanticSegmentation.

PARAMETER DESCRIPTION
self

The instance of the class.

config

The configuration object containing parameters for the model. It is used to set up the model architecture and define various hyperparameters. Must be an instance of BeitConfig.

TYPE: BeitConfig

RETURNS DESCRIPTION
None

None.

RAISES DESCRIPTION
ValueError

Raised if the length of config.out_indices is not 4, indicating an incorrect specification. BeitForSemanticSegmentation requires config.out_indices to be a list of 4 integers, specifying which features to use from the backbone. Recommended values are [3, 5, 7, 11] for a base-sized architecture.

Source code in mindnlp/transformers/models/beit/modeling_beit.py
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
def __init__(self, config: BeitConfig) -> None:
    """
    Initializes a new instance of BeitForSemanticSegmentation.

    Args:
        self: The instance of the class.
        config (BeitConfig): The configuration object containing parameters for the model.
            It is used to set up the model architecture and define various hyperparameters.
            Must be an instance of BeitConfig.

    Returns:
        None.

    Raises:
        ValueError: Raised if the length of config.out_indices is not 4, indicating an incorrect specification.
            BeitForSemanticSegmentation requires config.out_indices to be a list of 4 integers,
            specifying which features to use from the backbone.
            Recommended values are [3, 5, 7, 11] for a base-sized architecture.
    """
    super().__init__(config)

    self.num_labels = config.num_labels
    self.beit = BeitModel(config, add_pooling_layer=False)

    # FPNs
    if len(self.config.out_indices) != 4:
        raise ValueError(
            "BeitForSemanticSegmentation requires config.out_indices to be a list of 4 integers, "
            "specifying which features to use from the backbone. One can use [3, 5, 7, 11] in case of "
            "a base-sized architecture."
        )
    self.fpn1 = nn.Sequential(
        nn.ConvTranspose2d(config.hidden_size, config.hidden_size, kernel_size=2, stride=2),
        nn.BatchNorm2d(config.hidden_size),
        nn.GELU(),
        nn.ConvTranspose2d(config.hidden_size, config.hidden_size, kernel_size=2, stride=2),
    )
    self.fpn2 = nn.Sequential(
        nn.ConvTranspose2d(config.hidden_size, config.hidden_size, kernel_size=2, stride=2),
    )
    self.fpn3 = nn.Identity()
    self.fpn4 = nn.MaxPool2d(kernel_size=2, stride=2)

    # Semantic segmentation head(s)
    self.decode_head = BeitUperHead(config)
    self.auxiliary_head = BeitFCNHead(config) if config.use_auxiliary_head else None

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.beit.modeling_beit.BeitForSemanticSegmentation.compute_loss(logits, auxiliary_logits, labels)

This method computes the loss for semantic segmentation using the logits and auxiliary logits, and compares them with the provided labels.

PARAMETER DESCRIPTION
self

(object) The instance of the class BeitForSemanticSegmentation.

logits

(tensor) The main logits for semantic segmentation.

auxiliary_logits

(tensor or None) The auxiliary logits for semantic segmentation. It can be None if not provided.

labels

(tensor) The ground truth labels for semantic segmentation.

RETURNS DESCRIPTION
None

The method computes the loss and updates the internal state of the class.

RAISES DESCRIPTION
ValueError

If the size of logits or auxiliary_logits does not match the size of the labels.

ValueError

If labels contain values outside the range [0, num_classes-1], where num_classes is the number of classes for semantic segmentation.

RuntimeError

If the mode specified for interpolation is not supported.

Source code in mindnlp/transformers/models/beit/modeling_beit.py
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
def compute_loss(self, logits, auxiliary_logits, labels):
    '''
    This method computes the loss for semantic segmentation using the logits and auxiliary logits, and compares them with the provided labels.

    Args:
        self: (object) The instance of the class BeitForSemanticSegmentation.
        logits: (tensor) The main logits for semantic segmentation.
        auxiliary_logits: (tensor or None) The auxiliary logits for semantic segmentation. It can be None if not provided.
        labels: (tensor) The ground truth labels for semantic segmentation.

    Returns:
        None: The method computes the loss and updates the internal state of the class.

    Raises:
        ValueError: If the size of logits or auxiliary_logits does not match the size of the labels.
        ValueError: If labels contain values outside the range [0, num_classes-1], where num_classes is the number of classes for semantic segmentation.
        RuntimeError: If the mode specified for interpolation is not supported.
    '''
    # upsample logits to the images' original size
    upsampled_logits = F.interpolate(
        logits, size=labels.shape[-2:], mode="bilinear", align_corners=False
    )
    if auxiliary_logits is not None:
        upsampled_auxiliary_logits = F.interpolate(
            auxiliary_logits, size=labels.shape[-2:], mode="bilinear", align_corners=False
        )
    # compute weighted loss
    main_loss = F.cross_entropy(upsampled_logits, labels, ignore_index=self.config.semantic_loss_ignore_index)
    loss = main_loss
    if auxiliary_logits is not None:
        auxiliary_loss = F.cross_entropy(upsampled_auxiliary_logits, labels, ignore_index=self.config.semantic_loss_ignore_index)
        loss += self.config.auxiliary_loss_weight * auxiliary_loss

    return loss

mindnlp.transformers.models.beit.modeling_beit.BeitForSemanticSegmentation.forward(pixel_values=None, head_mask=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
labels

Ground truth semantic segmentation maps for computing the loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels > 1, a classification loss is computed (Cross-Entropy).

TYPE: `torch.LongTensor` of shape `(batch_size, height, width)`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[tuple, SemanticSegmenterOutput]

Union[tuple, SemanticSegmenterOutput]

Example
>>> from transformers import AutoImageProcessor, BeitForSemanticSegmentation
>>> from PIL import Image
>>> import requests
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-finetuned-ade-640-640")
>>> model = BeitForSemanticSegmentation.from_pretrained("microsoft/beit-base-finetuned-ade-640-640")
...
>>> inputs = image_processor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> # logits are of shape (batch_size, num_labels, height, width)
>>> logits = outputs.logits
Source code in mindnlp/transformers/models/beit/modeling_beit.py
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[tuple, SemanticSegmenterOutput]:
    r"""
    Args:
        labels (`torch.LongTensor` of shape `(batch_size, height, width)`, *optional*):
            Ground truth semantic segmentation maps for computing the loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels > 1`, a classification loss is computed (Cross-Entropy).

    Returns:
        Union[tuple, SemanticSegmenterOutput]

    Example:
        ```python
        >>> from transformers import AutoImageProcessor, BeitForSemanticSegmentation
        >>> from PIL import Image
        >>> import requests
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> image_processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-finetuned-ade-640-640")
        >>> model = BeitForSemanticSegmentation.from_pretrained("microsoft/beit-base-finetuned-ade-640-640")
        ...
        >>> inputs = image_processor(images=image, return_tensors="pt")
        >>> outputs = model(**inputs)
        >>> # logits are of shape (batch_size, num_labels, height, width)
        >>> logits = outputs.logits
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )

    outputs = self.beit(
        pixel_values,
        head_mask=head_mask,
        output_attentions=output_attentions,
        output_hidden_states=True,  # we need the intermediate hidden states
        return_dict=return_dict,
    )

    encoder_hidden_states = outputs.hidden_states if return_dict else outputs[1]

    # only keep certain features, and reshape
    # note that we do +1 as the encoder_hidden_states also includes the initial embeddings
    features = [feature for idx, feature in enumerate(encoder_hidden_states) if idx + 1 in self.config.out_indices]
    batch_size = pixel_values.shape[0]
    patch_resolution = self.config.image_size // self.config.patch_size
    features = [
        x[:, 1:, :].permute(0, 2, 1).reshape(batch_size, -1, patch_resolution, patch_resolution) for x in features
    ]

    # apply FPNs
    operators = [self.fpn1, self.fpn2, self.fpn3, self.fpn4]
    for i in range(len(features)):
        features[i] = operators[i](features[i])

    logits = self.decode_head(features)

    auxiliary_logits = None
    if self.auxiliary_head is not None:
        auxiliary_logits = self.auxiliary_head(features)

    loss = None
    if labels is not None:
        if self.config.num_labels == 1:
            raise ValueError("The number of labels should be greater than one")
        else:
            loss = self.compute_loss(logits, auxiliary_logits, labels)

    if not return_dict:
        if output_hidden_states:
            output = (logits,) + outputs[1:]
        else:
            output = (logits,) + outputs[2:]
        return ((loss,) + output) if loss is not None else output

    return SemanticSegmenterOutput(
        loss=loss,
        logits=logits,
        hidden_states=outputs.hidden_states if output_hidden_states else None,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.beit.modeling_beit.BeitModel

Bases: BeitPreTrainedModel

BeitModel

Represents a BeiT (Vision Transformer) model that utilizes a combination of convolutional and transformer layers for image recognition tasks.

This class inherits from BeitPreTrainedModel and includes methods for initializing the model, getting input embeddings, pruning heads, and forwarding the model with optional arguments.

ATTRIBUTE DESCRIPTION
config

The configuration for the model.

TYPE: BeitConfig

embeddings

The embeddings for the model.

TYPE: BeitEmbeddings

encoder

The encoder component of the model.

TYPE: BeitEncoder

layernorm

The layer normalization component of the model.

TYPE: Identity or LayerNorm

pooler

The pooling layer for the model, if included.

TYPE: BeitPooler

METHOD DESCRIPTION
__init__

Initializes the model with the given configuration and optional pooling layer.

get_input_embeddings

Retrieves the input embeddings for the model.

_prune_heads

Prunes heads of the model based on the provided dictionary of layers and heads to prune.

forward

Constructs the model with optional arguments for pixel values, masked positions, head masks, and return types.

Source code in mindnlp/transformers/models/beit/modeling_beit.py
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
class BeitModel(BeitPreTrainedModel):

    """
    BeitModel
    =========

    Represents a BeiT (Vision Transformer) model that utilizes a combination of convolutional and transformer layers 
    for image recognition tasks.

    This class inherits from BeitPreTrainedModel and includes methods for initializing the model, 
    getting input embeddings, pruning heads, and forwarding the model with optional arguments.

    Attributes:
        config (BeitConfig): The configuration for the model.
        embeddings (BeitEmbeddings): The embeddings for the model.
        encoder (BeitEncoder): The encoder component of the model.
        layernorm (nn.Identity or nn.LayerNorm): The layer normalization component of the model.
        pooler (BeitPooler): The pooling layer for the model, if included.

    Methods:
        __init__: Initializes the model with the given configuration and optional pooling layer.
        get_input_embeddings(self): Retrieves the input embeddings for the model.
        _prune_heads(self, heads_to_prune): 
            Prunes heads of the model based on the provided dictionary of layers and heads to prune.
        forward: 
            Constructs the model with optional arguments for pixel values, masked positions, head masks, and return types.

        Additional details and descriptions for each method can be found in the method docstrings.
    """
    def __init__(self, config: BeitConfig, add_pooling_layer: bool = True) -> None:
        """
        Initializes a new instance of the BeitModel class.

        Args:
            self: The object itself.
            config (BeitConfig): The configuration object for the BeitModel.
                This object contains various hyperparameters and settings for the model.
            add_pooling_layer (bool, optional): Flag indicating whether to include a pooling layer.
                Defaults to True.

        Returns:
            None.

        Raises:
            None.

        Note:
            This method sets up the BeitModel by initializing its attributes and components.
            It creates an instance of BeitEmbeddings, BeitEncoder, and BeitPooler based on the provided config.
            If add_pooling_layer is True, it also initializes a pooler for the model.
            Finally, it calls the post_init method to perform any additional initialization steps.
        """
        super().__init__(config)
        self.config = config

        self.embeddings = BeitEmbeddings(config)
        self.encoder = BeitEncoder(config, window_size=self.embeddings.patch_embeddings.patch_shape)

        self.layernorm = (
            nn.Identity() if config.use_mean_pooling else nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        )
        self.pooler = BeitPooler(config) if add_pooling_layer else None

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        """
        This method 'get_input_embeddings' is part of the 'BeitModel' class and is used to retrieve the input embeddings.

        Args:
            self: An instance of the 'BeitModel' class.

        Returns:
            None.

        Raises:
            None.
        """
        return self.embeddings.patch_embeddings

    def _prune_heads(self, heads_to_prune):
        """
        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
        class PreTrainedModel
        """
        for layer, heads in heads_to_prune.items():
            self.encoder.layer[layer].attention.prune_heads(heads)

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        bool_masked_pos: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[tuple, BeitModelOutputWithPooling]:
        r"""
        Args:
            bool_masked_pos (`mindspore.Tensor` of shape `(batch_size, num_patches)`, *optional*):
                Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if pixel_values is None:
            raise ValueError("You have to specify pixel_values")

        # Prepare head mask if needed
        # 1.0 in head_mask indicate we keep the head
        # attention_probs has shape bsz x n_heads x N x N
        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)

        embedding_output, (patch_height, patch_width) = self.embeddings(pixel_values, bool_masked_pos)

        encoder_outputs = self.encoder(
            embedding_output,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = encoder_outputs[0]
        sequence_output = self.layernorm(sequence_output)
        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

        if not return_dict:
            head_outputs = (sequence_output, pooled_output) if pooled_output is not None else (sequence_output,)
            return head_outputs + encoder_outputs[1:]

        return BeitModelOutputWithPooling(
            last_hidden_state=sequence_output,
            pooler_output=pooled_output,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
        )

mindnlp.transformers.models.beit.modeling_beit.BeitModel.__init__(config, add_pooling_layer=True)

Initializes a new instance of the BeitModel class.

PARAMETER DESCRIPTION
self

The object itself.

config

The configuration object for the BeitModel. This object contains various hyperparameters and settings for the model.

TYPE: BeitConfig

add_pooling_layer

Flag indicating whether to include a pooling layer. Defaults to True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
None

None.

Note

This method sets up the BeitModel by initializing its attributes and components. It creates an instance of BeitEmbeddings, BeitEncoder, and BeitPooler based on the provided config. If add_pooling_layer is True, it also initializes a pooler for the model. Finally, it calls the post_init method to perform any additional initialization steps.

Source code in mindnlp/transformers/models/beit/modeling_beit.py
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
def __init__(self, config: BeitConfig, add_pooling_layer: bool = True) -> None:
    """
    Initializes a new instance of the BeitModel class.

    Args:
        self: The object itself.
        config (BeitConfig): The configuration object for the BeitModel.
            This object contains various hyperparameters and settings for the model.
        add_pooling_layer (bool, optional): Flag indicating whether to include a pooling layer.
            Defaults to True.

    Returns:
        None.

    Raises:
        None.

    Note:
        This method sets up the BeitModel by initializing its attributes and components.
        It creates an instance of BeitEmbeddings, BeitEncoder, and BeitPooler based on the provided config.
        If add_pooling_layer is True, it also initializes a pooler for the model.
        Finally, it calls the post_init method to perform any additional initialization steps.
    """
    super().__init__(config)
    self.config = config

    self.embeddings = BeitEmbeddings(config)
    self.encoder = BeitEncoder(config, window_size=self.embeddings.patch_embeddings.patch_shape)

    self.layernorm = (
        nn.Identity() if config.use_mean_pooling else nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
    )
    self.pooler = BeitPooler(config) if add_pooling_layer else None

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.beit.modeling_beit.BeitModel.forward(pixel_values=None, bool_masked_pos=None, head_mask=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
bool_masked_pos

Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).

TYPE: `mindspore.Tensor` of shape `(batch_size, num_patches)`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/beit/modeling_beit.py
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    bool_masked_pos: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[tuple, BeitModelOutputWithPooling]:
    r"""
    Args:
        bool_masked_pos (`mindspore.Tensor` of shape `(batch_size, num_patches)`, *optional*):
            Boolean masked positions. Indicates which patches are masked (1) and which aren't (0).
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    if pixel_values is None:
        raise ValueError("You have to specify pixel_values")

    # Prepare head mask if needed
    # 1.0 in head_mask indicate we keep the head
    # attention_probs has shape bsz x n_heads x N x N
    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)

    embedding_output, (patch_height, patch_width) = self.embeddings(pixel_values, bool_masked_pos)

    encoder_outputs = self.encoder(
        embedding_output,
        head_mask=head_mask,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    sequence_output = encoder_outputs[0]
    sequence_output = self.layernorm(sequence_output)
    pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

    if not return_dict:
        head_outputs = (sequence_output, pooled_output) if pooled_output is not None else (sequence_output,)
        return head_outputs + encoder_outputs[1:]

    return BeitModelOutputWithPooling(
        last_hidden_state=sequence_output,
        pooler_output=pooled_output,
        hidden_states=encoder_outputs.hidden_states,
        attentions=encoder_outputs.attentions,
    )

mindnlp.transformers.models.beit.modeling_beit.BeitModel.get_input_embeddings()

This method 'get_input_embeddings' is part of the 'BeitModel' class and is used to retrieve the input embeddings.

PARAMETER DESCRIPTION
self

An instance of the 'BeitModel' class.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/beit/modeling_beit.py
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
def get_input_embeddings(self):
    """
    This method 'get_input_embeddings' is part of the 'BeitModel' class and is used to retrieve the input embeddings.

    Args:
        self: An instance of the 'BeitModel' class.

    Returns:
        None.

    Raises:
        None.
    """
    return self.embeddings.patch_embeddings

mindnlp.transformers.models.beit.modeling_beit.BeitPreTrainedModel

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in mindnlp/transformers/models/beit/modeling_beit.py
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
class BeitPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """
    config_class = BeitConfig
    base_model_prefix = "beit"
    main_input_name = "pixel_values"
    supports_gradient_checkpointing = True

    def _init_weights(self, cell):
        """Initialize the weights"""
        if isinstance(cell, (nn.Linear, nn.Conv2d, nn.ConvTranspose2d)):
            # Slightly different from the TF version which uses truncated_normal for initialization
            # cf https://github.com/pytorch/pytorch/pull/5617
            cell.weight.set_data(initializer(Normal(self.config.initializer_range),
                                                    cell.weight.shape, cell.weight.dtype))
            if cell.bias is not None:
                cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))
        elif isinstance(cell, nn.Embedding):
            weight = np.random.normal(0.0, self.config.initializer_range, cell.weight.shape)
            if cell.padding_idx:
                weight[cell.padding_idx] = 0

            cell.weight.set_data(Tensor(weight, cell.weight.dtype))
        elif isinstance(cell, nn.LayerNorm):
            cell.weight.set_data(initializer('ones', cell.weight.shape, cell.weight.dtype))
            cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))

mindnlp.transformers.models.beit.modeling_beit.BeitBackbone

Bases: BeitPreTrainedModel, BackboneMixin

Represents the backbone of a BEiT (Bottleneck Enhanced Image Transformer) model for image recognition and classification tasks. This class inherits from BeitPreTrainedModel and BackboneMixin.

The backbone consists of an image embedding module, an encoder module, and optionally, a feature pyramid network (FPN) for multi-scale feature extraction.

The class provides methods for initializing the backbone, getting input embeddings, and forwarding the backbone from input pixel values. It also supports the option to return hidden states and attentions.

Example Usage
>>> from transformers import AutoImageProcessor, AutoBackbone
>>> import torch
>>> from PIL import Image
>>> import requests
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224")
>>> model = AutoBackbone.from_pretrained(
...     "microsoft/beit-base-patch16-224", out_features=["stage1", "stage2", "stage3", "stage4"]
... )
...
>>> inputs = processor(image, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> feature_maps = outputs.feature_maps
>>> list(feature_maps[-1].shape)
[1, 768, 14, 14]
ATTRIBUTE DESCRIPTION
num_features

List of hidden sizes for each layer in the backbone.

TYPE: list

METHOD DESCRIPTION
__init__

Initializes the backbone with the given configuration.

get_input_embeddings

Returns the patch embeddings used as input to the backbone.

forward

Constructs the backbone from input pixel values, optionally returning hidden states and attentions.

RAISES DESCRIPTION
ValueError

If the specified output indices are invalid.

RETURNS DESCRIPTION
BackboneOutput

An object containing feature maps, hidden states, and attentions.

Note

The class supports the use of the FPN for multi-scale feature extraction.

Source code in mindnlp/transformers/models/beit/modeling_beit.py
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
class BeitBackbone(BeitPreTrainedModel, BackboneMixin):

    """
    Represents the backbone of a BEiT (Bottleneck Enhanced Image Transformer) model for image recognition and classification tasks.
    This class inherits from BeitPreTrainedModel and BackboneMixin.

    The backbone consists of an image embedding module, an encoder module, and optionally,
    a feature pyramid network (FPN) for multi-scale feature extraction.

    The class provides methods for initializing the backbone, getting input embeddings, and forwarding
    the backbone from input pixel values. It also supports the option to return hidden states and attentions.

    Example Usage:
        ```python
        >>> from transformers import AutoImageProcessor, AutoBackbone
        >>> import torch
        >>> from PIL import Image
        >>> import requests
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224")
        >>> model = AutoBackbone.from_pretrained(
        ...     "microsoft/beit-base-patch16-224", out_features=["stage1", "stage2", "stage3", "stage4"]
        ... )
        ...
        >>> inputs = processor(image, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> feature_maps = outputs.feature_maps
        >>> list(feature_maps[-1].shape)
        [1, 768, 14, 14]
        ```

    Attributes:
        num_features (list): List of hidden sizes for each layer in the backbone.

    Methods:
        __init__: Initializes the backbone with the given configuration.
        get_input_embeddings: Returns the patch embeddings used as input to the backbone.
        forward: Constructs the backbone from input pixel values, optionally returning hidden states and attentions.

    Raises:
        ValueError: If the specified output indices are invalid.

    Returns:
        BackboneOutput: An object containing feature maps, hidden states, and attentions.

    Note:
        The class supports the use of the FPN for multi-scale feature extraction.
    """
    def __init__(self, config):
        """
        Initializes an instance of the 'BeitBackbone' class.

        Args:
            self: The instance of the 'BeitBackbone' class.
            config: An object containing the configuration settings for the 'BeitBackbone'.
                It should provide the following attributes:

                - hidden_size (int): The size of the hidden layers.
                - num_hidden_layers (int): The number of hidden layers.
                - add_fpn (bool): Indicates whether to add a Feature Pyramid Network (FPN) to the backbone.
                - out_indices (list): A list of 4 integers specifying which features to use from the backbone if FPN is added.
                For example, [3, 5, 7, 11] can be used for a base-sized architecture.
                - batch_norm_eps (float): The value for epsilon in Batch Normalization.

                Note: Make sure 'config' provides the necessary attributes; otherwise, an exception will be raised.

        Returns:
            None.

        Raises:
            ValueError: If 'config.add_fpn' is True but 'len(config.out_indices)' is not equal to 4.
                        In this case, 'config.out_indices' should be a list of 4 integers specifying the features to use from the backbone.
        """
        super().__init__(config)
        super()._init_backbone(config)

        self.num_features = [config.hidden_size for _ in range(config.num_hidden_layers + 1)]
        self.embeddings = BeitEmbeddings(config)
        self.encoder = BeitEncoder(config, window_size=self.embeddings.patch_embeddings.patch_shape)

        if config.add_fpn:
            if len(self.config.out_indices) != 4:
                raise ValueError(
                    "BeitBackbone requires config.out_indices to be a list of 4 integers, "
                    "specifying which features to use from the backbone. One can use [3, 5, 7, 11] in case of "
                    "a base-sized architecture."
                )
            hidden_size = config.hidden_size
            self.fpn1 = nn.Sequential(
                nn.ConvTranspose2d(hidden_size, hidden_size, kernel_size=2, stride=2),
                nn.BatchNorm2d(hidden_size, eps=config.batch_norm_eps),
                nn.GELU(),
                nn.ConvTranspose2d(hidden_size, hidden_size, kernel_size=2, stride=2),
            )

            self.fpn2 = nn.Sequential(nn.ConvTranspose2d(hidden_size, hidden_size, kernel_size=2, stride=2))
            self.fpn3 = nn.Identity()
            self.fpn4 = nn.MaxPool2d(kernel_size=2, stride=2)

        # initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        """
        Retrieves the input embeddings from the BeitBackbone class.

        Args:
            self (BeitBackbone): An instance of the BeitBackbone class.

        Returns:
            None.

        Raises:
            None.

        This method retrieves the input embeddings from the BeitBackbone class.
        The input embeddings are obtained through the patch_embeddings attribute of the class
        and are used for further processing or analysis.

        Note:
            The input embeddings refer to the numerical representation of the input data,
            which can be used for tasks such as classification, regression, or other machine learning tasks.
        """
        return self.embeddings.patch_embeddings

    def forward(
        self,
        pixel_values: Tensor,
        output_hidden_states: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> BackboneOutput:
        """
        Returns:
            BackboneOutput

        Example:
            ```python
            >>> from transformers import AutoImageProcessor, AutoBackbone
            >>> import torch
            >>> from PIL import Image
            >>> import requests
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224")
            >>> model = AutoBackbone.from_pretrained(
            ...     "microsoft/beit-base-patch16-224", out_features=["stage1", "stage2", "stage3", "stage4"]
            ... )
            ...
            >>> inputs = processor(image, return_tensors="pt")
            ...
            >>> outputs = model(**inputs)
            >>> feature_maps = outputs.feature_maps
            >>> list(feature_maps[-1].shape)
            [1, 768, 14, 14]
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions

        batch_size = pixel_values.shape[0]
        embedding_output, (patch_height, patch_width) = self.embeddings(pixel_values)

        outputs = self.encoder(
            embedding_output, output_hidden_states=True, output_attentions=output_attentions, return_dict=return_dict
        )

        hidden_states = outputs.hidden_states if return_dict else outputs[1]

        feature_maps = ()
        for stage, hidden_state in zip(self.stage_names, hidden_states):
            if stage in self.out_features:
                if self.config.reshape_hidden_states:
                    hidden_state = hidden_state[:, 1:, :]
                    hidden_state = hidden_state.permute(0, 2, 1)
                    hidden_state = hidden_state.reshape(batch_size, -1, patch_height, patch_width)

                feature_maps += (hidden_state,)

        if self.config.add_fpn:
            feature_maps = [
                self.fpn1(feature_maps[0]),
                self.fpn2(feature_maps[1]),
                self.fpn3(feature_maps[2]),
                self.fpn4(feature_maps[3]),
            ]
            feature_maps = tuple(feature_maps)

        if not return_dict:
            if output_hidden_states:
                output = (feature_maps,) + outputs[1:]
            else:
                output = (feature_maps,) + outputs[2:]
            return output

        return BackboneOutput(
            feature_maps=feature_maps,
            hidden_states=outputs.hidden_states if output_hidden_states else None,
            attentions=outputs.attentions,
        )

mindnlp.transformers.models.beit.modeling_beit.BeitBackbone.__init__(config)

Initializes an instance of the 'BeitBackbone' class.

PARAMETER DESCRIPTION
self

The instance of the 'BeitBackbone' class.

config

An object containing the configuration settings for the 'BeitBackbone'. It should provide the following attributes:

  • hidden_size (int): The size of the hidden layers.
  • num_hidden_layers (int): The number of hidden layers.
  • add_fpn (bool): Indicates whether to add a Feature Pyramid Network (FPN) to the backbone.
  • out_indices (list): A list of 4 integers specifying which features to use from the backbone if FPN is added. For example, [3, 5, 7, 11] can be used for a base-sized architecture.
  • batch_norm_eps (float): The value for epsilon in Batch Normalization.

Note: Make sure 'config' provides the necessary attributes; otherwise, an exception will be raised.

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If 'config.add_fpn' is True but 'len(config.out_indices)' is not equal to 4. In this case, 'config.out_indices' should be a list of 4 integers specifying the features to use from the backbone.

Source code in mindnlp/transformers/models/beit/modeling_beit.py
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
def __init__(self, config):
    """
    Initializes an instance of the 'BeitBackbone' class.

    Args:
        self: The instance of the 'BeitBackbone' class.
        config: An object containing the configuration settings for the 'BeitBackbone'.
            It should provide the following attributes:

            - hidden_size (int): The size of the hidden layers.
            - num_hidden_layers (int): The number of hidden layers.
            - add_fpn (bool): Indicates whether to add a Feature Pyramid Network (FPN) to the backbone.
            - out_indices (list): A list of 4 integers specifying which features to use from the backbone if FPN is added.
            For example, [3, 5, 7, 11] can be used for a base-sized architecture.
            - batch_norm_eps (float): The value for epsilon in Batch Normalization.

            Note: Make sure 'config' provides the necessary attributes; otherwise, an exception will be raised.

    Returns:
        None.

    Raises:
        ValueError: If 'config.add_fpn' is True but 'len(config.out_indices)' is not equal to 4.
                    In this case, 'config.out_indices' should be a list of 4 integers specifying the features to use from the backbone.
    """
    super().__init__(config)
    super()._init_backbone(config)

    self.num_features = [config.hidden_size for _ in range(config.num_hidden_layers + 1)]
    self.embeddings = BeitEmbeddings(config)
    self.encoder = BeitEncoder(config, window_size=self.embeddings.patch_embeddings.patch_shape)

    if config.add_fpn:
        if len(self.config.out_indices) != 4:
            raise ValueError(
                "BeitBackbone requires config.out_indices to be a list of 4 integers, "
                "specifying which features to use from the backbone. One can use [3, 5, 7, 11] in case of "
                "a base-sized architecture."
            )
        hidden_size = config.hidden_size
        self.fpn1 = nn.Sequential(
            nn.ConvTranspose2d(hidden_size, hidden_size, kernel_size=2, stride=2),
            nn.BatchNorm2d(hidden_size, eps=config.batch_norm_eps),
            nn.GELU(),
            nn.ConvTranspose2d(hidden_size, hidden_size, kernel_size=2, stride=2),
        )

        self.fpn2 = nn.Sequential(nn.ConvTranspose2d(hidden_size, hidden_size, kernel_size=2, stride=2))
        self.fpn3 = nn.Identity()
        self.fpn4 = nn.MaxPool2d(kernel_size=2, stride=2)

    # initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.beit.modeling_beit.BeitBackbone.forward(pixel_values, output_hidden_states=None, output_attentions=None, return_dict=None)

RETURNS DESCRIPTION
BackboneOutput

BackboneOutput

Example
>>> from transformers import AutoImageProcessor, AutoBackbone
>>> import torch
>>> from PIL import Image
>>> import requests
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224")
>>> model = AutoBackbone.from_pretrained(
...     "microsoft/beit-base-patch16-224", out_features=["stage1", "stage2", "stage3", "stage4"]
... )
...
>>> inputs = processor(image, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> feature_maps = outputs.feature_maps
>>> list(feature_maps[-1].shape)
[1, 768, 14, 14]
Source code in mindnlp/transformers/models/beit/modeling_beit.py
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
def forward(
    self,
    pixel_values: Tensor,
    output_hidden_states: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> BackboneOutput:
    """
    Returns:
        BackboneOutput

    Example:
        ```python
        >>> from transformers import AutoImageProcessor, AutoBackbone
        >>> import torch
        >>> from PIL import Image
        >>> import requests
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> processor = AutoImageProcessor.from_pretrained("microsoft/beit-base-patch16-224")
        >>> model = AutoBackbone.from_pretrained(
        ...     "microsoft/beit-base-patch16-224", out_features=["stage1", "stage2", "stage3", "stage4"]
        ... )
        ...
        >>> inputs = processor(image, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> feature_maps = outputs.feature_maps
        >>> list(feature_maps[-1].shape)
        [1, 768, 14, 14]
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions

    batch_size = pixel_values.shape[0]
    embedding_output, (patch_height, patch_width) = self.embeddings(pixel_values)

    outputs = self.encoder(
        embedding_output, output_hidden_states=True, output_attentions=output_attentions, return_dict=return_dict
    )

    hidden_states = outputs.hidden_states if return_dict else outputs[1]

    feature_maps = ()
    for stage, hidden_state in zip(self.stage_names, hidden_states):
        if stage in self.out_features:
            if self.config.reshape_hidden_states:
                hidden_state = hidden_state[:, 1:, :]
                hidden_state = hidden_state.permute(0, 2, 1)
                hidden_state = hidden_state.reshape(batch_size, -1, patch_height, patch_width)

            feature_maps += (hidden_state,)

    if self.config.add_fpn:
        feature_maps = [
            self.fpn1(feature_maps[0]),
            self.fpn2(feature_maps[1]),
            self.fpn3(feature_maps[2]),
            self.fpn4(feature_maps[3]),
        ]
        feature_maps = tuple(feature_maps)

    if not return_dict:
        if output_hidden_states:
            output = (feature_maps,) + outputs[1:]
        else:
            output = (feature_maps,) + outputs[2:]
        return output

    return BackboneOutput(
        feature_maps=feature_maps,
        hidden_states=outputs.hidden_states if output_hidden_states else None,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.beit.modeling_beit.BeitBackbone.get_input_embeddings()

Retrieves the input embeddings from the BeitBackbone class.

PARAMETER DESCRIPTION
self

An instance of the BeitBackbone class.

TYPE: BeitBackbone

RETURNS DESCRIPTION

None.

This method retrieves the input embeddings from the BeitBackbone class. The input embeddings are obtained through the patch_embeddings attribute of the class and are used for further processing or analysis.

Note

The input embeddings refer to the numerical representation of the input data, which can be used for tasks such as classification, regression, or other machine learning tasks.

Source code in mindnlp/transformers/models/beit/modeling_beit.py
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
def get_input_embeddings(self):
    """
    Retrieves the input embeddings from the BeitBackbone class.

    Args:
        self (BeitBackbone): An instance of the BeitBackbone class.

    Returns:
        None.

    Raises:
        None.

    This method retrieves the input embeddings from the BeitBackbone class.
    The input embeddings are obtained through the patch_embeddings attribute of the class
    and are used for further processing or analysis.

    Note:
        The input embeddings refer to the numerical representation of the input data,
        which can be used for tasks such as classification, regression, or other machine learning tasks.
    """
    return self.embeddings.patch_embeddings