Skip to content

cvt

mindnlp.transformers.models.cvt.configuration_cvt

CvT model configuration

mindnlp.transformers.models.cvt.configuration_cvt.CvtConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [CvtModel]. It is used to instantiate a CvT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the CvT microsoft/cvt-13 architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
num_channels

The number of input channels.

TYPE: `int`, *optional*, defaults to 3 DEFAULT: 3

patch_sizes

The kernel size of each encoder's patch embedding.

TYPE: `List[int]`, *optional*, defaults to `[7, 3, 3]` DEFAULT: [7, 3, 3]

patch_stride

The stride size of each encoder's patch embedding.

TYPE: `List[int]`, *optional*, defaults to `[4, 2, 2]` DEFAULT: [4, 2, 2]

patch_padding

The padding size of each encoder's patch embedding.

TYPE: `List[int]`, *optional*, defaults to `[2, 1, 1]` DEFAULT: [2, 1, 1]

embed_dim

Dimension of each of the encoder blocks.

TYPE: `List[int]`, *optional*, defaults to `[64, 192, 384]` DEFAULT: [64, 192, 384]

num_heads

Number of attention heads for each attention layer in each block of the Transformer encoder.

TYPE: `List[int]`, *optional*, defaults to `[1, 3, 6]` DEFAULT: [1, 3, 6]

depth

The number of layers in each encoder block.

TYPE: `List[int]`, *optional*, defaults to `[1, 2, 10]` DEFAULT: [1, 2, 10]

mlp_ratios

Ratio of the size of the hidden layer compared to the size of the input layer of the Mix FFNs in the encoder blocks.

TYPE: `List[float]`, *optional*, defaults to `[4.0, 4.0, 4.0, 4.0]`

attention_drop_rate

The dropout ratio for the attention probabilities.

TYPE: `List[float]`, *optional*, defaults to `[0.0, 0.0, 0.0]` DEFAULT: [0.0, 0.0, 0.0]

drop_rate

The dropout ratio for the patch embeddings probabilities.

TYPE: `List[float]`, *optional*, defaults to `[0.0, 0.0, 0.0]` DEFAULT: [0.0, 0.0, 0.0]

drop_path_rate

The dropout probability for stochastic depth, used in the blocks of the Transformer encoder.

TYPE: `List[float]`, *optional*, defaults to `[0.0, 0.0, 0.1]` DEFAULT: [0.0, 0.0, 0.1]

qkv_bias

The bias bool for query, key and value in attentions

TYPE: `List[bool]`, *optional*, defaults to `[True, True, True]` DEFAULT: [True, True, True]

cls_token

Whether or not to add a classification token to the output of each of the last 3 stages.

TYPE: `List[bool]`, *optional*, defaults to `[False, False, True]` DEFAULT: [False, False, True]

qkv_projection_method

The projection method for query, key and value Default is depth-wise convolutions with batch norm. For Linear projection use "avg".

TYPE: `List[string]`, *optional*, defaults to ["dw_bn", "dw_bn", "dw_bn"]` DEFAULT: ['dw_bn', 'dw_bn', 'dw_bn']

kernel_qkv

The kernel size for query, key and value in attention layer

TYPE: `List[int]`, *optional*, defaults to `[3, 3, 3]` DEFAULT: [3, 3, 3]

padding_kv

The padding size for key and value in attention layer

TYPE: `List[int]`, *optional*, defaults to `[1, 1, 1]` DEFAULT: [1, 1, 1]

stride_kv

The stride size for key and value in attention layer

TYPE: `List[int]`, *optional*, defaults to `[2, 2, 2]` DEFAULT: [2, 2, 2]

padding_q

The padding size for query in attention layer

TYPE: `List[int]`, *optional*, defaults to `[1, 1, 1]` DEFAULT: [1, 1, 1]

stride_q

The stride size for query in attention layer

TYPE: `List[int]`, *optional*, defaults to `[1, 1, 1]` DEFAULT: [1, 1, 1]

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-6 DEFAULT: 1e-12

Example
>>> from transformers import CvtConfig, CvtModel
...
>>> # Initializing a Cvt msft/cvt style configuration
>>> configuration = CvtConfig()
...
>>> # Initializing a model (with random weights) from the msft/cvt style configuration
>>> model = CvtModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/cvt/configuration_cvt.py
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
class CvtConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`CvtModel`]. It is used to instantiate a CvT model
    according to the specified arguments, defining the model architecture. Instantiating a configuration with the
    defaults will yield a similar configuration to that of the CvT
    [microsoft/cvt-13](https://huggingface.co/microsoft/cvt-13) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        num_channels (`int`, *optional*, defaults to 3):
            The number of input channels.
        patch_sizes (`List[int]`, *optional*, defaults to `[7, 3, 3]`):
            The kernel size of each encoder's patch embedding.
        patch_stride (`List[int]`, *optional*, defaults to `[4, 2, 2]`):
            The stride size of each encoder's patch embedding.
        patch_padding (`List[int]`, *optional*, defaults to `[2, 1, 1]`):
            The padding size of each encoder's patch embedding.
        embed_dim (`List[int]`, *optional*, defaults to `[64, 192, 384]`):
            Dimension of each of the encoder blocks.
        num_heads (`List[int]`, *optional*, defaults to `[1, 3, 6]`):
            Number of attention heads for each attention layer in each block of the Transformer encoder.
        depth (`List[int]`, *optional*, defaults to `[1, 2, 10]`):
            The number of layers in each encoder block.
        mlp_ratios (`List[float]`, *optional*, defaults to `[4.0, 4.0, 4.0, 4.0]`):
            Ratio of the size of the hidden layer compared to the size of the input layer of the Mix FFNs in the
            encoder blocks.
        attention_drop_rate (`List[float]`, *optional*, defaults to `[0.0, 0.0, 0.0]`):
            The dropout ratio for the attention probabilities.
        drop_rate (`List[float]`, *optional*, defaults to `[0.0, 0.0, 0.0]`):
            The dropout ratio for the patch embeddings probabilities.
        drop_path_rate (`List[float]`, *optional*, defaults to `[0.0, 0.0, 0.1]`):
            The dropout probability for stochastic depth, used in the blocks of the Transformer encoder.
        qkv_bias (`List[bool]`, *optional*, defaults to `[True, True, True]`):
            The bias bool for query, key and value in attentions
        cls_token (`List[bool]`, *optional*, defaults to `[False, False, True]`):
            Whether or not to add a classification token to the output of each of the last 3 stages.
        qkv_projection_method (`List[string]`, *optional*, defaults to ["dw_bn", "dw_bn", "dw_bn"]`):
            The projection method for query, key and value Default is depth-wise convolutions with batch norm. For
            Linear projection use "avg".
        kernel_qkv (`List[int]`, *optional*, defaults to `[3, 3, 3]`):
            The kernel size for query, key and value in attention layer
        padding_kv (`List[int]`, *optional*, defaults to `[1, 1, 1]`):
            The padding size for key and value in attention layer
        stride_kv (`List[int]`, *optional*, defaults to `[2, 2, 2]`):
            The stride size for key and value in attention layer
        padding_q (`List[int]`, *optional*, defaults to `[1, 1, 1]`):
            The padding size for query in attention layer
        stride_q (`List[int]`, *optional*, defaults to `[1, 1, 1]`):
            The stride size for query in attention layer
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-6):
            The epsilon used by the layer normalization layers.

    Example:
        ```python
        >>> from transformers import CvtConfig, CvtModel
        ...
        >>> # Initializing a Cvt msft/cvt style configuration
        >>> configuration = CvtConfig()
        ...
        >>> # Initializing a model (with random weights) from the msft/cvt style configuration
        >>> model = CvtModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "cvt"

    def __init__(
        self,
        num_channels=3,
        patch_sizes=[7, 3, 3],
        patch_stride=[4, 2, 2],
        patch_padding=[2, 1, 1],
        embed_dim=[64, 192, 384],
        num_heads=[1, 3, 6],
        depth=[1, 2, 10],
        mlp_ratio=[4.0, 4.0, 4.0],
        attention_drop_rate=[0.0, 0.0, 0.0],
        drop_rate=[0.0, 0.0, 0.0],
        drop_path_rate=[0.0, 0.0, 0.1],
        qkv_bias=[True, True, True],
        cls_token=[False, False, True],
        qkv_projection_method=["dw_bn", "dw_bn", "dw_bn"],
        kernel_qkv=[3, 3, 3],
        padding_kv=[1, 1, 1],
        stride_kv=[2, 2, 2],
        padding_q=[1, 1, 1],
        stride_q=[1, 1, 1],
        initializer_range=0.02,
        layer_norm_eps=1e-12,
        **kwargs,
    ):
        '''
        This method initializes an instance of the CvtConfig class with the provided parameters.

        Args:
            self: The instance of the class.
            num_channels (int, optional): Number of input channels. Defaults to 3.
            patch_sizes (List[int], optional): List of patch sizes for each layer. Defaults to [7, 3, 3].
            patch_stride (List[int], optional): List of patch strides for each layer. Defaults to [4, 2, 2].
            patch_padding (List[int], optional): List of patch paddings for each layer. Defaults to [2, 1, 1].
            embed_dim (List[int], optional): List of embedding dimensions for each layer. Defaults to [64, 192, 384].
            num_heads (List[int], optional): List of the number of attention heads for each layer. Defaults to [1, 3, 6].
            depth (List[int], optional): List of the depths for each layer. Defaults to [1, 2, 10].
            mlp_ratio (List[float], optional): List of the MLP ratio for each layer. Defaults to [4.0, 4.0, 4.0].
            attention_drop_rate (List[float], optional): List of attention dropout rates for each layer. Defaults to [0.0, 0.0, 0.0].
            drop_rate (List[float], optional): List of dropout rates for each layer. Defaults to [0.0, 0.0, 0.0].
            drop_path_rate (List[float], optional): List of drop path rates for each layer. Defaults to [0.0, 0.0, 0.1].
            qkv_bias (List[bool], optional): List of booleans indicating whether to include bias for query, key, and value projections for each layer. Defaults to [True, True, True].
            cls_token (List[bool], optional): List of booleans indicating whether to include a class token for each layer. Defaults to [False, False, True].
            qkv_projection_method (List[str], optional): List of methods for query, key, and value projections for each layer. Defaults to ['dw_bn', 'dw_bn', 'dw_bn'].
            kernel_qkv (List[int], optional): List of kernel sizes for query, key, and value projections for each layer. Defaults to [3, 3, 3].
            padding_kv (List[int], optional): List of paddings for key and value projections for each layer. Defaults to [1, 1, 1].
            stride_kv (List[int], optional): List of strides for key and value projections for each layer. Defaults to [2, 2, 2].
            padding_q (List[int], optional): List of paddings for query projection for each layer. Defaults to [1, 1, 1].
            stride_q (List[int], optional): List of strides for query projection for each layer. Defaults to [1, 1, 1].
            initializer_range (float, optional): The range of the initializer. Defaults to 0.02.
            layer_norm_eps (float, optional): Epsilon value for layer normalization. Defaults to 1e-12.
            **kwargs: Additional keyword arguments.

        Returns:
            None.

        Raises:
            None.
        '''
        super().__init__(**kwargs)
        self.num_channels = num_channels
        self.patch_sizes = patch_sizes
        self.patch_stride = patch_stride
        self.patch_padding = patch_padding
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.depth = depth
        self.mlp_ratio = mlp_ratio
        self.attention_drop_rate = attention_drop_rate
        self.drop_rate = drop_rate
        self.drop_path_rate = drop_path_rate
        self.qkv_bias = qkv_bias
        self.cls_token = cls_token
        self.qkv_projection_method = qkv_projection_method
        self.kernel_qkv = kernel_qkv
        self.padding_kv = padding_kv
        self.stride_kv = stride_kv
        self.padding_q = padding_q
        self.stride_q = stride_q
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps

mindnlp.transformers.models.cvt.configuration_cvt.CvtConfig.__init__(num_channels=3, patch_sizes=[7, 3, 3], patch_stride=[4, 2, 2], patch_padding=[2, 1, 1], embed_dim=[64, 192, 384], num_heads=[1, 3, 6], depth=[1, 2, 10], mlp_ratio=[4.0, 4.0, 4.0], attention_drop_rate=[0.0, 0.0, 0.0], drop_rate=[0.0, 0.0, 0.0], drop_path_rate=[0.0, 0.0, 0.1], qkv_bias=[True, True, True], cls_token=[False, False, True], qkv_projection_method=['dw_bn', 'dw_bn', 'dw_bn'], kernel_qkv=[3, 3, 3], padding_kv=[1, 1, 1], stride_kv=[2, 2, 2], padding_q=[1, 1, 1], stride_q=[1, 1, 1], initializer_range=0.02, layer_norm_eps=1e-12, **kwargs)

This method initializes an instance of the CvtConfig class with the provided parameters.

PARAMETER DESCRIPTION
self

The instance of the class.

num_channels

Number of input channels. Defaults to 3.

TYPE: int DEFAULT: 3

patch_sizes

List of patch sizes for each layer. Defaults to [7, 3, 3].

TYPE: List[int] DEFAULT: [7, 3, 3]

patch_stride

List of patch strides for each layer. Defaults to [4, 2, 2].

TYPE: List[int] DEFAULT: [4, 2, 2]

patch_padding

List of patch paddings for each layer. Defaults to [2, 1, 1].

TYPE: List[int] DEFAULT: [2, 1, 1]

embed_dim

List of embedding dimensions for each layer. Defaults to [64, 192, 384].

TYPE: List[int] DEFAULT: [64, 192, 384]

num_heads

List of the number of attention heads for each layer. Defaults to [1, 3, 6].

TYPE: List[int] DEFAULT: [1, 3, 6]

depth

List of the depths for each layer. Defaults to [1, 2, 10].

TYPE: List[int] DEFAULT: [1, 2, 10]

mlp_ratio

List of the MLP ratio for each layer. Defaults to [4.0, 4.0, 4.0].

TYPE: List[float] DEFAULT: [4.0, 4.0, 4.0]

attention_drop_rate

List of attention dropout rates for each layer. Defaults to [0.0, 0.0, 0.0].

TYPE: List[float] DEFAULT: [0.0, 0.0, 0.0]

drop_rate

List of dropout rates for each layer. Defaults to [0.0, 0.0, 0.0].

TYPE: List[float] DEFAULT: [0.0, 0.0, 0.0]

drop_path_rate

List of drop path rates for each layer. Defaults to [0.0, 0.0, 0.1].

TYPE: List[float] DEFAULT: [0.0, 0.0, 0.1]

qkv_bias

List of booleans indicating whether to include bias for query, key, and value projections for each layer. Defaults to [True, True, True].

TYPE: List[bool] DEFAULT: [True, True, True]

cls_token

List of booleans indicating whether to include a class token for each layer. Defaults to [False, False, True].

TYPE: List[bool] DEFAULT: [False, False, True]

qkv_projection_method

List of methods for query, key, and value projections for each layer. Defaults to ['dw_bn', 'dw_bn', 'dw_bn'].

TYPE: List[str] DEFAULT: ['dw_bn', 'dw_bn', 'dw_bn']

kernel_qkv

List of kernel sizes for query, key, and value projections for each layer. Defaults to [3, 3, 3].

TYPE: List[int] DEFAULT: [3, 3, 3]

padding_kv

List of paddings for key and value projections for each layer. Defaults to [1, 1, 1].

TYPE: List[int] DEFAULT: [1, 1, 1]

stride_kv

List of strides for key and value projections for each layer. Defaults to [2, 2, 2].

TYPE: List[int] DEFAULT: [2, 2, 2]

padding_q

List of paddings for query projection for each layer. Defaults to [1, 1, 1].

TYPE: List[int] DEFAULT: [1, 1, 1]

stride_q

List of strides for query projection for each layer. Defaults to [1, 1, 1].

TYPE: List[int] DEFAULT: [1, 1, 1]

initializer_range

The range of the initializer. Defaults to 0.02.

TYPE: float DEFAULT: 0.02

layer_norm_eps

Epsilon value for layer normalization. Defaults to 1e-12.

TYPE: float DEFAULT: 1e-12

**kwargs

Additional keyword arguments.

DEFAULT: {}

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/cvt/configuration_cvt.py
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
def __init__(
    self,
    num_channels=3,
    patch_sizes=[7, 3, 3],
    patch_stride=[4, 2, 2],
    patch_padding=[2, 1, 1],
    embed_dim=[64, 192, 384],
    num_heads=[1, 3, 6],
    depth=[1, 2, 10],
    mlp_ratio=[4.0, 4.0, 4.0],
    attention_drop_rate=[0.0, 0.0, 0.0],
    drop_rate=[0.0, 0.0, 0.0],
    drop_path_rate=[0.0, 0.0, 0.1],
    qkv_bias=[True, True, True],
    cls_token=[False, False, True],
    qkv_projection_method=["dw_bn", "dw_bn", "dw_bn"],
    kernel_qkv=[3, 3, 3],
    padding_kv=[1, 1, 1],
    stride_kv=[2, 2, 2],
    padding_q=[1, 1, 1],
    stride_q=[1, 1, 1],
    initializer_range=0.02,
    layer_norm_eps=1e-12,
    **kwargs,
):
    '''
    This method initializes an instance of the CvtConfig class with the provided parameters.

    Args:
        self: The instance of the class.
        num_channels (int, optional): Number of input channels. Defaults to 3.
        patch_sizes (List[int], optional): List of patch sizes for each layer. Defaults to [7, 3, 3].
        patch_stride (List[int], optional): List of patch strides for each layer. Defaults to [4, 2, 2].
        patch_padding (List[int], optional): List of patch paddings for each layer. Defaults to [2, 1, 1].
        embed_dim (List[int], optional): List of embedding dimensions for each layer. Defaults to [64, 192, 384].
        num_heads (List[int], optional): List of the number of attention heads for each layer. Defaults to [1, 3, 6].
        depth (List[int], optional): List of the depths for each layer. Defaults to [1, 2, 10].
        mlp_ratio (List[float], optional): List of the MLP ratio for each layer. Defaults to [4.0, 4.0, 4.0].
        attention_drop_rate (List[float], optional): List of attention dropout rates for each layer. Defaults to [0.0, 0.0, 0.0].
        drop_rate (List[float], optional): List of dropout rates for each layer. Defaults to [0.0, 0.0, 0.0].
        drop_path_rate (List[float], optional): List of drop path rates for each layer. Defaults to [0.0, 0.0, 0.1].
        qkv_bias (List[bool], optional): List of booleans indicating whether to include bias for query, key, and value projections for each layer. Defaults to [True, True, True].
        cls_token (List[bool], optional): List of booleans indicating whether to include a class token for each layer. Defaults to [False, False, True].
        qkv_projection_method (List[str], optional): List of methods for query, key, and value projections for each layer. Defaults to ['dw_bn', 'dw_bn', 'dw_bn'].
        kernel_qkv (List[int], optional): List of kernel sizes for query, key, and value projections for each layer. Defaults to [3, 3, 3].
        padding_kv (List[int], optional): List of paddings for key and value projections for each layer. Defaults to [1, 1, 1].
        stride_kv (List[int], optional): List of strides for key and value projections for each layer. Defaults to [2, 2, 2].
        padding_q (List[int], optional): List of paddings for query projection for each layer. Defaults to [1, 1, 1].
        stride_q (List[int], optional): List of strides for query projection for each layer. Defaults to [1, 1, 1].
        initializer_range (float, optional): The range of the initializer. Defaults to 0.02.
        layer_norm_eps (float, optional): Epsilon value for layer normalization. Defaults to 1e-12.
        **kwargs: Additional keyword arguments.

    Returns:
        None.

    Raises:
        None.
    '''
    super().__init__(**kwargs)
    self.num_channels = num_channels
    self.patch_sizes = patch_sizes
    self.patch_stride = patch_stride
    self.patch_padding = patch_padding
    self.embed_dim = embed_dim
    self.num_heads = num_heads
    self.depth = depth
    self.mlp_ratio = mlp_ratio
    self.attention_drop_rate = attention_drop_rate
    self.drop_rate = drop_rate
    self.drop_path_rate = drop_path_rate
    self.qkv_bias = qkv_bias
    self.cls_token = cls_token
    self.qkv_projection_method = qkv_projection_method
    self.kernel_qkv = kernel_qkv
    self.padding_kv = padding_kv
    self.stride_kv = stride_kv
    self.padding_q = padding_q
    self.stride_q = stride_q
    self.initializer_range = initializer_range
    self.layer_norm_eps = layer_norm_eps

mindnlp.transformers.models.cvt.modeling_cvt

MindSpore CvT model.

mindnlp.transformers.models.cvt.modeling_cvt.BaseModelOutputWithCLSToken dataclass

Bases: ModelOutput

Base class for model's outputs, with potential hidden states and attentions.

PARAMETER DESCRIPTION
last_hidden_state

Sequence of hidden-states at the output of the last layer of the model.

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length, hidden_size)` DEFAULT: None

cls_token_value

Classification token at the output of the last layer of the model.

TYPE: `mindspore.Tensor` of shape `(batch_size, 1, hidden_size)` DEFAULT: None

hidden_states

Tuple of mindspore.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size). Hidden-states of the model at the output of each layer plus the initial embedding outputs.

TYPE: `tuple(mindspore.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True` DEFAULT: None

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
@dataclass
class BaseModelOutputWithCLSToken(ModelOutput):
    """
    Base class for model's outputs, with potential hidden states and attentions.

    Args:
        last_hidden_state (`mindspore.Tensor` of shape `(batch_size, sequence_length, hidden_size)`):
            Sequence of hidden-states at the output of the last layer of the model.
        cls_token_value (`mindspore.Tensor` of shape `(batch_size, 1, hidden_size)`):
            Classification token at the output of the last layer of the model.
        hidden_states (`tuple(mindspore.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
            Tuple of `mindspore.Tensor` (one for the output of the embeddings + one for the output of each layer) of
            shape `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer
            plus the initial embedding outputs.
    """
    last_hidden_state: mindspore.Tensor = None
    cls_token_value: mindspore.Tensor = None
    hidden_states: Optional[Tuple[mindspore.Tensor, ...]] = None

mindnlp.transformers.models.cvt.modeling_cvt.CvtAttention

Bases: Module

This class represents an attention mechanism for the Cvt model. It includes methods for initializing the attention mechanism, pruning specific attention heads, and forwarding the attention output.

ATTRIBUTE DESCRIPTION
num_heads

Number of attention heads.

TYPE: int

embed_dim

Dimension of the input embeddings.

TYPE: int

kernel_size

Size of the convolutional kernel.

TYPE: int

padding_q

Padding size for query tensor.

TYPE: int

padding_kv

Padding size for key and value tensors.

TYPE: int

stride_q

Stride size for query tensor.

TYPE: int

stride_kv

Stride size for key and value tensors.

TYPE: int

qkv_projection_method

Method for projecting query, key, and value tensors.

TYPE: str

qkv_bias

Whether to include bias in query, key, and value projections.

TYPE: bool

attention_drop_rate

Dropout rate for attention scores.

TYPE: float

drop_rate

Dropout rate for output.

TYPE: float

with_cls_token

Whether to include a classification token in the input.

TYPE: bool

METHOD DESCRIPTION
__init__

Initializes the attention mechanism with the given parameters.

prune_heads

Prunes specified attention heads based on the provided indices.

forward

Constructs the attention output using the input hidden state and spatial dimensions.

Inherits from

nn.Module

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
class CvtAttention(nn.Module):

    """
    This class represents an attention mechanism for the Cvt model.
    It includes methods for initializing the attention mechanism, pruning specific attention heads,
    and forwarding the attention output.

    Attributes:
        num_heads (int): Number of attention heads.
        embed_dim (int): Dimension of the input embeddings.
        kernel_size (int): Size of the convolutional kernel.
        padding_q (int): Padding size for query tensor.
        padding_kv (int): Padding size for key and value tensors.
        stride_q (int): Stride size for query tensor.
        stride_kv (int): Stride size for key and value tensors.
        qkv_projection_method (str): Method for projecting query, key, and value tensors.
        qkv_bias (bool): Whether to include bias in query, key, and value projections.
        attention_drop_rate (float): Dropout rate for attention scores.
        drop_rate (float): Dropout rate for output.
        with_cls_token (bool): Whether to include a classification token in the input.

    Methods:
        __init__(num_heads, embed_dim, kernel_size, padding_q, padding_kv, stride_q, stride_kv, qkv_projection_method, qkv_bias, attention_drop_rate, drop_rate, with_cls_token=True):
            Initializes the attention mechanism with the given parameters.

        prune_heads(heads):
            Prunes specified attention heads based on the provided indices.

        forward(hidden_state, height, width):
            Constructs the attention output using the input hidden state and spatial dimensions.

    Inherits from:
        nn.Module
    """
    def __init__(
        self,
        num_heads,
        embed_dim,
        kernel_size,
        padding_q,
        padding_kv,
        stride_q,
        stride_kv,
        qkv_projection_method,
        qkv_bias,
        attention_drop_rate,
        drop_rate,
        with_cls_token=True,
    ):
        """
        Initializes a CvtAttention instance with the specified parameters.

        Args:
            self (CvtAttention): The current instance of the CvtAttention class.
            num_heads (int): The number of attention heads to use.
            embed_dim (int): The dimension of the input embeddings.
            kernel_size (int): The size of the convolutional kernel.
            padding_q (int): Padding size for query tensor.
            padding_kv (int): Padding size for key and value tensors.
            stride_q (int): Stride size for query tensor.
            stride_kv (int): Stride size for key and value tensors.
            qkv_projection_method (str): The method used for query, key, value projection.
            qkv_bias (bool): Flag indicating whether to include bias in query, key, value projection.
            attention_drop_rate (float): The dropout rate applied to attention weights.
            drop_rate (float): The dropout rate applied to the output.
            with_cls_token (bool): Flag indicating whether to include a classification token.

        Returns:
            None.

        Raises:
            ValueError: If num_heads is not a positive integer.
            ValueError: If embed_dim is not a positive integer.
            ValueError: If kernel_size is not a positive integer.
            ValueError: If padding_q is not a non-negative integer.
            ValueError: If padding_kv is not a non-negative integer.
            ValueError: If stride_q is not a positive integer.
            ValueError: If stride_kv is not a positive integer.
            ValueError: If qkv_projection_method is not a string.
            ValueError: If attention_drop_rate is not in the range [0, 1].
            ValueError: If drop_rate is not in the range [0, 1].
        """
        super().__init__()
        self.attention = CvtSelfAttention(
            num_heads,
            embed_dim,
            kernel_size,
            padding_q,
            padding_kv,
            stride_q,
            stride_kv,
            qkv_projection_method,
            qkv_bias,
            attention_drop_rate,
            with_cls_token,
        )
        self.output = CvtSelfOutput(embed_dim, drop_rate)
        self.pruned_heads = set()

    def prune_heads(self, heads):
        """
        This method 'prune_heads' is defined within the class 'CvtAttention' and is used to prune the attention
        heads based on the provided 'heads' parameter.

        Args:
            self (object): The instance of the 'CvtAttention' class.
            heads (list): A list containing the indices of attention heads to be pruned.
                If the list is empty, no pruning is performed.

        Returns:
            None.

        Raises:
            ValueError: If the length of the 'heads' list is invalid or if any of the provided indices are out of range.
            TypeError: If the 'heads' parameter is not a list or if any of the internal operations encounter unexpected data types.
            RuntimeError: If an unexpected error occurs during the pruning process.
        """
        if len(heads) == 0:
            return
        heads, index = find_pruneable_heads_and_indices(
            heads, self.attention.num_attention_heads, self.attention.attention_head_size, self.pruned_heads
        )

        # Prune linear layers
        self.attention.query = prune_linear_layer(self.attention.query, index)
        self.attention.key = prune_linear_layer(self.attention.key, index)
        self.attention.value = prune_linear_layer(self.attention.value, index)
        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)

        # Update hyper params and store pruned heads
        self.attention.num_attention_heads = self.attention.num_attention_heads - len(heads)
        self.attention.all_head_size = self.attention.attention_head_size * self.attention.num_attention_heads
        self.pruned_heads = self.pruned_heads.union(heads)

    def forward(self, hidden_state, height, width):
        """
        Constructs an attention output based on the given hidden state, height, and width.

        Args:
            self (CvtAttention): An instance of the CvtAttention class.
            hidden_state: The hidden state used for attention computation.
            height (int): The height of the attention output.
            width (int): The width of the attention output.

        Returns:
            None.

        Raises:
            None.
        """
        self_output = self.attention(hidden_state, height, width)
        attention_output = self.output(self_output, hidden_state)
        return attention_output

mindnlp.transformers.models.cvt.modeling_cvt.CvtAttention.__init__(num_heads, embed_dim, kernel_size, padding_q, padding_kv, stride_q, stride_kv, qkv_projection_method, qkv_bias, attention_drop_rate, drop_rate, with_cls_token=True)

Initializes a CvtAttention instance with the specified parameters.

PARAMETER DESCRIPTION
self

The current instance of the CvtAttention class.

TYPE: CvtAttention

num_heads

The number of attention heads to use.

TYPE: int

embed_dim

The dimension of the input embeddings.

TYPE: int

kernel_size

The size of the convolutional kernel.

TYPE: int

padding_q

Padding size for query tensor.

TYPE: int

padding_kv

Padding size for key and value tensors.

TYPE: int

stride_q

Stride size for query tensor.

TYPE: int

stride_kv

Stride size for key and value tensors.

TYPE: int

qkv_projection_method

The method used for query, key, value projection.

TYPE: str

qkv_bias

Flag indicating whether to include bias in query, key, value projection.

TYPE: bool

attention_drop_rate

The dropout rate applied to attention weights.

TYPE: float

drop_rate

The dropout rate applied to the output.

TYPE: float

with_cls_token

Flag indicating whether to include a classification token.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If num_heads is not a positive integer.

ValueError

If embed_dim is not a positive integer.

ValueError

If kernel_size is not a positive integer.

ValueError

If padding_q is not a non-negative integer.

ValueError

If padding_kv is not a non-negative integer.

ValueError

If stride_q is not a positive integer.

ValueError

If stride_kv is not a positive integer.

ValueError

If qkv_projection_method is not a string.

ValueError

If attention_drop_rate is not in the range [0, 1].

ValueError

If drop_rate is not in the range [0, 1].

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
def __init__(
    self,
    num_heads,
    embed_dim,
    kernel_size,
    padding_q,
    padding_kv,
    stride_q,
    stride_kv,
    qkv_projection_method,
    qkv_bias,
    attention_drop_rate,
    drop_rate,
    with_cls_token=True,
):
    """
    Initializes a CvtAttention instance with the specified parameters.

    Args:
        self (CvtAttention): The current instance of the CvtAttention class.
        num_heads (int): The number of attention heads to use.
        embed_dim (int): The dimension of the input embeddings.
        kernel_size (int): The size of the convolutional kernel.
        padding_q (int): Padding size for query tensor.
        padding_kv (int): Padding size for key and value tensors.
        stride_q (int): Stride size for query tensor.
        stride_kv (int): Stride size for key and value tensors.
        qkv_projection_method (str): The method used for query, key, value projection.
        qkv_bias (bool): Flag indicating whether to include bias in query, key, value projection.
        attention_drop_rate (float): The dropout rate applied to attention weights.
        drop_rate (float): The dropout rate applied to the output.
        with_cls_token (bool): Flag indicating whether to include a classification token.

    Returns:
        None.

    Raises:
        ValueError: If num_heads is not a positive integer.
        ValueError: If embed_dim is not a positive integer.
        ValueError: If kernel_size is not a positive integer.
        ValueError: If padding_q is not a non-negative integer.
        ValueError: If padding_kv is not a non-negative integer.
        ValueError: If stride_q is not a positive integer.
        ValueError: If stride_kv is not a positive integer.
        ValueError: If qkv_projection_method is not a string.
        ValueError: If attention_drop_rate is not in the range [0, 1].
        ValueError: If drop_rate is not in the range [0, 1].
    """
    super().__init__()
    self.attention = CvtSelfAttention(
        num_heads,
        embed_dim,
        kernel_size,
        padding_q,
        padding_kv,
        stride_q,
        stride_kv,
        qkv_projection_method,
        qkv_bias,
        attention_drop_rate,
        with_cls_token,
    )
    self.output = CvtSelfOutput(embed_dim, drop_rate)
    self.pruned_heads = set()

mindnlp.transformers.models.cvt.modeling_cvt.CvtAttention.forward(hidden_state, height, width)

Constructs an attention output based on the given hidden state, height, and width.

PARAMETER DESCRIPTION
self

An instance of the CvtAttention class.

TYPE: CvtAttention

hidden_state

The hidden state used for attention computation.

height

The height of the attention output.

TYPE: int

width

The width of the attention output.

TYPE: int

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
def forward(self, hidden_state, height, width):
    """
    Constructs an attention output based on the given hidden state, height, and width.

    Args:
        self (CvtAttention): An instance of the CvtAttention class.
        hidden_state: The hidden state used for attention computation.
        height (int): The height of the attention output.
        width (int): The width of the attention output.

    Returns:
        None.

    Raises:
        None.
    """
    self_output = self.attention(hidden_state, height, width)
    attention_output = self.output(self_output, hidden_state)
    return attention_output

mindnlp.transformers.models.cvt.modeling_cvt.CvtAttention.prune_heads(heads)

This method 'prune_heads' is defined within the class 'CvtAttention' and is used to prune the attention heads based on the provided 'heads' parameter.

PARAMETER DESCRIPTION
self

The instance of the 'CvtAttention' class.

TYPE: object

heads

A list containing the indices of attention heads to be pruned. If the list is empty, no pruning is performed.

TYPE: list

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If the length of the 'heads' list is invalid or if any of the provided indices are out of range.

TypeError

If the 'heads' parameter is not a list or if any of the internal operations encounter unexpected data types.

RuntimeError

If an unexpected error occurs during the pruning process.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
def prune_heads(self, heads):
    """
    This method 'prune_heads' is defined within the class 'CvtAttention' and is used to prune the attention
    heads based on the provided 'heads' parameter.

    Args:
        self (object): The instance of the 'CvtAttention' class.
        heads (list): A list containing the indices of attention heads to be pruned.
            If the list is empty, no pruning is performed.

    Returns:
        None.

    Raises:
        ValueError: If the length of the 'heads' list is invalid or if any of the provided indices are out of range.
        TypeError: If the 'heads' parameter is not a list or if any of the internal operations encounter unexpected data types.
        RuntimeError: If an unexpected error occurs during the pruning process.
    """
    if len(heads) == 0:
        return
    heads, index = find_pruneable_heads_and_indices(
        heads, self.attention.num_attention_heads, self.attention.attention_head_size, self.pruned_heads
    )

    # Prune linear layers
    self.attention.query = prune_linear_layer(self.attention.query, index)
    self.attention.key = prune_linear_layer(self.attention.key, index)
    self.attention.value = prune_linear_layer(self.attention.value, index)
    self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)

    # Update hyper params and store pruned heads
    self.attention.num_attention_heads = self.attention.num_attention_heads - len(heads)
    self.attention.all_head_size = self.attention.attention_head_size * self.attention.num_attention_heads
    self.pruned_heads = self.pruned_heads.union(heads)

mindnlp.transformers.models.cvt.modeling_cvt.CvtConvEmbeddings

Bases: Module

Image to Conv Embedding.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
class CvtConvEmbeddings(nn.Module):
    """
    Image to Conv Embedding.
    """
    def __init__(self, patch_size, num_channels, embed_dim, stride, padding):
        """
        __init__

        Initializes the CvtConvEmbeddings class.

        Args:
            self: The instance of the class.
            patch_size (int or tuple): The size of the patch or kernel used for convolution.
                If an int is provided, the patch will be square.
                If a tuple is provided, it should contain two integers representing the height and width of the patch.
            num_channels (int): The number of input channels for the convolutional layer.
            embed_dim (int): The dimensionality of the output embedding.
            stride (int or tuple): The stride of the convolution operation.
                If an int is provided, the same stride is used in both dimensions.
                If a tuple is provided, it should contain two integers
                representing the stride in the height and width dimensions.
            padding (int or tuple): The amount of padding to be added to the input data for the convolution operation.
                If an int is provided, the same padding is added to both dimensions.
                If a tuple is provided, it should contain two integers representing the padding
                in the height and width dimensions.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size)
        self.patch_size = patch_size
        self.projection = nn.Conv2d(num_channels, embed_dim, kernel_size=patch_size, stride=stride, padding=padding, pad_mode='pad', bias=True)
        self.normalization = nn.LayerNorm(embed_dim)

    def forward(self, pixel_values):
        """
        Constructs the pixel embeddings for a given set of pixel values.

        Args:
            self (CvtConvEmbeddings): An instance of the CvtConvEmbeddings class.
            pixel_values (torch.Tensor): A tensor containing the pixel values of the image.
                It should have the shape (batch_size, num_channels, height, width).

        Returns:
            None: This method modifies the pixel_values tensor in-place.

        Raises:
            None.
        """
        pixel_values = self.projection(pixel_values)
        batch_size, num_channels, height, width = pixel_values.shape
        hidden_size = height * width
        # rearrange "b c h w -> b (h w) c"
        pixel_values = pixel_values.view(batch_size, num_channels, hidden_size).permute(0, 2, 1)
        if self.normalization:
            pixel_values = self.normalization(pixel_values)
        # rearrange "b (h w) c" -> b c h w"
        pixel_values = pixel_values.permute(0, 2, 1).view(batch_size, num_channels, height, width)
        return pixel_values

mindnlp.transformers.models.cvt.modeling_cvt.CvtConvEmbeddings.__init__(patch_size, num_channels, embed_dim, stride, padding)

init

Initializes the CvtConvEmbeddings class.

PARAMETER DESCRIPTION
self

The instance of the class.

patch_size

The size of the patch or kernel used for convolution. If an int is provided, the patch will be square. If a tuple is provided, it should contain two integers representing the height and width of the patch.

TYPE: int or tuple

num_channels

The number of input channels for the convolutional layer.

TYPE: int

embed_dim

The dimensionality of the output embedding.

TYPE: int

stride

The stride of the convolution operation. If an int is provided, the same stride is used in both dimensions. If a tuple is provided, it should contain two integers representing the stride in the height and width dimensions.

TYPE: int or tuple

padding

The amount of padding to be added to the input data for the convolution operation. If an int is provided, the same padding is added to both dimensions. If a tuple is provided, it should contain two integers representing the padding in the height and width dimensions.

TYPE: int or tuple

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
def __init__(self, patch_size, num_channels, embed_dim, stride, padding):
    """
    __init__

    Initializes the CvtConvEmbeddings class.

    Args:
        self: The instance of the class.
        patch_size (int or tuple): The size of the patch or kernel used for convolution.
            If an int is provided, the patch will be square.
            If a tuple is provided, it should contain two integers representing the height and width of the patch.
        num_channels (int): The number of input channels for the convolutional layer.
        embed_dim (int): The dimensionality of the output embedding.
        stride (int or tuple): The stride of the convolution operation.
            If an int is provided, the same stride is used in both dimensions.
            If a tuple is provided, it should contain two integers
            representing the stride in the height and width dimensions.
        padding (int or tuple): The amount of padding to be added to the input data for the convolution operation.
            If an int is provided, the same padding is added to both dimensions.
            If a tuple is provided, it should contain two integers representing the padding
            in the height and width dimensions.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size)
    self.patch_size = patch_size
    self.projection = nn.Conv2d(num_channels, embed_dim, kernel_size=patch_size, stride=stride, padding=padding, pad_mode='pad', bias=True)
    self.normalization = nn.LayerNorm(embed_dim)

mindnlp.transformers.models.cvt.modeling_cvt.CvtConvEmbeddings.forward(pixel_values)

Constructs the pixel embeddings for a given set of pixel values.

PARAMETER DESCRIPTION
self

An instance of the CvtConvEmbeddings class.

TYPE: CvtConvEmbeddings

pixel_values

A tensor containing the pixel values of the image. It should have the shape (batch_size, num_channels, height, width).

TYPE: Tensor

RETURNS DESCRIPTION
None

This method modifies the pixel_values tensor in-place.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
def forward(self, pixel_values):
    """
    Constructs the pixel embeddings for a given set of pixel values.

    Args:
        self (CvtConvEmbeddings): An instance of the CvtConvEmbeddings class.
        pixel_values (torch.Tensor): A tensor containing the pixel values of the image.
            It should have the shape (batch_size, num_channels, height, width).

    Returns:
        None: This method modifies the pixel_values tensor in-place.

    Raises:
        None.
    """
    pixel_values = self.projection(pixel_values)
    batch_size, num_channels, height, width = pixel_values.shape
    hidden_size = height * width
    # rearrange "b c h w -> b (h w) c"
    pixel_values = pixel_values.view(batch_size, num_channels, hidden_size).permute(0, 2, 1)
    if self.normalization:
        pixel_values = self.normalization(pixel_values)
    # rearrange "b (h w) c" -> b c h w"
    pixel_values = pixel_values.permute(0, 2, 1).view(batch_size, num_channels, height, width)
    return pixel_values

mindnlp.transformers.models.cvt.modeling_cvt.CvtDropPath

Bases: Module

Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
class CvtDropPath(nn.Module):
    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks)."""
    def __init__(self, drop_prob: Optional[float] = None) -> None:
        """
        Initializes an instance of the CvtDropPath class.

        Args:
            self: The instance of the class.
            drop_prob (Optional[float]): The probability of dropping a connection during training. Defaults to None.
                Must be a float value between 0 and 1, inclusive.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        self.drop_prob = drop_prob

    def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor:
        """
        This method forwards a modified version of the input hidden_states tensor using the drop_path operation.

        Args:
            self (CvtDropPath): The instance of the CvtDropPath class.
            hidden_states (mindspore.Tensor): The input tensor representing hidden states.
                It should be a tensor of arbitrary shape and type.

        Returns:
            mindspore.Tensor: A tensor of the same shape and type as the input hidden_states tensor,
                but with the drop_path operation applied.

        Raises:
            ValueError: If the input hidden_states tensor is not a valid mindspore.Tensor object.
            RuntimeError: If an error occurs during the drop_path operation.
        """
        return drop_path(hidden_states, self.drop_prob, self.training)

    def extra_repr(self) -> str:
        """
        This method provides a string representation for the CvtDropPath class.

        Args:
            self: CvtDropPath instance. Represents the current instance of the CvtDropPath class.

        Returns:
            str: A string representing the drop probability of the CvtDropPath instance.

        Raises:
            None.
        """
        return "p={}".format(self.drop_prob)

mindnlp.transformers.models.cvt.modeling_cvt.CvtDropPath.__init__(drop_prob=None)

Initializes an instance of the CvtDropPath class.

PARAMETER DESCRIPTION
self

The instance of the class.

drop_prob

The probability of dropping a connection during training. Defaults to None. Must be a float value between 0 and 1, inclusive.

TYPE: Optional[float] DEFAULT: None

RETURNS DESCRIPTION
None

None.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
def __init__(self, drop_prob: Optional[float] = None) -> None:
    """
    Initializes an instance of the CvtDropPath class.

    Args:
        self: The instance of the class.
        drop_prob (Optional[float]): The probability of dropping a connection during training. Defaults to None.
            Must be a float value between 0 and 1, inclusive.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    self.drop_prob = drop_prob

mindnlp.transformers.models.cvt.modeling_cvt.CvtDropPath.extra_repr()

This method provides a string representation for the CvtDropPath class.

PARAMETER DESCRIPTION
self

CvtDropPath instance. Represents the current instance of the CvtDropPath class.

RETURNS DESCRIPTION
str

A string representing the drop probability of the CvtDropPath instance.

TYPE: str

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def extra_repr(self) -> str:
    """
    This method provides a string representation for the CvtDropPath class.

    Args:
        self: CvtDropPath instance. Represents the current instance of the CvtDropPath class.

    Returns:
        str: A string representing the drop probability of the CvtDropPath instance.

    Raises:
        None.
    """
    return "p={}".format(self.drop_prob)

mindnlp.transformers.models.cvt.modeling_cvt.CvtDropPath.forward(hidden_states)

This method forwards a modified version of the input hidden_states tensor using the drop_path operation.

PARAMETER DESCRIPTION
self

The instance of the CvtDropPath class.

TYPE: CvtDropPath

hidden_states

The input tensor representing hidden states. It should be a tensor of arbitrary shape and type.

TYPE: Tensor

RETURNS DESCRIPTION
Tensor

mindspore.Tensor: A tensor of the same shape and type as the input hidden_states tensor, but with the drop_path operation applied.

RAISES DESCRIPTION
ValueError

If the input hidden_states tensor is not a valid mindspore.Tensor object.

RuntimeError

If an error occurs during the drop_path operation.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor:
    """
    This method forwards a modified version of the input hidden_states tensor using the drop_path operation.

    Args:
        self (CvtDropPath): The instance of the CvtDropPath class.
        hidden_states (mindspore.Tensor): The input tensor representing hidden states.
            It should be a tensor of arbitrary shape and type.

    Returns:
        mindspore.Tensor: A tensor of the same shape and type as the input hidden_states tensor,
            but with the drop_path operation applied.

    Raises:
        ValueError: If the input hidden_states tensor is not a valid mindspore.Tensor object.
        RuntimeError: If an error occurs during the drop_path operation.
    """
    return drop_path(hidden_states, self.drop_prob, self.training)

mindnlp.transformers.models.cvt.modeling_cvt.CvtEmbeddings

Bases: Module

Construct the CvT embeddings.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
class CvtEmbeddings(nn.Module):
    """
    Construct the CvT embeddings.
    """
    def __init__(self, patch_size, num_channels, embed_dim, stride, padding, dropout_rate):
        """
        Initializes an instance of the CvtEmbeddings class.

        Args:
            self: The object instance.
            patch_size (int): The size of the patches to be extracted from the input image.
            num_channels (int): The number of input channels in the image.
            embed_dim (int): The dimension of the embedded representation.
            stride (int): The stride of the convolution operation.
            padding (int): The amount of padding to be added to the input image.
            dropout_rate (float): The dropout rate to be applied to the convolutional embeddings.

        Returns:
            None.

        Raises:
            None.

        """
        super().__init__()
        self.convolution_embeddings = CvtConvEmbeddings(
            patch_size=patch_size, num_channels=num_channels, embed_dim=embed_dim, stride=stride, padding=padding
        )
        self.dropout = nn.Dropout(p=dropout_rate)

    def forward(self, pixel_values):
        """
        Constructs the hidden state using convolutional embeddings.

        Args:
            self (CvtEmbeddings): The instance of the CvtEmbeddings class.
            pixel_values (array-like): An array-like object containing pixel values for image data.

        Returns:
            numpy.ndarray: The hidden state forwarded using convolutional embeddings.

        Raises:
            ValueError: If the pixel_values parameter is empty or not valid.
            TypeError: If the pixel_values parameter is not array-like.
            RuntimeError: If an unexpected error occurs during the forwardion process.
        """
        hidden_state = self.convolution_embeddings(pixel_values)
        hidden_state = self.dropout(hidden_state)
        return hidden_state

mindnlp.transformers.models.cvt.modeling_cvt.CvtEmbeddings.__init__(patch_size, num_channels, embed_dim, stride, padding, dropout_rate)

Initializes an instance of the CvtEmbeddings class.

PARAMETER DESCRIPTION
self

The object instance.

patch_size

The size of the patches to be extracted from the input image.

TYPE: int

num_channels

The number of input channels in the image.

TYPE: int

embed_dim

The dimension of the embedded representation.

TYPE: int

stride

The stride of the convolution operation.

TYPE: int

padding

The amount of padding to be added to the input image.

TYPE: int

dropout_rate

The dropout rate to be applied to the convolutional embeddings.

TYPE: float

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
def __init__(self, patch_size, num_channels, embed_dim, stride, padding, dropout_rate):
    """
    Initializes an instance of the CvtEmbeddings class.

    Args:
        self: The object instance.
        patch_size (int): The size of the patches to be extracted from the input image.
        num_channels (int): The number of input channels in the image.
        embed_dim (int): The dimension of the embedded representation.
        stride (int): The stride of the convolution operation.
        padding (int): The amount of padding to be added to the input image.
        dropout_rate (float): The dropout rate to be applied to the convolutional embeddings.

    Returns:
        None.

    Raises:
        None.

    """
    super().__init__()
    self.convolution_embeddings = CvtConvEmbeddings(
        patch_size=patch_size, num_channels=num_channels, embed_dim=embed_dim, stride=stride, padding=padding
    )
    self.dropout = nn.Dropout(p=dropout_rate)

mindnlp.transformers.models.cvt.modeling_cvt.CvtEmbeddings.forward(pixel_values)

Constructs the hidden state using convolutional embeddings.

PARAMETER DESCRIPTION
self

The instance of the CvtEmbeddings class.

TYPE: CvtEmbeddings

pixel_values

An array-like object containing pixel values for image data.

TYPE: array - like

RETURNS DESCRIPTION

numpy.ndarray: The hidden state forwarded using convolutional embeddings.

RAISES DESCRIPTION
ValueError

If the pixel_values parameter is empty or not valid.

TypeError

If the pixel_values parameter is not array-like.

RuntimeError

If an unexpected error occurs during the forwardion process.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
def forward(self, pixel_values):
    """
    Constructs the hidden state using convolutional embeddings.

    Args:
        self (CvtEmbeddings): The instance of the CvtEmbeddings class.
        pixel_values (array-like): An array-like object containing pixel values for image data.

    Returns:
        numpy.ndarray: The hidden state forwarded using convolutional embeddings.

    Raises:
        ValueError: If the pixel_values parameter is empty or not valid.
        TypeError: If the pixel_values parameter is not array-like.
        RuntimeError: If an unexpected error occurs during the forwardion process.
    """
    hidden_state = self.convolution_embeddings(pixel_values)
    hidden_state = self.dropout(hidden_state)
    return hidden_state

mindnlp.transformers.models.cvt.modeling_cvt.CvtEncoder

Bases: Module

This class represents a converter encoder used for converting pixel values to hidden states. It is a subclass of nn.Module.

ATTRIBUTE DESCRIPTION
config

The configuration object for the CvtEncoder.

TYPE: Config

stages

A list of CvtStage instances representing the stages of the converter encoder.

TYPE: ModuleList

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
class CvtEncoder(nn.Module):

    """
    This class represents a converter encoder used for converting pixel values to hidden states. It is a subclass of nn.Module.

    Attributes:
        config (Config): The configuration object for the CvtEncoder.
        stages (nn.ModuleList): A list of CvtStage instances representing the stages of the converter encoder.

    Methods:
        __init__(self, config)
            Initializes a new instance of the CvtEncoder class.

            Args:

            - config (Config): The configuration object for the CvtEncoder.

        forward(self, pixel_values, output_hidden_states=False, return_dict=True)
            Constructs the converter encoder model.

            Args:

            - pixel_values (tensor): The input pixel values.
            - output_hidden_states (bool): Whether to output all hidden states. Defaults to False.
            - return_dict (bool): Whether to return the model output as a dictionary. Defaults to True.

            Returns:

            - BaseModelOutputWithCLSToken: The model output containing the last hidden state, the cls token value, and
            all hidden states.
    """
    def __init__(self, config):
        """
        Initializes an instance of the CvtEncoder class.

        Args:
            self: The instance of the class.
            config (object): The configuration object that holds the parameters for the encoder.
                This object is used to configure the behavior of the encoder.
                It must be an instance of the Config class.

        Returns:
            None

        Raises:
            None
        """
        super().__init__()
        self.config = config
        self.stages = nn.ModuleList([])
        for stage_idx in range(len(config.depth)):
            self.stages.append(CvtStage(config, stage_idx))

    def forward(self, pixel_values, output_hidden_states=False, return_dict=True):
        """
        Constructs the CvTEncoder.

        Args:
            self (CvtEncoder): The instance of the CvtEncoder class.
            pixel_values (Any): The input pixel values.
            output_hidden_states (bool): Whether to output hidden states or not. Defaults to False.
            return_dict (bool): Whether to return the result as a dictionary or not. Defaults to True.

        Returns:
            None

        Raises:
            None
        """
        all_hidden_states = () if output_hidden_states else None
        hidden_state = pixel_values

        cls_token = None
        for _, (stage_module) in enumerate(self.stages):
            hidden_state, cls_token = stage_module(hidden_state)
            if output_hidden_states:
                all_hidden_states = all_hidden_states + (hidden_state,)

        if not return_dict:
            return tuple(v for v in [hidden_state, cls_token, all_hidden_states] if v is not None)

        return BaseModelOutputWithCLSToken(
            last_hidden_state=hidden_state,
            cls_token_value=cls_token,
            hidden_states=all_hidden_states,
        )

mindnlp.transformers.models.cvt.modeling_cvt.CvtEncoder.__init__(config)

Initializes an instance of the CvtEncoder class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

The configuration object that holds the parameters for the encoder. This object is used to configure the behavior of the encoder. It must be an instance of the Config class.

TYPE: object

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
def __init__(self, config):
    """
    Initializes an instance of the CvtEncoder class.

    Args:
        self: The instance of the class.
        config (object): The configuration object that holds the parameters for the encoder.
            This object is used to configure the behavior of the encoder.
            It must be an instance of the Config class.

    Returns:
        None

    Raises:
        None
    """
    super().__init__()
    self.config = config
    self.stages = nn.ModuleList([])
    for stage_idx in range(len(config.depth)):
        self.stages.append(CvtStage(config, stage_idx))

mindnlp.transformers.models.cvt.modeling_cvt.CvtEncoder.forward(pixel_values, output_hidden_states=False, return_dict=True)

Constructs the CvTEncoder.

PARAMETER DESCRIPTION
self

The instance of the CvtEncoder class.

TYPE: CvtEncoder

pixel_values

The input pixel values.

TYPE: Any

output_hidden_states

Whether to output hidden states or not. Defaults to False.

TYPE: bool DEFAULT: False

return_dict

Whether to return the result as a dictionary or not. Defaults to True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
def forward(self, pixel_values, output_hidden_states=False, return_dict=True):
    """
    Constructs the CvTEncoder.

    Args:
        self (CvtEncoder): The instance of the CvtEncoder class.
        pixel_values (Any): The input pixel values.
        output_hidden_states (bool): Whether to output hidden states or not. Defaults to False.
        return_dict (bool): Whether to return the result as a dictionary or not. Defaults to True.

    Returns:
        None

    Raises:
        None
    """
    all_hidden_states = () if output_hidden_states else None
    hidden_state = pixel_values

    cls_token = None
    for _, (stage_module) in enumerate(self.stages):
        hidden_state, cls_token = stage_module(hidden_state)
        if output_hidden_states:
            all_hidden_states = all_hidden_states + (hidden_state,)

    if not return_dict:
        return tuple(v for v in [hidden_state, cls_token, all_hidden_states] if v is not None)

    return BaseModelOutputWithCLSToken(
        last_hidden_state=hidden_state,
        cls_token_value=cls_token,
        hidden_states=all_hidden_states,
    )

mindnlp.transformers.models.cvt.modeling_cvt.CvtForImageClassification

Bases: CvtPreTrainedModel

CvtForImageClassification is a class that represents a model for image classification utilizing the Cvt architecture. It inherits from the CvtPreTrainedModel class and provides methods for forwarding the model and computing image classification/regression loss.

ATTRIBUTE DESCRIPTION
num_labels

Number of labels for classification

TYPE: int

cvt

CvtModel instance used for image processing

TYPE: CvtModel

layernorm

Layer normalization module

TYPE: LayerNorm

classifier

Classifier module for final predictions

TYPE: Linear or Identity

METHOD DESCRIPTION
__init__

Initializes the CvtForImageClassification model with the provided configuration.

forward

Constructs the model and computes loss for image classification.

Parameters:

  • pixel_values (Optional[mindspore.Tensor]): Tensor containing pixel values of images
  • labels (Optional[mindspore.Tensor]): Tensor containing labels for computing classification/regression loss
  • output_hidden_states (Optional[bool]): Flag to indicate whether to output hidden states
  • return_dict (Optional[bool]): Flag to indicate whether to return output as a dictionary
RETURNS DESCRIPTION

Union[Tuple, ImageClassifierOutputWithNoAttention]: Tuple containing loss and output if return_dict is False. Otherwise, returns an ImageClassifierOutputWithNoAttention instance.

Notes
  • The 'forward' method handles the processing of input pixel values, computation of logits, and determination of loss based on the configuration settings.
  • The loss calculation depends on the problem type (regression, single_label_classification, or multi_label_classification) and the number of labels.
  • The final output includes logits and optionally hidden states depending on the return_dict flag.
Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
class CvtForImageClassification(CvtPreTrainedModel):

    """
    CvtForImageClassification is a class that represents a model for image classification utilizing the Cvt architecture.
    It inherits from the CvtPreTrainedModel class and provides methods for forwarding the model and computing
    image classification/regression loss.

    Attributes:
        num_labels (int): Number of labels for classification
        cvt (CvtModel): CvtModel instance used for image processing
        layernorm (nn.LayerNorm): Layer normalization module
        classifier (nn.Linear or nn.Identity): Classifier module for final predictions

    Methods:
        __init__(self, config): Initializes the CvtForImageClassification model with the provided configuration.
        forward(self, pixel_values, labels, output_hidden_states, return_dict):
            Constructs the model and computes loss for image classification.

            Parameters:

            - pixel_values (Optional[mindspore.Tensor]): Tensor containing pixel values of images
            - labels (Optional[mindspore.Tensor]): Tensor containing labels for computing classification/regression loss
            - output_hidden_states (Optional[bool]): Flag to indicate whether to output hidden states
            - return_dict (Optional[bool]): Flag to indicate whether to return output as a dictionary

    Returns:
        Union[Tuple, ImageClassifierOutputWithNoAttention]: Tuple containing loss and output if return_dict is False.
          Otherwise, returns an ImageClassifierOutputWithNoAttention instance.

    Notes:
        - The 'forward' method handles the processing of input pixel values, computation of logits,
        and determination of loss based on the configuration settings.
        - The loss calculation depends on the problem type (regression, single_label_classification,
        or multi_label_classification) and the number of labels.
        - The final output includes logits and optionally hidden states depending on the return_dict flag.
    """
    def __init__(self, config):
        """
        Initializes a new instance of the CvtForImageClassification class.

        Args:
            self: The object itself.
            config:
                An instance of the class Config containing the configuration settings.

                - Type: object
                - Purpose: Stores the configuration settings for the model.
                - Restrictions: Must be a valid instance of the Config class.

        Returns:
            None

        Raises:
            None
        """
        super().__init__(config)

        self.num_labels = config.num_labels
        self.cvt = CvtModel(config, add_pooling_layer=False)
        self.layernorm = nn.LayerNorm(config.embed_dim[-1])
        # Classifier head
        self.classifier = (
            nn.Linear(config.embed_dim[-1], config.num_labels) if config.num_labels > 0 else nn.Identity()
        )

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, ImageClassifierOutputWithNoAttention]:
        r"""
        Args:
            labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
                Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
                config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
                `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        outputs = self.cvt(
            pixel_values,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]
        cls_token = outputs[1]
        if self.config.cls_token[-1]:
            sequence_output = self.layernorm(cls_token)
        else:
            batch_size, num_channels, height, width = sequence_output.shape
            # rearrange "b c h w -> b (h w) c"
            sequence_output = sequence_output.view(batch_size, num_channels, height * width).permute(0, 2, 1)
            sequence_output = self.layernorm(sequence_output)

        sequence_output_mean = sequence_output.mean(axis=1)
        logits = self.classifier(sequence_output_mean)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.config.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.config.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                if self.config.num_labels == 1:
                    loss = ops.mse_loss(logits.squeeze(), labels.squeeze())
                else:
                    loss = ops.mse_loss(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss = ops.cross_entropy(logits.view(-1, self.config.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss = ops.binary_cross_entropy_with_logits(logits, labels)

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return ImageClassifierOutputWithNoAttention(loss=loss, logits=logits, hidden_states=outputs.hidden_states)

mindnlp.transformers.models.cvt.modeling_cvt.CvtForImageClassification.__init__(config)

Initializes a new instance of the CvtForImageClassification class.

PARAMETER DESCRIPTION
self

The object itself.

config

An instance of the class Config containing the configuration settings.

  • Type: object
  • Purpose: Stores the configuration settings for the model.
  • Restrictions: Must be a valid instance of the Config class.

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
def __init__(self, config):
    """
    Initializes a new instance of the CvtForImageClassification class.

    Args:
        self: The object itself.
        config:
            An instance of the class Config containing the configuration settings.

            - Type: object
            - Purpose: Stores the configuration settings for the model.
            - Restrictions: Must be a valid instance of the Config class.

    Returns:
        None

    Raises:
        None
    """
    super().__init__(config)

    self.num_labels = config.num_labels
    self.cvt = CvtModel(config, add_pooling_layer=False)
    self.layernorm = nn.LayerNorm(config.embed_dim[-1])
    # Classifier head
    self.classifier = (
        nn.Linear(config.embed_dim[-1], config.num_labels) if config.num_labels > 0 else nn.Identity()
    )

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.cvt.modeling_cvt.CvtForImageClassification.forward(pixel_values=None, labels=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
labels

Labels for computing the image classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

TYPE: `torch.LongTensor` of shape `(batch_size,)`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, ImageClassifierOutputWithNoAttention]:
    r"""
    Args:
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the image classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    outputs = self.cvt(
        pixel_values,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    sequence_output = outputs[0]
    cls_token = outputs[1]
    if self.config.cls_token[-1]:
        sequence_output = self.layernorm(cls_token)
    else:
        batch_size, num_channels, height, width = sequence_output.shape
        # rearrange "b c h w -> b (h w) c"
        sequence_output = sequence_output.view(batch_size, num_channels, height * width).permute(0, 2, 1)
        sequence_output = self.layernorm(sequence_output)

    sequence_output_mean = sequence_output.mean(axis=1)
    logits = self.classifier(sequence_output_mean)

    loss = None
    if labels is not None:
        if self.config.problem_type is None:
            if self.config.num_labels == 1:
                self.config.problem_type = "regression"
            elif self.config.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                self.config.problem_type = "single_label_classification"
            else:
                self.config.problem_type = "multi_label_classification"

        if self.config.problem_type == "regression":
            if self.config.num_labels == 1:
                loss = ops.mse_loss(logits.squeeze(), labels.squeeze())
            else:
                loss = ops.mse_loss(logits, labels)
        elif self.config.problem_type == "single_label_classification":
            loss = ops.cross_entropy(logits.view(-1, self.config.num_labels), labels.view(-1))
        elif self.config.problem_type == "multi_label_classification":
            loss = ops.binary_cross_entropy_with_logits(logits, labels)

    if not return_dict:
        output = (logits,) + outputs[2:]
        return ((loss,) + output) if loss is not None else output

    return ImageClassifierOutputWithNoAttention(loss=loss, logits=logits, hidden_states=outputs.hidden_states)

mindnlp.transformers.models.cvt.modeling_cvt.CvtIntermediate

Bases: Module

Represents an intermediate layer in a Convolutional Vision Transformer (CVT) network.

This class defines an intermediate layer in a CVT network that consists of a dense layer followed by a GELU activation function. The intermediate layer is used to process the hidden states in the network.

ATTRIBUTE DESCRIPTION
embed_dim

The dimension of the input embeddings.

TYPE: int

mlp_ratio

The ratio used to determine the size of the hidden layer in the dense layer.

TYPE: float

METHOD DESCRIPTION
__init__

Initializes the CvtIntermediate object with the specified embedding dimension and MLP ratio.

forward

Constructs the intermediate layer by applying a dense layer and GELU activation function to the input hidden state.

Inherits from

nn.Module

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
class CvtIntermediate(nn.Module):

    """
    Represents an intermediate layer in a Convolutional Vision Transformer (CVT) network.

    This class defines an intermediate layer in a CVT network that consists of a dense layer followed by
    a GELU activation function. The intermediate layer is used to process the hidden states in the network.

    Attributes:
        embed_dim (int): The dimension of the input embeddings.
        mlp_ratio (float): The ratio used to determine the size of the hidden layer in the dense layer.

    Methods:
        __init__(self, embed_dim, mlp_ratio):
            Initializes the CvtIntermediate object with the specified embedding dimension and MLP ratio.
        forward(self, hidden_state):
            Constructs the intermediate layer by applying a dense layer and GELU activation function to the input hidden state.

    Inherits from:
        nn.Module
    """
    def __init__(self, embed_dim, mlp_ratio):
        """
        Initializes an instance of the CvtIntermediate class.

        Args:
            self (CvtIntermediate): The instance of the class.
            embed_dim (int): The dimension of the embedding.
            mlp_ratio (float): The ratio used to calculate the hidden dimension of the MLP.

        Returns:
            None

        Raises:
            None
        """
        super().__init__()
        self.dense = nn.Linear(embed_dim, int(embed_dim * mlp_ratio))
        self.activation = nn.GELU()

    def forward(self, hidden_state):
        """
        Constructs the hidden state of the CvtIntermediate class.

        Args:
            self (CvtIntermediate): An instance of the CvtIntermediate class.
            hidden_state: The hidden state to be processed. It should be a tensor or array-like object.

        Returns:
            None: This method modifies the hidden state in-place.

        Raises:
            None.

        This method takes in the 'hidden_state' and applies transformations to it in order to forward the
        hidden state of the CvtIntermediate class.
        The 'hidden_state' is first passed through a dense layer using the 'self.dense' function.
        Then, the resulting tensor is passed through the activation function specified by the 'self.activation' attribute.
        The modified hidden state is returned as the output of this method.

        Note that this method modifies the hidden state in-place and does not create a new object.
        """
        hidden_state = self.dense(hidden_state)
        hidden_state = self.activation(hidden_state)
        return hidden_state

mindnlp.transformers.models.cvt.modeling_cvt.CvtIntermediate.__init__(embed_dim, mlp_ratio)

Initializes an instance of the CvtIntermediate class.

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: CvtIntermediate

embed_dim

The dimension of the embedding.

TYPE: int

mlp_ratio

The ratio used to calculate the hidden dimension of the MLP.

TYPE: float

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
def __init__(self, embed_dim, mlp_ratio):
    """
    Initializes an instance of the CvtIntermediate class.

    Args:
        self (CvtIntermediate): The instance of the class.
        embed_dim (int): The dimension of the embedding.
        mlp_ratio (float): The ratio used to calculate the hidden dimension of the MLP.

    Returns:
        None

    Raises:
        None
    """
    super().__init__()
    self.dense = nn.Linear(embed_dim, int(embed_dim * mlp_ratio))
    self.activation = nn.GELU()

mindnlp.transformers.models.cvt.modeling_cvt.CvtIntermediate.forward(hidden_state)

Constructs the hidden state of the CvtIntermediate class.

PARAMETER DESCRIPTION
self

An instance of the CvtIntermediate class.

TYPE: CvtIntermediate

hidden_state

The hidden state to be processed. It should be a tensor or array-like object.

RETURNS DESCRIPTION
None

This method modifies the hidden state in-place.

This method takes in the 'hidden_state' and applies transformations to it in order to forward the hidden state of the CvtIntermediate class. The 'hidden_state' is first passed through a dense layer using the 'self.dense' function. Then, the resulting tensor is passed through the activation function specified by the 'self.activation' attribute. The modified hidden state is returned as the output of this method.

Note that this method modifies the hidden state in-place and does not create a new object.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
def forward(self, hidden_state):
    """
    Constructs the hidden state of the CvtIntermediate class.

    Args:
        self (CvtIntermediate): An instance of the CvtIntermediate class.
        hidden_state: The hidden state to be processed. It should be a tensor or array-like object.

    Returns:
        None: This method modifies the hidden state in-place.

    Raises:
        None.

    This method takes in the 'hidden_state' and applies transformations to it in order to forward the
    hidden state of the CvtIntermediate class.
    The 'hidden_state' is first passed through a dense layer using the 'self.dense' function.
    Then, the resulting tensor is passed through the activation function specified by the 'self.activation' attribute.
    The modified hidden state is returned as the output of this method.

    Note that this method modifies the hidden state in-place and does not create a new object.
    """
    hidden_state = self.dense(hidden_state)
    hidden_state = self.activation(hidden_state)
    return hidden_state

mindnlp.transformers.models.cvt.modeling_cvt.CvtLayer

Bases: Module

CvtLayer composed by attention layers, normalization and multi-layer perceptrons (mlps).

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
class CvtLayer(nn.Module):
    """
    CvtLayer composed by attention layers, normalization and multi-layer perceptrons (mlps).
    """
    def __init__(
        self,
        num_heads,
        embed_dim,
        kernel_size,
        padding_q,
        padding_kv,
        stride_q,
        stride_kv,
        qkv_projection_method,
        qkv_bias,
        attention_drop_rate,
        drop_rate,
        mlp_ratio,
        drop_path_rate,
        with_cls_token=True,
    ):
        """
        Initializes an instance of the CvtLayer class.

        Args:
            self: The object instance.
            num_heads (int): The number of attention heads.
            embed_dim (int): The dimensionality of the embedding.
            kernel_size (int): The kernel size for the attention computation.
            padding_q (int): The padding size for queries.
            padding_kv (int): The padding size for key and value.
            stride_q (int): The stride size for queries.
            stride_kv (int): The stride size for key and value.
            qkv_projection_method (str): The method used for query, key, and value projection.
            qkv_bias (bool): Whether to include bias in query, key, and value projection.
            attention_drop_rate (float): The dropout rate for attention weights.
            drop_rate (float): The dropout rate for the output tensor.
            mlp_ratio (float): The ratio of the hidden size to the input size in the intermediate layer.
            drop_path_rate (float): The dropout rate for the residual connection.
            with_cls_token (bool): Whether to include a classification token.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        self.attention = CvtAttention(
            num_heads,
            embed_dim,
            kernel_size,
            padding_q,
            padding_kv,
            stride_q,
            stride_kv,
            qkv_projection_method,
            qkv_bias,
            attention_drop_rate,
            drop_rate,
            with_cls_token,
        )

        self.intermediate = CvtIntermediate(embed_dim, mlp_ratio)
        self.output = CvtOutput(embed_dim, mlp_ratio, drop_rate)
        self.drop_path = CvtDropPath(drop_prob=drop_path_rate) if drop_path_rate > 0.0 else nn.Identity()
        self.layernorm_before = nn.LayerNorm(embed_dim)
        self.layernorm_after = nn.LayerNorm(embed_dim)

    def forward(self, hidden_state, height, width):
        """
        This method forwards a layer in the CvtLayer class.

        Args:
            self (object): The instance of the CvtLayer class.
            hidden_state (tensor): The hidden state of the layer.
            height (int): The height of the input tensor.
            width (int): The width of the input tensor.

        Returns:
            None.

        Raises:
            ValueError: If the hidden_state is not a valid tensor.
            TypeError: If height and width are not integer values.
            RuntimeError: If an unexpected error occurs during the execution of the method.
        """
        self_attention_output = self.attention(
            self.layernorm_before(hidden_state),  # in Cvt, layernorm is applied before self-attention
            height,
            width,
        )
        attention_output = self_attention_output
        attention_output = self.drop_path(attention_output)

        # first residual connection
        hidden_state = attention_output + hidden_state

        # in Cvt, layernorm is also applied after self-attention
        layer_output = self.layernorm_after(hidden_state)
        layer_output = self.intermediate(layer_output)

        # second residual connection is done here
        layer_output = self.output(layer_output, hidden_state)
        layer_output = self.drop_path(layer_output)
        return layer_output

mindnlp.transformers.models.cvt.modeling_cvt.CvtLayer.__init__(num_heads, embed_dim, kernel_size, padding_q, padding_kv, stride_q, stride_kv, qkv_projection_method, qkv_bias, attention_drop_rate, drop_rate, mlp_ratio, drop_path_rate, with_cls_token=True)

Initializes an instance of the CvtLayer class.

PARAMETER DESCRIPTION
self

The object instance.

num_heads

The number of attention heads.

TYPE: int

embed_dim

The dimensionality of the embedding.

TYPE: int

kernel_size

The kernel size for the attention computation.

TYPE: int

padding_q

The padding size for queries.

TYPE: int

padding_kv

The padding size for key and value.

TYPE: int

stride_q

The stride size for queries.

TYPE: int

stride_kv

The stride size for key and value.

TYPE: int

qkv_projection_method

The method used for query, key, and value projection.

TYPE: str

qkv_bias

Whether to include bias in query, key, and value projection.

TYPE: bool

attention_drop_rate

The dropout rate for attention weights.

TYPE: float

drop_rate

The dropout rate for the output tensor.

TYPE: float

mlp_ratio

The ratio of the hidden size to the input size in the intermediate layer.

TYPE: float

drop_path_rate

The dropout rate for the residual connection.

TYPE: float

with_cls_token

Whether to include a classification token.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
def __init__(
    self,
    num_heads,
    embed_dim,
    kernel_size,
    padding_q,
    padding_kv,
    stride_q,
    stride_kv,
    qkv_projection_method,
    qkv_bias,
    attention_drop_rate,
    drop_rate,
    mlp_ratio,
    drop_path_rate,
    with_cls_token=True,
):
    """
    Initializes an instance of the CvtLayer class.

    Args:
        self: The object instance.
        num_heads (int): The number of attention heads.
        embed_dim (int): The dimensionality of the embedding.
        kernel_size (int): The kernel size for the attention computation.
        padding_q (int): The padding size for queries.
        padding_kv (int): The padding size for key and value.
        stride_q (int): The stride size for queries.
        stride_kv (int): The stride size for key and value.
        qkv_projection_method (str): The method used for query, key, and value projection.
        qkv_bias (bool): Whether to include bias in query, key, and value projection.
        attention_drop_rate (float): The dropout rate for attention weights.
        drop_rate (float): The dropout rate for the output tensor.
        mlp_ratio (float): The ratio of the hidden size to the input size in the intermediate layer.
        drop_path_rate (float): The dropout rate for the residual connection.
        with_cls_token (bool): Whether to include a classification token.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    self.attention = CvtAttention(
        num_heads,
        embed_dim,
        kernel_size,
        padding_q,
        padding_kv,
        stride_q,
        stride_kv,
        qkv_projection_method,
        qkv_bias,
        attention_drop_rate,
        drop_rate,
        with_cls_token,
    )

    self.intermediate = CvtIntermediate(embed_dim, mlp_ratio)
    self.output = CvtOutput(embed_dim, mlp_ratio, drop_rate)
    self.drop_path = CvtDropPath(drop_prob=drop_path_rate) if drop_path_rate > 0.0 else nn.Identity()
    self.layernorm_before = nn.LayerNorm(embed_dim)
    self.layernorm_after = nn.LayerNorm(embed_dim)

mindnlp.transformers.models.cvt.modeling_cvt.CvtLayer.forward(hidden_state, height, width)

This method forwards a layer in the CvtLayer class.

PARAMETER DESCRIPTION
self

The instance of the CvtLayer class.

TYPE: object

hidden_state

The hidden state of the layer.

TYPE: tensor

height

The height of the input tensor.

TYPE: int

width

The width of the input tensor.

TYPE: int

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If the hidden_state is not a valid tensor.

TypeError

If height and width are not integer values.

RuntimeError

If an unexpected error occurs during the execution of the method.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
def forward(self, hidden_state, height, width):
    """
    This method forwards a layer in the CvtLayer class.

    Args:
        self (object): The instance of the CvtLayer class.
        hidden_state (tensor): The hidden state of the layer.
        height (int): The height of the input tensor.
        width (int): The width of the input tensor.

    Returns:
        None.

    Raises:
        ValueError: If the hidden_state is not a valid tensor.
        TypeError: If height and width are not integer values.
        RuntimeError: If an unexpected error occurs during the execution of the method.
    """
    self_attention_output = self.attention(
        self.layernorm_before(hidden_state),  # in Cvt, layernorm is applied before self-attention
        height,
        width,
    )
    attention_output = self_attention_output
    attention_output = self.drop_path(attention_output)

    # first residual connection
    hidden_state = attention_output + hidden_state

    # in Cvt, layernorm is also applied after self-attention
    layer_output = self.layernorm_after(hidden_state)
    layer_output = self.intermediate(layer_output)

    # second residual connection is done here
    layer_output = self.output(layer_output, hidden_state)
    layer_output = self.drop_path(layer_output)
    return layer_output

mindnlp.transformers.models.cvt.modeling_cvt.CvtModel

Bases: CvtPreTrainedModel

CvtModel is a model class that represents a Convolutional Vision Transformer (Cvt) model for processing visual data. This class inherits from CvtPreTrainedModel and provides functionalities for initializing the model, pruning heads, and forwarding the model output.

ATTRIBUTE DESCRIPTION
config

The configuration object for the model.

TYPE: CvtConfig

encoder

The encoder component of the CvtModel responsible for processing input data.

TYPE: CvtEncoder

METHOD DESCRIPTION
__init__

Initializes the CvtModel instance with the provided configuration.

_prune_heads

Prunes specified heads of the model based on the provided dictionary of layer numbers and heads to prune.

forward

Constructs the model output by processing the input pixel values and returning the output hidden states. If pixel_values is not provided, a ValueError is raised. The output format is determined based on the return_dict flag and the model configuration.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
class CvtModel(CvtPreTrainedModel):

    """
    CvtModel is a model class that represents a Convolutional Vision Transformer (Cvt) model for processing visual data.
    This class inherits from CvtPreTrainedModel and provides functionalities for initializing the model, pruning heads,
    and forwarding the model output.

    Attributes:
        config (CvtConfig): The configuration object for the model.
        encoder (CvtEncoder): The encoder component of the CvtModel responsible for processing input data.

    Methods:
        __init__:
            Initializes the CvtModel instance with the provided configuration.

        _prune_heads:
            Prunes specified heads of the model based on the provided dictionary of layer numbers and heads to prune.

        forward:
            Constructs the model output by processing the input pixel values and returning the output hidden states.
            If pixel_values is not provided, a ValueError is raised.
            The output format is determined based on the return_dict flag and the model configuration.
    """
    def __init__(self, config, add_pooling_layer=True):
        """
        Initializes a new instance of the CvtModel class.

        Args:
            self (object): The instance of the CvtModel class.
            config (object): The configuration object containing model settings and parameters.
            add_pooling_layer (bool, optional): A flag indicating whether to add a pooling layer. Default is True.

        Returns:
            None.

        Raises:
            ValueError: If the provided config is invalid or missing required parameters.
            TypeError: If the provided config is not of the expected type.
            RuntimeError: If an error occurs during initialization.
        """
        super().__init__(config)
        self.config = config
        self.encoder = CvtEncoder(config)
        self.post_init()

    def _prune_heads(self, heads_to_prune):
        """
        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
        class PreTrainedModel
        """
        for layer, heads in heads_to_prune.items():
            self.encoder.layer[layer].attention.prune_heads(heads)

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithCLSToken]:
        """
        Constructs the CvtModel.

        Args:
            self (CvtModel): The instance of the CvtModel class.
            pixel_values (Optional[mindspore.Tensor]): The pixel values of the input image. Default is None.
            output_hidden_states (Optional[bool]): Whether to output hidden states. Default is None.
            return_dict (Optional[bool]): Whether to return a dictionary output. Default is None.

        Returns:
            Union[Tuple, BaseModelOutputWithCLSToken]:
                The forwarded model output.

                - If `return_dict` is False, a tuple is returned containing the sequence output and any additional
                encoder outputs.
                - If `return_dict` is True, a BaseModelOutputWithCLSToken object is returned, which includes
                the last hidden state, cls token value, and hidden states.

        Raises:
            ValueError: If `pixel_values` is not specified.

        """
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if pixel_values is None:
            raise ValueError("You have to specify pixel_values")

        encoder_outputs = self.encoder(
            pixel_values,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = encoder_outputs[0]

        if not return_dict:
            return (sequence_output,) + encoder_outputs[1:]

        return BaseModelOutputWithCLSToken(
            last_hidden_state=sequence_output,
            cls_token_value=encoder_outputs.cls_token_value,
            hidden_states=encoder_outputs.hidden_states,
        )

mindnlp.transformers.models.cvt.modeling_cvt.CvtModel.__init__(config, add_pooling_layer=True)

Initializes a new instance of the CvtModel class.

PARAMETER DESCRIPTION
self

The instance of the CvtModel class.

TYPE: object

config

The configuration object containing model settings and parameters.

TYPE: object

add_pooling_layer

A flag indicating whether to add a pooling layer. Default is True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If the provided config is invalid or missing required parameters.

TypeError

If the provided config is not of the expected type.

RuntimeError

If an error occurs during initialization.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
def __init__(self, config, add_pooling_layer=True):
    """
    Initializes a new instance of the CvtModel class.

    Args:
        self (object): The instance of the CvtModel class.
        config (object): The configuration object containing model settings and parameters.
        add_pooling_layer (bool, optional): A flag indicating whether to add a pooling layer. Default is True.

    Returns:
        None.

    Raises:
        ValueError: If the provided config is invalid or missing required parameters.
        TypeError: If the provided config is not of the expected type.
        RuntimeError: If an error occurs during initialization.
    """
    super().__init__(config)
    self.config = config
    self.encoder = CvtEncoder(config)
    self.post_init()

mindnlp.transformers.models.cvt.modeling_cvt.CvtModel.forward(pixel_values=None, output_hidden_states=None, return_dict=None)

Constructs the CvtModel.

PARAMETER DESCRIPTION
self

The instance of the CvtModel class.

TYPE: CvtModel

pixel_values

The pixel values of the input image. Default is None.

TYPE: Optional[Tensor] DEFAULT: None

output_hidden_states

Whether to output hidden states. Default is None.

TYPE: Optional[bool] DEFAULT: None

return_dict

Whether to return a dictionary output. Default is None.

TYPE: Optional[bool] DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple, BaseModelOutputWithCLSToken]

Union[Tuple, BaseModelOutputWithCLSToken]: The forwarded model output.

  • If return_dict is False, a tuple is returned containing the sequence output and any additional encoder outputs.
  • If return_dict is True, a BaseModelOutputWithCLSToken object is returned, which includes the last hidden state, cls token value, and hidden states.
RAISES DESCRIPTION
ValueError

If pixel_values is not specified.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
def forward(
    self,
    pixel_values: Optional[mindspore.Tensor] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithCLSToken]:
    """
    Constructs the CvtModel.

    Args:
        self (CvtModel): The instance of the CvtModel class.
        pixel_values (Optional[mindspore.Tensor]): The pixel values of the input image. Default is None.
        output_hidden_states (Optional[bool]): Whether to output hidden states. Default is None.
        return_dict (Optional[bool]): Whether to return a dictionary output. Default is None.

    Returns:
        Union[Tuple, BaseModelOutputWithCLSToken]:
            The forwarded model output.

            - If `return_dict` is False, a tuple is returned containing the sequence output and any additional
            encoder outputs.
            - If `return_dict` is True, a BaseModelOutputWithCLSToken object is returned, which includes
            the last hidden state, cls token value, and hidden states.

    Raises:
        ValueError: If `pixel_values` is not specified.

    """
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    if pixel_values is None:
        raise ValueError("You have to specify pixel_values")

    encoder_outputs = self.encoder(
        pixel_values,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    sequence_output = encoder_outputs[0]

    if not return_dict:
        return (sequence_output,) + encoder_outputs[1:]

    return BaseModelOutputWithCLSToken(
        last_hidden_state=sequence_output,
        cls_token_value=encoder_outputs.cls_token_value,
        hidden_states=encoder_outputs.hidden_states,
    )

mindnlp.transformers.models.cvt.modeling_cvt.CvtOutput

Bases: Module

The 'CvtOutput' class represents a conversion output module that is used in neural network models.

This class inherits from the 'nn.Module' class, which is a base class for all neural network cells in the MindSpore framework.

METHOD DESCRIPTION
__init__

Initializes a new instance of the 'CvtOutput' class.

Args:

  • embed_dim (int): The dimension of the embedded vectors.
  • mlp_ratio (float): The ratio used to calculate the dimension of the MLP intermediate layer.
  • drop_rate (float): The probability of an element to be zeroed in the dropout layer.
forward

Constructs the conversion output module by applying operations to the input tensors.

Args:

  • hidden_state (Tensor): The hidden state tensor.
  • input_tensor (Tensor): The input tensor.

Returns:

  • Tensor: The final hidden state tensor obtained after applying the conversion operations.
Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
class CvtOutput(nn.Module):

    """
    The 'CvtOutput' class represents a conversion output module that is used in neural network models.

    This class inherits from the 'nn.Module' class, which is a base class for all neural network cells in the MindSpore framework.

    Methods:
        __init__(self, embed_dim, mlp_ratio, drop_rate):
            Initializes a new instance of the 'CvtOutput' class.

            Args:

            - embed_dim (int): The dimension of the embedded vectors.
            - mlp_ratio (float): The ratio used to calculate the dimension of the MLP intermediate layer.
            - drop_rate (float): The probability of an element to be zeroed in the dropout layer.

        forward(self, hidden_state, input_tensor):
            Constructs the conversion output module by applying operations to the input tensors.

            Args:

            - hidden_state (Tensor): The hidden state tensor.
            - input_tensor (Tensor): The input tensor.

            Returns:

            - Tensor: The final hidden state tensor obtained after applying the conversion operations.
    """
    def __init__(self, embed_dim, mlp_ratio, drop_rate):
        """
        Initialize the CvtOutput class.

        Args:
            self: The instance of the class.
            embed_dim (int): The dimension of the embedding.
            mlp_ratio (float): The ratio used to determine the hidden layer size in the MLP.
            drop_rate (float): The dropout rate applied to the output.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        self.dense = nn.Linear(int(embed_dim * mlp_ratio), embed_dim)
        self.dropout = nn.Dropout(p=drop_rate)

    def forward(self, hidden_state, input_tensor):
        """
        Constructs the output of the CvtOutput class.

        Args:
            self (CvtOutput): An instance of the CvtOutput class.
            hidden_state (tensor): The hidden state tensor.
                This tensor represents the current state of the model and is used as input for further processing.
                It should have a shape compatible with the dense layer.
            input_tensor (tensor): The input tensor.
                This tensor represents the input data and is added to the hidden state tensor.
                It should have the same shape as the hidden state tensor.

        Returns:
            None.

        Raises:
            None.
        """
        hidden_state = self.dense(hidden_state)
        hidden_state = self.dropout(hidden_state)
        hidden_state = hidden_state + input_tensor
        return hidden_state

mindnlp.transformers.models.cvt.modeling_cvt.CvtOutput.__init__(embed_dim, mlp_ratio, drop_rate)

Initialize the CvtOutput class.

PARAMETER DESCRIPTION
self

The instance of the class.

embed_dim

The dimension of the embedding.

TYPE: int

mlp_ratio

The ratio used to determine the hidden layer size in the MLP.

TYPE: float

drop_rate

The dropout rate applied to the output.

TYPE: float

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
def __init__(self, embed_dim, mlp_ratio, drop_rate):
    """
    Initialize the CvtOutput class.

    Args:
        self: The instance of the class.
        embed_dim (int): The dimension of the embedding.
        mlp_ratio (float): The ratio used to determine the hidden layer size in the MLP.
        drop_rate (float): The dropout rate applied to the output.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    self.dense = nn.Linear(int(embed_dim * mlp_ratio), embed_dim)
    self.dropout = nn.Dropout(p=drop_rate)

mindnlp.transformers.models.cvt.modeling_cvt.CvtOutput.forward(hidden_state, input_tensor)

Constructs the output of the CvtOutput class.

PARAMETER DESCRIPTION
self

An instance of the CvtOutput class.

TYPE: CvtOutput

hidden_state

The hidden state tensor. This tensor represents the current state of the model and is used as input for further processing. It should have a shape compatible with the dense layer.

TYPE: tensor

input_tensor

The input tensor. This tensor represents the input data and is added to the hidden state tensor. It should have the same shape as the hidden state tensor.

TYPE: tensor

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
def forward(self, hidden_state, input_tensor):
    """
    Constructs the output of the CvtOutput class.

    Args:
        self (CvtOutput): An instance of the CvtOutput class.
        hidden_state (tensor): The hidden state tensor.
            This tensor represents the current state of the model and is used as input for further processing.
            It should have a shape compatible with the dense layer.
        input_tensor (tensor): The input tensor.
            This tensor represents the input data and is added to the hidden state tensor.
            It should have the same shape as the hidden state tensor.

    Returns:
        None.

    Raises:
        None.
    """
    hidden_state = self.dense(hidden_state)
    hidden_state = self.dropout(hidden_state)
    hidden_state = hidden_state + input_tensor
    return hidden_state

mindnlp.transformers.models.cvt.modeling_cvt.CvtPreTrainedModel

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
class CvtPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """
    config_class = CvtConfig
    base_model_prefix = "cvt"
    main_input_name = "pixel_values"
    _no_split_modules = ["CvtLayer"]
    _keys_to_ignore_on_load_unexpected = [r'num_batches_tracked']

    def _init_weights(self, module):
        """Initialize the weights"""
        if isinstance(module, (nn.Linear, nn.Conv2d)):
            module.weight.initialize(TruncatedNormal(self.config.initializer_range))
            if module.bias is not None:
                module.bias.initialize('zeros')
        elif isinstance(module, nn.LayerNorm):
            module.bias.initialize('zeros')
            module.weight.initialize('ones')
        elif isinstance(module, CvtStage):
            if self.config.cls_token[module.stage]:
                module.cls_token.initialize(TruncatedNormal(self.config.initializer_range))

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfAttention

Bases: Module

This class represents a Convolutional Self-Attention layer for a neural network model. It inherits from the nn.Module class.

ATTRIBUTE DESCRIPTION
num_heads

The number of attention heads.

TYPE: int

embed_dim

The dimension of the input embeddings.

TYPE: int

kernel_size

The size of the convolutional kernel.

TYPE: int

padding_q

The amount of padding for the query projection convolution.

TYPE: int

padding_kv

The amount of padding for the key and value projection convolutions.

TYPE: int

stride_q

The stride for the query projection convolution.

TYPE: int

stride_kv

The stride for the key and value projection convolutions.

TYPE: int

qkv_projection_method

The projection method used for the query, key, and value projections.

TYPE: str

qkv_bias

Indicates whether bias is added to the query, key, and value projections.

TYPE: bool

attention_drop_rate

The dropout rate for the attention scores.

TYPE: float

with_cls_token

Indicates whether a classification token is included in the input.

TYPE: bool

METHOD DESCRIPTION
__init__

Initializes the CvtSelfAttention instance.

rearrange_for_multi_head_attention

Rearranges the input hidden state for multi-head attention computations.

forward

Constructs the CvtSelfAttention layer by performing convolutional projections, multi-head attention calculations, and output rearrangement.

Note
  • The CvtSelfAttention layer assumes that the input hidden state is a 4D tensor with shape (batch_size, hidden_size, height, width).
  • The attention_score and attention_probs computations make use of the Einstein summation convention (einsum).
  • The context output is a 3D tensor with shape (batch_size, hidden_size, num_heads * head_dim).
Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
class CvtSelfAttention(nn.Module):

    """
    This class represents a Convolutional Self-Attention layer for a neural network model. It inherits from the nn.Module class.

    Attributes:
        num_heads (int): The number of attention heads.
        embed_dim (int): The dimension of the input embeddings.
        kernel_size (int): The size of the convolutional kernel.
        padding_q (int): The amount of padding for the query projection convolution.
        padding_kv (int): The amount of padding for the key and value projection convolutions.
        stride_q (int): The stride for the query projection convolution.
        stride_kv (int): The stride for the key and value projection convolutions.
        qkv_projection_method (str): The projection method used for the query, key, and value projections.
        qkv_bias (bool): Indicates whether bias is added to the query, key, and value projections.
        attention_drop_rate (float): The dropout rate for the attention scores.
        with_cls_token (bool): Indicates whether a classification token is included in the input.

    Methods:
        __init__(self, num_heads, embed_dim, kernel_size, padding_q, padding_kv, stride_q, stride_kv, qkv_projection_method, qkv_bias, attention_drop_rate, with_cls_token=True, **kwargs):
            Initializes the CvtSelfAttention instance.

        rearrange_for_multi_head_attention(self, hidden_state):
            Rearranges the input hidden state for multi-head attention computations.

        forward(self, hidden_state, height, width):
            Constructs the CvtSelfAttention layer by performing convolutional projections, multi-head attention calculations, and output rearrangement.

    Note:
        - The CvtSelfAttention layer assumes that the input hidden state is a 4D tensor with shape (batch_size, hidden_size, height, width).
        - The attention_score and attention_probs computations make use of the Einstein summation convention (einsum).
        - The context output is a 3D tensor with shape (batch_size, hidden_size, num_heads * head_dim).
    """
    def __init__(
        self,
        num_heads,
        embed_dim,
        kernel_size,
        padding_q,
        padding_kv,
        stride_q,
        stride_kv,
        qkv_projection_method,
        qkv_bias,
        attention_drop_rate,
        with_cls_token=True,
        **kwargs,
    ):
        """
        __init__

        Initializes the CvtSelfAttention class.

        Args:
            self: The instance of the class.
            num_heads (int): The number of attention heads.
            embed_dim (int): The dimension of the input embeddings.
            kernel_size (int): The size of the convolutional kernel.
            padding_q (int): The padding size for the query projection.
            padding_kv (int): The padding size for the key and value projections.
            stride_q (int): The stride for the query projection.
            stride_kv (int): The stride for the key and value projections.
            qkv_projection_method (str): The method used for query, key, and value projections.
                Can be 'avg' or any other specific projection method.
            qkv_bias (bool): Indicates whether bias is applied to the query, key, and value projections.
            attention_drop_rate (float): The dropout rate for attention weights.
            with_cls_token (bool, optional): Indicates whether the class token is included. Defaults to True.

        Returns:
            None.

        Raises:
            ValueError: If embed_dim is not a positive integer.
            ValueError: If num_heads is not a positive integer.
            ValueError: If kernel_size, padding_q, padding_kv, stride_q, or stride_kv is not a positive integer.
            ValueError: If qkv_projection_method is not 'avg' or a valid specific projection method.
            ValueError: If attention_drop_rate is not in the range [0, 1].
            TypeError: If with_cls_token is not a boolean value.
        """
        super().__init__()
        self.scale = embed_dim**-0.5
        self.with_cls_token = with_cls_token
        self.embed_dim = embed_dim
        self.num_heads = num_heads

        self.convolution_projection_query = CvtSelfAttentionProjection(
            embed_dim,
            kernel_size,
            padding_q,
            stride_q,
            projection_method="linear" if qkv_projection_method == "avg" else qkv_projection_method,
        )
        self.convolution_projection_key = CvtSelfAttentionProjection(
            embed_dim, kernel_size, padding_kv, stride_kv, projection_method=qkv_projection_method
        )
        self.convolution_projection_value = CvtSelfAttentionProjection(
            embed_dim, kernel_size, padding_kv, stride_kv, projection_method=qkv_projection_method
        )

        self.projection_query = nn.Linear(embed_dim, embed_dim, bias=qkv_bias)
        self.projection_key = nn.Linear(embed_dim, embed_dim, bias=qkv_bias)
        self.projection_value = nn.Linear(embed_dim, embed_dim, bias=qkv_bias)

        self.dropout = nn.Dropout(p=attention_drop_rate)

    def rearrange_for_multi_head_attention(self, hidden_state):
        """
        Method: rearrange_for_multi_head_attention

        In the class CvtSelfAttention, this method rearranges the hidden state tensor for multi-head attention computation.

        Args:
            self (CvtSelfAttention): The instance of the CvtSelfAttention class.
                This parameter is required for accessing the attributes and methods of the class.
            hidden_state (torch.Tensor):
                The input hidden state tensor of shape (batch_size, hidden_size, _).

                - batch_size (int): The number of sequences in the batch.
                - hidden_size (int): The dimensionality of the hidden state.
                - _ (int): Placeholder dimension for compatibility with the transformer architecture.

                This tensor represents the input hidden state that needs to be rearranged for multi-head attention computation.

        Returns:
            None:
                This method does not return any value.
                It rearranges the hidden state tensor in place and does not create a new tensor.

        Raises:
            None:
                This method does not explicitly raise any exceptions.
        """
        batch_size, hidden_size, _ = hidden_state.shape
        head_dim = self.embed_dim // self.num_heads
        # rearrange 'b t (h d) -> b h t d'
        return hidden_state.view(batch_size, hidden_size, self.num_heads, head_dim).permute(0, 2, 1, 3)

    def forward(self, hidden_state, height, width):
        """
        Constructs the self-attention context for the CvtSelfAttention class.

        Args:
            self: An instance of the CvtSelfAttention class.
            hidden_state (Tensor): The hidden state tensor of shape (batch_size, hidden_size, num_channels).
                It represents the input features.
            height (int): The height of the hidden state tensor.
            width (int): The width of the hidden state tensor.

        Returns:
            Tensor: The context tensor of shape (batch_size, hidden_size, num_heads * head_dim).
                It represents the output context after applying self-attention mechanism.

        Raises:
            None.
        """
        if self.with_cls_token:
            cls_token, hidden_state = ops.split(hidden_state, [1, height * width], 1)
        batch_size, hidden_size, num_channels = hidden_state.shape
        # rearrange "b (h w) c -> b c h w"
        hidden_state = hidden_state.permute(0, 2, 1).view(batch_size, num_channels, height, width)

        key = self.convolution_projection_key(hidden_state)
        query = self.convolution_projection_query(hidden_state)
        value = self.convolution_projection_value(hidden_state)

        if self.with_cls_token:
            query = ops.cat((cls_token, query), axis=1)
            key = ops.cat((cls_token, key), axis=1)
            value = ops.cat((cls_token, value), axis=1)

        head_dim = self.embed_dim // self.num_heads

        query = self.rearrange_for_multi_head_attention(self.projection_query(query))
        key = self.rearrange_for_multi_head_attention(self.projection_key(key))
        value = self.rearrange_for_multi_head_attention(self.projection_value(value))

        attention_score = ops.einsum("bhlk,bhtk->bhlt", query, key) * self.scale
        attention_probs = ops.softmax(attention_score, axis=-1)
        attention_probs = self.dropout(attention_probs)

        context = ops.einsum("bhlt,bhtv->bhlv", attention_probs, value)
        # rearrange"b h t d -> b t (h d)"
        _, _, hidden_size, _ = context.shape
        context = context.permute(0, 2, 1, 3).view(batch_size, hidden_size, self.num_heads * head_dim)
        return context

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfAttention.__init__(num_heads, embed_dim, kernel_size, padding_q, padding_kv, stride_q, stride_kv, qkv_projection_method, qkv_bias, attention_drop_rate, with_cls_token=True, **kwargs)

init

Initializes the CvtSelfAttention class.

PARAMETER DESCRIPTION
self

The instance of the class.

num_heads

The number of attention heads.

TYPE: int

embed_dim

The dimension of the input embeddings.

TYPE: int

kernel_size

The size of the convolutional kernel.

TYPE: int

padding_q

The padding size for the query projection.

TYPE: int

padding_kv

The padding size for the key and value projections.

TYPE: int

stride_q

The stride for the query projection.

TYPE: int

stride_kv

The stride for the key and value projections.

TYPE: int

qkv_projection_method

The method used for query, key, and value projections. Can be 'avg' or any other specific projection method.

TYPE: str

qkv_bias

Indicates whether bias is applied to the query, key, and value projections.

TYPE: bool

attention_drop_rate

The dropout rate for attention weights.

TYPE: float

with_cls_token

Indicates whether the class token is included. Defaults to True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If embed_dim is not a positive integer.

ValueError

If num_heads is not a positive integer.

ValueError

If kernel_size, padding_q, padding_kv, stride_q, or stride_kv is not a positive integer.

ValueError

If qkv_projection_method is not 'avg' or a valid specific projection method.

ValueError

If attention_drop_rate is not in the range [0, 1].

TypeError

If with_cls_token is not a boolean value.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
def __init__(
    self,
    num_heads,
    embed_dim,
    kernel_size,
    padding_q,
    padding_kv,
    stride_q,
    stride_kv,
    qkv_projection_method,
    qkv_bias,
    attention_drop_rate,
    with_cls_token=True,
    **kwargs,
):
    """
    __init__

    Initializes the CvtSelfAttention class.

    Args:
        self: The instance of the class.
        num_heads (int): The number of attention heads.
        embed_dim (int): The dimension of the input embeddings.
        kernel_size (int): The size of the convolutional kernel.
        padding_q (int): The padding size for the query projection.
        padding_kv (int): The padding size for the key and value projections.
        stride_q (int): The stride for the query projection.
        stride_kv (int): The stride for the key and value projections.
        qkv_projection_method (str): The method used for query, key, and value projections.
            Can be 'avg' or any other specific projection method.
        qkv_bias (bool): Indicates whether bias is applied to the query, key, and value projections.
        attention_drop_rate (float): The dropout rate for attention weights.
        with_cls_token (bool, optional): Indicates whether the class token is included. Defaults to True.

    Returns:
        None.

    Raises:
        ValueError: If embed_dim is not a positive integer.
        ValueError: If num_heads is not a positive integer.
        ValueError: If kernel_size, padding_q, padding_kv, stride_q, or stride_kv is not a positive integer.
        ValueError: If qkv_projection_method is not 'avg' or a valid specific projection method.
        ValueError: If attention_drop_rate is not in the range [0, 1].
        TypeError: If with_cls_token is not a boolean value.
    """
    super().__init__()
    self.scale = embed_dim**-0.5
    self.with_cls_token = with_cls_token
    self.embed_dim = embed_dim
    self.num_heads = num_heads

    self.convolution_projection_query = CvtSelfAttentionProjection(
        embed_dim,
        kernel_size,
        padding_q,
        stride_q,
        projection_method="linear" if qkv_projection_method == "avg" else qkv_projection_method,
    )
    self.convolution_projection_key = CvtSelfAttentionProjection(
        embed_dim, kernel_size, padding_kv, stride_kv, projection_method=qkv_projection_method
    )
    self.convolution_projection_value = CvtSelfAttentionProjection(
        embed_dim, kernel_size, padding_kv, stride_kv, projection_method=qkv_projection_method
    )

    self.projection_query = nn.Linear(embed_dim, embed_dim, bias=qkv_bias)
    self.projection_key = nn.Linear(embed_dim, embed_dim, bias=qkv_bias)
    self.projection_value = nn.Linear(embed_dim, embed_dim, bias=qkv_bias)

    self.dropout = nn.Dropout(p=attention_drop_rate)

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfAttention.forward(hidden_state, height, width)

Constructs the self-attention context for the CvtSelfAttention class.

PARAMETER DESCRIPTION
self

An instance of the CvtSelfAttention class.

hidden_state

The hidden state tensor of shape (batch_size, hidden_size, num_channels). It represents the input features.

TYPE: Tensor

height

The height of the hidden state tensor.

TYPE: int

width

The width of the hidden state tensor.

TYPE: int

RETURNS DESCRIPTION
Tensor

The context tensor of shape (batch_size, hidden_size, num_heads * head_dim). It represents the output context after applying self-attention mechanism.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
def forward(self, hidden_state, height, width):
    """
    Constructs the self-attention context for the CvtSelfAttention class.

    Args:
        self: An instance of the CvtSelfAttention class.
        hidden_state (Tensor): The hidden state tensor of shape (batch_size, hidden_size, num_channels).
            It represents the input features.
        height (int): The height of the hidden state tensor.
        width (int): The width of the hidden state tensor.

    Returns:
        Tensor: The context tensor of shape (batch_size, hidden_size, num_heads * head_dim).
            It represents the output context after applying self-attention mechanism.

    Raises:
        None.
    """
    if self.with_cls_token:
        cls_token, hidden_state = ops.split(hidden_state, [1, height * width], 1)
    batch_size, hidden_size, num_channels = hidden_state.shape
    # rearrange "b (h w) c -> b c h w"
    hidden_state = hidden_state.permute(0, 2, 1).view(batch_size, num_channels, height, width)

    key = self.convolution_projection_key(hidden_state)
    query = self.convolution_projection_query(hidden_state)
    value = self.convolution_projection_value(hidden_state)

    if self.with_cls_token:
        query = ops.cat((cls_token, query), axis=1)
        key = ops.cat((cls_token, key), axis=1)
        value = ops.cat((cls_token, value), axis=1)

    head_dim = self.embed_dim // self.num_heads

    query = self.rearrange_for_multi_head_attention(self.projection_query(query))
    key = self.rearrange_for_multi_head_attention(self.projection_key(key))
    value = self.rearrange_for_multi_head_attention(self.projection_value(value))

    attention_score = ops.einsum("bhlk,bhtk->bhlt", query, key) * self.scale
    attention_probs = ops.softmax(attention_score, axis=-1)
    attention_probs = self.dropout(attention_probs)

    context = ops.einsum("bhlt,bhtv->bhlv", attention_probs, value)
    # rearrange"b h t d -> b t (h d)"
    _, _, hidden_size, _ = context.shape
    context = context.permute(0, 2, 1, 3).view(batch_size, hidden_size, self.num_heads * head_dim)
    return context

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfAttention.rearrange_for_multi_head_attention(hidden_state)

In the class CvtSelfAttention, this method rearranges the hidden state tensor for multi-head attention computation.

PARAMETER DESCRIPTION
self

The instance of the CvtSelfAttention class. This parameter is required for accessing the attributes and methods of the class.

TYPE: CvtSelfAttention

hidden_state

The input hidden state tensor of shape (batch_size, hidden_size, _).

  • batch_size (int): The number of sequences in the batch.
  • hidden_size (int): The dimensionality of the hidden state.
  • _ (int): Placeholder dimension for compatibility with the transformer architecture.

This tensor represents the input hidden state that needs to be rearranged for multi-head attention computation.

TYPE: Tensor

RETURNS DESCRIPTION
None

This method does not return any value. It rearranges the hidden state tensor in place and does not create a new tensor.

RAISES DESCRIPTION
None

This method does not explicitly raise any exceptions.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
def rearrange_for_multi_head_attention(self, hidden_state):
    """
    Method: rearrange_for_multi_head_attention

    In the class CvtSelfAttention, this method rearranges the hidden state tensor for multi-head attention computation.

    Args:
        self (CvtSelfAttention): The instance of the CvtSelfAttention class.
            This parameter is required for accessing the attributes and methods of the class.
        hidden_state (torch.Tensor):
            The input hidden state tensor of shape (batch_size, hidden_size, _).

            - batch_size (int): The number of sequences in the batch.
            - hidden_size (int): The dimensionality of the hidden state.
            - _ (int): Placeholder dimension for compatibility with the transformer architecture.

            This tensor represents the input hidden state that needs to be rearranged for multi-head attention computation.

    Returns:
        None:
            This method does not return any value.
            It rearranges the hidden state tensor in place and does not create a new tensor.

    Raises:
        None:
            This method does not explicitly raise any exceptions.
    """
    batch_size, hidden_size, _ = hidden_state.shape
    head_dim = self.embed_dim // self.num_heads
    # rearrange 'b t (h d) -> b h t d'
    return hidden_state.view(batch_size, hidden_size, self.num_heads, head_dim).permute(0, 2, 1, 3)

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfAttentionConvProjection

Bases: Module

CvtSelfAttentionConvProjection represents a class for performing convolution and normalization operations on input data. This class inherits from nn.Module and provides methods for initializing the convolution and normalization layers, as well as for forwarding the output from the input hidden state.

ATTRIBUTE DESCRIPTION
embed_dim

The dimension of the input embedding.

TYPE: int

kernel_size

The size of the convolutional kernel.

TYPE: int

padding

The amount of padding to apply to the input data.

TYPE: int

stride

The stride of the convolution operation.

TYPE: int

METHOD DESCRIPTION
__init__

Initializes the CvtSelfAttentionConvProjection class with the specified parameters.

forward

Constructs the output from the input hidden state by applying convolution and normalization operations.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
class CvtSelfAttentionConvProjection(nn.Module):

    """
    CvtSelfAttentionConvProjection represents a class for performing convolution and normalization operations
    on input data. This class inherits from nn.Module and provides methods for initializing the
    convolution and normalization layers, as well as for forwarding the output from the input hidden state.

    Attributes:
        embed_dim (int): The dimension of the input embedding.
        kernel_size (int): The size of the convolutional kernel.
        padding (int): The amount of padding to apply to the input data.
        stride (int): The stride of the convolution operation.

    Methods:
        __init__: Initializes the CvtSelfAttentionConvProjection class with the specified parameters.
        forward: Constructs the output from the input hidden state by applying convolution and normalization operations.

    """
    def __init__(self, embed_dim, kernel_size, padding, stride):
        """
        Initializes a new instance of the CvtSelfAttentionConvProjection class.

        Args:
            self (CvtSelfAttentionConvProjection): The object itself.
            embed_dim (int): The number of channels in the input and output tensors.
            kernel_size (int or Tuple[int, int]): The size of the convolving kernel.
            padding (int or Tuple[int, int]): The amount of padding added to the input.
            stride (int or Tuple[int, int]): The stride of the convolution.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        self.convolution = nn.Conv2d(
            embed_dim,
            embed_dim,
            kernel_size=kernel_size,
            padding=padding,
            pad_mode='pad',
            stride=stride,
            bias=False,
            group=embed_dim,
        )
        self.normalization = nn.BatchNorm2d(embed_dim)

    def forward(self, hidden_state):
        """
        Constructs a hidden state using convolution, normalization, and projection in the CvtSelfAttentionConvProjection class.

        Args:
            self (CvtSelfAttentionConvProjection): An instance of the CvtSelfAttentionConvProjection class.
            hidden_state (any): The input hidden state.

        Returns:
            None.

        Raises:
            None.
        """
        hidden_state = self.convolution(hidden_state)
        hidden_state = self.normalization(hidden_state)
        return hidden_state

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfAttentionConvProjection.__init__(embed_dim, kernel_size, padding, stride)

Initializes a new instance of the CvtSelfAttentionConvProjection class.

PARAMETER DESCRIPTION
self

The object itself.

TYPE: CvtSelfAttentionConvProjection

embed_dim

The number of channels in the input and output tensors.

TYPE: int

kernel_size

The size of the convolving kernel.

TYPE: int or Tuple[int, int]

padding

The amount of padding added to the input.

TYPE: int or Tuple[int, int]

stride

The stride of the convolution.

TYPE: int or Tuple[int, int]

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
def __init__(self, embed_dim, kernel_size, padding, stride):
    """
    Initializes a new instance of the CvtSelfAttentionConvProjection class.

    Args:
        self (CvtSelfAttentionConvProjection): The object itself.
        embed_dim (int): The number of channels in the input and output tensors.
        kernel_size (int or Tuple[int, int]): The size of the convolving kernel.
        padding (int or Tuple[int, int]): The amount of padding added to the input.
        stride (int or Tuple[int, int]): The stride of the convolution.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    self.convolution = nn.Conv2d(
        embed_dim,
        embed_dim,
        kernel_size=kernel_size,
        padding=padding,
        pad_mode='pad',
        stride=stride,
        bias=False,
        group=embed_dim,
    )
    self.normalization = nn.BatchNorm2d(embed_dim)

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfAttentionConvProjection.forward(hidden_state)

Constructs a hidden state using convolution, normalization, and projection in the CvtSelfAttentionConvProjection class.

PARAMETER DESCRIPTION
self

An instance of the CvtSelfAttentionConvProjection class.

TYPE: CvtSelfAttentionConvProjection

hidden_state

The input hidden state.

TYPE: any

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
def forward(self, hidden_state):
    """
    Constructs a hidden state using convolution, normalization, and projection in the CvtSelfAttentionConvProjection class.

    Args:
        self (CvtSelfAttentionConvProjection): An instance of the CvtSelfAttentionConvProjection class.
        hidden_state (any): The input hidden state.

    Returns:
        None.

    Raises:
        None.
    """
    hidden_state = self.convolution(hidden_state)
    hidden_state = self.normalization(hidden_state)
    return hidden_state

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfAttentionLinearProjection

Bases: Module

The 'CvtSelfAttentionLinearProjection' class is a Python class that inherits from the 'nn.Module' class. It represents a linear projection operation applied to hidden states in a self-attention mechanism.

METHOD DESCRIPTION
forward

Applies a linear projection to the input hidden state.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
class CvtSelfAttentionLinearProjection(nn.Module):

    """
    The 'CvtSelfAttentionLinearProjection' class is a Python class that inherits from the 'nn.Module' class.
    It represents a linear projection operation applied to hidden states in a self-attention mechanism.

    Attributes:
        None.

    Methods:
        forward(hidden_state): Applies a linear projection to the input hidden state.

    """
    def forward(self, hidden_state):
        """
        Constructs a linear projection of hidden state for self-attention in the CvtSelfAttentionLinearProjection class.

        Args:
            self (CvtSelfAttentionLinearProjection): The instance of the CvtSelfAttentionLinearProjection class.
            hidden_state (torch.Tensor): The hidden state tensor with shape (batch_size, num_channels, height, width),
                where batch_size is the number of samples in the batch, num_channels is the number of channels,
                height is the height of the hidden state tensor, and width is the width of the hidden state tensor.

        Returns:
            torch.Tensor: The linearly projected hidden state tensor with shape (batch_size, hidden_size, num_channels),
                where batch_size is the number of samples in the batch, hidden_size is the product of height and width
                of the hidden state tensor, and num_channels is the number of channels.
                The tensor is permuted to have the dimensions (batch_size, hidden_size, num_channels).

        Raises:
            None
        """
        batch_size, num_channels, height, width = hidden_state.shape
        hidden_size = height * width
        # rearrange " b c h w -> b (h w) c"
        hidden_state = hidden_state.view(batch_size, num_channels, hidden_size).permute(0, 2, 1)
        return hidden_state

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfAttentionLinearProjection.forward(hidden_state)

Constructs a linear projection of hidden state for self-attention in the CvtSelfAttentionLinearProjection class.

PARAMETER DESCRIPTION
self

The instance of the CvtSelfAttentionLinearProjection class.

TYPE: CvtSelfAttentionLinearProjection

hidden_state

The hidden state tensor with shape (batch_size, num_channels, height, width), where batch_size is the number of samples in the batch, num_channels is the number of channels, height is the height of the hidden state tensor, and width is the width of the hidden state tensor.

TYPE: Tensor

RETURNS DESCRIPTION

torch.Tensor: The linearly projected hidden state tensor with shape (batch_size, hidden_size, num_channels), where batch_size is the number of samples in the batch, hidden_size is the product of height and width of the hidden state tensor, and num_channels is the number of channels. The tensor is permuted to have the dimensions (batch_size, hidden_size, num_channels).

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
def forward(self, hidden_state):
    """
    Constructs a linear projection of hidden state for self-attention in the CvtSelfAttentionLinearProjection class.

    Args:
        self (CvtSelfAttentionLinearProjection): The instance of the CvtSelfAttentionLinearProjection class.
        hidden_state (torch.Tensor): The hidden state tensor with shape (batch_size, num_channels, height, width),
            where batch_size is the number of samples in the batch, num_channels is the number of channels,
            height is the height of the hidden state tensor, and width is the width of the hidden state tensor.

    Returns:
        torch.Tensor: The linearly projected hidden state tensor with shape (batch_size, hidden_size, num_channels),
            where batch_size is the number of samples in the batch, hidden_size is the product of height and width
            of the hidden state tensor, and num_channels is the number of channels.
            The tensor is permuted to have the dimensions (batch_size, hidden_size, num_channels).

    Raises:
        None
    """
    batch_size, num_channels, height, width = hidden_state.shape
    hidden_size = height * width
    # rearrange " b c h w -> b (h w) c"
    hidden_state = hidden_state.view(batch_size, num_channels, hidden_size).permute(0, 2, 1)
    return hidden_state

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfAttentionProjection

Bases: Module

A class representing the projection layer for self-attention in a Convolutional Transformer network.

This class is responsible for projecting the input hidden state using convolutional and linear projections. It provides methods to initialize the projections and apply them sequentially to the input hidden state.

ATTRIBUTE DESCRIPTION
embed_dim

The dimensionality of the input embeddings.

TYPE: int

kernel_size

The size of the convolutional kernel.

TYPE: int

padding

The amount of padding to apply during convolution.

TYPE: int

stride

The stride of the convolution operation.

TYPE: int

projection_method

The method used for projection, default is 'dw_bn' (depthwise batch normalization).

TYPE: str

METHOD DESCRIPTION
__init__

Initializes the projection layer with the specified parameters.

forward

Applies the convolutional projection followed by the linear projection to the input hidden state. Returns the projected hidden state.

Note

This class inherits from nn.Module and is designed to be used within a Convolutional Transformer network.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
class CvtSelfAttentionProjection(nn.Module):

    """
    A class representing the projection layer for self-attention in a Convolutional Transformer network.

    This class is responsible for projecting the input hidden state using convolutional and linear projections.
    It provides methods to initialize the projections and apply them sequentially to the input hidden state.

    Attributes:
        embed_dim (int): The dimensionality of the input embeddings.
        kernel_size (int): The size of the convolutional kernel.
        padding (int): The amount of padding to apply during convolution.
        stride (int): The stride of the convolution operation.
        projection_method (str): The method used for projection, default is 'dw_bn' (depthwise batch normalization).

    Methods:
        __init__:
            Initializes the projection layer with the specified parameters.

        forward:
            Applies the convolutional projection followed by the linear projection to the input hidden state.
            Returns the projected hidden state.

    Note:
        This class inherits from nn.Module and is designed to be used within a Convolutional Transformer network.
    """
    def __init__(self, embed_dim, kernel_size, padding, stride, projection_method="dw_bn"):
        """
        Initializes an instance of the CvtSelfAttentionProjection class.

        Args:
            self (CvtSelfAttentionProjection): The instance of the class.
            embed_dim (int): The dimensionality of the input embeddings.
            kernel_size (int): The size of the convolutional kernel.
            padding (int): The amount of padding to be added to the input.
            stride (int): The stride value for the convolution operation.
            projection_method (string, optional): The method used for projection. Defaults to 'dw_bn'.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        if projection_method == "dw_bn":
            self.convolution_projection = CvtSelfAttentionConvProjection(embed_dim, kernel_size, padding, stride)
        self.linear_projection = CvtSelfAttentionLinearProjection()

    def forward(self, hidden_state):
        """
        Constructs the self-attention projection for the CvtSelfAttentionProjection class.

        Args:
            self (CvtSelfAttentionProjection): The instance of the CvtSelfAttentionProjection class.
            hidden_state (Tensor): The hidden state tensor to be projected.

        Returns:
            None: The method modifies the hidden_state in-place after applying convolution and linear projections.

        Raises:
            None.
        """
        hidden_state = self.convolution_projection(hidden_state)
        hidden_state = self.linear_projection(hidden_state)
        return hidden_state

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfAttentionProjection.__init__(embed_dim, kernel_size, padding, stride, projection_method='dw_bn')

Initializes an instance of the CvtSelfAttentionProjection class.

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: CvtSelfAttentionProjection

embed_dim

The dimensionality of the input embeddings.

TYPE: int

kernel_size

The size of the convolutional kernel.

TYPE: int

padding

The amount of padding to be added to the input.

TYPE: int

stride

The stride value for the convolution operation.

TYPE: int

projection_method

The method used for projection. Defaults to 'dw_bn'.

TYPE: string DEFAULT: 'dw_bn'

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
def __init__(self, embed_dim, kernel_size, padding, stride, projection_method="dw_bn"):
    """
    Initializes an instance of the CvtSelfAttentionProjection class.

    Args:
        self (CvtSelfAttentionProjection): The instance of the class.
        embed_dim (int): The dimensionality of the input embeddings.
        kernel_size (int): The size of the convolutional kernel.
        padding (int): The amount of padding to be added to the input.
        stride (int): The stride value for the convolution operation.
        projection_method (string, optional): The method used for projection. Defaults to 'dw_bn'.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    if projection_method == "dw_bn":
        self.convolution_projection = CvtSelfAttentionConvProjection(embed_dim, kernel_size, padding, stride)
    self.linear_projection = CvtSelfAttentionLinearProjection()

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfAttentionProjection.forward(hidden_state)

Constructs the self-attention projection for the CvtSelfAttentionProjection class.

PARAMETER DESCRIPTION
self

The instance of the CvtSelfAttentionProjection class.

TYPE: CvtSelfAttentionProjection

hidden_state

The hidden state tensor to be projected.

TYPE: Tensor

RETURNS DESCRIPTION
None

The method modifies the hidden_state in-place after applying convolution and linear projections.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
def forward(self, hidden_state):
    """
    Constructs the self-attention projection for the CvtSelfAttentionProjection class.

    Args:
        self (CvtSelfAttentionProjection): The instance of the CvtSelfAttentionProjection class.
        hidden_state (Tensor): The hidden state tensor to be projected.

    Returns:
        None: The method modifies the hidden_state in-place after applying convolution and linear projections.

    Raises:
        None.
    """
    hidden_state = self.convolution_projection(hidden_state)
    hidden_state = self.linear_projection(hidden_state)
    return hidden_state

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfOutput

Bases: Module

The residual connection is defined in CvtLayer instead of here (as is the case with other models), due to the layernorm applied before each block.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
class CvtSelfOutput(nn.Module):
    """
    The residual connection is defined in CvtLayer instead of here (as is the case with other models), due to the
    layernorm applied before each block.
    """
    def __init__(self, embed_dim, drop_rate):
        """
        Initializes an instance of the CvtSelfOutput class.

        Args:
            self (CvtSelfOutput): The instance of the class.
            embed_dim (int): The dimension of the embedding.
            drop_rate (float): The dropout rate to be applied.

        Returns:
            None

        Raises:
            None
        """
        super().__init__()
        self.dense = nn.Linear(embed_dim, embed_dim)
        self.dropout = nn.Dropout(p=drop_rate)

    def forward(self, hidden_state, input_tensor):
        """
        Constructs the output of the CvtSelfOutput class.

        Args:
            self (CvtSelfOutput): An instance of the CvtSelfOutput class.
            hidden_state (Tensor): The hidden state to be processed.
                This tensor represents the current state of the model and is expected to have shape (batch_size, hidden_size).
                It serves as input to the dense layer and will be transformed.
            input_tensor (Tensor): The input tensor to the method.
                This tensor represents additional input to the forward method and can be of any shape.

        Returns:
            None.

        Raises:
            None.
        """
        hidden_state = self.dense(hidden_state)
        hidden_state = self.dropout(hidden_state)
        return hidden_state

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfOutput.__init__(embed_dim, drop_rate)

Initializes an instance of the CvtSelfOutput class.

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: CvtSelfOutput

embed_dim

The dimension of the embedding.

TYPE: int

drop_rate

The dropout rate to be applied.

TYPE: float

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
def __init__(self, embed_dim, drop_rate):
    """
    Initializes an instance of the CvtSelfOutput class.

    Args:
        self (CvtSelfOutput): The instance of the class.
        embed_dim (int): The dimension of the embedding.
        drop_rate (float): The dropout rate to be applied.

    Returns:
        None

    Raises:
        None
    """
    super().__init__()
    self.dense = nn.Linear(embed_dim, embed_dim)
    self.dropout = nn.Dropout(p=drop_rate)

mindnlp.transformers.models.cvt.modeling_cvt.CvtSelfOutput.forward(hidden_state, input_tensor)

Constructs the output of the CvtSelfOutput class.

PARAMETER DESCRIPTION
self

An instance of the CvtSelfOutput class.

TYPE: CvtSelfOutput

hidden_state

The hidden state to be processed. This tensor represents the current state of the model and is expected to have shape (batch_size, hidden_size). It serves as input to the dense layer and will be transformed.

TYPE: Tensor

input_tensor

The input tensor to the method. This tensor represents additional input to the forward method and can be of any shape.

TYPE: Tensor

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
def forward(self, hidden_state, input_tensor):
    """
    Constructs the output of the CvtSelfOutput class.

    Args:
        self (CvtSelfOutput): An instance of the CvtSelfOutput class.
        hidden_state (Tensor): The hidden state to be processed.
            This tensor represents the current state of the model and is expected to have shape (batch_size, hidden_size).
            It serves as input to the dense layer and will be transformed.
        input_tensor (Tensor): The input tensor to the method.
            This tensor represents additional input to the forward method and can be of any shape.

    Returns:
        None.

    Raises:
        None.
    """
    hidden_state = self.dense(hidden_state)
    hidden_state = self.dropout(hidden_state)
    return hidden_state

mindnlp.transformers.models.cvt.modeling_cvt.CvtStage

Bases: Module

The CvtStage class represents a stage in the Cross Vision Transformer (Cvt) model. It inherits from nn.Module and is designed to handle the processing and transformation of input data within a specific stage of the Cvt model.

This class includes methods for initializing the stage with configuration and stage information, as well as forwarding the hidden state through a series of operations involving embeddings, layer processing, and token manipulation.

The class supports the configuration of parameters such as patch size, stride, number of channels, embedding dimensions, padding, dropout rates, depth, number of heads, kernel size, attention and multi-layer perceptron (MLP) settings, and the inclusion of a classification (cls) token.

The forward method is responsible for processing the hidden state by applying the configured embeddings, manipulating the hidden state based on the existence of a cls token, and iterating through the layers to transform the hidden state. Additionally, it handles the splitting and reshaping of the hidden state before returning the updated hidden state and cls token.

Overall, the CvtStage class provides a structured and configurable framework for managing the transformation of data within a specific stage of the Cvt model.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
class CvtStage(nn.Module):

    """
    The CvtStage class represents a stage in the Cross Vision Transformer (Cvt) model. It inherits from nn.Module and
    is designed to handle the processing and transformation of input data within a specific stage of the Cvt model.

    This class includes methods for initializing the stage with configuration and stage information, as well as
    forwarding the hidden state through a series of operations involving embeddings, layer processing,
    and token manipulation.

    The class supports the configuration of parameters such as patch size, stride, number of channels,
    embedding dimensions, padding, dropout rates, depth, number of heads, kernel size, attention and
    multi-layer perceptron (MLP) settings, and the inclusion of a classification (cls) token.

    The forward method is responsible for processing the hidden state by applying the configured embeddings,
    manipulating the hidden state based on the existence of a cls token, and iterating through the
    layers to transform the hidden state. Additionally, it handles the splitting and reshaping of the hidden state
    before returning the updated hidden state and cls token.

    Overall, the CvtStage class provides a structured and configurable framework for managing the transformation of
    data within a specific stage of the Cvt model.
    """
    def __init__(self, config, stage):
        """
        This method initializes an instance of the CvtStage class.

        Args:
            self: The instance of the CvtStage class.
            config (object): The configuration object containing various parameters such as patch size, stride,
                number of channels, embedding dimensions, padding, dropout rate, depth, number of heads, kernel
                size, padding for query, key, and value, stride for key and value, stride for query, method
                for QKV projection, QKV bias, attention dropout rate, drop rate, drop path rate, MLP ratio,
                and presence of a classification token.
            stage (int): The stage of the CvtStage.

        Returns:
            None.

        Raises:
            ValueError: If the config.cls_token[self.stage] does not exist or is not a valid value.
            TypeError: If the config.drop_path_rate[self.stage] is not a valid type.
            IndexError: If the drop_path_rates[self.stage] does not exist or is not a valid index.
            TypeError: If any of the parameters in the CvtLayer instantiation are of an invalid type.
        """
        super().__init__()
        self.config = config
        self.stage = stage
        if self.config.cls_token[self.stage]:
            self.cls_token = Parameter(ops.randn(1, 1, self.config.embed_dim[-1]))

        self.embedding = CvtEmbeddings(
            patch_size=config.patch_sizes[self.stage],
            stride=config.patch_stride[self.stage],
            num_channels=config.num_channels if self.stage == 0 else config.embed_dim[self.stage - 1],
            embed_dim=config.embed_dim[self.stage],
            padding=config.patch_padding[self.stage],
            dropout_rate=config.drop_rate[self.stage],
        )

        drop_path_rates = [x.item() for x in ops.linspace(0, config.drop_path_rate[self.stage], config.depth[stage])]

        self.layers = nn.SequentialCell(
            *[
                CvtLayer(
                    num_heads=config.num_heads[self.stage],
                    embed_dim=config.embed_dim[self.stage],
                    kernel_size=config.kernel_qkv[self.stage],
                    padding_q=config.padding_q[self.stage],
                    padding_kv=config.padding_kv[self.stage],
                    stride_kv=config.stride_kv[self.stage],
                    stride_q=config.stride_q[self.stage],
                    qkv_projection_method=config.qkv_projection_method[self.stage],
                    qkv_bias=config.qkv_bias[self.stage],
                    attention_drop_rate=config.attention_drop_rate[self.stage],
                    drop_rate=config.drop_rate[self.stage],
                    drop_path_rate=drop_path_rates[self.stage],
                    mlp_ratio=config.mlp_ratio[self.stage],
                    with_cls_token=config.cls_token[self.stage],
                )
                for _ in range(config.depth[self.stage])
            ]
        )

    def forward(self, hidden_state):
        """
        Constructs the hidden state for the CvtStage class.

        Args:
            self (CvtStage): The instance of the CvtStage class.
            hidden_state: The hidden state input for forwarding the hidden state. It should be a tensor.

        Returns:
            tuple:
                A tuple containing the forwarded hidden state and cls_token.

                The hidden state is a tensor with dimensions (batch_size, num_channels, height, width), representing
                the forwarded hidden state. The cls_token is a tensor with dimensions (batch_size, 1, num_channels),
                representing the cls_token if it exists, otherwise it is None.

        Raises:
            None.
        """
        cls_token = None
        hidden_state = self.embedding(hidden_state)
        batch_size, num_channels, height, width = hidden_state.shape
        # rearrange b c h w -> b (h w) c"
        hidden_state = hidden_state.view(batch_size, num_channels, height * width).permute(0, 2, 1)
        if self.config.cls_token[self.stage]:
            cls_token = self.cls_token.expand(batch_size, -1, -1)
            hidden_state = ops.cat((cls_token, hidden_state), axis=1)

        for layer in self.layers:
            layer_outputs = layer(hidden_state, height, width)
            hidden_state = layer_outputs

        if self.config.cls_token[self.stage]:
            cls_token, hidden_state = ops.split(hidden_state, [1, height * width], 1)
        hidden_state = hidden_state.permute(0, 2, 1).view(batch_size, num_channels, height, width)
        return hidden_state, cls_token

mindnlp.transformers.models.cvt.modeling_cvt.CvtStage.__init__(config, stage)

This method initializes an instance of the CvtStage class.

PARAMETER DESCRIPTION
self

The instance of the CvtStage class.

config

The configuration object containing various parameters such as patch size, stride, number of channels, embedding dimensions, padding, dropout rate, depth, number of heads, kernel size, padding for query, key, and value, stride for key and value, stride for query, method for QKV projection, QKV bias, attention dropout rate, drop rate, drop path rate, MLP ratio, and presence of a classification token.

TYPE: object

stage

The stage of the CvtStage.

TYPE: int

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If the config.cls_token[self.stage] does not exist or is not a valid value.

TypeError

If the config.drop_path_rate[self.stage] is not a valid type.

IndexError

If the drop_path_rates[self.stage] does not exist or is not a valid index.

TypeError

If any of the parameters in the CvtLayer instantiation are of an invalid type.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
def __init__(self, config, stage):
    """
    This method initializes an instance of the CvtStage class.

    Args:
        self: The instance of the CvtStage class.
        config (object): The configuration object containing various parameters such as patch size, stride,
            number of channels, embedding dimensions, padding, dropout rate, depth, number of heads, kernel
            size, padding for query, key, and value, stride for key and value, stride for query, method
            for QKV projection, QKV bias, attention dropout rate, drop rate, drop path rate, MLP ratio,
            and presence of a classification token.
        stage (int): The stage of the CvtStage.

    Returns:
        None.

    Raises:
        ValueError: If the config.cls_token[self.stage] does not exist or is not a valid value.
        TypeError: If the config.drop_path_rate[self.stage] is not a valid type.
        IndexError: If the drop_path_rates[self.stage] does not exist or is not a valid index.
        TypeError: If any of the parameters in the CvtLayer instantiation are of an invalid type.
    """
    super().__init__()
    self.config = config
    self.stage = stage
    if self.config.cls_token[self.stage]:
        self.cls_token = Parameter(ops.randn(1, 1, self.config.embed_dim[-1]))

    self.embedding = CvtEmbeddings(
        patch_size=config.patch_sizes[self.stage],
        stride=config.patch_stride[self.stage],
        num_channels=config.num_channels if self.stage == 0 else config.embed_dim[self.stage - 1],
        embed_dim=config.embed_dim[self.stage],
        padding=config.patch_padding[self.stage],
        dropout_rate=config.drop_rate[self.stage],
    )

    drop_path_rates = [x.item() for x in ops.linspace(0, config.drop_path_rate[self.stage], config.depth[stage])]

    self.layers = nn.SequentialCell(
        *[
            CvtLayer(
                num_heads=config.num_heads[self.stage],
                embed_dim=config.embed_dim[self.stage],
                kernel_size=config.kernel_qkv[self.stage],
                padding_q=config.padding_q[self.stage],
                padding_kv=config.padding_kv[self.stage],
                stride_kv=config.stride_kv[self.stage],
                stride_q=config.stride_q[self.stage],
                qkv_projection_method=config.qkv_projection_method[self.stage],
                qkv_bias=config.qkv_bias[self.stage],
                attention_drop_rate=config.attention_drop_rate[self.stage],
                drop_rate=config.drop_rate[self.stage],
                drop_path_rate=drop_path_rates[self.stage],
                mlp_ratio=config.mlp_ratio[self.stage],
                with_cls_token=config.cls_token[self.stage],
            )
            for _ in range(config.depth[self.stage])
        ]
    )

mindnlp.transformers.models.cvt.modeling_cvt.CvtStage.forward(hidden_state)

Constructs the hidden state for the CvtStage class.

PARAMETER DESCRIPTION
self

The instance of the CvtStage class.

TYPE: CvtStage

hidden_state

The hidden state input for forwarding the hidden state. It should be a tensor.

RETURNS DESCRIPTION
tuple

A tuple containing the forwarded hidden state and cls_token.

The hidden state is a tensor with dimensions (batch_size, num_channels, height, width), representing the forwarded hidden state. The cls_token is a tensor with dimensions (batch_size, 1, num_channels), representing the cls_token if it exists, otherwise it is None.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
def forward(self, hidden_state):
    """
    Constructs the hidden state for the CvtStage class.

    Args:
        self (CvtStage): The instance of the CvtStage class.
        hidden_state: The hidden state input for forwarding the hidden state. It should be a tensor.

    Returns:
        tuple:
            A tuple containing the forwarded hidden state and cls_token.

            The hidden state is a tensor with dimensions (batch_size, num_channels, height, width), representing
            the forwarded hidden state. The cls_token is a tensor with dimensions (batch_size, 1, num_channels),
            representing the cls_token if it exists, otherwise it is None.

    Raises:
        None.
    """
    cls_token = None
    hidden_state = self.embedding(hidden_state)
    batch_size, num_channels, height, width = hidden_state.shape
    # rearrange b c h w -> b (h w) c"
    hidden_state = hidden_state.view(batch_size, num_channels, height * width).permute(0, 2, 1)
    if self.config.cls_token[self.stage]:
        cls_token = self.cls_token.expand(batch_size, -1, -1)
        hidden_state = ops.cat((cls_token, hidden_state), axis=1)

    for layer in self.layers:
        layer_outputs = layer(hidden_state, height, width)
        hidden_state = layer_outputs

    if self.config.cls_token[self.stage]:
        cls_token, hidden_state = ops.split(hidden_state, [1, height * width], 1)
    hidden_state = hidden_state.permute(0, 2, 1).view(batch_size, num_channels, height, width)
    return hidden_state, cls_token

mindnlp.transformers.models.cvt.modeling_cvt.drop_path(input, drop_prob=0.0, training=False)

Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).

Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks, however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper... See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the argument.

Source code in mindnlp/transformers/models/cvt/modeling_cvt.py
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
def drop_path(input: mindspore.Tensor, drop_prob: float = 0.0, training: bool = False) -> mindspore.Tensor:
    """
    Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).

    Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks,
    however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the
    layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the
    argument.
    """
    if drop_prob == 0.0 or not training:
        return input
    keep_prob = 1 - drop_prob
    shape = (input.shape[0],) + (1,) * (input.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
    random_tensor = keep_prob + ops.rand(shape, dtype=input.dtype)
    random_tensor = random_tensor.floor()  # binarize
    output = input.div(keep_prob) * random_tensor
    return output