Skip to content

starcoder2

mindnlp.transformers.models.starcoder2.configuration_starcoder2

Starcoder2 model configuration

mindnlp.transformers.models.starcoder2.configuration_starcoder2.Starcoder2Config

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [Starcoder2Model]. It is used to instantiate a Starcoder2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the [bigcode/starcoder2-7b_16k] (https://hf-mirror.com/bigcode/starcoder2-7b_16k) model.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
vocab_size

Vocabulary size of the Starcoder2 model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [Starcoder2Model]

TYPE: `int`, *optional*, defaults to 49152 DEFAULT: 49152

hidden_size

Dimension of the hidden representations.

TYPE: `int`, *optional*, defaults to 3072 DEFAULT: 3072

intermediate_size

Dimension of the MLP representations.

TYPE: `int`, *optional*, defaults to 12288 DEFAULT: 12288

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 30 DEFAULT: 30

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 24 DEFAULT: 24

num_key_value_heads

This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be forwarded by meanpooling all the original heads within that group. For more details checkout this paper. If it is not specified, will default to 8.

TYPE: `int`, *optional*, defaults to 2 DEFAULT: 2

hidden_act

The non-linear activation function (function or string) in the decoder.

TYPE: `str` or `function`, *optional*, defaults to `"gelu_pytorch_tanh"` DEFAULT: 'gelu_pytorch_tanh'

max_position_embeddings

The maximum sequence length that this model might ever be used with. Starcoder2's sliding window attention allows sequence of up to 4096*32 tokens.

TYPE: `int`, *optional*, defaults to 4096 DEFAULT: 4096

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.018042

norm_epsilon

Epsilon value for the layer norm

TYPE: `float`, *optional*, defaults to 1e-05 DEFAULT: 1e-05

use_cache

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

bos_token_id

The id of the "beginning-of-sequence" token.

TYPE: `int`, *optional*, defaults to 50256 DEFAULT: 50256

eos_token_id

The id of the "end-of-sequence" token.

TYPE: `int`, *optional*, defaults to 50256 DEFAULT: 50256

rope_theta

The base period of the RoPE embeddings.

TYPE: `float`, *optional*, defaults to 10000.0 DEFAULT: 10000.0

sliding_window

Sliding window attention window size. If not specified, will default to None (no sliding window).

TYPE: `int`, *optional* DEFAULT: None

attention_dropout

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

residual_dropout

Residual connection dropout value.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

embedding_dropout

Embedding dropout.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

use_bias

Whether to use bias term on linear layers of the model.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

Example
>>> from transformers import Starcoder2Model, Starcoder2Config
...
>>> # Initializing a Starcoder2 7B style configuration
>>> configuration = Starcoder2Config()
...
>>> # Initializing a model from the Starcoder2 7B style configuration
>>> model = Starcoder2Model(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/starcoder2/configuration_starcoder2.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
class Starcoder2Config(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`Starcoder2Model`]. It is used to instantiate a
    Starcoder2 model according to the specified arguments, defining the model architecture. Instantiating a configuration
    with the defaults will yield a similar configuration to that of the [bigcode/starcoder2-7b_16k]
    (https://hf-mirror.com/bigcode/starcoder2-7b_16k) model.


    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.


    Args:
        vocab_size (`int`, *optional*, defaults to 49152):
            Vocabulary size of the Starcoder2 model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`Starcoder2Model`]
        hidden_size (`int`, *optional*, defaults to 3072):
            Dimension of the hidden representations.
        intermediate_size (`int`, *optional*, defaults to 12288):
            Dimension of the MLP representations.
        num_hidden_layers (`int`, *optional*, defaults to 30):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 24):
            Number of attention heads for each attention layer in the Transformer encoder.
        num_key_value_heads (`int`, *optional*, defaults to 2):
            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be forwarded
            by meanpooling all the original heads within that group. For more details checkout [this
            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `8`.
        hidden_act (`str` or `function`, *optional*, defaults to `"gelu_pytorch_tanh"`):
            The non-linear activation function (function or string) in the decoder.
        max_position_embeddings (`int`, *optional*, defaults to 4096):
            The maximum sequence length that this model might ever be used with. Starcoder2's sliding window attention
            allows sequence of up to 4096*32 tokens.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        norm_epsilon (`float`, *optional*, defaults to 1e-05):
            Epsilon value for the layer norm
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.
        bos_token_id (`int`, *optional*, defaults to 50256):
            The id of the "beginning-of-sequence" token.
        eos_token_id (`int`, *optional*, defaults to 50256):
            The id of the "end-of-sequence" token.
        rope_theta (`float`, *optional*, defaults to 10000.0):
            The base period of the RoPE embeddings.
        sliding_window (`int`, *optional*):
            Sliding window attention window size. If not specified, will default to `None` (no sliding window).
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        residual_dropout (`float`, *optional*, defaults to 0.0):
            Residual connection dropout value.
        embedding_dropout (`float`, *optional*, defaults to 0.0):
            Embedding dropout.
        use_bias (`bool`, *optional*, defaults to `True`):
            Whether to use bias term on linear layers of the model.

    Example:
        ```python
        >>> from transformers import Starcoder2Model, Starcoder2Config
        ...
        >>> # Initializing a Starcoder2 7B style configuration
        >>> configuration = Starcoder2Config()
        ...
        >>> # Initializing a model from the Starcoder2 7B style configuration
        >>> model = Starcoder2Model(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "starcoder2"
    keys_to_ignore_at_inference = ["past_key_values"]

    def __init__(
        self,
        vocab_size=49152,
        hidden_size=3072,
        intermediate_size=12288,
        num_hidden_layers=30,
        num_attention_heads=24,
        num_key_value_heads=2,
        hidden_act="gelu_pytorch_tanh",
        max_position_embeddings=4096,
        initializer_range=0.018042,
        norm_epsilon=1e-5,
        use_cache=True,
        bos_token_id=50256,
        eos_token_id=50256,
        rope_theta=10000.0,
        sliding_window=None,
        attention_dropout=0.0,
        residual_dropout=0.0,
        embedding_dropout=0.0,
        use_bias=True,
        **kwargs,
    ):
        """
        Initializes a new instance of the Starcoder2Config class.

        Args:
            self: The object itself.
            vocab_size (int, optional): The size of the vocabulary. Defaults to 49152.
            hidden_size (int, optional): The size of the hidden layer. Defaults to 3072.
            intermediate_size (int, optional): The size of the intermediate layer in the transformer encoder. Defaults to 12288.
            num_hidden_layers (int, optional): The number of hidden layers in the transformer encoder. Defaults to 30.
            num_attention_heads (int, optional): The number of attention heads in the transformer encoder. Defaults to 24.
            num_key_value_heads (int, optional): The number of key-value attention heads in the transformer encoder. Defaults to 2.
            hidden_act (str, optional): The activation function for the hidden layer. Defaults to 'gelu_pytorch_tanh'.
            max_position_embeddings (int, optional): The maximum number of tokens in a sequence. Defaults to 4096.
            initializer_range (float, optional): The range of the initializer. Defaults to 0.018042.
            norm_epsilon (float, optional): The epsilon value for normalization. Defaults to 1e-05.
            use_cache (bool, optional): Specifies whether to use cache in the transformer encoder. Defaults to True.
            bos_token_id (int, optional): The ID of the beginning-of-sentence token. Defaults to 50256.
            eos_token_id (int, optional): The ID of the end-of-sentence token. Defaults to 50256.
            rope_theta (float, optional): The theta value for the ROPE mechanism. Defaults to 10000.0.
            sliding_window (None or int, optional): The size of the sliding window used in the transformer encoder. Defaults to None.
            attention_dropout (float, optional): The dropout rate for attention layers. Defaults to 0.0.
            residual_dropout (float, optional): The dropout rate for residual connections. Defaults to 0.0.
            embedding_dropout (float, optional): The dropout rate for embeddings. Defaults to 0.0.
            use_bias (bool, optional): Specifies whether to use bias in the transformer encoder. Defaults to True.

        Returns:
            None.

        Raises:
            None.
        """
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.sliding_window = sliding_window
        self.use_bias = use_bias
        self.num_key_value_heads = num_key_value_heads
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
        self.norm_epsilon = norm_epsilon
        self.use_cache = use_cache
        self.rope_theta = rope_theta
        self.attention_dropout = attention_dropout
        self.residual_dropout = residual_dropout
        self.embedding_dropout = embedding_dropout

        super().__init__(
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            **kwargs,
        )

mindnlp.transformers.models.starcoder2.configuration_starcoder2.Starcoder2Config.__init__(vocab_size=49152, hidden_size=3072, intermediate_size=12288, num_hidden_layers=30, num_attention_heads=24, num_key_value_heads=2, hidden_act='gelu_pytorch_tanh', max_position_embeddings=4096, initializer_range=0.018042, norm_epsilon=1e-05, use_cache=True, bos_token_id=50256, eos_token_id=50256, rope_theta=10000.0, sliding_window=None, attention_dropout=0.0, residual_dropout=0.0, embedding_dropout=0.0, use_bias=True, **kwargs)

Initializes a new instance of the Starcoder2Config class.

PARAMETER DESCRIPTION
self

The object itself.

vocab_size

The size of the vocabulary. Defaults to 49152.

TYPE: int DEFAULT: 49152

hidden_size

The size of the hidden layer. Defaults to 3072.

TYPE: int DEFAULT: 3072

intermediate_size

The size of the intermediate layer in the transformer encoder. Defaults to 12288.

TYPE: int DEFAULT: 12288

num_hidden_layers

The number of hidden layers in the transformer encoder. Defaults to 30.

TYPE: int DEFAULT: 30

num_attention_heads

The number of attention heads in the transformer encoder. Defaults to 24.

TYPE: int DEFAULT: 24

num_key_value_heads

The number of key-value attention heads in the transformer encoder. Defaults to 2.

TYPE: int DEFAULT: 2

hidden_act

The activation function for the hidden layer. Defaults to 'gelu_pytorch_tanh'.

TYPE: str DEFAULT: 'gelu_pytorch_tanh'

max_position_embeddings

The maximum number of tokens in a sequence. Defaults to 4096.

TYPE: int DEFAULT: 4096

initializer_range

The range of the initializer. Defaults to 0.018042.

TYPE: float DEFAULT: 0.018042

norm_epsilon

The epsilon value for normalization. Defaults to 1e-05.

TYPE: float DEFAULT: 1e-05

use_cache

Specifies whether to use cache in the transformer encoder. Defaults to True.

TYPE: bool DEFAULT: True

bos_token_id

The ID of the beginning-of-sentence token. Defaults to 50256.

TYPE: int DEFAULT: 50256

eos_token_id

The ID of the end-of-sentence token. Defaults to 50256.

TYPE: int DEFAULT: 50256

rope_theta

The theta value for the ROPE mechanism. Defaults to 10000.0.

TYPE: float DEFAULT: 10000.0

sliding_window

The size of the sliding window used in the transformer encoder. Defaults to None.

TYPE: None or int DEFAULT: None

attention_dropout

The dropout rate for attention layers. Defaults to 0.0.

TYPE: float DEFAULT: 0.0

residual_dropout

The dropout rate for residual connections. Defaults to 0.0.

TYPE: float DEFAULT: 0.0

embedding_dropout

The dropout rate for embeddings. Defaults to 0.0.

TYPE: float DEFAULT: 0.0

use_bias

Specifies whether to use bias in the transformer encoder. Defaults to True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/starcoder2/configuration_starcoder2.py
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
def __init__(
    self,
    vocab_size=49152,
    hidden_size=3072,
    intermediate_size=12288,
    num_hidden_layers=30,
    num_attention_heads=24,
    num_key_value_heads=2,
    hidden_act="gelu_pytorch_tanh",
    max_position_embeddings=4096,
    initializer_range=0.018042,
    norm_epsilon=1e-5,
    use_cache=True,
    bos_token_id=50256,
    eos_token_id=50256,
    rope_theta=10000.0,
    sliding_window=None,
    attention_dropout=0.0,
    residual_dropout=0.0,
    embedding_dropout=0.0,
    use_bias=True,
    **kwargs,
):
    """
    Initializes a new instance of the Starcoder2Config class.

    Args:
        self: The object itself.
        vocab_size (int, optional): The size of the vocabulary. Defaults to 49152.
        hidden_size (int, optional): The size of the hidden layer. Defaults to 3072.
        intermediate_size (int, optional): The size of the intermediate layer in the transformer encoder. Defaults to 12288.
        num_hidden_layers (int, optional): The number of hidden layers in the transformer encoder. Defaults to 30.
        num_attention_heads (int, optional): The number of attention heads in the transformer encoder. Defaults to 24.
        num_key_value_heads (int, optional): The number of key-value attention heads in the transformer encoder. Defaults to 2.
        hidden_act (str, optional): The activation function for the hidden layer. Defaults to 'gelu_pytorch_tanh'.
        max_position_embeddings (int, optional): The maximum number of tokens in a sequence. Defaults to 4096.
        initializer_range (float, optional): The range of the initializer. Defaults to 0.018042.
        norm_epsilon (float, optional): The epsilon value for normalization. Defaults to 1e-05.
        use_cache (bool, optional): Specifies whether to use cache in the transformer encoder. Defaults to True.
        bos_token_id (int, optional): The ID of the beginning-of-sentence token. Defaults to 50256.
        eos_token_id (int, optional): The ID of the end-of-sentence token. Defaults to 50256.
        rope_theta (float, optional): The theta value for the ROPE mechanism. Defaults to 10000.0.
        sliding_window (None or int, optional): The size of the sliding window used in the transformer encoder. Defaults to None.
        attention_dropout (float, optional): The dropout rate for attention layers. Defaults to 0.0.
        residual_dropout (float, optional): The dropout rate for residual connections. Defaults to 0.0.
        embedding_dropout (float, optional): The dropout rate for embeddings. Defaults to 0.0.
        use_bias (bool, optional): Specifies whether to use bias in the transformer encoder. Defaults to True.

    Returns:
        None.

    Raises:
        None.
    """
    self.vocab_size = vocab_size
    self.max_position_embeddings = max_position_embeddings
    self.hidden_size = hidden_size
    self.intermediate_size = intermediate_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.sliding_window = sliding_window
    self.use_bias = use_bias
    self.num_key_value_heads = num_key_value_heads
    self.hidden_act = hidden_act
    self.initializer_range = initializer_range
    self.norm_epsilon = norm_epsilon
    self.use_cache = use_cache
    self.rope_theta = rope_theta
    self.attention_dropout = attention_dropout
    self.residual_dropout = residual_dropout
    self.embedding_dropout = embedding_dropout

    super().__init__(
        bos_token_id=bos_token_id,
        eos_token_id=eos_token_id,
        **kwargs,
    )

mindnlp.transformers.models.starcoder2.modeling_starcoder2

MindSpore Starcoder2 model.

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2Attention

Bases: Module

Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer and "Generating Long Sequences with Sparse Transformers".

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
class Starcoder2Attention(nn.Module):
    """
    Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
    and "Generating Long Sequences with Sparse Transformers".
    """
    def __init__(self, config: Starcoder2Config, layer_idx: Optional[int] = None):
        """
        Initializes a new instance of the Starcoder2Attention class.

        Args:
            self: The object instance.
            config (Starcoder2Config): The configuration object containing various model hyperparameters.
            layer_idx (Optional[int], default=None): The index of the layer. If None, a warning will be logged and
                it is not recommended to omit this parameter as it may cause errors during the forward call if
                caching is used.

        Returns:
            None

        Raises:
            ValueError: If the `hidden_size` is not divisible by `num_heads`.

        """
        super().__init__()
        self.config = config
        self.layer_idx = layer_idx
        if layer_idx is None:
            logger.warning_once(
                f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
                "lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
                "when creating this class."
            )

        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads
        self.head_dim = self.hidden_size // self.num_heads
        self.num_key_value_heads = config.num_key_value_heads
        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
        self.max_position_embeddings = config.max_position_embeddings
        self.rope_theta = config.rope_theta
        self.use_bias = config.use_bias
        self.is_causal = True
        self.attention_dropout = config.attention_dropout
        self.residual_dropout = config.residual_dropout

        if (self.head_dim * self.num_heads) != self.hidden_size:
            raise ValueError(
                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
                f" and `num_heads`: {self.num_heads})."
            )
        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=self.use_bias)
        self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=self.use_bias)
        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=self.use_bias)
        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=self.use_bias)

        self.rotary_emb = Starcoder2RotaryEmbedding(
            self.head_dim,
            max_position_embeddings=self.max_position_embeddings,
            base=self.rope_theta,
        )

    def forward(
        self,
        hidden_states: mindspore.Tensor,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_value: Optional[Cache] = None,
        output_attentions: bool = False,
        use_cache: bool = False,
        **kwargs,
    ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
        '''
        This method forwards the Starcoder2Attention layer.

        Args:
            self (Starcoder2Attention): The instance of the Starcoder2Attention layer.
            hidden_states (mindspore.Tensor): The input hidden states tensor with shape
                (batch_size, sequence_length, hidden_size).
            attention_mask (Optional[mindspore.Tensor]): An optional tensor for masking the attention scores with
                shape (batch_size, 1, sequence_length, key_value_sequence_length).
            position_ids (Optional[mindspore.Tensor]): An optional tensor to specify the position ids with shape
                (batch_size, sequence_length).
            past_key_value (Optional[Cache]): An optional cache for previous key and value states.
            output_attentions (bool): A flag to indicate whether to return the attention weights.
            use_cache (bool): A flag to indicate whether to use the cache for key and value states.

        Returns:
            Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: A tuple containing
                the attention output tensor with shape (batch_size, sequence_length, hidden_size),
                attention weights tensor (optional), and past key value tuple (optional).

        Raises:
            ValueError: If the attention weights or attention mask shape does not match the expected shape.
            ValueError: If the output size of `attn_output` does not match the expected shape.
            ValueError: If the cache structure has changed and the layer index is not initialized for
                auto-regressive decoding.
            '''
        if "padding_mask" in kwargs:
            warnings.warn(
                "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
            )
        bsz, q_len, _ = hidden_states.shape

        query_states = self.q_proj(hidden_states)
        key_states = self.k_proj(hidden_states)
        value_states = self.v_proj(hidden_states)

        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).swapaxes(1, 2)
        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).swapaxes(1, 2)
        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).swapaxes(1, 2)

        kv_seq_len = key_states.shape[-2]
        if past_key_value is not None:
            if self.layer_idx is None:
                raise ValueError(
                    f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
                    "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
                    "with a layer index."
                )
            kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)

        if past_key_value is not None:
            cache_kwargs = {"sin": sin, "cos": cos}  # Specific to RoPE models
            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)

        # repeat k/v heads if n_kv_heads < n_heads
        key_states = repeat_kv(key_states, self.num_key_value_groups)
        value_states = repeat_kv(value_states, self.num_key_value_groups)

        attn_weights = ops.matmul(query_states, key_states.swapaxes(2, 3)) / math.sqrt(self.head_dim)

        if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len):
            raise ValueError(
                f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
                f" {attn_weights.shape}"
            )

        if attention_mask is not None:
            if attention_mask.shape != (bsz, 1, q_len, kv_seq_len):
                raise ValueError(
                    f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}"
                )

            attn_weights = attn_weights + attention_mask

        # upcast attention to fp32
        attn_weights = ops.softmax(attn_weights, axis=-1, dtype=mindspore.float32).to(query_states.dtype)
        attn_weights = ops.dropout(attn_weights, p=self.attention_dropout, training=self.training)
        attn_output = ops.matmul(attn_weights, value_states)

        if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim):
            raise ValueError(
                f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
                f" {attn_output.shape}"
            )

        attn_output = attn_output.swapaxes(1, 2)
        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)

        attn_output = self.o_proj(attn_output)
        attn_output = ops.dropout(attn_output, p=self.residual_dropout, training=self.training)

        if not output_attentions:
            attn_weights = None

        return attn_output, attn_weights, past_key_value

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2Attention.__init__(config, layer_idx=None)

Initializes a new instance of the Starcoder2Attention class.

PARAMETER DESCRIPTION
self

The object instance.

config

The configuration object containing various model hyperparameters.

TYPE: Starcoder2Config

layer_idx

The index of the layer. If None, a warning will be logged and it is not recommended to omit this parameter as it may cause errors during the forward call if caching is used.

TYPE: Optional[int], default=None DEFAULT: None

RETURNS DESCRIPTION

None

RAISES DESCRIPTION
ValueError

If the hidden_size is not divisible by num_heads.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
def __init__(self, config: Starcoder2Config, layer_idx: Optional[int] = None):
    """
    Initializes a new instance of the Starcoder2Attention class.

    Args:
        self: The object instance.
        config (Starcoder2Config): The configuration object containing various model hyperparameters.
        layer_idx (Optional[int], default=None): The index of the layer. If None, a warning will be logged and
            it is not recommended to omit this parameter as it may cause errors during the forward call if
            caching is used.

    Returns:
        None

    Raises:
        ValueError: If the `hidden_size` is not divisible by `num_heads`.

    """
    super().__init__()
    self.config = config
    self.layer_idx = layer_idx
    if layer_idx is None:
        logger.warning_once(
            f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
            "lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
            "when creating this class."
        )

    self.hidden_size = config.hidden_size
    self.num_heads = config.num_attention_heads
    self.head_dim = self.hidden_size // self.num_heads
    self.num_key_value_heads = config.num_key_value_heads
    self.num_key_value_groups = self.num_heads // self.num_key_value_heads
    self.max_position_embeddings = config.max_position_embeddings
    self.rope_theta = config.rope_theta
    self.use_bias = config.use_bias
    self.is_causal = True
    self.attention_dropout = config.attention_dropout
    self.residual_dropout = config.residual_dropout

    if (self.head_dim * self.num_heads) != self.hidden_size:
        raise ValueError(
            f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
            f" and `num_heads`: {self.num_heads})."
        )
    self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=self.use_bias)
    self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=self.use_bias)
    self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=self.use_bias)
    self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=self.use_bias)

    self.rotary_emb = Starcoder2RotaryEmbedding(
        self.head_dim,
        max_position_embeddings=self.max_position_embeddings,
        base=self.rope_theta,
    )

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2Attention.forward(hidden_states, attention_mask=None, position_ids=None, past_key_value=None, output_attentions=False, use_cache=False, **kwargs)

This method forwards the Starcoder2Attention layer.

PARAMETER DESCRIPTION
self

The instance of the Starcoder2Attention layer.

TYPE: Starcoder2Attention

hidden_states

The input hidden states tensor with shape (batch_size, sequence_length, hidden_size).

TYPE: Tensor

attention_mask

An optional tensor for masking the attention scores with shape (batch_size, 1, sequence_length, key_value_sequence_length).

TYPE: Optional[Tensor] DEFAULT: None

position_ids

An optional tensor to specify the position ids with shape (batch_size, sequence_length).

TYPE: Optional[Tensor] DEFAULT: None

past_key_value

An optional cache for previous key and value states.

TYPE: Optional[Cache] DEFAULT: None

output_attentions

A flag to indicate whether to return the attention weights.

TYPE: bool DEFAULT: False

use_cache

A flag to indicate whether to use the cache for key and value states.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
Tuple[Tensor, Optional[Tensor], Optional[Tuple[Tensor]]]

Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: A tuple containing the attention output tensor with shape (batch_size, sequence_length, hidden_size), attention weights tensor (optional), and past key value tuple (optional).

RAISES DESCRIPTION
ValueError

If the attention weights or attention mask shape does not match the expected shape.

ValueError

If the output size of attn_output does not match the expected shape.

ValueError

If the cache structure has changed and the layer index is not initialized for auto-regressive decoding.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
def forward(
    self,
    hidden_states: mindspore.Tensor,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_value: Optional[Cache] = None,
    output_attentions: bool = False,
    use_cache: bool = False,
    **kwargs,
) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
    '''
    This method forwards the Starcoder2Attention layer.

    Args:
        self (Starcoder2Attention): The instance of the Starcoder2Attention layer.
        hidden_states (mindspore.Tensor): The input hidden states tensor with shape
            (batch_size, sequence_length, hidden_size).
        attention_mask (Optional[mindspore.Tensor]): An optional tensor for masking the attention scores with
            shape (batch_size, 1, sequence_length, key_value_sequence_length).
        position_ids (Optional[mindspore.Tensor]): An optional tensor to specify the position ids with shape
            (batch_size, sequence_length).
        past_key_value (Optional[Cache]): An optional cache for previous key and value states.
        output_attentions (bool): A flag to indicate whether to return the attention weights.
        use_cache (bool): A flag to indicate whether to use the cache for key and value states.

    Returns:
        Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: A tuple containing
            the attention output tensor with shape (batch_size, sequence_length, hidden_size),
            attention weights tensor (optional), and past key value tuple (optional).

    Raises:
        ValueError: If the attention weights or attention mask shape does not match the expected shape.
        ValueError: If the output size of `attn_output` does not match the expected shape.
        ValueError: If the cache structure has changed and the layer index is not initialized for
            auto-regressive decoding.
        '''
    if "padding_mask" in kwargs:
        warnings.warn(
            "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
        )
    bsz, q_len, _ = hidden_states.shape

    query_states = self.q_proj(hidden_states)
    key_states = self.k_proj(hidden_states)
    value_states = self.v_proj(hidden_states)

    query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).swapaxes(1, 2)
    key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).swapaxes(1, 2)
    value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).swapaxes(1, 2)

    kv_seq_len = key_states.shape[-2]
    if past_key_value is not None:
        if self.layer_idx is None:
            raise ValueError(
                f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
                "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
                "with a layer index."
            )
        kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
    cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)

    if past_key_value is not None:
        cache_kwargs = {"sin": sin, "cos": cos}  # Specific to RoPE models
        key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)

    # repeat k/v heads if n_kv_heads < n_heads
    key_states = repeat_kv(key_states, self.num_key_value_groups)
    value_states = repeat_kv(value_states, self.num_key_value_groups)

    attn_weights = ops.matmul(query_states, key_states.swapaxes(2, 3)) / math.sqrt(self.head_dim)

    if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len):
        raise ValueError(
            f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
            f" {attn_weights.shape}"
        )

    if attention_mask is not None:
        if attention_mask.shape != (bsz, 1, q_len, kv_seq_len):
            raise ValueError(
                f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}"
            )

        attn_weights = attn_weights + attention_mask

    # upcast attention to fp32
    attn_weights = ops.softmax(attn_weights, axis=-1, dtype=mindspore.float32).to(query_states.dtype)
    attn_weights = ops.dropout(attn_weights, p=self.attention_dropout, training=self.training)
    attn_output = ops.matmul(attn_weights, value_states)

    if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim):
        raise ValueError(
            f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
            f" {attn_output.shape}"
        )

    attn_output = attn_output.swapaxes(1, 2)
    attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)

    attn_output = self.o_proj(attn_output)
    attn_output = ops.dropout(attn_output, p=self.residual_dropout, training=self.training)

    if not output_attentions:
        attn_weights = None

    return attn_output, attn_weights, past_key_value

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2DecoderLayer

Bases: Module

The Starcoder2DecoderLayer class represents a single layer of the Starcoder2 decoder model. This class inherits from nn.Module and implements the operations required for the decoder layer.

ATTRIBUTE DESCRIPTION
hidden_size

The size of the hidden state in the layer.

TYPE: int

self_attn

The self-attention mechanism used in the layer.

TYPE: STARCODER2_ATTENTION_CLASSES

mlp

The multi-layer perceptron used in the layer.

TYPE: Starcoder2MLP

input_layernorm

The layer normalization applied to the input.

TYPE: LayerNorm

post_attention_layernorm

The layer normalization applied after the attention mechanism.

TYPE: LayerNorm

METHOD DESCRIPTION
forward

Applies the operations of the decoder layer to the input hidden states and returns the output along with optional values based on the provided arguments.

PARAMETER DESCRIPTION
config

The configuration for the Starcoder2 model.

TYPE: Starcoder2Config

layer_idx

The index of the layer within the model.

TYPE: int

RETURNS DESCRIPTION

Tuple[mindspore.Tensor, Optional[Tuple[mindspore.Tensor, mindspore.Tensor]]]: The output tensor and optionally, attention weights and/or present key value states.

RAISES DESCRIPTION
ValueError

If the input dimensions are incompatible.

Note
  • The attention_mask should indicate padding elements with 0.
  • If output_attentions is True, the attention weights for all attention layers will be returned.
  • If use_cache is True, the present_key_value states can be used to speed up decoding.
  • The input hidden_states should be of shape (batch, seq_len, embed_dim).
Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
class Starcoder2DecoderLayer(nn.Module):

    """
    The Starcoder2DecoderLayer class represents a single layer of the Starcoder2 decoder model.
    This class inherits from nn.Module and implements the operations required for the decoder layer.

    Attributes:
        hidden_size (int): The size of the hidden state in the layer.
        self_attn (STARCODER2_ATTENTION_CLASSES): The self-attention mechanism used in the layer.
        mlp (Starcoder2MLP): The multi-layer perceptron used in the layer.
        input_layernorm (nn.LayerNorm): The layer normalization applied to the input.
        post_attention_layernorm (nn.LayerNorm): The layer normalization applied after the attention mechanism.

    Methods:
        forward:
            Applies the operations of the decoder layer to the input hidden states and returns the output along with
            optional values based on the provided arguments.

    Args:
        config (Starcoder2Config): The configuration for the Starcoder2 model.
        layer_idx (int): The index of the layer within the model.

    Returns:
        Tuple[mindspore.Tensor, Optional[Tuple[mindspore.Tensor, mindspore.Tensor]]]: The output tensor and optionally,
            attention weights and/or present key value states.

    Raises:
        ValueError: If the input dimensions are incompatible.

    Note:
        - The attention_mask should indicate padding elements with 0.
        - If output_attentions is True, the attention weights for all attention layers will be returned.
        - If use_cache is True, the present_key_value states can be used to speed up decoding.
        - The input hidden_states should be of shape (batch, seq_len, embed_dim).
    """
    def __init__(self, config: Starcoder2Config, layer_idx: int):
        """
        Initializes a new instance of the Starcoder2DecoderLayer class.

        Args:
            self: The object itself.
            config (Starcoder2Config): An instance of the Starcoder2Config class containing the configuration settings
                for the decoder layer.
            layer_idx (int): The index of the decoder layer.

        Returns:
            None

        Raises:
            None
        """
        super().__init__()
        self.hidden_size = config.hidden_size

        self.self_attn = STARCODER2_ATTENTION_CLASSES["eager"](config, layer_idx)

        self.mlp = Starcoder2MLP(config)

        self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.norm_epsilon)
        self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.norm_epsilon)

    # Copied from transformers.models.mistral.modeling_mistral.MistralDecoderLayer.forward
    def forward(
        self,
        hidden_states: mindspore.Tensor,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_value: Optional[Tuple[mindspore.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = False,
        **kwargs,
    ) -> Tuple[mindspore.Tensor, Optional[Tuple[mindspore.Tensor, mindspore.Tensor]]]:
        """
        Args:
            hidden_states (`mindspore.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
            attention_mask (`mindspore.Tensor`, *optional*): attention mask of size
                `(batch, sequence_length)` where padding elements are indicated by 0.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
            use_cache (`bool`, *optional*):
                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
                (see `past_key_values`).
            past_key_value (`Tuple(mindspore.Tensor)`, *optional*): cached past key and value projection states
        """
        residual = hidden_states

        hidden_states = self.input_layernorm(hidden_states)

        # Self Attention
        hidden_states, self_attn_weights, present_key_value = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache,
        )
        hidden_states = residual + hidden_states

        # Fully Connected
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states

        outputs = (hidden_states,)

        if output_attentions:
            outputs += (self_attn_weights,)

        if use_cache:
            outputs += (present_key_value,)

        return outputs

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2DecoderLayer.__init__(config, layer_idx)

Initializes a new instance of the Starcoder2DecoderLayer class.

PARAMETER DESCRIPTION
self

The object itself.

config

An instance of the Starcoder2Config class containing the configuration settings for the decoder layer.

TYPE: Starcoder2Config

layer_idx

The index of the decoder layer.

TYPE: int

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
def __init__(self, config: Starcoder2Config, layer_idx: int):
    """
    Initializes a new instance of the Starcoder2DecoderLayer class.

    Args:
        self: The object itself.
        config (Starcoder2Config): An instance of the Starcoder2Config class containing the configuration settings
            for the decoder layer.
        layer_idx (int): The index of the decoder layer.

    Returns:
        None

    Raises:
        None
    """
    super().__init__()
    self.hidden_size = config.hidden_size

    self.self_attn = STARCODER2_ATTENTION_CLASSES["eager"](config, layer_idx)

    self.mlp = Starcoder2MLP(config)

    self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.norm_epsilon)
    self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.norm_epsilon)

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2DecoderLayer.forward(hidden_states, attention_mask=None, position_ids=None, past_key_value=None, output_attentions=False, use_cache=False, **kwargs)

PARAMETER DESCRIPTION
hidden_states

input to the layer of shape (batch, seq_len, embed_dim)

TYPE: `mindspore.Tensor`

attention_mask

attention mask of size (batch, sequence_length) where padding elements are indicated by 0.

TYPE: `mindspore.Tensor`, *optional* DEFAULT: None

output_attentions

Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

TYPE: `bool`, *optional* DEFAULT: False

use_cache

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

TYPE: `bool`, *optional* DEFAULT: False

past_key_value

cached past key and value projection states

TYPE: `Tuple(mindspore.Tensor)`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
def forward(
    self,
    hidden_states: mindspore.Tensor,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_value: Optional[Tuple[mindspore.Tensor]] = None,
    output_attentions: Optional[bool] = False,
    use_cache: Optional[bool] = False,
    **kwargs,
) -> Tuple[mindspore.Tensor, Optional[Tuple[mindspore.Tensor, mindspore.Tensor]]]:
    """
    Args:
        hidden_states (`mindspore.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
        attention_mask (`mindspore.Tensor`, *optional*): attention mask of size
            `(batch, sequence_length)` where padding elements are indicated by 0.
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under
            returned tensors for more detail.
        use_cache (`bool`, *optional*):
            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
            (see `past_key_values`).
        past_key_value (`Tuple(mindspore.Tensor)`, *optional*): cached past key and value projection states
    """
    residual = hidden_states

    hidden_states = self.input_layernorm(hidden_states)

    # Self Attention
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
        hidden_states=hidden_states,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_value=past_key_value,
        output_attentions=output_attentions,
        use_cache=use_cache,
    )
    hidden_states = residual + hidden_states

    # Fully Connected
    residual = hidden_states
    hidden_states = self.post_attention_layernorm(hidden_states)
    hidden_states = self.mlp(hidden_states)
    hidden_states = residual + hidden_states

    outputs = (hidden_states,)

    if output_attentions:
        outputs += (self_attn_weights,)

    if use_cache:
        outputs += (present_key_value,)

    return outputs

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForCausalLM

Bases: Starcoder2PreTrainedModel

The Starcoder2ForCausalLM class represents a Starcoder2 model for causal language modeling. It inherits from the Starcoder2PreTrainedModel.

This class provides methods to initialize the model, get and set input embeddings, get and set output embeddings, set the decoder, get the decoder, forward the model with various optional inputs, prepare inputs for generation, and reorder the cache.

Example
>>> from transformers import AutoTokenizer, Starcoder2ForCausalLM
...
>>> model = Starcoder2ForCausalLM.from_pretrained("bigcode/starcoder2-7b_16k")
>>> tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-7b_16k")
...
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
...
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
class Starcoder2ForCausalLM(Starcoder2PreTrainedModel):
    r"""
    The Starcoder2ForCausalLM class represents a Starcoder2 model for causal language modeling.
    It inherits from the Starcoder2PreTrainedModel.

    This class provides methods to initialize the model, get and set input embeddings, get and set output embeddings,
    set the decoder, get the decoder, forward the model with various optional inputs, prepare inputs for generation,
    and reorder the cache.

    Example:
        ```python
        >>> from transformers import AutoTokenizer, Starcoder2ForCausalLM
        ...
        >>> model = Starcoder2ForCausalLM.from_pretrained("bigcode/starcoder2-7b_16k")
        >>> tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-7b_16k")
        ...
        >>> prompt = "Hey, are you conscious? Can you talk to me?"
        >>> inputs = tokenizer(prompt, return_tensors="pt")
        ...
        >>> # Generate
        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
        ```
    """
    _tied_weights_keys = ["lm_head.weight"]

    def __init__(self, config):
        """
        This method initializes an instance of the Starcoder2ForCausalLM class.

        Args:
            self: The instance of the class.
            config: A dictionary containing configuration parameters for the model.
                It is used to initialize the Starcoder2Model, determine the vocabulary size, and configure the lm_head.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)
        self.model = Starcoder2Model(config)
        self.vocab_size = config.vocab_size
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        """
        Returns the input embeddings from the model.

        Args:
            self: The instance of Starcoder2ForCausalLM class.
                This parameter is required to access the model's embedded tokens.

        Returns:
            None: This method returns None as it simply retrieves the input embeddings from the model.

        Raises:
            None
        """
        return self.model.embed_tokens

    def set_input_embeddings(self, value):
        """
        Sets the input embeddings for the Starcoder2ForCausalLM model.

        Args:
            self (Starcoder2ForCausalLM): The instance of the Starcoder2ForCausalLM class.
            value: The input embeddings to be set for the model. It should be compatible with the model's
                embed_tokens attribute.

        Returns:
            None.

        Raises:
            None.
        """
        self.model.embed_tokens = value

    def get_output_embeddings(self):
        """
        Method to retrieve the output embeddings from the Starcoder2ForCausalLM class.

        Args:
            self:
                Instance of the Starcoder2ForCausalLM class.

                - Type: Starcoder2ForCausalLM
                - Purpose: Represents the current instance of the class.
                - Restrictions: None

        Returns:
            lm_head:
                The output embeddings.

                - Type: None
                - Purpose: The method returns the output embeddings.

        Raises:
            None
        """
        return self.lm_head

    def set_output_embeddings(self, new_embeddings):
        """
        Set the output embeddings for Starcoder2ForCausalLM.

        This method allows the user to set the output embeddings for the Starcoder2ForCausalLM model.
        The output embeddings are responsible for generating the predicted output sequence.

        Args:
            self (Starcoder2ForCausalLM): The current instance of the Starcoder2ForCausalLM class.
            new_embeddings (Any): The new embeddings to set as the output embeddings.
                This can be of any type, as long as it is compatible with the model architecture.

        Returns:
            None.

        Raises:
            None.
        """
        self.lm_head = new_embeddings

    def set_decoder(self, decoder):
        """
        Sets the decoder for the Starcoder2ForCausalLM class.

        Args:
            self (Starcoder2ForCausalLM): The instance of the Starcoder2ForCausalLM class.
            decoder: The decoder object to be set for the model.

        Returns:
            None.

        Raises:
            None.
        """
        self.model = decoder

    def get_decoder(self):
        """
        This method returns the decoder model associated with the Starcoder2ForCausalLM instance.

        Args:
            self: The instance of the Starcoder2ForCausalLM class.

        Returns:
            model: This method returns the decoder model associated with the instance.

        Raises:
            None.
        """
        return self.model

    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[List[mindspore.Tensor]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, CausalLMOutputWithPast]:
        r"""
        Args:
            labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

        Returns:
            Union[Tuple, CausalLMOutputWithPast]

        Example:
            ```python
            >>> from transformers import AutoTokenizer, Starcoder2ForCausalLM
            ...
            >>> model = Starcoder2ForCausalLM.from_pretrained("bigcode/starcoder2-7b_16k")
            >>> tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-7b_16k")
            ...
            >>> prompt = "Hey, are you conscious? Can you talk to me?"
            >>> inputs = tokenizer(prompt, return_tensors="pt")
            ...
            >>> # Generate
            >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
            >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
            "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
            ```
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        hidden_states = outputs[0]
        logits = self.lm_head(hidden_states)
        logits = logits.float()

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :]
            shift_labels = labels[..., 1:]
            # Flatten the tokens
            shift_logits = shift_logits.view(-1, self.config.vocab_size)
            shift_labels = shift_labels.view(-1)
            # Ensure tensors are on the same device
            loss = ops.cross_entropy(shift_logits, shift_labels)

        if not return_dict:
            output = (logits,) + outputs[1:]
            return (loss,) + output if loss is not None else output

        return CausalLMOutputWithPast(
            loss=loss,
            logits=logits,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

    def prepare_inputs_for_generation(
        self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
    ):
        """
        Prepare inputs for generation.

        Args:
            self: An instance of the Starcoder2ForCausalLM class.
            input_ids (torch.Tensor): Tensor of shape (batch_size, sequence_length) containing the input IDs.
            past_key_values (Cache, tuple, or None): Cache object or tuple of tensors containing the past key values.
                If Cache object is provided, the cache_length, past_length, and max_cache_length are
                extracted. If tuple is provided, cache_length and past_length are extracted from the first element.
                If None, cache_length and past_length are calculated based on input_ids.
            attention_mask (torch.Tensor or None): Tensor of shape (batch_size, sequence_length)
                containing the attention mask. If not None and attention_mask.shape[1] is greater than
                input_ids.shape[1], the input_ids are truncated accordingly. If attention_mask is not None and
                past_length is less than input_ids.shape[1], the input_ids are sliced accordingly.
                If max_cache_length is not None and attention_mask is not None and cache_length + input_ids.shape[1] is
                greater than max_cache_length, the attention_mask is truncated accordingly.
            inputs_embeds (torch.Tensor or None): Tensor of shape (batch_size, sequence_length, embedding_size)
                containing the input embeddings. If not None and past_key_values is None, the model_inputs
                dictionary is updated with 'inputs_embeds' key.
            **kwargs: Additional keyword arguments.

        Returns:
            dict:
                A dictionary containing the model inputs. The dictionary includes the following keys:

                - 'input_ids': Tensor of shape (batch_size, sequence_length) containing the input IDs.
                - 'position_ids': Tensor of shape (batch_size, sequence_length) containing the position IDs.
                If attention_mask is not None and position_ids is None, the position_ids are calculated based on the
                attention_mask.
                - 'past_key_values': Cache object or tuple of tensors containing the past key values.
                - 'use_cache': Boolean indicating whether to use cache or not.
                - 'attention_mask': Tensor of shape (batch_size, sequence_length) containing the attention mask.

        Raises:
            None.
        """
        # Omit tokens covered by past_key_values
        if past_key_values is not None:
            if isinstance(past_key_values, Cache):
                cache_length = past_key_values.get_seq_length()
                past_length = past_key_values.seen_tokens
                max_cache_length = past_key_values.get_max_length()
            else:
                cache_length = past_length = past_key_values[0][0].shape[2]
                max_cache_length = None

            # Keep only the unprocessed tokens:
            # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
            # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
            # input)
            if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
                input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
            # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
            # input_ids based on the past_length.
            elif past_length < input_ids.shape[1]:
                input_ids = input_ids[:, past_length:]
            # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.

            # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
            if (
                max_cache_length is not None
                and attention_mask is not None
                and cache_length + input_ids.shape[1] > max_cache_length
            ):
                attention_mask = attention_mask[:, -max_cache_length:]

        position_ids = kwargs.get("position_ids", None)
        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids = position_ids.masked_fill(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -input_ids.shape[1] :]

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "position_ids": position_ids,
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
                "attention_mask": attention_mask,
            }
        )
        return model_inputs

    @staticmethod
    def _reorder_cache(past_key_values, beam_idx):
        """
        Reorders the cache of past key values based on the given beam index.

        Args:
            past_key_values (tuple): A tuple containing the cache of past key values.
                Each element in the tuple represents the past key values for a specific layer.
            beam_idx (torch.Tensor): A tensor containing the beam indices for reordering the cache.

        Returns:
            None.

        Raises:
            None.
        """
        reordered_past = ()
        for layer_past in past_key_values:
            reordered_past += (
                tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),
            )
        return reordered_past

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForCausalLM.__init__(config)

This method initializes an instance of the Starcoder2ForCausalLM class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

A dictionary containing configuration parameters for the model. It is used to initialize the Starcoder2Model, determine the vocabulary size, and configure the lm_head.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
def __init__(self, config):
    """
    This method initializes an instance of the Starcoder2ForCausalLM class.

    Args:
        self: The instance of the class.
        config: A dictionary containing configuration parameters for the model.
            It is used to initialize the Starcoder2Model, determine the vocabulary size, and configure the lm_head.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)
    self.model = Starcoder2Model(config)
    self.vocab_size = config.vocab_size
    self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForCausalLM.forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
labels

Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple, CausalLMOutputWithPast]

Union[Tuple, CausalLMOutputWithPast]

Example
>>> from transformers import AutoTokenizer, Starcoder2ForCausalLM
...
>>> model = Starcoder2ForCausalLM.from_pretrained("bigcode/starcoder2-7b_16k")
>>> tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-7b_16k")
...
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
...
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
def forward(
    self,
    input_ids: mindspore.Tensor = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[List[mindspore.Tensor]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
    r"""
    Args:
        labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

    Returns:
        Union[Tuple, CausalLMOutputWithPast]

    Example:
        ```python
        >>> from transformers import AutoTokenizer, Starcoder2ForCausalLM
        ...
        >>> model = Starcoder2ForCausalLM.from_pretrained("bigcode/starcoder2-7b_16k")
        >>> tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder2-7b_16k")
        ...
        >>> prompt = "Hey, are you conscious? Can you talk to me?"
        >>> inputs = tokenizer(prompt, return_tensors="pt")
        ...
        >>> # Generate
        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
        ```
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
    outputs = self.model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    hidden_states = outputs[0]
    logits = self.lm_head(hidden_states)
    logits = logits.float()

    loss = None
    if labels is not None:
        # Shift so that tokens < n predict n
        shift_logits = logits[..., :-1, :]
        shift_labels = labels[..., 1:]
        # Flatten the tokens
        shift_logits = shift_logits.view(-1, self.config.vocab_size)
        shift_labels = shift_labels.view(-1)
        # Ensure tensors are on the same device
        loss = ops.cross_entropy(shift_logits, shift_labels)

    if not return_dict:
        output = (logits,) + outputs[1:]
        return (loss,) + output if loss is not None else output

    return CausalLMOutputWithPast(
        loss=loss,
        logits=logits,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForCausalLM.get_decoder()

This method returns the decoder model associated with the Starcoder2ForCausalLM instance.

PARAMETER DESCRIPTION
self

The instance of the Starcoder2ForCausalLM class.

RETURNS DESCRIPTION
model

This method returns the decoder model associated with the instance.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
def get_decoder(self):
    """
    This method returns the decoder model associated with the Starcoder2ForCausalLM instance.

    Args:
        self: The instance of the Starcoder2ForCausalLM class.

    Returns:
        model: This method returns the decoder model associated with the instance.

    Raises:
        None.
    """
    return self.model

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForCausalLM.get_input_embeddings()

Returns the input embeddings from the model.

PARAMETER DESCRIPTION
self

The instance of Starcoder2ForCausalLM class. This parameter is required to access the model's embedded tokens.

RETURNS DESCRIPTION
None

This method returns None as it simply retrieves the input embeddings from the model.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
def get_input_embeddings(self):
    """
    Returns the input embeddings from the model.

    Args:
        self: The instance of Starcoder2ForCausalLM class.
            This parameter is required to access the model's embedded tokens.

    Returns:
        None: This method returns None as it simply retrieves the input embeddings from the model.

    Raises:
        None
    """
    return self.model.embed_tokens

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForCausalLM.get_output_embeddings()

Method to retrieve the output embeddings from the Starcoder2ForCausalLM class.

PARAMETER DESCRIPTION
self

Instance of the Starcoder2ForCausalLM class.

  • Type: Starcoder2ForCausalLM
  • Purpose: Represents the current instance of the class.
  • Restrictions: None

RETURNS DESCRIPTION
lm_head

The output embeddings.

  • Type: None
  • Purpose: The method returns the output embeddings.
Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
def get_output_embeddings(self):
    """
    Method to retrieve the output embeddings from the Starcoder2ForCausalLM class.

    Args:
        self:
            Instance of the Starcoder2ForCausalLM class.

            - Type: Starcoder2ForCausalLM
            - Purpose: Represents the current instance of the class.
            - Restrictions: None

    Returns:
        lm_head:
            The output embeddings.

            - Type: None
            - Purpose: The method returns the output embeddings.

    Raises:
        None
    """
    return self.lm_head

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForCausalLM.prepare_inputs_for_generation(input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs)

Prepare inputs for generation.

PARAMETER DESCRIPTION
self

An instance of the Starcoder2ForCausalLM class.

input_ids

Tensor of shape (batch_size, sequence_length) containing the input IDs.

TYPE: Tensor

past_key_values

Cache object or tuple of tensors containing the past key values. If Cache object is provided, the cache_length, past_length, and max_cache_length are extracted. If tuple is provided, cache_length and past_length are extracted from the first element. If None, cache_length and past_length are calculated based on input_ids.

TYPE: Cache, tuple, or None DEFAULT: None

attention_mask

Tensor of shape (batch_size, sequence_length) containing the attention mask. If not None and attention_mask.shape[1] is greater than input_ids.shape[1], the input_ids are truncated accordingly. If attention_mask is not None and past_length is less than input_ids.shape[1], the input_ids are sliced accordingly. If max_cache_length is not None and attention_mask is not None and cache_length + input_ids.shape[1] is greater than max_cache_length, the attention_mask is truncated accordingly.

TYPE: Tensor or None DEFAULT: None

inputs_embeds

Tensor of shape (batch_size, sequence_length, embedding_size) containing the input embeddings. If not None and past_key_values is None, the model_inputs dictionary is updated with 'inputs_embeds' key.

TYPE: Tensor or None DEFAULT: None

**kwargs

Additional keyword arguments.

DEFAULT: {}

RETURNS DESCRIPTION
dict

A dictionary containing the model inputs. The dictionary includes the following keys:

  • 'input_ids': Tensor of shape (batch_size, sequence_length) containing the input IDs.
  • 'position_ids': Tensor of shape (batch_size, sequence_length) containing the position IDs. If attention_mask is not None and position_ids is None, the position_ids are calculated based on the attention_mask.
  • 'past_key_values': Cache object or tuple of tensors containing the past key values.
  • 'use_cache': Boolean indicating whether to use cache or not.
  • 'attention_mask': Tensor of shape (batch_size, sequence_length) containing the attention mask.
Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
def prepare_inputs_for_generation(
    self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
):
    """
    Prepare inputs for generation.

    Args:
        self: An instance of the Starcoder2ForCausalLM class.
        input_ids (torch.Tensor): Tensor of shape (batch_size, sequence_length) containing the input IDs.
        past_key_values (Cache, tuple, or None): Cache object or tuple of tensors containing the past key values.
            If Cache object is provided, the cache_length, past_length, and max_cache_length are
            extracted. If tuple is provided, cache_length and past_length are extracted from the first element.
            If None, cache_length and past_length are calculated based on input_ids.
        attention_mask (torch.Tensor or None): Tensor of shape (batch_size, sequence_length)
            containing the attention mask. If not None and attention_mask.shape[1] is greater than
            input_ids.shape[1], the input_ids are truncated accordingly. If attention_mask is not None and
            past_length is less than input_ids.shape[1], the input_ids are sliced accordingly.
            If max_cache_length is not None and attention_mask is not None and cache_length + input_ids.shape[1] is
            greater than max_cache_length, the attention_mask is truncated accordingly.
        inputs_embeds (torch.Tensor or None): Tensor of shape (batch_size, sequence_length, embedding_size)
            containing the input embeddings. If not None and past_key_values is None, the model_inputs
            dictionary is updated with 'inputs_embeds' key.
        **kwargs: Additional keyword arguments.

    Returns:
        dict:
            A dictionary containing the model inputs. The dictionary includes the following keys:

            - 'input_ids': Tensor of shape (batch_size, sequence_length) containing the input IDs.
            - 'position_ids': Tensor of shape (batch_size, sequence_length) containing the position IDs.
            If attention_mask is not None and position_ids is None, the position_ids are calculated based on the
            attention_mask.
            - 'past_key_values': Cache object or tuple of tensors containing the past key values.
            - 'use_cache': Boolean indicating whether to use cache or not.
            - 'attention_mask': Tensor of shape (batch_size, sequence_length) containing the attention mask.

    Raises:
        None.
    """
    # Omit tokens covered by past_key_values
    if past_key_values is not None:
        if isinstance(past_key_values, Cache):
            cache_length = past_key_values.get_seq_length()
            past_length = past_key_values.seen_tokens
            max_cache_length = past_key_values.get_max_length()
        else:
            cache_length = past_length = past_key_values[0][0].shape[2]
            max_cache_length = None

        # Keep only the unprocessed tokens:
        # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
        # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
        # input)
        if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
            input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
        # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
        # input_ids based on the past_length.
        elif past_length < input_ids.shape[1]:
            input_ids = input_ids[:, past_length:]
        # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.

        # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
        if (
            max_cache_length is not None
            and attention_mask is not None
            and cache_length + input_ids.shape[1] > max_cache_length
        ):
            attention_mask = attention_mask[:, -max_cache_length:]

    position_ids = kwargs.get("position_ids", None)
    if attention_mask is not None and position_ids is None:
        # create position_ids on the fly for batch generation
        position_ids = attention_mask.long().cumsum(-1) - 1
        position_ids = position_ids.masked_fill(attention_mask == 0, 1)
        if past_key_values:
            position_ids = position_ids[:, -input_ids.shape[1] :]

    # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
    if inputs_embeds is not None and past_key_values is None:
        model_inputs = {"inputs_embeds": inputs_embeds}
    else:
        model_inputs = {"input_ids": input_ids}

    model_inputs.update(
        {
            "position_ids": position_ids,
            "past_key_values": past_key_values,
            "use_cache": kwargs.get("use_cache"),
            "attention_mask": attention_mask,
        }
    )
    return model_inputs

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForCausalLM.set_decoder(decoder)

Sets the decoder for the Starcoder2ForCausalLM class.

PARAMETER DESCRIPTION
self

The instance of the Starcoder2ForCausalLM class.

TYPE: Starcoder2ForCausalLM

decoder

The decoder object to be set for the model.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
def set_decoder(self, decoder):
    """
    Sets the decoder for the Starcoder2ForCausalLM class.

    Args:
        self (Starcoder2ForCausalLM): The instance of the Starcoder2ForCausalLM class.
        decoder: The decoder object to be set for the model.

    Returns:
        None.

    Raises:
        None.
    """
    self.model = decoder

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForCausalLM.set_input_embeddings(value)

Sets the input embeddings for the Starcoder2ForCausalLM model.

PARAMETER DESCRIPTION
self

The instance of the Starcoder2ForCausalLM class.

TYPE: Starcoder2ForCausalLM

value

The input embeddings to be set for the model. It should be compatible with the model's embed_tokens attribute.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
def set_input_embeddings(self, value):
    """
    Sets the input embeddings for the Starcoder2ForCausalLM model.

    Args:
        self (Starcoder2ForCausalLM): The instance of the Starcoder2ForCausalLM class.
        value: The input embeddings to be set for the model. It should be compatible with the model's
            embed_tokens attribute.

    Returns:
        None.

    Raises:
        None.
    """
    self.model.embed_tokens = value

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForCausalLM.set_output_embeddings(new_embeddings)

Set the output embeddings for Starcoder2ForCausalLM.

This method allows the user to set the output embeddings for the Starcoder2ForCausalLM model. The output embeddings are responsible for generating the predicted output sequence.

PARAMETER DESCRIPTION
self

The current instance of the Starcoder2ForCausalLM class.

TYPE: Starcoder2ForCausalLM

new_embeddings

The new embeddings to set as the output embeddings. This can be of any type, as long as it is compatible with the model architecture.

TYPE: Any

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
def set_output_embeddings(self, new_embeddings):
    """
    Set the output embeddings for Starcoder2ForCausalLM.

    This method allows the user to set the output embeddings for the Starcoder2ForCausalLM model.
    The output embeddings are responsible for generating the predicted output sequence.

    Args:
        self (Starcoder2ForCausalLM): The current instance of the Starcoder2ForCausalLM class.
        new_embeddings (Any): The new embeddings to set as the output embeddings.
            This can be of any type, as long as it is compatible with the model architecture.

    Returns:
        None.

    Raises:
        None.
    """
    self.lm_head = new_embeddings

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForSequenceClassification

Bases: Starcoder2PreTrainedModel

This class represents a sequence classification model based on Starcoder2 architecture. It inherits functionality from Starcoder2PreTrainedModel. The class includes methods for initializing the model, getting and setting input embeddings, and forwarding the model for sequence classification. The 'forward' method takes various input parameters like input_ids, attention_mask, position_ids, etc., and returns the sequence classifier output. It supports computing loss based on different problem types such as regression, single-label classification, and multi-label classification. The class provides flexibility to handle different problem types and batch sizes, ensuring efficient training and inference for sequence classification tasks.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
class Starcoder2ForSequenceClassification(Starcoder2PreTrainedModel):
    """
    This class represents a sequence classification model based on Starcoder2 architecture.
    It inherits functionality from Starcoder2PreTrainedModel.
    The class includes methods for initializing the model, getting and setting input embeddings, and forwarding
    the model for sequence classification.
    The 'forward' method takes various input parameters like input_ids, attention_mask, position_ids, etc.,
    and returns the sequence classifier output.
    It supports computing loss based on different problem types such as regression, single-label classification,
    and multi-label classification.
    The class provides flexibility to handle different problem types and batch sizes, ensuring efficient training
    and inference for sequence classification tasks.
    """
    def __init__(self, config):
        """
        Initializes a new instance of the Starcoder2ForSequenceClassification class.

        Args:
            self (Starcoder2ForSequenceClassification): The object instance.
            config: An instance of the Starcoder2Config class containing the configuration parameters for the model.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)
        self.num_labels = config.num_labels
        self.model = Starcoder2Model(config)
        self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        """
        Retrieves the input embeddings from the 'Starcoder2ForSequenceClassification' model.

        Args:
            self: An instance of the 'Starcoder2ForSequenceClassification' class.

        Returns:
            None: The method retrieves the input embeddings from the model and does not return any value.

        Raises:
            None.

        """
        return self.model.embed_tokens

    def set_input_embeddings(self, value):
        """
        Sets the input embeddings of the Starcoder2ForSequenceClassification model.

        Args:
            self (Starcoder2ForSequenceClassification): The instance of the Starcoder2ForSequenceClassification class.
            value: The input embeddings to be set for the model.
                It should be compatible with the model's embedding layer.

        Returns:
            None.

        Raises:
            TypeError: If the value provided is not compatible with the model's embedding layer.
            AttributeError: If the model instance does not have the 'embed_tokens' attribute.
        """
        self.model.embed_tokens = value

    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[List[mindspore.Tensor]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
        r"""
        Args:
            labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
                Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
                config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
                `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        transformer_outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        hidden_states = transformer_outputs[0]
        logits = self.score(hidden_states)

        if input_ids is not None:
            batch_size = input_ids.shape[0]
        else:
            batch_size = inputs_embeds.shape[0]

        if self.config.pad_token_id is None and batch_size != 1:
            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
        if self.config.pad_token_id is None:
            sequence_lengths = -1
        else:
            if input_ids is not None:
                # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
                sequence_lengths = ops.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
                sequence_lengths = sequence_lengths % input_ids.shape[-1]
            else:
                sequence_lengths = -1

        pooled_logits = logits[ops.arange(batch_size), sequence_lengths]

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                if self.num_labels == 1:
                    loss = ops.mse_loss(pooled_logits.squeeze(), labels.squeeze())
                else:
                    loss = ops.mse_loss(pooled_logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss = ops.cross_entropy(pooled_logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss = ops.binary_cross_entropy_with_logits(pooled_logits, labels)
        if not return_dict:
            output = (pooled_logits,) + transformer_outputs[1:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutputWithPast(
            loss=loss,
            logits=pooled_logits,
            past_key_values=transformer_outputs.past_key_values,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
        )

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForSequenceClassification.__init__(config)

Initializes a new instance of the Starcoder2ForSequenceClassification class.

PARAMETER DESCRIPTION
self

The object instance.

TYPE: Starcoder2ForSequenceClassification

config

An instance of the Starcoder2Config class containing the configuration parameters for the model.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
def __init__(self, config):
    """
    Initializes a new instance of the Starcoder2ForSequenceClassification class.

    Args:
        self (Starcoder2ForSequenceClassification): The object instance.
        config: An instance of the Starcoder2Config class containing the configuration parameters for the model.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)
    self.num_labels = config.num_labels
    self.model = Starcoder2Model(config)
    self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForSequenceClassification.forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
labels

Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

TYPE: `mindspore.Tensor` of shape `(batch_size,)`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
def forward(
    self,
    input_ids: mindspore.Tensor = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[List[mindspore.Tensor]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
    r"""
    Args:
        labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    transformer_outputs = self.model(
        input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    hidden_states = transformer_outputs[0]
    logits = self.score(hidden_states)

    if input_ids is not None:
        batch_size = input_ids.shape[0]
    else:
        batch_size = inputs_embeds.shape[0]

    if self.config.pad_token_id is None and batch_size != 1:
        raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
    if self.config.pad_token_id is None:
        sequence_lengths = -1
    else:
        if input_ids is not None:
            # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
            sequence_lengths = ops.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
            sequence_lengths = sequence_lengths % input_ids.shape[-1]
        else:
            sequence_lengths = -1

    pooled_logits = logits[ops.arange(batch_size), sequence_lengths]

    loss = None
    if labels is not None:
        if self.config.problem_type is None:
            if self.num_labels == 1:
                self.config.problem_type = "regression"
            elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                self.config.problem_type = "single_label_classification"
            else:
                self.config.problem_type = "multi_label_classification"

        if self.config.problem_type == "regression":
            if self.num_labels == 1:
                loss = ops.mse_loss(pooled_logits.squeeze(), labels.squeeze())
            else:
                loss = ops.mse_loss(pooled_logits, labels)
        elif self.config.problem_type == "single_label_classification":
            loss = ops.cross_entropy(pooled_logits.view(-1, self.num_labels), labels.view(-1))
        elif self.config.problem_type == "multi_label_classification":
            loss = ops.binary_cross_entropy_with_logits(pooled_logits, labels)
    if not return_dict:
        output = (pooled_logits,) + transformer_outputs[1:]
        return ((loss,) + output) if loss is not None else output

    return SequenceClassifierOutputWithPast(
        loss=loss,
        logits=pooled_logits,
        past_key_values=transformer_outputs.past_key_values,
        hidden_states=transformer_outputs.hidden_states,
        attentions=transformer_outputs.attentions,
    )

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForSequenceClassification.get_input_embeddings()

Retrieves the input embeddings from the 'Starcoder2ForSequenceClassification' model.

PARAMETER DESCRIPTION
self

An instance of the 'Starcoder2ForSequenceClassification' class.

RETURNS DESCRIPTION
None

The method retrieves the input embeddings from the model and does not return any value.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
def get_input_embeddings(self):
    """
    Retrieves the input embeddings from the 'Starcoder2ForSequenceClassification' model.

    Args:
        self: An instance of the 'Starcoder2ForSequenceClassification' class.

    Returns:
        None: The method retrieves the input embeddings from the model and does not return any value.

    Raises:
        None.

    """
    return self.model.embed_tokens

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2ForSequenceClassification.set_input_embeddings(value)

Sets the input embeddings of the Starcoder2ForSequenceClassification model.

PARAMETER DESCRIPTION
self

The instance of the Starcoder2ForSequenceClassification class.

TYPE: Starcoder2ForSequenceClassification

value

The input embeddings to be set for the model. It should be compatible with the model's embedding layer.

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the value provided is not compatible with the model's embedding layer.

AttributeError

If the model instance does not have the 'embed_tokens' attribute.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
def set_input_embeddings(self, value):
    """
    Sets the input embeddings of the Starcoder2ForSequenceClassification model.

    Args:
        self (Starcoder2ForSequenceClassification): The instance of the Starcoder2ForSequenceClassification class.
        value: The input embeddings to be set for the model.
            It should be compatible with the model's embedding layer.

    Returns:
        None.

    Raises:
        TypeError: If the value provided is not compatible with the model's embedding layer.
        AttributeError: If the model instance does not have the 'embed_tokens' attribute.
    """
    self.model.embed_tokens = value

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2MLP

Bases: Module

A class representing a multi-layer perceptron (MLP) for Starcoder2 model.

This class inherits from nn.Module and implements the forwardion of the MLP for the Starcoder2 model. The MLP consists of fully connected layers with activation functions and residual dropout.

ATTRIBUTE DESCRIPTION
config

The configuration for the Starcoder2 model.

TYPE: Starcoder2Config

METHOD DESCRIPTION
__init__

Initializes the Starcoder2MLP with the given configuration.

forward

Constructs the multi-layer perceptron using the provided hidden states.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
class Starcoder2MLP(nn.Module):

    '''
    A class representing a multi-layer perceptron (MLP) for Starcoder2 model.

    This class inherits from nn.Module and implements the forwardion of the MLP for the Starcoder2 model.
    The MLP consists of fully connected layers with activation functions and residual dropout.

    Attributes:
        config (Starcoder2Config): The configuration for the Starcoder2 model.

    Methods:
        __init__:
            Initializes the Starcoder2MLP with the given configuration.

        forward:
            Constructs the multi-layer perceptron using the provided hidden states.

    '''
    def __init__(self, config: Starcoder2Config):
        """
        Initializes a Starcoder2MLP instance.

        Args:
            self (Starcoder2MLP): The current instance of the Starcoder2MLP class.
            config (Starcoder2Config): An instance of Starcoder2Config containing the configuration parameters.

        Returns:
            None.

        Raises:
            TypeError: If the provided config parameter is not of type Starcoder2Config.
            ValueError: If the provided config parameter does not contain valid configuration values.
        """
        super().__init__()
        embed_dim = config.hidden_size
        self.c_fc = nn.Linear(embed_dim, config.intermediate_size, bias=config.use_bias)
        self.c_proj = nn.Linear(config.intermediate_size, embed_dim, bias=config.use_bias)
        self.act = ACT2FN[config.hidden_act]
        self.residual_dropout = config.residual_dropout

    def forward(self, hidden_states: Optional[Tuple[mindspore.Tensor]]) -> mindspore.Tensor:
        """
        This method forwards the forward pass of the Starcoder2MLP model.

        Args:
            self (Starcoder2MLP): The instance of the Starcoder2MLP class.
            hidden_states (Optional[Tuple[mindspore.Tensor]]): The input hidden states to be processed.
                It can be a tuple of mindspore.Tensor objects or None.

        Returns:
            mindspore.Tensor: The processed hidden states after passing through the MLP layers.

        Raises:
            None.
        """
        hidden_states = self.c_fc(hidden_states)
        hidden_states = self.act(hidden_states)
        hidden_states = self.c_proj(hidden_states)
        hidden_states = ops.dropout(hidden_states, p=self.residual_dropout, training=self.training)
        return hidden_states

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2MLP.__init__(config)

Initializes a Starcoder2MLP instance.

PARAMETER DESCRIPTION
self

The current instance of the Starcoder2MLP class.

TYPE: Starcoder2MLP

config

An instance of Starcoder2Config containing the configuration parameters.

TYPE: Starcoder2Config

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the provided config parameter is not of type Starcoder2Config.

ValueError

If the provided config parameter does not contain valid configuration values.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
def __init__(self, config: Starcoder2Config):
    """
    Initializes a Starcoder2MLP instance.

    Args:
        self (Starcoder2MLP): The current instance of the Starcoder2MLP class.
        config (Starcoder2Config): An instance of Starcoder2Config containing the configuration parameters.

    Returns:
        None.

    Raises:
        TypeError: If the provided config parameter is not of type Starcoder2Config.
        ValueError: If the provided config parameter does not contain valid configuration values.
    """
    super().__init__()
    embed_dim = config.hidden_size
    self.c_fc = nn.Linear(embed_dim, config.intermediate_size, bias=config.use_bias)
    self.c_proj = nn.Linear(config.intermediate_size, embed_dim, bias=config.use_bias)
    self.act = ACT2FN[config.hidden_act]
    self.residual_dropout = config.residual_dropout

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2MLP.forward(hidden_states)

This method forwards the forward pass of the Starcoder2MLP model.

PARAMETER DESCRIPTION
self

The instance of the Starcoder2MLP class.

TYPE: Starcoder2MLP

hidden_states

The input hidden states to be processed. It can be a tuple of mindspore.Tensor objects or None.

TYPE: Optional[Tuple[Tensor]]

RETURNS DESCRIPTION
Tensor

mindspore.Tensor: The processed hidden states after passing through the MLP layers.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
def forward(self, hidden_states: Optional[Tuple[mindspore.Tensor]]) -> mindspore.Tensor:
    """
    This method forwards the forward pass of the Starcoder2MLP model.

    Args:
        self (Starcoder2MLP): The instance of the Starcoder2MLP class.
        hidden_states (Optional[Tuple[mindspore.Tensor]]): The input hidden states to be processed.
            It can be a tuple of mindspore.Tensor objects or None.

    Returns:
        mindspore.Tensor: The processed hidden states after passing through the MLP layers.

    Raises:
        None.
    """
    hidden_states = self.c_fc(hidden_states)
    hidden_states = self.act(hidden_states)
    hidden_states = self.c_proj(hidden_states)
    hidden_states = ops.dropout(hidden_states, p=self.residual_dropout, training=self.training)
    return hidden_states

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2Model

Bases: Starcoder2PreTrainedModel

Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a [Starcoder2DecoderLayer]

PARAMETER DESCRIPTION
config

Starcoder2Config

TYPE: Starcoder2Config

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
class Starcoder2Model(Starcoder2PreTrainedModel):
    """
    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Starcoder2DecoderLayer`]

    Args:
        config: Starcoder2Config
    """
    def __init__(self, config: Starcoder2Config):
        """
        Initializes a new instance of the Starcoder2Model class.

        Args:
            self (Starcoder2Model): The current instance of the Starcoder2Model class.
            config (Starcoder2Config):
                An instance of Starcoder2Config containing the configuration parameters for the model.

                - config.pad_token_id (int): The index of the padding token in the vocabulary.
                - config.vocab_size (int): The size of the vocabulary.
                - config.hidden_size (int): The size of the hidden layers.
                - config.embedding_dropout (float): The dropout probability for the embedding layer.
                - config.num_hidden_layers (int): The number of hidden layers in the model.
                - config.norm_epsilon (float): The epsilon value for normalization.

        Returns:
            None.

        Raises:
            TypeError: If the config parameter is not an instance of Starcoder2Config.
            ValueError: If the config parameters are invalid or out of range.
            RuntimeError: If there is an issue with initializing the model attributes.
            """
        super().__init__(config)
        self.padding_idx = config.pad_token_id
        self.vocab_size = config.vocab_size

        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
        self.embedding_dropout = config.embedding_dropout
        self.layers = nn.ModuleList(
            [Starcoder2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
        )
        # self._attn_implementation = config._attn_implementation
        self.norm = nn.LayerNorm(config.hidden_size, eps=config.norm_epsilon)
        self.gradient_checkpointing = False
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        """
        This method retrieves the input embeddings for the Starcoder2Model.

        Args:
            self: The instance of the Starcoder2Model class.

        Returns:
            None: This method returns the input embeddings as stored in the 'embed_tokens' attribute of the
                Starcoder2Model instance.

        Raises:
            None
        """
        return self.embed_tokens

    def set_input_embeddings(self, value):
        """
        Set the input embeddings for the Starcoder2Model.

        Args:
            self (Starcoder2Model): The instance of the Starcoder2Model class.
            value (any): The input embeddings to be set for the model.

        Returns:
            None.

        Raises:
            None.
        """
        self.embed_tokens = value

    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[List[mindspore.Tensor]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPast]:
        """
        Constructs the Starcoder2Model.

        Args:
            self (Starcoder2Model): The instance of the Starcoder2Model class.
            input_ids (mindspore.Tensor, optional): The input tensor containing the indices of input tokens.
                Defaults to None.
            attention_mask (mindspore.Tensor, optional): The attention mask tensor. Defaults to None.
            position_ids (mindspore.Tensor, optional): The position ids tensor. Defaults to None.
            past_key_values (List[mindspore.Tensor], optional): The list of tensors containing past key values.
                Defaults to None.
            inputs_embeds (mindspore.Tensor, optional): The embedded input tensors. Defaults to None.
            use_cache (bool, optional): Flag to indicate whether to use cache. Defaults to None.
            output_attentions (bool, optional): Flag to indicate whether to output attentions. Defaults to None.
            output_hidden_states (bool, optional): Flag to indicate whether to output hidden states. Defaults to None.
            return_dict (bool, optional): Flag to indicate whether to return a dictionary. Defaults to None.

        Returns:
            Union[Tuple, BaseModelOutputWithPast]: The forwarded model output.

        Raises:
            ValueError: If both input_ids and inputs_embeds are specified, or if neither of them is specified.
            Warning: If use_cache is set to True and gradient checkpointing is enabled,
                the use_cache flag will be overridden.
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache

        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # retrieve input_ids and inputs_embeds
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
        if input_ids is not None:
            batch_size, seq_length = input_ids.shape
        elif inputs_embeds is not None:
            batch_size, seq_length, _ = inputs_embeds.shape
        else:
            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")

        if self.gradient_checkpointing and self.training:
            if use_cache:
                logger.warning_once(
                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
                )
                use_cache = False

        past_key_values_length = 0

        if use_cache:
            use_legacy_cache = not isinstance(past_key_values, Cache)
            if use_legacy_cache:
                past_key_values = DynamicCache.from_legacy_cache(past_key_values)
            past_key_values_length = past_key_values.get_usable_length(seq_length)

        if position_ids is None:
            position_ids = ops.arange(
                past_key_values_length, seq_length + past_key_values_length, dtype=mindspore.int64
            )
            position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
        else:
            position_ids = position_ids.view(-1, seq_length).long()

        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)

        # if attention_mask is not None and self._attn_implementation == "flash_attention_2" and use_cache:
        #     is_padding_right = attention_mask[:, -1].sum().item() != batch_size
        #     if is_padding_right:
        #         raise ValueError(
        #             "You are attempting to perform batched generation with padding_side='right'"
        #             " this may lead to unexpected behaviour for Flash Attention version of Starcoder2. Make sure to "
        #             " call `tokenizer.padding_side  = 'left'` before tokenizing the input. "
        #         )

        # 4d mask is passed through the layers
        attention_mask = _prepare_4d_causal_attention_mask(
            attention_mask,
            (batch_size, seq_length),
            inputs_embeds,
            past_key_values_length,
            sliding_window=self.config.sliding_window,
        )

        hidden_states = inputs_embeds
        hidden_states = ops.dropout(hidden_states, p=self.embedding_dropout, training=self.training)

        # decoder layers
        all_hidden_states = () if output_hidden_states else None
        all_self_attns = () if output_attentions else None
        next_decoder_cache = None

        for decoder_layer in self.layers:
            if output_hidden_states:
                all_hidden_states += (hidden_states,)

            layer_outputs = decoder_layer(
                hidden_states,
                attention_mask=attention_mask,
                position_ids=position_ids,
                past_key_value=past_key_values,
                output_attentions=output_attentions,
                use_cache=use_cache,
            )

            hidden_states = layer_outputs[0]

            if use_cache:
                next_decoder_cache = layer_outputs[2 if output_attentions else 1]

            if output_attentions:
                all_self_attns += (layer_outputs[1],)

        hidden_states = self.norm(hidden_states)

        # add hidden states from the last decoder layer
        if output_hidden_states:
            all_hidden_states += (hidden_states,)

        next_cache = None
        if use_cache:
            next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache

        if not return_dict:
            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
        return BaseModelOutputWithPast(
            last_hidden_state=hidden_states,
            past_key_values=next_cache,
            hidden_states=all_hidden_states,
            attentions=all_self_attns,
        )

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2Model.__init__(config)

Initializes a new instance of the Starcoder2Model class.

PARAMETER DESCRIPTION
self

The current instance of the Starcoder2Model class.

TYPE: Starcoder2Model

config

An instance of Starcoder2Config containing the configuration parameters for the model.

  • config.pad_token_id (int): The index of the padding token in the vocabulary.
  • config.vocab_size (int): The size of the vocabulary.
  • config.hidden_size (int): The size of the hidden layers.
  • config.embedding_dropout (float): The dropout probability for the embedding layer.
  • config.num_hidden_layers (int): The number of hidden layers in the model.
  • config.norm_epsilon (float): The epsilon value for normalization.

TYPE: Starcoder2Config

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the config parameter is not an instance of Starcoder2Config.

ValueError

If the config parameters are invalid or out of range.

RuntimeError

If there is an issue with initializing the model attributes.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
def __init__(self, config: Starcoder2Config):
    """
    Initializes a new instance of the Starcoder2Model class.

    Args:
        self (Starcoder2Model): The current instance of the Starcoder2Model class.
        config (Starcoder2Config):
            An instance of Starcoder2Config containing the configuration parameters for the model.

            - config.pad_token_id (int): The index of the padding token in the vocabulary.
            - config.vocab_size (int): The size of the vocabulary.
            - config.hidden_size (int): The size of the hidden layers.
            - config.embedding_dropout (float): The dropout probability for the embedding layer.
            - config.num_hidden_layers (int): The number of hidden layers in the model.
            - config.norm_epsilon (float): The epsilon value for normalization.

    Returns:
        None.

    Raises:
        TypeError: If the config parameter is not an instance of Starcoder2Config.
        ValueError: If the config parameters are invalid or out of range.
        RuntimeError: If there is an issue with initializing the model attributes.
        """
    super().__init__(config)
    self.padding_idx = config.pad_token_id
    self.vocab_size = config.vocab_size

    self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
    self.embedding_dropout = config.embedding_dropout
    self.layers = nn.ModuleList(
        [Starcoder2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
    )
    # self._attn_implementation = config._attn_implementation
    self.norm = nn.LayerNorm(config.hidden_size, eps=config.norm_epsilon)
    self.gradient_checkpointing = False
    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2Model.forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)

Constructs the Starcoder2Model.

PARAMETER DESCRIPTION
self

The instance of the Starcoder2Model class.

TYPE: Starcoder2Model

input_ids

The input tensor containing the indices of input tokens. Defaults to None.

TYPE: Tensor DEFAULT: None

attention_mask

The attention mask tensor. Defaults to None.

TYPE: Tensor DEFAULT: None

position_ids

The position ids tensor. Defaults to None.

TYPE: Tensor DEFAULT: None

past_key_values

The list of tensors containing past key values. Defaults to None.

TYPE: List[Tensor] DEFAULT: None

inputs_embeds

The embedded input tensors. Defaults to None.

TYPE: Tensor DEFAULT: None

use_cache

Flag to indicate whether to use cache. Defaults to None.

TYPE: bool DEFAULT: None

output_attentions

Flag to indicate whether to output attentions. Defaults to None.

TYPE: bool DEFAULT: None

output_hidden_states

Flag to indicate whether to output hidden states. Defaults to None.

TYPE: bool DEFAULT: None

return_dict

Flag to indicate whether to return a dictionary. Defaults to None.

TYPE: bool DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple, BaseModelOutputWithPast]

Union[Tuple, BaseModelOutputWithPast]: The forwarded model output.

RAISES DESCRIPTION
ValueError

If both input_ids and inputs_embeds are specified, or if neither of them is specified.

Warning

If use_cache is set to True and gradient checkpointing is enabled, the use_cache flag will be overridden.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
def forward(
    self,
    input_ids: mindspore.Tensor = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[List[mindspore.Tensor]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPast]:
    """
    Constructs the Starcoder2Model.

    Args:
        self (Starcoder2Model): The instance of the Starcoder2Model class.
        input_ids (mindspore.Tensor, optional): The input tensor containing the indices of input tokens.
            Defaults to None.
        attention_mask (mindspore.Tensor, optional): The attention mask tensor. Defaults to None.
        position_ids (mindspore.Tensor, optional): The position ids tensor. Defaults to None.
        past_key_values (List[mindspore.Tensor], optional): The list of tensors containing past key values.
            Defaults to None.
        inputs_embeds (mindspore.Tensor, optional): The embedded input tensors. Defaults to None.
        use_cache (bool, optional): Flag to indicate whether to use cache. Defaults to None.
        output_attentions (bool, optional): Flag to indicate whether to output attentions. Defaults to None.
        output_hidden_states (bool, optional): Flag to indicate whether to output hidden states. Defaults to None.
        return_dict (bool, optional): Flag to indicate whether to return a dictionary. Defaults to None.

    Returns:
        Union[Tuple, BaseModelOutputWithPast]: The forwarded model output.

    Raises:
        ValueError: If both input_ids and inputs_embeds are specified, or if neither of them is specified.
        Warning: If use_cache is set to True and gradient checkpointing is enabled,
            the use_cache flag will be overridden.
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    use_cache = use_cache if use_cache is not None else self.config.use_cache

    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    # retrieve input_ids and inputs_embeds
    if input_ids is not None and inputs_embeds is not None:
        raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
    if input_ids is not None:
        batch_size, seq_length = input_ids.shape
    elif inputs_embeds is not None:
        batch_size, seq_length, _ = inputs_embeds.shape
    else:
        raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")

    if self.gradient_checkpointing and self.training:
        if use_cache:
            logger.warning_once(
                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
            )
            use_cache = False

    past_key_values_length = 0

    if use_cache:
        use_legacy_cache = not isinstance(past_key_values, Cache)
        if use_legacy_cache:
            past_key_values = DynamicCache.from_legacy_cache(past_key_values)
        past_key_values_length = past_key_values.get_usable_length(seq_length)

    if position_ids is None:
        position_ids = ops.arange(
            past_key_values_length, seq_length + past_key_values_length, dtype=mindspore.int64
        )
        position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
    else:
        position_ids = position_ids.view(-1, seq_length).long()

    if inputs_embeds is None:
        inputs_embeds = self.embed_tokens(input_ids)

    # if attention_mask is not None and self._attn_implementation == "flash_attention_2" and use_cache:
    #     is_padding_right = attention_mask[:, -1].sum().item() != batch_size
    #     if is_padding_right:
    #         raise ValueError(
    #             "You are attempting to perform batched generation with padding_side='right'"
    #             " this may lead to unexpected behaviour for Flash Attention version of Starcoder2. Make sure to "
    #             " call `tokenizer.padding_side  = 'left'` before tokenizing the input. "
    #         )

    # 4d mask is passed through the layers
    attention_mask = _prepare_4d_causal_attention_mask(
        attention_mask,
        (batch_size, seq_length),
        inputs_embeds,
        past_key_values_length,
        sliding_window=self.config.sliding_window,
    )

    hidden_states = inputs_embeds
    hidden_states = ops.dropout(hidden_states, p=self.embedding_dropout, training=self.training)

    # decoder layers
    all_hidden_states = () if output_hidden_states else None
    all_self_attns = () if output_attentions else None
    next_decoder_cache = None

    for decoder_layer in self.layers:
        if output_hidden_states:
            all_hidden_states += (hidden_states,)

        layer_outputs = decoder_layer(
            hidden_states,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_value=past_key_values,
            output_attentions=output_attentions,
            use_cache=use_cache,
        )

        hidden_states = layer_outputs[0]

        if use_cache:
            next_decoder_cache = layer_outputs[2 if output_attentions else 1]

        if output_attentions:
            all_self_attns += (layer_outputs[1],)

    hidden_states = self.norm(hidden_states)

    # add hidden states from the last decoder layer
    if output_hidden_states:
        all_hidden_states += (hidden_states,)

    next_cache = None
    if use_cache:
        next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache

    if not return_dict:
        return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
    return BaseModelOutputWithPast(
        last_hidden_state=hidden_states,
        past_key_values=next_cache,
        hidden_states=all_hidden_states,
        attentions=all_self_attns,
    )

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2Model.get_input_embeddings()

This method retrieves the input embeddings for the Starcoder2Model.

PARAMETER DESCRIPTION
self

The instance of the Starcoder2Model class.

RETURNS DESCRIPTION
None

This method returns the input embeddings as stored in the 'embed_tokens' attribute of the Starcoder2Model instance.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
def get_input_embeddings(self):
    """
    This method retrieves the input embeddings for the Starcoder2Model.

    Args:
        self: The instance of the Starcoder2Model class.

    Returns:
        None: This method returns the input embeddings as stored in the 'embed_tokens' attribute of the
            Starcoder2Model instance.

    Raises:
        None
    """
    return self.embed_tokens

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2Model.set_input_embeddings(value)

Set the input embeddings for the Starcoder2Model.

PARAMETER DESCRIPTION
self

The instance of the Starcoder2Model class.

TYPE: Starcoder2Model

value

The input embeddings to be set for the model.

TYPE: any

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
def set_input_embeddings(self, value):
    """
    Set the input embeddings for the Starcoder2Model.

    Args:
        self (Starcoder2Model): The instance of the Starcoder2Model class.
        value (any): The input embeddings to be set for the model.

    Returns:
        None.

    Raises:
        None.
    """
    self.embed_tokens = value

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2PreTrainedModel

Bases: PreTrainedModel

This class represents a Starcoder2PreTrainedModel, which is a subclass of PreTrainedModel.

The Starcoder2PreTrainedModel class provides methods for initializing the weights of different types of cells. The initialization process depends on the type of the cell. If the cell is an instance of nn.Linear, the weights are initialized using a normal distribution with a mean of 0 and a standard deviation defined by the initializer_range attribute of the configuration. If the cell has a bias, the bias is initialized with zeros.

If the cell is an instance of nn.Embedding, the weights are initialized using a normal distribution with a mean of 0 and a standard deviation defined by the initializer_range attribute of the configuration. If the cell has a padding index, the weight corresponding to the padding index is set to 0.

Note

It is assumed that the cell parameter passed to the _init_weights method is an instance of either nn.Linear or nn.Embedding.

Please refer to the source code for more details on the implementation.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
class Starcoder2PreTrainedModel(PreTrainedModel):

    """
    This class represents a Starcoder2PreTrainedModel, which is a subclass of PreTrainedModel.

    The Starcoder2PreTrainedModel class provides methods for initializing the weights of different types of cells.
    The initialization process depends on the type of the cell. If the cell is an instance of nn.Linear, the weights are
    initialized using a normal distribution with a mean of 0 and a standard deviation defined by the `initializer_range`
    attribute of the configuration. If the cell has a bias, the bias is initialized with zeros.

    If the cell is an instance of nn.Embedding, the weights are initialized using a normal distribution with a mean
    of 0 and a standard deviation defined by the `initializer_range` attribute of the configuration.
    If the cell has a padding index, the weight corresponding to the padding index is set to 0.

    Note:
        It is assumed that the `cell` parameter passed to the `_init_weights` method is an instance of either nn.Linear
        or nn.Embedding.

    Please refer to the source code for more details on the implementation.

    """
    config_class = Starcoder2Config
    base_model_prefix = "model"
    supports_gradient_checkpointing = True
    _no_split_modules = ["Starcoder2DecoderLayer"]
    _skip_keys_device_placement = "past_key_values"
    _supports_cache_class = True

    def _init_weights(self, cell):
        """Initialize the weights"""
        if isinstance(cell, nn.Linear):
            cell.weight.set_data(initializer(Normal(self.config.initializer_range),
                                                    cell.weight.shape, cell.weight.dtype))
            if cell.bias:
                cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))
        elif isinstance(cell, nn.Embedding):
            weight = np.random.normal(0.0, self.config.initializer_range, cell.weight.shape)
            if cell.padding_idx:
                weight[cell.padding_idx] = 0

            cell.weight.set_data(Tensor(weight, cell.weight.dtype))

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2RotaryEmbedding

Bases: Module

The Starcoder2RotaryEmbedding class represents a rotary embedding module used for positional encoding in neural network models. This class inherits from the nn.Module class.

The class's forwardor method initializes the Starcoder2RotaryEmbedding instance with the specified dimensions, maximum position embeddings, and base value for the rotary embedding. It computes the inverse frequency and sets the cosine and sine cache for positional encoding.

The _set_cos_sin_cache method sets the cosine and sine cache based on the maximum sequence length and data type.

The forward method applies the positional encoding to the input tensor based on the sequence length and returns the cosine and sine embeddings.

Note

This docstring is a detailed summary of the Starcoder2RotaryEmbedding class and its methods, providing an overview of its functionality and purpose within the context of neural network modeling.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
class Starcoder2RotaryEmbedding(nn.Module):

    """
    The Starcoder2RotaryEmbedding class represents a rotary embedding module used for positional encoding in neural
    network models. This class inherits from the nn.Module class.

    The class's forwardor method initializes the Starcoder2RotaryEmbedding instance with the specified dimensions,
    maximum position embeddings, and base value for the rotary embedding. It computes the inverse frequency and sets
    the cosine and sine cache for positional encoding.

    The _set_cos_sin_cache method sets the cosine and sine cache based on the maximum sequence length and data type.

    The forward method applies the positional encoding to the input tensor based on the sequence length and returns
    the cosine and sine embeddings.

    Note:
        This docstring is a detailed summary of the Starcoder2RotaryEmbedding class and its methods, providing an
        overview of its functionality and purpose within the context of neural network modeling.
    """
    def __init__(self, dim, max_position_embeddings=2048, base=10000):
        """
        Initialize the Starcoder2RotaryEmbedding object.

        Args:
            self (Starcoder2RotaryEmbedding): The current instance of the Starcoder2RotaryEmbedding class.
            dim (int): The dimensionality of the embedding.
            max_position_embeddings (int, optional): The maximum number of positions to embed. Default is 2048.
            base (int, optional): The base value used in the calculation. Default is 10000.

        Returns:
            None.

        Raises:
            ValueError: If dim is not an integer.
            ValueError: If max_position_embeddings is not an integer.
            ValueError: If base is not an integer.
            ValueError: If any of the provided values are invalid or out of range.
        """
        super().__init__()

        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.base = base
        inv_freq = 1.0 / (self.base ** (ops.arange(0, self.dim, 2, dtype=mindspore.int64).float() / self.dim))
        self.inv_freq = inv_freq
        # Build here to make `torch.jit.trace` work.
        self._set_cos_sin_cache(
            seq_len=max_position_embeddings, dtype=get_default_dtype()
        )

    def _set_cos_sin_cache(self, seq_len, dtype):
        """
        Sets the cosine and sine cache for the Starcoder2RotaryEmbedding class.

        Args:
            self: The instance of the Starcoder2RotaryEmbedding class.
            seq_len (int): The length of the sequence.
            dtype: The data type for the cache.

        Returns:
            None: This method modifies the cos_cached and sin_cached attributes of the Starcoder2RotaryEmbedding instance.

        Raises:
            None.
        """
        self.max_seq_len_cached = seq_len
        t = ops.arange(self.max_seq_len_cached, dtype=mindspore.int64).type_as(self.inv_freq)

        freqs = ops.outer(t, self.inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = ops.cat((freqs, freqs), axis=-1)
        self.cos_cached = emb.cos().to(dtype)
        self.sin_cached = emb.sin().to(dtype)

    def forward(self, x, seq_len=None):
        """
        Construct and return the cosine and sine embeddings for the given sequence length.

        Args:
            self (Starcoder2RotaryEmbedding): An instance of the Starcoder2RotaryEmbedding class.
            x: The input tensor.
            seq_len (Optional[int]): The length of the sequence. If not provided, the default value is None.

        Returns:
            Tuple[torch.Tensor, torch.Tensor]:
                A tuple containing the cosine and sine embeddings for the given sequence length.

                - The cosine embedding is obtained by taking the first 'seq_len' elements from the cached cosine values.
                - The sine embedding is obtained by taking the first 'seq_len' elements from the cached sine values.

                Both embeddings are converted to the same data type as the input tensor 'x'.

        Raises:
            TypeError: If the input 'seq_len' is not an integer.
            ValueError: If the input 'seq_len' is negative.
        """
        # x: [bs, num_attention_heads, seq_len, head_size]
        if seq_len > self.max_seq_len_cached:
            self._set_cos_sin_cache(seq_len=seq_len, dtype=x.dtype)

        return (
            self.cos_cached[:seq_len].to(dtype=x.dtype),
            self.sin_cached[:seq_len].to(dtype=x.dtype),
        )

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2RotaryEmbedding.__init__(dim, max_position_embeddings=2048, base=10000)

Initialize the Starcoder2RotaryEmbedding object.

PARAMETER DESCRIPTION
self

The current instance of the Starcoder2RotaryEmbedding class.

TYPE: Starcoder2RotaryEmbedding

dim

The dimensionality of the embedding.

TYPE: int

max_position_embeddings

The maximum number of positions to embed. Default is 2048.

TYPE: int DEFAULT: 2048

base

The base value used in the calculation. Default is 10000.

TYPE: int DEFAULT: 10000

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If dim is not an integer.

ValueError

If max_position_embeddings is not an integer.

ValueError

If base is not an integer.

ValueError

If any of the provided values are invalid or out of range.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
def __init__(self, dim, max_position_embeddings=2048, base=10000):
    """
    Initialize the Starcoder2RotaryEmbedding object.

    Args:
        self (Starcoder2RotaryEmbedding): The current instance of the Starcoder2RotaryEmbedding class.
        dim (int): The dimensionality of the embedding.
        max_position_embeddings (int, optional): The maximum number of positions to embed. Default is 2048.
        base (int, optional): The base value used in the calculation. Default is 10000.

    Returns:
        None.

    Raises:
        ValueError: If dim is not an integer.
        ValueError: If max_position_embeddings is not an integer.
        ValueError: If base is not an integer.
        ValueError: If any of the provided values are invalid or out of range.
    """
    super().__init__()

    self.dim = dim
    self.max_position_embeddings = max_position_embeddings
    self.base = base
    inv_freq = 1.0 / (self.base ** (ops.arange(0, self.dim, 2, dtype=mindspore.int64).float() / self.dim))
    self.inv_freq = inv_freq
    # Build here to make `torch.jit.trace` work.
    self._set_cos_sin_cache(
        seq_len=max_position_embeddings, dtype=get_default_dtype()
    )

mindnlp.transformers.models.starcoder2.modeling_starcoder2.Starcoder2RotaryEmbedding.forward(x, seq_len=None)

Construct and return the cosine and sine embeddings for the given sequence length.

PARAMETER DESCRIPTION
self

An instance of the Starcoder2RotaryEmbedding class.

TYPE: Starcoder2RotaryEmbedding

x

The input tensor.

seq_len

The length of the sequence. If not provided, the default value is None.

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION

Tuple[torch.Tensor, torch.Tensor]: A tuple containing the cosine and sine embeddings for the given sequence length.

  • The cosine embedding is obtained by taking the first 'seq_len' elements from the cached cosine values.
  • The sine embedding is obtained by taking the first 'seq_len' elements from the cached sine values.

Both embeddings are converted to the same data type as the input tensor 'x'.

RAISES DESCRIPTION
TypeError

If the input 'seq_len' is not an integer.

ValueError

If the input 'seq_len' is negative.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
def forward(self, x, seq_len=None):
    """
    Construct and return the cosine and sine embeddings for the given sequence length.

    Args:
        self (Starcoder2RotaryEmbedding): An instance of the Starcoder2RotaryEmbedding class.
        x: The input tensor.
        seq_len (Optional[int]): The length of the sequence. If not provided, the default value is None.

    Returns:
        Tuple[torch.Tensor, torch.Tensor]:
            A tuple containing the cosine and sine embeddings for the given sequence length.

            - The cosine embedding is obtained by taking the first 'seq_len' elements from the cached cosine values.
            - The sine embedding is obtained by taking the first 'seq_len' elements from the cached sine values.

            Both embeddings are converted to the same data type as the input tensor 'x'.

    Raises:
        TypeError: If the input 'seq_len' is not an integer.
        ValueError: If the input 'seq_len' is negative.
    """
    # x: [bs, num_attention_heads, seq_len, head_size]
    if seq_len > self.max_seq_len_cached:
        self._set_cos_sin_cache(seq_len=seq_len, dtype=x.dtype)

    return (
        self.cos_cached[:seq_len].to(dtype=x.dtype),
        self.sin_cached[:seq_len].to(dtype=x.dtype),
    )

mindnlp.transformers.models.starcoder2.modeling_starcoder2.apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1)

Applies Rotary Position Embedding to the query and key tensors.

PARAMETER DESCRIPTION
q

The query tensor.

TYPE: `mindspore.Tensor`

k

The key tensor.

TYPE: `mindspore.Tensor`

cos

The cosine part of the rotary embedding.

TYPE: `mindspore.Tensor`

sin

The sine part of the rotary embedding.

TYPE: `mindspore.Tensor`

position_ids

The position indices of the tokens corresponding to the query and key tensors. For example, this can be used to pass offsetted position ids when working with a KV-cache.

TYPE: `mindspore.Tensor`

unsqueeze_dim

The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

RETURNS DESCRIPTION

tuple(mindspore.Tensor) comprising of the query and key tensors rotated using the Rotary Position Embedding.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
    """
    Applies Rotary Position Embedding to the query and key tensors.

    Args:
        q (`mindspore.Tensor`): The query tensor.
        k (`mindspore.Tensor`): The key tensor.
        cos (`mindspore.Tensor`): The cosine part of the rotary embedding.
        sin (`mindspore.Tensor`): The sine part of the rotary embedding.
        position_ids (`mindspore.Tensor`):
            The position indices of the tokens corresponding to the query and key tensors. For example, this can be
            used to pass offsetted position ids when working with a KV-cache.
        unsqueeze_dim (`int`, *optional*, defaults to 1):
            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.

    Returns:
        `tuple(mindspore.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
    """
    cos = cos[position_ids].unsqueeze(unsqueeze_dim)
    sin = sin[position_ids].unsqueeze(unsqueeze_dim)
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

mindnlp.transformers.models.starcoder2.modeling_starcoder2.repeat_kv(hidden_states, n_rep)

This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
284
285
286
287
288
289
290
291
292
293
def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor:
    """
    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
    """
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    if n_rep == 1:
        return hidden_states
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)

mindnlp.transformers.models.starcoder2.modeling_starcoder2.rotate_half(x)

Rotates half the hidden dims of the input.

Source code in mindnlp/transformers/models/starcoder2/modeling_starcoder2.py
181
182
183
184
185
186
def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    # x1 = x[..., : x.shape[-1] // 2]
    # x2 = x[..., x.shape[-1] // 2 :]
    x1, x2 = x.tensor_split(2, -1)
    return ops.cat((-x2, x1), axis=-1)