Skip to content

openelm

mindnlp.transformers.models.openelm.modeling_openelm

OpenELM config

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMDecoderLayer

Bases: Module

The OpenELMDecoderLayer class represents a single layer of the OpenELM decoder model. It is designed to be used in the OpenELMDecoder model for generating high-quality sequence predictions.

This class inherits from the nn.Module class, which provides a base class for all neural network cells in MindSpore.

ATTRIBUTE DESCRIPTION
attn

An instance of the OpenELMMultiHeadCausalAttention class responsible for performing multi-head causal attention operations.

TYPE: OpenELMMultiHeadCausalAttention

ffn

An instance of the OpenELMFeedForwardNetwork class responsible for applying feed-forward neural network operations.

TYPE: OpenELMFeedForwardNetwork

ffn_norm

An instance of the OpenELMRMSNorm class responsible for normalizing the output of the feed-forward network.

TYPE: OpenELMRMSNorm

attn_norm

An instance of the OpenELMRMSNorm class responsible for normalizing the output of the attention layer.

TYPE: OpenELMRMSNorm

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
class OpenELMDecoderLayer(nn.Module):

    """
    The `OpenELMDecoderLayer` class represents a single layer of the OpenELM decoder model.
    It is designed to be used in the OpenELMDecoder model for generating high-quality sequence predictions.

    This class inherits from the `nn.Module` class, which provides a base class for all neural network cells in MindSpore.

    Attributes:
        attn (OpenELMMultiHeadCausalAttention): An instance of the `OpenELMMultiHeadCausalAttention` class responsible
            for performing multi-head causal attention operations.
        ffn (OpenELMFeedForwardNetwork): An instance of the `OpenELMFeedForwardNetwork` class responsible
            for applying feed-forward neural network operations.
        ffn_norm (OpenELMRMSNorm): An instance of the `OpenELMRMSNorm` class responsible
            for normalizing the output of the feed-forward network.
        attn_norm (OpenELMRMSNorm): An instance of the `OpenELMRMSNorm` class responsible
            for normalizing the output of the attention layer.
    """
    def __init__(self, config: OpenELMConfig, layer_idx: int) -> None:
        """Initialize an instance of the OpenELMDecoderLayer class.

        Args:
            self: The instance of the OpenELMDecoderLayer class.
            config (OpenELMConfig): The configuration object for OpenELM.
                It specifies the model configuration settings.
            layer_idx (int): The index of the current layer in the decoder stack.
                It is used for identifying the layer position.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        self.attn = OpenELMMultiHeadCausalAttention(config=config, layer_idx=layer_idx)
        self.ffn = OpenELMFeedForwardNetwork(config=config, layer_idx=layer_idx)
        self.ffn_norm = OpenELMRMSNorm(
            num_features=config.model_dim,
        )
        self.attn_norm = OpenELMRMSNorm(
            num_features=config.model_dim,
        )

    def forward(
        self,
        hidden_states: mindspore.Tensor,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_value: Optional[Tuple[mindspore.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = False,
        cache_position: Optional[mindspore.Tensor] = None,
        **kwargs,
    ) -> Tuple[
        mindspore.Tensor, Optional[Tuple[mindspore.Tensor, mindspore.Tensor]]
    ]:
        """
        Args:
            hidden_states (`mindspore.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
            attention_mask (`mindspore.Tensor`, *optional*):
                attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
                query_sequence_length, key_sequence_length)` if default attention is used.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
            use_cache (`bool`, *optional*):
                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
                (see `past_key_values`).
            past_key_value (`Tuple(mindspore.Tensor)`, *optional*): cached past key and value projection states
        """
        residual = hidden_states
        hidden_states = self.attn_norm(hidden_states)

        # Self Attention
        hidden_states, self_attn_weights, present_key_value = self.attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache,
            cache_position=cache_position,
            **kwargs,
        )
        hidden_states = residual + hidden_states

        # Fully Connected
        residual = hidden_states
        hidden_states = self.ffn_norm(hidden_states)
        hidden_states = self.ffn(hidden_states)
        hidden_states = residual + hidden_states

        outputs = (hidden_states,)

        if output_attentions:
            outputs += (self_attn_weights,)

        if use_cache:
            outputs += (present_key_value,)

        return outputs

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMDecoderLayer.__init__(config, layer_idx)

Initialize an instance of the OpenELMDecoderLayer class.

PARAMETER DESCRIPTION
self

The instance of the OpenELMDecoderLayer class.

config

The configuration object for OpenELM. It specifies the model configuration settings.

TYPE: OpenELMConfig

layer_idx

The index of the current layer in the decoder stack. It is used for identifying the layer position.

TYPE: int

RETURNS DESCRIPTION
None

None.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
def __init__(self, config: OpenELMConfig, layer_idx: int) -> None:
    """Initialize an instance of the OpenELMDecoderLayer class.

    Args:
        self: The instance of the OpenELMDecoderLayer class.
        config (OpenELMConfig): The configuration object for OpenELM.
            It specifies the model configuration settings.
        layer_idx (int): The index of the current layer in the decoder stack.
            It is used for identifying the layer position.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    self.attn = OpenELMMultiHeadCausalAttention(config=config, layer_idx=layer_idx)
    self.ffn = OpenELMFeedForwardNetwork(config=config, layer_idx=layer_idx)
    self.ffn_norm = OpenELMRMSNorm(
        num_features=config.model_dim,
    )
    self.attn_norm = OpenELMRMSNorm(
        num_features=config.model_dim,
    )

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMDecoderLayer.forward(hidden_states, attention_mask=None, position_ids=None, past_key_value=None, output_attentions=False, use_cache=False, cache_position=None, **kwargs)

PARAMETER DESCRIPTION
hidden_states

input to the layer of shape (batch, seq_len, embed_dim)

TYPE: `mindspore.Tensor`

attention_mask

attention mask of size (batch_size, sequence_length) if flash attention is used or (batch_size, 1, query_sequence_length, key_sequence_length) if default attention is used.

TYPE: `mindspore.Tensor`, *optional* DEFAULT: None

output_attentions

Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

TYPE: `bool`, *optional* DEFAULT: False

use_cache

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

TYPE: `bool`, *optional* DEFAULT: False

past_key_value

cached past key and value projection states

TYPE: `Tuple(mindspore.Tensor)`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
def forward(
    self,
    hidden_states: mindspore.Tensor,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_value: Optional[Tuple[mindspore.Tensor]] = None,
    output_attentions: Optional[bool] = False,
    use_cache: Optional[bool] = False,
    cache_position: Optional[mindspore.Tensor] = None,
    **kwargs,
) -> Tuple[
    mindspore.Tensor, Optional[Tuple[mindspore.Tensor, mindspore.Tensor]]
]:
    """
    Args:
        hidden_states (`mindspore.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
        attention_mask (`mindspore.Tensor`, *optional*):
            attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
            query_sequence_length, key_sequence_length)` if default attention is used.
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under
            returned tensors for more detail.
        use_cache (`bool`, *optional*):
            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
            (see `past_key_values`).
        past_key_value (`Tuple(mindspore.Tensor)`, *optional*): cached past key and value projection states
    """
    residual = hidden_states
    hidden_states = self.attn_norm(hidden_states)

    # Self Attention
    hidden_states, self_attn_weights, present_key_value = self.attn(
        hidden_states=hidden_states,
        attention_mask=attention_mask,
        past_key_value=past_key_value,
        output_attentions=output_attentions,
        use_cache=use_cache,
        cache_position=cache_position,
        **kwargs,
    )
    hidden_states = residual + hidden_states

    # Fully Connected
    residual = hidden_states
    hidden_states = self.ffn_norm(hidden_states)
    hidden_states = self.ffn(hidden_states)
    hidden_states = residual + hidden_states

    outputs = (hidden_states,)

    if output_attentions:
        outputs += (self_attn_weights,)

    if use_cache:
        outputs += (present_key_value,)

    return outputs

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMFeedForwardNetwork

Bases: Module

The OpenELMFeedForwardNetwork class represents a feedforward network layer for the OpenELM model. This class inherits from nn.Module and implements the forward function of the feedforward network layer.

The init method initializes the OpenELMFeedForwardNetwork instance with the provided configuration and layer index. It calculates the intermediate dimensions based on the configuration, initializes the projection layers, and sets the activation function based on the configuration.

The extra_repr method returns a string representation of the instance, including the ffn_with_glu attribute.

The forward method implements the forward function of the feedforward network layer. It takes an input tensor of shape [batch size, sequence length, model dimension], applies the projection layers and activation functions based on the configuration, and returns a tensor of the same shape as the input.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
class OpenELMFeedForwardNetwork(nn.Module):

    """
    The OpenELMFeedForwardNetwork class represents a feedforward network layer for the OpenELM model.
    This class inherits from nn.Module and implements the forward function of the feedforward network layer.

    The __init__ method initializes the OpenELMFeedForwardNetwork instance with the provided configuration and layer index.
    It calculates the intermediate dimensions based on the configuration, initializes the projection layers, and sets
    the activation function based on the configuration.

    The extra_repr method returns a string representation of the instance, including the ffn_with_glu attribute.

    The forward method implements the forward function of the feedforward network layer.
    It takes an input tensor of shape [batch size, sequence length, model dimension], applies the projection layers and
    activation functions based on the configuration, and returns a tensor of the same shape as the input.

    """
    def __init__(self, config: OpenELMConfig, layer_idx: int) -> None:
        """
        Initializes an instance of the OpenELMFeedForwardNetwork class.

        Args:
            self: The instance of the class.
            config (OpenELMConfig): An instance of the OpenELMConfig class containing configuration settings.
            layer_idx (int): The index of the layer in the network.

        Returns:
            None.

        Raises:
            TypeError: If the input parameters are of incorrect types.
            ValueError: If the layer index is out of bounds or if there are any configuration issues.
            KeyError: If the activation function name specified in the config is not found in the ACT2FN dictionary.
        """
        super().__init__()
        ffn_multiplier = config.ffn_multipliers[layer_idx]
        intermediate_dim = int(
            make_divisible(
                ffn_multiplier * config.model_dim,
                divisor=config.ffn_dim_divisor,
            )
        )
        if config.ffn_with_glu:
            # FFN with Gated linear unit, as described in https://arxiv.org/abs/2002.05202v1.
            self.proj_1 = nn.Linear(
                in_channels=config.model_dim,
                out_channels=2 * intermediate_dim,
                bias=False,
            )
            self.proj_2 = nn.Linear(
                in_channels=intermediate_dim,
                out_channels=config.model_dim,
                bias=False,
            )
            self.ffn_with_glu = True
        else:
            # Standard FFN, as described in https://arxiv.org/abs/1706.03762
            self.proj_1 = nn.Linear(
                in_channels=config.model_dim,
                out_channels=intermediate_dim,
                bias=False,
            )
            self.proj_2 = nn.Linear(
                in_channels=intermediate_dim,
                out_channels=config.model_dim,
                bias=False,
            )
            self.ffn_with_glu = False

        self.act = ACT2FN[config.activation_fn_name]

    def extra_repr(self) -> str:
        """
        This method generates a string representation of the OpenELMFeedForwardNetwork object with additional
        information about the feedforward network configuration.

        Args:
            self (OpenELMFeedForwardNetwork): The instance of the OpenELMFeedForwardNetwork class.

        Returns:
            str: A string representation of the OpenELMFeedForwardNetwork object with the additional information
                about the feedforward network configuration including the ffn_with_glu attribute.

        Raises:
            None.
        """
        return super().extra_repr() + f"(ffn_with_glu) : {self.ffn_with_glu}"

    def forward(self, x: Tensor) -> Tensor:
        """Forward function of FFN layer.

        Args:
            x: Input tensor of the shape [batch size, sequence length, model dimension].

        Returns:
            A tensor of the same shape as the input.
        """
        if self.ffn_with_glu:
            y_12 = self.proj_1(x)
            y_1, y_2 = y_12.chunk(2, axis=-1)
            y = self.act(y_1) * y_2
            return self.proj_2(y)
        else:
            return self.proj_2(self.act(self.proj_1(x)))

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMFeedForwardNetwork.__init__(config, layer_idx)

Initializes an instance of the OpenELMFeedForwardNetwork class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

An instance of the OpenELMConfig class containing configuration settings.

TYPE: OpenELMConfig

layer_idx

The index of the layer in the network.

TYPE: int

RETURNS DESCRIPTION
None

None.

RAISES DESCRIPTION
TypeError

If the input parameters are of incorrect types.

ValueError

If the layer index is out of bounds or if there are any configuration issues.

KeyError

If the activation function name specified in the config is not found in the ACT2FN dictionary.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
def __init__(self, config: OpenELMConfig, layer_idx: int) -> None:
    """
    Initializes an instance of the OpenELMFeedForwardNetwork class.

    Args:
        self: The instance of the class.
        config (OpenELMConfig): An instance of the OpenELMConfig class containing configuration settings.
        layer_idx (int): The index of the layer in the network.

    Returns:
        None.

    Raises:
        TypeError: If the input parameters are of incorrect types.
        ValueError: If the layer index is out of bounds or if there are any configuration issues.
        KeyError: If the activation function name specified in the config is not found in the ACT2FN dictionary.
    """
    super().__init__()
    ffn_multiplier = config.ffn_multipliers[layer_idx]
    intermediate_dim = int(
        make_divisible(
            ffn_multiplier * config.model_dim,
            divisor=config.ffn_dim_divisor,
        )
    )
    if config.ffn_with_glu:
        # FFN with Gated linear unit, as described in https://arxiv.org/abs/2002.05202v1.
        self.proj_1 = nn.Linear(
            in_channels=config.model_dim,
            out_channels=2 * intermediate_dim,
            bias=False,
        )
        self.proj_2 = nn.Linear(
            in_channels=intermediate_dim,
            out_channels=config.model_dim,
            bias=False,
        )
        self.ffn_with_glu = True
    else:
        # Standard FFN, as described in https://arxiv.org/abs/1706.03762
        self.proj_1 = nn.Linear(
            in_channels=config.model_dim,
            out_channels=intermediate_dim,
            bias=False,
        )
        self.proj_2 = nn.Linear(
            in_channels=intermediate_dim,
            out_channels=config.model_dim,
            bias=False,
        )
        self.ffn_with_glu = False

    self.act = ACT2FN[config.activation_fn_name]

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMFeedForwardNetwork.extra_repr()

This method generates a string representation of the OpenELMFeedForwardNetwork object with additional information about the feedforward network configuration.

PARAMETER DESCRIPTION
self

The instance of the OpenELMFeedForwardNetwork class.

TYPE: OpenELMFeedForwardNetwork

RETURNS DESCRIPTION
str

A string representation of the OpenELMFeedForwardNetwork object with the additional information about the feedforward network configuration including the ffn_with_glu attribute.

TYPE: str

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
def extra_repr(self) -> str:
    """
    This method generates a string representation of the OpenELMFeedForwardNetwork object with additional
    information about the feedforward network configuration.

    Args:
        self (OpenELMFeedForwardNetwork): The instance of the OpenELMFeedForwardNetwork class.

    Returns:
        str: A string representation of the OpenELMFeedForwardNetwork object with the additional information
            about the feedforward network configuration including the ffn_with_glu attribute.

    Raises:
        None.
    """
    return super().extra_repr() + f"(ffn_with_glu) : {self.ffn_with_glu}"

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMFeedForwardNetwork.forward(x)

Forward function of FFN layer.

PARAMETER DESCRIPTION
x

Input tensor of the shape [batch size, sequence length, model dimension].

TYPE: Tensor

RETURNS DESCRIPTION
Tensor

A tensor of the same shape as the input.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
def forward(self, x: Tensor) -> Tensor:
    """Forward function of FFN layer.

    Args:
        x: Input tensor of the shape [batch size, sequence length, model dimension].

    Returns:
        A tensor of the same shape as the input.
    """
    if self.ffn_with_glu:
        y_12 = self.proj_1(x)
        y_1, y_2 = y_12.chunk(2, axis=-1)
        y = self.act(y_1) * y_2
        return self.proj_2(y)
    else:
        return self.proj_2(self.act(self.proj_1(x)))

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMForCausalLM

Bases: OpenELMPreTrainedModel

This class represents a OpenELM model for Causal Language Modeling (LM). It is designed for generating text based on input sequences and predicting the next token in a sequence. The class includes methods for setting and getting input and output embeddings, setting the decoder, forwarding the model for generation, and preparing inputs for text generation. Additionally, it provides a static method for reordering cache during generation. The class inherits from OpenELMPreTrainedModel and implements functionality specific to Causal LM tasks.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
class OpenELMForCausalLM(OpenELMPreTrainedModel):

    """
    This class represents a OpenELM model for Causal Language Modeling (LM).
    It is designed for generating text based on input sequences and predicting the next token in a sequence.
    The class includes methods for setting and getting input and output embeddings, setting the decoder,
    forwarding the model for generation, and preparing inputs for text generation.
    Additionally, it provides a static method for reordering cache during generation.
    The class inherits from OpenELMPreTrainedModel and implements functionality specific to Causal LM tasks.
    """
    _tied_weights_keys = ["lm_head.weight"]

    def __init__(self, config: OpenELMConfig):
        """
        Initializes an instance of the OpenELMForCausalLM class.

        Args:
            self: The current object instance.
            config (OpenELMConfig): An instance of OpenELMConfig class containing the configuration settings
                for the OpenELM model.

        Returns:
            None

        Raises:
            None

        This method initializes the OpenELMForCausalLM object by setting its configuration, transformer, vocab_size,
        and lm_head attributes. The config parameter is an instance of OpenELMConfig class and is required to configure
        the OpenELM model.

        Attributes:
            self.transformer: An instance of the OpenELMModel class.
            self.vocab_size: An integer representing the size of the vocabulary used in the model.
            self.lm_head: An instance of the nn.Linear class or None depending on the value of
                config.share_input_output_layers.

        Note:
            The OpenELMModel and nn.Linear classes are imported from the appropriate libraries.

        Example:
            ```python
            >>> config = OpenELMConfig(vocab_size=10000, share_input_output_layers=False)
            >>> open_elm = OpenELMForCausalLM(config)
            ```
        """
        super().__init__(config)
        self.transformer = OpenELMModel(config)
        self.vocab_size = config.vocab_size
        if config.share_input_output_layers:
            self.lm_head = None
        else:
            self.lm_head = nn.Linear(config.model_dim, config.vocab_size, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        """
        Retrieve the input embeddings from the OpenELMForCausalLM model.

        Args:
            self (OpenELMForCausalLM): The instance of OpenELMForCausalLM.

        Returns:
            token_embeddings: This method returns the input embeddings as a transformer token embeddings.

        Raises:
            None.
        """
        return self.transformer.token_embeddings

    def set_input_embeddings(self, value):
        """
        Sets the input embeddings for the OpenELMForCausalLM model.

        Args:
            self (OpenELMForCausalLM): The instance of the OpenELMForCausalLM class.
            value (torch.Tensor): The input embeddings to be set for the model. It should be a torch.Tensor object.

        Returns:
            None.

        Raises:
            None.
        """
        self.transformer.token_embeddings = value

    def get_output_embeddings(self):
        """
        Returns the output embeddings of the OpenELMForCausalLM model.

        Args:
            self: An instance of the OpenELMForCausalLM class.

        Returns:
            lm_head: The method returns the output embeddings of the OpenELMForCausalLM model.

        Raises:
            None.
        """
        return self.lm_head

    def set_output_embeddings(self, new_embeddings):
        """
        This method sets the output embeddings for the OpenELMForCausalLM class.

        Args:
            self (OpenELMForCausalLM): The instance of the OpenELMForCausalLM class.
            new_embeddings (object): The new output embeddings to be set for the OpenELMForCausalLM instance.

        Returns:
            None.

        Raises:
            None.
        """
        self.lm_head = new_embeddings

    def set_decoder(self, decoder):
        """Set the decoder for the OpenELMForCausalLM instance.

        This method allows setting the decoder for the OpenELMForCausalLM instance.
        The decoder is used to transform the input data.

        Args:
            self (OpenELMForCausalLM): The instance of the OpenELMForCausalLM class.
            decoder: The decoder to be set. It should be compatible with the OpenELMForCausalLM instance.

        Returns:
            None.

        Raises:
            None.
        """
        self.transformer = decoder

    def get_decoder(self):
        """
        This method returns the transformer for OpenELMForCausalLM.

        Args:
            self (object): The instance of the OpenELMForCausalLM class.

        Returns:
            None: This method returns the transformer object associated with the OpenELMForCausalLM instance.

        Raises:
            None
        """
        return self.transformer

    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[List[mindspore.Tensor]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        cache_position: Optional[mindspore.Tensor] = None,
    ) -> Union[Tuple, CausalLMOutputWithPast]:
        """
        This method forwards a Causal Language Model for OpenELM.

        Args:
            self (object): The instance of the class.
            input_ids (mindspore.Tensor, optional): The input tensor containing token IDs. Default is None.
            attention_mask (mindspore.Tensor, optional): An optional tensor for masking tokens. Default is None.
            position_ids (mindspore.Tensor, optional): An optional tensor containing position IDs. Default is None.
            past_key_values (List[mindspore.Tensor], optional): A list of tensors representing past key values.
                Default is None.
            inputs_embeds (mindspore.Tensor, optional): An optional tensor of input embeddings. Default is None.
            labels (mindspore.Tensor, optional): An optional tensor containing labels. Default is None.
            use_cache (bool, optional): A flag indicating whether to use cache. Default is None.
            output_attentions (bool, optional): A flag indicating whether to output attentions. Default is None.
            output_hidden_states (bool, optional): A flag indicating whether to output hidden states. Default is None.
            return_dict (bool, optional): A flag indicating whether to return a dictionary. Default is None.
            cache_position (mindspore.Tensor, optional): An optional tensor for cache position. Default is None.

        Returns:
            Union[Tuple, CausalLMOutputWithPast]: The output of the method, which can be a tuple or an instance of
                CausalLMOutputWithPast. If return_dict is False, the return value includes loss, logits,
                past key values, hidden states, and attentions.

        Raises:
            None
        """
        output_attentions = (
            output_attentions
            if output_attentions is not None
            else self.config.output_attentions
        )
        output_hidden_states = (
            output_hidden_states
            if output_hidden_states is not None
            else self.config.output_hidden_states
        )
        return_dict = (
            return_dict if return_dict is not None else self.config.use_return_dict
        )
        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
        outputs = self.transformer(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
            cache_position=cache_position,
        )

        hidden_states = outputs[0]
        if self.lm_head is None:
            # shared
            logits = ops.dense(
                hidden_states, weight=self.transformer.token_embeddings.weight
            )
        else:
            logits = self.lm_head(hidden_states)
        logits = logits[:, : self.config.vocab_size]
        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :]
            shift_labels = labels[..., 1:]
            # Flatten the tokens
            shift_logits = shift_logits.view(-1, self.config.vocab_size)
            shift_labels = shift_labels.view(-1)
            loss = ops.cross_entropy(shift_logits, shift_labels)

        if not return_dict:
            output = (logits,) + outputs[1:]
            return (loss,) + output if loss is not None else output

        return CausalLMOutputWithPast(
            loss=loss,
            logits=logits,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

    def prepare_inputs_for_generation(
        self,
        input_ids,
        past_key_values=None,
        attention_mask=None,
        inputs_embeds=None,
        **kwargs,
    ):
        """
        Prepares the inputs for generation in the OpenELMForCausalLM class.

        Args:
            self (OpenELMForCausalLM): The instance of the OpenELMForCausalLM class.
            input_ids (Tensor): The input tensor of shape [batch_size, sequence_length] containing the token indices.
            past_key_values (Optional[Union[Cache, Tuple[Tensor]]]): The past key-value states.
                If provided, should be either an instance of Cache or a tuple containing tensors. Defaults to None.
            attention_mask (Optional[Tensor]): The attention mask tensor of shape [batch_size, sequence_length].
                If provided, it masks the attention scores. Defaults to None.
            inputs_embeds (Optional[Tensor]): The embedded inputs tensor of shape
                [batch_size, sequence_length, hidden_size]. If provided, it replaces input_ids. Defaults to None.

        Returns:
            model_inputs (Dict[str, Tensor]): A dictionary containing the model inputs for generation.
                It has the following keys:

                - 'inputs_embeds' (Tensor): The embedded inputs tensor.
                It is included if inputs_embeds is provided and past_key_values is None.
                - 'input_ids' (Tensor): The input tensor with token indices.
                It is included if inputs_embeds is None or past_key_values is not None.
                - 'position_ids' (Tensor): The token position indices tensor of shape [batch_size, sequence_length].
                - 'cache_position' (Tensor): The tensor containing the positions for caching of shape [sequence_length].
                - 'past_key_values' (Union[Cache, Tuple[Tensor]]): The past key-value states.
                - 'use_cache' (Optional[bool]): Whether to use cache for generation. Defaults to None.
                - 'attention_mask' (Optional[Tensor]): The attention mask tensor of shape [batch_size, sequence_length].
                It is included if attention_mask is provided.

        Raises:
            None
        """
        past_length = 0
        if past_key_values is not None:
            if isinstance(past_key_values, Cache):
                cache_length = past_key_values.get_seq_length()
                past_length = past_key_values.seen_tokens
                max_cache_length = past_key_values.get_max_length()
            else:
                cache_length = past_length = past_key_values[0][0].shape[2]
                max_cache_length = None

            # Keep only the unprocessed tokens:
            # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
            # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
            # input)
            if (
                attention_mask is not None
                and attention_mask.shape[1] > input_ids.shape[1]
            ):
                input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
            # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
            # input_ids based on the past_length.
            elif past_length < input_ids.shape[1]:
                input_ids = input_ids[:, past_length:]
            # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.

            # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
            if (
                max_cache_length is not None
                and attention_mask is not None
                and cache_length + input_ids.shape[1] > max_cache_length
            ):
                attention_mask = attention_mask[:, -max_cache_length:]

        position_ids = kwargs.get("position_ids", None)
        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids = position_ids.masked_fill(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -input_ids.shape[1] :]

        if self.generation_config.cache_implementation == "static":
            # generation with static cache
            cache_position = kwargs.get("cache_position", None)
            if cache_position is None:
                past_length = 0
            else:
                past_length = cache_position[-1] + 1
            input_ids = input_ids[:, past_length:]
            position_ids = position_ids[:, past_length:]

        # we should only keep a `cache_position` in generate, and do +=1.
        # same goes for position ids. Could also help with continued generation.
        cache_position = ops.arange(
            past_length,
            past_length + position_ids.shape[-1],
        )

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            # The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
            # recompiles graphs as the stride of the inputs is a guard. Ref: https://github.com/huggingface/transformers/pull/29114
            # We could use `next_tokens` directly instead.
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "position_ids": position_ids,
                "cache_position": cache_position,
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
                "attention_mask": attention_mask,
            }
        )
        return model_inputs

    @staticmethod
    def _reorder_cache(past_key_values, beam_idx):
        """
        Reorders the cache of past key values based on the provided beam index.

        Args:
            past_key_values (tuple): A tuple containing the past key values for each layer.
                Each element in the tuple represents the past key values for a layer.
            beam_idx (tensor): An index tensor specifying the order in which the cache should be reordered.

        Returns:
            None: This method does not return any value. Instead, it modifies the 'past_key_values' in place.

        Raises:
            ValueError: If the 'beam_idx' tensor is not valid or if the dimensions of 'past_key_values'
                are not as expected.
            IndexError: If the index specified in 'beam_idx' is out of range for the 'past_key_values'.
        """
        reordered_past = ()
        for layer_past in past_key_values:
            reordered_past += (
                tuple(
                    past_state.index_select(0, beam_idx)
                    for past_state in layer_past
                ),
            )
        return reordered_past

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMForCausalLM.__init__(config)

Initializes an instance of the OpenELMForCausalLM class.

PARAMETER DESCRIPTION
self

The current object instance.

config

An instance of OpenELMConfig class containing the configuration settings for the OpenELM model.

TYPE: OpenELMConfig

RETURNS DESCRIPTION

None

This method initializes the OpenELMForCausalLM object by setting its configuration, transformer, vocab_size, and lm_head attributes. The config parameter is an instance of OpenELMConfig class and is required to configure the OpenELM model.

ATTRIBUTE DESCRIPTION
self.transformer

An instance of the OpenELMModel class.

self.vocab_size

An integer representing the size of the vocabulary used in the model.

self.lm_head

An instance of the nn.Linear class or None depending on the value of config.share_input_output_layers.

Note

The OpenELMModel and nn.Linear classes are imported from the appropriate libraries.

Example
>>> config = OpenELMConfig(vocab_size=10000, share_input_output_layers=False)
>>> open_elm = OpenELMForCausalLM(config)
Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
def __init__(self, config: OpenELMConfig):
    """
    Initializes an instance of the OpenELMForCausalLM class.

    Args:
        self: The current object instance.
        config (OpenELMConfig): An instance of OpenELMConfig class containing the configuration settings
            for the OpenELM model.

    Returns:
        None

    Raises:
        None

    This method initializes the OpenELMForCausalLM object by setting its configuration, transformer, vocab_size,
    and lm_head attributes. The config parameter is an instance of OpenELMConfig class and is required to configure
    the OpenELM model.

    Attributes:
        self.transformer: An instance of the OpenELMModel class.
        self.vocab_size: An integer representing the size of the vocabulary used in the model.
        self.lm_head: An instance of the nn.Linear class or None depending on the value of
            config.share_input_output_layers.

    Note:
        The OpenELMModel and nn.Linear classes are imported from the appropriate libraries.

    Example:
        ```python
        >>> config = OpenELMConfig(vocab_size=10000, share_input_output_layers=False)
        >>> open_elm = OpenELMForCausalLM(config)
        ```
    """
    super().__init__(config)
    self.transformer = OpenELMModel(config)
    self.vocab_size = config.vocab_size
    if config.share_input_output_layers:
        self.lm_head = None
    else:
        self.lm_head = nn.Linear(config.model_dim, config.vocab_size, bias=False)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMForCausalLM.forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None, cache_position=None)

This method forwards a Causal Language Model for OpenELM.

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: object

input_ids

The input tensor containing token IDs. Default is None.

TYPE: Tensor DEFAULT: None

attention_mask

An optional tensor for masking tokens. Default is None.

TYPE: Tensor DEFAULT: None

position_ids

An optional tensor containing position IDs. Default is None.

TYPE: Tensor DEFAULT: None

past_key_values

A list of tensors representing past key values. Default is None.

TYPE: List[Tensor] DEFAULT: None

inputs_embeds

An optional tensor of input embeddings. Default is None.

TYPE: Tensor DEFAULT: None

labels

An optional tensor containing labels. Default is None.

TYPE: Tensor DEFAULT: None

use_cache

A flag indicating whether to use cache. Default is None.

TYPE: bool DEFAULT: None

output_attentions

A flag indicating whether to output attentions. Default is None.

TYPE: bool DEFAULT: None

output_hidden_states

A flag indicating whether to output hidden states. Default is None.

TYPE: bool DEFAULT: None

return_dict

A flag indicating whether to return a dictionary. Default is None.

TYPE: bool DEFAULT: None

cache_position

An optional tensor for cache position. Default is None.

TYPE: Tensor DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple, CausalLMOutputWithPast]

Union[Tuple, CausalLMOutputWithPast]: The output of the method, which can be a tuple or an instance of CausalLMOutputWithPast. If return_dict is False, the return value includes loss, logits, past key values, hidden states, and attentions.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
def forward(
    self,
    input_ids: mindspore.Tensor = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[List[mindspore.Tensor]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
    cache_position: Optional[mindspore.Tensor] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
    """
    This method forwards a Causal Language Model for OpenELM.

    Args:
        self (object): The instance of the class.
        input_ids (mindspore.Tensor, optional): The input tensor containing token IDs. Default is None.
        attention_mask (mindspore.Tensor, optional): An optional tensor for masking tokens. Default is None.
        position_ids (mindspore.Tensor, optional): An optional tensor containing position IDs. Default is None.
        past_key_values (List[mindspore.Tensor], optional): A list of tensors representing past key values.
            Default is None.
        inputs_embeds (mindspore.Tensor, optional): An optional tensor of input embeddings. Default is None.
        labels (mindspore.Tensor, optional): An optional tensor containing labels. Default is None.
        use_cache (bool, optional): A flag indicating whether to use cache. Default is None.
        output_attentions (bool, optional): A flag indicating whether to output attentions. Default is None.
        output_hidden_states (bool, optional): A flag indicating whether to output hidden states. Default is None.
        return_dict (bool, optional): A flag indicating whether to return a dictionary. Default is None.
        cache_position (mindspore.Tensor, optional): An optional tensor for cache position. Default is None.

    Returns:
        Union[Tuple, CausalLMOutputWithPast]: The output of the method, which can be a tuple or an instance of
            CausalLMOutputWithPast. If return_dict is False, the return value includes loss, logits,
            past key values, hidden states, and attentions.

    Raises:
        None
    """
    output_attentions = (
        output_attentions
        if output_attentions is not None
        else self.config.output_attentions
    )
    output_hidden_states = (
        output_hidden_states
        if output_hidden_states is not None
        else self.config.output_hidden_states
    )
    return_dict = (
        return_dict if return_dict is not None else self.config.use_return_dict
    )
    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
    outputs = self.transformer(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
        cache_position=cache_position,
    )

    hidden_states = outputs[0]
    if self.lm_head is None:
        # shared
        logits = ops.dense(
            hidden_states, weight=self.transformer.token_embeddings.weight
        )
    else:
        logits = self.lm_head(hidden_states)
    logits = logits[:, : self.config.vocab_size]
    loss = None
    if labels is not None:
        # Shift so that tokens < n predict n
        shift_logits = logits[..., :-1, :]
        shift_labels = labels[..., 1:]
        # Flatten the tokens
        shift_logits = shift_logits.view(-1, self.config.vocab_size)
        shift_labels = shift_labels.view(-1)
        loss = ops.cross_entropy(shift_logits, shift_labels)

    if not return_dict:
        output = (logits,) + outputs[1:]
        return (loss,) + output if loss is not None else output

    return CausalLMOutputWithPast(
        loss=loss,
        logits=logits,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMForCausalLM.get_decoder()

This method returns the transformer for OpenELMForCausalLM.

PARAMETER DESCRIPTION
self

The instance of the OpenELMForCausalLM class.

TYPE: object

RETURNS DESCRIPTION
None

This method returns the transformer object associated with the OpenELMForCausalLM instance.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
def get_decoder(self):
    """
    This method returns the transformer for OpenELMForCausalLM.

    Args:
        self (object): The instance of the OpenELMForCausalLM class.

    Returns:
        None: This method returns the transformer object associated with the OpenELMForCausalLM instance.

    Raises:
        None
    """
    return self.transformer

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMForCausalLM.get_input_embeddings()

Retrieve the input embeddings from the OpenELMForCausalLM model.

PARAMETER DESCRIPTION
self

The instance of OpenELMForCausalLM.

TYPE: OpenELMForCausalLM

RETURNS DESCRIPTION
token_embeddings

This method returns the input embeddings as a transformer token embeddings.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
def get_input_embeddings(self):
    """
    Retrieve the input embeddings from the OpenELMForCausalLM model.

    Args:
        self (OpenELMForCausalLM): The instance of OpenELMForCausalLM.

    Returns:
        token_embeddings: This method returns the input embeddings as a transformer token embeddings.

    Raises:
        None.
    """
    return self.transformer.token_embeddings

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMForCausalLM.get_output_embeddings()

Returns the output embeddings of the OpenELMForCausalLM model.

PARAMETER DESCRIPTION
self

An instance of the OpenELMForCausalLM class.

RETURNS DESCRIPTION
lm_head

The method returns the output embeddings of the OpenELMForCausalLM model.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
def get_output_embeddings(self):
    """
    Returns the output embeddings of the OpenELMForCausalLM model.

    Args:
        self: An instance of the OpenELMForCausalLM class.

    Returns:
        lm_head: The method returns the output embeddings of the OpenELMForCausalLM model.

    Raises:
        None.
    """
    return self.lm_head

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMForCausalLM.prepare_inputs_for_generation(input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs)

Prepares the inputs for generation in the OpenELMForCausalLM class.

PARAMETER DESCRIPTION
self

The instance of the OpenELMForCausalLM class.

TYPE: OpenELMForCausalLM

input_ids

The input tensor of shape [batch_size, sequence_length] containing the token indices.

TYPE: Tensor

past_key_values

The past key-value states. If provided, should be either an instance of Cache or a tuple containing tensors. Defaults to None.

TYPE: Optional[Union[Cache, Tuple[Tensor]]] DEFAULT: None

attention_mask

The attention mask tensor of shape [batch_size, sequence_length]. If provided, it masks the attention scores. Defaults to None.

TYPE: Optional[Tensor] DEFAULT: None

inputs_embeds

The embedded inputs tensor of shape [batch_size, sequence_length, hidden_size]. If provided, it replaces input_ids. Defaults to None.

TYPE: Optional[Tensor] DEFAULT: None

RETURNS DESCRIPTION
model_inputs

A dictionary containing the model inputs for generation. It has the following keys:

  • 'inputs_embeds' (Tensor): The embedded inputs tensor. It is included if inputs_embeds is provided and past_key_values is None.
  • 'input_ids' (Tensor): The input tensor with token indices. It is included if inputs_embeds is None or past_key_values is not None.
  • 'position_ids' (Tensor): The token position indices tensor of shape [batch_size, sequence_length].
  • 'cache_position' (Tensor): The tensor containing the positions for caching of shape [sequence_length].
  • 'past_key_values' (Union[Cache, Tuple[Tensor]]): The past key-value states.
  • 'use_cache' (Optional[bool]): Whether to use cache for generation. Defaults to None.
  • 'attention_mask' (Optional[Tensor]): The attention mask tensor of shape [batch_size, sequence_length]. It is included if attention_mask is provided.

TYPE: Dict[str, Tensor]

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
def prepare_inputs_for_generation(
    self,
    input_ids,
    past_key_values=None,
    attention_mask=None,
    inputs_embeds=None,
    **kwargs,
):
    """
    Prepares the inputs for generation in the OpenELMForCausalLM class.

    Args:
        self (OpenELMForCausalLM): The instance of the OpenELMForCausalLM class.
        input_ids (Tensor): The input tensor of shape [batch_size, sequence_length] containing the token indices.
        past_key_values (Optional[Union[Cache, Tuple[Tensor]]]): The past key-value states.
            If provided, should be either an instance of Cache or a tuple containing tensors. Defaults to None.
        attention_mask (Optional[Tensor]): The attention mask tensor of shape [batch_size, sequence_length].
            If provided, it masks the attention scores. Defaults to None.
        inputs_embeds (Optional[Tensor]): The embedded inputs tensor of shape
            [batch_size, sequence_length, hidden_size]. If provided, it replaces input_ids. Defaults to None.

    Returns:
        model_inputs (Dict[str, Tensor]): A dictionary containing the model inputs for generation.
            It has the following keys:

            - 'inputs_embeds' (Tensor): The embedded inputs tensor.
            It is included if inputs_embeds is provided and past_key_values is None.
            - 'input_ids' (Tensor): The input tensor with token indices.
            It is included if inputs_embeds is None or past_key_values is not None.
            - 'position_ids' (Tensor): The token position indices tensor of shape [batch_size, sequence_length].
            - 'cache_position' (Tensor): The tensor containing the positions for caching of shape [sequence_length].
            - 'past_key_values' (Union[Cache, Tuple[Tensor]]): The past key-value states.
            - 'use_cache' (Optional[bool]): Whether to use cache for generation. Defaults to None.
            - 'attention_mask' (Optional[Tensor]): The attention mask tensor of shape [batch_size, sequence_length].
            It is included if attention_mask is provided.

    Raises:
        None
    """
    past_length = 0
    if past_key_values is not None:
        if isinstance(past_key_values, Cache):
            cache_length = past_key_values.get_seq_length()
            past_length = past_key_values.seen_tokens
            max_cache_length = past_key_values.get_max_length()
        else:
            cache_length = past_length = past_key_values[0][0].shape[2]
            max_cache_length = None

        # Keep only the unprocessed tokens:
        # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
        # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
        # input)
        if (
            attention_mask is not None
            and attention_mask.shape[1] > input_ids.shape[1]
        ):
            input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
        # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
        # input_ids based on the past_length.
        elif past_length < input_ids.shape[1]:
            input_ids = input_ids[:, past_length:]
        # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.

        # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
        if (
            max_cache_length is not None
            and attention_mask is not None
            and cache_length + input_ids.shape[1] > max_cache_length
        ):
            attention_mask = attention_mask[:, -max_cache_length:]

    position_ids = kwargs.get("position_ids", None)
    if attention_mask is not None and position_ids is None:
        # create position_ids on the fly for batch generation
        position_ids = attention_mask.long().cumsum(-1) - 1
        position_ids = position_ids.masked_fill(attention_mask == 0, 1)
        if past_key_values:
            position_ids = position_ids[:, -input_ids.shape[1] :]

    if self.generation_config.cache_implementation == "static":
        # generation with static cache
        cache_position = kwargs.get("cache_position", None)
        if cache_position is None:
            past_length = 0
        else:
            past_length = cache_position[-1] + 1
        input_ids = input_ids[:, past_length:]
        position_ids = position_ids[:, past_length:]

    # we should only keep a `cache_position` in generate, and do +=1.
    # same goes for position ids. Could also help with continued generation.
    cache_position = ops.arange(
        past_length,
        past_length + position_ids.shape[-1],
    )

    # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
    if inputs_embeds is not None and past_key_values is None:
        model_inputs = {"inputs_embeds": inputs_embeds}
    else:
        # The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
        # recompiles graphs as the stride of the inputs is a guard. Ref: https://github.com/huggingface/transformers/pull/29114
        # We could use `next_tokens` directly instead.
        model_inputs = {"input_ids": input_ids}

    model_inputs.update(
        {
            "position_ids": position_ids,
            "cache_position": cache_position,
            "past_key_values": past_key_values,
            "use_cache": kwargs.get("use_cache"),
            "attention_mask": attention_mask,
        }
    )
    return model_inputs

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMForCausalLM.set_decoder(decoder)

Set the decoder for the OpenELMForCausalLM instance.

This method allows setting the decoder for the OpenELMForCausalLM instance. The decoder is used to transform the input data.

PARAMETER DESCRIPTION
self

The instance of the OpenELMForCausalLM class.

TYPE: OpenELMForCausalLM

decoder

The decoder to be set. It should be compatible with the OpenELMForCausalLM instance.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
def set_decoder(self, decoder):
    """Set the decoder for the OpenELMForCausalLM instance.

    This method allows setting the decoder for the OpenELMForCausalLM instance.
    The decoder is used to transform the input data.

    Args:
        self (OpenELMForCausalLM): The instance of the OpenELMForCausalLM class.
        decoder: The decoder to be set. It should be compatible with the OpenELMForCausalLM instance.

    Returns:
        None.

    Raises:
        None.
    """
    self.transformer = decoder

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMForCausalLM.set_input_embeddings(value)

Sets the input embeddings for the OpenELMForCausalLM model.

PARAMETER DESCRIPTION
self

The instance of the OpenELMForCausalLM class.

TYPE: OpenELMForCausalLM

value

The input embeddings to be set for the model. It should be a torch.Tensor object.

TYPE: Tensor

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
def set_input_embeddings(self, value):
    """
    Sets the input embeddings for the OpenELMForCausalLM model.

    Args:
        self (OpenELMForCausalLM): The instance of the OpenELMForCausalLM class.
        value (torch.Tensor): The input embeddings to be set for the model. It should be a torch.Tensor object.

    Returns:
        None.

    Raises:
        None.
    """
    self.transformer.token_embeddings = value

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMForCausalLM.set_output_embeddings(new_embeddings)

This method sets the output embeddings for the OpenELMForCausalLM class.

PARAMETER DESCRIPTION
self

The instance of the OpenELMForCausalLM class.

TYPE: OpenELMForCausalLM

new_embeddings

The new output embeddings to be set for the OpenELMForCausalLM instance.

TYPE: object

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
def set_output_embeddings(self, new_embeddings):
    """
    This method sets the output embeddings for the OpenELMForCausalLM class.

    Args:
        self (OpenELMForCausalLM): The instance of the OpenELMForCausalLM class.
        new_embeddings (object): The new output embeddings to be set for the OpenELMForCausalLM instance.

    Returns:
        None.

    Raises:
        None.
    """
    self.lm_head = new_embeddings

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMModel

Bases: OpenELMPreTrainedModel

This class represents an OpenELM model for natural language processing tasks. It is designed to be used for tasks such as language modeling, text generation, and machine translation. The model architecture includes a transformer-based decoder with customizable layers and attention mechanisms.

The OpenELMModel class provides methods for initializing the model with configuration settings, accessing and updating input embeddings, and forwarding the model for inference or training. The forward method handles the main computation flow of the model, including processing input data, applying transformer layers, and generating model outputs. The class also includes helper methods for managing cache, attention masks, and normalization.

The OpenELMModel class is designed to be flexible and efficient, allowing for easy customization of the model architecture and behavior. It inherits from the OpenELMPreTrainedModel class, which provides additional functionality and pre-trained model weights.

For detailed information on each method and parameter, refer to the method docstrings within the class implementation.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
class OpenELMModel(OpenELMPreTrainedModel):

    """
    This class represents an OpenELM model for natural language processing tasks.
    It is designed to be used for tasks such as language modeling, text generation, and machine translation.
    The model architecture includes a transformer-based decoder with customizable layers and attention mechanisms.

    The OpenELMModel class provides methods for initializing the model with configuration settings, accessing and
    updating input embeddings, and forwarding the model for inference or training.
    The forward method handles the main computation flow of the model, including processing input data, applying
    transformer layers, and generating model outputs. The class also includes helper methods for managing cache,
    attention masks, and normalization.

    The OpenELMModel class is designed to be flexible and efficient, allowing for easy customization of the
    model architecture and behavior. It inherits from the OpenELMPreTrainedModel class, which provides
    additional functionality and pre-trained model weights.

    For detailed information on each method and parameter, refer to the method docstrings within the
    class implementation.
    """
    config_class = OpenELMConfig

    def __init__(self, config: OpenELMConfig):
        """
        Initializes an instance of the OpenELMModel class.

        Args:
            self: The instance of the class.
            config (OpenELMConfig):
                The configuration object containing the model settings.

                - Type: OpenELMConfig
                - Purpose: Specifies the parameters for the OpenELMModel.
                - Restrictions: Must be of type OpenELMConfig.

        Returns:
            None

        Raises:
            None
        """
        super().__init__(config)
        self.config = config

        self.token_embeddings = nn.Embedding(
            embedding_size=config.model_dim,
            vocab_size=config.vocab_size,
        )

        self.layers = nn.ModuleList([
            OpenELMDecoderLayer(config=config, layer_idx=layer_idx)
            for layer_idx in range(config.num_transformer_layers)]
        )
        self.norm = OpenELMRMSNorm(num_features=config.model_dim)
        if config.share_input_output_layers:
            self.classifier = None
        else:
            self.classifier = nn.Linear(
                in_channels=config.model_dim,
                out_channels=config.vocab_size,
                bias=False,
            )
        self.num_transformer_layers = config.num_transformer_layers
        self.gradient_checkpointing = False

        # Register a causal mask to separate causal and padding mask creation. Merging happens in the attention class.
        # NOTE: This is not friendly with TorchScript, ONNX, ExportedProgram serialization for very large `max_context_length`.
        causal_mask = ops.full(
            (config.max_context_length, config.max_context_length),
            fill_value=True,
            dtype=mindspore.bool_,
        )
        self.causal_mask = ops.triu(causal_mask, diagonal=1)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        """
        Returns the input embeddings for the OpenELMModel.

        Args:
            self (OpenELMModel): An instance of the OpenELMModel class.

        Returns:
            None.

        Raises:
            None.

        This method retrieves the input embeddings from the OpenELMModel.
        The input embeddings are obtained from the token embeddings of the model.
        The token embeddings are stored in the `token_embeddings` attribute of the OpenELMModel instance.
        The input embeddings are used as input for further processing or analysis in the OpenELMModel.
        """
        return self.token_embeddings

    def set_input_embeddings(self, new_embeddings: mindspore.Tensor):
        """
        Set the input embeddings for the OpenELMModel.

        Args:
            self (OpenELMModel): The instance of the OpenELMModel class.
            new_embeddings (mindspore.Tensor): A tensor containing the new embeddings to be set as input.

        Returns:
            None.

        Raises:
            None.
        """
        self.token_embeddings = new_embeddings

    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[List[mindspore.Tensor]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        cache_position: Optional[mindspore.Tensor] = None,
    ) -> Union[Tuple, BaseModelOutputWithPast]:
        """
        Constructs the OpenELMModel.

        Args:
            self (OpenELMModel): The instance of the OpenELMModel class.
            input_ids (mindspore.Tensor, optional): The input tensor ids. Default: None.
            attention_mask (mindspore.Tensor, optional): The attention mask tensor. Default: None.
            position_ids (mindspore.Tensor, optional): The position ids tensor. Default: None.
            past_key_values (List[mindspore.Tensor], optional): The list of past key value tensors. Default: None.
            inputs_embeds (mindspore.Tensor, optional): The input embeddings tensor. Default: None.
            use_cache (bool, optional): Whether to use cache. Default: None.
            output_attentions (bool, optional): Whether to output attentions. Default: None.
            output_hidden_states (bool, optional): Whether to output hidden states. Default: None.
            return_dict (bool, optional): Whether to return a dictionary. Default: None.
            cache_position (mindspore.Tensor, optional): The cache position tensor. Default: None.

        Returns:
            Union[Tuple, BaseModelOutputWithPast]: The output tuple or BaseModelOutputWithPast object.

        Raises:
            ValueError: If both input_ids and inputs_embeds are specified or neither is specified.
            Warning: If use_cache=True is incompatible with gradient checkpointing.

        """
        output_attentions = (
            output_attentions
            if output_attentions is not None
            else self.config.output_attentions
        )
        output_hidden_states = (
            output_hidden_states
            if output_hidden_states is not None
            else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        return_dict = (
            return_dict if return_dict is not None else self.config.use_return_dict
        )

        if (input_ids is None) ^ (inputs_embeds is not None):
            raise ValueError(
                "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
            )

        if self.gradient_checkpointing and self.training and use_cache:
            logger.warning_once(
                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
            )
            use_cache = False

        if inputs_embeds is None:
            inputs_embeds = self.token_embeddings(input_ids)

        past_seen_tokens = 0
        if use_cache:  # kept for BC (cache positions)
            if not isinstance(past_key_values, StaticCache):
                past_key_values = DynamicCache.from_legacy_cache(past_key_values)
            past_seen_tokens = past_key_values.get_seq_length()

        if cache_position is None:
            cache_position = ops.arange(
                past_seen_tokens,
                past_seen_tokens + inputs_embeds.shape[1],
            )

        if position_ids is None:
            position_ids = cache_position.unsqueeze(0)

        causal_mask = self._update_causal_mask(attention_mask, inputs_embeds)

        # embed positions
        hidden_states = inputs_embeds

        # decoder layers
        all_hidden_states = () if output_hidden_states else None
        all_self_attns = () if output_attentions else None
        next_decoder_cache = None

        for decoder_layer in self.layers:
            if output_hidden_states:
                all_hidden_states += (hidden_states,)

            if self.gradient_checkpointing and self.training:
                layer_outputs = self._gradient_checkpointing_func(
                    decoder_layer.__call__,
                    hidden_states,
                    causal_mask,
                    position_ids,
                    past_key_values,
                    output_attentions,
                    use_cache,
                    cache_position,
                )
            else:
                layer_outputs = decoder_layer(
                    hidden_states,
                    attention_mask=causal_mask,
                    position_ids=position_ids,
                    past_key_value=past_key_values,
                    output_attentions=output_attentions,
                    use_cache=use_cache,
                    cache_position=cache_position,
                )

            hidden_states = layer_outputs[0]

            if use_cache:
                next_decoder_cache = layer_outputs[2 if output_attentions else 1]

            if output_attentions:
                all_self_attns += (layer_outputs[1],)

        hidden_states = self.norm(hidden_states)

        # add hidden states from the last decoder layer
        if output_hidden_states:
            all_hidden_states += (hidden_states,)

        next_cache = None
        if use_cache:
            next_cache = (
                next_decoder_cache.to_legacy_cache()
                if isinstance(next_decoder_cache, Cache)
                else next_decoder_cache
            )
        if not return_dict:
            return tuple(
                v
                for v in [hidden_states, next_cache, all_hidden_states, all_self_attns]
                if v is not None
            )
        return BaseModelOutputWithPast(
            last_hidden_state=hidden_states,
            past_key_values=next_cache,
            hidden_states=all_hidden_states,
            attentions=all_self_attns,
        )

    def _update_causal_mask(self, attention_mask, input_tensor):
        """
        Updates the causal mask used in the OpenELMModel for attention computations.

        Args:
            self (OpenELMModel): The instance of the OpenELMModel class.
            attention_mask (torch.Tensor): A tensor containing the attention mask.
                This mask is used to mask certain positions of the input tensor during attention computations.
                If the `_attn_implementation` attribute of the `config` object is set to 'flash_attention_2' and
                the attention_mask contains a 0.0 value, the attention_mask is returned as is. Otherwise,
                if the attention_mask is not provided or does not contain a 0.0 value, it is set to None.
            input_tensor (torch.Tensor): The input tensor to the model.
                It has shape (batch_size, seq_length) and represents the input sequences.

        Returns:
            None: The method updates the causal_mask attribute of the OpenELMModel instance in-place.

        Raises:
            None.
        """
        if self.config._attn_implementation == "flash_attention_2":
            if attention_mask is not None and 0.0 in attention_mask:
                return attention_mask
            return None

        batch_size, seq_length = input_tensor.shape[:2]
        dtype = input_tensor.dtype

        # support going beyond cached `max_position_embedding`
        if seq_length > self.causal_mask.shape[-1]:
            causal_mask = ops.full(
                (2 * self.causal_mask.shape[-1], 2 * self.causal_mask.shape[-1]),
                fill_value=1,
            )
            self.causal_mask = ops.triu(causal_mask, diagonal=1)

        # We use the current dtype to avoid any overflows
        min_dtype = finfo(dtype, 'min')
        causal_mask = (
            self.causal_mask[None, None, :, :].repeat(batch_size, 1, 1, 1).to(dtype)
            * min_dtype
        )

        causal_mask = causal_mask.to(dtype=dtype)
        if attention_mask is not None and attention_mask.ndim == 2:
            mask_length = attention_mask.shape[-1]
            padding_mask = causal_mask[..., :mask_length].eq(0.0) & attention_mask[
                :, None, None, :
            ].eq(0.0)
            causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(
                padding_mask, min_dtype
            )

        return causal_mask

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMModel.__init__(config)

Initializes an instance of the OpenELMModel class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

The configuration object containing the model settings.

  • Type: OpenELMConfig
  • Purpose: Specifies the parameters for the OpenELMModel.
  • Restrictions: Must be of type OpenELMConfig.

TYPE: OpenELMConfig

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
def __init__(self, config: OpenELMConfig):
    """
    Initializes an instance of the OpenELMModel class.

    Args:
        self: The instance of the class.
        config (OpenELMConfig):
            The configuration object containing the model settings.

            - Type: OpenELMConfig
            - Purpose: Specifies the parameters for the OpenELMModel.
            - Restrictions: Must be of type OpenELMConfig.

    Returns:
        None

    Raises:
        None
    """
    super().__init__(config)
    self.config = config

    self.token_embeddings = nn.Embedding(
        embedding_size=config.model_dim,
        vocab_size=config.vocab_size,
    )

    self.layers = nn.ModuleList([
        OpenELMDecoderLayer(config=config, layer_idx=layer_idx)
        for layer_idx in range(config.num_transformer_layers)]
    )
    self.norm = OpenELMRMSNorm(num_features=config.model_dim)
    if config.share_input_output_layers:
        self.classifier = None
    else:
        self.classifier = nn.Linear(
            in_channels=config.model_dim,
            out_channels=config.vocab_size,
            bias=False,
        )
    self.num_transformer_layers = config.num_transformer_layers
    self.gradient_checkpointing = False

    # Register a causal mask to separate causal and padding mask creation. Merging happens in the attention class.
    # NOTE: This is not friendly with TorchScript, ONNX, ExportedProgram serialization for very large `max_context_length`.
    causal_mask = ops.full(
        (config.max_context_length, config.max_context_length),
        fill_value=True,
        dtype=mindspore.bool_,
    )
    self.causal_mask = ops.triu(causal_mask, diagonal=1)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMModel.forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None, cache_position=None)

Constructs the OpenELMModel.

PARAMETER DESCRIPTION
self

The instance of the OpenELMModel class.

TYPE: OpenELMModel

input_ids

The input tensor ids. Default: None.

TYPE: Tensor DEFAULT: None

attention_mask

The attention mask tensor. Default: None.

TYPE: Tensor DEFAULT: None

position_ids

The position ids tensor. Default: None.

TYPE: Tensor DEFAULT: None

past_key_values

The list of past key value tensors. Default: None.

TYPE: List[Tensor] DEFAULT: None

inputs_embeds

The input embeddings tensor. Default: None.

TYPE: Tensor DEFAULT: None

use_cache

Whether to use cache. Default: None.

TYPE: bool DEFAULT: None

output_attentions

Whether to output attentions. Default: None.

TYPE: bool DEFAULT: None

output_hidden_states

Whether to output hidden states. Default: None.

TYPE: bool DEFAULT: None

return_dict

Whether to return a dictionary. Default: None.

TYPE: bool DEFAULT: None

cache_position

The cache position tensor. Default: None.

TYPE: Tensor DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple, BaseModelOutputWithPast]

Union[Tuple, BaseModelOutputWithPast]: The output tuple or BaseModelOutputWithPast object.

RAISES DESCRIPTION
ValueError

If both input_ids and inputs_embeds are specified or neither is specified.

Warning

If use_cache=True is incompatible with gradient checkpointing.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
def forward(
    self,
    input_ids: mindspore.Tensor = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[List[mindspore.Tensor]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
    cache_position: Optional[mindspore.Tensor] = None,
) -> Union[Tuple, BaseModelOutputWithPast]:
    """
    Constructs the OpenELMModel.

    Args:
        self (OpenELMModel): The instance of the OpenELMModel class.
        input_ids (mindspore.Tensor, optional): The input tensor ids. Default: None.
        attention_mask (mindspore.Tensor, optional): The attention mask tensor. Default: None.
        position_ids (mindspore.Tensor, optional): The position ids tensor. Default: None.
        past_key_values (List[mindspore.Tensor], optional): The list of past key value tensors. Default: None.
        inputs_embeds (mindspore.Tensor, optional): The input embeddings tensor. Default: None.
        use_cache (bool, optional): Whether to use cache. Default: None.
        output_attentions (bool, optional): Whether to output attentions. Default: None.
        output_hidden_states (bool, optional): Whether to output hidden states. Default: None.
        return_dict (bool, optional): Whether to return a dictionary. Default: None.
        cache_position (mindspore.Tensor, optional): The cache position tensor. Default: None.

    Returns:
        Union[Tuple, BaseModelOutputWithPast]: The output tuple or BaseModelOutputWithPast object.

    Raises:
        ValueError: If both input_ids and inputs_embeds are specified or neither is specified.
        Warning: If use_cache=True is incompatible with gradient checkpointing.

    """
    output_attentions = (
        output_attentions
        if output_attentions is not None
        else self.config.output_attentions
    )
    output_hidden_states = (
        output_hidden_states
        if output_hidden_states is not None
        else self.config.output_hidden_states
    )
    use_cache = use_cache if use_cache is not None else self.config.use_cache
    return_dict = (
        return_dict if return_dict is not None else self.config.use_return_dict
    )

    if (input_ids is None) ^ (inputs_embeds is not None):
        raise ValueError(
            "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
        )

    if self.gradient_checkpointing and self.training and use_cache:
        logger.warning_once(
            "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`."
        )
        use_cache = False

    if inputs_embeds is None:
        inputs_embeds = self.token_embeddings(input_ids)

    past_seen_tokens = 0
    if use_cache:  # kept for BC (cache positions)
        if not isinstance(past_key_values, StaticCache):
            past_key_values = DynamicCache.from_legacy_cache(past_key_values)
        past_seen_tokens = past_key_values.get_seq_length()

    if cache_position is None:
        cache_position = ops.arange(
            past_seen_tokens,
            past_seen_tokens + inputs_embeds.shape[1],
        )

    if position_ids is None:
        position_ids = cache_position.unsqueeze(0)

    causal_mask = self._update_causal_mask(attention_mask, inputs_embeds)

    # embed positions
    hidden_states = inputs_embeds

    # decoder layers
    all_hidden_states = () if output_hidden_states else None
    all_self_attns = () if output_attentions else None
    next_decoder_cache = None

    for decoder_layer in self.layers:
        if output_hidden_states:
            all_hidden_states += (hidden_states,)

        if self.gradient_checkpointing and self.training:
            layer_outputs = self._gradient_checkpointing_func(
                decoder_layer.__call__,
                hidden_states,
                causal_mask,
                position_ids,
                past_key_values,
                output_attentions,
                use_cache,
                cache_position,
            )
        else:
            layer_outputs = decoder_layer(
                hidden_states,
                attention_mask=causal_mask,
                position_ids=position_ids,
                past_key_value=past_key_values,
                output_attentions=output_attentions,
                use_cache=use_cache,
                cache_position=cache_position,
            )

        hidden_states = layer_outputs[0]

        if use_cache:
            next_decoder_cache = layer_outputs[2 if output_attentions else 1]

        if output_attentions:
            all_self_attns += (layer_outputs[1],)

    hidden_states = self.norm(hidden_states)

    # add hidden states from the last decoder layer
    if output_hidden_states:
        all_hidden_states += (hidden_states,)

    next_cache = None
    if use_cache:
        next_cache = (
            next_decoder_cache.to_legacy_cache()
            if isinstance(next_decoder_cache, Cache)
            else next_decoder_cache
        )
    if not return_dict:
        return tuple(
            v
            for v in [hidden_states, next_cache, all_hidden_states, all_self_attns]
            if v is not None
        )
    return BaseModelOutputWithPast(
        last_hidden_state=hidden_states,
        past_key_values=next_cache,
        hidden_states=all_hidden_states,
        attentions=all_self_attns,
    )

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMModel.get_input_embeddings()

Returns the input embeddings for the OpenELMModel.

PARAMETER DESCRIPTION
self

An instance of the OpenELMModel class.

TYPE: OpenELMModel

RETURNS DESCRIPTION

None.

This method retrieves the input embeddings from the OpenELMModel. The input embeddings are obtained from the token embeddings of the model. The token embeddings are stored in the token_embeddings attribute of the OpenELMModel instance. The input embeddings are used as input for further processing or analysis in the OpenELMModel.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
def get_input_embeddings(self):
    """
    Returns the input embeddings for the OpenELMModel.

    Args:
        self (OpenELMModel): An instance of the OpenELMModel class.

    Returns:
        None.

    Raises:
        None.

    This method retrieves the input embeddings from the OpenELMModel.
    The input embeddings are obtained from the token embeddings of the model.
    The token embeddings are stored in the `token_embeddings` attribute of the OpenELMModel instance.
    The input embeddings are used as input for further processing or analysis in the OpenELMModel.
    """
    return self.token_embeddings

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMModel.set_input_embeddings(new_embeddings)

Set the input embeddings for the OpenELMModel.

PARAMETER DESCRIPTION
self

The instance of the OpenELMModel class.

TYPE: OpenELMModel

new_embeddings

A tensor containing the new embeddings to be set as input.

TYPE: Tensor

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
def set_input_embeddings(self, new_embeddings: mindspore.Tensor):
    """
    Set the input embeddings for the OpenELMModel.

    Args:
        self (OpenELMModel): The instance of the OpenELMModel class.
        new_embeddings (mindspore.Tensor): A tensor containing the new embeddings to be set as input.

    Returns:
        None.

    Raises:
        None.
    """
    self.token_embeddings = new_embeddings

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMMultiHeadCausalAttention

Bases: Module

This class represents a multi-head causal attention mechanism for OpenELM models. It performs multi-head self-attention computation with optional key and query normalization and caching capabilities.

Inherits from nn.Module, this class provides functionality for processing input tensors through multi-head self-attention mechanism, with support for caching key-value pairs for efficient generation tasks.

The class initializes with configuration parameters and layer index, setting up projection layers, position embeddings, normalization options, and output projection layers. It also defines the number of query, key, and value heads, along with transformer dimensions and grouping information.

The 'forward' method performs the forward pass of multi-head self-attention, taking input hidden states, optional attention mask, cached key-value pairs, and other parameters. It computes queries, keys, and values, applies normalization if configured, updates cached key-value pairs if available, incorporates positional embeddings, and performs scaled dot-product attention calculation. Finally, it applies output projection and returns the attention output along with optional attention weights and updated cached key-value pairs.

Note

This class assumes the existence of certain related classes and functions like OpenELMConfig, OpenELMRotaryEmbedding, OpenELMRMSNorm, Cache, nn.Linear, and _scaled_dot_product_attention.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
class OpenELMMultiHeadCausalAttention(nn.Module):

    """
    This class represents a multi-head causal attention mechanism for OpenELM models.
    It performs multi-head self-attention computation with optional key and query normalization and caching capabilities.

    Inherits from nn.Module, this class provides functionality for processing input tensors through multi-head
    self-attention mechanism, with support for caching key-value pairs for efficient generation tasks.

    The class initializes with configuration parameters and layer index, setting up projection layers,
    position embeddings, normalization options, and output projection layers.
    It also defines the number of query, key, and value heads, along with transformer dimensions and grouping information.

    The 'forward' method performs the forward pass of multi-head self-attention, taking input hidden states,
    optional attention mask, cached key-value pairs, and other parameters.
    It computes queries, keys, and values, applies normalization if configured, updates cached key-value pairs
    if available, incorporates positional embeddings, and performs scaled dot-product attention calculation.
    Finally, it applies output projection and returns the attention output along with optional attention weights and
    updated cached key-value pairs.

    Note:
        This class assumes the existence of certain related classes and functions like OpenELMConfig,
        OpenELMRotaryEmbedding, OpenELMRMSNorm, Cache, nn.Linear, and _scaled_dot_product_attention.
    """
    def __init__(self, config: OpenELMConfig, layer_idx: int) -> None:
        '''
        Initializes an instance of the OpenELMMultiHeadCausalAttention class.

        Args:
            self: The instance of the class.
            config (OpenELMConfig): An instance of the OpenELMConfig class containing configuration parameters.
            layer_idx (int): The index of the layer.

        Returns:
            None

        Raises:
            None
        '''
        super().__init__()
        self.layer_idx = layer_idx
        head_dim = config.head_dim
        q_heads = config.num_query_heads[layer_idx]
        k_heads = config.num_kv_heads[layer_idx]
        v_heads = config.num_kv_heads[layer_idx]

        self.qkv_proj = nn.Linear(
            in_channels=config.model_dim,
            out_channels=(q_heads + k_heads + v_heads) * head_dim,
            bias=False,
        )

        self.pos_embedding = OpenELMRotaryEmbedding(
            model_dim=config.head_dim,
            max_seq_length=config.rope_max_length,
            freq_constant=config.rope_freq_constant,
        )

        if config.normalize_qk_projections:
            self.q_norm = OpenELMRMSNorm(
                num_features=config.head_dim,
            )
            self.k_norm = OpenELMRMSNorm(
                num_features=config.head_dim,
            )
        else:
            self.q_norm = None
            self.k_norm = None

        self.out_proj = nn.Linear(
            in_channels=q_heads * head_dim,
            out_channels=config.model_dim,
            bias=False,
        )

        self.head_dim = config.head_dim
        self.num_q_heads = q_heads
        self.num_k_heads = k_heads
        self.num_v_heads = v_heads
        self.transformer_dim = config.model_dim
        self.num_groups = self.num_q_heads // self.num_k_heads

    def extra_repr(self) -> str:
        """
        Returns a string representation of the OpenELMMultiHeadCausalAttention object, including the number of query,
        key, and value heads.

        Args:
            self (OpenELMMultiHeadCausalAttention): The instance of the OpenELMMultiHeadCausalAttention class.

        Returns:
            str: A string representation of the OpenELMMultiHeadCausalAttention object,
                including the number of query, key, and value heads.

        Raises:
            None.

        """
        return (
            super().extra_repr()
            + f"query_heads={self.num_q_heads}, key_heads={self.num_k_heads}, value_heads={self.num_v_heads}"
        )

    def forward(
        self,
        hidden_states: mindspore.Tensor,
        attention_mask: Optional[mindspore.Tensor] = None,
        past_key_value: Optional[Cache] = None,
        output_attentions: bool = False,
        use_cache: bool = False,
        cache_position: Optional[mindspore.Tensor] = None,
    ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
        """
        Forward pass of multi-head self-attention.

        Args:
            hidden_states: Input tensor of the shape [batch size, sequence length, model dimension].
            past_key_value: Tensor storing the cached keys and values.
            output_attentions: output attention weights.
            use_cache: Specifies whether to use kv-cache for generation.
            cache_position: used for updating the kv-cache.

        Returns:
            The output of the same shape as the input, optionally with a tensor containing cached keys and values.
        """
        # scaled_dot_product_attention does not return attention weights, set output_attentions to False
        output_attentions = False
        batch_size, seq_length, d_model = hidden_states.shape

        # [B, S, d] --> [B, S, (q_h + k_h + v_h) * h]
        qkv = self.qkv_proj(hidden_states)
        # [B, S, (q_h + k_h + v_h) * h] --> [B, S, (q_h + k_h + v_h), h]
        qkv = qkv.reshape(
            batch_size,
            seq_length,
            self.num_q_heads + self.num_k_heads + self.num_v_heads,
            self.head_dim,
        )
        # [B, S, (q_h + k_h + v_h), h] --> [B, (q_h + k_h + v_h), S, h]
        qkv = qkv.swapaxes(1, 2)
        # [B, (q_h + k_h + v_h), S, h] --> [B, q_h, S h], [B, k_h, S, h], [B, v_h, S, h]
        queries, keys, values = qkv.split(
            [self.num_q_heads, self.num_k_heads, self.num_v_heads], axis=1
        )

        if self.q_norm is not None:
            queries = self.q_norm(queries)

        if self.k_norm is not None:
            keys = self.k_norm(keys)

        past_key_value = getattr(self, "past_key_value", past_key_value)

        if past_key_value is not None:
            # sin and cos are specific to RoPE models; position_ids needed for the static cache
            # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
            cache_kwargs = {"cache_position": cache_position}
            keys, values = past_key_value.update(
                keys, values, self.layer_idx, cache_kwargs
            )

        # Add positional embedding
        queries, keys = self.pos_embedding(queries, keys)

        if self.num_groups != 1:
            # GQA
            # [B, k_h, S, h] --> [B, q_h, S, h]
            keys = keys.repeat_interleave(self.num_groups, dim=1)
            # [B, v_h, S, h] --> [B, q_h, S, h]
            values = values.repeat_interleave(self.num_groups, dim=1)

        causal_mask = attention_mask
        if attention_mask is not None and cache_position is not None:
            causal_mask = causal_mask[:, :, cache_position, : keys.shape[-2]]

        attn_output, _ = _scaled_dot_product_attention(
            queries,
            keys,
            values,
            attn_mask=causal_mask,
            dropout_p=0.,
            is_causal=False,
            is_training=self.training,
            dtype=queries.dtype
        )

        attn_output = attn_output.swapaxes(1, 2)
        attn_output = attn_output.reshape(
            batch_size, seq_length, self.num_q_heads * self.head_dim
        )
        attn_output = self.out_proj(attn_output)
        if not output_attentions:
            attn_weights = None
        return attn_output, attn_weights, past_key_value

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMMultiHeadCausalAttention.__init__(config, layer_idx)

Initializes an instance of the OpenELMMultiHeadCausalAttention class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

An instance of the OpenELMConfig class containing configuration parameters.

TYPE: OpenELMConfig

layer_idx

The index of the layer.

TYPE: int

RETURNS DESCRIPTION
None

None

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
def __init__(self, config: OpenELMConfig, layer_idx: int) -> None:
    '''
    Initializes an instance of the OpenELMMultiHeadCausalAttention class.

    Args:
        self: The instance of the class.
        config (OpenELMConfig): An instance of the OpenELMConfig class containing configuration parameters.
        layer_idx (int): The index of the layer.

    Returns:
        None

    Raises:
        None
    '''
    super().__init__()
    self.layer_idx = layer_idx
    head_dim = config.head_dim
    q_heads = config.num_query_heads[layer_idx]
    k_heads = config.num_kv_heads[layer_idx]
    v_heads = config.num_kv_heads[layer_idx]

    self.qkv_proj = nn.Linear(
        in_channels=config.model_dim,
        out_channels=(q_heads + k_heads + v_heads) * head_dim,
        bias=False,
    )

    self.pos_embedding = OpenELMRotaryEmbedding(
        model_dim=config.head_dim,
        max_seq_length=config.rope_max_length,
        freq_constant=config.rope_freq_constant,
    )

    if config.normalize_qk_projections:
        self.q_norm = OpenELMRMSNorm(
            num_features=config.head_dim,
        )
        self.k_norm = OpenELMRMSNorm(
            num_features=config.head_dim,
        )
    else:
        self.q_norm = None
        self.k_norm = None

    self.out_proj = nn.Linear(
        in_channels=q_heads * head_dim,
        out_channels=config.model_dim,
        bias=False,
    )

    self.head_dim = config.head_dim
    self.num_q_heads = q_heads
    self.num_k_heads = k_heads
    self.num_v_heads = v_heads
    self.transformer_dim = config.model_dim
    self.num_groups = self.num_q_heads // self.num_k_heads

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMMultiHeadCausalAttention.extra_repr()

Returns a string representation of the OpenELMMultiHeadCausalAttention object, including the number of query, key, and value heads.

PARAMETER DESCRIPTION
self

The instance of the OpenELMMultiHeadCausalAttention class.

TYPE: OpenELMMultiHeadCausalAttention

RETURNS DESCRIPTION
str

A string representation of the OpenELMMultiHeadCausalAttention object, including the number of query, key, and value heads.

TYPE: str

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
def extra_repr(self) -> str:
    """
    Returns a string representation of the OpenELMMultiHeadCausalAttention object, including the number of query,
    key, and value heads.

    Args:
        self (OpenELMMultiHeadCausalAttention): The instance of the OpenELMMultiHeadCausalAttention class.

    Returns:
        str: A string representation of the OpenELMMultiHeadCausalAttention object,
            including the number of query, key, and value heads.

    Raises:
        None.

    """
    return (
        super().extra_repr()
        + f"query_heads={self.num_q_heads}, key_heads={self.num_k_heads}, value_heads={self.num_v_heads}"
    )

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMMultiHeadCausalAttention.forward(hidden_states, attention_mask=None, past_key_value=None, output_attentions=False, use_cache=False, cache_position=None)

Forward pass of multi-head self-attention.

PARAMETER DESCRIPTION
hidden_states

Input tensor of the shape [batch size, sequence length, model dimension].

TYPE: Tensor

past_key_value

Tensor storing the cached keys and values.

TYPE: Optional[Cache] DEFAULT: None

output_attentions

output attention weights.

TYPE: bool DEFAULT: False

use_cache

Specifies whether to use kv-cache for generation.

TYPE: bool DEFAULT: False

cache_position

used for updating the kv-cache.

TYPE: Optional[Tensor] DEFAULT: None

RETURNS DESCRIPTION
Tuple[Tensor, Optional[Tensor], Optional[Tuple[Tensor]]]

The output of the same shape as the input, optionally with a tensor containing cached keys and values.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
def forward(
    self,
    hidden_states: mindspore.Tensor,
    attention_mask: Optional[mindspore.Tensor] = None,
    past_key_value: Optional[Cache] = None,
    output_attentions: bool = False,
    use_cache: bool = False,
    cache_position: Optional[mindspore.Tensor] = None,
) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
    """
    Forward pass of multi-head self-attention.

    Args:
        hidden_states: Input tensor of the shape [batch size, sequence length, model dimension].
        past_key_value: Tensor storing the cached keys and values.
        output_attentions: output attention weights.
        use_cache: Specifies whether to use kv-cache for generation.
        cache_position: used for updating the kv-cache.

    Returns:
        The output of the same shape as the input, optionally with a tensor containing cached keys and values.
    """
    # scaled_dot_product_attention does not return attention weights, set output_attentions to False
    output_attentions = False
    batch_size, seq_length, d_model = hidden_states.shape

    # [B, S, d] --> [B, S, (q_h + k_h + v_h) * h]
    qkv = self.qkv_proj(hidden_states)
    # [B, S, (q_h + k_h + v_h) * h] --> [B, S, (q_h + k_h + v_h), h]
    qkv = qkv.reshape(
        batch_size,
        seq_length,
        self.num_q_heads + self.num_k_heads + self.num_v_heads,
        self.head_dim,
    )
    # [B, S, (q_h + k_h + v_h), h] --> [B, (q_h + k_h + v_h), S, h]
    qkv = qkv.swapaxes(1, 2)
    # [B, (q_h + k_h + v_h), S, h] --> [B, q_h, S h], [B, k_h, S, h], [B, v_h, S, h]
    queries, keys, values = qkv.split(
        [self.num_q_heads, self.num_k_heads, self.num_v_heads], axis=1
    )

    if self.q_norm is not None:
        queries = self.q_norm(queries)

    if self.k_norm is not None:
        keys = self.k_norm(keys)

    past_key_value = getattr(self, "past_key_value", past_key_value)

    if past_key_value is not None:
        # sin and cos are specific to RoPE models; position_ids needed for the static cache
        # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
        cache_kwargs = {"cache_position": cache_position}
        keys, values = past_key_value.update(
            keys, values, self.layer_idx, cache_kwargs
        )

    # Add positional embedding
    queries, keys = self.pos_embedding(queries, keys)

    if self.num_groups != 1:
        # GQA
        # [B, k_h, S, h] --> [B, q_h, S, h]
        keys = keys.repeat_interleave(self.num_groups, dim=1)
        # [B, v_h, S, h] --> [B, q_h, S, h]
        values = values.repeat_interleave(self.num_groups, dim=1)

    causal_mask = attention_mask
    if attention_mask is not None and cache_position is not None:
        causal_mask = causal_mask[:, :, cache_position, : keys.shape[-2]]

    attn_output, _ = _scaled_dot_product_attention(
        queries,
        keys,
        values,
        attn_mask=causal_mask,
        dropout_p=0.,
        is_causal=False,
        is_training=self.training,
        dtype=queries.dtype
    )

    attn_output = attn_output.swapaxes(1, 2)
    attn_output = attn_output.reshape(
        batch_size, seq_length, self.num_q_heads * self.head_dim
    )
    attn_output = self.out_proj(attn_output)
    if not output_attentions:
        attn_weights = None
    return attn_output, attn_weights, past_key_value

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMPreTrainedModel

Bases: PreTrainedModel

This class represents a pre-trained model for OpenELM. It is a subclass of PreTrainedModel and implements various methods and functionalities for training and inference.

The class contains an initialization method, '_init_weights', which is responsible for initializing the weights of the model. This method takes a 'cell' parameter, which represents the neural network cell.

The '_init_weights' method initializes the weights differently based on the type of the 'cell' parameter. If the 'cell' is an instance of 'nn.Linear', the weight is initialized using a normal distribution with a range defined by the 'initializer_range' attribute of the 'config' object. If the 'cell' has a bias, it is initialized to zeros.

If the 'cell' is an instance of 'nn.Embedding', the weight is initialized using a normal distribution with a range defined by the 'initializer_range' attribute of the 'config' object. If the 'cell' has a padding index, the weight corresponding to the padding index is set to zero.

If the 'cell' is an instance of 'OpenELMRMSNorm', the weight is initialized to ones.

Note

This class is designed specifically for OpenELM and inherits functionalities from the 'PreTrainedModel' class.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
class OpenELMPreTrainedModel(PreTrainedModel):

    """
    This class represents a pre-trained model for OpenELM.
    It is a subclass of PreTrainedModel and implements various methods and functionalities for training and inference.

    The class contains an initialization method, '_init_weights', which is responsible for initializing the weights
    of the model. This method takes a 'cell' parameter, which represents the neural network cell.

    The '_init_weights' method initializes the weights differently based on the type of the 'cell' parameter.
    If the 'cell' is an instance of 'nn.Linear', the weight is initialized using a normal distribution
    with a range defined by the 'initializer_range' attribute of the 'config' object.
    If the 'cell' has a bias, it is initialized to zeros.

    If the 'cell' is an instance of 'nn.Embedding', the weight is initialized using a normal distribution
    with a range defined by the 'initializer_range' attribute of the 'config' object.
    If the 'cell' has a padding index, the weight corresponding to the padding index is set to zero.

    If the 'cell' is an instance of 'OpenELMRMSNorm', the weight is initialized to ones.

    Note:
        This class is designed specifically for OpenELM and inherits functionalities from the 'PreTrainedModel' class.

    """
    config_class = OpenELMConfig
    base_model_prefix = "transformer"
    supports_gradient_checkpointing = False
    _no_split_modules = ["OpenELMDecoderLayer"]
    _skip_keys_device_placement = "past_key_values"

    def _init_weights(self, cell: nn.Module) -> None:
        """Initialize the weights."""
        if isinstance(cell, nn.Linear):
            # Slightly different from the TF version which uses truncated_normal for initialization
            # cf https://github.com/pytorch/pytorch/pull/5617
            cell.weight.initialize(Normal(self.config.initializer_range))
            if cell.bias is not None:
                cell.bias.initialize('zeros')
        elif isinstance(cell, nn.Embedding):
            weight = np.random.normal(0.0, self.config.initializer_range, cell.weight.shape)
            if cell.padding_idx:
                weight[cell.padding_idx] = 0

            cell.weight.set_data(Tensor(weight, cell.weight.dtype))

        elif isinstance(cell, OpenELMRMSNorm):
            cell.weight.initialize('ones')

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMRMSNorm

Bases: Module

This class represents the OpenELMRMSNorm normalization layer, which can be used for normalizing input tensors.

ATTRIBUTE DESCRIPTION
eps

A small value added to the denominator for numerical stability.

TYPE: float

weight

Learnable scaling parameter.

TYPE: Parameter

METHOD DESCRIPTION
__init__

Initialize the OpenELMRMSNorm normalization layer.

_norm

Apply the OpenELMRMSNorm normalization to the input tensor.

forward

Forward pass through the OpenELMRMSNorm layer.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
class OpenELMRMSNorm(nn.Module):

    """
    This class represents the OpenELMRMSNorm normalization layer, which can be used for normalizing input tensors. 

    Attributes:
        eps (float): A small value added to the denominator for numerical stability.
        weight (nn.Parameter): Learnable scaling parameter.

    Methods:
        __init__:
            Initialize the OpenELMRMSNorm normalization layer.

        _norm:
            Apply the OpenELMRMSNorm normalization to the input tensor.

        forward:
            Forward pass through the OpenELMRMSNorm layer.

    """
    def __init__(self, num_features: int, eps: float = 1e-6):
        """
        Initialize the OpenELMRMSNorm normalization layer.

        Args:
            dim (int): The dimension of the input tensor.
            eps (float, optional): A small value added to the denominator for numerical stability. Default is 1e-6.

        Attributes:
            eps (float): A small value added to the denominator for numerical stability.
            weight (nn.Parameter): Learnable scaling parameter.

        """
        super().__init__()
        self.eps = eps
        self.weight = Parameter(ops.ones(num_features))
        self.num_features = num_features

    def _norm(self, x: Tensor) -> Tensor:
        """
        Apply the OpenELMRMSNorm normalization to the input tensor.

        Args:
            x (mindspore.Tensor): The input tensor.

        Returns:
            mindspore.Tensor: The normalized tensor.

        """
        return x * ops.rsqrt(x.pow(2).mean(-1, keep_dims=True) + self.eps)

    def forward(self, x: Tensor) -> Tensor:
        """
        Forward pass through the OpenELMRMSNorm layer.

        Args:
            x (mindspore.Tensor): The input tensor.

        Returns:
            mindspore.Tensor: The output tensor after applying OpenELMRMSNorm.

        """
        output = self._norm(x.float()).type_as(x)
        return output * self.weight

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMRMSNorm.__init__(num_features, eps=1e-06)

Initialize the OpenELMRMSNorm normalization layer.

PARAMETER DESCRIPTION
dim

The dimension of the input tensor.

TYPE: int

eps

A small value added to the denominator for numerical stability. Default is 1e-6.

TYPE: float DEFAULT: 1e-06

ATTRIBUTE DESCRIPTION
eps

A small value added to the denominator for numerical stability.

TYPE: float

weight

Learnable scaling parameter.

TYPE: Parameter

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
def __init__(self, num_features: int, eps: float = 1e-6):
    """
    Initialize the OpenELMRMSNorm normalization layer.

    Args:
        dim (int): The dimension of the input tensor.
        eps (float, optional): A small value added to the denominator for numerical stability. Default is 1e-6.

    Attributes:
        eps (float): A small value added to the denominator for numerical stability.
        weight (nn.Parameter): Learnable scaling parameter.

    """
    super().__init__()
    self.eps = eps
    self.weight = Parameter(ops.ones(num_features))
    self.num_features = num_features

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMRMSNorm.forward(x)

Forward pass through the OpenELMRMSNorm layer.

PARAMETER DESCRIPTION
x

The input tensor.

TYPE: Tensor

RETURNS DESCRIPTION
Tensor

mindspore.Tensor: The output tensor after applying OpenELMRMSNorm.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
82
83
84
85
86
87
88
89
90
91
92
93
94
def forward(self, x: Tensor) -> Tensor:
    """
    Forward pass through the OpenELMRMSNorm layer.

    Args:
        x (mindspore.Tensor): The input tensor.

    Returns:
        mindspore.Tensor: The output tensor after applying OpenELMRMSNorm.

    """
    output = self._norm(x.float()).type_as(x)
    return output * self.weight

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMRotaryEmbedding

Bases: Module

The rotary position embeddings (aka RoPE) from RoFormer <https://arxiv.org/abs/2104.09864>_.

RoPE encodes the position information of tokens using a rotation matrix, and is able to capture explicit relative positional dependencies.

PARAMETER DESCRIPTION
model_dim

The dimensionality of the model's hidden state.

TYPE: int

max_seq_length

Maximum sequence length.

TYPE: int

freq_constant

A constant used for computing frequencies.

TYPE: int DEFAULT: 10000

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
class OpenELMRotaryEmbedding(nn.Module):
    """
    The rotary position embeddings (aka RoPE) from `RoFormer <https://arxiv.org/abs/2104.09864>`_.

    RoPE encodes the position information of tokens using a rotation matrix, and is able to capture
    explicit relative positional dependencies.

    Args:
        model_dim: The dimensionality of the model's hidden state.
        max_seq_length: Maximum sequence length.
        freq_constant: A constant used for computing frequencies.
    """
    def __init__(
        self, model_dim: int, max_seq_length: int, freq_constant: int = 10000
    ) -> None:
        """
        Initializes the OpenELMRotaryEmbedding instance with the specified parameters.

        Args:
            self: The object itself.
            model_dim (int): The dimension of the model.
            max_seq_length (int): The maximum sequence length.
            freq_constant (int, optional): The frequency constant used in the calculation. Defaults to 10000.

        Returns:
            None.

        Raises:
            None.
        """
        inv_freq = 1.0 / (
            freq_constant
            ** (ops.arange(0, model_dim, 2, dtype=mindspore.float32) / model_dim)
        )
        super().__init__()

        self.model_dim = model_dim
        self.freq_constant = freq_constant
        self.max_seq_length = max_seq_length

        self.inv_freq = inv_freq
        self._cached_cos = None
        self._cached_sin = None
        self._cached_seq_length = max_seq_length
        self._compute_sin_cos_embeddings(max_seq_length)

    def extra_repr(self) -> str:
        """
        This method generates a string representation that includes specific attributes of the OpenELMRotaryEmbedding
        class instance.

        Args:
            self: The instance of the OpenELMRotaryEmbedding class.

        Returns:
            str: A formatted string representing the model_dim, max_seq_length,
                and freq_constant attributes of the instance.

        Raises:
            None.
        """
        return f"\tmodel_dim={self.model_dim}, max_seq_length={self.max_seq_length}, freq_constant={self.freq_constant}"

    def _compute_sin_cos_embeddings(
        self,
        key_len: int,
        key_dtype = mindspore.float32,
    ) -> None:
        """
        Compute sine and cos embeddings.

        Args:
            key_len: Number of tokens in the key embeddings in the transformer model.
            key_dtype: Data type of the key embeddings.

        Returns:
            None

        Note:
            We recalculate the sine and cosine embeddings if any of the following conditions are met:

            1. The number of tokens in key embeddings are greater than the cached sequence length.
            2. Sine and cosine caches are empty.
        """
        if (
            key_len > self._cached_seq_length
            or self._cached_cos is None
            or (self._cached_cos is not None)
            or (self._cached_cos is not None and self._cached_cos.dtype != key_dtype)
            or self._cached_sin is None
            or (self._cached_sin is not None)
            or (self._cached_sin is not None and self._cached_sin.dtype != key_dtype)
        ):
            self._cached_seq_length = max(key_len, self._cached_seq_length)

            # The shape of 'pos_index' is [number of key tokens]
            pos_index = ops.arange(
                self._cached_seq_length,
                dtype=mindspore.float32,
            )
            # The shape of 'pos_index_theta' is [number of key tokens, model dimension]
            pos_index_theta = ops.einsum("i,j->ij", pos_index, self.inv_freq)
            # The shape of 'emb' is [number of key tokens, model dimension]
            emb = ops.cat((pos_index_theta, pos_index_theta), axis=-1)

            # the shape of cos and sin embeddings is [number of key tokens, model_dim]
            cos_emb = emb.cos().to(dtype=key_dtype)
            sin_emb = emb.sin().to(dtype=key_dtype)

            # the shape of cached cos and sin embeddings is [1, 1, number of key tokens, model_dim]
            self._cached_cos = cos_emb[None, None, :, :]
            self._cached_sin = sin_emb[None, None, :, :]

    def forward(
        self,
        query: mindspore.Tensor,
        key: mindspore.Tensor,
    ) -> Tuple[mindspore.Tensor, mindspore.Tensor]:
        """
        The forward function of RoPE embeddings.

        Args:
            query: Query embeddings in the transformer model. The shape of query embeddings is
                [Batch, number of query heads, number of query tokens, model dimension].
            key: Key embeddings in the transformer model. The shape of key embeddings is
                [Batch, number of key heads, number of key tokens, model dimension].

        Returns:
            tuple:
                A tuple containing the query and key embeddings with positional information.
                The shape of the returned query and key embeddings is the same as the input query and key embeddings
                respectively.

        Note:
            The RoPE embedding computation is done in full-precision. After the computation, input query and key tensors
            are casted to original input datatype.
        """
        dim = key.shape[-1]
        key_len = key.shape[2]
        query_len = query.shape[2]

        assert dim == self.model_dim
        assert key.dtype == query.dtype

        # In the context of self-attention, the lengths of keys and queries are equal.
        # However, in generation tasks, such as predicting the next token in a sequence, the lengths of keys and queries
        # can differ. For instance, when employing key-value (KV) caching for sequence prediction, the keys
        # represent embeddings of previous tokens and the current token, while the query corresponds
        # to the embedding of the current token only.
        assert (
            key_len >= query_len
        ), "Number of keys has to be greater than or equal to number of queries."

        query_float = query.float()
        key_float = key.float()

        self._compute_sin_cos_embeddings(
            key_len, key_dtype=key_float.dtype
        )
        query_float = _apply_rotary_pos_emb(
            x=query_float,
            pos_sin=self._cached_sin[..., key_len - query_len : key_len, :],
            pos_cos=self._cached_cos[..., key_len - query_len : key_len, :],
        )
        key_float = _apply_rotary_pos_emb(
            x=key_float,
            pos_sin=self._cached_sin[..., :key_len, :],
            pos_cos=self._cached_cos[..., :key_len, :],
        )

        return query_float.type_as(query), key_float.type_as(key)

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMRotaryEmbedding.__init__(model_dim, max_seq_length, freq_constant=10000)

Initializes the OpenELMRotaryEmbedding instance with the specified parameters.

PARAMETER DESCRIPTION
self

The object itself.

model_dim

The dimension of the model.

TYPE: int

max_seq_length

The maximum sequence length.

TYPE: int

freq_constant

The frequency constant used in the calculation. Defaults to 10000.

TYPE: int DEFAULT: 10000

RETURNS DESCRIPTION
None

None.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
def __init__(
    self, model_dim: int, max_seq_length: int, freq_constant: int = 10000
) -> None:
    """
    Initializes the OpenELMRotaryEmbedding instance with the specified parameters.

    Args:
        self: The object itself.
        model_dim (int): The dimension of the model.
        max_seq_length (int): The maximum sequence length.
        freq_constant (int, optional): The frequency constant used in the calculation. Defaults to 10000.

    Returns:
        None.

    Raises:
        None.
    """
    inv_freq = 1.0 / (
        freq_constant
        ** (ops.arange(0, model_dim, 2, dtype=mindspore.float32) / model_dim)
    )
    super().__init__()

    self.model_dim = model_dim
    self.freq_constant = freq_constant
    self.max_seq_length = max_seq_length

    self.inv_freq = inv_freq
    self._cached_cos = None
    self._cached_sin = None
    self._cached_seq_length = max_seq_length
    self._compute_sin_cos_embeddings(max_seq_length)

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMRotaryEmbedding.extra_repr()

This method generates a string representation that includes specific attributes of the OpenELMRotaryEmbedding class instance.

PARAMETER DESCRIPTION
self

The instance of the OpenELMRotaryEmbedding class.

RETURNS DESCRIPTION
str

A formatted string representing the model_dim, max_seq_length, and freq_constant attributes of the instance.

TYPE: str

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
def extra_repr(self) -> str:
    """
    This method generates a string representation that includes specific attributes of the OpenELMRotaryEmbedding
    class instance.

    Args:
        self: The instance of the OpenELMRotaryEmbedding class.

    Returns:
        str: A formatted string representing the model_dim, max_seq_length,
            and freq_constant attributes of the instance.

    Raises:
        None.
    """
    return f"\tmodel_dim={self.model_dim}, max_seq_length={self.max_seq_length}, freq_constant={self.freq_constant}"

mindnlp.transformers.models.openelm.modeling_openelm.OpenELMRotaryEmbedding.forward(query, key)

The forward function of RoPE embeddings.

PARAMETER DESCRIPTION
query

Query embeddings in the transformer model. The shape of query embeddings is [Batch, number of query heads, number of query tokens, model dimension].

TYPE: Tensor

key

Key embeddings in the transformer model. The shape of key embeddings is [Batch, number of key heads, number of key tokens, model dimension].

TYPE: Tensor

RETURNS DESCRIPTION
tuple

A tuple containing the query and key embeddings with positional information. The shape of the returned query and key embeddings is the same as the input query and key embeddings respectively.

TYPE: Tuple[Tensor, Tensor]

Note

The RoPE embedding computation is done in full-precision. After the computation, input query and key tensors are casted to original input datatype.

Source code in mindnlp/transformers/models/openelm/modeling_openelm.py
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
def forward(
    self,
    query: mindspore.Tensor,
    key: mindspore.Tensor,
) -> Tuple[mindspore.Tensor, mindspore.Tensor]:
    """
    The forward function of RoPE embeddings.

    Args:
        query: Query embeddings in the transformer model. The shape of query embeddings is
            [Batch, number of query heads, number of query tokens, model dimension].
        key: Key embeddings in the transformer model. The shape of key embeddings is
            [Batch, number of key heads, number of key tokens, model dimension].

    Returns:
        tuple:
            A tuple containing the query and key embeddings with positional information.
            The shape of the returned query and key embeddings is the same as the input query and key embeddings
            respectively.

    Note:
        The RoPE embedding computation is done in full-precision. After the computation, input query and key tensors
        are casted to original input datatype.
    """
    dim = key.shape[-1]
    key_len = key.shape[2]
    query_len = query.shape[2]

    assert dim == self.model_dim
    assert key.dtype == query.dtype

    # In the context of self-attention, the lengths of keys and queries are equal.
    # However, in generation tasks, such as predicting the next token in a sequence, the lengths of keys and queries
    # can differ. For instance, when employing key-value (KV) caching for sequence prediction, the keys
    # represent embeddings of previous tokens and the current token, while the query corresponds
    # to the embedding of the current token only.
    assert (
        key_len >= query_len
    ), "Number of keys has to be greater than or equal to number of queries."

    query_float = query.float()
    key_float = key.float()

    self._compute_sin_cos_embeddings(
        key_len, key_dtype=key_float.dtype
    )
    query_float = _apply_rotary_pos_emb(
        x=query_float,
        pos_sin=self._cached_sin[..., key_len - query_len : key_len, :],
        pos_cos=self._cached_cos[..., key_len - query_len : key_len, :],
    )
    key_float = _apply_rotary_pos_emb(
        x=key_float,
        pos_sin=self._cached_sin[..., :key_len, :],
        pos_cos=self._cached_cos[..., :key_len, :],
    )

    return query_float.type_as(query), key_float.type_as(key)

mindnlp.transformers.models.openelm.configuration_openelm

Implements HF OpenELMConfig based on PretrainedConfig

mindnlp.transformers.models.openelm.configuration_openelm.OpenELMConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [OpenELMModel]. It is used to instantiate an OpenELM model according to the specified arguments, defining the model architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
vocab_size

Vocabulary size of the OpenELM model.

TYPE: `int`, *optional*, defaults to 32000 DEFAULT: 32000

max_context_length

Maximum number of input tokens.

TYPE: `int`, *optional*, defaults to 2048 DEFAULT: 2048

num_transformer_layers

Number of hidden layers in the Transformer decoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

model_dim

Dimension of the hidden representations.

TYPE: `int`, *optional*, defaults to 2048 DEFAULT: 2048

head_dim

The attention head dimension.

TYPE: `int`, *optional*, defaults to 128 DEFAULT: 128

qkv_multipliers

If the qkv_multipliers is a Number, then all attention layers have the same latent dimensions, resulting in uniform allocation of parameters. If the qkv_multipliers is a List of Number, then each attention layer have different latent dimensions assuming qkv_multipliers[0] != qkv_multipliers[1]. This results in variable allocation of parameters in attention layer. This scaling is known as layer-wise or block-wise scaling: https://arxiv.org/abs/2008.00623

TYPE: `Union[Number, List[Number]]`, *optional*, defaults to 1.0 DEFAULT: 1.0

num_query_heads

The number of query heads, computed from compute_heads(model_dim=model_dim, head_dim=head_dim).

TYPE: `Union[int, None]`, *optional*, defaults to None DEFAULT: None

num_gqa_groups

This variable allows to switch between multi-head attention, group query attention, and multi-query attention. When num_gqa_groups == 1, then it is multi-head attention. When 1 < num_gqa_groups < num_heads and num_heads is divisible by num_gqa_groups, then it is group query attention When num_gqa_groups == num_heads, then it is multi-query attention

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

ffn_multipliers

Feed-forward network (FFN) multipliers. If the ffn_multipliers is a Number, then all FFN layers have the same latent dimensions, resulting in uniform allocation of parameters. If the ffn_multipliers is a List of Number, then each FFN layer have different latent dimensions assuming ffn_multipliers[0] != ffn_multipliers[1]. This results in variable allocation of parameters in FFN layer. This scaling is known as layer-wise or block-wise scaling: https://arxiv.org/abs/2008.00623

TYPE: `Union[Number, List[Number]]`, *optional*, defaults to 4.0 DEFAULT: 4.0

ffn_with_glu

Whether to use FFN with Gated Linear Unit (GLU)

TYPE: `bool`, *optional*, defaults to True DEFAULT: True

ffn_dim_divisor

The ffn layer dimension divisor.

TYPE: `int`, *optional*, defaults to 256 DEFAULT: 256

activation_fn_name

The non-linear activation function (function or string) in the decoder.

TYPE: `str` or `function`, *optional*, defaults to `"swish"` DEFAULT: 'swish'

normalization_layer_name

Type of normalization layer.

TYPE: `str` or `function`, *optional*, defaults to `"rms_norm"` DEFAULT: 'rms_norm'

normalize_qk_projections

Whether to normalize queries and keys after projections

TYPE: `bool`, *optional*, defaults to False DEFAULT: False

share_input_output_layers

Whether to share the embedding between input and output linear layer

TYPE: `bool`, *optional*, defaults to False DEFAULT: False

rope_freq_constant

The base period of the RoPE embeddings.

TYPE: `int`, *optional*, defaults to 10000 DEFAULT: 10000

rope_max_length

That rope_max_length is set to twice of max_context_length. This allows flexibility in token lengths during training or fine-tuning.

TYPE: `int`, *optional*, defaults to 4096 DEFAULT: 4096

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

use_cache

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

bos_token_id

Beginning of stream token id.

TYPE: `int`, *optional*, defaults to 2 DEFAULT: 1

eos_token_id

End of stream token id.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 2

Source code in mindnlp/transformers/models/openelm/configuration_openelm.py
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
class OpenELMConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`OpenELMModel`].
    It is used to instantiate an OpenELM model according to the specified arguments, defining the model architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        vocab_size (`int`, *optional*, defaults to 32000):
            Vocabulary size of the OpenELM model.
        max_context_length (`int`, *optional*, defaults to 2048):
            Maximum number of input tokens.
        num_transformer_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer decoder.
        model_dim (`int`, *optional*, defaults to 2048):
            Dimension of the hidden representations.
        head_dim (`int`, *optional*, defaults to 128):
            The attention head dimension.
        qkv_multipliers (`Union[Number, List[Number]]`, *optional*, defaults to 1.0):
            If the qkv_multipliers is a Number, then all attention layers have the same latent dimensions,
            resulting in uniform allocation of parameters.
            If the qkv_multipliers is a List of Number, then each attention layer have different latent dimensions
            assuming qkv_multipliers[0] != qkv_multipliers[1]. This results in variable allocation of parameters
            in attention layer.
            This scaling is known as layer-wise or block-wise scaling: https://arxiv.org/abs/2008.00623
        num_query_heads (`Union[int, None]`, *optional*, defaults to None):
            The number of query heads, computed from `compute_heads(model_dim=model_dim, head_dim=head_dim)`.
        num_gqa_groups (`int`, *optional*, defaults to 1):
            This variable allows to switch between multi-head attention, group query attention, and multi-query attention.
            When num_gqa_groups == 1, then it is multi-head attention.
            When 1 < num_gqa_groups < num_heads and num_heads is divisible by num_gqa_groups,
            then it is group query attention
            When num_gqa_groups == num_heads, then it is multi-query attention
        ffn_multipliers (`Union[Number, List[Number]]`, *optional*, defaults to 4.0):
            Feed-forward network (FFN) multipliers.
            If the ffn_multipliers is a Number, then all FFN layers have the same latent dimensions,
            resulting in uniform allocation of parameters.
            If the ffn_multipliers is a List of Number, then each FFN layer have different latent dimensions
            assuming ffn_multipliers[0] != ffn_multipliers[1]. This results in variable allocation of parameters
            in FFN layer.
            This scaling is known as layer-wise or block-wise scaling: https://arxiv.org/abs/2008.00623
        ffn_with_glu (`bool`, *optional*, defaults to True):
            Whether to use FFN with Gated Linear Unit (GLU)
        ffn_dim_divisor (`int`, *optional*, defaults to 256):
            The ffn layer dimension divisor.
        activation_fn_name (`str` or `function`, *optional*, defaults to `"swish"`):
            The non-linear activation function (function or string) in the decoder.
        normalization_layer_name (`str` or `function`, *optional*, defaults to `"rms_norm"`):
            Type of normalization layer.
        normalize_qk_projections (`bool`, *optional*, defaults to False):
            Whether to normalize queries and keys after projections
        share_input_output_layers (`bool`, *optional*, defaults to False):
            Whether to share the embedding between input and output linear layer
        rope_freq_constant (`int`, *optional*, defaults to 10000):
            The base period of the RoPE embeddings.
        rope_max_length (`int`, *optional*, defaults to 4096):
            That rope_max_length is set to twice of max_context_length.
            This allows flexibility in token lengths during training or fine-tuning.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.
        bos_token_id (`int`, *optional*, defaults to 2):
            Beginning of stream token id.
        eos_token_id (`int`, *optional*, defaults to 1):
            End of stream token id.
    """
    model_type = "openelm"

    def __init__(
        self,
        vocab_size: int = 32000,
        max_context_length: int = 2048,
        num_transformer_layers: int = 12,
        model_dim: int = 2048,
        head_dim: int = 128,
        qkv_multipliers: Union[Number, List[Number]] = 1.0,
        num_query_heads: Union[int, None] = None,
        num_gqa_groups: int = 1,
        ffn_multipliers: Union[Number, List[Number]] = 4.0,
        ffn_with_glu: bool = True,
        ffn_dim_divisor: int = 256,
        activation_fn_name: str = "swish",
        normalization_layer_name: str = "rms_norm",
        normalize_qk_projections: bool = False,
        share_input_output_layers: bool = False,
        rope_freq_constant: int = 10000,
        rope_max_length: int = 4096,
        initializer_range: float = 0.02,
        use_cache: bool = True,
        bos_token_id: int = 1,
        eos_token_id: int = 2,
        **kwargs,
    ) -> None:
        """
        This method initializes an instance of the OpenELMConfig class with the provided parameters.

        Args:
            self: The instance of the class.
            vocab_size (int): The size of the vocabulary.
            max_context_length (int): The maximum length of the context.
            num_transformer_layers (int): The number of transformer layers.
            model_dim (int): The dimension of the model.
            head_dim (int): The dimension of the head.
            qkv_multipliers (Union[Number, List[Number]]): The multiplier(s) for query, key, and value vectors.
            num_query_heads (Union[int, None]): The number of query heads.
                If None, it will be computed based on model_dim and head_dim.
            num_gqa_groups (int): The number of groups for generalized query attention.
            ffn_multipliers (Union[Number, List[Number]]): The multiplier(s) for feed-forward network.
            ffn_with_glu (bool): A boolean indicating whether to use gated linear units in the feed-forward network.
            ffn_dim_divisor (int): The divisor for the feed-forward network dimension.
            activation_fn_name (str): The name of the activation function.
            normalization_layer_name (str): The name of the normalization layer.
            normalize_qk_projections (bool): A boolean indicating whether to normalize query and key projections.
            share_input_output_layers (bool): A boolean indicating whether to share input and output layers.
            rope_freq_constant (int): The frequency constant for the relative positional encoding.
            rope_max_length (int): The maximum length for the relative positional encoding.
            initializer_range (float): The range for random weight initialization.
            use_cache (bool): A boolean indicating whether to use cache.
            bos_token_id (int): The token ID for the beginning of sentence.
            eos_token_id (int): The token ID for the end of sentence.

        Returns:
            None.

        Raises:
            None.
        """
        self.vocab_size = vocab_size
        self.max_context_length = max_context_length
        self.num_transformer_layers = num_transformer_layers
        self.model_dim = model_dim
        self.head_dim = head_dim
        self.qkv_multipliers = qkv_multipliers
        self.num_query_heads = num_query_heads
        self.num_gqa_groups = num_gqa_groups
        self.ffn_multipliers = ffn_multipliers
        self.ffn_with_glu = ffn_with_glu
        self.ffn_dim_divisor = ffn_dim_divisor
        self.activation_fn_name = activation_fn_name
        self.normalization_layer_name = normalization_layer_name
        self.normalize_qk_projections = normalize_qk_projections
        self.share_input_output_layers = share_input_output_layers
        self.rope_freq_constant = rope_freq_constant
        self.rope_max_length = rope_max_length
        self.num_query_heads = (
            compute_heads(model_dim=model_dim, head_dim=head_dim)
            if num_query_heads is None
            else num_query_heads
        )
        self.initializer_range = initializer_range

        self.__post_init__()
        super().__init__(
            use_cache=use_cache,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            **kwargs,
        )

    def __post_init__(self) -> None:
        """
        This method initializes the configuration parameters for the OpenELM model.

        Args:
            self (OpenELMConfig): The instance of the OpenELMConfig class.

        Returns:
            None.

        Raises:
            NotImplementedError: If the QKV multipliers are not a single number or a list containing exactly two numbers,
                or if the FFN multipliers are not a single number or a list containing exactly two numbers.
            AssertionError: If the length of the FFN multipliers does not match the number of transformer layers, or if
                the number of query heads is not divisible by the number of key-value heads for any layer.
        """
        if self.num_gqa_groups is not None:
            head_multiple_of = self.num_gqa_groups
        else:
            head_multiple_of = 2

        if isinstance(self.qkv_multipliers, Number):
            # All attention layers have the same latent dimensions, resulting in uniform allocation of parameters.
            qkv_dim = make_divisible(
                self.model_dim * self.qkv_multipliers,
                divisor=self.head_dim * head_multiple_of,
            )
            query_dims = [int(qkv_dim)] * self.num_transformer_layers

        elif (
            isinstance(self.qkv_multipliers, (tuple, list))
            and len(self.qkv_multipliers) == 2
        ):
            # Each attention layer have different latent dimensions assuming qkv_multipliers[0] != qkv_multipliers[1].
            # This results in variable allocation of parameters in attention layer.
            # This scaling is known as layer-wise or block-wise scaling: https://arxiv.org/abs/2008.00623
            qkv_multipliers = [
                round(v, 2)
                for v in np.linspace(
                    self.qkv_multipliers[0],
                    self.qkv_multipliers[1],
                    num=self.num_transformer_layers,
                    dtype=float,
                )
            ]
            # Make sure that scaled model dimension is divisible by scaled head dimension.
            query_dims = [
                int(
                    make_divisible(
                        self.model_dim * m, divisor=self.head_dim * head_multiple_of
                    )
                )
                for m in qkv_multipliers
            ]
        else:
            raise NotImplementedError(
                f"QKV multipliers should be a single number or a list containing exactly two numbers. Got: {qkv_multipliers}."
            )

        # compute the number of query, key, and value heads
        # For multi-head and multi-query attention, the number of heads for query, key, and value are the same.
        # For group query attention, the number of key and value heads are the same.
        self.num_query_heads = [
            int(compute_heads(q_dim, self.head_dim)) for q_dim in query_dims
        ]
        self.num_kv_heads = [
            q_heads // self.num_gqa_groups for q_heads in self.num_query_heads
        ]

        # Feed-forward network (FFN) multipliers
        if isinstance(self.ffn_multipliers, Number):
            # All FFN layers have the same latent dimensions, resulting in uniform allocation of parameters.
            self.ffn_multipliers = [self.ffn_multipliers] * self.num_transformer_layers
        elif isinstance(self.ffn_multipliers, (tuple, list)):
            # Each FFN layer have different latent dimensions assuming ffn_multipliers[0] != ffn_multipliers[1].
            # This results in variable allocation of parameters in FFN layer.
            # This scaling is known as layer-wise or block-wise scaling: https://arxiv.org/abs/2008.00623
            if len(self.ffn_multipliers) == 2:
                self.ffn_multipliers = [
                    round(v, 2)
                    for v in np.linspace(
                        self.ffn_multipliers[0],
                        self.ffn_multipliers[1],
                        num=self.num_transformer_layers,
                        dtype=float,
                    )
                ]
            else:
                assert (
                    len(self.ffn_multipliers) == self.num_transformer_layers
                ), f"{len(self.ffn_multipliers)}!={self.num_transformer_layers}"
        else:
            raise NotImplementedError(
                f"FFN multipliers should be a single number or a list containing exactly two numbers. Got: {qkv_multipliers}."
            )

        # check num_query_heads divisible by num_kv_heads for every layer
        for layer_idx in range(len(query_dims)):
            assert self.num_query_heads[layer_idx] % self.num_kv_heads[layer_idx] == 0

mindnlp.transformers.models.openelm.configuration_openelm.OpenELMConfig.__init__(vocab_size=32000, max_context_length=2048, num_transformer_layers=12, model_dim=2048, head_dim=128, qkv_multipliers=1.0, num_query_heads=None, num_gqa_groups=1, ffn_multipliers=4.0, ffn_with_glu=True, ffn_dim_divisor=256, activation_fn_name='swish', normalization_layer_name='rms_norm', normalize_qk_projections=False, share_input_output_layers=False, rope_freq_constant=10000, rope_max_length=4096, initializer_range=0.02, use_cache=True, bos_token_id=1, eos_token_id=2, **kwargs)

This method initializes an instance of the OpenELMConfig class with the provided parameters.

PARAMETER DESCRIPTION
self

The instance of the class.

vocab_size

The size of the vocabulary.

TYPE: int DEFAULT: 32000

max_context_length

The maximum length of the context.

TYPE: int DEFAULT: 2048

num_transformer_layers

The number of transformer layers.

TYPE: int DEFAULT: 12

model_dim

The dimension of the model.

TYPE: int DEFAULT: 2048

head_dim

The dimension of the head.

TYPE: int DEFAULT: 128

qkv_multipliers

The multiplier(s) for query, key, and value vectors.

TYPE: Union[Number, List[Number]] DEFAULT: 1.0

num_query_heads

The number of query heads. If None, it will be computed based on model_dim and head_dim.

TYPE: Union[int, None] DEFAULT: None

num_gqa_groups

The number of groups for generalized query attention.

TYPE: int DEFAULT: 1

ffn_multipliers

The multiplier(s) for feed-forward network.

TYPE: Union[Number, List[Number]] DEFAULT: 4.0

ffn_with_glu

A boolean indicating whether to use gated linear units in the feed-forward network.

TYPE: bool DEFAULT: True

ffn_dim_divisor

The divisor for the feed-forward network dimension.

TYPE: int DEFAULT: 256

activation_fn_name

The name of the activation function.

TYPE: str DEFAULT: 'swish'

normalization_layer_name

The name of the normalization layer.

TYPE: str DEFAULT: 'rms_norm'

normalize_qk_projections

A boolean indicating whether to normalize query and key projections.

TYPE: bool DEFAULT: False

share_input_output_layers

A boolean indicating whether to share input and output layers.

TYPE: bool DEFAULT: False

rope_freq_constant

The frequency constant for the relative positional encoding.

TYPE: int DEFAULT: 10000

rope_max_length

The maximum length for the relative positional encoding.

TYPE: int DEFAULT: 4096

initializer_range

The range for random weight initialization.

TYPE: float DEFAULT: 0.02

use_cache

A boolean indicating whether to use cache.

TYPE: bool DEFAULT: True

bos_token_id

The token ID for the beginning of sentence.

TYPE: int DEFAULT: 1

eos_token_id

The token ID for the end of sentence.

TYPE: int DEFAULT: 2

RETURNS DESCRIPTION
None

None.

Source code in mindnlp/transformers/models/openelm/configuration_openelm.py
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
def __init__(
    self,
    vocab_size: int = 32000,
    max_context_length: int = 2048,
    num_transformer_layers: int = 12,
    model_dim: int = 2048,
    head_dim: int = 128,
    qkv_multipliers: Union[Number, List[Number]] = 1.0,
    num_query_heads: Union[int, None] = None,
    num_gqa_groups: int = 1,
    ffn_multipliers: Union[Number, List[Number]] = 4.0,
    ffn_with_glu: bool = True,
    ffn_dim_divisor: int = 256,
    activation_fn_name: str = "swish",
    normalization_layer_name: str = "rms_norm",
    normalize_qk_projections: bool = False,
    share_input_output_layers: bool = False,
    rope_freq_constant: int = 10000,
    rope_max_length: int = 4096,
    initializer_range: float = 0.02,
    use_cache: bool = True,
    bos_token_id: int = 1,
    eos_token_id: int = 2,
    **kwargs,
) -> None:
    """
    This method initializes an instance of the OpenELMConfig class with the provided parameters.

    Args:
        self: The instance of the class.
        vocab_size (int): The size of the vocabulary.
        max_context_length (int): The maximum length of the context.
        num_transformer_layers (int): The number of transformer layers.
        model_dim (int): The dimension of the model.
        head_dim (int): The dimension of the head.
        qkv_multipliers (Union[Number, List[Number]]): The multiplier(s) for query, key, and value vectors.
        num_query_heads (Union[int, None]): The number of query heads.
            If None, it will be computed based on model_dim and head_dim.
        num_gqa_groups (int): The number of groups for generalized query attention.
        ffn_multipliers (Union[Number, List[Number]]): The multiplier(s) for feed-forward network.
        ffn_with_glu (bool): A boolean indicating whether to use gated linear units in the feed-forward network.
        ffn_dim_divisor (int): The divisor for the feed-forward network dimension.
        activation_fn_name (str): The name of the activation function.
        normalization_layer_name (str): The name of the normalization layer.
        normalize_qk_projections (bool): A boolean indicating whether to normalize query and key projections.
        share_input_output_layers (bool): A boolean indicating whether to share input and output layers.
        rope_freq_constant (int): The frequency constant for the relative positional encoding.
        rope_max_length (int): The maximum length for the relative positional encoding.
        initializer_range (float): The range for random weight initialization.
        use_cache (bool): A boolean indicating whether to use cache.
        bos_token_id (int): The token ID for the beginning of sentence.
        eos_token_id (int): The token ID for the end of sentence.

    Returns:
        None.

    Raises:
        None.
    """
    self.vocab_size = vocab_size
    self.max_context_length = max_context_length
    self.num_transformer_layers = num_transformer_layers
    self.model_dim = model_dim
    self.head_dim = head_dim
    self.qkv_multipliers = qkv_multipliers
    self.num_query_heads = num_query_heads
    self.num_gqa_groups = num_gqa_groups
    self.ffn_multipliers = ffn_multipliers
    self.ffn_with_glu = ffn_with_glu
    self.ffn_dim_divisor = ffn_dim_divisor
    self.activation_fn_name = activation_fn_name
    self.normalization_layer_name = normalization_layer_name
    self.normalize_qk_projections = normalize_qk_projections
    self.share_input_output_layers = share_input_output_layers
    self.rope_freq_constant = rope_freq_constant
    self.rope_max_length = rope_max_length
    self.num_query_heads = (
        compute_heads(model_dim=model_dim, head_dim=head_dim)
        if num_query_heads is None
        else num_query_heads
    )
    self.initializer_range = initializer_range

    self.__post_init__()
    super().__init__(
        use_cache=use_cache,
        bos_token_id=bos_token_id,
        eos_token_id=eos_token_id,
        **kwargs,
    )

mindnlp.transformers.models.openelm.configuration_openelm.OpenELMConfig.__post_init__()

This method initializes the configuration parameters for the OpenELM model.

PARAMETER DESCRIPTION
self

The instance of the OpenELMConfig class.

TYPE: OpenELMConfig

RETURNS DESCRIPTION
None

None.

RAISES DESCRIPTION
NotImplementedError

If the QKV multipliers are not a single number or a list containing exactly two numbers, or if the FFN multipliers are not a single number or a list containing exactly two numbers.

AssertionError

If the length of the FFN multipliers does not match the number of transformer layers, or if the number of query heads is not divisible by the number of key-value heads for any layer.

Source code in mindnlp/transformers/models/openelm/configuration_openelm.py
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
def __post_init__(self) -> None:
    """
    This method initializes the configuration parameters for the OpenELM model.

    Args:
        self (OpenELMConfig): The instance of the OpenELMConfig class.

    Returns:
        None.

    Raises:
        NotImplementedError: If the QKV multipliers are not a single number or a list containing exactly two numbers,
            or if the FFN multipliers are not a single number or a list containing exactly two numbers.
        AssertionError: If the length of the FFN multipliers does not match the number of transformer layers, or if
            the number of query heads is not divisible by the number of key-value heads for any layer.
    """
    if self.num_gqa_groups is not None:
        head_multiple_of = self.num_gqa_groups
    else:
        head_multiple_of = 2

    if isinstance(self.qkv_multipliers, Number):
        # All attention layers have the same latent dimensions, resulting in uniform allocation of parameters.
        qkv_dim = make_divisible(
            self.model_dim * self.qkv_multipliers,
            divisor=self.head_dim * head_multiple_of,
        )
        query_dims = [int(qkv_dim)] * self.num_transformer_layers

    elif (
        isinstance(self.qkv_multipliers, (tuple, list))
        and len(self.qkv_multipliers) == 2
    ):
        # Each attention layer have different latent dimensions assuming qkv_multipliers[0] != qkv_multipliers[1].
        # This results in variable allocation of parameters in attention layer.
        # This scaling is known as layer-wise or block-wise scaling: https://arxiv.org/abs/2008.00623
        qkv_multipliers = [
            round(v, 2)
            for v in np.linspace(
                self.qkv_multipliers[0],
                self.qkv_multipliers[1],
                num=self.num_transformer_layers,
                dtype=float,
            )
        ]
        # Make sure that scaled model dimension is divisible by scaled head dimension.
        query_dims = [
            int(
                make_divisible(
                    self.model_dim * m, divisor=self.head_dim * head_multiple_of
                )
            )
            for m in qkv_multipliers
        ]
    else:
        raise NotImplementedError(
            f"QKV multipliers should be a single number or a list containing exactly two numbers. Got: {qkv_multipliers}."
        )

    # compute the number of query, key, and value heads
    # For multi-head and multi-query attention, the number of heads for query, key, and value are the same.
    # For group query attention, the number of key and value heads are the same.
    self.num_query_heads = [
        int(compute_heads(q_dim, self.head_dim)) for q_dim in query_dims
    ]
    self.num_kv_heads = [
        q_heads // self.num_gqa_groups for q_heads in self.num_query_heads
    ]

    # Feed-forward network (FFN) multipliers
    if isinstance(self.ffn_multipliers, Number):
        # All FFN layers have the same latent dimensions, resulting in uniform allocation of parameters.
        self.ffn_multipliers = [self.ffn_multipliers] * self.num_transformer_layers
    elif isinstance(self.ffn_multipliers, (tuple, list)):
        # Each FFN layer have different latent dimensions assuming ffn_multipliers[0] != ffn_multipliers[1].
        # This results in variable allocation of parameters in FFN layer.
        # This scaling is known as layer-wise or block-wise scaling: https://arxiv.org/abs/2008.00623
        if len(self.ffn_multipliers) == 2:
            self.ffn_multipliers = [
                round(v, 2)
                for v in np.linspace(
                    self.ffn_multipliers[0],
                    self.ffn_multipliers[1],
                    num=self.num_transformer_layers,
                    dtype=float,
                )
            ]
        else:
            assert (
                len(self.ffn_multipliers) == self.num_transformer_layers
            ), f"{len(self.ffn_multipliers)}!={self.num_transformer_layers}"
    else:
        raise NotImplementedError(
            f"FFN multipliers should be a single number or a list containing exactly two numbers. Got: {qkv_multipliers}."
        )

    # check num_query_heads divisible by num_kv_heads for every layer
    for layer_idx in range(len(query_dims)):
        assert self.num_query_heads[layer_idx] % self.num_kv_heads[layer_idx] == 0

mindnlp.transformers.models.openelm.configuration_openelm.compute_heads(model_dim, head_dim)

Compute the number of heads.

PARAMETER DESCRIPTION
model_dim

Model dimension.

TYPE: int

head_dim

Head dimension.

TYPE: int

RETURNS DESCRIPTION
int

An integer denoting number of heads in multi-head attention is returned.

RAISES DESCRIPTION
ValueError

if model dimension is not divisible by head dimension.

Source code in mindnlp/transformers/models/openelm/configuration_openelm.py
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def compute_heads(model_dim: int, head_dim: int) -> int:
    """
    Compute the number of heads.

    Args:
        model_dim: Model dimension.
        head_dim: Head dimension.

    Returns:
        An integer denoting number of heads in multi-head attention is returned.

    Raises:
        ValueError: if model dimension is not divisible by head dimension.
    """
    if model_dim % head_dim == 0:
        return model_dim // head_dim
    else:
        raise ValueError(
            f"Model dimension should be divisible by head dimension. Got: {model_dim} and {head_dim}."
        )

mindnlp.transformers.models.openelm.configuration_openelm.make_divisible(v, divisor=8, min_value=None)

This function is taken from the original tf repo. It ensures that all layers have a channel number that is divisible by the divisor It can be seen at: https://github.com/tensorflow/models/blob/2cfc99eff5e5eb729c6793d2f3d03aa1c9be2b15/research/slim/nets/mobilenet/mobilenet.py#L62

PARAMETER DESCRIPTION
v

input value

TYPE: Union[float, int]

divisor

default to 8

TYPE: Optional[int] DEFAULT: 8

min_value

minimum divisor value

TYPE: Optional[Union[float, int]] DEFAULT: None

RETURNS DESCRIPTION
new_v

new divisible value

TYPE: Union[float, int]

Source code in mindnlp/transformers/models/openelm/configuration_openelm.py
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def make_divisible(
    v: Union[float, int],
    divisor: Optional[int] = 8,
    min_value: Optional[Union[float, int]] = None,
) -> Union[float, int]:
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by the divisor
    It can be seen at:
    https://github.com/tensorflow/models/blob/2cfc99eff5e5eb729c6793d2f3d03aa1c9be2b15/research/slim/nets/mobilenet/mobilenet.py#L62

    Args:
        v: input value
        divisor: default to 8
        min_value: minimum divisor value

    Returns:
        new_v: new divisible value
    """
    if min_value is None:
        min_value = divisor
    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_v < 0.9 * v:
        new_v += divisor
    return new_v