Skip to content

bert_generation

mindnlp.transformers.models.bert_generation.configuration_bert_generation.BertGenerationConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [BertGenerationPreTrainedModel]. It is used to instantiate a BertGeneration model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BertGeneration google/bert_for_seq_generation_L-24_bbc_encoder architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
vocab_size

Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [BertGeneration].

TYPE: `int`, *optional*, defaults to 50358 DEFAULT: 50358

hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 1024 DEFAULT: 1024

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 24 DEFAULT: 24

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 16 DEFAULT: 16

intermediate_size

Dimensionality of the "intermediate" (often called feed-forward) layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 4096 DEFAULT: 4096

hidden_act

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "silu" and "gelu_new" are supported.

TYPE: `str` or `function`, *optional*, defaults to `"gelu"` DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

TYPE: `float`, *optional*, defaults to 0.1 DEFAULT: 0.1

attention_probs_dropout_prob

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.1 DEFAULT: 0.1

max_position_embeddings

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048).

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-12 DEFAULT: 1e-12

pad_token_id

Padding token id.

TYPE: `int`, *optional*, defaults to 0 DEFAULT: 0

bos_token_id

Beginning of stream token id.

TYPE: `int`, *optional*, defaults to 2 DEFAULT: 2

eos_token_id

End of stream token id.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

position_embedding_type

Type of position embedding. Choose one of "absolute", "relative_key", "relative_key_query". For positional embeddings use "absolute". For more information on "relative_key", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on "relative_key_query", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

TYPE: `str`, *optional*, defaults to `"absolute"` DEFAULT: 'absolute'

use_cache

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

Example
>>> from transformers import BertGenerationConfig, BertGenerationEncoder
...
>>> # Initializing a BertGeneration config
>>> configuration = BertGenerationConfig()
...
>>> # Initializing a model (with random weights) from the config
>>> model = BertGenerationEncoder(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/bert_generation/configuration_bert_generation.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
class BertGenerationConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`BertGenerationPreTrainedModel`]. It is used to
    instantiate a BertGeneration model according to the specified arguments, defining the model architecture.
    Instantiating a configuration with the defaults will yield a similar configuration to that of the BertGeneration
    [google/bert_for_seq_generation_L-24_bbc_encoder](https://hf-mirror.com/google/bert_for_seq_generation_L-24_bbc_encoder)
    architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        vocab_size (`int`, *optional*, defaults to 50358):
            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`BertGeneration`].
        hidden_size (`int`, *optional*, defaults to 1024):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 24):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
        intermediate_size (`int`, *optional*, defaults to 4096):
            Dimensionality of the "intermediate" (often called feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"silu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        max_position_embeddings (`int`, *optional*, defaults to 512):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        pad_token_id (`int`, *optional*, defaults to 0):
            Padding token id.
        bos_token_id (`int`, *optional*, defaults to 2):
            Beginning of stream token id.
        eos_token_id (`int`, *optional*, defaults to 1):
            End of stream token id.
        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.

    Example:
        ```python
        >>> from transformers import BertGenerationConfig, BertGenerationEncoder
        ...
        >>> # Initializing a BertGeneration config
        >>> configuration = BertGenerationConfig()
        ...
        >>> # Initializing a model (with random weights) from the config
        >>> model = BertGenerationEncoder(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "bert-generation"

    def __init__(
        self,
        vocab_size=50358,
        hidden_size=1024,
        num_hidden_layers=24,
        num_attention_heads=16,
        intermediate_size=4096,
        hidden_act="gelu",
        hidden_dropout_prob=0.1,
        attention_probs_dropout_prob=0.1,
        max_position_embeddings=512,
        initializer_range=0.02,
        layer_norm_eps=1e-12,
        pad_token_id=0,
        bos_token_id=2,
        eos_token_id=1,
        position_embedding_type="absolute",
        use_cache=True,
        **kwargs,
    ):
        """
        This method initializes an instance of the BertGenerationConfig class.

        Args:
            self: The instance of the class.
            vocab_size (int, optional): The size of the vocabulary. Default is 50358.
            hidden_size (int, optional): The size of the hidden layers. Default is 1024.
            num_hidden_layers (int, optional): The number of hidden layers. Default is 24.
            num_attention_heads (int, optional): The number of attention heads. Default is 16.
            intermediate_size (int, optional): The size of the intermediate layer in the transformer encoder. Default is 4096.
            hidden_act (str, optional): The activation function for the hidden layers. Default is 'gelu'.
            hidden_dropout_prob (float, optional): The dropout probability for the hidden layers. Default is 0.1.
            attention_probs_dropout_prob (float, optional): The dropout probability for the attention probabilities. Default is 0.1.
            max_position_embeddings (int, optional): The maximum number of positional embeddings. Default is 512.
            initializer_range (float, optional): The range of the parameter initializer. Default is 0.02.
            layer_norm_eps (float, optional): The epsilon value for layer normalization. Default is 1e-12.
            pad_token_id (int, optional): The token id for padding. Default is 0.
            bos_token_id (int, optional): The token id for the beginning of sequence. Default is 2.
            eos_token_id (int, optional): The token id for the end of sequence. Default is 1.
            position_embedding_type (str, optional): The type of position embedding. Default is 'absolute'.
            use_cache (bool, optional): Whether to use caching. Default is True.
            **kwargs: Additional keyword arguments.

        Returns:
            None.

        Raises:
            ValueError: If any of the input arguments are invalid.
        """
        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)

        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.hidden_act = hidden_act
        self.intermediate_size = intermediate_size
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.max_position_embeddings = max_position_embeddings
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps
        self.position_embedding_type = position_embedding_type
        self.use_cache = use_cache

mindnlp.transformers.models.bert_generation.configuration_bert_generation.BertGenerationConfig.__init__(vocab_size=50358, hidden_size=1024, num_hidden_layers=24, num_attention_heads=16, intermediate_size=4096, hidden_act='gelu', hidden_dropout_prob=0.1, attention_probs_dropout_prob=0.1, max_position_embeddings=512, initializer_range=0.02, layer_norm_eps=1e-12, pad_token_id=0, bos_token_id=2, eos_token_id=1, position_embedding_type='absolute', use_cache=True, **kwargs)

This method initializes an instance of the BertGenerationConfig class.

PARAMETER DESCRIPTION
self

The instance of the class.

vocab_size

The size of the vocabulary. Default is 50358.

TYPE: int DEFAULT: 50358

hidden_size

The size of the hidden layers. Default is 1024.

TYPE: int DEFAULT: 1024

num_hidden_layers

The number of hidden layers. Default is 24.

TYPE: int DEFAULT: 24

num_attention_heads

The number of attention heads. Default is 16.

TYPE: int DEFAULT: 16

intermediate_size

The size of the intermediate layer in the transformer encoder. Default is 4096.

TYPE: int DEFAULT: 4096

hidden_act

The activation function for the hidden layers. Default is 'gelu'.

TYPE: str DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for the hidden layers. Default is 0.1.

TYPE: float DEFAULT: 0.1

attention_probs_dropout_prob

The dropout probability for the attention probabilities. Default is 0.1.

TYPE: float DEFAULT: 0.1

max_position_embeddings

The maximum number of positional embeddings. Default is 512.

TYPE: int DEFAULT: 512

initializer_range

The range of the parameter initializer. Default is 0.02.

TYPE: float DEFAULT: 0.02

layer_norm_eps

The epsilon value for layer normalization. Default is 1e-12.

TYPE: float DEFAULT: 1e-12

pad_token_id

The token id for padding. Default is 0.

TYPE: int DEFAULT: 0

bos_token_id

The token id for the beginning of sequence. Default is 2.

TYPE: int DEFAULT: 2

eos_token_id

The token id for the end of sequence. Default is 1.

TYPE: int DEFAULT: 1

position_embedding_type

The type of position embedding. Default is 'absolute'.

TYPE: str DEFAULT: 'absolute'

use_cache

Whether to use caching. Default is True.

TYPE: bool DEFAULT: True

**kwargs

Additional keyword arguments.

DEFAULT: {}

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If any of the input arguments are invalid.

Source code in mindnlp/transformers/models/bert_generation/configuration_bert_generation.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
def __init__(
    self,
    vocab_size=50358,
    hidden_size=1024,
    num_hidden_layers=24,
    num_attention_heads=16,
    intermediate_size=4096,
    hidden_act="gelu",
    hidden_dropout_prob=0.1,
    attention_probs_dropout_prob=0.1,
    max_position_embeddings=512,
    initializer_range=0.02,
    layer_norm_eps=1e-12,
    pad_token_id=0,
    bos_token_id=2,
    eos_token_id=1,
    position_embedding_type="absolute",
    use_cache=True,
    **kwargs,
):
    """
    This method initializes an instance of the BertGenerationConfig class.

    Args:
        self: The instance of the class.
        vocab_size (int, optional): The size of the vocabulary. Default is 50358.
        hidden_size (int, optional): The size of the hidden layers. Default is 1024.
        num_hidden_layers (int, optional): The number of hidden layers. Default is 24.
        num_attention_heads (int, optional): The number of attention heads. Default is 16.
        intermediate_size (int, optional): The size of the intermediate layer in the transformer encoder. Default is 4096.
        hidden_act (str, optional): The activation function for the hidden layers. Default is 'gelu'.
        hidden_dropout_prob (float, optional): The dropout probability for the hidden layers. Default is 0.1.
        attention_probs_dropout_prob (float, optional): The dropout probability for the attention probabilities. Default is 0.1.
        max_position_embeddings (int, optional): The maximum number of positional embeddings. Default is 512.
        initializer_range (float, optional): The range of the parameter initializer. Default is 0.02.
        layer_norm_eps (float, optional): The epsilon value for layer normalization. Default is 1e-12.
        pad_token_id (int, optional): The token id for padding. Default is 0.
        bos_token_id (int, optional): The token id for the beginning of sequence. Default is 2.
        eos_token_id (int, optional): The token id for the end of sequence. Default is 1.
        position_embedding_type (str, optional): The type of position embedding. Default is 'absolute'.
        use_cache (bool, optional): Whether to use caching. Default is True.
        **kwargs: Additional keyword arguments.

    Returns:
        None.

    Raises:
        ValueError: If any of the input arguments are invalid.
    """
    super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)

    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.initializer_range = initializer_range
    self.layer_norm_eps = layer_norm_eps
    self.position_embedding_type = position_embedding_type
    self.use_cache = use_cache

mindnlp.transformers.models.bert_generation.modeling_bert_generation.BertGenerationDecoder

Bases: BertGenerationPreTrainedModel

This class represents a decoder model for BERT generation. It extends the BertGenerationPreTrainedModel and provides methods for initializing the model, forwarding the model outputs, preparing inputs for generation, and reordering cache. The class includes methods for initializing the model, retrieving and setting output embeddings, forwarding the model outputs, preparing inputs for generation, and reordering cache. The detailed docstrings for each method provide information about the parameters, return types, and usage examples. This class is designed to be used as part of the BERT generation framework and provides essential functionality for decoding and generating outputs based on input sequences.

Source code in mindnlp/transformers/models/bert_generation/modeling_bert_generation.py
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
class BertGenerationDecoder(BertGenerationPreTrainedModel):

    """
    This class represents a decoder model for BERT generation.
    It extends the BertGenerationPreTrainedModel and provides methods for initializing the model,
    forwarding the model outputs, preparing inputs for generation, and reordering cache.
    The class includes methods for initializing the model, retrieving and setting output embeddings,
    forwarding the model outputs, preparing inputs for generation, and reordering cache.
    The detailed docstrings for each method provide information about the parameters, return types, and usage examples.
    This class is designed to be used as part of the BERT generation framework
    and provides essential functionality for decoding and generating outputs based on input sequences.
    """
    _tied_weights_keys = ["lm_head.decoder.weight", "lm_head.decoder.bias"]

    def __init__(self, config):
        """
        Initializes a new instance of the BertGenerationDecoder class.

        Args:
            self: The instance of the class.
            config (object): The configuration object containing settings for the decoder.
                This object must have the necessary attributes and properties required for configuring the decoder.
                It should also have an attribute 'is_decoder' to indicate if the decoder is being used as
                a standalone component.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)

        if not config.is_decoder:
            logger.warning("If you want to use `BertGenerationDecoder` as a standalone, add `is_decoder=True.`")

        self.bert = BertGenerationEncoder(config)
        self.lm_head = BertGenerationOnlyLMHead(config)

        # Initialize weights and apply final processing
        self.post_init()

    def get_output_embeddings(self):
        """
        This method returns the output embeddings of the BertGenerationDecoder.

        Args:
            self: The object instance of the BertGenerationDecoder class.

        Returns:
            None: This method returns the output embeddings of the decoder in the BertGenerationDecoder class.

        Raises:
            None
        """
        return self.lm_head.decoder

    def set_output_embeddings(self, new_embeddings):
        """
        Method to set new output embeddings for the decoder in BertGenerationDecoder.

        Args:
            self (BertGenerationDecoder): The instance of BertGenerationDecoder to which the new embeddings will be set.
            new_embeddings: The new embeddings to set for the decoder. Should be of type compatible with the decoder.

        Returns:
            None.

        Raises:
            None.
        """
        self.lm_head.decoder = new_embeddings

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        encoder_hidden_states: Optional[mindspore.Tensor] = None,
        encoder_attention_mask: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, CausalLMOutputWithCrossAttentions]:
        r"""
        Args:
            encoder_hidden_states  (`mindspore.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
                the model is configured as a decoder.
            encoder_attention_mask (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Mask to avoid performing attention on the padding token indices of the encoder input.
                This mask is used in the cross-attention if the model is configured as a decoder.
                Mask values selected in `[0, 1]`:

                - 1 for tokens that are **not masked**,
                - 0 for tokens that are **masked**.
            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
                Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
                `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are
                ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
            past_key_values (`tuple(tuple(mindspore.Tensor))` of length `config.n_layers` with each tuple having 4 tensors of shape
                `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
                don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
                `decoder_input_ids` of shape `(batch_size, sequence_length)`.
            use_cache (`bool`, *optional*):
                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
                `past_key_values`).

        Returns:
            Union[Tuple, CausalLMOutputWithCrossAttentions]

        Example:
            ```python
            >>> from transformers import AutoTokenizer, BertGenerationDecoder, BertGenerationConfig
            >>> import torch
            ...
            >>> tokenizer = AutoTokenizer.from_pretrained("google/bert_for_seq_generation_L-24_bbc_encoder")
            >>> config = BertGenerationConfig.from_pretrained("google/bert_for_seq_generation_L-24_bbc_encoder")
            >>> config.is_decoder = True
            >>> model = BertGenerationDecoder.from_pretrained(
            ...     "google/bert_for_seq_generation_L-24_bbc_encoder", config=config
            ... )
            ...
            >>> inputs = tokenizer("Hello, my dog is cute", return_token_type_ids=False, return_tensors="pt")
            >>> outputs = model(**inputs)
            ...
            >>> prediction_logits = outputs.logits
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        if labels is not None:
            use_cache = False

        outputs = self.bert(
            input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            head_mask=head_mask,
            inputs_embeds=inputs_embeds,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            past_key_values=past_key_values,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = outputs[0]
        prediction_scores = self.lm_head(sequence_output)

        lm_loss = None
        if labels is not None:
            # we are doing next-token prediction; shift prediction scores and input ids by one
            shifted_prediction_scores = prediction_scores[:, :-1, :]
            labels = labels[:, 1:]
            lm_loss = F.cross_entropy(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))

        if not return_dict:
            output = (prediction_scores,) + outputs[1:]
            return ((lm_loss,) + output) if lm_loss is not None else output

        return CausalLMOutputWithCrossAttentions(
            loss=lm_loss,
            logits=prediction_scores,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
            cross_attentions=outputs.cross_attentions,
        )

    def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attention_mask=None, **model_kwargs):
        """
        Method: prepare_inputs_for_generation

        This method prepares inputs for generation in the BertGenerationDecoder class.

        Args:
            self (object): The instance of the BertGenerationDecoder class.
            input_ids (torch.Tensor): The input tensor containing token IDs. Shape should be (batch_size, sequence_length).
            past_key_values (tuple, optional): Tuple containing past key values from previous generations. Default is None.
            attention_mask (torch.Tensor, optional): The attention mask tensor.
                If not provided, a tensor of ones with the same shape as input_ids is created.

        Returns:
            dict: A dictionary containing the prepared inputs for generation including
                'input_ids', 'attention_mask', and 'past_key_values'.

        Raises:
            ValueError: If the provided input_ids shape is invalid.
            IndexError: If there is an issue with past_key_values.
        """
        input_shape = input_ids.shape
        # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
        if attention_mask is None:
            attention_mask = input_ids.new_ones(input_shape)

        # cut decoder_input_ids if past_key_values is used
        if past_key_values is not None:
            past_length = past_key_values[0][0].shape[2]

            # Some generation methods already pass only the last input ID
            if input_ids.shape[1] > past_length:
                remove_prefix_length = past_length
            else:
                # Default to old behavior: keep only final ID
                remove_prefix_length = input_ids.shape[1] - 1

            input_ids = input_ids[:, remove_prefix_length:]

        return {"input_ids": input_ids, "attention_mask": attention_mask, "past_key_values": past_key_values}

    def _reorder_cache(self, past_key_values, beam_idx):
        """
        Reorders the cache for the BertGenerationDecoder.

        Args:
            self: An instance of the BertGenerationDecoder class.
            past_key_values (tuple): A tuple containing the past key-value states for each layer.
                Each layer's past key-value state is a tensor of shape (batch_size, sequence_length, hidden_size).
            beam_idx (tensor): The beam indices used for reordering the past key-value states.
                A tensor of shape (batch_size, beam_size).

        Returns:
            None: This method modifies the cache in-place.

        Raises:
            None.
        """
        reordered_past = ()
        for layer_past in past_key_values:
            reordered_past += (
                tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),
            )
        return reordered_past

mindnlp.transformers.models.bert_generation.modeling_bert_generation.BertGenerationDecoder.__init__(config)

Initializes a new instance of the BertGenerationDecoder class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

The configuration object containing settings for the decoder. This object must have the necessary attributes and properties required for configuring the decoder. It should also have an attribute 'is_decoder' to indicate if the decoder is being used as a standalone component.

TYPE: object

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/bert_generation/modeling_bert_generation.py
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
def __init__(self, config):
    """
    Initializes a new instance of the BertGenerationDecoder class.

    Args:
        self: The instance of the class.
        config (object): The configuration object containing settings for the decoder.
            This object must have the necessary attributes and properties required for configuring the decoder.
            It should also have an attribute 'is_decoder' to indicate if the decoder is being used as
            a standalone component.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)

    if not config.is_decoder:
        logger.warning("If you want to use `BertGenerationDecoder` as a standalone, add `is_decoder=True.`")

    self.bert = BertGenerationEncoder(config)
    self.lm_head = BertGenerationOnlyLMHead(config)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.bert_generation.modeling_bert_generation.BertGenerationDecoder.forward(input_ids=None, attention_mask=None, position_ids=None, head_mask=None, inputs_embeds=None, encoder_hidden_states=None, encoder_attention_mask=None, labels=None, past_key_values=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
encoder_hidden_states

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

TYPE: (`mindspore.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional* DEFAULT: None

encoder_attention_mask

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]:

  • 1 for tokens that are not masked,
  • 0 for tokens that are masked.

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional* DEFAULT: None

labels

Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in [-100, 0, ..., config.vocab_size] (see input_ids docstring) Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size]

TYPE: `torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional* DEFAULT: None

use_cache

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

TYPE: `bool`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple, CausalLMOutputWithCrossAttentions]

Union[Tuple, CausalLMOutputWithCrossAttentions]

Example
>>> from transformers import AutoTokenizer, BertGenerationDecoder, BertGenerationConfig
>>> import torch
...
>>> tokenizer = AutoTokenizer.from_pretrained("google/bert_for_seq_generation_L-24_bbc_encoder")
>>> config = BertGenerationConfig.from_pretrained("google/bert_for_seq_generation_L-24_bbc_encoder")
>>> config.is_decoder = True
>>> model = BertGenerationDecoder.from_pretrained(
...     "google/bert_for_seq_generation_L-24_bbc_encoder", config=config
... )
...
>>> inputs = tokenizer("Hello, my dog is cute", return_token_type_ids=False, return_tensors="pt")
>>> outputs = model(**inputs)
...
>>> prediction_logits = outputs.logits
Source code in mindnlp/transformers/models/bert_generation/modeling_bert_generation.py
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    encoder_hidden_states: Optional[mindspore.Tensor] = None,
    encoder_attention_mask: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, CausalLMOutputWithCrossAttentions]:
    r"""
    Args:
        encoder_hidden_states  (`mindspore.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
            the model is configured as a decoder.
        encoder_attention_mask (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Mask to avoid performing attention on the padding token indices of the encoder input.
            This mask is used in the cross-attention if the model is configured as a decoder.
            Mask values selected in `[0, 1]`:

            - 1 for tokens that are **not masked**,
            - 0 for tokens that are **masked**.
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
            `[-100, 0, ..., config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are
            ignored (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
        past_key_values (`tuple(tuple(mindspore.Tensor))` of length `config.n_layers` with each tuple having 4 tensors of shape
            `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
        use_cache (`bool`, *optional*):
            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
            `past_key_values`).

    Returns:
        Union[Tuple, CausalLMOutputWithCrossAttentions]

    Example:
        ```python
        >>> from transformers import AutoTokenizer, BertGenerationDecoder, BertGenerationConfig
        >>> import torch
        ...
        >>> tokenizer = AutoTokenizer.from_pretrained("google/bert_for_seq_generation_L-24_bbc_encoder")
        >>> config = BertGenerationConfig.from_pretrained("google/bert_for_seq_generation_L-24_bbc_encoder")
        >>> config.is_decoder = True
        >>> model = BertGenerationDecoder.from_pretrained(
        ...     "google/bert_for_seq_generation_L-24_bbc_encoder", config=config
        ... )
        ...
        >>> inputs = tokenizer("Hello, my dog is cute", return_token_type_ids=False, return_tensors="pt")
        >>> outputs = model(**inputs)
        ...
        >>> prediction_logits = outputs.logits
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    if labels is not None:
        use_cache = False

    outputs = self.bert(
        input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        encoder_hidden_states=encoder_hidden_states,
        encoder_attention_mask=encoder_attention_mask,
        past_key_values=past_key_values,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    sequence_output = outputs[0]
    prediction_scores = self.lm_head(sequence_output)

    lm_loss = None
    if labels is not None:
        # we are doing next-token prediction; shift prediction scores and input ids by one
        shifted_prediction_scores = prediction_scores[:, :-1, :]
        labels = labels[:, 1:]
        lm_loss = F.cross_entropy(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))

    if not return_dict:
        output = (prediction_scores,) + outputs[1:]
        return ((lm_loss,) + output) if lm_loss is not None else output

    return CausalLMOutputWithCrossAttentions(
        loss=lm_loss,
        logits=prediction_scores,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
        cross_attentions=outputs.cross_attentions,
    )

mindnlp.transformers.models.bert_generation.modeling_bert_generation.BertGenerationDecoder.get_output_embeddings()

This method returns the output embeddings of the BertGenerationDecoder.

PARAMETER DESCRIPTION
self

The object instance of the BertGenerationDecoder class.

RETURNS DESCRIPTION
None

This method returns the output embeddings of the decoder in the BertGenerationDecoder class.

Source code in mindnlp/transformers/models/bert_generation/modeling_bert_generation.py
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
def get_output_embeddings(self):
    """
    This method returns the output embeddings of the BertGenerationDecoder.

    Args:
        self: The object instance of the BertGenerationDecoder class.

    Returns:
        None: This method returns the output embeddings of the decoder in the BertGenerationDecoder class.

    Raises:
        None
    """
    return self.lm_head.decoder

mindnlp.transformers.models.bert_generation.modeling_bert_generation.BertGenerationDecoder.prepare_inputs_for_generation(input_ids, past_key_values=None, attention_mask=None, **model_kwargs)

This method prepares inputs for generation in the BertGenerationDecoder class.

PARAMETER DESCRIPTION
self

The instance of the BertGenerationDecoder class.

TYPE: object

input_ids

The input tensor containing token IDs. Shape should be (batch_size, sequence_length).

TYPE: Tensor

past_key_values

Tuple containing past key values from previous generations. Default is None.

TYPE: tuple DEFAULT: None

attention_mask

The attention mask tensor. If not provided, a tensor of ones with the same shape as input_ids is created.

TYPE: Tensor DEFAULT: None

RETURNS DESCRIPTION
dict

A dictionary containing the prepared inputs for generation including 'input_ids', 'attention_mask', and 'past_key_values'.

RAISES DESCRIPTION
ValueError

If the provided input_ids shape is invalid.

IndexError

If there is an issue with past_key_values.

Source code in mindnlp/transformers/models/bert_generation/modeling_bert_generation.py
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attention_mask=None, **model_kwargs):
    """
    Method: prepare_inputs_for_generation

    This method prepares inputs for generation in the BertGenerationDecoder class.

    Args:
        self (object): The instance of the BertGenerationDecoder class.
        input_ids (torch.Tensor): The input tensor containing token IDs. Shape should be (batch_size, sequence_length).
        past_key_values (tuple, optional): Tuple containing past key values from previous generations. Default is None.
        attention_mask (torch.Tensor, optional): The attention mask tensor.
            If not provided, a tensor of ones with the same shape as input_ids is created.

    Returns:
        dict: A dictionary containing the prepared inputs for generation including
            'input_ids', 'attention_mask', and 'past_key_values'.

    Raises:
        ValueError: If the provided input_ids shape is invalid.
        IndexError: If there is an issue with past_key_values.
    """
    input_shape = input_ids.shape
    # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
    if attention_mask is None:
        attention_mask = input_ids.new_ones(input_shape)

    # cut decoder_input_ids if past_key_values is used
    if past_key_values is not None:
        past_length = past_key_values[0][0].shape[2]

        # Some generation methods already pass only the last input ID
        if input_ids.shape[1] > past_length:
            remove_prefix_length = past_length
        else:
            # Default to old behavior: keep only final ID
            remove_prefix_length = input_ids.shape[1] - 1

        input_ids = input_ids[:, remove_prefix_length:]

    return {"input_ids": input_ids, "attention_mask": attention_mask, "past_key_values": past_key_values}

mindnlp.transformers.models.bert_generation.modeling_bert_generation.BertGenerationDecoder.set_output_embeddings(new_embeddings)

Method to set new output embeddings for the decoder in BertGenerationDecoder.

PARAMETER DESCRIPTION
self

The instance of BertGenerationDecoder to which the new embeddings will be set.

TYPE: BertGenerationDecoder

new_embeddings

The new embeddings to set for the decoder. Should be of type compatible with the decoder.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/bert_generation/modeling_bert_generation.py
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
def set_output_embeddings(self, new_embeddings):
    """
    Method to set new output embeddings for the decoder in BertGenerationDecoder.

    Args:
        self (BertGenerationDecoder): The instance of BertGenerationDecoder to which the new embeddings will be set.
        new_embeddings: The new embeddings to set for the decoder. Should be of type compatible with the decoder.

    Returns:
        None.

    Raises:
        None.
    """
    self.lm_head.decoder = new_embeddings

mindnlp.transformers.models.bert_generation.modeling_bert_generation.BertGenerationEncoder

Bases: BertGenerationPreTrainedModel

The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of cross-attention is added between the self-attention layers, following the architecture described in Attention is all you need by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.

This model should be used when leveraging Bert or Roberta checkpoints for the [EncoderDecoderModel] class as described in Leveraging Pre-trained Checkpoints for Sequence Generation Tasks by Sascha Rothe, Shashi Narayan, and Aliaksei Severyn.

To behave as an decoder the model needs to be initialized with the is_decoder argument of the configuration set to True. To be used in a Seq2Seq model, the model needs to initialized with both is_decoder argument and add_cross_attention set to True; an encoder_hidden_states is then expected as an input to the forward pass.

Source code in mindnlp/transformers/models/bert_generation/modeling_bert_generation.py
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
class BertGenerationEncoder(BertGenerationPreTrainedModel):
    """
    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
    cross-attention is added between the self-attention layers, following the architecture described in [Attention is
    all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.

    This model should be used when leveraging Bert or Roberta checkpoints for the [`EncoderDecoderModel`] class as
    described in [Leveraging Pre-trained Checkpoints for Sequence Generation Tasks](https://arxiv.org/abs/1907.12461)
    by Sascha Rothe, Shashi Narayan, and Aliaksei Severyn.

    To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set
    to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` argument and
    `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass.
    """
    def __init__(self, config):
        """
        Initializes a BertGenerationEncoder instance.

        Args:
            self (BertGenerationEncoder): The instance of the BertGenerationEncoder class being initialized.
            config (dict): A dictionary containing configuration parameters for the BertGenerationEncoder.
                This dictionary must include the necessary settings for the embeddings and encoder components.

        Returns:
            None.

        Raises:
            None
        """
        super().__init__(config)
        self.config = config

        self.embeddings = BertGenerationEmbeddings(config)
        self.encoder = BertEncoder(config)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        """
        This method retrieves the input embeddings from the BertGenerationEncoder.

        Args:
            self: The instance of the BertGenerationEncoder class.

        Returns:
            word_embeddings: This method returns the word embeddings for input.

        Raises:
            None.
        """
        return self.embeddings.word_embeddings

    def set_input_embeddings(self, value):
        """
        Sets the input embeddings for the BertGenerationEncoder class.

        Args:
            self (BertGenerationEncoder): An instance of the BertGenerationEncoder class.
            value: The input embeddings to be set. This should be of type `torch.Tensor`.

        Returns:
            None.

        Raises:
            None.

        This method sets the value of the `word_embeddings` attribute of the `embeddings` object
        within the `BertGenerationEncoder` instance to the provided input embeddings.
        """
        self.embeddings.word_embeddings = value

    def _prune_heads(self, heads_to_prune):
        """
        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
        class PreTrainedModel
        """
        for layer, heads in heads_to_prune.items():
            self.encoder.layer[layer].attention.prune_heads(heads)

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        encoder_hidden_states: Optional[mindspore.Tensor] = None,
        encoder_attention_mask: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:
        r"""
        Args:
            encoder_hidden_states  (`mindspore.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
                the model is configured as a decoder.
            encoder_attention_mask (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
                the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`: `1` for
                tokens that are NOT MASKED, `0` for MASKED tokens.
            past_key_values (`tuple(tuple(mindspore.Tensor))` of length `config.n_layers`
                with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
                Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
                don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
                `decoder_input_ids` of shape `(batch_size, sequence_length)`.
            use_cache (`bool`, *optional*):
                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
                `past_key_values`).
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if self.config.is_decoder:
            use_cache = use_cache if use_cache is not None else self.config.use_cache
        else:
            use_cache = False

        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        elif input_ids is not None:
            self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
            input_shape = input_ids.shape
        elif inputs_embeds is not None:
            input_shape = inputs_embeds.shape[:-1]
        else:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        batch_size, seq_length = input_shape

        # past_key_values_length
        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0

        if attention_mask is None:
            attention_mask = ops.ones(batch_size, seq_length + past_key_values_length)

        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
        # ourselves in which case we just need to make it broadcastable to all heads.
        extended_attention_mask = None
        if not use_cache:
            extended_attention_mask: mindspore.Tensor = self.get_extended_attention_mask(attention_mask, input_shape)

        # If a 2D or 3D attention mask is provided for the cross-attention
        # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
        if self.config.is_decoder and encoder_hidden_states is not None:
            encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.shape
            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
            if encoder_attention_mask is None:
                encoder_attention_mask = ops.ones(*encoder_hidden_shape)
            encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
        else:
            encoder_extended_attention_mask = None

        # Prepare head mask if needed
        # 1.0 in head_mask indicate we keep the head
        # attention_probs has shape bsz x n_heads x N x N
        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)

        embedding_output = self.embeddings(
            input_ids=input_ids,
            position_ids=position_ids,
            inputs_embeds=inputs_embeds,
            past_key_values_length=past_key_values_length,
        )

        encoder_outputs = self.encoder(
            embedding_output,
            attention_mask=extended_attention_mask,
            head_mask=head_mask,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_extended_attention_mask,
            past_key_values=past_key_values,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = encoder_outputs[0]

        if not return_dict:
            return (sequence_output,) + encoder_outputs[1:]

        return BaseModelOutputWithPastAndCrossAttentions(
            last_hidden_state=sequence_output,
            past_key_values=encoder_outputs.past_key_values,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
            cross_attentions=encoder_outputs.cross_attentions,
        )

mindnlp.transformers.models.bert_generation.modeling_bert_generation.BertGenerationEncoder.__init__(config)

Initializes a BertGenerationEncoder instance.

PARAMETER DESCRIPTION
self

The instance of the BertGenerationEncoder class being initialized.

TYPE: BertGenerationEncoder

config

A dictionary containing configuration parameters for the BertGenerationEncoder. This dictionary must include the necessary settings for the embeddings and encoder components.

TYPE: dict

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/bert_generation/modeling_bert_generation.py
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
def __init__(self, config):
    """
    Initializes a BertGenerationEncoder instance.

    Args:
        self (BertGenerationEncoder): The instance of the BertGenerationEncoder class being initialized.
        config (dict): A dictionary containing configuration parameters for the BertGenerationEncoder.
            This dictionary must include the necessary settings for the embeddings and encoder components.

    Returns:
        None.

    Raises:
        None
    """
    super().__init__(config)
    self.config = config

    self.embeddings = BertGenerationEmbeddings(config)
    self.encoder = BertEncoder(config)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.bert_generation.modeling_bert_generation.BertGenerationEncoder.forward(input_ids=None, attention_mask=None, position_ids=None, head_mask=None, inputs_embeds=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_values=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
encoder_hidden_states

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if the model is configured as a decoder.

TYPE: (`mindspore.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional* DEFAULT: None

encoder_attention_mask

Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in the cross-attention if the model is configured as a decoder. Mask values selected in [0, 1]: 1 for tokens that are NOT MASKED, 0 for MASKED tokens.

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional* DEFAULT: None

use_cache

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

TYPE: `bool`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/bert_generation/modeling_bert_generation.py
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    encoder_hidden_states: Optional[mindspore.Tensor] = None,
    encoder_attention_mask: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:
    r"""
    Args:
        encoder_hidden_states  (`mindspore.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
            the model is configured as a decoder.
        encoder_attention_mask (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
            the cross-attention if the model is configured as a decoder. Mask values selected in `[0, 1]`: `1` for
            tokens that are NOT MASKED, `0` for MASKED tokens.
        past_key_values (`tuple(tuple(mindspore.Tensor))` of length `config.n_layers`
            with each tuple having 4 tensors of shape `(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
            Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
        use_cache (`bool`, *optional*):
            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
            `past_key_values`).
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    if self.config.is_decoder:
        use_cache = use_cache if use_cache is not None else self.config.use_cache
    else:
        use_cache = False

    if input_ids is not None and inputs_embeds is not None:
        raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
    elif input_ids is not None:
        self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
        input_shape = input_ids.shape
    elif inputs_embeds is not None:
        input_shape = inputs_embeds.shape[:-1]
    else:
        raise ValueError("You have to specify either input_ids or inputs_embeds")

    batch_size, seq_length = input_shape

    # past_key_values_length
    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0

    if attention_mask is None:
        attention_mask = ops.ones(batch_size, seq_length + past_key_values_length)

    # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
    # ourselves in which case we just need to make it broadcastable to all heads.
    extended_attention_mask = None
    if not use_cache:
        extended_attention_mask: mindspore.Tensor = self.get_extended_attention_mask(attention_mask, input_shape)

    # If a 2D or 3D attention mask is provided for the cross-attention
    # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
    if self.config.is_decoder and encoder_hidden_states is not None:
        encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.shape
        encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
        if encoder_attention_mask is None:
            encoder_attention_mask = ops.ones(*encoder_hidden_shape)
        encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
    else:
        encoder_extended_attention_mask = None

    # Prepare head mask if needed
    # 1.0 in head_mask indicate we keep the head
    # attention_probs has shape bsz x n_heads x N x N
    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)

    embedding_output = self.embeddings(
        input_ids=input_ids,
        position_ids=position_ids,
        inputs_embeds=inputs_embeds,
        past_key_values_length=past_key_values_length,
    )

    encoder_outputs = self.encoder(
        embedding_output,
        attention_mask=extended_attention_mask,
        head_mask=head_mask,
        encoder_hidden_states=encoder_hidden_states,
        encoder_attention_mask=encoder_extended_attention_mask,
        past_key_values=past_key_values,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    sequence_output = encoder_outputs[0]

    if not return_dict:
        return (sequence_output,) + encoder_outputs[1:]

    return BaseModelOutputWithPastAndCrossAttentions(
        last_hidden_state=sequence_output,
        past_key_values=encoder_outputs.past_key_values,
        hidden_states=encoder_outputs.hidden_states,
        attentions=encoder_outputs.attentions,
        cross_attentions=encoder_outputs.cross_attentions,
    )

mindnlp.transformers.models.bert_generation.modeling_bert_generation.BertGenerationEncoder.get_input_embeddings()

This method retrieves the input embeddings from the BertGenerationEncoder.

PARAMETER DESCRIPTION
self

The instance of the BertGenerationEncoder class.

RETURNS DESCRIPTION
word_embeddings

This method returns the word embeddings for input.

Source code in mindnlp/transformers/models/bert_generation/modeling_bert_generation.py
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
def get_input_embeddings(self):
    """
    This method retrieves the input embeddings from the BertGenerationEncoder.

    Args:
        self: The instance of the BertGenerationEncoder class.

    Returns:
        word_embeddings: This method returns the word embeddings for input.

    Raises:
        None.
    """
    return self.embeddings.word_embeddings

mindnlp.transformers.models.bert_generation.modeling_bert_generation.BertGenerationEncoder.set_input_embeddings(value)

Sets the input embeddings for the BertGenerationEncoder class.

PARAMETER DESCRIPTION
self

An instance of the BertGenerationEncoder class.

TYPE: BertGenerationEncoder

value

The input embeddings to be set. This should be of type torch.Tensor.

RETURNS DESCRIPTION

None.

This method sets the value of the word_embeddings attribute of the embeddings object within the BertGenerationEncoder instance to the provided input embeddings.

Source code in mindnlp/transformers/models/bert_generation/modeling_bert_generation.py
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
def set_input_embeddings(self, value):
    """
    Sets the input embeddings for the BertGenerationEncoder class.

    Args:
        self (BertGenerationEncoder): An instance of the BertGenerationEncoder class.
        value: The input embeddings to be set. This should be of type `torch.Tensor`.

    Returns:
        None.

    Raises:
        None.

    This method sets the value of the `word_embeddings` attribute of the `embeddings` object
    within the `BertGenerationEncoder` instance to the provided input embeddings.
    """
    self.embeddings.word_embeddings = value

mindnlp.transformers.models.bert_generation.modeling_bert_generation.BertGenerationPreTrainedModel

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in mindnlp/transformers/models/bert_generation/modeling_bert_generation.py
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
class BertGenerationPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """
    config_class = BertGenerationConfig
    base_model_prefix = "bert"
    supports_gradient_checkpointing = True

    def _init_weights(self, cell):
        """Initialize the weights"""
        if isinstance(cell, nn.Linear):
            # Slightly different from the TF version which uses truncated_normal for initialization
            # cf https://github.com/pytorch/pytorch/pull/5617
            cell.weight.set_data(initializer(Normal(self.config.initializer_range),
                                                    cell.weight.shape, cell.weight.dtype))
            if cell.bias is not None:
                cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))
        elif isinstance(cell, nn.Embedding):
            weight = np.random.normal(0.0, self.config.initializer_range, cell.weight.shape)
            if cell.padding_idx:
                weight[cell.padding_idx] = 0

            cell.weight.set_data(Tensor(weight, cell.weight.dtype))
        elif isinstance(cell, nn.LayerNorm):
            cell.weight.set_data(initializer('ones', cell.weight.shape, cell.weight.dtype))
            cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))