Skip to content

mt5

mindnlp.transformers.models.mt5.modeling_mt5

MindSpore mT5 model.

mindnlp.transformers.models.mt5.modeling_mt5.MT5Attention

Bases: Module

The MT5Attention class is a module that implements the attention mechanism used in the MT5 model. It is designed to be used as a building block for the Transformer-based models.

This class inherits from the nn.Module class, which is the base class for all neural network modules in MindSpore.

The main purpose of this class is to compute the attention weights and output of the attention mechanism. It takes in the hidden states, mask, key-value states, position bias, past key-value states, layer head mask, query length, use cache flag, and output attentions flag as inputs.

The class provides the following methods:

  • __init__: Initializes the MT5Attention instance with the given configuration and relative attention bias flag.
  • prune_heads: Prunes the specified attention heads from the model.
  • _relative_position_bucket: Translates the relative position to a bucket number for relative attention. This method is adapted from Mesh Tensorflow.
  • compute_bias: Computes the binned relative position bias for the attention mechanism.
  • forward: Constructs the attention mechanism by applying self-attention (if key_value_states is None) or attention over source sentence (provided by key_value_states).

Please refer to the method docstrings for more detailed information on each method and its parameters.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
class MT5Attention(nn.Module):

    """
    The `MT5Attention` class is a module that implements the attention mechanism used in the MT5 model.
    It is designed to be used as a building block for the Transformer-based models.

    This class inherits from the `nn.Module` class, which is the base class for all neural network modules in MindSpore.

    The main purpose of this class is to compute the attention weights and output of the attention mechanism.
    It takes in the hidden states, mask, key-value states, position bias, past key-value states, layer head mask,
    query length, use cache flag, and output attentions flag as inputs.

    The class provides the following methods:

    - `__init__`: Initializes the `MT5Attention` instance with the given configuration and relative attention bias flag.
    - `prune_heads`: Prunes the specified attention heads from the model.
    - `_relative_position_bucket`: Translates the relative position to a bucket number for relative attention.
    This method is adapted from Mesh Tensorflow.
    - `compute_bias`: Computes the binned relative position bias for the attention mechanism.
    - `forward`: Constructs the attention mechanism by applying self-attention (if `key_value_states` is None) or
    attention over source sentence (provided by `key_value_states`).

    Please refer to the method docstrings for more detailed information on each method and its parameters.
    """
    def __init__(self, config: MT5Config, has_relative_attention_bias=False):
        """
        Initializes an instance of the MT5Attention class.

        Args:
            self: The instance of the class.
            config (MT5Config): An object containing configuration parameters for the attention mechanism.
                The configuration object must have the following attributes:

                - is_decoder (bool): Indicates if the attention mechanism is used in a decoder.
                - relative_attention_num_buckets (int): Number of buckets for relative attention calculations.
                - relative_attention_max_distance (int): Maximum distance for relative attention calculations.
                - d_model (int): Dimensionality of the model.
                - d_kv (int): Dimensionality of the key and value projections.
                - num_heads (int): Number of attention heads.
                - dropout_rate (float): Dropout rate to apply.
            has_relative_attention_bias (bool, optional): Indicates whether relative attention bias is used.
            Default is False.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        self.is_decoder = config.is_decoder
        self.has_relative_attention_bias = has_relative_attention_bias
        self.relative_attention_num_buckets = config.relative_attention_num_buckets
        self.relative_attention_max_distance = config.relative_attention_max_distance
        self.d_model = config.d_model
        self.key_value_proj_dim = config.d_kv
        self.n_heads = config.num_heads
        self.dropout = config.dropout_rate
        self.inner_dim = self.n_heads * self.key_value_proj_dim

        # Mesh TensorFlow initialization to avoid scaling before softmax
        self.q = nn.Linear(self.d_model, self.inner_dim, bias=False)
        self.k = nn.Linear(self.d_model, self.inner_dim, bias=False)
        self.v = nn.Linear(self.d_model, self.inner_dim, bias=False)
        self.o = nn.Linear(self.inner_dim, self.d_model, bias=False)

        if self.has_relative_attention_bias:
            self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.n_heads)
        self.pruned_heads = set()

    def prune_heads(self, heads):
        """
        This method 'prune_heads' is defined in the class 'MT5Attention' and is used to prune specific heads in the
        attention mechanism of a MT5 model.

        Args:
            self (object): The instance of the MT5Attention class.
                It is used to access the attributes and methods within the class.
            heads (list): A list of integers representing the indices of the heads to be pruned.
                The indices should be within the range of existing heads in the attention mechanism.

        Returns:
            None: This method does not return any value. It modifies the attributes of the MT5Attention instance in place.

        Raises:
            None:
                However, potential exceptions may arise if the input 'heads' list contains indices that are out of
                bounds of the existing heads or if any of the helper functions called within this method encounter
                errors.
        """
        if len(heads) == 0:
            return
        heads, index = find_pruneable_heads_and_indices(
            heads, self.n_heads, self.key_value_proj_dim, self.pruned_heads
        )
        # Prune linear layers
        self.q = prune_linear_layer(self.q, index)
        self.k = prune_linear_layer(self.k, index)
        self.v = prune_linear_layer(self.v, index)
        self.o = prune_linear_layer(self.o, index, dim=1)
        # Update hyper params
        self.n_heads = self.n_heads - len(heads)
        self.inner_dim = self.key_value_proj_dim * self.n_heads
        self.pruned_heads = self.pruned_heads.union(heads)

    @staticmethod
    def _relative_position_bucket(relative_position, bidirectional=True, num_buckets=32, max_distance=128):
        """
        Adapted from Mesh Tensorflow:
        https://github.com/tensorflow/mesh/blob/0cb87fe07da627bf0b7e60475d59f95ed6b5be3d/mesh_tensorflow/transformer/transformer_layers.py#L593

        Translate relative position to a bucket number for relative attention. The relative position is defined as
        memory_position - query_position, i.e. the distance in tokens from the attending position to the attended-to
        position. If bidirectional=False, then positive relative positions are invalid. We use smaller buckets for
        small absolute relative_position and larger buckets for larger absolute relative_positions. All relative
        positions >=max_distance map to the same bucket. All relative positions <=-max_distance map to the same bucket.
        This should allow for more graceful generalization to longer sequences than the model has been trained on

        Args:
            relative_position: an int32 Tensor
            bidirectional: a boolean - whether the attention is bidirectional
            num_buckets: an integer
            max_distance: an integer

        Returns:
            a Tensor with the same shape as relative_position, containing int32 values in the range [0, num_buckets)
        """
        relative_buckets = 0
        if bidirectional:
            num_buckets //= 2
            relative_buckets += (relative_position > 0).to(mindspore.int64) * num_buckets
            relative_position = ops.abs(relative_position)
        else:
            relative_position = -ops.minimum(relative_position, ops.zeros_like(relative_position))
        # now relative_position is in the range [0, inf)

        # half of the buckets are for exact increments in positions
        max_exact = num_buckets // 2
        is_small = relative_position < max_exact

        # The other half of the buckets are for logarithmically bigger bins in positions up to max_distance
        relative_position_if_large = max_exact + (
            ops.log(relative_position.float() / max_exact)
            / math.log(max_distance / max_exact)
            * (num_buckets - max_exact)
        ).to(mindspore.int64)
        relative_position_if_large = ops.minimum(
            relative_position_if_large, ops.full_like(relative_position_if_large, num_buckets - 1)
        )

        relative_buckets += ops.where(is_small, relative_position, relative_position_if_large)
        return relative_buckets

    def compute_bias(self, query_length, key_length):
        """Compute binned relative position bias"""
        context_position = ops.arange(query_length, dtype=mindspore.int64)[:, None]
        memory_position = ops.arange(key_length, dtype=mindspore.int64)[None, :]
        relative_position = memory_position - context_position  # shape (query_length, key_length)
        relative_position_bucket = self._relative_position_bucket(
            relative_position,  # shape (query_length, key_length)
            bidirectional=(not self.is_decoder),
            num_buckets=self.relative_attention_num_buckets,
            max_distance=self.relative_attention_max_distance,
        )
        values = self.relative_attention_bias(relative_position_bucket)  # shape (query_length, key_length, num_heads)
        values = values.permute([2, 0, 1]).unsqueeze(0)  # shape (1, num_heads, query_length, key_length)
        return values

    def forward(
        self,
        hidden_states,
        mask=None,
        key_value_states=None,
        position_bias=None,
        past_key_value=None,
        layer_head_mask=None,
        query_length=None,
        use_cache=False,
        output_attentions=False,
    ):
        """
        Self-attention (if key_value_states is None) or attention over source sentence (provided by key_value_states).
        """
        # Input is (batch_size, seq_length, dim)
        # Mask is (batch_size, key_length) (non-causal) or (batch_size, key_length, key_length)
        # past_key_value[0] is (batch_size, n_heads, q_len - 1, dim_per_head)
        batch_size, seq_length = hidden_states.shape[:2]

        real_seq_length = seq_length

        if past_key_value is not None:
            if len(past_key_value) != 2:
                raise ValueError(
                    f"past_key_value should have 2 past states: keys and values. Got { len(past_key_value)} past states"
                )
            real_seq_length += past_key_value[0].shape[2] if query_length is None else query_length

        key_length = real_seq_length if key_value_states is None else key_value_states.shape[1]

        def shape(states):
            """projection"""
            return states.view(batch_size, -1, self.n_heads, self.key_value_proj_dim).swapaxes(1, 2)

        def unshape(states):
            """reshape"""
            return states.swapaxes(1, 2).view(batch_size, -1, self.inner_dim)

        def project(hidden_states, proj_layer, key_value_states, past_key_value):
            """projects hidden states correctly to key/query states"""
            if key_value_states is None:
                # self-attn
                # (batch_size, n_heads, seq_length, dim_per_head)
                hidden_states = shape(proj_layer(hidden_states))
            elif past_key_value is None:
                # cross-attn
                # (batch_size, n_heads, seq_length, dim_per_head)
                hidden_states = shape(proj_layer(key_value_states))

            if past_key_value is not None:
                if key_value_states is None:
                    # self-attn
                    # (batch_size, n_heads, key_length, dim_per_head)
                    hidden_states = ops.cat([past_key_value, hidden_states], axis=2)
                elif past_key_value.shape[2] != key_value_states.shape[1]:
                    # checking that the `sequence_length` of the `past_key_value` is the same as
                    # the provided `key_value_states` to support prefix tuning
                    # cross-attn
                    # (batch_size, n_heads, seq_length, dim_per_head)
                    hidden_states = shape(proj_layer(key_value_states))
                else:
                    # cross-attn
                    hidden_states = past_key_value
            return hidden_states

        # get query states
        query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length, dim_per_head)

        # get key/value states
        key_states = project(
            hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
        )
        value_states = project(
            hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
        )

        # compute scores
        scores = ops.matmul(
            query_states, key_states.swapaxes(3, 2)
        )  # equivalent of torch.einsum("bnqd,bnkd->bnqk", query_states, key_states), compatible with onnx op>9

        if position_bias is None:
            if not self.has_relative_attention_bias:
                position_bias = ops.zeros(
                    (1, self.n_heads, real_seq_length, key_length), dtype=scores.dtype
                )
            else:
                position_bias = self.compute_bias(real_seq_length, key_length)

            # if key and values are already calculated
            # we want only the last query position bias
            if past_key_value is not None:
                position_bias = position_bias[:, :, -hidden_states.shape[1] :, :]

            if mask is not None:
                position_bias = position_bias + mask  # (batch_size, n_heads, seq_length, key_length)

        if self.pruned_heads:
            mask = ops.ones(position_bias.shape[1])
            mask[list(self.pruned_heads)] = 0
            position_bias_masked = position_bias[:, mask.bool()]
        else:
            position_bias_masked = position_bias

        scores += position_bias_masked
        attn_weights = ops.softmax(scores.float(), axis=-1).astype(
            scores.dtype
        )  # (batch_size, n_heads, seq_length, key_length)
        attn_weights = ops.dropout(
            attn_weights, p=self.dropout, training=self.training
        )  # (batch_size, n_heads, seq_length, key_length)

        # Mask heads if we want to
        if layer_head_mask is not None:
            attn_weights = attn_weights * layer_head_mask

        attn_output = unshape(ops.matmul(attn_weights, value_states))  # (batch_size, seq_length, dim)
        attn_output = self.o(attn_output)

        present_key_value_state = (key_states, value_states) if (self.is_decoder and use_cache) else None
        outputs = (attn_output,) + (present_key_value_state,) + (position_bias,)

        if output_attentions:
            outputs = outputs + (attn_weights,)
        return outputs

mindnlp.transformers.models.mt5.modeling_mt5.MT5Attention.__init__(config, has_relative_attention_bias=False)

Initializes an instance of the MT5Attention class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

An object containing configuration parameters for the attention mechanism. The configuration object must have the following attributes:

  • is_decoder (bool): Indicates if the attention mechanism is used in a decoder.
  • relative_attention_num_buckets (int): Number of buckets for relative attention calculations.
  • relative_attention_max_distance (int): Maximum distance for relative attention calculations.
  • d_model (int): Dimensionality of the model.
  • d_kv (int): Dimensionality of the key and value projections.
  • num_heads (int): Number of attention heads.
  • dropout_rate (float): Dropout rate to apply.

TYPE: MT5Config

has_relative_attention_bias

Indicates whether relative attention bias is used.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
def __init__(self, config: MT5Config, has_relative_attention_bias=False):
    """
    Initializes an instance of the MT5Attention class.

    Args:
        self: The instance of the class.
        config (MT5Config): An object containing configuration parameters for the attention mechanism.
            The configuration object must have the following attributes:

            - is_decoder (bool): Indicates if the attention mechanism is used in a decoder.
            - relative_attention_num_buckets (int): Number of buckets for relative attention calculations.
            - relative_attention_max_distance (int): Maximum distance for relative attention calculations.
            - d_model (int): Dimensionality of the model.
            - d_kv (int): Dimensionality of the key and value projections.
            - num_heads (int): Number of attention heads.
            - dropout_rate (float): Dropout rate to apply.
        has_relative_attention_bias (bool, optional): Indicates whether relative attention bias is used.
        Default is False.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    self.is_decoder = config.is_decoder
    self.has_relative_attention_bias = has_relative_attention_bias
    self.relative_attention_num_buckets = config.relative_attention_num_buckets
    self.relative_attention_max_distance = config.relative_attention_max_distance
    self.d_model = config.d_model
    self.key_value_proj_dim = config.d_kv
    self.n_heads = config.num_heads
    self.dropout = config.dropout_rate
    self.inner_dim = self.n_heads * self.key_value_proj_dim

    # Mesh TensorFlow initialization to avoid scaling before softmax
    self.q = nn.Linear(self.d_model, self.inner_dim, bias=False)
    self.k = nn.Linear(self.d_model, self.inner_dim, bias=False)
    self.v = nn.Linear(self.d_model, self.inner_dim, bias=False)
    self.o = nn.Linear(self.inner_dim, self.d_model, bias=False)

    if self.has_relative_attention_bias:
        self.relative_attention_bias = nn.Embedding(self.relative_attention_num_buckets, self.n_heads)
    self.pruned_heads = set()

mindnlp.transformers.models.mt5.modeling_mt5.MT5Attention.compute_bias(query_length, key_length)

Compute binned relative position bias

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
509
510
511
512
513
514
515
516
517
518
519
520
521
522
def compute_bias(self, query_length, key_length):
    """Compute binned relative position bias"""
    context_position = ops.arange(query_length, dtype=mindspore.int64)[:, None]
    memory_position = ops.arange(key_length, dtype=mindspore.int64)[None, :]
    relative_position = memory_position - context_position  # shape (query_length, key_length)
    relative_position_bucket = self._relative_position_bucket(
        relative_position,  # shape (query_length, key_length)
        bidirectional=(not self.is_decoder),
        num_buckets=self.relative_attention_num_buckets,
        max_distance=self.relative_attention_max_distance,
    )
    values = self.relative_attention_bias(relative_position_bucket)  # shape (query_length, key_length, num_heads)
    values = values.permute([2, 0, 1]).unsqueeze(0)  # shape (1, num_heads, query_length, key_length)
    return values

mindnlp.transformers.models.mt5.modeling_mt5.MT5Attention.forward(hidden_states, mask=None, key_value_states=None, position_bias=None, past_key_value=None, layer_head_mask=None, query_length=None, use_cache=False, output_attentions=False)

Self-attention (if key_value_states is None) or attention over source sentence (provided by key_value_states).

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
def forward(
    self,
    hidden_states,
    mask=None,
    key_value_states=None,
    position_bias=None,
    past_key_value=None,
    layer_head_mask=None,
    query_length=None,
    use_cache=False,
    output_attentions=False,
):
    """
    Self-attention (if key_value_states is None) or attention over source sentence (provided by key_value_states).
    """
    # Input is (batch_size, seq_length, dim)
    # Mask is (batch_size, key_length) (non-causal) or (batch_size, key_length, key_length)
    # past_key_value[0] is (batch_size, n_heads, q_len - 1, dim_per_head)
    batch_size, seq_length = hidden_states.shape[:2]

    real_seq_length = seq_length

    if past_key_value is not None:
        if len(past_key_value) != 2:
            raise ValueError(
                f"past_key_value should have 2 past states: keys and values. Got { len(past_key_value)} past states"
            )
        real_seq_length += past_key_value[0].shape[2] if query_length is None else query_length

    key_length = real_seq_length if key_value_states is None else key_value_states.shape[1]

    def shape(states):
        """projection"""
        return states.view(batch_size, -1, self.n_heads, self.key_value_proj_dim).swapaxes(1, 2)

    def unshape(states):
        """reshape"""
        return states.swapaxes(1, 2).view(batch_size, -1, self.inner_dim)

    def project(hidden_states, proj_layer, key_value_states, past_key_value):
        """projects hidden states correctly to key/query states"""
        if key_value_states is None:
            # self-attn
            # (batch_size, n_heads, seq_length, dim_per_head)
            hidden_states = shape(proj_layer(hidden_states))
        elif past_key_value is None:
            # cross-attn
            # (batch_size, n_heads, seq_length, dim_per_head)
            hidden_states = shape(proj_layer(key_value_states))

        if past_key_value is not None:
            if key_value_states is None:
                # self-attn
                # (batch_size, n_heads, key_length, dim_per_head)
                hidden_states = ops.cat([past_key_value, hidden_states], axis=2)
            elif past_key_value.shape[2] != key_value_states.shape[1]:
                # checking that the `sequence_length` of the `past_key_value` is the same as
                # the provided `key_value_states` to support prefix tuning
                # cross-attn
                # (batch_size, n_heads, seq_length, dim_per_head)
                hidden_states = shape(proj_layer(key_value_states))
            else:
                # cross-attn
                hidden_states = past_key_value
        return hidden_states

    # get query states
    query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length, dim_per_head)

    # get key/value states
    key_states = project(
        hidden_states, self.k, key_value_states, past_key_value[0] if past_key_value is not None else None
    )
    value_states = project(
        hidden_states, self.v, key_value_states, past_key_value[1] if past_key_value is not None else None
    )

    # compute scores
    scores = ops.matmul(
        query_states, key_states.swapaxes(3, 2)
    )  # equivalent of torch.einsum("bnqd,bnkd->bnqk", query_states, key_states), compatible with onnx op>9

    if position_bias is None:
        if not self.has_relative_attention_bias:
            position_bias = ops.zeros(
                (1, self.n_heads, real_seq_length, key_length), dtype=scores.dtype
            )
        else:
            position_bias = self.compute_bias(real_seq_length, key_length)

        # if key and values are already calculated
        # we want only the last query position bias
        if past_key_value is not None:
            position_bias = position_bias[:, :, -hidden_states.shape[1] :, :]

        if mask is not None:
            position_bias = position_bias + mask  # (batch_size, n_heads, seq_length, key_length)

    if self.pruned_heads:
        mask = ops.ones(position_bias.shape[1])
        mask[list(self.pruned_heads)] = 0
        position_bias_masked = position_bias[:, mask.bool()]
    else:
        position_bias_masked = position_bias

    scores += position_bias_masked
    attn_weights = ops.softmax(scores.float(), axis=-1).astype(
        scores.dtype
    )  # (batch_size, n_heads, seq_length, key_length)
    attn_weights = ops.dropout(
        attn_weights, p=self.dropout, training=self.training
    )  # (batch_size, n_heads, seq_length, key_length)

    # Mask heads if we want to
    if layer_head_mask is not None:
        attn_weights = attn_weights * layer_head_mask

    attn_output = unshape(ops.matmul(attn_weights, value_states))  # (batch_size, seq_length, dim)
    attn_output = self.o(attn_output)

    present_key_value_state = (key_states, value_states) if (self.is_decoder and use_cache) else None
    outputs = (attn_output,) + (present_key_value_state,) + (position_bias,)

    if output_attentions:
        outputs = outputs + (attn_weights,)
    return outputs

mindnlp.transformers.models.mt5.modeling_mt5.MT5Attention.prune_heads(heads)

This method 'prune_heads' is defined in the class 'MT5Attention' and is used to prune specific heads in the attention mechanism of a MT5 model.

PARAMETER DESCRIPTION
self

The instance of the MT5Attention class. It is used to access the attributes and methods within the class.

TYPE: object

heads

A list of integers representing the indices of the heads to be pruned. The indices should be within the range of existing heads in the attention mechanism.

TYPE: list

RETURNS DESCRIPTION
None

This method does not return any value. It modifies the attributes of the MT5Attention instance in place.

RAISES DESCRIPTION
None

However, potential exceptions may arise if the input 'heads' list contains indices that are out of bounds of the existing heads or if any of the helper functions called within this method encounter errors.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
def prune_heads(self, heads):
    """
    This method 'prune_heads' is defined in the class 'MT5Attention' and is used to prune specific heads in the
    attention mechanism of a MT5 model.

    Args:
        self (object): The instance of the MT5Attention class.
            It is used to access the attributes and methods within the class.
        heads (list): A list of integers representing the indices of the heads to be pruned.
            The indices should be within the range of existing heads in the attention mechanism.

    Returns:
        None: This method does not return any value. It modifies the attributes of the MT5Attention instance in place.

    Raises:
        None:
            However, potential exceptions may arise if the input 'heads' list contains indices that are out of
            bounds of the existing heads or if any of the helper functions called within this method encounter
            errors.
    """
    if len(heads) == 0:
        return
    heads, index = find_pruneable_heads_and_indices(
        heads, self.n_heads, self.key_value_proj_dim, self.pruned_heads
    )
    # Prune linear layers
    self.q = prune_linear_layer(self.q, index)
    self.k = prune_linear_layer(self.k, index)
    self.v = prune_linear_layer(self.v, index)
    self.o = prune_linear_layer(self.o, index, dim=1)
    # Update hyper params
    self.n_heads = self.n_heads - len(heads)
    self.inner_dim = self.key_value_proj_dim * self.n_heads
    self.pruned_heads = self.pruned_heads.union(heads)

mindnlp.transformers.models.mt5.modeling_mt5.MT5Block

Bases: Module

This class represents a block of the MT5 model, which is a Transformer-based neural network architecture for sequence-to-sequence tasks. It consists of a self-attention layer, an optional cross-attention layer, and a feed-forward layer.

ATTRIBUTE DESCRIPTION
`is_decoder`

Indicates whether the block is used in the decoder part of the model.

TYPE: bool

`layer`

A list of layers in the block, including the self-attention, cross-attention, and feed-forward layers.

TYPE: ModuleList

METHOD DESCRIPTION
`forward`

Performs the forward pass of the block, processing the input hidden states and generating the outputs.

Details

The MT5Block class inherits from the nn.Module class and overrides the forward method. The __init__ method initializes the block's attributes, including the is_decoder flag and the list of layers.

The forward method takes various input parameters, including the hidden states, attention masks, position biases, and layer head masks. It also accepts optional parameters for encoder hidden states and attention masks, as well as past key-value states used for caching.

The method first checks if past key-value states are provided and validates their correctness. It then retrieves the self-attention and cross-attention past key-value states from the input if present.

Next, the method passes the hidden states through the self-attention layer, using the provided attention mask, position bias, and layer head mask. The output includes the updated hidden states and the present key-value state.

If the block is a decoder and encoder hidden states are provided, the method performs cross-attention. It retrieves the query length and passes the hidden states, encoder hidden states, and other parameters to the cross-attention layer. The output includes the updated hidden states and the present key-value state.

Finally, the method passes the hidden states through the feed-forward layer. It then clamps the hidden states to prevent any numerical issues and returns the final hidden states along with any additional outputs, such as present key-value states and attention outputs, depending on the value of the use_cache parameter.

Note

This class assumes the usage of the MindSpore deep learning framework.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
class MT5Block(nn.Module):

    """
    This class represents a block of the MT5 model, which is a Transformer-based neural network architecture for
    sequence-to-sequence tasks. It consists of a self-attention layer, an optional cross-attention layer, and a
    feed-forward layer.

    Attributes:
        `is_decoder` (bool): Indicates whether the block is used in the decoder part of the model.
        `layer` (nn.ModuleList): A list of layers in the block, including the self-attention, cross-attention,
            and feed-forward layers.

    Methods:
        `forward`: Performs the forward pass of the block, processing the input hidden states and generating the outputs.

    Details:
        The `MT5Block` class inherits from the `nn.Module` class and overrides the `forward` method. The `__init__`
        method initializes the block's attributes, including the `is_decoder` flag and the list of layers.

        The `forward` method takes various input parameters, including the hidden states, attention masks,
        position biases, and layer head masks. It also accepts optional parameters for encoder hidden states and
        attention masks, as well as past key-value states used for caching.

        The method first checks if past key-value states are provided and validates their correctness.
        It then retrieves the self-attention and cross-attention past key-value states from the input if present.

        Next, the method passes the hidden states through the self-attention layer, using the provided attention mask,
        position bias, and layer head mask. The output includes the updated hidden states and the
        present key-value state.

        If the block is a decoder and encoder hidden states are provided, the method performs cross-attention.
        It retrieves the query length and passes the hidden states, encoder hidden states, and other
        parameters to the cross-attention layer. The output includes the updated hidden states and the present key-value state.

        Finally, the method passes the hidden states through the feed-forward layer. It then clamps the hidden states
        to prevent any numerical issues and returns the final hidden states along with any additional outputs, such as
        present key-value states and attention outputs, depending on the value of the `use_cache` parameter.

    Note:
        This class assumes the usage of the MindSpore deep learning framework.

    """
    def __init__(self, config, has_relative_attention_bias=False):
        """
        Initializes a new instance of the MT5Block class.

        Args:
            self: The object itself.
            config (object): The configuration object for MT5Block.
            has_relative_attention_bias (bool, optional): Specifies whether the attention bias is relative or not.
                Default is False.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        self.is_decoder = config.is_decoder
        self.layer = nn.ModuleList()
        self.layer.append(MT5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))
        if self.is_decoder:
            self.layer.append(MT5LayerCrossAttention(config))

        self.layer.append(MT5LayerFF(config))

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        position_bias=None,
        encoder_hidden_states=None,
        encoder_attention_mask=None,
        encoder_decoder_position_bias=None,
        layer_head_mask=None,
        cross_attn_layer_head_mask=None,
        past_key_value=None,
        use_cache=False,
        output_attentions=False,
    ):
        """
        Constructs the MT5Block.

        This method is responsible for performing the main computations of the MT5Block.
        It takes in multiple parameters and returns None.

        Args:
            self (MT5Block): An instance of the MT5Block class.
            hidden_states (Tensor): The hidden states of the input sequence.
                Shape: (batch_size, sequence_length, hidden_size).
            attention_mask (Tensor, optional): The attention mask tensor.
                Shape: (batch_size, sequence_length). Default: None.
            position_bias (Tensor, optional): The position bias tensor.
                Shape: (batch_size, sequence_length, sequence_length). Default: None.
            encoder_hidden_states (Tensor, optional): The hidden states of the encoder sequence.
                Shape: (batch_size, encoder_sequence_length, hidden_size). Default: None.
            encoder_attention_mask (Tensor, optional): The attention mask tensor for the encoder sequence.
                Shape: (batch_size, encoder_sequence_length). Default: None.
            encoder_decoder_position_bias (Tensor, optional): The position bias tensor for encoder-decoder attention.
                Shape: (batch_size, sequence_length, encoder_sequence_length). Default: None.
            layer_head_mask (Tensor, optional): The layer head mask tensor. Shape: (num_layers, num_heads).
                Default: None.
            cross_attn_layer_head_mask (Tensor, optional): The cross-attention layer head mask tensor.
                Shape: (num_layers, num_heads). Default: None.
            past_key_value (Tuple, optional): Tuple containing the past key-value states.
                Shape: (2 or 4, batch_size, num_heads, past_sequence_length, hidden_size). Default: None.
            use_cache (bool, optional): Whether to use caching. Default: False.
            output_attentions (bool, optional): Whether to output attention weights. Default: False.

        Returns:
            None

        Raises:
            ValueError: If the length of past_key_value is not equal to the expected number of past states.
            Warning: If past_key_values is passed to the encoder.
            TypeError: If the data type of hidden_states is not supported.
            TypeError: If the data type of encoder_hidden_states is not supported.
            TypeError: If the data type of hidden_states after cross-attention is not supported.
            TypeError: If the data type of hidden_states after the final layer is not supported.
        """
        if past_key_value is not None:
            if not self.is_decoder:
                logger.warning("`past_key_values` is passed to the encoder. Please make sure this is intended.")
            expected_num_past_key_values = 2 if encoder_hidden_states is None else 4

            if len(past_key_value) != expected_num_past_key_values:
                raise ValueError(
                    f"There should be {expected_num_past_key_values} past states. "
                    f"{'2 (past / key) for cross attention. ' if expected_num_past_key_values == 4 else ''}"
                    f"Got {len(past_key_value)} past key / value states"
                )

            self_attn_past_key_value = past_key_value[:2]
            cross_attn_past_key_value = past_key_value[2:]
        else:
            self_attn_past_key_value, cross_attn_past_key_value = None, None

        self_attention_outputs = self.layer[0](
            hidden_states,
            attention_mask=attention_mask,
            position_bias=position_bias,
            layer_head_mask=layer_head_mask,
            past_key_value=self_attn_past_key_value,
            use_cache=use_cache,
            output_attentions=output_attentions,
        )
        hidden_states, present_key_value_state = self_attention_outputs[:2]
        attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs and relative position weights

        # clamp inf values to enable fp16 training
        if hidden_states.dtype == mindspore.float16:
            clamp_value = ops.where(
                ops.isinf(hidden_states).any(),
                np.finfo(mindspore.dtype_to_nptype(hidden_states.dtype)).max - 1000,
                np.finfo(mindspore.dtype_to_nptype(hidden_states.dtype)).max,
            )
            hidden_states = ops.clamp(hidden_states, min=-clamp_value, max=clamp_value)

        do_cross_attention = self.is_decoder and encoder_hidden_states is not None
        if do_cross_attention:
            # the actual query length is unknown for cross attention
            # if using past key value states. Need to inject it here
            if present_key_value_state is not None:
                query_length = present_key_value_state[0].shape[2]
            else:
                query_length = None

            cross_attention_outputs = self.layer[1](
                hidden_states,
                key_value_states=encoder_hidden_states,
                attention_mask=encoder_attention_mask,
                position_bias=encoder_decoder_position_bias,
                layer_head_mask=cross_attn_layer_head_mask,
                past_key_value=cross_attn_past_key_value,
                query_length=query_length,
                use_cache=use_cache,
                output_attentions=output_attentions,
            )
            hidden_states = cross_attention_outputs[0]

            # clamp inf values to enable fp16 training
            if hidden_states.dtype == mindspore.float16:
                clamp_value = ops.where(
                    ops.isinf(hidden_states).any(),
                    np.finfo(mindspore.dtype_to_nptype(hidden_states.dtype)).max - 1000,
                    np.finfo(mindspore.dtype_to_nptype(hidden_states.dtype)).max,
                )
                hidden_states = ops.clamp(hidden_states, min=-clamp_value, max=clamp_value)

            # Combine self attn and cross attn key value states
            if present_key_value_state is not None:
                present_key_value_state = present_key_value_state + cross_attention_outputs[1]

            # Keep cross-attention outputs and relative position weights
            attention_outputs = attention_outputs + cross_attention_outputs[2:]

        # Apply Feed Forward layer
        hidden_states = self.layer[-1](hidden_states)

        # clamp inf values to enable fp16 training
        if hidden_states.dtype == mindspore.float16:
            clamp_value = ops.where(
                ops.isinf(hidden_states).any(),
                np.finfo(mindspore.dtype_to_nptype(hidden_states.dtype)).max - 1000,
                np.finfo(mindspore.dtype_to_nptype(hidden_states.dtype)).max,
            )
            hidden_states = ops.clamp(hidden_states, min=-clamp_value, max=clamp_value)

        outputs = (hidden_states,)

        if use_cache:
            outputs = outputs + (present_key_value_state,) + attention_outputs
        else:
            outputs = outputs + attention_outputs

        return outputs  # hidden-states, present_key_value_states, (self-attention position bias), (self-attention weights), (cross-attention position bias), (cross-attention weights)

mindnlp.transformers.models.mt5.modeling_mt5.MT5Block.__init__(config, has_relative_attention_bias=False)

Initializes a new instance of the MT5Block class.

PARAMETER DESCRIPTION
self

The object itself.

config

The configuration object for MT5Block.

TYPE: object

has_relative_attention_bias

Specifies whether the attention bias is relative or not. Default is False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
def __init__(self, config, has_relative_attention_bias=False):
    """
    Initializes a new instance of the MT5Block class.

    Args:
        self: The object itself.
        config (object): The configuration object for MT5Block.
        has_relative_attention_bias (bool, optional): Specifies whether the attention bias is relative or not.
            Default is False.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    self.is_decoder = config.is_decoder
    self.layer = nn.ModuleList()
    self.layer.append(MT5LayerSelfAttention(config, has_relative_attention_bias=has_relative_attention_bias))
    if self.is_decoder:
        self.layer.append(MT5LayerCrossAttention(config))

    self.layer.append(MT5LayerFF(config))

mindnlp.transformers.models.mt5.modeling_mt5.MT5Block.forward(hidden_states, attention_mask=None, position_bias=None, encoder_hidden_states=None, encoder_attention_mask=None, encoder_decoder_position_bias=None, layer_head_mask=None, cross_attn_layer_head_mask=None, past_key_value=None, use_cache=False, output_attentions=False)

Constructs the MT5Block.

This method is responsible for performing the main computations of the MT5Block. It takes in multiple parameters and returns None.

PARAMETER DESCRIPTION
self

An instance of the MT5Block class.

TYPE: MT5Block

hidden_states

The hidden states of the input sequence. Shape: (batch_size, sequence_length, hidden_size).

TYPE: Tensor

attention_mask

The attention mask tensor. Shape: (batch_size, sequence_length). Default: None.

TYPE: Tensor DEFAULT: None

position_bias

The position bias tensor. Shape: (batch_size, sequence_length, sequence_length). Default: None.

TYPE: Tensor DEFAULT: None

encoder_hidden_states

The hidden states of the encoder sequence. Shape: (batch_size, encoder_sequence_length, hidden_size). Default: None.

TYPE: Tensor DEFAULT: None

encoder_attention_mask

The attention mask tensor for the encoder sequence. Shape: (batch_size, encoder_sequence_length). Default: None.

TYPE: Tensor DEFAULT: None

encoder_decoder_position_bias

The position bias tensor for encoder-decoder attention. Shape: (batch_size, sequence_length, encoder_sequence_length). Default: None.

TYPE: Tensor DEFAULT: None

layer_head_mask

The layer head mask tensor. Shape: (num_layers, num_heads). Default: None.

TYPE: Tensor DEFAULT: None

cross_attn_layer_head_mask

The cross-attention layer head mask tensor. Shape: (num_layers, num_heads). Default: None.

TYPE: Tensor DEFAULT: None

past_key_value

Tuple containing the past key-value states. Shape: (2 or 4, batch_size, num_heads, past_sequence_length, hidden_size). Default: None.

TYPE: Tuple DEFAULT: None

use_cache

Whether to use caching. Default: False.

TYPE: bool DEFAULT: False

output_attentions

Whether to output attention weights. Default: False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION

None

RAISES DESCRIPTION
ValueError

If the length of past_key_value is not equal to the expected number of past states.

Warning

If past_key_values is passed to the encoder.

TypeError

If the data type of hidden_states is not supported.

TypeError

If the data type of encoder_hidden_states is not supported.

TypeError

If the data type of hidden_states after cross-attention is not supported.

TypeError

If the data type of hidden_states after the final layer is not supported.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
def forward(
    self,
    hidden_states,
    attention_mask=None,
    position_bias=None,
    encoder_hidden_states=None,
    encoder_attention_mask=None,
    encoder_decoder_position_bias=None,
    layer_head_mask=None,
    cross_attn_layer_head_mask=None,
    past_key_value=None,
    use_cache=False,
    output_attentions=False,
):
    """
    Constructs the MT5Block.

    This method is responsible for performing the main computations of the MT5Block.
    It takes in multiple parameters and returns None.

    Args:
        self (MT5Block): An instance of the MT5Block class.
        hidden_states (Tensor): The hidden states of the input sequence.
            Shape: (batch_size, sequence_length, hidden_size).
        attention_mask (Tensor, optional): The attention mask tensor.
            Shape: (batch_size, sequence_length). Default: None.
        position_bias (Tensor, optional): The position bias tensor.
            Shape: (batch_size, sequence_length, sequence_length). Default: None.
        encoder_hidden_states (Tensor, optional): The hidden states of the encoder sequence.
            Shape: (batch_size, encoder_sequence_length, hidden_size). Default: None.
        encoder_attention_mask (Tensor, optional): The attention mask tensor for the encoder sequence.
            Shape: (batch_size, encoder_sequence_length). Default: None.
        encoder_decoder_position_bias (Tensor, optional): The position bias tensor for encoder-decoder attention.
            Shape: (batch_size, sequence_length, encoder_sequence_length). Default: None.
        layer_head_mask (Tensor, optional): The layer head mask tensor. Shape: (num_layers, num_heads).
            Default: None.
        cross_attn_layer_head_mask (Tensor, optional): The cross-attention layer head mask tensor.
            Shape: (num_layers, num_heads). Default: None.
        past_key_value (Tuple, optional): Tuple containing the past key-value states.
            Shape: (2 or 4, batch_size, num_heads, past_sequence_length, hidden_size). Default: None.
        use_cache (bool, optional): Whether to use caching. Default: False.
        output_attentions (bool, optional): Whether to output attention weights. Default: False.

    Returns:
        None

    Raises:
        ValueError: If the length of past_key_value is not equal to the expected number of past states.
        Warning: If past_key_values is passed to the encoder.
        TypeError: If the data type of hidden_states is not supported.
        TypeError: If the data type of encoder_hidden_states is not supported.
        TypeError: If the data type of hidden_states after cross-attention is not supported.
        TypeError: If the data type of hidden_states after the final layer is not supported.
    """
    if past_key_value is not None:
        if not self.is_decoder:
            logger.warning("`past_key_values` is passed to the encoder. Please make sure this is intended.")
        expected_num_past_key_values = 2 if encoder_hidden_states is None else 4

        if len(past_key_value) != expected_num_past_key_values:
            raise ValueError(
                f"There should be {expected_num_past_key_values} past states. "
                f"{'2 (past / key) for cross attention. ' if expected_num_past_key_values == 4 else ''}"
                f"Got {len(past_key_value)} past key / value states"
            )

        self_attn_past_key_value = past_key_value[:2]
        cross_attn_past_key_value = past_key_value[2:]
    else:
        self_attn_past_key_value, cross_attn_past_key_value = None, None

    self_attention_outputs = self.layer[0](
        hidden_states,
        attention_mask=attention_mask,
        position_bias=position_bias,
        layer_head_mask=layer_head_mask,
        past_key_value=self_attn_past_key_value,
        use_cache=use_cache,
        output_attentions=output_attentions,
    )
    hidden_states, present_key_value_state = self_attention_outputs[:2]
    attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs and relative position weights

    # clamp inf values to enable fp16 training
    if hidden_states.dtype == mindspore.float16:
        clamp_value = ops.where(
            ops.isinf(hidden_states).any(),
            np.finfo(mindspore.dtype_to_nptype(hidden_states.dtype)).max - 1000,
            np.finfo(mindspore.dtype_to_nptype(hidden_states.dtype)).max,
        )
        hidden_states = ops.clamp(hidden_states, min=-clamp_value, max=clamp_value)

    do_cross_attention = self.is_decoder and encoder_hidden_states is not None
    if do_cross_attention:
        # the actual query length is unknown for cross attention
        # if using past key value states. Need to inject it here
        if present_key_value_state is not None:
            query_length = present_key_value_state[0].shape[2]
        else:
            query_length = None

        cross_attention_outputs = self.layer[1](
            hidden_states,
            key_value_states=encoder_hidden_states,
            attention_mask=encoder_attention_mask,
            position_bias=encoder_decoder_position_bias,
            layer_head_mask=cross_attn_layer_head_mask,
            past_key_value=cross_attn_past_key_value,
            query_length=query_length,
            use_cache=use_cache,
            output_attentions=output_attentions,
        )
        hidden_states = cross_attention_outputs[0]

        # clamp inf values to enable fp16 training
        if hidden_states.dtype == mindspore.float16:
            clamp_value = ops.where(
                ops.isinf(hidden_states).any(),
                np.finfo(mindspore.dtype_to_nptype(hidden_states.dtype)).max - 1000,
                np.finfo(mindspore.dtype_to_nptype(hidden_states.dtype)).max,
            )
            hidden_states = ops.clamp(hidden_states, min=-clamp_value, max=clamp_value)

        # Combine self attn and cross attn key value states
        if present_key_value_state is not None:
            present_key_value_state = present_key_value_state + cross_attention_outputs[1]

        # Keep cross-attention outputs and relative position weights
        attention_outputs = attention_outputs + cross_attention_outputs[2:]

    # Apply Feed Forward layer
    hidden_states = self.layer[-1](hidden_states)

    # clamp inf values to enable fp16 training
    if hidden_states.dtype == mindspore.float16:
        clamp_value = ops.where(
            ops.isinf(hidden_states).any(),
            np.finfo(mindspore.dtype_to_nptype(hidden_states.dtype)).max - 1000,
            np.finfo(mindspore.dtype_to_nptype(hidden_states.dtype)).max,
        )
        hidden_states = ops.clamp(hidden_states, min=-clamp_value, max=clamp_value)

    outputs = (hidden_states,)

    if use_cache:
        outputs = outputs + (present_key_value_state,) + attention_outputs
    else:
        outputs = outputs + attention_outputs

    return outputs  # hidden-states, present_key_value_states, (self-attention position bias), (self-attention weights), (cross-attention position bias), (cross-attention weights)

mindnlp.transformers.models.mt5.modeling_mt5.MT5ClassificationHead

Bases: Module

Head for sentence-level classification tasks.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
class MT5ClassificationHead(nn.Module):
    """Head for sentence-level classification tasks."""
    def __init__(self, config: MT5Config):
        """
        Initializes the MT5ClassificationHead class with the provided configuration.

        Args:
            self (MT5ClassificationHead): The instance of the MT5ClassificationHead class.
            config (MT5Config):
                An object containing configuration parameters for the MT5 model.

                - config.d_model (int): The dimension of the model.
                - config.classifier_dropout (float): The dropout rate for the classifier.
                - config.num_labels (int): The number of output labels.

        Returns:
            None.

        Raises:
            TypeError: If the config parameter is not of type MT5Config.
            ValueError: If any of the configuration parameters are missing or invalid.
        """
        super().__init__()
        self.dense = nn.Linear(config.d_model, config.d_model)
        self.dropout = nn.Dropout(p=config.classifier_dropout)
        self.out_proj = nn.Linear(config.d_model, config.num_labels)

    def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor:
        """
        Constructs the classification head for an MT5 model.

        Args:
            self: Instance of the MT5ClassificationHead class.
            hidden_states (mindspore.Tensor): The input hidden states tensor to be processed by the classification head.

        Returns:
            mindspore.Tensor: The output tensor after processing through the classification head.

        Raises:
            None
        """
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.dense(hidden_states)
        hidden_states = ops.tanh(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.out_proj(hidden_states)
        return hidden_states

mindnlp.transformers.models.mt5.modeling_mt5.MT5ClassificationHead.__init__(config)

Initializes the MT5ClassificationHead class with the provided configuration.

PARAMETER DESCRIPTION
self

The instance of the MT5ClassificationHead class.

TYPE: MT5ClassificationHead

config

An object containing configuration parameters for the MT5 model.

  • config.d_model (int): The dimension of the model.
  • config.classifier_dropout (float): The dropout rate for the classifier.
  • config.num_labels (int): The number of output labels.

TYPE: MT5Config

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the config parameter is not of type MT5Config.

ValueError

If any of the configuration parameters are missing or invalid.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
def __init__(self, config: MT5Config):
    """
    Initializes the MT5ClassificationHead class with the provided configuration.

    Args:
        self (MT5ClassificationHead): The instance of the MT5ClassificationHead class.
        config (MT5Config):
            An object containing configuration parameters for the MT5 model.

            - config.d_model (int): The dimension of the model.
            - config.classifier_dropout (float): The dropout rate for the classifier.
            - config.num_labels (int): The number of output labels.

    Returns:
        None.

    Raises:
        TypeError: If the config parameter is not of type MT5Config.
        ValueError: If any of the configuration parameters are missing or invalid.
    """
    super().__init__()
    self.dense = nn.Linear(config.d_model, config.d_model)
    self.dropout = nn.Dropout(p=config.classifier_dropout)
    self.out_proj = nn.Linear(config.d_model, config.num_labels)

mindnlp.transformers.models.mt5.modeling_mt5.MT5ClassificationHead.forward(hidden_states)

Constructs the classification head for an MT5 model.

PARAMETER DESCRIPTION
self

Instance of the MT5ClassificationHead class.

hidden_states

The input hidden states tensor to be processed by the classification head.

TYPE: Tensor

RETURNS DESCRIPTION
Tensor

mindspore.Tensor: The output tensor after processing through the classification head.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor:
    """
    Constructs the classification head for an MT5 model.

    Args:
        self: Instance of the MT5ClassificationHead class.
        hidden_states (mindspore.Tensor): The input hidden states tensor to be processed by the classification head.

    Returns:
        mindspore.Tensor: The output tensor after processing through the classification head.

    Raises:
        None
    """
    hidden_states = self.dropout(hidden_states)
    hidden_states = self.dense(hidden_states)
    hidden_states = ops.tanh(hidden_states)
    hidden_states = self.dropout(hidden_states)
    hidden_states = self.out_proj(hidden_states)
    return hidden_states

mindnlp.transformers.models.mt5.modeling_mt5.MT5DenseActDense

Bases: Module

MT5DenseActDense is a neural network module that implements a specific architecture for processing hidden states in the MT5 model. It consists of two dense layers with an activation function and dropout in between.

Inherits from nn.Module.

The init method initializes the MT5DenseActDense module with the provided MT5Config object. It sets up the internal components including two dense layers, a dropout layer, and an activation function.

The forward method processes the input hidden states through the internal components in sequence. It applies the first dense layer, activation function, dropout, type conversion if necessary, and the second dense layer. The final processed hidden states are returned as the output of the module.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
class MT5DenseActDense(nn.Module):

    """
    MT5DenseActDense is a neural network module that implements a specific architecture for
    processing hidden states in the MT5 model.
    It consists of two dense layers with an activation function and dropout in between.

    Inherits from nn.Module.

    The __init__ method initializes the MT5DenseActDense module with the provided MT5Config object.
    It sets up the internal components including two dense layers, a dropout layer, and an activation function.

    The forward method processes the input hidden states through the internal components in sequence.
    It applies the first dense layer, activation function, dropout, type conversion if necessary, and the
    second dense layer.
    The final processed hidden states are returned as the output of the module.
    """
    def __init__(self, config: MT5Config):
        """
        Initializes an instance of the MT5DenseActDense class.

        Args:
            self: The instance of the class.
            config (MT5Config):
                An object of type MT5Config containing configuration parameters.

                - MT5Config.d_model (int): The model dimension.
                - MT5Config.d_ff (int): The feed-forward dimension.
                - MT5Config.dropout_rate (float): The dropout rate.
                - MT5Config.dense_act_fn (str): The activation function to be used.

        Returns:
            None.

        Raises:
            KeyError: If the specified dense activation function in the config is not found in ACT2FN.
            ValueError: If any of the configuration parameters are missing or invalid.
        """
        super().__init__()
        self.wi = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
        self.dropout = nn.Dropout(p=config.dropout_rate)
        self.act = ACT2FN[config.dense_act_fn]

    def forward(self, hidden_states):
        """
        This method forwards the hidden states by applying operations and transformations.

        Args:
            self: The instance of the MT5DenseActDense class.
            hidden_states (mindspore.Tensor): The input hidden states to be processed. It should be a tensor.

        Returns:
            mindspore.Tensor: The processed hidden states after applying the operations and transformations.

        Raises:
            TypeError: If the input hidden_states is not of type mindspore.Tensor.
            ValueError: If the weight dtype of self.wo is not compatible with the dtype of hidden_states.
            RuntimeError: If an unexpected error occurs during the processing of hidden_states.
        """
        hidden_states = self.wi(hidden_states)
        hidden_states = self.act(hidden_states)
        hidden_states = self.dropout(hidden_states)
        if (
            isinstance(self.wo.weight, mindspore.Tensor)
            and hidden_states.dtype != self.wo.weight.dtype
            and self.wo.weight.dtype != mindspore.int8
        ):
            hidden_states = hidden_states.to(self.wo.weight.dtype)
        hidden_states = self.wo(hidden_states)
        return hidden_states

mindnlp.transformers.models.mt5.modeling_mt5.MT5DenseActDense.__init__(config)

Initializes an instance of the MT5DenseActDense class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

An object of type MT5Config containing configuration parameters.

  • MT5Config.d_model (int): The model dimension.
  • MT5Config.d_ff (int): The feed-forward dimension.
  • MT5Config.dropout_rate (float): The dropout rate.
  • MT5Config.dense_act_fn (str): The activation function to be used.

TYPE: MT5Config

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
KeyError

If the specified dense activation function in the config is not found in ACT2FN.

ValueError

If any of the configuration parameters are missing or invalid.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
def __init__(self, config: MT5Config):
    """
    Initializes an instance of the MT5DenseActDense class.

    Args:
        self: The instance of the class.
        config (MT5Config):
            An object of type MT5Config containing configuration parameters.

            - MT5Config.d_model (int): The model dimension.
            - MT5Config.d_ff (int): The feed-forward dimension.
            - MT5Config.dropout_rate (float): The dropout rate.
            - MT5Config.dense_act_fn (str): The activation function to be used.

    Returns:
        None.

    Raises:
        KeyError: If the specified dense activation function in the config is not found in ACT2FN.
        ValueError: If any of the configuration parameters are missing or invalid.
    """
    super().__init__()
    self.wi = nn.Linear(config.d_model, config.d_ff, bias=False)
    self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
    self.dropout = nn.Dropout(p=config.dropout_rate)
    self.act = ACT2FN[config.dense_act_fn]

mindnlp.transformers.models.mt5.modeling_mt5.MT5DenseActDense.forward(hidden_states)

This method forwards the hidden states by applying operations and transformations.

PARAMETER DESCRIPTION
self

The instance of the MT5DenseActDense class.

hidden_states

The input hidden states to be processed. It should be a tensor.

TYPE: Tensor

RETURNS DESCRIPTION

mindspore.Tensor: The processed hidden states after applying the operations and transformations.

RAISES DESCRIPTION
TypeError

If the input hidden_states is not of type mindspore.Tensor.

ValueError

If the weight dtype of self.wo is not compatible with the dtype of hidden_states.

RuntimeError

If an unexpected error occurs during the processing of hidden_states.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
def forward(self, hidden_states):
    """
    This method forwards the hidden states by applying operations and transformations.

    Args:
        self: The instance of the MT5DenseActDense class.
        hidden_states (mindspore.Tensor): The input hidden states to be processed. It should be a tensor.

    Returns:
        mindspore.Tensor: The processed hidden states after applying the operations and transformations.

    Raises:
        TypeError: If the input hidden_states is not of type mindspore.Tensor.
        ValueError: If the weight dtype of self.wo is not compatible with the dtype of hidden_states.
        RuntimeError: If an unexpected error occurs during the processing of hidden_states.
    """
    hidden_states = self.wi(hidden_states)
    hidden_states = self.act(hidden_states)
    hidden_states = self.dropout(hidden_states)
    if (
        isinstance(self.wo.weight, mindspore.Tensor)
        and hidden_states.dtype != self.wo.weight.dtype
        and self.wo.weight.dtype != mindspore.int8
    ):
        hidden_states = hidden_states.to(self.wo.weight.dtype)
    hidden_states = self.wo(hidden_states)
    return hidden_states

mindnlp.transformers.models.mt5.modeling_mt5.MT5DenseGatedActDense

Bases: Module

This class represents a dense gated activation module for the MT5 model. It inherits from the nn.Module class.

The MT5DenseGatedActDense class contains methods to initialize and forward the dense gated activation module.

METHOD DESCRIPTION
__init__

Initializes the MT5DenseGatedActDense module with the given configuration.

forward

Constructs the dense gated activation module using the provided hidden states.

ATTRIBUTE DESCRIPTION
wi_0

A dense layer that transforms the input hidden states.

wi_1

A dense layer that transforms the input hidden states.

wo

A dense layer that transforms the gated hidden states.

dropout

A dropout layer to apply dropout to the transformed hidden states.

act

The activation function to be applied to the transformed hidden states.

Example
>>> config = MT5Config(d_model=512, d_ff=2048, dropout_rate=0.1, dense_act_fn='gelu')
>>> dense_gated_act_dense = MT5DenseGatedActDense(config)
>>> hidden_states = ...
>>> output = dense_gated_act_dense.forward(hidden_states)
Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
class MT5DenseGatedActDense(nn.Module):

    """
    This class represents a dense gated activation module for the MT5 model. It inherits from the nn.Module class.

    The MT5DenseGatedActDense class contains methods to initialize and forward the dense gated activation module.

    Methods:
        __init__: Initializes the MT5DenseGatedActDense module with the given configuration.
        forward: Constructs the dense gated activation module using the provided hidden states.

    Attributes:
        wi_0: A dense layer that transforms the input hidden states.
        wi_1: A dense layer that transforms the input hidden states.
        wo: A dense layer that transforms the gated hidden states.
        dropout: A dropout layer to apply dropout to the transformed hidden states.
        act: The activation function to be applied to the transformed hidden states.

    Example:
        ```python
        >>> config = MT5Config(d_model=512, d_ff=2048, dropout_rate=0.1, dense_act_fn='gelu')
        >>> dense_gated_act_dense = MT5DenseGatedActDense(config)
        >>> hidden_states = ...
        >>> output = dense_gated_act_dense.forward(hidden_states)
        ```
    """
    def __init__(self, config: MT5Config):
        """
        Initializes an instance of the MT5DenseGatedActDense class.

        Args:
            self: The instance of the class.
            config (MT5Config):
                An object of type MT5Config containing configuration parameters for the model.

                - The 'config' parameter is required and must be of type MT5Config.
                - It is used to configure the dimensions and settings for the dense layers in the model.

        Returns:
            None

        Raises:
            ValueError: If the configuration parameters are not provided or are of incorrect type.
            KeyError: If the activation function specified in the configuration is not supported.
        """
        super().__init__()
        self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
        self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
        self.dropout = nn.Dropout(p=config.dropout_rate)
        self.act = ACT2FN[config.dense_act_fn]

    def forward(self, hidden_states):
        """
        This method forwards the hidden states by applying a series of transformations.

        Args:
            self (MT5DenseGatedActDense): The instance of the MT5DenseGatedActDense class.
            hidden_states (mindspore.Tensor): The input hidden states to be processed.

        Returns:
            None: This method does not return any value explicitly,
                but it updates the hidden states based on the transformations applied.

        Raises:
            TypeError: If the datatype of the hidden_states is not compatible with the datatype of the weight tensor 'wo'.
        """
        hidden_gelu = self.act(self.wi_0(hidden_states))
        hidden_linear = self.wi_1(hidden_states)
        hidden_states = hidden_gelu * hidden_linear
        hidden_states = self.dropout(hidden_states)

        # To make 8bit quantization work for google/flan-t5-xxl, self.wo is kept in float32.
        # See https://github.com/huggingface/transformers/issues/20287
        # we also make sure the weights are not in `int8` in case users will force `_keep_in_fp32_modules` to be `None``
        if (
            isinstance(self.wo.weight, mindspore.Tensor)
            and hidden_states.dtype != self.wo.weight.dtype
            and self.wo.weight.dtype != mindspore.int8
        ):
            hidden_states = hidden_states.to(self.wo.weight.dtype)

        hidden_states = self.wo(hidden_states)
        return hidden_states

mindnlp.transformers.models.mt5.modeling_mt5.MT5DenseGatedActDense.__init__(config)

Initializes an instance of the MT5DenseGatedActDense class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

An object of type MT5Config containing configuration parameters for the model.

  • The 'config' parameter is required and must be of type MT5Config.
  • It is used to configure the dimensions and settings for the dense layers in the model.

TYPE: MT5Config

RETURNS DESCRIPTION

None

RAISES DESCRIPTION
ValueError

If the configuration parameters are not provided or are of incorrect type.

KeyError

If the activation function specified in the configuration is not supported.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
def __init__(self, config: MT5Config):
    """
    Initializes an instance of the MT5DenseGatedActDense class.

    Args:
        self: The instance of the class.
        config (MT5Config):
            An object of type MT5Config containing configuration parameters for the model.

            - The 'config' parameter is required and must be of type MT5Config.
            - It is used to configure the dimensions and settings for the dense layers in the model.

    Returns:
        None

    Raises:
        ValueError: If the configuration parameters are not provided or are of incorrect type.
        KeyError: If the activation function specified in the configuration is not supported.
    """
    super().__init__()
    self.wi_0 = nn.Linear(config.d_model, config.d_ff, bias=False)
    self.wi_1 = nn.Linear(config.d_model, config.d_ff, bias=False)
    self.wo = nn.Linear(config.d_ff, config.d_model, bias=False)
    self.dropout = nn.Dropout(p=config.dropout_rate)
    self.act = ACT2FN[config.dense_act_fn]

mindnlp.transformers.models.mt5.modeling_mt5.MT5DenseGatedActDense.forward(hidden_states)

This method forwards the hidden states by applying a series of transformations.

PARAMETER DESCRIPTION
self

The instance of the MT5DenseGatedActDense class.

TYPE: MT5DenseGatedActDense

hidden_states

The input hidden states to be processed.

TYPE: Tensor

RETURNS DESCRIPTION
None

This method does not return any value explicitly, but it updates the hidden states based on the transformations applied.

RAISES DESCRIPTION
TypeError

If the datatype of the hidden_states is not compatible with the datatype of the weight tensor 'wo'.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
def forward(self, hidden_states):
    """
    This method forwards the hidden states by applying a series of transformations.

    Args:
        self (MT5DenseGatedActDense): The instance of the MT5DenseGatedActDense class.
        hidden_states (mindspore.Tensor): The input hidden states to be processed.

    Returns:
        None: This method does not return any value explicitly,
            but it updates the hidden states based on the transformations applied.

    Raises:
        TypeError: If the datatype of the hidden_states is not compatible with the datatype of the weight tensor 'wo'.
    """
    hidden_gelu = self.act(self.wi_0(hidden_states))
    hidden_linear = self.wi_1(hidden_states)
    hidden_states = hidden_gelu * hidden_linear
    hidden_states = self.dropout(hidden_states)

    # To make 8bit quantization work for google/flan-t5-xxl, self.wo is kept in float32.
    # See https://github.com/huggingface/transformers/issues/20287
    # we also make sure the weights are not in `int8` in case users will force `_keep_in_fp32_modules` to be `None``
    if (
        isinstance(self.wo.weight, mindspore.Tensor)
        and hidden_states.dtype != self.wo.weight.dtype
        and self.wo.weight.dtype != mindspore.int8
    ):
        hidden_states = hidden_states.to(self.wo.weight.dtype)

    hidden_states = self.wo(hidden_states)
    return hidden_states

mindnlp.transformers.models.mt5.modeling_mt5.MT5EncoderModel

Bases: MT5PreTrainedModel

Example
>>> from transformers import MT5EncoderModel, AutoTokenizer
...
>>> model = MT5EncoderModel.from_pretrained("google/mt5-small")
>>> tokenizer = AutoTokenizer.from_pretrained("google/mt5-small")
>>> article = "UN Offizier sagt, dass weiter verhandelt werden muss in Syrien."
>>> input_ids = tokenizer(article, return_tensors="pt").input_ids
>>> outputs = model(input_ids)
>>> hidden_state = outputs.last_hidden_state
Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
class MT5EncoderModel(MT5PreTrainedModel):
    r"""
    Example:
        ```python
        >>> from transformers import MT5EncoderModel, AutoTokenizer
        ...
        >>> model = MT5EncoderModel.from_pretrained("google/mt5-small")
        >>> tokenizer = AutoTokenizer.from_pretrained("google/mt5-small")
        >>> article = "UN Offizier sagt, dass weiter verhandelt werden muss in Syrien."
        >>> input_ids = tokenizer(article, return_tensors="pt").input_ids
        >>> outputs = model(input_ids)
        >>> hidden_state = outputs.last_hidden_state
        ```
    """
    model_type = "mt5"
    config_class = MT5Config
    _tied_weights_keys = ["encoder.embed_tokens.weight"]

    # Copied from transformers.models.t5.modeling_t5.T5EncoderModel.__init__ with T5->MT5
    def __init__(self, config: MT5Config):
        """
        Initializes an instance of the MT5EncoderModel class.

        Args:
            self: The instance of the MT5EncoderModel class.
            config (MT5Config): An object of type MT5Config containing configuration parameters for the model.
                The config parameter specifies the configuration settings for the MT5 model.
                It must be an instance of the MT5Config class.

        Returns:
            None.

        Raises:
            TypeError: If the config parameter is not of type MT5Config.
            ValueError: If the config parameter is missing or if any required configuration settings are not provided.
        """
        super().__init__(config)
        self.shared = nn.Embedding(config.vocab_size, config.d_model)

        encoder_config = copy.deepcopy(config)
        encoder_config.use_cache = False
        encoder_config.is_encoder_decoder = False
        self.encoder = MT5Stack(encoder_config, self.shared)

        # Initialize weights and apply final processing
        self.post_init()

    # Copied from transformers.models.t5.modeling_t5.T5EncoderModel.get_input_embeddings
    def get_input_embeddings(self):
        """
        This method retrieves the input embeddings for the MT5EncoderModel.

        Args:
            self: An instance of the MT5EncoderModel class.

        Returns:
            The shared input embeddings for the MT5EncoderModel.

        Raises:
            None.
        """
        return self.shared

    # Copied from transformers.models.t5.modeling_t5.T5EncoderModel.set_input_embeddings
    def set_input_embeddings(self, new_embeddings):
        """
        Sets the input embeddings for the MT5EncoderModel.

        Args:
            self (MT5EncoderModel): The instance of the MT5EncoderModel class.
            new_embeddings (object): The new embeddings to be set for the input.

        Returns:
            None.

        Raises:
            TypeError: If the new_embeddings parameter is not of the correct type.
            ValueError: If there is an issue with setting the input embeddings.
        """
        self.shared = new_embeddings
        self.encoder.set_input_embeddings(new_embeddings)

    # Copied from transformers.models.t5.modeling_t5.T5EncoderModel.get_encoder
    def get_encoder(self):
        """
        Returns the encoder of the MT5EncoderModel.

        Args:
            self: An instance of the MT5EncoderModel class.

        Returns:
            encoder: The method returns the encoder of the MT5EncoderModel.

        Raises:
            None.
        """
        return self.encoder

    # Copied from transformers.models.t5.modeling_t5.T5EncoderModel._prune_heads
    def _prune_heads(self, heads_to_prune):
        """
        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
        class PreTrainedModel
        """
        for layer, heads in heads_to_prune.items():
            self.encoder.block[layer].layer[0].SelfAttention.prune_heads(heads)

    # Copied from transformers.models.t5.modeling_t5.T5EncoderModel.forward with T5->MT5, t5->mt5
    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[mindspore.Tensor], BaseModelOutput]:
        r"""

        Returns:
            Union[Tuple[mindspore.Tensor], BaseModelOutput]

        Example:
            ```python
            >>> from transformers import AutoTokenizer, MT5EncoderModel
            ...
            >>> tokenizer = AutoTokenizer.from_pretrained("mt5-small")
            >>> model = MT5EncoderModel.from_pretrained("mt5-small")
            >>> input_ids = tokenizer(
            ...     "Studies have been shown that owning a dog is good for you", return_tensors="pt"
            ... ).input_ids  # Batch size 1
            >>> outputs = model(input_ids=input_ids)
            >>> last_hidden_states = outputs.last_hidden_state
            ```
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        encoder_outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        return encoder_outputs

mindnlp.transformers.models.mt5.modeling_mt5.MT5EncoderModel.__init__(config)

Initializes an instance of the MT5EncoderModel class.

PARAMETER DESCRIPTION
self

The instance of the MT5EncoderModel class.

config

An object of type MT5Config containing configuration parameters for the model. The config parameter specifies the configuration settings for the MT5 model. It must be an instance of the MT5Config class.

TYPE: MT5Config

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the config parameter is not of type MT5Config.

ValueError

If the config parameter is missing or if any required configuration settings are not provided.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
def __init__(self, config: MT5Config):
    """
    Initializes an instance of the MT5EncoderModel class.

    Args:
        self: The instance of the MT5EncoderModel class.
        config (MT5Config): An object of type MT5Config containing configuration parameters for the model.
            The config parameter specifies the configuration settings for the MT5 model.
            It must be an instance of the MT5Config class.

    Returns:
        None.

    Raises:
        TypeError: If the config parameter is not of type MT5Config.
        ValueError: If the config parameter is missing or if any required configuration settings are not provided.
    """
    super().__init__(config)
    self.shared = nn.Embedding(config.vocab_size, config.d_model)

    encoder_config = copy.deepcopy(config)
    encoder_config.use_cache = False
    encoder_config.is_encoder_decoder = False
    self.encoder = MT5Stack(encoder_config, self.shared)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.mt5.modeling_mt5.MT5EncoderModel.forward(input_ids=None, attention_mask=None, head_mask=None, inputs_embeds=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
Union[Tuple[Tensor], BaseModelOutput]

Union[Tuple[mindspore.Tensor], BaseModelOutput]

Example
>>> from transformers import AutoTokenizer, MT5EncoderModel
...
>>> tokenizer = AutoTokenizer.from_pretrained("mt5-small")
>>> model = MT5EncoderModel.from_pretrained("mt5-small")
>>> input_ids = tokenizer(
...     "Studies have been shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids  # Batch size 1
>>> outputs = model(input_ids=input_ids)
>>> last_hidden_states = outputs.last_hidden_state
Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple[mindspore.Tensor], BaseModelOutput]:
    r"""

    Returns:
        Union[Tuple[mindspore.Tensor], BaseModelOutput]

    Example:
        ```python
        >>> from transformers import AutoTokenizer, MT5EncoderModel
        ...
        >>> tokenizer = AutoTokenizer.from_pretrained("mt5-small")
        >>> model = MT5EncoderModel.from_pretrained("mt5-small")
        >>> input_ids = tokenizer(
        ...     "Studies have been shown that owning a dog is good for you", return_tensors="pt"
        ... ).input_ids  # Batch size 1
        >>> outputs = model(input_ids=input_ids)
        >>> last_hidden_states = outputs.last_hidden_state
        ```
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    encoder_outputs = self.encoder(
        input_ids=input_ids,
        attention_mask=attention_mask,
        inputs_embeds=inputs_embeds,
        head_mask=head_mask,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    return encoder_outputs

mindnlp.transformers.models.mt5.modeling_mt5.MT5EncoderModel.get_encoder()

Returns the encoder of the MT5EncoderModel.

PARAMETER DESCRIPTION
self

An instance of the MT5EncoderModel class.

RETURNS DESCRIPTION
encoder

The method returns the encoder of the MT5EncoderModel.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
def get_encoder(self):
    """
    Returns the encoder of the MT5EncoderModel.

    Args:
        self: An instance of the MT5EncoderModel class.

    Returns:
        encoder: The method returns the encoder of the MT5EncoderModel.

    Raises:
        None.
    """
    return self.encoder

mindnlp.transformers.models.mt5.modeling_mt5.MT5EncoderModel.get_input_embeddings()

This method retrieves the input embeddings for the MT5EncoderModel.

PARAMETER DESCRIPTION
self

An instance of the MT5EncoderModel class.

RETURNS DESCRIPTION

The shared input embeddings for the MT5EncoderModel.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
def get_input_embeddings(self):
    """
    This method retrieves the input embeddings for the MT5EncoderModel.

    Args:
        self: An instance of the MT5EncoderModel class.

    Returns:
        The shared input embeddings for the MT5EncoderModel.

    Raises:
        None.
    """
    return self.shared

mindnlp.transformers.models.mt5.modeling_mt5.MT5EncoderModel.set_input_embeddings(new_embeddings)

Sets the input embeddings for the MT5EncoderModel.

PARAMETER DESCRIPTION
self

The instance of the MT5EncoderModel class.

TYPE: MT5EncoderModel

new_embeddings

The new embeddings to be set for the input.

TYPE: object

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the new_embeddings parameter is not of the correct type.

ValueError

If there is an issue with setting the input embeddings.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
def set_input_embeddings(self, new_embeddings):
    """
    Sets the input embeddings for the MT5EncoderModel.

    Args:
        self (MT5EncoderModel): The instance of the MT5EncoderModel class.
        new_embeddings (object): The new embeddings to be set for the input.

    Returns:
        None.

    Raises:
        TypeError: If the new_embeddings parameter is not of the correct type.
        ValueError: If there is an issue with setting the input embeddings.
    """
    self.shared = new_embeddings
    self.encoder.set_input_embeddings(new_embeddings)

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForConditionalGeneration

Bases: MT5PreTrainedModel

Example
>>> from transformers import MT5ForConditionalGeneration, AutoTokenizer
...
>>> model = MT5ForConditionalGeneration.from_pretrained("google/mt5-small")
>>> tokenizer = AutoTokenizer.from_pretrained("google/mt5-small")
>>> article = "UN Offizier sagt, dass weiter verhandelt werden muss in Syrien."
>>> summary = "Weiter Verhandlung in Syrien."
>>> inputs = tokenizer(article, text_target=summary, return_tensors="pt")
...
>>> outputs = model(**inputs)
>>> loss = outputs.loss
Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
class MT5ForConditionalGeneration(MT5PreTrainedModel):
    r"""
    Example:
        ```python
        >>> from transformers import MT5ForConditionalGeneration, AutoTokenizer
        ...
        >>> model = MT5ForConditionalGeneration.from_pretrained("google/mt5-small")
        >>> tokenizer = AutoTokenizer.from_pretrained("google/mt5-small")
        >>> article = "UN Offizier sagt, dass weiter verhandelt werden muss in Syrien."
        >>> summary = "Weiter Verhandlung in Syrien."
        >>> inputs = tokenizer(article, text_target=summary, return_tensors="pt")
        ...
        >>> outputs = model(**inputs)
        >>> loss = outputs.loss
        ```
    """
    model_type = "mt5"
    config_class = MT5Config
    _keys_to_ignore_on_load_unexpected = ["decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight"]
    _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight", "lm_head.weight"]

    # Copied from transformers.models.t5.modeling_t5.T5ForConditionalGeneration.__init__ with T5->MT5
    def __init__(self, config: MT5Config):
        """
        Initializes an instance of the MT5ForConditionalGeneration class.

        Args:
            self: The object instance.
            config (MT5Config):
                The configuration object containing various parameters for the model.

                - `d_model` (int): The dimensionality of the model.
                - `vocab_size` (int): The size of the vocabulary.
                - `num_decoder_layers` (int): The number of layers in the decoder.
                - `is_decoder` (bool): Indicates whether the instance is a decoder.
                - `use_cache` (bool): Indicates whether to use cache during encoding.
                - `is_encoder_decoder` (bool): Indicates whether the instance is an encoder-decoder.

        Returns:
            None

        Raises:
            None
        """
        super().__init__(config)
        self.model_dim = config.d_model

        self.shared = nn.Embedding(config.vocab_size, config.d_model)

        encoder_config = copy.deepcopy(config)
        encoder_config.is_decoder = False
        encoder_config.use_cache = False
        encoder_config.is_encoder_decoder = False
        self.encoder = MT5Stack(encoder_config, self.shared)

        decoder_config = copy.deepcopy(config)
        decoder_config.is_decoder = True
        decoder_config.is_encoder_decoder = False
        decoder_config.num_layers = config.num_decoder_layers
        self.decoder = MT5Stack(decoder_config, self.shared)

        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    # Copied from transformers.models.t5.modeling_t5.T5ForConditionalGeneration.get_input_embeddings
    def get_input_embeddings(self):
        """
        Retrieves the input embeddings for the MT5 model.

        Args:
            self (MT5ForConditionalGeneration): An instance of the MT5ForConditionalGeneration class.

        Returns:
            None.

        Raises:
            None.
        """
        return self.shared

    # Copied from transformers.models.t5.modeling_t5.T5ForConditionalGeneration.set_input_embeddings
    def set_input_embeddings(self, new_embeddings):
        """
        Set input embeddings for the MT5 model for conditional generation.

        Args:
            self (MT5ForConditionalGeneration): The instance of the MT5ForConditionalGeneration class.
            new_embeddings (Tensor): New input embeddings to be set for the model.
                Should be a tensor of shape [vocab_size, embedding_size] where:

                - vocab_size: Number of tokens in the vocabulary.
                - embedding_size: Dimension of the token embeddings.

                The new_embeddings should match the token embedding requirements of the model.

        Returns:
            None.

        Raises:
            TypeError: If the new_embeddings provided is not a tensor.
            ValueError: If the shape of the new_embeddings tensor does not match the expected shape
                [vocab_size, embedding_size].
        """
        self.shared = new_embeddings
        self.encoder.set_input_embeddings(new_embeddings)
        self.decoder.set_input_embeddings(new_embeddings)

    # Copied from transformers.models.t5.modeling_t5.T5ForConditionalGeneration.set_output_embeddings
    def set_output_embeddings(self, new_embeddings):
        """
        Set the output embeddings for the MT5 model.

        Args:
            self (MT5ForConditionalGeneration): The instance of the MT5ForConditionalGeneration class.
            new_embeddings (object): The new output embeddings to be set for the model. It can be of any valid type.

        Returns:
            None.

        Raises:
            None.
        """
        self.lm_head = new_embeddings

    # Copied from transformers.models.t5.modeling_t5.T5ForConditionalGeneration.get_output_embeddings
    def get_output_embeddings(self):
        """
        Returns the output embeddings of the MT5 model.

        Args:
            self: An instance of the MT5ForConditionalGeneration class.

        Returns:
            embeddings: The output embeddings of the MT5 model.

        Raises:
            None.
        """
        return self.lm_head

    # Copied from transformers.models.t5.modeling_t5.T5ForConditionalGeneration.get_encoder
    def get_encoder(self):
        """
        Retrieve the encoder object used for conditional generation in the MT5ForConditionalGeneration class.

        Args:
            self (MT5ForConditionalGeneration): An instance of the MT5ForConditionalGeneration class.
                This parameter is required for accessing the encoder object associated with the instance.

        Returns:
            encoder: The encoder object that is utilized for conditional text generation.

        Raises:
            None.
        """
        return self.encoder

    # Copied from transformers.models.t5.modeling_t5.T5ForConditionalGeneration.get_decoder
    def get_decoder(self):
        """
        Method to retrieve the decoder used in the MT5ForConditionalGeneration class.

        Args:
            self: An instance of the MT5ForConditionalGeneration class.

        Returns:
            decoder: This method returns the decoder associated with the MT5ForConditionalGeneration instance.

        Raises:
            None.
        """
        return self.decoder

    # Copied from transformers.models.t5.modeling_t5.T5ForConditionalGeneration.forward with T5->MT5, t5->mt5
    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        decoder_input_ids: Optional[mindspore.Tensor] = None,
        decoder_attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        decoder_head_mask: Optional[mindspore.Tensor] = None,
        cross_attn_head_mask: Optional[mindspore.Tensor] = None,
        encoder_outputs: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
        past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        decoder_inputs_embeds: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[mindspore.Tensor], Seq2SeqLMOutput]:
        r"""
        Args:
            labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
                Labels for computing the sequence classification/regression loss. Indices should be in `[-100, 0, ...,
                config.vocab_size - 1]`. All labels set to `-100` are ignored (masked), the loss is only computed for
                labels in `[0, ..., config.vocab_size]`

        Returns:
            `Union[Tuple[mindspore.Tensor], Seq2SeqLMOutput]`

        Example:
            ```python
            >>> from transformers import AutoTokenizer, MT5ForConditionalGeneration
            ...
            >>> tokenizer = AutoTokenizer.from_pretrained("mt5-small")
            >>> model = MT5ForConditionalGeneration.from_pretrained("mt5-small")
            ...
            >>> # training
            >>> input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
            >>> labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
            >>> outputs = model(input_ids=input_ids, labels=labels)
            >>> loss = outputs.loss
            >>> logits = outputs.logits
            ...
            >>> # inference
            >>> input_ids = tokenizer(
            ...     "summarize: studies have shown that owning a dog is good for you", return_tensors="pt"
            ... ).input_ids  # Batch size 1
            >>> outputs = model.generate(input_ids)
            >>> print(tokenizer.decode(outputs[0], skip_special_tokens=True))
            >>> # studies have shown that owning a dog is good for you.
            ```
        """
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # FutureWarning: head_mask was separated into two input args - head_mask, decoder_head_mask
        if head_mask is not None and decoder_head_mask is None:
            if self.config.num_layers == self.config.num_decoder_layers:
                warnings.warn(__HEAD_MASK_WARNING_MSG, FutureWarning)
                decoder_head_mask = head_mask

        # Encode if needed (training, first prediction pass)
        if encoder_outputs is None:
            # Convert encoder inputs in embeddings if needed
            encoder_outputs = self.encoder(
                input_ids=input_ids,
                attention_mask=attention_mask,
                inputs_embeds=inputs_embeds,
                head_mask=head_mask,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
            )
        elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
            encoder_outputs = BaseModelOutput(
                last_hidden_state=encoder_outputs[0],
                hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,
                attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
            )

        hidden_states = encoder_outputs[0]

        if labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
            # get decoder inputs from shifting lm labels to the right
            decoder_input_ids = self._shift_right(labels)

        # Decode
        decoder_outputs = self.decoder(
            input_ids=decoder_input_ids,
            attention_mask=decoder_attention_mask,
            inputs_embeds=decoder_inputs_embeds,
            past_key_values=past_key_values,
            encoder_hidden_states=hidden_states,
            encoder_attention_mask=attention_mask,
            head_mask=decoder_head_mask,
            cross_attn_head_mask=cross_attn_head_mask,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = decoder_outputs[0]

        if self.config.tie_word_embeddings:
            # Rescale output before projecting on vocab
            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586
            sequence_output = sequence_output * (self.model_dim**-0.5)

        lm_logits = self.lm_head(sequence_output)

        loss = None
        if labels is not None:
            loss = ops.cross_entropy(lm_logits.view(-1, lm_logits.shape[-1]), labels.view(-1))
            # TODO(thom): Add z_loss https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L666

        if not return_dict:
            output = (lm_logits,) + decoder_outputs[1:] + encoder_outputs
            return ((loss,) + output) if loss is not None else output

        return Seq2SeqLMOutput(
            loss=loss,
            logits=lm_logits,
            past_key_values=decoder_outputs.past_key_values,
            decoder_hidden_states=decoder_outputs.hidden_states,
            decoder_attentions=decoder_outputs.attentions,
            cross_attentions=decoder_outputs.cross_attentions,
            encoder_last_hidden_state=encoder_outputs.last_hidden_state,
            encoder_hidden_states=encoder_outputs.hidden_states,
            encoder_attentions=encoder_outputs.attentions,
        )

    # Copied from transformers.models.t5.modeling_t5.T5ForConditionalGeneration.prepare_inputs_for_generation
    def prepare_inputs_for_generation(
        self,
        input_ids,
        past_key_values=None,
        attention_mask=None,
        head_mask=None,
        decoder_head_mask=None,
        decoder_attention_mask=None,
        cross_attn_head_mask=None,
        use_cache=None,
        encoder_outputs=None,
        **kwargs,
    ):
        """
        This method prepares inputs for generation in the MT5ForConditionalGeneration class.

        Args:
            self (object): The instance of the class.
            input_ids (Tensor): The input token IDs for the model. Shape: [batch_size, sequence_length].
            past_key_values (tuple, optional): The past key values required for fast autoregressive decoding. Default: None.
            attention_mask (Tensor, optional): The attention mask for the input. Shape: [batch_size, sequence_length].
            head_mask (Tensor, optional): The mask for the multi-head attention layers. Shape: [num_heads, sequence_length].
            decoder_head_mask (Tensor, optional): The mask for the decoder's multi-head attention layers. Shape: [num_heads, sequence_length].
            decoder_attention_mask (Tensor, optional): The attention mask for the decoder. Shape: [batch_size, sequence_length].
            cross_attn_head_mask (Tensor, optional): The mask for the cross-attention layers. Shape: [num_heads, sequence_length].
            use_cache (bool, optional): Whether to use the cache for fast decoding. Default: None.
            encoder_outputs (tuple, optional): The outputs of the encoder. Default: None.

        Returns:
            dict:
                A dictionary containing the prepared inputs for the generation including 'decoder_input_ids', 'past_key_values',
                'encoder_outputs', 'attention_mask', 'head_mask', 'decoder_head_mask', 'decoder_attention_mask', 'cross_attn_head_mask',
                and 'use_cache'.

        Raises:
            None
        """
        # cut decoder_input_ids if past_key_values is used
        if past_key_values is not None:
            past_length = past_key_values[0][0].shape[2]

            # Some generation methods already pass only the last input ID
            if input_ids.shape[1] > past_length:
                remove_prefix_length = past_length
            else:
                # Default to old behavior: keep only final ID
                remove_prefix_length = input_ids.shape[1] - 1

            input_ids = input_ids[:, remove_prefix_length:]

        return {
            "decoder_input_ids": input_ids,
            "past_key_values": past_key_values,
            "encoder_outputs": encoder_outputs,
            "attention_mask": attention_mask,
            "head_mask": head_mask,
            "decoder_head_mask": decoder_head_mask,
            "decoder_attention_mask": decoder_attention_mask,
            "cross_attn_head_mask": cross_attn_head_mask,
            "use_cache": use_cache,
        }

    # Copied from transformers.models.t5.modeling_t5.T5ForConditionalGeneration.prepare_decoder_input_ids_from_labels
    def prepare_decoder_input_ids_from_labels(self, labels: mindspore.Tensor):
        """
        Prepare decoder input IDs from labels for conditional generation.

        Args:
            self (MT5ForConditionalGeneration): An instance of the MT5ForConditionalGeneration class.
            labels (mindspore.Tensor): The labels tensor containing the target sequence to be shifted right.

        Returns:
            None: This method returns None as it directly modifies the input labels tensor.

        Raises:
            None.
        """
        return self._shift_right(labels)

    # Copied from transformers.models.t5.modeling_t5.T5ForConditionalGeneration._reorder_cache
    def _reorder_cache(self, past_key_values, beam_idx):
        """
        Reorders the cache for the specified `beam_idx` in the `MT5ForConditionalGeneration` class.

        Args:
            self (MT5ForConditionalGeneration): An instance of the MT5ForConditionalGeneration class.
            past_key_values (Tuple): A tuple containing the past key values for the decoder.
                Each element in the tuple represents the past key values for a specific layer.
                Each layer's past key values is a tuple containing the past key values for each attention head in
                that layer.
            beam_idx (Tensor): The index of the beam to reorder the cache for.

        Returns:
            Tuple: The reordered cache for the specified `beam_idx`. The reordered cache has the same structure as the
                input `past_key_values`, but the values are reordered based on the specified `beam_idx`.

        Raises:
            ValueError: If the shape of the reordered_layer_past_states[0] and layer_past_states[0] mismatch.
            ValueError: If the length of reordered_layer_past_states and layer_past_states mismatch.
        """
        # if decoder past is not included in output
        # speedy decoding is disabled and no need to reorder
        if past_key_values is None:
            logger.warning("You might want to consider setting `use_cache=True` to speed up decoding")
            return past_key_values

        reordered_decoder_past = ()
        for layer_past_states in past_key_values:
            # get the correct batch idx from layer past batch dim
            # batch dim of `past` is at 2nd position
            reordered_layer_past_states = ()
            for layer_past_state in layer_past_states:
                # need to set correct `past` for each of the four key / value states
                reordered_layer_past_states = reordered_layer_past_states + (
                    layer_past_state.index_select(0, beam_idx),
                )

            if reordered_layer_past_states[0].shape != layer_past_states[0].shape:
                raise ValueError(
                    f"reordered_layer_past_states[0] shape {reordered_layer_past_states[0].shape} and layer_past_states[0] shape {layer_past_states[0].shape} mismatched"
                )
            if len(reordered_layer_past_states) != len(layer_past_states):
                raise ValueError(
                    f"length of reordered_layer_past_states {len(reordered_layer_past_states)} and length of layer_past_states {len(layer_past_states)} mismatched"
                )

            reordered_decoder_past = reordered_decoder_past + (reordered_layer_past_states,)
        return reordered_decoder_past

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForConditionalGeneration.__init__(config)

Initializes an instance of the MT5ForConditionalGeneration class.

PARAMETER DESCRIPTION
self

The object instance.

config

The configuration object containing various parameters for the model.

  • d_model (int): The dimensionality of the model.
  • vocab_size (int): The size of the vocabulary.
  • num_decoder_layers (int): The number of layers in the decoder.
  • is_decoder (bool): Indicates whether the instance is a decoder.
  • use_cache (bool): Indicates whether to use cache during encoding.
  • is_encoder_decoder (bool): Indicates whether the instance is an encoder-decoder.

TYPE: MT5Config

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
def __init__(self, config: MT5Config):
    """
    Initializes an instance of the MT5ForConditionalGeneration class.

    Args:
        self: The object instance.
        config (MT5Config):
            The configuration object containing various parameters for the model.

            - `d_model` (int): The dimensionality of the model.
            - `vocab_size` (int): The size of the vocabulary.
            - `num_decoder_layers` (int): The number of layers in the decoder.
            - `is_decoder` (bool): Indicates whether the instance is a decoder.
            - `use_cache` (bool): Indicates whether to use cache during encoding.
            - `is_encoder_decoder` (bool): Indicates whether the instance is an encoder-decoder.

    Returns:
        None

    Raises:
        None
    """
    super().__init__(config)
    self.model_dim = config.d_model

    self.shared = nn.Embedding(config.vocab_size, config.d_model)

    encoder_config = copy.deepcopy(config)
    encoder_config.is_decoder = False
    encoder_config.use_cache = False
    encoder_config.is_encoder_decoder = False
    self.encoder = MT5Stack(encoder_config, self.shared)

    decoder_config = copy.deepcopy(config)
    decoder_config.is_decoder = True
    decoder_config.is_encoder_decoder = False
    decoder_config.num_layers = config.num_decoder_layers
    self.decoder = MT5Stack(decoder_config, self.shared)

    self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForConditionalGeneration.forward(input_ids=None, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, head_mask=None, decoder_head_mask=None, cross_attn_head_mask=None, encoder_outputs=None, past_key_values=None, inputs_embeds=None, decoder_inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
labels

Labels for computing the sequence classification/regression loss. Indices should be in [-100, 0, ..., config.vocab_size - 1]. All labels set to -100 are ignored (masked), the loss is only computed for labels in [0, ..., config.vocab_size]

TYPE: `mindspore.Tensor` of shape `(batch_size,)`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple[Tensor], Seq2SeqLMOutput]

Union[Tuple[mindspore.Tensor], Seq2SeqLMOutput]

Example
>>> from transformers import AutoTokenizer, MT5ForConditionalGeneration
...
>>> tokenizer = AutoTokenizer.from_pretrained("mt5-small")
>>> model = MT5ForConditionalGeneration.from_pretrained("mt5-small")
...
>>> # training
>>> input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
>>> labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
>>> outputs = model(input_ids=input_ids, labels=labels)
>>> loss = outputs.loss
>>> logits = outputs.logits
...
>>> # inference
>>> input_ids = tokenizer(
...     "summarize: studies have shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids  # Batch size 1
>>> outputs = model.generate(input_ids)
>>> print(tokenizer.decode(outputs[0], skip_special_tokens=True))
>>> # studies have shown that owning a dog is good for you.
Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    decoder_input_ids: Optional[mindspore.Tensor] = None,
    decoder_attention_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    decoder_head_mask: Optional[mindspore.Tensor] = None,
    cross_attn_head_mask: Optional[mindspore.Tensor] = None,
    encoder_outputs: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
    past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    decoder_inputs_embeds: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple[mindspore.Tensor], Seq2SeqLMOutput]:
    r"""
    Args:
        labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[-100, 0, ...,
            config.vocab_size - 1]`. All labels set to `-100` are ignored (masked), the loss is only computed for
            labels in `[0, ..., config.vocab_size]`

    Returns:
        `Union[Tuple[mindspore.Tensor], Seq2SeqLMOutput]`

    Example:
        ```python
        >>> from transformers import AutoTokenizer, MT5ForConditionalGeneration
        ...
        >>> tokenizer = AutoTokenizer.from_pretrained("mt5-small")
        >>> model = MT5ForConditionalGeneration.from_pretrained("mt5-small")
        ...
        >>> # training
        >>> input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
        >>> labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
        >>> outputs = model(input_ids=input_ids, labels=labels)
        >>> loss = outputs.loss
        >>> logits = outputs.logits
        ...
        >>> # inference
        >>> input_ids = tokenizer(
        ...     "summarize: studies have shown that owning a dog is good for you", return_tensors="pt"
        ... ).input_ids  # Batch size 1
        >>> outputs = model.generate(input_ids)
        >>> print(tokenizer.decode(outputs[0], skip_special_tokens=True))
        >>> # studies have shown that owning a dog is good for you.
        ```
    """
    use_cache = use_cache if use_cache is not None else self.config.use_cache
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    # FutureWarning: head_mask was separated into two input args - head_mask, decoder_head_mask
    if head_mask is not None and decoder_head_mask is None:
        if self.config.num_layers == self.config.num_decoder_layers:
            warnings.warn(__HEAD_MASK_WARNING_MSG, FutureWarning)
            decoder_head_mask = head_mask

    # Encode if needed (training, first prediction pass)
    if encoder_outputs is None:
        # Convert encoder inputs in embeddings if needed
        encoder_outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
    elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
        encoder_outputs = BaseModelOutput(
            last_hidden_state=encoder_outputs[0],
            hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,
            attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
        )

    hidden_states = encoder_outputs[0]

    if labels is not None and decoder_input_ids is None and decoder_inputs_embeds is None:
        # get decoder inputs from shifting lm labels to the right
        decoder_input_ids = self._shift_right(labels)

    # Decode
    decoder_outputs = self.decoder(
        input_ids=decoder_input_ids,
        attention_mask=decoder_attention_mask,
        inputs_embeds=decoder_inputs_embeds,
        past_key_values=past_key_values,
        encoder_hidden_states=hidden_states,
        encoder_attention_mask=attention_mask,
        head_mask=decoder_head_mask,
        cross_attn_head_mask=cross_attn_head_mask,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    sequence_output = decoder_outputs[0]

    if self.config.tie_word_embeddings:
        # Rescale output before projecting on vocab
        # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/transformer/transformer.py#L586
        sequence_output = sequence_output * (self.model_dim**-0.5)

    lm_logits = self.lm_head(sequence_output)

    loss = None
    if labels is not None:
        loss = ops.cross_entropy(lm_logits.view(-1, lm_logits.shape[-1]), labels.view(-1))
        # TODO(thom): Add z_loss https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L666

    if not return_dict:
        output = (lm_logits,) + decoder_outputs[1:] + encoder_outputs
        return ((loss,) + output) if loss is not None else output

    return Seq2SeqLMOutput(
        loss=loss,
        logits=lm_logits,
        past_key_values=decoder_outputs.past_key_values,
        decoder_hidden_states=decoder_outputs.hidden_states,
        decoder_attentions=decoder_outputs.attentions,
        cross_attentions=decoder_outputs.cross_attentions,
        encoder_last_hidden_state=encoder_outputs.last_hidden_state,
        encoder_hidden_states=encoder_outputs.hidden_states,
        encoder_attentions=encoder_outputs.attentions,
    )

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForConditionalGeneration.get_decoder()

Method to retrieve the decoder used in the MT5ForConditionalGeneration class.

PARAMETER DESCRIPTION
self

An instance of the MT5ForConditionalGeneration class.

RETURNS DESCRIPTION
decoder

This method returns the decoder associated with the MT5ForConditionalGeneration instance.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
def get_decoder(self):
    """
    Method to retrieve the decoder used in the MT5ForConditionalGeneration class.

    Args:
        self: An instance of the MT5ForConditionalGeneration class.

    Returns:
        decoder: This method returns the decoder associated with the MT5ForConditionalGeneration instance.

    Raises:
        None.
    """
    return self.decoder

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForConditionalGeneration.get_encoder()

Retrieve the encoder object used for conditional generation in the MT5ForConditionalGeneration class.

PARAMETER DESCRIPTION
self

An instance of the MT5ForConditionalGeneration class. This parameter is required for accessing the encoder object associated with the instance.

TYPE: MT5ForConditionalGeneration

RETURNS DESCRIPTION
encoder

The encoder object that is utilized for conditional text generation.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
def get_encoder(self):
    """
    Retrieve the encoder object used for conditional generation in the MT5ForConditionalGeneration class.

    Args:
        self (MT5ForConditionalGeneration): An instance of the MT5ForConditionalGeneration class.
            This parameter is required for accessing the encoder object associated with the instance.

    Returns:
        encoder: The encoder object that is utilized for conditional text generation.

    Raises:
        None.
    """
    return self.encoder

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForConditionalGeneration.get_input_embeddings()

Retrieves the input embeddings for the MT5 model.

PARAMETER DESCRIPTION
self

An instance of the MT5ForConditionalGeneration class.

TYPE: MT5ForConditionalGeneration

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
def get_input_embeddings(self):
    """
    Retrieves the input embeddings for the MT5 model.

    Args:
        self (MT5ForConditionalGeneration): An instance of the MT5ForConditionalGeneration class.

    Returns:
        None.

    Raises:
        None.
    """
    return self.shared

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForConditionalGeneration.get_output_embeddings()

Returns the output embeddings of the MT5 model.

PARAMETER DESCRIPTION
self

An instance of the MT5ForConditionalGeneration class.

RETURNS DESCRIPTION
embeddings

The output embeddings of the MT5 model.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
def get_output_embeddings(self):
    """
    Returns the output embeddings of the MT5 model.

    Args:
        self: An instance of the MT5ForConditionalGeneration class.

    Returns:
        embeddings: The output embeddings of the MT5 model.

    Raises:
        None.
    """
    return self.lm_head

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForConditionalGeneration.prepare_decoder_input_ids_from_labels(labels)

Prepare decoder input IDs from labels for conditional generation.

PARAMETER DESCRIPTION
self

An instance of the MT5ForConditionalGeneration class.

TYPE: MT5ForConditionalGeneration

labels

The labels tensor containing the target sequence to be shifted right.

TYPE: Tensor

RETURNS DESCRIPTION
None

This method returns None as it directly modifies the input labels tensor.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
def prepare_decoder_input_ids_from_labels(self, labels: mindspore.Tensor):
    """
    Prepare decoder input IDs from labels for conditional generation.

    Args:
        self (MT5ForConditionalGeneration): An instance of the MT5ForConditionalGeneration class.
        labels (mindspore.Tensor): The labels tensor containing the target sequence to be shifted right.

    Returns:
        None: This method returns None as it directly modifies the input labels tensor.

    Raises:
        None.
    """
    return self._shift_right(labels)

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForConditionalGeneration.prepare_inputs_for_generation(input_ids, past_key_values=None, attention_mask=None, head_mask=None, decoder_head_mask=None, decoder_attention_mask=None, cross_attn_head_mask=None, use_cache=None, encoder_outputs=None, **kwargs)

This method prepares inputs for generation in the MT5ForConditionalGeneration class.

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: object

input_ids

The input token IDs for the model. Shape: [batch_size, sequence_length].

TYPE: Tensor

past_key_values

The past key values required for fast autoregressive decoding. Default: None.

TYPE: tuple DEFAULT: None

attention_mask

The attention mask for the input. Shape: [batch_size, sequence_length].

TYPE: Tensor DEFAULT: None

head_mask

The mask for the multi-head attention layers. Shape: [num_heads, sequence_length].

TYPE: Tensor DEFAULT: None

decoder_head_mask

The mask for the decoder's multi-head attention layers. Shape: [num_heads, sequence_length].

TYPE: Tensor DEFAULT: None

decoder_attention_mask

The attention mask for the decoder. Shape: [batch_size, sequence_length].

TYPE: Tensor DEFAULT: None

cross_attn_head_mask

The mask for the cross-attention layers. Shape: [num_heads, sequence_length].

TYPE: Tensor DEFAULT: None

use_cache

Whether to use the cache for fast decoding. Default: None.

TYPE: bool DEFAULT: None

encoder_outputs

The outputs of the encoder. Default: None.

TYPE: tuple DEFAULT: None

RETURNS DESCRIPTION
dict

A dictionary containing the prepared inputs for the generation including 'decoder_input_ids', 'past_key_values', 'encoder_outputs', 'attention_mask', 'head_mask', 'decoder_head_mask', 'decoder_attention_mask', 'cross_attn_head_mask', and 'use_cache'.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
def prepare_inputs_for_generation(
    self,
    input_ids,
    past_key_values=None,
    attention_mask=None,
    head_mask=None,
    decoder_head_mask=None,
    decoder_attention_mask=None,
    cross_attn_head_mask=None,
    use_cache=None,
    encoder_outputs=None,
    **kwargs,
):
    """
    This method prepares inputs for generation in the MT5ForConditionalGeneration class.

    Args:
        self (object): The instance of the class.
        input_ids (Tensor): The input token IDs for the model. Shape: [batch_size, sequence_length].
        past_key_values (tuple, optional): The past key values required for fast autoregressive decoding. Default: None.
        attention_mask (Tensor, optional): The attention mask for the input. Shape: [batch_size, sequence_length].
        head_mask (Tensor, optional): The mask for the multi-head attention layers. Shape: [num_heads, sequence_length].
        decoder_head_mask (Tensor, optional): The mask for the decoder's multi-head attention layers. Shape: [num_heads, sequence_length].
        decoder_attention_mask (Tensor, optional): The attention mask for the decoder. Shape: [batch_size, sequence_length].
        cross_attn_head_mask (Tensor, optional): The mask for the cross-attention layers. Shape: [num_heads, sequence_length].
        use_cache (bool, optional): Whether to use the cache for fast decoding. Default: None.
        encoder_outputs (tuple, optional): The outputs of the encoder. Default: None.

    Returns:
        dict:
            A dictionary containing the prepared inputs for the generation including 'decoder_input_ids', 'past_key_values',
            'encoder_outputs', 'attention_mask', 'head_mask', 'decoder_head_mask', 'decoder_attention_mask', 'cross_attn_head_mask',
            and 'use_cache'.

    Raises:
        None
    """
    # cut decoder_input_ids if past_key_values is used
    if past_key_values is not None:
        past_length = past_key_values[0][0].shape[2]

        # Some generation methods already pass only the last input ID
        if input_ids.shape[1] > past_length:
            remove_prefix_length = past_length
        else:
            # Default to old behavior: keep only final ID
            remove_prefix_length = input_ids.shape[1] - 1

        input_ids = input_ids[:, remove_prefix_length:]

    return {
        "decoder_input_ids": input_ids,
        "past_key_values": past_key_values,
        "encoder_outputs": encoder_outputs,
        "attention_mask": attention_mask,
        "head_mask": head_mask,
        "decoder_head_mask": decoder_head_mask,
        "decoder_attention_mask": decoder_attention_mask,
        "cross_attn_head_mask": cross_attn_head_mask,
        "use_cache": use_cache,
    }

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForConditionalGeneration.set_input_embeddings(new_embeddings)

Set input embeddings for the MT5 model for conditional generation.

PARAMETER DESCRIPTION
self

The instance of the MT5ForConditionalGeneration class.

TYPE: MT5ForConditionalGeneration

new_embeddings

New input embeddings to be set for the model. Should be a tensor of shape [vocab_size, embedding_size] where:

  • vocab_size: Number of tokens in the vocabulary.
  • embedding_size: Dimension of the token embeddings.

The new_embeddings should match the token embedding requirements of the model.

TYPE: Tensor

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the new_embeddings provided is not a tensor.

ValueError

If the shape of the new_embeddings tensor does not match the expected shape [vocab_size, embedding_size].

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
def set_input_embeddings(self, new_embeddings):
    """
    Set input embeddings for the MT5 model for conditional generation.

    Args:
        self (MT5ForConditionalGeneration): The instance of the MT5ForConditionalGeneration class.
        new_embeddings (Tensor): New input embeddings to be set for the model.
            Should be a tensor of shape [vocab_size, embedding_size] where:

            - vocab_size: Number of tokens in the vocabulary.
            - embedding_size: Dimension of the token embeddings.

            The new_embeddings should match the token embedding requirements of the model.

    Returns:
        None.

    Raises:
        TypeError: If the new_embeddings provided is not a tensor.
        ValueError: If the shape of the new_embeddings tensor does not match the expected shape
            [vocab_size, embedding_size].
    """
    self.shared = new_embeddings
    self.encoder.set_input_embeddings(new_embeddings)
    self.decoder.set_input_embeddings(new_embeddings)

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForConditionalGeneration.set_output_embeddings(new_embeddings)

Set the output embeddings for the MT5 model.

PARAMETER DESCRIPTION
self

The instance of the MT5ForConditionalGeneration class.

TYPE: MT5ForConditionalGeneration

new_embeddings

The new output embeddings to be set for the model. It can be of any valid type.

TYPE: object

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
def set_output_embeddings(self, new_embeddings):
    """
    Set the output embeddings for the MT5 model.

    Args:
        self (MT5ForConditionalGeneration): The instance of the MT5ForConditionalGeneration class.
        new_embeddings (object): The new output embeddings to be set for the model. It can be of any valid type.

    Returns:
        None.

    Raises:
        None.
    """
    self.lm_head = new_embeddings

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForQuestionAnswering

Bases: MT5PreTrainedModel

MT5ForQuestionAnswering is a class that represents a Question Answering model based on the MT5 architecture. It is a subclass of MT5PreTrainedModel.

The class includes the following methods:

  • init: Initializes an instance of the class with the given configuration.
  • get_input_embeddings: Returns the shared input embeddings.
  • set_input_embeddings: Sets the shared input embeddings to the provided new embeddings.
  • get_encoder: Returns the encoder module of the model.
  • get_decoder: Returns the decoder module of the model.
  • forward: Constructs the model and returns the outputs.

The 'forward' method takes various input tensors and returns either a tuple of tensors or an instance of Seq2SeqQuestionAnsweringModelOutput.

Please note that this docstring does not include the method signatures or any other code.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
class MT5ForQuestionAnswering(MT5PreTrainedModel):

    """
    MT5ForQuestionAnswering is a class that represents a Question Answering model based on the MT5 architecture.
    It is a subclass of MT5PreTrainedModel.

    The class includes the following methods:

    - __init__: Initializes an instance of the class with the given configuration.
    - get_input_embeddings: Returns the shared input embeddings.
    - set_input_embeddings: Sets the shared input embeddings to the provided new embeddings.
    - get_encoder: Returns the encoder module of the model.
    - get_decoder: Returns the decoder module of the model.
    - forward: Constructs the model and returns the outputs.

    The 'forward' method takes various input tensors and returns either a tuple of tensors or an instance of
    Seq2SeqQuestionAnsweringModelOutput.

    Please note that this docstring does not include the method signatures or any other code.
    """
    _keys_to_ignore_on_load_unexpected = ["decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight"]
    _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"]

    # Copied from transformers.models.t5.modeling_t5.T5ForQuestionAnswering.__init__ with T5->MT5
    def __init__(self, config: MT5Config):
        """Initialize an instance of the MT5ForQuestionAnswering class.

        Args:
            self: The instance of the class.
            config (MT5Config): The configuration object for the MT5 model.

        Returns:
            None

        Raises:
            None
        """
        super().__init__(config)
        self.model_dim = config.d_model

        self.shared = nn.Embedding(config.vocab_size, config.d_model)

        encoder_config = copy.deepcopy(config)
        encoder_config.is_decoder = False
        encoder_config.use_cache = False
        encoder_config.is_encoder_decoder = False
        self.encoder = MT5Stack(encoder_config, self.shared)

        decoder_config = copy.deepcopy(config)
        decoder_config.is_decoder = True
        decoder_config.is_encoder_decoder = False
        decoder_config.num_layers = config.num_decoder_layers
        self.decoder = MT5Stack(decoder_config, self.shared)

        self.num_labels = config.num_labels
        self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)

        # Initialize weights and apply final processing
        self.post_init()

        self.model_parallel = False

    # Copied from transformers.models.t5.modeling_t5.T5ForQuestionAnswering.get_input_embeddings
    def get_input_embeddings(self):
        """
        This method retrieves the input embeddings from the MT5 model for question answering.

        Args:
            self: An instance of the MT5ForQuestionAnswering class.

        Returns:
            None: The method returns the shared input embeddings.

        Raises:
            This method does not raise any exceptions.
        """
        return self.shared

    # Copied from transformers.models.t5.modeling_t5.T5ForQuestionAnswering.set_input_embeddings
    def set_input_embeddings(self, new_embeddings):
        """
        Set the input embeddings for both encoder and decoder in the MT5ForQuestionAnswering model.

        Args:
            self (MT5ForQuestionAnswering): The instance of the MT5ForQuestionAnswering class.
            new_embeddings (Tensor): New embeddings to be set as input for both encoder and decoder.
                Should be a tensor of the same shape as the current input embeddings.

        Returns:
            None.

        Raises:
            ValueError: If the shape of the new embeddings does not match the current input embeddings.
            TypeError: If the new_embeddings parameter is not a tensor.
        """
        self.shared = new_embeddings
        self.encoder.set_input_embeddings(new_embeddings)
        self.decoder.set_input_embeddings(new_embeddings)

    # Copied from transformers.models.t5.modeling_t5.T5ForQuestionAnswering.get_encoder
    def get_encoder(self):
        """
        Get the encoder object used in the MT5ForQuestionAnswering class.

        Args:
            self: An instance of MT5ForQuestionAnswering.

        Returns:
            encoder: The method returns the encoder object, which is an instance of a specific encoder used in the
                MT5ForQuestionAnswering class.

        Raises:
            None.

        """
        return self.encoder

    # Copied from transformers.models.t5.modeling_t5.T5ForQuestionAnswering.get_decoder
    def get_decoder(self):
        """
        Method to retrieve the decoder object.

        Args:
            self: An instance of the MT5ForQuestionAnswering class.

        Returns:
            decoder: The method returns the decoder object associated with the MT5ForQuestionAnswering instance.

        Raises:
            None.
        """
        return self.decoder

    # Copied from transformers.models.t5.modeling_t5.T5ForQuestionAnswering.forward
    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        decoder_input_ids: Optional[mindspore.Tensor] = None,
        decoder_attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        decoder_head_mask: Optional[mindspore.Tensor] = None,
        cross_attn_head_mask: Optional[mindspore.Tensor] = None,
        encoder_outputs: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
        start_positions: Optional[mindspore.Tensor] = None,
        end_positions: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        decoder_inputs_embeds: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[mindspore.Tensor], Seq2SeqQuestionAnsweringModelOutput]:
        r"""
        Args:
            start_positions (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
                Labels for position (index) of the start of the labelled span for computing the token classification loss.
                Positions are clamped to the length of the sequence (*sequence_length*). Position outside of the sequence
                are not taken into account for computing the loss.
            end_positions (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
                Labels for position (index) of the end of the labelled span for computing the token classification loss.
                Positions are clamped to the length of the sequence (*sequence_length*). Position outside of the sequence
                are not taken into account for computing the loss.

        Returns:
            Union[Tuple[mindspore.Tensor], Seq2SeqQuestionAnsweringModelOutput]
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        if start_positions is not None and end_positions is not None:
            use_cache = False

        # Copied from models.bart.modeling_bart.BartModel.forward
        #   different to other models, T5 automatically creates decoder_input_ids from
        #   input_ids if no decoder_input_ids are provided
        if decoder_input_ids is None and decoder_inputs_embeds is None:
            if input_ids is None:
                raise ValueError(
                    "If no `decoder_input_ids` or `decoder_inputs_embeds` are "
                    "passed, `input_ids` cannot be `None`. Please pass either "
                    "`input_ids` or `decoder_input_ids` or `decoder_inputs_embeds`."
                )
            decoder_input_ids = self._shift_right(input_ids)

        use_cache = use_cache if use_cache is not None else self.config.use_cache
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # FutureWarning: head_mask was separated into two input args - head_mask, decoder_head_mask
        if head_mask is not None and decoder_head_mask is None:
            if self.config.num_layers == self.config.num_decoder_layers:
                warnings.warn(__HEAD_MASK_WARNING_MSG, FutureWarning)
                decoder_head_mask = head_mask

        # Encode if needed (training, first prediction pass)
        if encoder_outputs is None:
            encoder_outputs = self.encoder(
                input_ids=input_ids,
                attention_mask=attention_mask,
                inputs_embeds=inputs_embeds,
                head_mask=head_mask,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
            )
        elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
            encoder_outputs = BaseModelOutput(
                last_hidden_state=encoder_outputs[0],
                hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,
                attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
            )

        hidden_states = encoder_outputs[0]

        # Decode
        decoder_outputs = self.decoder(
            input_ids=decoder_input_ids,
            attention_mask=decoder_attention_mask,
            inputs_embeds=decoder_inputs_embeds,
            past_key_values=None,
            encoder_hidden_states=hidden_states,
            encoder_attention_mask=attention_mask,
            head_mask=decoder_head_mask,
            cross_attn_head_mask=cross_attn_head_mask,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        sequence_output = decoder_outputs[0]

        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = logits.split(1, axis=-1)
        start_logits = start_logits.squeeze(-1)
        end_logits = end_logits.squeeze(-1)

        total_loss = None
        if start_positions is not None and end_positions is not None:
            # If we are on multi-GPU, split add a dimension
            if len(start_positions.shape) > 1:
                start_positions = start_positions.squeeze(-1)
            if len(end_positions.shape) > 1:
                end_positions = end_positions.squeeze(-1)
            # sometimes the start/end positions are outside our model inputs, we ignore these terms
            ignored_index = start_logits.shape[1]
            start_positions = start_positions.clamp(0, ignored_index)
            end_positions = end_positions.clamp(0, ignored_index)

            start_loss = ops.cross_entropy(start_logits, start_positions, ignore_index=ignored_index)
            end_loss = ops.cross_entropy(end_logits, end_positions, ignore_index=ignored_index)
            total_loss = (start_loss + end_loss) / 2

        if not return_dict:
            output = (start_logits, end_logits) + decoder_outputs[1:] + encoder_outputs
            return ((total_loss,) + output) if total_loss is not None else output

        return Seq2SeqQuestionAnsweringModelOutput(
            loss=total_loss,
            start_logits=start_logits,
            end_logits=end_logits,
            past_key_values=decoder_outputs.past_key_values,
            decoder_hidden_states=decoder_outputs.hidden_states,
            decoder_attentions=decoder_outputs.attentions,
            cross_attentions=decoder_outputs.cross_attentions,
            encoder_last_hidden_state=encoder_outputs.last_hidden_state,
            encoder_hidden_states=encoder_outputs.hidden_states,
            encoder_attentions=encoder_outputs.attentions,
        )

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForQuestionAnswering.__init__(config)

Initialize an instance of the MT5ForQuestionAnswering class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

The configuration object for the MT5 model.

TYPE: MT5Config

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
def __init__(self, config: MT5Config):
    """Initialize an instance of the MT5ForQuestionAnswering class.

    Args:
        self: The instance of the class.
        config (MT5Config): The configuration object for the MT5 model.

    Returns:
        None

    Raises:
        None
    """
    super().__init__(config)
    self.model_dim = config.d_model

    self.shared = nn.Embedding(config.vocab_size, config.d_model)

    encoder_config = copy.deepcopy(config)
    encoder_config.is_decoder = False
    encoder_config.use_cache = False
    encoder_config.is_encoder_decoder = False
    self.encoder = MT5Stack(encoder_config, self.shared)

    decoder_config = copy.deepcopy(config)
    decoder_config.is_decoder = True
    decoder_config.is_encoder_decoder = False
    decoder_config.num_layers = config.num_decoder_layers
    self.decoder = MT5Stack(decoder_config, self.shared)

    self.num_labels = config.num_labels
    self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)

    # Initialize weights and apply final processing
    self.post_init()

    self.model_parallel = False

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForQuestionAnswering.forward(input_ids=None, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, head_mask=None, decoder_head_mask=None, cross_attn_head_mask=None, encoder_outputs=None, start_positions=None, end_positions=None, inputs_embeds=None, decoder_inputs_embeds=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
start_positions

Labels for position (index) of the start of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss.

TYPE: `mindspore.Tensor` of shape `(batch_size,)`, *optional* DEFAULT: None

end_positions

Labels for position (index) of the end of the labelled span for computing the token classification loss. Positions are clamped to the length of the sequence (sequence_length). Position outside of the sequence are not taken into account for computing the loss.

TYPE: `mindspore.Tensor` of shape `(batch_size,)`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple[Tensor], Seq2SeqQuestionAnsweringModelOutput]

Union[Tuple[mindspore.Tensor], Seq2SeqQuestionAnsweringModelOutput]

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    decoder_input_ids: Optional[mindspore.Tensor] = None,
    decoder_attention_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    decoder_head_mask: Optional[mindspore.Tensor] = None,
    cross_attn_head_mask: Optional[mindspore.Tensor] = None,
    encoder_outputs: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
    start_positions: Optional[mindspore.Tensor] = None,
    end_positions: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    decoder_inputs_embeds: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple[mindspore.Tensor], Seq2SeqQuestionAnsweringModelOutput]:
    r"""
    Args:
        start_positions (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for position (index) of the start of the labelled span for computing the token classification loss.
            Positions are clamped to the length of the sequence (*sequence_length*). Position outside of the sequence
            are not taken into account for computing the loss.
        end_positions (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for position (index) of the end of the labelled span for computing the token classification loss.
            Positions are clamped to the length of the sequence (*sequence_length*). Position outside of the sequence
            are not taken into account for computing the loss.

    Returns:
        Union[Tuple[mindspore.Tensor], Seq2SeqQuestionAnsweringModelOutput]
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    use_cache = use_cache if use_cache is not None else self.config.use_cache
    if start_positions is not None and end_positions is not None:
        use_cache = False

    # Copied from models.bart.modeling_bart.BartModel.forward
    #   different to other models, T5 automatically creates decoder_input_ids from
    #   input_ids if no decoder_input_ids are provided
    if decoder_input_ids is None and decoder_inputs_embeds is None:
        if input_ids is None:
            raise ValueError(
                "If no `decoder_input_ids` or `decoder_inputs_embeds` are "
                "passed, `input_ids` cannot be `None`. Please pass either "
                "`input_ids` or `decoder_input_ids` or `decoder_inputs_embeds`."
            )
        decoder_input_ids = self._shift_right(input_ids)

    use_cache = use_cache if use_cache is not None else self.config.use_cache
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    # FutureWarning: head_mask was separated into two input args - head_mask, decoder_head_mask
    if head_mask is not None and decoder_head_mask is None:
        if self.config.num_layers == self.config.num_decoder_layers:
            warnings.warn(__HEAD_MASK_WARNING_MSG, FutureWarning)
            decoder_head_mask = head_mask

    # Encode if needed (training, first prediction pass)
    if encoder_outputs is None:
        encoder_outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
    elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
        encoder_outputs = BaseModelOutput(
            last_hidden_state=encoder_outputs[0],
            hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,
            attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
        )

    hidden_states = encoder_outputs[0]

    # Decode
    decoder_outputs = self.decoder(
        input_ids=decoder_input_ids,
        attention_mask=decoder_attention_mask,
        inputs_embeds=decoder_inputs_embeds,
        past_key_values=None,
        encoder_hidden_states=hidden_states,
        encoder_attention_mask=attention_mask,
        head_mask=decoder_head_mask,
        cross_attn_head_mask=cross_attn_head_mask,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    sequence_output = decoder_outputs[0]

    logits = self.qa_outputs(sequence_output)
    start_logits, end_logits = logits.split(1, axis=-1)
    start_logits = start_logits.squeeze(-1)
    end_logits = end_logits.squeeze(-1)

    total_loss = None
    if start_positions is not None and end_positions is not None:
        # If we are on multi-GPU, split add a dimension
        if len(start_positions.shape) > 1:
            start_positions = start_positions.squeeze(-1)
        if len(end_positions.shape) > 1:
            end_positions = end_positions.squeeze(-1)
        # sometimes the start/end positions are outside our model inputs, we ignore these terms
        ignored_index = start_logits.shape[1]
        start_positions = start_positions.clamp(0, ignored_index)
        end_positions = end_positions.clamp(0, ignored_index)

        start_loss = ops.cross_entropy(start_logits, start_positions, ignore_index=ignored_index)
        end_loss = ops.cross_entropy(end_logits, end_positions, ignore_index=ignored_index)
        total_loss = (start_loss + end_loss) / 2

    if not return_dict:
        output = (start_logits, end_logits) + decoder_outputs[1:] + encoder_outputs
        return ((total_loss,) + output) if total_loss is not None else output

    return Seq2SeqQuestionAnsweringModelOutput(
        loss=total_loss,
        start_logits=start_logits,
        end_logits=end_logits,
        past_key_values=decoder_outputs.past_key_values,
        decoder_hidden_states=decoder_outputs.hidden_states,
        decoder_attentions=decoder_outputs.attentions,
        cross_attentions=decoder_outputs.cross_attentions,
        encoder_last_hidden_state=encoder_outputs.last_hidden_state,
        encoder_hidden_states=encoder_outputs.hidden_states,
        encoder_attentions=encoder_outputs.attentions,
    )

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForQuestionAnswering.get_decoder()

Method to retrieve the decoder object.

PARAMETER DESCRIPTION
self

An instance of the MT5ForQuestionAnswering class.

RETURNS DESCRIPTION
decoder

The method returns the decoder object associated with the MT5ForQuestionAnswering instance.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
def get_decoder(self):
    """
    Method to retrieve the decoder object.

    Args:
        self: An instance of the MT5ForQuestionAnswering class.

    Returns:
        decoder: The method returns the decoder object associated with the MT5ForQuestionAnswering instance.

    Raises:
        None.
    """
    return self.decoder

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForQuestionAnswering.get_encoder()

Get the encoder object used in the MT5ForQuestionAnswering class.

PARAMETER DESCRIPTION
self

An instance of MT5ForQuestionAnswering.

RETURNS DESCRIPTION
encoder

The method returns the encoder object, which is an instance of a specific encoder used in the MT5ForQuestionAnswering class.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
def get_encoder(self):
    """
    Get the encoder object used in the MT5ForQuestionAnswering class.

    Args:
        self: An instance of MT5ForQuestionAnswering.

    Returns:
        encoder: The method returns the encoder object, which is an instance of a specific encoder used in the
            MT5ForQuestionAnswering class.

    Raises:
        None.

    """
    return self.encoder

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForQuestionAnswering.get_input_embeddings()

This method retrieves the input embeddings from the MT5 model for question answering.

PARAMETER DESCRIPTION
self

An instance of the MT5ForQuestionAnswering class.

RETURNS DESCRIPTION
None

The method returns the shared input embeddings.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
def get_input_embeddings(self):
    """
    This method retrieves the input embeddings from the MT5 model for question answering.

    Args:
        self: An instance of the MT5ForQuestionAnswering class.

    Returns:
        None: The method returns the shared input embeddings.

    Raises:
        This method does not raise any exceptions.
    """
    return self.shared

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForQuestionAnswering.set_input_embeddings(new_embeddings)

Set the input embeddings for both encoder and decoder in the MT5ForQuestionAnswering model.

PARAMETER DESCRIPTION
self

The instance of the MT5ForQuestionAnswering class.

TYPE: MT5ForQuestionAnswering

new_embeddings

New embeddings to be set as input for both encoder and decoder. Should be a tensor of the same shape as the current input embeddings.

TYPE: Tensor

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If the shape of the new embeddings does not match the current input embeddings.

TypeError

If the new_embeddings parameter is not a tensor.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
def set_input_embeddings(self, new_embeddings):
    """
    Set the input embeddings for both encoder and decoder in the MT5ForQuestionAnswering model.

    Args:
        self (MT5ForQuestionAnswering): The instance of the MT5ForQuestionAnswering class.
        new_embeddings (Tensor): New embeddings to be set as input for both encoder and decoder.
            Should be a tensor of the same shape as the current input embeddings.

    Returns:
        None.

    Raises:
        ValueError: If the shape of the new embeddings does not match the current input embeddings.
        TypeError: If the new_embeddings parameter is not a tensor.
    """
    self.shared = new_embeddings
    self.encoder.set_input_embeddings(new_embeddings)
    self.decoder.set_input_embeddings(new_embeddings)

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForSequenceClassification

Bases: MT5PreTrainedModel

This class represents a sequence classification model based on the MT5 architecture. It is designed for fine-tuning the MT5 model on sequence classification tasks.

The MT5ForSequenceClassification class inherits from the MT5PreTrainedModel class, which provides the basic implementation for loading and saving pre-trained MT5 models.

To initialize an instance of this class, a MT5Config object must be passed as a parameter to the forwardor.

The MT5ForSequenceClassification class has the following attributes:

  • transformer: An instance of the MT5Model class, which represents the main transformer model.
  • classification_head: An instance of the MT5ClassificationHead class, which represents the classification head of the model.

The forward method is used to process the input and generate the outputs of the model. It takes several input tensors as parameters, such as input_ids, attention_mask, decoder_input_ids, etc. The method returns a tuple of outputs, including the predicted logits for classification, and other intermediate outputs if requested.

If labels are provided, the method also calculates the loss based on the predicted logits and the provided labels. The loss calculation depends on the problem_type specified in the configuration. The supported problem types are regression, single-label classification, and multi-label classification.

Note

The MT5ForSequenceClassification class does not currently support passing input embeddings instead of input IDs.

The MT5ForSequenceClassification class is designed to be used with the MT5 model for fine-tuning on sequence classification tasks. It provides a convenient interface for processing input sequences and generating predictions.

Please refer to the documentation of the MT5PreTrainedModel class for more details on loading and saving pre-trained MT5 models.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
class MT5ForSequenceClassification(MT5PreTrainedModel):

    """
    This class represents a sequence classification model based on the MT5 architecture.
    It is designed for fine-tuning the MT5 model on sequence classification tasks.

    The `MT5ForSequenceClassification` class inherits from the `MT5PreTrainedModel` class,
    which provides the basic implementation for loading and saving pre-trained MT5 models.

    To initialize an instance of this class, a `MT5Config` object must be passed as a parameter to the forwardor.

    The `MT5ForSequenceClassification` class has the following attributes:

    - `transformer`: An instance of the `MT5Model` class, which represents the main transformer model.
    - `classification_head`: An instance of the `MT5ClassificationHead` class, which represents the classification head of the model.

    The `forward` method is used to process the input and generate the outputs of the model. It takes several input
    tensors as parameters, such as `input_ids`, `attention_mask`, `decoder_input_ids`, etc. The method returns a tuple
    of outputs, including the predicted logits for classification, and other intermediate outputs if requested.

    If labels are provided, the method also calculates the loss based on the predicted logits and the provided labels.
    The loss calculation depends on the `problem_type` specified in the configuration. The supported problem types are
    regression, single-label classification, and multi-label classification.

    Note:
        The `MT5ForSequenceClassification` class does not currently support passing input embeddings instead of input IDs.

    The `MT5ForSequenceClassification` class is designed to be used with the MT5 model for fine-tuning on sequence
    classification tasks. It provides a convenient interface for processing input sequences and generating predictions.

    Please refer to the documentation of the `MT5PreTrainedModel` class for more details on loading and saving
    pre-trained MT5 models.
    """
    _keys_to_ignore_on_load_unexpected = ["decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight"]
    _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"]

    # Copied from transformers.models.t5.modeling_t5.T5ForSequenceClassification.__init__ with T5->MT5
    def __init__(self, config: MT5Config):
        """
        Initializes an instance of MT5ForSequenceClassification.

        Args:
            self: The instance of the MT5ForSequenceClassification class.
            config (MT5Config): An object of type MT5Config containing configuration parameters for the model.

        Returns:
            None.

        Raises:
            TypeError: If the config parameter is not of type MT5Config.
            ValueError: If there are any issues during initialization of the transformer, classification head,
                or post_init method.
        """
        super().__init__(config)
        self.transformer = MT5Model(config)
        self.classification_head = MT5ClassificationHead(config)

        # Initialize weights and apply final processing
        self.post_init()

    # Copied from transformers.models.t5.modeling_t5.T5ForSequenceClassification.forward
    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        decoder_input_ids: Optional[mindspore.Tensor] = None,
        decoder_attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        decoder_head_mask: Optional[mindspore.Tensor] = None,
        cross_attn_head_mask: Optional[mindspore.Tensor] = None,
        encoder_outputs: Optional[List[mindspore.Tensor]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        decoder_inputs_embeds: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, Seq2SeqSequenceClassifierOutput]:
        r"""
        Args:
            labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
                Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
                config.num_labels - 1]`. If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).

        Returns:
            Union[Tuple, Seq2SeqSequenceClassifierOutput]
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        if labels is not None:
            use_cache = False

        if input_ids is None and inputs_embeds is not None:
            raise NotImplementedError(
                f"Passing input embeddings is currently not supported for {self.__class__.__name__}"
            )

        # Copied from models.bart.modeling_bart.BartModel.forward different to other models, T5 automatically creates
        # decoder_input_ids from input_ids if no decoder_input_ids are provided
        if decoder_input_ids is None and decoder_inputs_embeds is None:
            if input_ids is None:
                raise ValueError(
                    "If no `decoder_input_ids` or `decoder_inputs_embeds` are "
                    "passed, `input_ids` cannot be `None`. Please pass either "
                    "`input_ids` or `decoder_input_ids` or `decoder_inputs_embeds`."
                )
            decoder_input_ids = self._shift_right(input_ids)

        outputs = self.transformer(
            input_ids,
            attention_mask=attention_mask,
            decoder_input_ids=decoder_input_ids,
            decoder_attention_mask=decoder_attention_mask,
            head_mask=head_mask,
            decoder_head_mask=decoder_head_mask,
            cross_attn_head_mask=cross_attn_head_mask,
            encoder_outputs=encoder_outputs,
            inputs_embeds=inputs_embeds,
            decoder_inputs_embeds=decoder_inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = outputs[0]

        eos_mask = input_ids.eq(self.config.eos_token_id)

        if len(ops.unique_consecutive(eos_mask.sum(1))) > 1:
            raise ValueError("All examples must have the same number of <eos> tokens.")
        batch_size, _, hidden_size = sequence_output.shape
        sentence_representation = sequence_output[eos_mask, :].view(batch_size, -1, hidden_size)[:, -1, :]
        logits = self.classification_head(sentence_representation)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.config.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.config.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                if self.config.num_labels == 1:
                    loss = ops.mse_loss(logits.squeeze(), labels.squeeze())
                else:
                    loss = ops.mse_loss(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss = ops.cross_entropy(logits.view(-1, self.config.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss = ops.binary_cross_entropy_with_logits(logits, labels)
        if not return_dict:
            output = (logits,) + outputs[1:]
            return ((loss,) + output) if loss is not None else output

        return Seq2SeqSequenceClassifierOutput(
            loss=loss,
            logits=logits,
            past_key_values=outputs.past_key_values,
            decoder_hidden_states=outputs.decoder_hidden_states,
            decoder_attentions=outputs.decoder_attentions,
            cross_attentions=outputs.cross_attentions,
            encoder_last_hidden_state=outputs.encoder_last_hidden_state,
            encoder_hidden_states=outputs.encoder_hidden_states,
            encoder_attentions=outputs.encoder_attentions,
        )

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForSequenceClassification.__init__(config)

Initializes an instance of MT5ForSequenceClassification.

PARAMETER DESCRIPTION
self

The instance of the MT5ForSequenceClassification class.

config

An object of type MT5Config containing configuration parameters for the model.

TYPE: MT5Config

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the config parameter is not of type MT5Config.

ValueError

If there are any issues during initialization of the transformer, classification head, or post_init method.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
def __init__(self, config: MT5Config):
    """
    Initializes an instance of MT5ForSequenceClassification.

    Args:
        self: The instance of the MT5ForSequenceClassification class.
        config (MT5Config): An object of type MT5Config containing configuration parameters for the model.

    Returns:
        None.

    Raises:
        TypeError: If the config parameter is not of type MT5Config.
        ValueError: If there are any issues during initialization of the transformer, classification head,
            or post_init method.
    """
    super().__init__(config)
    self.transformer = MT5Model(config)
    self.classification_head = MT5ClassificationHead(config)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.mt5.modeling_mt5.MT5ForSequenceClassification.forward(input_ids=None, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, head_mask=None, decoder_head_mask=None, cross_attn_head_mask=None, encoder_outputs=None, inputs_embeds=None, decoder_inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
labels

Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

TYPE: `mindspore.Tensor` of shape `(batch_size,)`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple, Seq2SeqSequenceClassifierOutput]

Union[Tuple, Seq2SeqSequenceClassifierOutput]

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
def forward(
    self,
    input_ids: mindspore.Tensor = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    decoder_input_ids: Optional[mindspore.Tensor] = None,
    decoder_attention_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    decoder_head_mask: Optional[mindspore.Tensor] = None,
    cross_attn_head_mask: Optional[mindspore.Tensor] = None,
    encoder_outputs: Optional[List[mindspore.Tensor]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    decoder_inputs_embeds: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, Seq2SeqSequenceClassifierOutput]:
    r"""
    Args:
        labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels > 1` a classification loss is computed (Cross-Entropy).

    Returns:
        Union[Tuple, Seq2SeqSequenceClassifierOutput]
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    if labels is not None:
        use_cache = False

    if input_ids is None and inputs_embeds is not None:
        raise NotImplementedError(
            f"Passing input embeddings is currently not supported for {self.__class__.__name__}"
        )

    # Copied from models.bart.modeling_bart.BartModel.forward different to other models, T5 automatically creates
    # decoder_input_ids from input_ids if no decoder_input_ids are provided
    if decoder_input_ids is None and decoder_inputs_embeds is None:
        if input_ids is None:
            raise ValueError(
                "If no `decoder_input_ids` or `decoder_inputs_embeds` are "
                "passed, `input_ids` cannot be `None`. Please pass either "
                "`input_ids` or `decoder_input_ids` or `decoder_inputs_embeds`."
            )
        decoder_input_ids = self._shift_right(input_ids)

    outputs = self.transformer(
        input_ids,
        attention_mask=attention_mask,
        decoder_input_ids=decoder_input_ids,
        decoder_attention_mask=decoder_attention_mask,
        head_mask=head_mask,
        decoder_head_mask=decoder_head_mask,
        cross_attn_head_mask=cross_attn_head_mask,
        encoder_outputs=encoder_outputs,
        inputs_embeds=inputs_embeds,
        decoder_inputs_embeds=decoder_inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    sequence_output = outputs[0]

    eos_mask = input_ids.eq(self.config.eos_token_id)

    if len(ops.unique_consecutive(eos_mask.sum(1))) > 1:
        raise ValueError("All examples must have the same number of <eos> tokens.")
    batch_size, _, hidden_size = sequence_output.shape
    sentence_representation = sequence_output[eos_mask, :].view(batch_size, -1, hidden_size)[:, -1, :]
    logits = self.classification_head(sentence_representation)

    loss = None
    if labels is not None:
        if self.config.problem_type is None:
            if self.config.num_labels == 1:
                self.config.problem_type = "regression"
            elif self.config.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                self.config.problem_type = "single_label_classification"
            else:
                self.config.problem_type = "multi_label_classification"

        if self.config.problem_type == "regression":
            if self.config.num_labels == 1:
                loss = ops.mse_loss(logits.squeeze(), labels.squeeze())
            else:
                loss = ops.mse_loss(logits, labels)
        elif self.config.problem_type == "single_label_classification":
            loss = ops.cross_entropy(logits.view(-1, self.config.num_labels), labels.view(-1))
        elif self.config.problem_type == "multi_label_classification":
            loss = ops.binary_cross_entropy_with_logits(logits, labels)
    if not return_dict:
        output = (logits,) + outputs[1:]
        return ((loss,) + output) if loss is not None else output

    return Seq2SeqSequenceClassifierOutput(
        loss=loss,
        logits=logits,
        past_key_values=outputs.past_key_values,
        decoder_hidden_states=outputs.decoder_hidden_states,
        decoder_attentions=outputs.decoder_attentions,
        cross_attentions=outputs.cross_attentions,
        encoder_last_hidden_state=outputs.encoder_last_hidden_state,
        encoder_hidden_states=outputs.encoder_hidden_states,
        encoder_attentions=outputs.encoder_attentions,
    )

mindnlp.transformers.models.mt5.modeling_mt5.MT5LayerCrossAttention

Bases: Module

MT5LayerCrossAttention represents a layer for cross-attention mechanism in the MT5 model.

This class inherits from nn.Module and includes methods for initializing the layer and forwarding the cross-attention mechanism.

ATTRIBUTE DESCRIPTION
EncDecAttention

An instance of the MT5Attention class for encoder-decoder attention mechanism.

layer_norm

An instance of the MT5LayerNorm class for layer normalization.

dropout

An instance of the nn.Dropout class for applying dropout.

METHOD DESCRIPTION
__init__

Initializes the MT5LayerCrossAttention instance with the given configuration.

forward

Constructs the cross-attention mechanism using the given parameters and returns the outputs.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
class MT5LayerCrossAttention(nn.Module):

    """
    MT5LayerCrossAttention represents a layer for cross-attention mechanism in the MT5 model.

    This class inherits from nn.Module and includes methods for initializing the layer and forwarding the
    cross-attention mechanism.

    Attributes:
        EncDecAttention: An instance of the MT5Attention class for encoder-decoder attention mechanism.
        layer_norm: An instance of the MT5LayerNorm class for layer normalization.
        dropout: An instance of the nn.Dropout class for applying dropout.

    Methods:
        __init__: Initializes the MT5LayerCrossAttention instance with the given configuration.
        forward: Constructs the cross-attention mechanism using the given parameters and returns the outputs.

    """
    def __init__(self, config):
        """
        Initializes an instance of the MT5LayerCrossAttention class.

        Args:
            self (MT5LayerCrossAttention): The instance of the class.
            config (dict): The configuration dictionary containing the settings for the cross-attention layer.

        Returns:
            None.

        Raises:
            ValueError: If the configuration dictionary 'config' is missing required keys or has invalid values.
            TypeError: If the data types of the input parameters are incorrect.
        """
        super().__init__()
        self.EncDecAttention = MT5Attention(config, has_relative_attention_bias=False)
        self.layer_norm = MT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(p=config.dropout_rate)

    def forward(
        self,
        hidden_states,
        key_value_states,
        attention_mask=None,
        position_bias=None,
        layer_head_mask=None,
        past_key_value=None,
        use_cache=False,
        query_length=None,
        output_attentions=False,
    ):
        """
        This method forwards the cross-attention mechanism in the MT5 model.

        Args:
            self (MT5LayerCrossAttention): The instance of the MT5LayerCrossAttention class.
            hidden_states (torch.Tensor): The input hidden states to be processed.
            key_value_states (torch.Tensor): The key-value states used in attention computation.
            attention_mask (torch.Tensor, optional): Mask to avoid attending to specific positions. Default is None.
            position_bias (torch.Tensor, optional): Bias values added to the attention scores. Default is None.
            layer_head_mask (torch.Tensor, optional): Mask to control which heads are allowed to attend to which positions.
                Default is None.
            past_key_value (tuple, optional): Key and value tensors from the previous time steps. Default is None.
            use_cache (bool, optional): Whether to use cache for faster decoding. Default is False.
            query_length (int, optional): The length of the queries. Default is None.
            output_attentions (bool, optional): Whether to output attention weights. Default is False.

        Returns:
            tuple: A tuple containing the layer's output and additional attention outputs if requested.

        Raises:
            ValueError: If the shape of the input tensors is not compatible.
            TypeError: If the data types of the input parameters are incorrect.
            RuntimeError: If there is an issue during the attention computation process.
        """
        normed_hidden_states = self.layer_norm(hidden_states)
        attention_output = self.EncDecAttention(
            normed_hidden_states,
            mask=attention_mask,
            key_value_states=key_value_states,
            position_bias=position_bias,
            layer_head_mask=layer_head_mask,
            past_key_value=past_key_value,
            use_cache=use_cache,
            query_length=query_length,
            output_attentions=output_attentions,
        )
        layer_output = hidden_states + self.dropout(attention_output[0])
        outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them
        return outputs

mindnlp.transformers.models.mt5.modeling_mt5.MT5LayerCrossAttention.__init__(config)

Initializes an instance of the MT5LayerCrossAttention class.

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: MT5LayerCrossAttention

config

The configuration dictionary containing the settings for the cross-attention layer.

TYPE: dict

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If the configuration dictionary 'config' is missing required keys or has invalid values.

TypeError

If the data types of the input parameters are incorrect.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
def __init__(self, config):
    """
    Initializes an instance of the MT5LayerCrossAttention class.

    Args:
        self (MT5LayerCrossAttention): The instance of the class.
        config (dict): The configuration dictionary containing the settings for the cross-attention layer.

    Returns:
        None.

    Raises:
        ValueError: If the configuration dictionary 'config' is missing required keys or has invalid values.
        TypeError: If the data types of the input parameters are incorrect.
    """
    super().__init__()
    self.EncDecAttention = MT5Attention(config, has_relative_attention_bias=False)
    self.layer_norm = MT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
    self.dropout = nn.Dropout(p=config.dropout_rate)

mindnlp.transformers.models.mt5.modeling_mt5.MT5LayerCrossAttention.forward(hidden_states, key_value_states, attention_mask=None, position_bias=None, layer_head_mask=None, past_key_value=None, use_cache=False, query_length=None, output_attentions=False)

This method forwards the cross-attention mechanism in the MT5 model.

PARAMETER DESCRIPTION
self

The instance of the MT5LayerCrossAttention class.

TYPE: MT5LayerCrossAttention

hidden_states

The input hidden states to be processed.

TYPE: Tensor

key_value_states

The key-value states used in attention computation.

TYPE: Tensor

attention_mask

Mask to avoid attending to specific positions. Default is None.

TYPE: Tensor DEFAULT: None

position_bias

Bias values added to the attention scores. Default is None.

TYPE: Tensor DEFAULT: None

layer_head_mask

Mask to control which heads are allowed to attend to which positions. Default is None.

TYPE: Tensor DEFAULT: None

past_key_value

Key and value tensors from the previous time steps. Default is None.

TYPE: tuple DEFAULT: None

use_cache

Whether to use cache for faster decoding. Default is False.

TYPE: bool DEFAULT: False

query_length

The length of the queries. Default is None.

TYPE: int DEFAULT: None

output_attentions

Whether to output attention weights. Default is False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
tuple

A tuple containing the layer's output and additional attention outputs if requested.

RAISES DESCRIPTION
ValueError

If the shape of the input tensors is not compatible.

TypeError

If the data types of the input parameters are incorrect.

RuntimeError

If there is an issue during the attention computation process.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
def forward(
    self,
    hidden_states,
    key_value_states,
    attention_mask=None,
    position_bias=None,
    layer_head_mask=None,
    past_key_value=None,
    use_cache=False,
    query_length=None,
    output_attentions=False,
):
    """
    This method forwards the cross-attention mechanism in the MT5 model.

    Args:
        self (MT5LayerCrossAttention): The instance of the MT5LayerCrossAttention class.
        hidden_states (torch.Tensor): The input hidden states to be processed.
        key_value_states (torch.Tensor): The key-value states used in attention computation.
        attention_mask (torch.Tensor, optional): Mask to avoid attending to specific positions. Default is None.
        position_bias (torch.Tensor, optional): Bias values added to the attention scores. Default is None.
        layer_head_mask (torch.Tensor, optional): Mask to control which heads are allowed to attend to which positions.
            Default is None.
        past_key_value (tuple, optional): Key and value tensors from the previous time steps. Default is None.
        use_cache (bool, optional): Whether to use cache for faster decoding. Default is False.
        query_length (int, optional): The length of the queries. Default is None.
        output_attentions (bool, optional): Whether to output attention weights. Default is False.

    Returns:
        tuple: A tuple containing the layer's output and additional attention outputs if requested.

    Raises:
        ValueError: If the shape of the input tensors is not compatible.
        TypeError: If the data types of the input parameters are incorrect.
        RuntimeError: If there is an issue during the attention computation process.
    """
    normed_hidden_states = self.layer_norm(hidden_states)
    attention_output = self.EncDecAttention(
        normed_hidden_states,
        mask=attention_mask,
        key_value_states=key_value_states,
        position_bias=position_bias,
        layer_head_mask=layer_head_mask,
        past_key_value=past_key_value,
        use_cache=use_cache,
        query_length=query_length,
        output_attentions=output_attentions,
    )
    layer_output = hidden_states + self.dropout(attention_output[0])
    outputs = (layer_output,) + attention_output[1:]  # add attentions if we output them
    return outputs

mindnlp.transformers.models.mt5.modeling_mt5.MT5LayerFF

Bases: Module

MT5LayerFF is a Python class representing a feed-forward layer for the MT5 model. It inherits from nn.Module and contains methods for initialization and forward propagation.

The init method initializes the MT5LayerFF instance with the provided configuration. It checks if the configuration includes gated activation and assigns the appropriate DenseReluDense module accordingly. Additionally, it sets up layer normalization and dropout.

The forward method applies layer normalization to the input hidden_states, passes it through the DenseReluDense module, applies dropout, and returns the updated hidden_states.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
class MT5LayerFF(nn.Module):

    """
    MT5LayerFF is a Python class representing a feed-forward layer for the MT5 model.
    It inherits from nn.Module and contains methods for initialization and forward propagation.

    The __init__ method initializes the MT5LayerFF instance with the provided configuration.
    It checks if the configuration includes gated activation and assigns the appropriate DenseReluDense module
    accordingly. Additionally, it sets up layer normalization and dropout.

    The forward method applies layer normalization to the input hidden_states, passes it through the DenseReluDense
    module, applies dropout, and returns the updated hidden_states.

    """
    def __init__(self, config: MT5Config):
        """
        Initializes an instance of the MT5LayerFF class.

        Args:
            self: The instance of the MT5LayerFF class.
            config (MT5Config): An instance of the MT5Config class containing configuration settings for the MT5 model.

        Returns:
            None.

        Raises:
            TypeError: If the config parameter is not of type MT5Config.
            ValueError: If the config parameter is missing required attributes.
            RuntimeError: If there is an issue with the initialization process.
        """
        super().__init__()
        if config.is_gated_act:
            self.DenseReluDense = MT5DenseGatedActDense(config)
        else:
            self.DenseReluDense = MT5DenseActDense(config)

        self.layer_norm = MT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(p=config.dropout_rate)

    def forward(self, hidden_states):
        """
        Constructs the forward pass of the feed-forward layer in the MT5 model.

        Args:
            self (MT5LayerFF): An instance of the MT5LayerFF class.
            hidden_states (torch.Tensor): The input hidden states tensor of shape
                (batch_size, sequence_length, hidden_size).

        Returns:
            torch.Tensor: The output hidden states tensor after applying the feed-forward layer,
                with the same shape as the input tensor.

        Raises:
            None: This method does not raise any exceptions.

        Description:
            This method forwards the forward pass for the feed-forward layer in the MT5 model.
            It takes the input hidden states tensor and applies a series of operations to transform it.
            The steps involved in the forward pass are as follows:

            1. Layer Normalization: The input hidden states tensor is first passed through a layer normalization
            operation using self.layer_norm. This operation normalizes the hidden states, making them more
            robust to variations in scale and distribution.
            2. Feed-Forward Transformation: The normalized hidden states tensor is then passed through a feed-forward
            transformation using self.DenseReluDense. This transformation consists of a linear layer followed by a ReLU
            activation function, followed by another linear layer. This operation helps the model learn complex
            non-linear relationships within the hidden states.
            3. Dropout: The output of the feed-forward transformation is then added to the original hidden states tensor
            after applying dropout. Dropout is a regularization technique that randomly sets a fraction of the hidden
            states to zero during training, which helps prevent overfitting and improves generalization.

            The final output hidden states tensor is returned by this method, which has the same shape as the input tensor.

        Note:
            hidden_states: This method does not modify the input hidden states tensor in-place,
                but instead returns a new tensor.
        """
        forwarded_states = self.layer_norm(hidden_states)
        forwarded_states = self.DenseReluDense(forwarded_states)
        hidden_states = hidden_states + self.dropout(forwarded_states)
        return hidden_states

mindnlp.transformers.models.mt5.modeling_mt5.MT5LayerFF.__init__(config)

Initializes an instance of the MT5LayerFF class.

PARAMETER DESCRIPTION
self

The instance of the MT5LayerFF class.

config

An instance of the MT5Config class containing configuration settings for the MT5 model.

TYPE: MT5Config

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the config parameter is not of type MT5Config.

ValueError

If the config parameter is missing required attributes.

RuntimeError

If there is an issue with the initialization process.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
def __init__(self, config: MT5Config):
    """
    Initializes an instance of the MT5LayerFF class.

    Args:
        self: The instance of the MT5LayerFF class.
        config (MT5Config): An instance of the MT5Config class containing configuration settings for the MT5 model.

    Returns:
        None.

    Raises:
        TypeError: If the config parameter is not of type MT5Config.
        ValueError: If the config parameter is missing required attributes.
        RuntimeError: If there is an issue with the initialization process.
    """
    super().__init__()
    if config.is_gated_act:
        self.DenseReluDense = MT5DenseGatedActDense(config)
    else:
        self.DenseReluDense = MT5DenseActDense(config)

    self.layer_norm = MT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
    self.dropout = nn.Dropout(p=config.dropout_rate)

mindnlp.transformers.models.mt5.modeling_mt5.MT5LayerFF.forward(hidden_states)

Constructs the forward pass of the feed-forward layer in the MT5 model.

PARAMETER DESCRIPTION
self

An instance of the MT5LayerFF class.

TYPE: MT5LayerFF

hidden_states

The input hidden states tensor of shape (batch_size, sequence_length, hidden_size).

TYPE: Tensor

RETURNS DESCRIPTION

torch.Tensor: The output hidden states tensor after applying the feed-forward layer, with the same shape as the input tensor.

RAISES DESCRIPTION
None

This method does not raise any exceptions.

Description

This method forwards the forward pass for the feed-forward layer in the MT5 model. It takes the input hidden states tensor and applies a series of operations to transform it. The steps involved in the forward pass are as follows:

  1. Layer Normalization: The input hidden states tensor is first passed through a layer normalization operation using self.layer_norm. This operation normalizes the hidden states, making them more robust to variations in scale and distribution.
  2. Feed-Forward Transformation: The normalized hidden states tensor is then passed through a feed-forward transformation using self.DenseReluDense. This transformation consists of a linear layer followed by a ReLU activation function, followed by another linear layer. This operation helps the model learn complex non-linear relationships within the hidden states.
  3. Dropout: The output of the feed-forward transformation is then added to the original hidden states tensor after applying dropout. Dropout is a regularization technique that randomly sets a fraction of the hidden states to zero during training, which helps prevent overfitting and improves generalization.

The final output hidden states tensor is returned by this method, which has the same shape as the input tensor.

Note
Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
def forward(self, hidden_states):
    """
    Constructs the forward pass of the feed-forward layer in the MT5 model.

    Args:
        self (MT5LayerFF): An instance of the MT5LayerFF class.
        hidden_states (torch.Tensor): The input hidden states tensor of shape
            (batch_size, sequence_length, hidden_size).

    Returns:
        torch.Tensor: The output hidden states tensor after applying the feed-forward layer,
            with the same shape as the input tensor.

    Raises:
        None: This method does not raise any exceptions.

    Description:
        This method forwards the forward pass for the feed-forward layer in the MT5 model.
        It takes the input hidden states tensor and applies a series of operations to transform it.
        The steps involved in the forward pass are as follows:

        1. Layer Normalization: The input hidden states tensor is first passed through a layer normalization
        operation using self.layer_norm. This operation normalizes the hidden states, making them more
        robust to variations in scale and distribution.
        2. Feed-Forward Transformation: The normalized hidden states tensor is then passed through a feed-forward
        transformation using self.DenseReluDense. This transformation consists of a linear layer followed by a ReLU
        activation function, followed by another linear layer. This operation helps the model learn complex
        non-linear relationships within the hidden states.
        3. Dropout: The output of the feed-forward transformation is then added to the original hidden states tensor
        after applying dropout. Dropout is a regularization technique that randomly sets a fraction of the hidden
        states to zero during training, which helps prevent overfitting and improves generalization.

        The final output hidden states tensor is returned by this method, which has the same shape as the input tensor.

    Note:
        hidden_states: This method does not modify the input hidden states tensor in-place,
            but instead returns a new tensor.
    """
    forwarded_states = self.layer_norm(hidden_states)
    forwarded_states = self.DenseReluDense(forwarded_states)
    hidden_states = hidden_states + self.dropout(forwarded_states)
    return hidden_states

mindnlp.transformers.models.mt5.modeling_mt5.MT5LayerNorm

Bases: Module

Represents a layer normalization module in the MT5 style with no bias and no subtraction of mean.

This class inherits from nn.Module and provides functionality for layer normalization in the MT5 style. The forwardor initializes the layer normalization module with the specified hidden size and epsilon value. The 'forward' method accepts hidden states as input, calculates the variance, and normalizes the hidden states using the calculated variance and epsilon value. If the weight data type is float16 or bfloat16, the hidden states are converted to the weight data type before returning the weighted normalized hidden states.

ATTRIBUTE DESCRIPTION
hidden_size

The size of the hidden states.

TYPE: int

eps

The epsilon value for numerical stability.

TYPE: float

METHOD DESCRIPTION
__init__

Constructs a MT5LayerNorm module with the given hidden size and epsilon value.

forward

Applies layer normalization to the input hidden states and returns the normalized output.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
class MT5LayerNorm(nn.Module):

    """
    Represents a layer normalization module in the MT5 style with no bias and no subtraction of mean.

    This class inherits from nn.Module and provides functionality for layer normalization in the MT5 style. 
    The forwardor initializes the layer normalization module with the specified hidden size and epsilon value. 
    The 'forward' method accepts hidden states as input, calculates the variance, and normalizes the hidden states
    using the calculated variance and epsilon value.
    If the weight data type is float16 or bfloat16, the hidden states are converted to the weight data type before
    returning the weighted normalized hidden states.

    Attributes:
        hidden_size (int): The size of the hidden states.
        eps (float): The epsilon value for numerical stability.

    Methods:
        __init__: Constructs a MT5LayerNorm module with the given hidden size and epsilon value.
        forward: Applies layer normalization to the input hidden states and returns the normalized output.
    """
    def __init__(self, hidden_size, eps=1e-6):
        """
        Construct a layernorm module in the MT5 style. No bias and no subtraction of mean.
        """
        super().__init__()
        self.weight = Parameter(ops.ones(hidden_size), 'weight')
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        """
        Method to perform layer normalization on hidden states.

        Args:
            self (MT5LayerNorm): The instance of the MT5LayerNorm class.
            hidden_states (Tensor): The input hidden states to be normalized.

        Returns:
            None: This method does not return any value but updates the hidden states in-place after normalization.

        Raises:
            TypeError: If the input hidden_states are not of type Tensor.
            ValueError: If the variance calculation encounters any issues.
            RuntimeError: If there are runtime issues during the normalization process.
        """
        # MT5 uses a layer_norm which only scales and doesn't shift, which is also known as Root Mean
        # Square Layer Normalization https://arxiv.org/abs/1910.07467 thus varience is calculated
        # w/o mean and there is no bias. Additionally we want to make sure that the accumulation for
        # half-precision inputs is done in fp32

        variance = hidden_states.to(mindspore.float32).pow(2).mean(-1, keep_dims=True)
        hidden_states = hidden_states * ops.rsqrt(variance + self.variance_epsilon)

        # convert into half-precision if necessary
        if self.weight.dtype in [mindspore.float16, mindspore.bfloat16]:
            hidden_states = hidden_states.to(self.weight.dtype)

        return self.weight * hidden_states

mindnlp.transformers.models.mt5.modeling_mt5.MT5LayerNorm.__init__(hidden_size, eps=1e-06)

Construct a layernorm module in the MT5 style. No bias and no subtraction of mean.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
71
72
73
74
75
76
77
def __init__(self, hidden_size, eps=1e-6):
    """
    Construct a layernorm module in the MT5 style. No bias and no subtraction of mean.
    """
    super().__init__()
    self.weight = Parameter(ops.ones(hidden_size), 'weight')
    self.variance_epsilon = eps

mindnlp.transformers.models.mt5.modeling_mt5.MT5LayerNorm.forward(hidden_states)

Method to perform layer normalization on hidden states.

PARAMETER DESCRIPTION
self

The instance of the MT5LayerNorm class.

TYPE: MT5LayerNorm

hidden_states

The input hidden states to be normalized.

TYPE: Tensor

RETURNS DESCRIPTION
None

This method does not return any value but updates the hidden states in-place after normalization.

RAISES DESCRIPTION
TypeError

If the input hidden_states are not of type Tensor.

ValueError

If the variance calculation encounters any issues.

RuntimeError

If there are runtime issues during the normalization process.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
def forward(self, hidden_states):
    """
    Method to perform layer normalization on hidden states.

    Args:
        self (MT5LayerNorm): The instance of the MT5LayerNorm class.
        hidden_states (Tensor): The input hidden states to be normalized.

    Returns:
        None: This method does not return any value but updates the hidden states in-place after normalization.

    Raises:
        TypeError: If the input hidden_states are not of type Tensor.
        ValueError: If the variance calculation encounters any issues.
        RuntimeError: If there are runtime issues during the normalization process.
    """
    # MT5 uses a layer_norm which only scales and doesn't shift, which is also known as Root Mean
    # Square Layer Normalization https://arxiv.org/abs/1910.07467 thus varience is calculated
    # w/o mean and there is no bias. Additionally we want to make sure that the accumulation for
    # half-precision inputs is done in fp32

    variance = hidden_states.to(mindspore.float32).pow(2).mean(-1, keep_dims=True)
    hidden_states = hidden_states * ops.rsqrt(variance + self.variance_epsilon)

    # convert into half-precision if necessary
    if self.weight.dtype in [mindspore.float16, mindspore.bfloat16]:
        hidden_states = hidden_states.to(self.weight.dtype)

    return self.weight * hidden_states

mindnlp.transformers.models.mt5.modeling_mt5.MT5LayerSelfAttention

Bases: Module

This class represents a self-attention mechanism used in the MT5 (Multilingual Translation) model. It is designed to be used as a layer within the MT5 model.

This class inherits from the nn.Module class, which is a base class for all neural network modules in PyTorch.

ATTRIBUTE DESCRIPTION
SelfAttention

An instance of the MT5Attention class that performs the self-attention computation.

TYPE: MT5Attention

layer_norm

An instance of the MT5LayerNorm class that applies layer normalization to the hidden states.

TYPE: MT5LayerNorm

dropout

An instance of the nn.Dropout class that applies dropout regularization to the attention output.

TYPE: Dropout

METHOD DESCRIPTION
forward

This method applies the self-attention mechanism to the input hidden states, optionally using additional inputs such as attention mask, position bias, layer head mask, and past key-value states.

Args:

  • hidden_states (Tensor): The input hidden states to be processed by the self-attention mechanism.
  • attention_mask (Tensor, optional): An attention mask specifying which positions should be attended to and which should be ignored. Defaults to None.
  • position_bias (Tensor, optional): A tensor containing position bias values. Defaults to None.
  • layer_head_mask (Tensor, optional): A tensor containing layer and head mask values. Defaults to None.
  • past_key_value (Tuple[Tensor], optional): A tuple containing past key and value tensors. Defaults to None.
  • use_cache (bool, optional): Whether to use caching for the key-value states. Defaults to False.
  • output_attentions (bool, optional): Whether to output the attention values. Defaults to False.

Returns:

  • Tuple[Tensor]: A tuple containing the updated hidden states and additional outputs depending on the configuration.
Note
  • The self-attention mechanism is applied to the input hidden states after they are layer-normalized.
  • The attention output is added to the input hidden states after applying dropout regularization.
  • The method returns a tuple containing the updated hidden states and additional outputs depending on the configuration.
Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
class MT5LayerSelfAttention(nn.Module):

    """
    This class represents a self-attention mechanism used in the MT5 (Multilingual Translation) model.
    It is designed to be used as a layer within the MT5 model.

    This class inherits from the nn.Module class, which is a base class for all neural network modules in PyTorch.

    Attributes:
        SelfAttention (MT5Attention): An instance of the MT5Attention class that performs the self-attention computation.
        layer_norm (MT5LayerNorm): An instance of the MT5LayerNorm class that applies layer normalization to the hidden states.
        dropout (nn.Dropout): An instance of the nn.Dropout class that applies dropout regularization to the attention output.

    Methods:
        forward:
            This method applies the self-attention mechanism to the input hidden states, optionally using additional
            inputs such as attention mask, position bias, layer head mask, and past key-value states.

            Args:

            - hidden_states (Tensor): The input hidden states to be processed by the self-attention mechanism.
            - attention_mask (Tensor, optional): An attention mask specifying which positions should be attended
            to and which should be ignored. Defaults to None.
            - position_bias (Tensor, optional): A tensor containing position bias values. Defaults to None.
            - layer_head_mask (Tensor, optional): A tensor containing layer and head mask values. Defaults to None.
            - past_key_value (Tuple[Tensor], optional): A tuple containing past key and value tensors. Defaults to None.
            - use_cache (bool, optional): Whether to use caching for the key-value states. Defaults to False.
            - output_attentions (bool, optional): Whether to output the attention values. Defaults to False.

            Returns:

            - Tuple[Tensor]: A tuple containing the updated hidden states and additional outputs depending on the configuration.

    Note:
        - The self-attention mechanism is applied to the input hidden states after they are layer-normalized.
        - The attention output is added to the input hidden states after applying dropout regularization.
        - The method returns a tuple containing the updated hidden states and additional outputs depending on 
        the configuration.
    """
    def __init__(self, config, has_relative_attention_bias=False):
        """
        Args:
            self (object): The instance of the class.
            config (object): An object containing configuration settings for the model.
            has_relative_attention_bias (bool, optional): A flag indicating whether to apply relative attention bias.
                Defaults to False.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        self.SelfAttention = MT5Attention(config, has_relative_attention_bias=has_relative_attention_bias)
        self.layer_norm = MT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
        self.dropout = nn.Dropout(p=config.dropout_rate)

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        position_bias=None,
        layer_head_mask=None,
        past_key_value=None,
        use_cache=False,
        output_attentions=False,
    ):
        """
        Constructs the self-attention layer of the MT5 model.

        Args:
            self (MT5LayerSelfAttention): The instance of the MT5LayerSelfAttention class.
            hidden_states (Tensor): The input tensor of shape (batch_size, sequence_length, hidden_size).
                The hidden states to be passed through the self-attention layer.
            attention_mask (Tensor, optional): The attention mask tensor of shape (batch_size, sequence_length).
                A mask that indicates which tokens should be attended to and which should not.
                Defaults to None.
            position_bias (Tensor, optional): The position bias tensor of shape
                (batch_size, sequence_length, sequence_length). A bias that is added to the attention scores
                for each token. Defaults to None.
            layer_head_mask (Tensor, optional): The layer head mask tensor of shape (num_heads,) or (num_layers, num_heads).
                A mask that indicates which heads should be masked out.
                Defaults to None.
            past_key_value (Tuple[Tensor], optional): The tuple of past key and value tensors.
                It contains the cached key and value tensors from previous time steps.
                Defaults to None.
            use_cache (bool, optional): Whether to use the cache for the attention outputs of each layer.
                Defaults to False.
            output_attentions (bool, optional): Whether to return the attention scores.
                Defaults to False.

        Returns:
            Tuple[Tensor]:
                The outputs of the self-attention layer.
                The tuple contains:

                - hidden_states (Tensor): The updated hidden states after passing through the self-attention layer.
                It has the same shape as the input tensor.
                - attention_scores (Tensor, optional): The attention scores if 'output_attentions' is set to True.
                It has the shape (batch_size, num_heads, sequence_length, sequence_length).
                - position_bias (Tensor, optional): The updated position bias tensor if 'use_cache' is set to True.
                It has the same shape as the input position bias tensor.

        Raises:
            None.
        """
        normed_hidden_states = self.layer_norm(hidden_states)
        attention_output = self.SelfAttention(
            normed_hidden_states,
            mask=attention_mask,
            position_bias=position_bias,
            layer_head_mask=layer_head_mask,
            past_key_value=past_key_value,
            use_cache=use_cache,
            output_attentions=output_attentions,
        )
        hidden_states = hidden_states + self.dropout(attention_output[0])
        outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output them
        return outputs

mindnlp.transformers.models.mt5.modeling_mt5.MT5LayerSelfAttention.__init__(config, has_relative_attention_bias=False)

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: object

config

An object containing configuration settings for the model.

TYPE: object

has_relative_attention_bias

A flag indicating whether to apply relative attention bias. Defaults to False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
def __init__(self, config, has_relative_attention_bias=False):
    """
    Args:
        self (object): The instance of the class.
        config (object): An object containing configuration settings for the model.
        has_relative_attention_bias (bool, optional): A flag indicating whether to apply relative attention bias.
            Defaults to False.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    self.SelfAttention = MT5Attention(config, has_relative_attention_bias=has_relative_attention_bias)
    self.layer_norm = MT5LayerNorm(config.d_model, eps=config.layer_norm_epsilon)
    self.dropout = nn.Dropout(p=config.dropout_rate)

mindnlp.transformers.models.mt5.modeling_mt5.MT5LayerSelfAttention.forward(hidden_states, attention_mask=None, position_bias=None, layer_head_mask=None, past_key_value=None, use_cache=False, output_attentions=False)

Constructs the self-attention layer of the MT5 model.

PARAMETER DESCRIPTION
self

The instance of the MT5LayerSelfAttention class.

TYPE: MT5LayerSelfAttention

hidden_states

The input tensor of shape (batch_size, sequence_length, hidden_size). The hidden states to be passed through the self-attention layer.

TYPE: Tensor

attention_mask

The attention mask tensor of shape (batch_size, sequence_length). A mask that indicates which tokens should be attended to and which should not. Defaults to None.

TYPE: Tensor DEFAULT: None

position_bias

The position bias tensor of shape (batch_size, sequence_length, sequence_length). A bias that is added to the attention scores for each token. Defaults to None.

TYPE: Tensor DEFAULT: None

layer_head_mask

The layer head mask tensor of shape (num_heads,) or (num_layers, num_heads). A mask that indicates which heads should be masked out. Defaults to None.

TYPE: Tensor DEFAULT: None

past_key_value

The tuple of past key and value tensors. It contains the cached key and value tensors from previous time steps. Defaults to None.

TYPE: Tuple[Tensor] DEFAULT: None

use_cache

Whether to use the cache for the attention outputs of each layer. Defaults to False.

TYPE: bool DEFAULT: False

output_attentions

Whether to return the attention scores. Defaults to False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION

Tuple[Tensor]: The outputs of the self-attention layer. The tuple contains:

  • hidden_states (Tensor): The updated hidden states after passing through the self-attention layer. It has the same shape as the input tensor.
  • attention_scores (Tensor, optional): The attention scores if 'output_attentions' is set to True. It has the shape (batch_size, num_heads, sequence_length, sequence_length).
  • position_bias (Tensor, optional): The updated position bias tensor if 'use_cache' is set to True. It has the same shape as the input position bias tensor.
Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
def forward(
    self,
    hidden_states,
    attention_mask=None,
    position_bias=None,
    layer_head_mask=None,
    past_key_value=None,
    use_cache=False,
    output_attentions=False,
):
    """
    Constructs the self-attention layer of the MT5 model.

    Args:
        self (MT5LayerSelfAttention): The instance of the MT5LayerSelfAttention class.
        hidden_states (Tensor): The input tensor of shape (batch_size, sequence_length, hidden_size).
            The hidden states to be passed through the self-attention layer.
        attention_mask (Tensor, optional): The attention mask tensor of shape (batch_size, sequence_length).
            A mask that indicates which tokens should be attended to and which should not.
            Defaults to None.
        position_bias (Tensor, optional): The position bias tensor of shape
            (batch_size, sequence_length, sequence_length). A bias that is added to the attention scores
            for each token. Defaults to None.
        layer_head_mask (Tensor, optional): The layer head mask tensor of shape (num_heads,) or (num_layers, num_heads).
            A mask that indicates which heads should be masked out.
            Defaults to None.
        past_key_value (Tuple[Tensor], optional): The tuple of past key and value tensors.
            It contains the cached key and value tensors from previous time steps.
            Defaults to None.
        use_cache (bool, optional): Whether to use the cache for the attention outputs of each layer.
            Defaults to False.
        output_attentions (bool, optional): Whether to return the attention scores.
            Defaults to False.

    Returns:
        Tuple[Tensor]:
            The outputs of the self-attention layer.
            The tuple contains:

            - hidden_states (Tensor): The updated hidden states after passing through the self-attention layer.
            It has the same shape as the input tensor.
            - attention_scores (Tensor, optional): The attention scores if 'output_attentions' is set to True.
            It has the shape (batch_size, num_heads, sequence_length, sequence_length).
            - position_bias (Tensor, optional): The updated position bias tensor if 'use_cache' is set to True.
            It has the same shape as the input position bias tensor.

    Raises:
        None.
    """
    normed_hidden_states = self.layer_norm(hidden_states)
    attention_output = self.SelfAttention(
        normed_hidden_states,
        mask=attention_mask,
        position_bias=position_bias,
        layer_head_mask=layer_head_mask,
        past_key_value=past_key_value,
        use_cache=use_cache,
        output_attentions=output_attentions,
    )
    hidden_states = hidden_states + self.dropout(attention_output[0])
    outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output them
    return outputs

mindnlp.transformers.models.mt5.modeling_mt5.MT5Model

Bases: MT5PreTrainedModel

Example
>>> from transformers import MT5Model, AutoTokenizer
...
>>> model = MT5Model.from_pretrained("google/mt5-small")
>>> tokenizer = AutoTokenizer.from_pretrained("google/mt5-small")
>>> article = "UN Offizier sagt, dass weiter verhandelt werden muss in Syrien."
>>> summary = "Weiter Verhandlung in Syrien."
>>> inputs = tokenizer(article, return_tensors="pt")
>>> labels = tokenizer(text_target=summary, return_tensors="pt")
...
>>> outputs = model(input_ids=inputs["input_ids"], decoder_input_ids=labels["input_ids"])
>>> hidden_states = outputs.last_hidden_state
Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
class MT5Model(MT5PreTrainedModel):
    r"""

    Example:
        ```python
        >>> from transformers import MT5Model, AutoTokenizer
        ...
        >>> model = MT5Model.from_pretrained("google/mt5-small")
        >>> tokenizer = AutoTokenizer.from_pretrained("google/mt5-small")
        >>> article = "UN Offizier sagt, dass weiter verhandelt werden muss in Syrien."
        >>> summary = "Weiter Verhandlung in Syrien."
        >>> inputs = tokenizer(article, return_tensors="pt")
        >>> labels = tokenizer(text_target=summary, return_tensors="pt")
        ...
        >>> outputs = model(input_ids=inputs["input_ids"], decoder_input_ids=labels["input_ids"])
        >>> hidden_states = outputs.last_hidden_state
        ```
    """
    model_type = "mt5"
    config_class = MT5Config
    _keys_to_ignore_on_load_missing = ["decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight"]
    _keys_to_ignore_on_load_unexpected = ["decoder.block.0.layer.1.EncDecAttention.relative_attention_bias.weight"]
    _tied_weights_keys = ["encoder.embed_tokens.weight", "decoder.embed_tokens.weight"]

    # Copied from transformers.models.t5.modeling_t5.T5Model.__init__ with T5->MT5
    def __init__(self, config: MT5Config):
        """
        Initializes an instance of the MT5Model class.

        Args:
            self: The instance of the MT5Model class.
            config (MT5Config): An object of type MT5Config that holds the configuration parameters for the MT5 model.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)
        self.shared = nn.Embedding(config.vocab_size, config.d_model)

        encoder_config = copy.deepcopy(config)
        encoder_config.is_decoder = False
        encoder_config.use_cache = False
        encoder_config.is_encoder_decoder = False
        self.encoder = MT5Stack(encoder_config, self.shared)

        decoder_config = copy.deepcopy(config)
        decoder_config.is_decoder = True
        decoder_config.is_encoder_decoder = False
        decoder_config.num_layers = config.num_decoder_layers
        self.decoder = MT5Stack(decoder_config, self.shared)

        # Initialize weights and apply final processing
        self.post_init()

    # Copied from transformers.models.t5.modeling_t5.T5Model.get_input_embeddings
    def get_input_embeddings(self):
        """
        Retrieves the input embeddings for the MT5Model.

        Args:
            self (MT5Model): The instance of the MT5Model class.

        Returns:
            None.

        Raises:
            None.
        """
        return self.shared

    # Copied from transformers.models.t5.modeling_t5.T5Model.set_input_embeddings
    def set_input_embeddings(self, new_embeddings):
        """Set the input embeddings for the MT5Model.

        This method sets the shared input embeddings for both the encoder and decoder modules in the MT5Model.

        Args:
            self (MT5Model): An instance of the MT5Model class.
            new_embeddings (torch.Tensor): The new input embeddings to be set.

        Returns:
            None.

        Raises:
            None.
        """
        self.shared = new_embeddings
        self.encoder.set_input_embeddings(new_embeddings)
        self.decoder.set_input_embeddings(new_embeddings)

    # Copied from transformers.models.t5.modeling_t5.T5Model.get_encoder
    def get_encoder(self):
        """
        Returns the encoder of the MT5Model.

        Args:
            self: An instance of the MT5Model class.

        Returns:
            The encoder of the MT5Model.

        Raises:
            None.
        """
        return self.encoder

    # Copied from transformers.models.t5.modeling_t5.T5Model.get_decoder
    def get_decoder(self):
        """
        This method returns the decoder associated with the MT5Model instance.

        Args:
            self: The MT5Model instance itself.

        Returns:
            The decoder associated with the MT5Model instance.

        Raises:
            None.
        """
        return self.decoder

    # Copied from transformers.models.t5.modeling_t5.T5Model._prune_heads
    def _prune_heads(self, heads_to_prune):
        """
        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
        class PreTrainedModel
        """
        for layer, heads in heads_to_prune.items():
            self.encoder.layer[layer].attention.prune_heads(heads)

    # Copied from transformers.models.t5.modeling_t5.T5Model.forward with T5->MT5, t5->mt5
    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        decoder_input_ids: Optional[mindspore.Tensor] = None,
        decoder_attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        decoder_head_mask: Optional[mindspore.Tensor] = None,
        cross_attn_head_mask: Optional[mindspore.Tensor] = None,
        encoder_outputs: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
        past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        decoder_inputs_embeds: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple[mindspore.Tensor], Seq2SeqModelOutput]:
        r"""

        Returns:
            Union[Tuple[mindspore.Tensor], Seq2SeqModelOutput]

        Example:
            ```python
            >>> from transformers import AutoTokenizer, MT5Model
            ...
            >>> tokenizer = AutoTokenizer.from_pretrained("mt5-small")
            >>> model = MT5Model.from_pretrained("mt5-small")
            ...
            >>> input_ids = tokenizer(
            ...     "Studies have been shown that owning a dog is good for you", return_tensors="pt"
            ... ).input_ids  # Batch size 1
            >>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1
            ...
            >>> # preprocess: Prepend decoder_input_ids with start token which is pad token for MT5Model.
            >>> # This is not needed for torch's MT5ForConditionalGeneration as it does this internally using labels arg.
            >>> decoder_input_ids = model._shift_right(decoder_input_ids)
            ...
            >>> # forward pass
            >>> outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
            >>> last_hidden_states = outputs.last_hidden_state
            ```
        """
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # FutureWarning: head_mask was separated into two input args - head_mask, decoder_head_mask
        if head_mask is not None and decoder_head_mask is None:
            if self.config.num_layers == self.config.num_decoder_layers:
                warnings.warn(__HEAD_MASK_WARNING_MSG, FutureWarning)
                decoder_head_mask = head_mask

        # Encode if needed (training, first prediction pass)
        if encoder_outputs is None:
            encoder_outputs = self.encoder(
                input_ids=input_ids,
                attention_mask=attention_mask,
                inputs_embeds=inputs_embeds,
                head_mask=head_mask,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
            )
        elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
            encoder_outputs = BaseModelOutput(
                last_hidden_state=encoder_outputs[0],
                hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,
                attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
            )

        hidden_states = encoder_outputs[0]

        # Decode
        decoder_outputs = self.decoder(
            input_ids=decoder_input_ids,
            attention_mask=decoder_attention_mask,
            inputs_embeds=decoder_inputs_embeds,
            past_key_values=past_key_values,
            encoder_hidden_states=hidden_states,
            encoder_attention_mask=attention_mask,
            head_mask=decoder_head_mask,
            cross_attn_head_mask=cross_attn_head_mask,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        if not return_dict:
            return decoder_outputs + encoder_outputs

        return Seq2SeqModelOutput(
            last_hidden_state=decoder_outputs.last_hidden_state,
            past_key_values=decoder_outputs.past_key_values,
            decoder_hidden_states=decoder_outputs.hidden_states,
            decoder_attentions=decoder_outputs.attentions,
            cross_attentions=decoder_outputs.cross_attentions,
            encoder_last_hidden_state=encoder_outputs.last_hidden_state,
            encoder_hidden_states=encoder_outputs.hidden_states,
            encoder_attentions=encoder_outputs.attentions,
        )

mindnlp.transformers.models.mt5.modeling_mt5.MT5Model.__init__(config)

Initializes an instance of the MT5Model class.

PARAMETER DESCRIPTION
self

The instance of the MT5Model class.

config

An object of type MT5Config that holds the configuration parameters for the MT5 model.

TYPE: MT5Config

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
def __init__(self, config: MT5Config):
    """
    Initializes an instance of the MT5Model class.

    Args:
        self: The instance of the MT5Model class.
        config (MT5Config): An object of type MT5Config that holds the configuration parameters for the MT5 model.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)
    self.shared = nn.Embedding(config.vocab_size, config.d_model)

    encoder_config = copy.deepcopy(config)
    encoder_config.is_decoder = False
    encoder_config.use_cache = False
    encoder_config.is_encoder_decoder = False
    self.encoder = MT5Stack(encoder_config, self.shared)

    decoder_config = copy.deepcopy(config)
    decoder_config.is_decoder = True
    decoder_config.is_encoder_decoder = False
    decoder_config.num_layers = config.num_decoder_layers
    self.decoder = MT5Stack(decoder_config, self.shared)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.mt5.modeling_mt5.MT5Model.forward(input_ids=None, attention_mask=None, decoder_input_ids=None, decoder_attention_mask=None, head_mask=None, decoder_head_mask=None, cross_attn_head_mask=None, encoder_outputs=None, past_key_values=None, inputs_embeds=None, decoder_inputs_embeds=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)

RETURNS DESCRIPTION
Union[Tuple[Tensor], Seq2SeqModelOutput]

Union[Tuple[mindspore.Tensor], Seq2SeqModelOutput]

Example
>>> from transformers import AutoTokenizer, MT5Model
...
>>> tokenizer = AutoTokenizer.from_pretrained("mt5-small")
>>> model = MT5Model.from_pretrained("mt5-small")
...
>>> input_ids = tokenizer(
...     "Studies have been shown that owning a dog is good for you", return_tensors="pt"
... ).input_ids  # Batch size 1
>>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1
...
>>> # preprocess: Prepend decoder_input_ids with start token which is pad token for MT5Model.
>>> # This is not needed for torch's MT5ForConditionalGeneration as it does this internally using labels arg.
>>> decoder_input_ids = model._shift_right(decoder_input_ids)
...
>>> # forward pass
>>> outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
>>> last_hidden_states = outputs.last_hidden_state
Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    decoder_input_ids: Optional[mindspore.Tensor] = None,
    decoder_attention_mask: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    decoder_head_mask: Optional[mindspore.Tensor] = None,
    cross_attn_head_mask: Optional[mindspore.Tensor] = None,
    encoder_outputs: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
    past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    decoder_inputs_embeds: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple[mindspore.Tensor], Seq2SeqModelOutput]:
    r"""

    Returns:
        Union[Tuple[mindspore.Tensor], Seq2SeqModelOutput]

    Example:
        ```python
        >>> from transformers import AutoTokenizer, MT5Model
        ...
        >>> tokenizer = AutoTokenizer.from_pretrained("mt5-small")
        >>> model = MT5Model.from_pretrained("mt5-small")
        ...
        >>> input_ids = tokenizer(
        ...     "Studies have been shown that owning a dog is good for you", return_tensors="pt"
        ... ).input_ids  # Batch size 1
        >>> decoder_input_ids = tokenizer("Studies show that", return_tensors="pt").input_ids  # Batch size 1
        ...
        >>> # preprocess: Prepend decoder_input_ids with start token which is pad token for MT5Model.
        >>> # This is not needed for torch's MT5ForConditionalGeneration as it does this internally using labels arg.
        >>> decoder_input_ids = model._shift_right(decoder_input_ids)
        ...
        >>> # forward pass
        >>> outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
        >>> last_hidden_states = outputs.last_hidden_state
        ```
    """
    use_cache = use_cache if use_cache is not None else self.config.use_cache
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    # FutureWarning: head_mask was separated into two input args - head_mask, decoder_head_mask
    if head_mask is not None and decoder_head_mask is None:
        if self.config.num_layers == self.config.num_decoder_layers:
            warnings.warn(__HEAD_MASK_WARNING_MSG, FutureWarning)
            decoder_head_mask = head_mask

    # Encode if needed (training, first prediction pass)
    if encoder_outputs is None:
        encoder_outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
            inputs_embeds=inputs_embeds,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
    elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
        encoder_outputs = BaseModelOutput(
            last_hidden_state=encoder_outputs[0],
            hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,
            attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
        )

    hidden_states = encoder_outputs[0]

    # Decode
    decoder_outputs = self.decoder(
        input_ids=decoder_input_ids,
        attention_mask=decoder_attention_mask,
        inputs_embeds=decoder_inputs_embeds,
        past_key_values=past_key_values,
        encoder_hidden_states=hidden_states,
        encoder_attention_mask=attention_mask,
        head_mask=decoder_head_mask,
        cross_attn_head_mask=cross_attn_head_mask,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    if not return_dict:
        return decoder_outputs + encoder_outputs

    return Seq2SeqModelOutput(
        last_hidden_state=decoder_outputs.last_hidden_state,
        past_key_values=decoder_outputs.past_key_values,
        decoder_hidden_states=decoder_outputs.hidden_states,
        decoder_attentions=decoder_outputs.attentions,
        cross_attentions=decoder_outputs.cross_attentions,
        encoder_last_hidden_state=encoder_outputs.last_hidden_state,
        encoder_hidden_states=encoder_outputs.hidden_states,
        encoder_attentions=encoder_outputs.attentions,
    )

mindnlp.transformers.models.mt5.modeling_mt5.MT5Model.get_decoder()

This method returns the decoder associated with the MT5Model instance.

PARAMETER DESCRIPTION
self

The MT5Model instance itself.

RETURNS DESCRIPTION

The decoder associated with the MT5Model instance.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
def get_decoder(self):
    """
    This method returns the decoder associated with the MT5Model instance.

    Args:
        self: The MT5Model instance itself.

    Returns:
        The decoder associated with the MT5Model instance.

    Raises:
        None.
    """
    return self.decoder

mindnlp.transformers.models.mt5.modeling_mt5.MT5Model.get_encoder()

Returns the encoder of the MT5Model.

PARAMETER DESCRIPTION
self

An instance of the MT5Model class.

RETURNS DESCRIPTION

The encoder of the MT5Model.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
def get_encoder(self):
    """
    Returns the encoder of the MT5Model.

    Args:
        self: An instance of the MT5Model class.

    Returns:
        The encoder of the MT5Model.

    Raises:
        None.
    """
    return self.encoder

mindnlp.transformers.models.mt5.modeling_mt5.MT5Model.get_input_embeddings()

Retrieves the input embeddings for the MT5Model.

PARAMETER DESCRIPTION
self

The instance of the MT5Model class.

TYPE: MT5Model

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/mt5/modeling_mt5.py
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
def get_input_embeddings(self):
    """
    Retrieves the input embeddings for the MT5Model.

    Args:
        self (MT5Model): The instance of the MT5Model class.

    Returns:
        None.

    Raises:
        None.
    """
    return self