Skip to content

seamless_m4t_v2

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2

MindSpore SeamlessM4Tv2 model.

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.HifiGanResidualBlock

Bases: Module

This class represents a HiFiGAN residual block, which is used for generating high-fidelity audio waveforms. It inherits from the nn.Module class.

ATTRIBUTE DESCRIPTION
channels

The number of input and output channels for the convolutional layers.

TYPE: int

kernel_size

The size of the convolutional kernel.

TYPE: int

dilation

A tuple of dilation factors for the convolutional layers.

TYPE: tuple

leaky_relu_slope

The slope for the leaky ReLU activation function.

TYPE: float

METHOD DESCRIPTION
__init__

Initializes a HiFiGAN residual block object.

get_padding

Calculates the padding size for the convolutional layers based on the kernel size and dilation factor.

apply_weight_norm

Applies weight normalization to the convolutional layers in the residual block.

remove_weight_norm

Removes weight normalization from the convolutional layers in the residual block.

forward

Constructs the residual block by sequentially applying leaky ReLU activation, convolutional layers, and addition with the residual. Returns the final hidden states after passing through the residual block.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
3582
3583
3584
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604
3605
3606
3607
3608
3609
3610
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
3660
3661
3662
3663
3664
3665
3666
3667
3668
3669
3670
3671
3672
3673
3674
3675
3676
3677
3678
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3710
3711
3712
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
3732
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
class HifiGanResidualBlock(nn.Module):

    """
    This class represents a HiFiGAN residual block, which is used for generating high-fidelity audio waveforms.
    It inherits from the nn.Module class.

    Attributes:
        channels (int): The number of input and output channels for the convolutional layers.
        kernel_size (int): The size of the convolutional kernel.
        dilation (tuple): A tuple of dilation factors for the convolutional layers.
        leaky_relu_slope (float): The slope for the leaky ReLU activation function.

    Methods:
        __init__:
            Initializes a HiFiGAN residual block object.

        get_padding:
            Calculates the padding size for the convolutional layers based on the kernel size and dilation factor.

        apply_weight_norm:
            Applies weight normalization to the convolutional layers in the residual block.

        remove_weight_norm:
            Removes weight normalization from the convolutional layers in the residual block.

        forward:
            Constructs the residual block by sequentially applying leaky ReLU activation, convolutional layers,
            and addition with the residual. Returns the final hidden states after passing through the residual block.
    """
    def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5), leaky_relu_slope=0.1):
        """
        Initializes a HifiGanResidualBlock object.

        Args:
            self (HifiGanResidualBlock): An instance of the HifiGanResidualBlock class.
            channels (int): The number of input and output channels for the convolutional layers.
            kernel_size (int, optional): The size of the kernel for the convolutional layers. Defaults to 3.
            dilation (tuple, optional): A tuple of dilation factors for the convolutional layers. Defaults to (1, 3, 5).
            leaky_relu_slope (float, optional): The slope of the negative part of the leaky ReLU activation function.
                Defaults to 0.1.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        self.leaky_relu_slope = leaky_relu_slope

        self.convs1 = nn.ModuleList(
            [
                nn.Conv1d(
                    channels,
                    channels,
                    kernel_size,
                    stride=1,
                    dilation=dilation[i],
                    pad_mode='pad',
                    padding=self.get_padding(kernel_size, dilation[i]),
                )
                for i in range(len(dilation))
            ]
        )
        self.convs2 = nn.ModuleList(
            [
                nn.Conv1d(
                    channels,
                    channels,
                    kernel_size,
                    stride=1,
                    dilation=1,
                    pad_mode='pad',
                    padding=self.get_padding(kernel_size, 1),
                )
                for _ in range(len(dilation))
            ]
        )

    def get_padding(self, kernel_size, dilation=1):
        """
        Returns the amount of padding required for the convolution operation in the HiFi-GAN residual block.

        Args:
            self: Instance of the HifiGanResidualBlock class.
            kernel_size (int): The size of the kernel used in the convolution operation.
            dilation (int, optional): The dilation rate of the convolution operation. Defaults to 1.

        Returns:
            int: The amount of padding required for the convolution operation.

        Raises:
            TypeError:
                If kernel_size or dilation is not an integer, or if the value of dilation is less than or equal to zero.
        """
        return (kernel_size * dilation - dilation) // 2

    def apply_weight_norm(self):
        """
        Applies weight normalization to the convolutional layers in the HifiGanResidualBlock.

        Args:
            self: An instance of the HifiGanResidualBlock class.

        Returns:
            None.

        Raises:
            None.

        Description:
            This method applies weight normalization to the convolutional layers in the HifiGanResidualBlock.
            Weight normalization is a technique that normalizes the weights of a neural network layer to stabilize
            training and improve convergence. The method iterates over the convs1 and convs2 lists, which contain
            the convolutional layers, and applies weight normalization using the nn.utils.weight_norm function.

        Note:
            - The convs1 and convs2 lists must be populated with valid convolutional layers before calling this method.
            - Weight normalization modifies the weights of the layers in-place.

        Example:
            ```python
            >>> block = HifiGanResidualBlock()
            >>> block.apply_weight_norm()
            ```
        """
        for layer in self.convs1:
            nn.utils.weight_norm(layer)
        for layer in self.convs2:
            nn.utils.weight_norm(layer)

    def remove_weight_norm(self):
        """
        Removes weight normalization from the convolutional layers in a HifiGanResidualBlock.

        Args:
            self (HifiGanResidualBlock): The instance of the HifiGanResidualBlock class.
                It represents the block containing convolutional layers with weight normalization to remove.

        Returns:
            None: This method does not return any value. It modifies the convolutional layers in place by removing
                weight normalization.

        Raises:
            None.
        """
        for layer in self.convs1:
            nn.utils.remove_weight_norm(layer)
        for layer in self.convs2:
            nn.utils.remove_weight_norm(layer)

    def forward(self, hidden_states):
        """
        Constructs a single residual block in the HifiGanResidualBlock class.

        Args:
            self (HifiGanResidualBlock): The instance of the HifiGanResidualBlock class.
            hidden_states (torch.Tensor): The input hidden states of shape (batch_size, channels, height, width).

        Returns:
            torch.Tensor: The output hidden states of shape (batch_size, channels, height, width).

        Raises:
            None.
        """
        for conv1, conv2 in zip(self.convs1, self.convs2):
            residual = hidden_states
            hidden_states = ops.leaky_relu(hidden_states, self.leaky_relu_slope)
            hidden_states = conv1(hidden_states)
            hidden_states = ops.leaky_relu(hidden_states, self.leaky_relu_slope)
            hidden_states = conv2(hidden_states)
            hidden_states = hidden_states + residual
        return hidden_states

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.HifiGanResidualBlock.__init__(channels, kernel_size=3, dilation=(1, 3, 5), leaky_relu_slope=0.1)

Initializes a HifiGanResidualBlock object.

PARAMETER DESCRIPTION
self

An instance of the HifiGanResidualBlock class.

TYPE: HifiGanResidualBlock

channels

The number of input and output channels for the convolutional layers.

TYPE: int

kernel_size

The size of the kernel for the convolutional layers. Defaults to 3.

TYPE: int DEFAULT: 3

dilation

A tuple of dilation factors for the convolutional layers. Defaults to (1, 3, 5).

TYPE: tuple DEFAULT: (1, 3, 5)

leaky_relu_slope

The slope of the negative part of the leaky ReLU activation function. Defaults to 0.1.

TYPE: float DEFAULT: 0.1

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
3611
3612
3613
3614
3615
3616
3617
3618
3619
3620
3621
3622
3623
3624
3625
3626
3627
3628
3629
3630
3631
3632
3633
3634
3635
3636
3637
3638
3639
3640
3641
3642
3643
3644
3645
3646
3647
3648
3649
3650
3651
3652
3653
3654
3655
3656
3657
3658
3659
def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5), leaky_relu_slope=0.1):
    """
    Initializes a HifiGanResidualBlock object.

    Args:
        self (HifiGanResidualBlock): An instance of the HifiGanResidualBlock class.
        channels (int): The number of input and output channels for the convolutional layers.
        kernel_size (int, optional): The size of the kernel for the convolutional layers. Defaults to 3.
        dilation (tuple, optional): A tuple of dilation factors for the convolutional layers. Defaults to (1, 3, 5).
        leaky_relu_slope (float, optional): The slope of the negative part of the leaky ReLU activation function.
            Defaults to 0.1.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    self.leaky_relu_slope = leaky_relu_slope

    self.convs1 = nn.ModuleList(
        [
            nn.Conv1d(
                channels,
                channels,
                kernel_size,
                stride=1,
                dilation=dilation[i],
                pad_mode='pad',
                padding=self.get_padding(kernel_size, dilation[i]),
            )
            for i in range(len(dilation))
        ]
    )
    self.convs2 = nn.ModuleList(
        [
            nn.Conv1d(
                channels,
                channels,
                kernel_size,
                stride=1,
                dilation=1,
                pad_mode='pad',
                padding=self.get_padding(kernel_size, 1),
            )
            for _ in range(len(dilation))
        ]
    )

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.HifiGanResidualBlock.apply_weight_norm()

Applies weight normalization to the convolutional layers in the HifiGanResidualBlock.

PARAMETER DESCRIPTION
self

An instance of the HifiGanResidualBlock class.

RETURNS DESCRIPTION

None.

Description

This method applies weight normalization to the convolutional layers in the HifiGanResidualBlock. Weight normalization is a technique that normalizes the weights of a neural network layer to stabilize training and improve convergence. The method iterates over the convs1 and convs2 lists, which contain the convolutional layers, and applies weight normalization using the nn.utils.weight_norm function.

Note
  • The convs1 and convs2 lists must be populated with valid convolutional layers before calling this method.
  • Weight normalization modifies the weights of the layers in-place.
Example
>>> block = HifiGanResidualBlock()
>>> block.apply_weight_norm()
Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
3679
3680
3681
3682
3683
3684
3685
3686
3687
3688
3689
3690
3691
3692
3693
3694
3695
3696
3697
3698
3699
3700
3701
3702
3703
3704
3705
3706
3707
3708
3709
3710
3711
def apply_weight_norm(self):
    """
    Applies weight normalization to the convolutional layers in the HifiGanResidualBlock.

    Args:
        self: An instance of the HifiGanResidualBlock class.

    Returns:
        None.

    Raises:
        None.

    Description:
        This method applies weight normalization to the convolutional layers in the HifiGanResidualBlock.
        Weight normalization is a technique that normalizes the weights of a neural network layer to stabilize
        training and improve convergence. The method iterates over the convs1 and convs2 lists, which contain
        the convolutional layers, and applies weight normalization using the nn.utils.weight_norm function.

    Note:
        - The convs1 and convs2 lists must be populated with valid convolutional layers before calling this method.
        - Weight normalization modifies the weights of the layers in-place.

    Example:
        ```python
        >>> block = HifiGanResidualBlock()
        >>> block.apply_weight_norm()
        ```
    """
    for layer in self.convs1:
        nn.utils.weight_norm(layer)
    for layer in self.convs2:
        nn.utils.weight_norm(layer)

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.HifiGanResidualBlock.forward(hidden_states)

Constructs a single residual block in the HifiGanResidualBlock class.

PARAMETER DESCRIPTION
self

The instance of the HifiGanResidualBlock class.

TYPE: HifiGanResidualBlock

hidden_states

The input hidden states of shape (batch_size, channels, height, width).

TYPE: Tensor

RETURNS DESCRIPTION

torch.Tensor: The output hidden states of shape (batch_size, channels, height, width).

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
3733
3734
3735
3736
3737
3738
3739
3740
3741
3742
3743
3744
3745
3746
3747
3748
3749
3750
3751
3752
3753
3754
def forward(self, hidden_states):
    """
    Constructs a single residual block in the HifiGanResidualBlock class.

    Args:
        self (HifiGanResidualBlock): The instance of the HifiGanResidualBlock class.
        hidden_states (torch.Tensor): The input hidden states of shape (batch_size, channels, height, width).

    Returns:
        torch.Tensor: The output hidden states of shape (batch_size, channels, height, width).

    Raises:
        None.
    """
    for conv1, conv2 in zip(self.convs1, self.convs2):
        residual = hidden_states
        hidden_states = ops.leaky_relu(hidden_states, self.leaky_relu_slope)
        hidden_states = conv1(hidden_states)
        hidden_states = ops.leaky_relu(hidden_states, self.leaky_relu_slope)
        hidden_states = conv2(hidden_states)
        hidden_states = hidden_states + residual
    return hidden_states

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.HifiGanResidualBlock.get_padding(kernel_size, dilation=1)

Returns the amount of padding required for the convolution operation in the HiFi-GAN residual block.

PARAMETER DESCRIPTION
self

Instance of the HifiGanResidualBlock class.

kernel_size

The size of the kernel used in the convolution operation.

TYPE: int

dilation

The dilation rate of the convolution operation. Defaults to 1.

TYPE: int DEFAULT: 1

RETURNS DESCRIPTION
int

The amount of padding required for the convolution operation.

RAISES DESCRIPTION
TypeError

If kernel_size or dilation is not an integer, or if the value of dilation is less than or equal to zero.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
3661
3662
3663
3664
3665
3666
3667
3668
3669
3670
3671
3672
3673
3674
3675
3676
3677
def get_padding(self, kernel_size, dilation=1):
    """
    Returns the amount of padding required for the convolution operation in the HiFi-GAN residual block.

    Args:
        self: Instance of the HifiGanResidualBlock class.
        kernel_size (int): The size of the kernel used in the convolution operation.
        dilation (int, optional): The dilation rate of the convolution operation. Defaults to 1.

    Returns:
        int: The amount of padding required for the convolution operation.

    Raises:
        TypeError:
            If kernel_size or dilation is not an integer, or if the value of dilation is less than or equal to zero.
    """
    return (kernel_size * dilation - dilation) // 2

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.HifiGanResidualBlock.remove_weight_norm()

Removes weight normalization from the convolutional layers in a HifiGanResidualBlock.

PARAMETER DESCRIPTION
self

The instance of the HifiGanResidualBlock class. It represents the block containing convolutional layers with weight normalization to remove.

TYPE: HifiGanResidualBlock

RETURNS DESCRIPTION
None

This method does not return any value. It modifies the convolutional layers in place by removing weight normalization.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
3713
3714
3715
3716
3717
3718
3719
3720
3721
3722
3723
3724
3725
3726
3727
3728
3729
3730
3731
def remove_weight_norm(self):
    """
    Removes weight normalization from the convolutional layers in a HifiGanResidualBlock.

    Args:
        self (HifiGanResidualBlock): The instance of the HifiGanResidualBlock class.
            It represents the block containing convolutional layers with weight normalization to remove.

    Returns:
        None: This method does not return any value. It modifies the convolutional layers in place by removing
            weight normalization.

    Raises:
        None.
    """
    for layer in self.convs1:
        nn.utils.remove_weight_norm(layer)
    for layer in self.convs2:
        nn.utils.remove_weight_norm(layer)

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2Attention

Bases: Module

Multi-headed attention from 'Attention Is All You Need' paper

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
class SeamlessM4Tv2Attention(nn.Module):
    """Multi-headed attention from 'Attention Is All You Need' paper"""
    # Copied from transformers.models.bart.modeling_bart.BartAttention.__init__ with Bart->SeamlessM4Tv2
    def __init__(
        self,
        embed_dim: int,
        num_heads: int,
        dropout: float = 0.0,
        is_decoder: bool = False,
        bias: bool = True,
        is_causal: bool = False,
        config: Optional[SeamlessM4Tv2Config] = None,
    ):
        """Initializes the SeamlessM4Tv2Attention object.

        Args:
            self: The object itself.
            embed_dim (int): The dimension of the input embeddings.
            num_heads (int): The number of attention heads.
            dropout (float, optional): The dropout probability. Defaults to 0.0.
            is_decoder (bool, optional): Indicates if the attention is used in a decoder. Defaults to False.
            bias (bool, optional): Indicates if bias is added to the linear transformations. Defaults to True.
            is_causal (bool, optional): Indicates if the attention is causal. Defaults to False.
            config (Optional[SeamlessM4Tv2Config], optional): The configuration for the attention. Defaults to None.

        Returns:
            None

        Raises:
            ValueError: If embed_dim is not divisible by num_heads.
        """
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.dropout = dropout
        self.head_dim = embed_dim // num_heads
        self.config = config

        if (self.head_dim * num_heads) != self.embed_dim:
            raise ValueError(
                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim}"
                f" and `num_heads`: {num_heads})."
            )
        self.scaling = self.head_dim**-0.5
        self.is_decoder = is_decoder
        self.is_causal = is_causal

        self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
        self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)

    def _shape(self, projection: mindspore.Tensor) -> mindspore.Tensor:
        """
        Method to reshape the input projection tensor to match the specified number of heads and head dimension.

        Args:
            self (SeamlessM4Tv2Attention): The instance of the SeamlessM4Tv2Attention class.
            projection (mindspore.Tensor): The input projection tensor that needs to be reshaped.

        Returns:
            mindspore.Tensor: A new tensor with the reshaped projection based on the specified number of heads
                and head dimension.

        Raises:
            None.
        """
        new_projection_shape = projection.shape[:-1] + (self.num_heads, self.head_dim)
        # move heads to 2nd position (B, T, H * D) -> (B, T, H, D) -> (B, H, T, D)
        new_projection = projection.view(new_projection_shape).permute(0, 2, 1, 3)
        return new_projection

    def forward(
        self,
        hidden_states: mindspore.Tensor,
        encoder_hidden_states: Optional[mindspore.Tensor] = None,
        past_key_value: Optional[Tuple[mindspore.Tensor]] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        output_attentions: bool = False,
    ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
        """Input shape: Batch x Time x Channel"""
        is_cross_attention = encoder_hidden_states is not None
        batch_size, seq_length = hidden_states.shape[:2]

        # use encoder_hidden_states if cross attention
        current_states = encoder_hidden_states if encoder_hidden_states is not None else hidden_states
        # checking that the `sequence_length` of the `past_key_value` is the same as the he provided
        # `encoder_hidden_states` to support prefix tuning
        if is_cross_attention and past_key_value and past_key_value[0].shape[2] == current_states.shape[1]:
            # reuse k,v, cross_attentions
            key_states = past_key_value[0]
            value_states = past_key_value[1]
        else:
            key_states = self._shape(self.k_proj(current_states))
            value_states = self._shape(self.v_proj(current_states))
            if past_key_value is not None and not is_cross_attention:
                # reuse k, v, self_attention
                key_states = ops.cat([past_key_value[0], key_states], axis=2)
                value_states = ops.cat([past_key_value[1], value_states], axis=2)

        query_states = self._shape(self.q_proj(hidden_states) * self.scaling)
        attention_scores = ops.matmul(query_states, key_states.swapaxes(-1, -2))

        if self.is_decoder:
            # if cross_attention save Tuple(mindspore.Tensor, mindspore.Tensor) of all cross attention key/value_states.
            # Further calls to cross_attention layer can then reuse all cross-attention
            # key/value_states (first "if" case)
            # if uni-directional self-attention (decoder) save Tuple(mindspore.Tensor, mindspore.Tensor) of
            # all previous decoder key/value_states. Further calls to uni-directional self-attention
            # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
            # if encoder bi-directional self-attention `past_key_value` is always `None`
            past_key_value = (key_states, value_states)

        if attention_mask is not None:
            attention_scores = attention_scores + attention_mask

        # (batch_size, n_heads, seq_length, key_length)
        attn_weights = ops.softmax(attention_scores.float(), axis=-1).type_as(attention_scores)
        attn_weights = ops.dropout(attn_weights, p=self.dropout, training=self.training)

        #  attn_output = torch.bmm(attn_probs, value_states) ?
        context_states = ops.matmul(attn_weights, value_states)
        # attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim) ?
        context_states = context_states.permute(0, 2, 1, 3).view(batch_size, seq_length, -1)
        attn_output = self.out_proj(context_states)

        if output_attentions:
            return attn_output, attn_weights, past_key_value
        return attn_output, None, past_key_value

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2Attention.__init__(embed_dim, num_heads, dropout=0.0, is_decoder=False, bias=True, is_causal=False, config=None)

Initializes the SeamlessM4Tv2Attention object.

PARAMETER DESCRIPTION
self

The object itself.

embed_dim

The dimension of the input embeddings.

TYPE: int

num_heads

The number of attention heads.

TYPE: int

dropout

The dropout probability. Defaults to 0.0.

TYPE: float DEFAULT: 0.0

is_decoder

Indicates if the attention is used in a decoder. Defaults to False.

TYPE: bool DEFAULT: False

bias

Indicates if bias is added to the linear transformations. Defaults to True.

TYPE: bool DEFAULT: True

is_causal

Indicates if the attention is causal. Defaults to False.

TYPE: bool DEFAULT: False

config

The configuration for the attention. Defaults to None.

TYPE: Optional[SeamlessM4Tv2Config] DEFAULT: None

RETURNS DESCRIPTION

None

RAISES DESCRIPTION
ValueError

If embed_dim is not divisible by num_heads.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
def __init__(
    self,
    embed_dim: int,
    num_heads: int,
    dropout: float = 0.0,
    is_decoder: bool = False,
    bias: bool = True,
    is_causal: bool = False,
    config: Optional[SeamlessM4Tv2Config] = None,
):
    """Initializes the SeamlessM4Tv2Attention object.

    Args:
        self: The object itself.
        embed_dim (int): The dimension of the input embeddings.
        num_heads (int): The number of attention heads.
        dropout (float, optional): The dropout probability. Defaults to 0.0.
        is_decoder (bool, optional): Indicates if the attention is used in a decoder. Defaults to False.
        bias (bool, optional): Indicates if bias is added to the linear transformations. Defaults to True.
        is_causal (bool, optional): Indicates if the attention is causal. Defaults to False.
        config (Optional[SeamlessM4Tv2Config], optional): The configuration for the attention. Defaults to None.

    Returns:
        None

    Raises:
        ValueError: If embed_dim is not divisible by num_heads.
    """
    super().__init__()
    self.embed_dim = embed_dim
    self.num_heads = num_heads
    self.dropout = dropout
    self.head_dim = embed_dim // num_heads
    self.config = config

    if (self.head_dim * num_heads) != self.embed_dim:
        raise ValueError(
            f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim}"
            f" and `num_heads`: {num_heads})."
        )
    self.scaling = self.head_dim**-0.5
    self.is_decoder = is_decoder
    self.is_causal = is_causal

    self.k_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
    self.v_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
    self.q_proj = nn.Linear(embed_dim, embed_dim, bias=bias)
    self.out_proj = nn.Linear(embed_dim, embed_dim, bias=bias)

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2Attention.forward(hidden_states, encoder_hidden_states=None, past_key_value=None, attention_mask=None, output_attentions=False)

Input shape: Batch x Time x Channel

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
def forward(
    self,
    hidden_states: mindspore.Tensor,
    encoder_hidden_states: Optional[mindspore.Tensor] = None,
    past_key_value: Optional[Tuple[mindspore.Tensor]] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    output_attentions: bool = False,
) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
    """Input shape: Batch x Time x Channel"""
    is_cross_attention = encoder_hidden_states is not None
    batch_size, seq_length = hidden_states.shape[:2]

    # use encoder_hidden_states if cross attention
    current_states = encoder_hidden_states if encoder_hidden_states is not None else hidden_states
    # checking that the `sequence_length` of the `past_key_value` is the same as the he provided
    # `encoder_hidden_states` to support prefix tuning
    if is_cross_attention and past_key_value and past_key_value[0].shape[2] == current_states.shape[1]:
        # reuse k,v, cross_attentions
        key_states = past_key_value[0]
        value_states = past_key_value[1]
    else:
        key_states = self._shape(self.k_proj(current_states))
        value_states = self._shape(self.v_proj(current_states))
        if past_key_value is not None and not is_cross_attention:
            # reuse k, v, self_attention
            key_states = ops.cat([past_key_value[0], key_states], axis=2)
            value_states = ops.cat([past_key_value[1], value_states], axis=2)

    query_states = self._shape(self.q_proj(hidden_states) * self.scaling)
    attention_scores = ops.matmul(query_states, key_states.swapaxes(-1, -2))

    if self.is_decoder:
        # if cross_attention save Tuple(mindspore.Tensor, mindspore.Tensor) of all cross attention key/value_states.
        # Further calls to cross_attention layer can then reuse all cross-attention
        # key/value_states (first "if" case)
        # if uni-directional self-attention (decoder) save Tuple(mindspore.Tensor, mindspore.Tensor) of
        # all previous decoder key/value_states. Further calls to uni-directional self-attention
        # can concat previous decoder key/value_states to current projected key/value_states (third "elif" case)
        # if encoder bi-directional self-attention `past_key_value` is always `None`
        past_key_value = (key_states, value_states)

    if attention_mask is not None:
        attention_scores = attention_scores + attention_mask

    # (batch_size, n_heads, seq_length, key_length)
    attn_weights = ops.softmax(attention_scores.float(), axis=-1).type_as(attention_scores)
    attn_weights = ops.dropout(attn_weights, p=self.dropout, training=self.training)

    #  attn_output = torch.bmm(attn_probs, value_states) ?
    context_states = ops.matmul(attn_weights, value_states)
    # attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim) ?
    context_states = context_states.permute(0, 2, 1, 3).view(batch_size, seq_length, -1)
    attn_output = self.out_proj(context_states)

    if output_attentions:
        return attn_output, attn_weights, past_key_value
    return attn_output, None, past_key_value

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2CodeHifiGan

Bases: PreTrainedModel

This class represents the SeamlessM4Tv2CodeHifiGan model, which is used for speech synthesis and translation. It inherits from the PreTrainedModel class.

ATTRIBUTE DESCRIPTION
pad_token_id

The ID of the padding token in the input sequence.

TYPE: int

dur_predictor

The variance predictor module for duration prediction.

TYPE: SeamlessM4Tv2VariancePredictor

unit_embedding

The embedding layer for unit tokens.

TYPE: Embedding

speaker_embedding

The embedding layer for speaker IDs.

TYPE: Embedding

language_embedding

The embedding layer for language IDs.

TYPE: Embedding

hifi_gan

The high-fidelity generative adversarial network for speech synthesis.

TYPE: SeamlessM4Tv2HifiGan

METHOD DESCRIPTION
_get_dur_output_lengths

Computes the output length after the duration layer.

_get_output_hifigan_lengths

Computes the output length of the hifigan convolutional layers.

forward

Constructs the output sequence using the input tokens, speaker ID, and language ID.

_init_weights

Initializes the weights of the model.

apply_weight_norm

Applies weight normalization to the model.

remove_weight_norm

Removes weight normalization from the model.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
3975
3976
3977
3978
3979
3980
3981
3982
3983
3984
3985
3986
3987
3988
3989
3990
3991
3992
3993
3994
3995
3996
3997
3998
3999
4000
4001
4002
4003
4004
4005
4006
4007
4008
4009
4010
4011
4012
4013
4014
4015
4016
4017
4018
4019
4020
4021
4022
4023
4024
4025
4026
4027
4028
4029
4030
4031
4032
4033
4034
4035
4036
4037
4038
4039
4040
4041
4042
4043
4044
4045
4046
4047
4048
4049
4050
4051
4052
4053
4054
4055
4056
4057
4058
4059
4060
4061
4062
4063
4064
4065
4066
4067
4068
4069
4070
4071
4072
4073
4074
4075
4076
4077
4078
4079
4080
4081
4082
4083
4084
4085
4086
4087
4088
4089
4090
4091
4092
4093
4094
4095
4096
4097
4098
4099
4100
4101
4102
4103
4104
4105
4106
4107
4108
4109
4110
4111
4112
4113
4114
4115
4116
4117
4118
4119
4120
4121
4122
4123
4124
4125
4126
4127
4128
4129
4130
4131
4132
4133
4134
4135
4136
4137
4138
4139
4140
4141
4142
4143
4144
4145
4146
4147
4148
4149
4150
4151
4152
4153
4154
4155
4156
4157
4158
4159
4160
4161
4162
4163
4164
4165
4166
4167
4168
4169
4170
4171
4172
4173
4174
4175
4176
4177
4178
4179
4180
4181
4182
4183
4184
4185
4186
4187
4188
4189
4190
4191
4192
4193
4194
4195
4196
4197
4198
4199
4200
4201
4202
4203
4204
4205
4206
4207
4208
4209
4210
4211
4212
4213
4214
4215
4216
4217
4218
4219
4220
4221
class SeamlessM4Tv2CodeHifiGan(PreTrainedModel):

    """
    This class represents the SeamlessM4Tv2CodeHifiGan model, which is used for speech synthesis and translation.
    It inherits from the PreTrainedModel class.

    Attributes:
        pad_token_id (int): The ID of the padding token in the input sequence.
        dur_predictor (SeamlessM4Tv2VariancePredictor): The variance predictor module for duration prediction.
        unit_embedding (nn.Embedding): The embedding layer for unit tokens.
        speaker_embedding (nn.Embedding): The embedding layer for speaker IDs.
        language_embedding (nn.Embedding): The embedding layer for language IDs.
        hifi_gan (SeamlessM4Tv2HifiGan): The high-fidelity generative adversarial network for speech synthesis.

    Methods:
        _get_dur_output_lengths: Computes the output length after the duration layer.
        _get_output_hifigan_lengths: Computes the output length of the hifigan convolutional layers.
        forward: Constructs the output sequence using the input tokens, speaker ID, and language ID.
        _init_weights: Initializes the weights of the model.
        apply_weight_norm: Applies weight normalization to the model.
        remove_weight_norm: Removes weight normalization from the model.
    """
    config_class = SeamlessM4Tv2Config
    main_input_name = "input_embeds"
    _no_split_modules = []

    def __init__(self, config):
        """
        Initializes an instance of SeamlessM4Tv2CodeHifiGan.

        Args:
            self: The instance of the class.
            config: A configuration object containing various settings and parameters for the model.
                It is expected to have the following attributes:

                - t2u_pad_token_id (int): The padding token ID for the model.
                - unit_embed_dim (int): The dimension of unit embeddings.
                - variance_predictor_kernel_size (int): The kernel size for the variance predictor.
                - var_pred_dropout (float): The dropout rate for the variance predictor.
                - unit_hifi_gan_vocab_size (int): The vocabulary size for unit HiFi-GAN.
                - vocoder_num_spkrs (int): The number of speakers for the vocoder.
                - spkr_embed_dim (int): The dimension of speaker embeddings.
                - vocoder_num_langs (int): The number of languages for the vocoder.
                - lang_embed_dim (int): The dimension of language embeddings.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)

        self.pad_token_id = config.t2u_pad_token_id
        embed_dim = config.unit_embed_dim
        kernel_size = config.variance_predictor_kernel_size
        var_pred_dropout = config.var_pred_dropout
        self.dur_predictor = SeamlessM4Tv2VariancePredictor(embed_dim, embed_dim, kernel_size, var_pred_dropout)

        self.unit_embedding = nn.Embedding(config.unit_hifi_gan_vocab_size, config.unit_embed_dim)
        self.speaker_embedding = nn.Embedding(config.vocoder_num_spkrs, config.spkr_embed_dim)
        self.language_embedding = nn.Embedding(config.vocoder_num_langs, config.lang_embed_dim)

        self.hifi_gan = SeamlessM4Tv2HifiGan(config)

        # Initialize weights and apply final processing
        self.post_init()

    # Copied from transformers.models.seamless_m4t.modeling_seamless_m4t.SeamlessM4TCodeHifiGan._get_dur_output_lengths
    def _get_dur_output_lengths(self, input_ids, dur_out):
        """
        Computes the output length after the duration layer.
        """
        unit_lengths = (input_ids != self.pad_token_id).sum(1)

        # take care of edge cases where no padding or too many padding
        unit_lengths = ops.clamp(unit_lengths, 0, dur_out.shape[1] - 1)

        cumulative_dur_out = ops.cumsum(dur_out, axis=1)
        unit_lengths = cumulative_dur_out.gather_elements(dim=1, index=unit_lengths.unsqueeze(1)).squeeze()

        return unit_lengths

    # Copied from transformers.models.seamless_m4t.modeling_seamless_m4t.SeamlessM4TCodeHifiGan._get_output_hifigan_lengths
    def _get_output_hifigan_lengths(self, input_lengths: Union[mindspore.Tensor, int]):
        """
        Computes the output length of the hifigan convolutional layers
        """
        def _conv_out_length(input_length, kernel_size, stride, pad, dilation=1):
            # 1D convolutional layer output length formula taken
            # from https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html
            return (
                ops.div(input_length + 2 * pad - dilation * (kernel_size - 1) - 1, stride, rounding_mode="floor") + 1
            )

        def _swapaxes_conv_out_length(input_length, kernel_size, stride, pad, dilation=1):
            return (input_length - 1) * stride - 2 * pad + dilation * (kernel_size - 1) + 1

        # conv_pre
        input_lengths = _conv_out_length(input_lengths, 7, 1, 3)

        # upsampler
        for _, (upsample_rate, kernel_size) in enumerate(
            zip(self.config.upsample_rates, self.config.upsample_kernel_sizes)
        ):
            input_lengths = _swapaxes_conv_out_length(
                input_lengths, kernel_size, upsample_rate, (kernel_size - upsample_rate) // 2
            )

        # resblock
        for _ in range(len(self.config.upsample_rates)):
            for kernel_size, dilation in zip(self.config.resblock_kernel_sizes, self.config.resblock_dilation_sizes):
                for dil in dilation:
                    input_lengths = _conv_out_length(
                        input_lengths, kernel_size, 1, (kernel_size - 1) * dil // 2, dilation=dil
                    )

                for dil in dilation:
                    input_lengths = _conv_out_length(input_lengths, kernel_size, 1, (kernel_size - 1) // 2, dilation=1)

        # conv_post
        input_lengths = _conv_out_length(input_lengths, 7, 1, 3)

        return input_lengths

    # Copied from transformers.models.seamless_m4t.modeling_seamless_m4t.SeamlessM4TCodeHifiGan.forward with SeamlessM4T->SeamlessM4Tv2, spkr_id->speaker_id
    def forward(
        self, input_ids: mindspore.Tensor, speaker_id: mindspore.Tensor, lang_id: mindspore.Tensor
    ) -> Tuple[mindspore.Tensor]:
        """
        Args:
            input_ids (`mindspore.Tensor` of shape `(batch_size, sequence_length)`):
                Indices of input sequence tokens in the vocabulary.

                Indices can be obtained using [`SeamlessM4Tv2TextToUnitForConditionalGeneration`]. [What are input
                IDs?](../glossary#input-ids)
            speaker_id (`int`, *optional*):
                The id of the speaker used for speech synthesis. Must be lower than `config.vocoder_num_spkrs`.
            tgt_lang (`str`, *optional*):
                The language id to use as target language for translation.
        """
        hidden_states = self.unit_embedding(input_ids).swapaxes(1, 2)
        spkr = self.speaker_embedding(speaker_id).swapaxes(1, 2)
        lang = self.language_embedding(lang_id).swapaxes(1, 2)

        log_dur_pred = self.dur_predictor(hidden_states.swapaxes(1, 2))
        dur_out = ops.clamp(ops.round((ops.exp(log_dur_pred) - 1)).long(), min=1)
        # B x C x T
        if hidden_states.shape[0] == 1:
            hidden_states = ops.repeat_interleave(hidden_states, dur_out.view(-1), axis=2)
        else:
            # if batched sample, need to interleave per sample, and pad -> loss of parallelism
            if hidden_states.shape[0] > 1 and self.training:
                logger.warning(
                    """`self.training=True` and you use batching. You lose parallelism during the hifigan
                               forward pass because the samples are interleaved."""
                )
            hidden_states = [
                ops.repeat_interleave(hidden_state, duration, axis=-1).swapaxes(0, 1)
                for (hidden_state, duration) in zip(hidden_states, dur_out)
            ]

            hidden_states = pad_sequence(hidden_states, batch_first=True).swapaxes(1, 2)

        spkr = spkr.repeat(1, 1, hidden_states.shape[-1])
        lang = lang.repeat(1, 1, hidden_states.shape[-1])
        hidden_states = ops.cat([lang, hidden_states, spkr], axis=1)

        hidden_states = self.hifi_gan(hidden_states)

        unit_lengths = self._get_dur_output_lengths(input_ids, dur_out)
        lengths = self._get_output_hifigan_lengths(unit_lengths)

        return hidden_states, lengths

    # Copied from transformers.models.seamless_m4t.modeling_seamless_m4t.SeamlessM4TCodeHifiGan._init_weights
    def _init_weights(self, cell):
        """Initialize the weights."""
        if isinstance(cell, (nn.Linear, nn.Conv1d, nn.Conv1dTranspose)):
            cell.weight.set_data(initializer(Normal(self.config.initializer_range),
                                                    cell.weight.shape, cell.weight.dtype))
            if cell.bias is not None:
                cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))
        elif isinstance(cell, nn.Embedding):
            weight = initializer(Normal(self.config.initializer_range),
                                                 cell.weight.shape,
                                                 cell.weight.dtype)
            if cell.padding_idx is not None:
                weight[cell.padding_idx] = 0
            cell.weight.set_data(weight)

    # Copied from transformers.models.seamless_m4t.modeling_seamless_m4t.SeamlessM4TCodeHifiGan.apply_weight_norm
    def apply_weight_norm(self):
        """
        Apply weight normalization to the HifiGan model layers.

        Args:
            self: Instance of the SeamlessM4Tv2CodeHifiGan class. Represents the current instance of the class.

        Returns:
            None.

        Raises:
            None: However, if any exceptions occur during the weight normalization process,
                they will be propagated up the call stack.
        """
        nn.utils.weight_norm(self.hifi_gan.conv_pre)
        for layer in self.hifi_gan.upsampler:
            nn.utils.weight_norm(layer)
        for layer in self.hifi_gan.resblocks:
            layer.apply_weight_norm()
        nn.utils.weight_norm(self.hifi_gan.conv_post)

    # Copied from transformers.models.seamless_m4t.modeling_seamless_m4t.SeamlessM4TCodeHifiGan.remove_weight_norm
    def remove_weight_norm(self):
        """
        Removes weight normalization from the specified layers in the HifiGan model.

        Args:
            self: An instance of the SeamlessM4Tv2CodeHifiGan class.

        Returns:
            None.

        Raises:
            None.

        Description:
            This method removes weight normalization from the following layers in the HifiGan model:

            - self.hifi_gan.conv_pre: The convolutional layer before upsampling.
            - self.hifi_gan.upsampler: A list of upsampling layers.
            - self.hifi_gan.resblocks: A list of residual blocks.
            - self.hifi_gan.conv_post: The final convolutional layer after upsampling.

        Weight normalization is a technique used to normalize the weights of neural network layers.
        By removing weight normalization, the weights of the specified layers are no longer normalized, which can have
        an impact on the performance of the model.

        Note that this method modifies the layers in-place and does not return any value.
        """
        nn.utils.remove_weight_norm(self.hifi_gan.conv_pre)
        for layer in self.hifi_gan.upsampler:
            nn.utils.remove_weight_norm(layer)
        for layer in self.hifi_gan.resblocks:
            layer.remove_weight_norm()
        nn.utils.remove_weight_norm(self.hifi_gan.conv_post)

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2CodeHifiGan.__init__(config)

Initializes an instance of SeamlessM4Tv2CodeHifiGan.

PARAMETER DESCRIPTION
self

The instance of the class.

config

A configuration object containing various settings and parameters for the model. It is expected to have the following attributes:

  • t2u_pad_token_id (int): The padding token ID for the model.
  • unit_embed_dim (int): The dimension of unit embeddings.
  • variance_predictor_kernel_size (int): The kernel size for the variance predictor.
  • var_pred_dropout (float): The dropout rate for the variance predictor.
  • unit_hifi_gan_vocab_size (int): The vocabulary size for unit HiFi-GAN.
  • vocoder_num_spkrs (int): The number of speakers for the vocoder.
  • spkr_embed_dim (int): The dimension of speaker embeddings.
  • vocoder_num_langs (int): The number of languages for the vocoder.
  • lang_embed_dim (int): The dimension of language embeddings.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
4001
4002
4003
4004
4005
4006
4007
4008
4009
4010
4011
4012
4013
4014
4015
4016
4017
4018
4019
4020
4021
4022
4023
4024
4025
4026
4027
4028
4029
4030
4031
4032
4033
4034
4035
4036
4037
4038
4039
4040
4041
def __init__(self, config):
    """
    Initializes an instance of SeamlessM4Tv2CodeHifiGan.

    Args:
        self: The instance of the class.
        config: A configuration object containing various settings and parameters for the model.
            It is expected to have the following attributes:

            - t2u_pad_token_id (int): The padding token ID for the model.
            - unit_embed_dim (int): The dimension of unit embeddings.
            - variance_predictor_kernel_size (int): The kernel size for the variance predictor.
            - var_pred_dropout (float): The dropout rate for the variance predictor.
            - unit_hifi_gan_vocab_size (int): The vocabulary size for unit HiFi-GAN.
            - vocoder_num_spkrs (int): The number of speakers for the vocoder.
            - spkr_embed_dim (int): The dimension of speaker embeddings.
            - vocoder_num_langs (int): The number of languages for the vocoder.
            - lang_embed_dim (int): The dimension of language embeddings.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)

    self.pad_token_id = config.t2u_pad_token_id
    embed_dim = config.unit_embed_dim
    kernel_size = config.variance_predictor_kernel_size
    var_pred_dropout = config.var_pred_dropout
    self.dur_predictor = SeamlessM4Tv2VariancePredictor(embed_dim, embed_dim, kernel_size, var_pred_dropout)

    self.unit_embedding = nn.Embedding(config.unit_hifi_gan_vocab_size, config.unit_embed_dim)
    self.speaker_embedding = nn.Embedding(config.vocoder_num_spkrs, config.spkr_embed_dim)
    self.language_embedding = nn.Embedding(config.vocoder_num_langs, config.lang_embed_dim)

    self.hifi_gan = SeamlessM4Tv2HifiGan(config)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2CodeHifiGan.apply_weight_norm()

Apply weight normalization to the HifiGan model layers.

PARAMETER DESCRIPTION
self

Instance of the SeamlessM4Tv2CodeHifiGan class. Represents the current instance of the class.

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
None

However, if any exceptions occur during the weight normalization process, they will be propagated up the call stack.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
4167
4168
4169
4170
4171
4172
4173
4174
4175
4176
4177
4178
4179
4180
4181
4182
4183
4184
4185
4186
def apply_weight_norm(self):
    """
    Apply weight normalization to the HifiGan model layers.

    Args:
        self: Instance of the SeamlessM4Tv2CodeHifiGan class. Represents the current instance of the class.

    Returns:
        None.

    Raises:
        None: However, if any exceptions occur during the weight normalization process,
            they will be propagated up the call stack.
    """
    nn.utils.weight_norm(self.hifi_gan.conv_pre)
    for layer in self.hifi_gan.upsampler:
        nn.utils.weight_norm(layer)
    for layer in self.hifi_gan.resblocks:
        layer.apply_weight_norm()
    nn.utils.weight_norm(self.hifi_gan.conv_post)

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2CodeHifiGan.forward(input_ids, speaker_id, lang_id)

PARAMETER DESCRIPTION
input_ids

Indices of input sequence tokens in the vocabulary.

Indices can be obtained using [SeamlessM4Tv2TextToUnitForConditionalGeneration]. What are input IDs?

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length)`

speaker_id

The id of the speaker used for speech synthesis. Must be lower than config.vocoder_num_spkrs.

TYPE: `int`, *optional*

tgt_lang

The language id to use as target language for translation.

TYPE: `str`, *optional*

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
4101
4102
4103
4104
4105
4106
4107
4108
4109
4110
4111
4112
4113
4114
4115
4116
4117
4118
4119
4120
4121
4122
4123
4124
4125
4126
4127
4128
4129
4130
4131
4132
4133
4134
4135
4136
4137
4138
4139
4140
4141
4142
4143
4144
4145
4146
4147
4148
def forward(
    self, input_ids: mindspore.Tensor, speaker_id: mindspore.Tensor, lang_id: mindspore.Tensor
) -> Tuple[mindspore.Tensor]:
    """
    Args:
        input_ids (`mindspore.Tensor` of shape `(batch_size, sequence_length)`):
            Indices of input sequence tokens in the vocabulary.

            Indices can be obtained using [`SeamlessM4Tv2TextToUnitForConditionalGeneration`]. [What are input
            IDs?](../glossary#input-ids)
        speaker_id (`int`, *optional*):
            The id of the speaker used for speech synthesis. Must be lower than `config.vocoder_num_spkrs`.
        tgt_lang (`str`, *optional*):
            The language id to use as target language for translation.
    """
    hidden_states = self.unit_embedding(input_ids).swapaxes(1, 2)
    spkr = self.speaker_embedding(speaker_id).swapaxes(1, 2)
    lang = self.language_embedding(lang_id).swapaxes(1, 2)

    log_dur_pred = self.dur_predictor(hidden_states.swapaxes(1, 2))
    dur_out = ops.clamp(ops.round((ops.exp(log_dur_pred) - 1)).long(), min=1)
    # B x C x T
    if hidden_states.shape[0] == 1:
        hidden_states = ops.repeat_interleave(hidden_states, dur_out.view(-1), axis=2)
    else:
        # if batched sample, need to interleave per sample, and pad -> loss of parallelism
        if hidden_states.shape[0] > 1 and self.training:
            logger.warning(
                """`self.training=True` and you use batching. You lose parallelism during the hifigan
                           forward pass because the samples are interleaved."""
            )
        hidden_states = [
            ops.repeat_interleave(hidden_state, duration, axis=-1).swapaxes(0, 1)
            for (hidden_state, duration) in zip(hidden_states, dur_out)
        ]

        hidden_states = pad_sequence(hidden_states, batch_first=True).swapaxes(1, 2)

    spkr = spkr.repeat(1, 1, hidden_states.shape[-1])
    lang = lang.repeat(1, 1, hidden_states.shape[-1])
    hidden_states = ops.cat([lang, hidden_states, spkr], axis=1)

    hidden_states = self.hifi_gan(hidden_states)

    unit_lengths = self._get_dur_output_lengths(input_ids, dur_out)
    lengths = self._get_output_hifigan_lengths(unit_lengths)

    return hidden_states, lengths

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2CodeHifiGan.remove_weight_norm()

Removes weight normalization from the specified layers in the HifiGan model.

PARAMETER DESCRIPTION
self

An instance of the SeamlessM4Tv2CodeHifiGan class.

RETURNS DESCRIPTION

None.

Description

This method removes weight normalization from the following layers in the HifiGan model:

  • self.hifi_gan.conv_pre: The convolutional layer before upsampling.
  • self.hifi_gan.upsampler: A list of upsampling layers.
  • self.hifi_gan.resblocks: A list of residual blocks.
  • self.hifi_gan.conv_post: The final convolutional layer after upsampling.

Weight normalization is a technique used to normalize the weights of neural network layers. By removing weight normalization, the weights of the specified layers are no longer normalized, which can have an impact on the performance of the model.

Note that this method modifies the layers in-place and does not return any value.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
4189
4190
4191
4192
4193
4194
4195
4196
4197
4198
4199
4200
4201
4202
4203
4204
4205
4206
4207
4208
4209
4210
4211
4212
4213
4214
4215
4216
4217
4218
4219
4220
4221
def remove_weight_norm(self):
    """
    Removes weight normalization from the specified layers in the HifiGan model.

    Args:
        self: An instance of the SeamlessM4Tv2CodeHifiGan class.

    Returns:
        None.

    Raises:
        None.

    Description:
        This method removes weight normalization from the following layers in the HifiGan model:

        - self.hifi_gan.conv_pre: The convolutional layer before upsampling.
        - self.hifi_gan.upsampler: A list of upsampling layers.
        - self.hifi_gan.resblocks: A list of residual blocks.
        - self.hifi_gan.conv_post: The final convolutional layer after upsampling.

    Weight normalization is a technique used to normalize the weights of neural network layers.
    By removing weight normalization, the weights of the specified layers are no longer normalized, which can have
    an impact on the performance of the model.

    Note that this method modifies the layers in-place and does not return any value.
    """
    nn.utils.remove_weight_norm(self.hifi_gan.conv_pre)
    for layer in self.hifi_gan.upsampler:
        nn.utils.remove_weight_norm(layer)
    for layer in self.hifi_gan.resblocks:
        layer.remove_weight_norm()
    nn.utils.remove_weight_norm(self.hifi_gan.conv_post)

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerAdapter

Bases: Module

A class representing a SeamlessM4Tv2ConformerAdapter.

Inherits from nn.Module.

This class initializes an instance of SeamlessM4Tv2ConformerAdapter and forwards the adapter layers. Each adapter layer is a SeamlessM4Tv2ConformerAdapterLayer, and the number of layers is determined by the 'num_adapter_layers' parameter in the configuration.

ATTRIBUTE DESCRIPTION
layers

A list of SeamlessM4Tv2ConformerAdapterLayer instances representing the adapter layers.

TYPE: ModuleList

METHOD DESCRIPTION
__init__

Initializes a new instance of SeamlessM4Tv2ConformerAdapter.

forward

Constructs the adapter layers by iterating over each layer and applying it to the input hidden states and attention mask.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
class SeamlessM4Tv2ConformerAdapter(nn.Module):

    """A class representing a SeamlessM4Tv2ConformerAdapter.

    Inherits from nn.Module.

    This class initializes an instance of SeamlessM4Tv2ConformerAdapter and forwards the adapter layers.
    Each adapter layer is a SeamlessM4Tv2ConformerAdapterLayer, and the number of layers is determined by
    the 'num_adapter_layers' parameter in the configuration.

    Attributes:
        layers (nn.ModuleList): A list of SeamlessM4Tv2ConformerAdapterLayer instances representing the adapter layers.

    Methods:
        __init__: Initializes a new instance of SeamlessM4Tv2ConformerAdapter.
        forward: Constructs the adapter layers by iterating over each layer and applying it to the input
            hidden states and attention mask.

    """
    def __init__(self, config):
        """
        Initializes an instance of the 'SeamlessM4Tv2ConformerAdapter' class.

        Args:
            self: The instance of the 'SeamlessM4Tv2ConformerAdapter' class.
            config: An object of type 'Config' containing configuration parameters.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()

        self.layers = nn.ModuleList(
            [SeamlessM4Tv2ConformerAdapterLayer(config) for _ in range(config.num_adapter_layers)]
        )

    def forward(self, hidden_states, attention_mask):
        """
        Constructs the hidden states of the SeamlessM4Tv2ConformerAdapter by applying the layers in sequence.

        Args:
            self (SeamlessM4Tv2ConformerAdapter): An instance of the SeamlessM4Tv2ConformerAdapter class.
            hidden_states (Tensor): The input hidden states. The shape is (batch_size, sequence_length, hidden_size).
            attention_mask (Tensor): The attention mask tensor. The shape is (batch_size, sequence_length).

        Returns:
            None

        Raises:
            None
        """
        # down project hidden_states if necessary

        for layer in self.layers:
            hidden_states = layer(hidden_states, attention_mask)

        return hidden_states

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerAdapter.__init__(config)

Initializes an instance of the 'SeamlessM4Tv2ConformerAdapter' class.

PARAMETER DESCRIPTION
self

The instance of the 'SeamlessM4Tv2ConformerAdapter' class.

config

An object of type 'Config' containing configuration parameters.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
def __init__(self, config):
    """
    Initializes an instance of the 'SeamlessM4Tv2ConformerAdapter' class.

    Args:
        self: The instance of the 'SeamlessM4Tv2ConformerAdapter' class.
        config: An object of type 'Config' containing configuration parameters.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()

    self.layers = nn.ModuleList(
        [SeamlessM4Tv2ConformerAdapterLayer(config) for _ in range(config.num_adapter_layers)]
    )

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerAdapter.forward(hidden_states, attention_mask)

Constructs the hidden states of the SeamlessM4Tv2ConformerAdapter by applying the layers in sequence.

PARAMETER DESCRIPTION
self

An instance of the SeamlessM4Tv2ConformerAdapter class.

TYPE: SeamlessM4Tv2ConformerAdapter

hidden_states

The input hidden states. The shape is (batch_size, sequence_length, hidden_size).

TYPE: Tensor

attention_mask

The attention mask tensor. The shape is (batch_size, sequence_length).

TYPE: Tensor

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
def forward(self, hidden_states, attention_mask):
    """
    Constructs the hidden states of the SeamlessM4Tv2ConformerAdapter by applying the layers in sequence.

    Args:
        self (SeamlessM4Tv2ConformerAdapter): An instance of the SeamlessM4Tv2ConformerAdapter class.
        hidden_states (Tensor): The input hidden states. The shape is (batch_size, sequence_length, hidden_size).
        attention_mask (Tensor): The attention mask tensor. The shape is (batch_size, sequence_length).

    Returns:
        None

    Raises:
        None
    """
    # down project hidden_states if necessary

    for layer in self.layers:
        hidden_states = layer(hidden_states, attention_mask)

    return hidden_states

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerAdapterLayer

Bases: Module

This class represents a layer for the SeamlessM4Tv2 Conformer Adapter. It inherits from nn.Module and contains methods for computing sub-sample lengths from attention mask and forwarding the adapter layer using the given input and optional attention mask.

ATTRIBUTE DESCRIPTION
config

The configuration object containing hidden size and adaptor dropout information.

TYPE: object

METHOD DESCRIPTION
_compute_sub_sample_lengths_from_attention_mask

Computes sub-sample lengths from the attention mask.

forward

Constructs the adapter layer using the given input hidden_states and optional attention_mask.

Note

For detailed information on the class methods and attributes, please refer to the class code and comments.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
class SeamlessM4Tv2ConformerAdapterLayer(nn.Module):
    """
    This class represents a layer for the SeamlessM4Tv2 Conformer Adapter. It inherits from nn.Module and contains methods
    for computing sub-sample lengths from attention mask and forwarding the adapter layer using the given input and
    optional attention mask.

    Attributes:
        config (object): The configuration object containing hidden size and adaptor dropout information.

    Methods:
        _compute_sub_sample_lengths_from_attention_mask(attention_mask): Computes sub-sample lengths from the
            attention mask.
        forward(hidden_states, attention_mask, output_attentions): Constructs the adapter layer using the given
            input hidden_states and optional attention_mask.

    Note:
        For detailed information on the class methods and attributes, please refer to the class code and comments.
    """
    def __init__(self, config):
        """
        This method initializes an instance of the SeamlessM4Tv2ConformerAdapterLayer class.

        Args:
            self: The instance of the class.
            config: A configuration object containing the parameters for the adapter layer.
                It is expected to have the following attributes:

                - hidden_size: An integer representing the dimension of the hidden state.
                - adaptor_dropout: A float representing the dropout probability for the adapter layer.
                - adaptor_kernel_size: An integer representing the size of the kernel for the convolutional layers
                in the adapter.
                - adaptor_stride: An integer representing the stride for the convolutional layers in the adapter.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        embed_dim = config.hidden_size
        dropout = config.adaptor_dropout

        self.kernel_size = config.adaptor_kernel_size
        self.stride = config.adaptor_stride

        # 1. residual convolution
        self.residual_layer_norm = nn.LayerNorm([embed_dim])
        self.residual_conv = nn.Conv1d(
            embed_dim,
            2 * embed_dim,
            self.kernel_size,
            stride=self.stride,
            pad_mode='pad',
            padding=self.stride // 2,
        )
        self.activation = nn.GLU(axis=1)

        # Self-Attention
        self.self_attn_layer_norm = nn.LayerNorm([embed_dim])
        self.self_attn_conv = nn.Conv1d(
            embed_dim,
            2 * embed_dim,
            self.kernel_size,
            stride=self.stride,
            pad_mode='pad',
            padding=self.stride // 2,
        )
        self.self_attn = SeamlessM4Tv2ConformerSelfAttention(config, use_position_embeddings=False)
        self.self_attn_dropout = nn.Dropout(p=dropout)

        # Feed-forward
        self.ffn_layer_norm = nn.LayerNorm([embed_dim])
        self.ffn = SeamlessM4Tv2ConformerFeedForward(config, act_fn="relu", dropout=dropout)

    def _compute_sub_sample_lengths_from_attention_mask(self, attention_mask):
        """
        Computes the lengths of sub-samples based on the attention mask.

        Args:
            self (SeamlessM4Tv2ConformerAdapterLayer): An instance of the SeamlessM4Tv2ConformerAdapterLayer class.
            attention_mask (Tensor): A binary tensor of shape (batch_size, sequence_length) representing the attention
                mask.

        Returns:
            None

        Raises:
            None

        This method computes the lengths of sub-samples based on the attention mask. The attention mask is a binary
        tensor where each element indicates whether the corresponding token is a valid token (1) or a padding token (0).
        The method calculates the sequence lengths for each sample in the batch by subtracting the number of padding
        tokens from the total sequence length.

        The sequence lengths are then adjusted to account for the kernel size and stride. The method applies a padding
        value 'pad' equal to half the kernel size. It subtracts twice the padding value and the kernel size from the
        sequence lengths, and then divides the result by the stride value. Finally, it adds 1 to obtain the lengths of
        the sub-samples.

        The resulting sequence lengths are converted to float32 data type using the 'astype' method and then rounded
        down to the nearest integer using the 'floor' method from the MindSpore library.

        Note:
            The returned value is of type None, as the sequence lengths are stored internally within the
            SeamlessM4Tv2ConformerAdapterLayer object.
        """
        pad = self.kernel_size // 2
        seq_lens = attention_mask.shape[1] - (1 - attention_mask.int()).sum(1)

        seq_lens = ((seq_lens + 2 * pad - self.kernel_size) / self.stride) + 1

        return seq_lens.astype(mindspore.float32).floor()

    def forward(
        self,
        hidden_states,
        attention_mask: Optional[mindspore.Tensor] = None,
        output_attentions: bool = False,
    ):
        """
        Constructs the SeamlessM4Tv2ConformerAdapterLayer.

        Args:
            self: The instance of the class.
            hidden_states (mindspore.Tensor): The input hidden states. It represents the input data to the layer.
            attention_mask (Optional[mindspore.Tensor]): An optional tensor representing the attention mask.
                Defaults to None. If provided, it restricts the attention of the layer.
            output_attentions (bool): A flag indicating whether to output attentions. Defaults to False.

        Returns:
            mindspore.Tensor: The output hidden states after processing through the layer.

        Raises:
            ValueError: If the dimensions of input tensors are incompatible.
            RuntimeError: If an error occurs during the computation process.
            TypeError: If the input parameters are of incorrect type.
        """
        residual = self.residual_layer_norm(hidden_states)

        # Apply pooling to the residual to match the sequence length of the
        # multi-head attention output.
        # (batch, seq_len, feature_dim) -> (batch, feature_dim, seq_len)
        residual = residual.swapaxes(1, 2)
        residual = self.residual_conv(residual)
        residual = self.activation(residual)
        # (batch, feature_dim, seq_len) -> (batch, seq_len, feature_dim)
        residual = residual.swapaxes(1, 2)

        hidden_states = self.self_attn_layer_norm(hidden_states)
        # Apply pooling before feeding to the multihead-attention layer.
        # (batch, seq_len, feature_dim) -> (batch, feature_dim, seq_len)
        hidden_states = hidden_states.swapaxes(1, 2)
        hidden_states = self.self_attn_conv(hidden_states)
        hidden_states = self.activation(hidden_states)
        # (batch, feature_dim, seq_len) -> (batch, seq_len, feature_dim)
        hidden_states = hidden_states.swapaxes(1, 2)

        if attention_mask is not None:
            sub_sampled_lengths = self._compute_sub_sample_lengths_from_attention_mask(attention_mask)
            attention_mask = _compute_new_attention_mask(hidden_states=hidden_states, seq_lens=sub_sampled_lengths)
            attention_mask = _prepare_4d_attention_mask(
                attention_mask,
                hidden_states.dtype,
            )

        # The rest of the computation is identical to a vanilla Transformer
        # encoder layer.
        hidden_states, _ = self.self_attn(
            hidden_states,
            attention_mask=attention_mask,
            output_attentions=output_attentions,
        )
        hidden_states = self.self_attn_dropout(hidden_states)
        hidden_states = hidden_states + residual

        residual = hidden_states

        hidden_states = self.ffn_layer_norm(hidden_states)
        hidden_states = self.ffn(hidden_states) + residual

        return hidden_states

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerAdapterLayer.__init__(config)

This method initializes an instance of the SeamlessM4Tv2ConformerAdapterLayer class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

A configuration object containing the parameters for the adapter layer. It is expected to have the following attributes:

  • hidden_size: An integer representing the dimension of the hidden state.
  • adaptor_dropout: A float representing the dropout probability for the adapter layer.
  • adaptor_kernel_size: An integer representing the size of the kernel for the convolutional layers in the adapter.
  • adaptor_stride: An integer representing the stride for the convolutional layers in the adapter.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
def __init__(self, config):
    """
    This method initializes an instance of the SeamlessM4Tv2ConformerAdapterLayer class.

    Args:
        self: The instance of the class.
        config: A configuration object containing the parameters for the adapter layer.
            It is expected to have the following attributes:

            - hidden_size: An integer representing the dimension of the hidden state.
            - adaptor_dropout: A float representing the dropout probability for the adapter layer.
            - adaptor_kernel_size: An integer representing the size of the kernel for the convolutional layers
            in the adapter.
            - adaptor_stride: An integer representing the stride for the convolutional layers in the adapter.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    embed_dim = config.hidden_size
    dropout = config.adaptor_dropout

    self.kernel_size = config.adaptor_kernel_size
    self.stride = config.adaptor_stride

    # 1. residual convolution
    self.residual_layer_norm = nn.LayerNorm([embed_dim])
    self.residual_conv = nn.Conv1d(
        embed_dim,
        2 * embed_dim,
        self.kernel_size,
        stride=self.stride,
        pad_mode='pad',
        padding=self.stride // 2,
    )
    self.activation = nn.GLU(axis=1)

    # Self-Attention
    self.self_attn_layer_norm = nn.LayerNorm([embed_dim])
    self.self_attn_conv = nn.Conv1d(
        embed_dim,
        2 * embed_dim,
        self.kernel_size,
        stride=self.stride,
        pad_mode='pad',
        padding=self.stride // 2,
    )
    self.self_attn = SeamlessM4Tv2ConformerSelfAttention(config, use_position_embeddings=False)
    self.self_attn_dropout = nn.Dropout(p=dropout)

    # Feed-forward
    self.ffn_layer_norm = nn.LayerNorm([embed_dim])
    self.ffn = SeamlessM4Tv2ConformerFeedForward(config, act_fn="relu", dropout=dropout)

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerAdapterLayer.forward(hidden_states, attention_mask=None, output_attentions=False)

Constructs the SeamlessM4Tv2ConformerAdapterLayer.

PARAMETER DESCRIPTION
self

The instance of the class.

hidden_states

The input hidden states. It represents the input data to the layer.

TYPE: Tensor

attention_mask

An optional tensor representing the attention mask. Defaults to None. If provided, it restricts the attention of the layer.

TYPE: Optional[Tensor] DEFAULT: None

output_attentions

A flag indicating whether to output attentions. Defaults to False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION

mindspore.Tensor: The output hidden states after processing through the layer.

RAISES DESCRIPTION
ValueError

If the dimensions of input tensors are incompatible.

RuntimeError

If an error occurs during the computation process.

TypeError

If the input parameters are of incorrect type.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
def forward(
    self,
    hidden_states,
    attention_mask: Optional[mindspore.Tensor] = None,
    output_attentions: bool = False,
):
    """
    Constructs the SeamlessM4Tv2ConformerAdapterLayer.

    Args:
        self: The instance of the class.
        hidden_states (mindspore.Tensor): The input hidden states. It represents the input data to the layer.
        attention_mask (Optional[mindspore.Tensor]): An optional tensor representing the attention mask.
            Defaults to None. If provided, it restricts the attention of the layer.
        output_attentions (bool): A flag indicating whether to output attentions. Defaults to False.

    Returns:
        mindspore.Tensor: The output hidden states after processing through the layer.

    Raises:
        ValueError: If the dimensions of input tensors are incompatible.
        RuntimeError: If an error occurs during the computation process.
        TypeError: If the input parameters are of incorrect type.
    """
    residual = self.residual_layer_norm(hidden_states)

    # Apply pooling to the residual to match the sequence length of the
    # multi-head attention output.
    # (batch, seq_len, feature_dim) -> (batch, feature_dim, seq_len)
    residual = residual.swapaxes(1, 2)
    residual = self.residual_conv(residual)
    residual = self.activation(residual)
    # (batch, feature_dim, seq_len) -> (batch, seq_len, feature_dim)
    residual = residual.swapaxes(1, 2)

    hidden_states = self.self_attn_layer_norm(hidden_states)
    # Apply pooling before feeding to the multihead-attention layer.
    # (batch, seq_len, feature_dim) -> (batch, feature_dim, seq_len)
    hidden_states = hidden_states.swapaxes(1, 2)
    hidden_states = self.self_attn_conv(hidden_states)
    hidden_states = self.activation(hidden_states)
    # (batch, feature_dim, seq_len) -> (batch, seq_len, feature_dim)
    hidden_states = hidden_states.swapaxes(1, 2)

    if attention_mask is not None:
        sub_sampled_lengths = self._compute_sub_sample_lengths_from_attention_mask(attention_mask)
        attention_mask = _compute_new_attention_mask(hidden_states=hidden_states, seq_lens=sub_sampled_lengths)
        attention_mask = _prepare_4d_attention_mask(
            attention_mask,
            hidden_states.dtype,
        )

    # The rest of the computation is identical to a vanilla Transformer
    # encoder layer.
    hidden_states, _ = self.self_attn(
        hidden_states,
        attention_mask=attention_mask,
        output_attentions=output_attentions,
    )
    hidden_states = self.self_attn_dropout(hidden_states)
    hidden_states = hidden_states + residual

    residual = hidden_states

    hidden_states = self.ffn_layer_norm(hidden_states)
    hidden_states = self.ffn(hidden_states) + residual

    return hidden_states

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerConvolutionModule

Bases: Module

Convolution block used in the conformer block. Uses a causal depthwise convolution similar to that described in Section 2.1 of `https://doi.org/10.48550/arxiv.1609.03499

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
class SeamlessM4Tv2ConformerConvolutionModule(nn.Module):
    """Convolution block used in the conformer block. Uses a causal depthwise convolution similar to that
    described in Section 2.1 of `https://doi.org/10.48550/arxiv.1609.03499"""
    def __init__(self, config):
        """
        Initializes the SeamlessM4Tv2ConformerConvolutionModule.

        Args:
            self (object): The instance of the class.
            config (object):
                The configuration object containing various parameters for the module.

                - conv_depthwise_kernel_size (int): The kernel size for depthwise convolution.
                - hidden_size (int): The hidden size used in convolution layers.
                - speech_encoder_hidden_act (str): The activation function for hidden layers.
                - speech_encoder_dropout (float): The dropout rate.

        Returns:
            None.

        Raises:
            ValueError: Raised if the 'config.conv_depthwise_kernel_size' is not an odd number,
                as it should be for 'SAME' padding.
        """
        super().__init__()
        if (config.conv_depthwise_kernel_size - 1) % 2 == 1:
            raise ValueError("`config.conv_depthwise_kernel_size` should be a odd number for 'SAME' padding")
        self.layer_norm = nn.LayerNorm([config.hidden_size])
        self.pointwise_conv1 = nn.Conv1d(
            config.hidden_size,
            2 * config.hidden_size,
            kernel_size=1,
            stride=1,
            pad_mode='valid',
            bias=False,
        )
        self.glu = nn.GLU(axis=1)
        self.depthwise_conv = nn.Conv1d(
            config.hidden_size,
            config.hidden_size,
            config.conv_depthwise_kernel_size,
            stride=1,
            pad_mode='valid',
            group=config.hidden_size,
            bias=False,
        )
        self.depthwise_layer_norm = nn.LayerNorm([config.hidden_size])
        self.activation = ACT2FN[config.speech_encoder_hidden_act]
        self.pointwise_conv2 = nn.Conv1d(
            config.hidden_size,
            config.hidden_size,
            kernel_size=1,
            stride=1,
            pad_mode='valid',
            bias=False,
        )
        self.dropout = nn.Dropout(p=config.speech_encoder_dropout)

    def forward(self, hidden_states, attention_mask=None):
        """
        Constructs the SeamlessM4Tv2ConformerConvolutionModule.

        Args:
            self: The instance of the SeamlessM4Tv2ConformerConvolutionModule.
            hidden_states (Tensor): The input hidden states tensor of shape (batch_size, sequence_length, hidden_size).
            attention_mask (Tensor, optional): The attention mask tensor of shape (batch_size, sequence_length)
                indicating which tokens should be attended to and which should not. Defaults to None.

        Returns:
            Tensor: The output hidden states tensor after applying the convolution operations of shape
                (batch_size, sequence_length, hidden_size).

        Raises:
            None.
        """
        hidden_states = self.layer_norm(hidden_states)

        # Ensure that we do not leak padded positions in depthwise convolution.
        # Put 0 where necessary
        if attention_mask is not None:
            hidden_states = hidden_states.masked_fill(~attention_mask.bool().unsqueeze(-1), 0.0)

        # exchange the temporal dimension and the feature dimension
        hidden_states = hidden_states.swapaxes(1, 2)

        # GLU mechanism
        # => (batch, 2*channel, dim)
        hidden_states = self.pointwise_conv1(hidden_states)
        # => (batch, channel, dim)
        hidden_states = self.glu(hidden_states)

        # Pad the sequence entirely on the left because of causal convolution.
        hidden_states = ops.pad(hidden_states, (self.depthwise_conv.kernel_size[0] - 1, 0))

        # 1D Depthwise Conv
        hidden_states = self.depthwise_conv(hidden_states)
        hidden_states = self.depthwise_layer_norm(hidden_states.swapaxes(1, 2)).swapaxes(1, 2)
        hidden_states = self.activation(hidden_states)

        hidden_states = self.pointwise_conv2(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = hidden_states.swapaxes(1, 2)
        return hidden_states

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerConvolutionModule.__init__(config)

Initializes the SeamlessM4Tv2ConformerConvolutionModule.

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: object

config

The configuration object containing various parameters for the module.

  • conv_depthwise_kernel_size (int): The kernel size for depthwise convolution.
  • hidden_size (int): The hidden size used in convolution layers.
  • speech_encoder_hidden_act (str): The activation function for hidden layers.
  • speech_encoder_dropout (float): The dropout rate.

TYPE: object

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

Raised if the 'config.conv_depthwise_kernel_size' is not an odd number, as it should be for 'SAME' padding.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
def __init__(self, config):
    """
    Initializes the SeamlessM4Tv2ConformerConvolutionModule.

    Args:
        self (object): The instance of the class.
        config (object):
            The configuration object containing various parameters for the module.

            - conv_depthwise_kernel_size (int): The kernel size for depthwise convolution.
            - hidden_size (int): The hidden size used in convolution layers.
            - speech_encoder_hidden_act (str): The activation function for hidden layers.
            - speech_encoder_dropout (float): The dropout rate.

    Returns:
        None.

    Raises:
        ValueError: Raised if the 'config.conv_depthwise_kernel_size' is not an odd number,
            as it should be for 'SAME' padding.
    """
    super().__init__()
    if (config.conv_depthwise_kernel_size - 1) % 2 == 1:
        raise ValueError("`config.conv_depthwise_kernel_size` should be a odd number for 'SAME' padding")
    self.layer_norm = nn.LayerNorm([config.hidden_size])
    self.pointwise_conv1 = nn.Conv1d(
        config.hidden_size,
        2 * config.hidden_size,
        kernel_size=1,
        stride=1,
        pad_mode='valid',
        bias=False,
    )
    self.glu = nn.GLU(axis=1)
    self.depthwise_conv = nn.Conv1d(
        config.hidden_size,
        config.hidden_size,
        config.conv_depthwise_kernel_size,
        stride=1,
        pad_mode='valid',
        group=config.hidden_size,
        bias=False,
    )
    self.depthwise_layer_norm = nn.LayerNorm([config.hidden_size])
    self.activation = ACT2FN[config.speech_encoder_hidden_act]
    self.pointwise_conv2 = nn.Conv1d(
        config.hidden_size,
        config.hidden_size,
        kernel_size=1,
        stride=1,
        pad_mode='valid',
        bias=False,
    )
    self.dropout = nn.Dropout(p=config.speech_encoder_dropout)

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerConvolutionModule.forward(hidden_states, attention_mask=None)

Constructs the SeamlessM4Tv2ConformerConvolutionModule.

PARAMETER DESCRIPTION
self

The instance of the SeamlessM4Tv2ConformerConvolutionModule.

hidden_states

The input hidden states tensor of shape (batch_size, sequence_length, hidden_size).

TYPE: Tensor

attention_mask

The attention mask tensor of shape (batch_size, sequence_length) indicating which tokens should be attended to and which should not. Defaults to None.

TYPE: Tensor DEFAULT: None

RETURNS DESCRIPTION
Tensor

The output hidden states tensor after applying the convolution operations of shape (batch_size, sequence_length, hidden_size).

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
def forward(self, hidden_states, attention_mask=None):
    """
    Constructs the SeamlessM4Tv2ConformerConvolutionModule.

    Args:
        self: The instance of the SeamlessM4Tv2ConformerConvolutionModule.
        hidden_states (Tensor): The input hidden states tensor of shape (batch_size, sequence_length, hidden_size).
        attention_mask (Tensor, optional): The attention mask tensor of shape (batch_size, sequence_length)
            indicating which tokens should be attended to and which should not. Defaults to None.

    Returns:
        Tensor: The output hidden states tensor after applying the convolution operations of shape
            (batch_size, sequence_length, hidden_size).

    Raises:
        None.
    """
    hidden_states = self.layer_norm(hidden_states)

    # Ensure that we do not leak padded positions in depthwise convolution.
    # Put 0 where necessary
    if attention_mask is not None:
        hidden_states = hidden_states.masked_fill(~attention_mask.bool().unsqueeze(-1), 0.0)

    # exchange the temporal dimension and the feature dimension
    hidden_states = hidden_states.swapaxes(1, 2)

    # GLU mechanism
    # => (batch, 2*channel, dim)
    hidden_states = self.pointwise_conv1(hidden_states)
    # => (batch, channel, dim)
    hidden_states = self.glu(hidden_states)

    # Pad the sequence entirely on the left because of causal convolution.
    hidden_states = ops.pad(hidden_states, (self.depthwise_conv.kernel_size[0] - 1, 0))

    # 1D Depthwise Conv
    hidden_states = self.depthwise_conv(hidden_states)
    hidden_states = self.depthwise_layer_norm(hidden_states.swapaxes(1, 2)).swapaxes(1, 2)
    hidden_states = self.activation(hidden_states)

    hidden_states = self.pointwise_conv2(hidden_states)
    hidden_states = self.dropout(hidden_states)
    hidden_states = hidden_states.swapaxes(1, 2)
    return hidden_states

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerEncoder

Bases: Module

The class represents a SeamlessM4Tv2ConformerEncoder, which is a neural network cell for encoding speech data. It inherits from the nn.Module class.

The class includes methods for initializing the encoder, applying chunk attention, and forwarding the hidden states. The init method initializes the encoder with the given configuration, dropout, layers, and layer normalization. The _apply_chunk_attention method creates a chunk attention mask to prevent attention across chunks. The forward method processes the hidden states, applies chunk attention if specified, and performs layer-wise computations.

Note

This docstring is a summary based on the provided code and may need additional details from the broader context of the codebase.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
class SeamlessM4Tv2ConformerEncoder(nn.Module):

    """
    The class represents a SeamlessM4Tv2ConformerEncoder, which is a neural network cell for encoding speech data.
    It inherits from the nn.Module class.

    The class includes methods for initializing the encoder, applying chunk attention, and forwarding the hidden states.
    The __init__ method initializes the encoder with the given configuration, dropout, layers, and layer normalization.
    The _apply_chunk_attention method creates a chunk attention mask to prevent attention across chunks.
    The forward method processes the hidden states, applies chunk attention if specified, and performs layer-wise
    computations.

    Note:
        This docstring is a summary based on the provided code and may need additional details from the broader context
        of the codebase.
    """
    def __init__(self, config):
        """
        Initializes an instance of the SeamlessM4Tv2ConformerEncoder class.

        Args:
            self: An instance of the class.
            config:
                An object of type 'config' containing the configuration settings for the encoder.

                - Type: Config object
                - Purpose: Specifies the configuration parameters for the encoder.
                - Restrictions: None

        Returns:
            None

        Raises:
            None
        """
        super().__init__()
        self.config = config

        self.dropout = nn.Dropout(p=config.speech_encoder_dropout)
        self.layers = nn.ModuleList(
            [SeamlessM4Tv2ConformerEncoderLayer(config) for _ in range(config.speech_encoder_layers)]
        )

        self.layer_norm = nn.LayerNorm([config.hidden_size], eps=config.layer_norm_eps)

    def _apply_chunk_attention(self, attention_mask, hidden_states):
        """
        Creates a chunk attention mask. It creates a mask to prevent attention across chunks, ensuring that each
        position attends only to positions within its own chunk. If a left chunk overlap is specified
        (`speech_encoder_chunk_size` in the configuration), the attention mask is adjusted accordingly to allow each
        position to also attends the `speech_encoder_chunk_size - 1` previous chunks.
        """
        sequence_len = hidden_states.shape[1]

        chunk_indices = ops.arange(sequence_len)
        chunk_indices = ops.div(chunk_indices, self.config.speech_encoder_chunk_size).long()

        start_indices = ops.full_like(chunk_indices, 0)
        if self.config.speech_encoder_left_chunk_num >= 0:
            start_indices = (chunk_indices - self.config.speech_encoder_left_chunk_num).clamp(min=0)
            start_indices = start_indices * self.config.speech_encoder_chunk_size
        start_indices = start_indices.unsqueeze(1).expand(-1, sequence_len)

        end_indices = ((chunk_indices + 1) * self.config.speech_encoder_chunk_size).clamp(max=sequence_len)

        end_indices = end_indices.unsqueeze(1).expand(-1, sequence_len)

        indices = ops.arange(sequence_len).unsqueeze(0).expand(sequence_len, -1)

        chunk_mask = (indices < start_indices) | (indices >= end_indices)
        chunk_mask = chunk_mask.unsqueeze(0).unsqueeze(0)

        attention_mask = chunk_mask if attention_mask is None else (attention_mask.bool() | chunk_mask)
        attention_mask = attention_mask.to(dtype=hidden_states.dtype)
        return attention_mask

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        output_attentions=False,
        output_hidden_states=False,
        return_dict=True,
    ):
        """
        Constructs the SeamlessM4Tv2ConformerEncoder.

        Args:
            self: The instance of the class.
            hidden_states (Tensor): The hidden states of the encoder. Shape should be
                (batch_size, sequence_length, hidden_size).
            attention_mask (Tensor, optional): The attention mask tensor.
                If provided, it should have the same shape as 'hidden_states'.
                Masked positions have a value of 'True' and unmasked positions have a value of 'False'.
                Default is 'None'.
            output_attentions (bool, optional): Whether to output the self-attention tensors of each layer.
                Default is 'False'.
            output_hidden_states (bool, optional): Whether to output the hidden states of each layer. Default is 'False'.
            return_dict (bool, optional): Whether to return the output as a dictionary. Default is 'True'.

        Returns:
            None

        Raises:
            None
        """
        all_hidden_states = () if output_hidden_states else None
        all_self_attentions = () if output_attentions else None

        conv_attention_mask = attention_mask
        if attention_mask is not None:
            # make sure padded tokens output 0
            hidden_states = hidden_states.masked_fill(~attention_mask.bool().unsqueeze(-1), 0.0)
            # extend attention_mask
            attention_mask = 1.0 - attention_mask[:, None, None, :].to(dtype=hidden_states.dtype)
            attention_mask = attention_mask.expand(
                attention_mask.shape[0], 1, attention_mask.shape[-1], attention_mask.shape[-1]
            )

        if self.config.speech_encoder_chunk_size is not None:
            attention_mask = self._apply_chunk_attention(attention_mask, hidden_states)

        if attention_mask is not None:
            attention_mask = attention_mask * float(np.finfo(mindspore.dtype_to_nptype(hidden_states.dtype)).min)

        hidden_states = self.dropout(hidden_states)

        for _, layer in enumerate(self.layers):
            if output_hidden_states:
                all_hidden_states = all_hidden_states + (hidden_states,)

            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
            dropout_probability = ops.rand([])

            skip_the_layer = bool(self.training and (dropout_probability < self.config.speech_encoder_layerdrop))
            if not skip_the_layer:
                layer_outputs = layer(
                    hidden_states,
                    attention_mask=attention_mask,
                    output_attentions=output_attentions,
                    conv_attention_mask=conv_attention_mask,
                )
                hidden_states = layer_outputs[0]

            if skip_the_layer:
                layer_outputs = (None, None)

            if output_attentions:
                all_self_attentions = all_self_attentions + (layer_outputs[1],)

        hidden_states = self.layer_norm(hidden_states)
        if output_hidden_states:
            all_hidden_states = all_hidden_states + (hidden_states,)

        if not return_dict:
            return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None)
        return BaseModelOutput(
            last_hidden_state=hidden_states,
            hidden_states=all_hidden_states,
            attentions=all_self_attentions,
        )

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerEncoder.__init__(config)

Initializes an instance of the SeamlessM4Tv2ConformerEncoder class.

PARAMETER DESCRIPTION
self

An instance of the class.

config

An object of type 'config' containing the configuration settings for the encoder.

  • Type: Config object
  • Purpose: Specifies the configuration parameters for the encoder.
  • Restrictions: None

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
def __init__(self, config):
    """
    Initializes an instance of the SeamlessM4Tv2ConformerEncoder class.

    Args:
        self: An instance of the class.
        config:
            An object of type 'config' containing the configuration settings for the encoder.

            - Type: Config object
            - Purpose: Specifies the configuration parameters for the encoder.
            - Restrictions: None

    Returns:
        None

    Raises:
        None
    """
    super().__init__()
    self.config = config

    self.dropout = nn.Dropout(p=config.speech_encoder_dropout)
    self.layers = nn.ModuleList(
        [SeamlessM4Tv2ConformerEncoderLayer(config) for _ in range(config.speech_encoder_layers)]
    )

    self.layer_norm = nn.LayerNorm([config.hidden_size], eps=config.layer_norm_eps)

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerEncoder.forward(hidden_states, attention_mask=None, output_attentions=False, output_hidden_states=False, return_dict=True)

Constructs the SeamlessM4Tv2ConformerEncoder.

PARAMETER DESCRIPTION
self

The instance of the class.

hidden_states

The hidden states of the encoder. Shape should be (batch_size, sequence_length, hidden_size).

TYPE: Tensor

attention_mask

The attention mask tensor. If provided, it should have the same shape as 'hidden_states'. Masked positions have a value of 'True' and unmasked positions have a value of 'False'. Default is 'None'.

TYPE: Tensor DEFAULT: None

output_attentions

Whether to output the self-attention tensors of each layer. Default is 'False'.

TYPE: bool DEFAULT: False

output_hidden_states

Whether to output the hidden states of each layer. Default is 'False'.

TYPE: bool DEFAULT: False

return_dict

Whether to return the output as a dictionary. Default is 'True'.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
def forward(
    self,
    hidden_states,
    attention_mask=None,
    output_attentions=False,
    output_hidden_states=False,
    return_dict=True,
):
    """
    Constructs the SeamlessM4Tv2ConformerEncoder.

    Args:
        self: The instance of the class.
        hidden_states (Tensor): The hidden states of the encoder. Shape should be
            (batch_size, sequence_length, hidden_size).
        attention_mask (Tensor, optional): The attention mask tensor.
            If provided, it should have the same shape as 'hidden_states'.
            Masked positions have a value of 'True' and unmasked positions have a value of 'False'.
            Default is 'None'.
        output_attentions (bool, optional): Whether to output the self-attention tensors of each layer.
            Default is 'False'.
        output_hidden_states (bool, optional): Whether to output the hidden states of each layer. Default is 'False'.
        return_dict (bool, optional): Whether to return the output as a dictionary. Default is 'True'.

    Returns:
        None

    Raises:
        None
    """
    all_hidden_states = () if output_hidden_states else None
    all_self_attentions = () if output_attentions else None

    conv_attention_mask = attention_mask
    if attention_mask is not None:
        # make sure padded tokens output 0
        hidden_states = hidden_states.masked_fill(~attention_mask.bool().unsqueeze(-1), 0.0)
        # extend attention_mask
        attention_mask = 1.0 - attention_mask[:, None, None, :].to(dtype=hidden_states.dtype)
        attention_mask = attention_mask.expand(
            attention_mask.shape[0], 1, attention_mask.shape[-1], attention_mask.shape[-1]
        )

    if self.config.speech_encoder_chunk_size is not None:
        attention_mask = self._apply_chunk_attention(attention_mask, hidden_states)

    if attention_mask is not None:
        attention_mask = attention_mask * float(np.finfo(mindspore.dtype_to_nptype(hidden_states.dtype)).min)

    hidden_states = self.dropout(hidden_states)

    for _, layer in enumerate(self.layers):
        if output_hidden_states:
            all_hidden_states = all_hidden_states + (hidden_states,)

        # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
        dropout_probability = ops.rand([])

        skip_the_layer = bool(self.training and (dropout_probability < self.config.speech_encoder_layerdrop))
        if not skip_the_layer:
            layer_outputs = layer(
                hidden_states,
                attention_mask=attention_mask,
                output_attentions=output_attentions,
                conv_attention_mask=conv_attention_mask,
            )
            hidden_states = layer_outputs[0]

        if skip_the_layer:
            layer_outputs = (None, None)

        if output_attentions:
            all_self_attentions = all_self_attentions + (layer_outputs[1],)

    hidden_states = self.layer_norm(hidden_states)
    if output_hidden_states:
        all_hidden_states = all_hidden_states + (hidden_states,)

    if not return_dict:
        return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None)
    return BaseModelOutput(
        last_hidden_state=hidden_states,
        hidden_states=all_hidden_states,
        attentions=all_self_attentions,
    )

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerEncoderLayer

Bases: Module

Conformer block based on https://arxiv.org/abs/2005.08100.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
class SeamlessM4Tv2ConformerEncoderLayer(nn.Module):
    """Conformer block based on https://arxiv.org/abs/2005.08100."""
    # Copied from transformers.models.wav2vec2_conformer.modeling_wav2vec2_conformer.Wav2Vec2ConformerEncoderLayer.__init__ with Wav2Vec2->SeamlessM4Tv2, attention_dropout->speech_encoder_dropout, torch.nn->nn
    def __init__(self, config):
        """
        Initialize a SeamlessM4Tv2ConformerEncoderLayer object.

        Args:
            self (SeamlessM4Tv2ConformerEncoderLayer): The instance of the class.
            config:
                An object containing the configuration parameters for the encoder layer.

                - hidden_size (int): The dimension of the embedding.
                - speech_encoder_dropout (float): The dropout probability for the self-attention layer.

        Returns:
            None

        Raises:
            None
        """
        super().__init__()
        embed_dim = config.hidden_size
        dropout = config.speech_encoder_dropout

        # Feed-forward 1
        self.ffn1_layer_norm = nn.LayerNorm([embed_dim])
        self.ffn1 = SeamlessM4Tv2ConformerFeedForward(config)

        # Self-Attention
        self.self_attn_layer_norm = nn.LayerNorm([embed_dim])
        self.self_attn_dropout = nn.Dropout(p=dropout)
        self.self_attn = SeamlessM4Tv2ConformerSelfAttention(config)

        # Conformer Convolution
        self.conv_module = SeamlessM4Tv2ConformerConvolutionModule(config)

        # Feed-forward 2
        self.ffn2_layer_norm = nn.LayerNorm([embed_dim])
        self.ffn2 = SeamlessM4Tv2ConformerFeedForward(config)
        self.final_layer_norm = nn.LayerNorm([embed_dim])

    def forward(
        self,
        hidden_states,
        attention_mask: Optional[mindspore.Tensor] = None,
        output_attentions: bool = False,
        conv_attention_mask: Optional[mindspore.Tensor] = None,
    ):
        """
        Constructs a SeamlessM4Tv2ConformerEncoderLayer.

        Args:
            self: The object instance.
            hidden_states (mindspore.Tensor): The input hidden states. Shape is (batch_size, sequence_length, hidden_size).
            attention_mask (Optional[mindspore.Tensor], optional): The attention mask tensor. Default is None.
                If provided, the attention mask tensor must have the same shape as `hidden_states`.
                A value of 0 in the attention mask tensor indicates masking for the corresponding position,
                while a value of 1 indicates non-masking.
            output_attentions (bool, optional): Whether to output the attention weights. Default is False.
            conv_attention_mask (Optional[mindspore.Tensor], optional):
                The convolution attention mask tensor. Default is None.
                If provided, the convolution attention mask tensor must have the same shape as `hidden_states`.
                A value of 0 in the convolution attention mask tensor indicates masking for the corresponding position,
                while a value of 1 indicates non-masking.

        Returns:
            Tuple[mindspore.Tensor, Optional[mindspore.Tensor]]:
                A tuple containing:

                - hidden_states (mindspore.Tensor): The output hidden states. Shape is
                (batch_size, sequence_length, hidden_size).
                - attn_weights (Optional[mindspore.Tensor]): The attention weights tensor if
                `output_attentions` is True, else None.

        Raises:
            None.
        """
        # 1. Feed-Forward 1 layer
        residual = hidden_states
        hidden_states = self.ffn1_layer_norm(hidden_states)
        hidden_states = self.ffn1(hidden_states)
        hidden_states = hidden_states * 0.5 + residual
        residual = hidden_states

        # 2. Self-Attention layer
        hidden_states = self.self_attn_layer_norm(hidden_states)
        hidden_states, attn_weights = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            output_attentions=output_attentions,
        )
        hidden_states = self.self_attn_dropout(hidden_states)
        hidden_states = hidden_states + residual

        # 3. Convolutional Layer
        residual = hidden_states
        hidden_states = self.conv_module(hidden_states, attention_mask=conv_attention_mask)
        hidden_states = residual + hidden_states

        # 4. Feed-Forward 2 Layer
        residual = hidden_states
        hidden_states = self.ffn2_layer_norm(hidden_states)
        hidden_states = self.ffn2(hidden_states)
        hidden_states = hidden_states * 0.5 + residual
        hidden_states = self.final_layer_norm(hidden_states)

        return hidden_states, attn_weights

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerEncoderLayer.__init__(config)

Initialize a SeamlessM4Tv2ConformerEncoderLayer object.

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: SeamlessM4Tv2ConformerEncoderLayer

config

An object containing the configuration parameters for the encoder layer.

  • hidden_size (int): The dimension of the embedding.
  • speech_encoder_dropout (float): The dropout probability for the self-attention layer.

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
def __init__(self, config):
    """
    Initialize a SeamlessM4Tv2ConformerEncoderLayer object.

    Args:
        self (SeamlessM4Tv2ConformerEncoderLayer): The instance of the class.
        config:
            An object containing the configuration parameters for the encoder layer.

            - hidden_size (int): The dimension of the embedding.
            - speech_encoder_dropout (float): The dropout probability for the self-attention layer.

    Returns:
        None

    Raises:
        None
    """
    super().__init__()
    embed_dim = config.hidden_size
    dropout = config.speech_encoder_dropout

    # Feed-forward 1
    self.ffn1_layer_norm = nn.LayerNorm([embed_dim])
    self.ffn1 = SeamlessM4Tv2ConformerFeedForward(config)

    # Self-Attention
    self.self_attn_layer_norm = nn.LayerNorm([embed_dim])
    self.self_attn_dropout = nn.Dropout(p=dropout)
    self.self_attn = SeamlessM4Tv2ConformerSelfAttention(config)

    # Conformer Convolution
    self.conv_module = SeamlessM4Tv2ConformerConvolutionModule(config)

    # Feed-forward 2
    self.ffn2_layer_norm = nn.LayerNorm([embed_dim])
    self.ffn2 = SeamlessM4Tv2ConformerFeedForward(config)
    self.final_layer_norm = nn.LayerNorm([embed_dim])

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerEncoderLayer.forward(hidden_states, attention_mask=None, output_attentions=False, conv_attention_mask=None)

Constructs a SeamlessM4Tv2ConformerEncoderLayer.

PARAMETER DESCRIPTION
self

The object instance.

hidden_states

The input hidden states. Shape is (batch_size, sequence_length, hidden_size).

TYPE: Tensor

attention_mask

The attention mask tensor. Default is None. If provided, the attention mask tensor must have the same shape as hidden_states. A value of 0 in the attention mask tensor indicates masking for the corresponding position, while a value of 1 indicates non-masking.

TYPE: Optional[Tensor] DEFAULT: None

output_attentions

Whether to output the attention weights. Default is False.

TYPE: bool DEFAULT: False

conv_attention_mask

The convolution attention mask tensor. Default is None. If provided, the convolution attention mask tensor must have the same shape as hidden_states. A value of 0 in the convolution attention mask tensor indicates masking for the corresponding position, while a value of 1 indicates non-masking.

TYPE: Optional[Tensor] DEFAULT: None

RETURNS DESCRIPTION

Tuple[mindspore.Tensor, Optional[mindspore.Tensor]]: A tuple containing:

  • hidden_states (mindspore.Tensor): The output hidden states. Shape is (batch_size, sequence_length, hidden_size).
  • attn_weights (Optional[mindspore.Tensor]): The attention weights tensor if output_attentions is True, else None.
Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
def forward(
    self,
    hidden_states,
    attention_mask: Optional[mindspore.Tensor] = None,
    output_attentions: bool = False,
    conv_attention_mask: Optional[mindspore.Tensor] = None,
):
    """
    Constructs a SeamlessM4Tv2ConformerEncoderLayer.

    Args:
        self: The object instance.
        hidden_states (mindspore.Tensor): The input hidden states. Shape is (batch_size, sequence_length, hidden_size).
        attention_mask (Optional[mindspore.Tensor], optional): The attention mask tensor. Default is None.
            If provided, the attention mask tensor must have the same shape as `hidden_states`.
            A value of 0 in the attention mask tensor indicates masking for the corresponding position,
            while a value of 1 indicates non-masking.
        output_attentions (bool, optional): Whether to output the attention weights. Default is False.
        conv_attention_mask (Optional[mindspore.Tensor], optional):
            The convolution attention mask tensor. Default is None.
            If provided, the convolution attention mask tensor must have the same shape as `hidden_states`.
            A value of 0 in the convolution attention mask tensor indicates masking for the corresponding position,
            while a value of 1 indicates non-masking.

    Returns:
        Tuple[mindspore.Tensor, Optional[mindspore.Tensor]]:
            A tuple containing:

            - hidden_states (mindspore.Tensor): The output hidden states. Shape is
            (batch_size, sequence_length, hidden_size).
            - attn_weights (Optional[mindspore.Tensor]): The attention weights tensor if
            `output_attentions` is True, else None.

    Raises:
        None.
    """
    # 1. Feed-Forward 1 layer
    residual = hidden_states
    hidden_states = self.ffn1_layer_norm(hidden_states)
    hidden_states = self.ffn1(hidden_states)
    hidden_states = hidden_states * 0.5 + residual
    residual = hidden_states

    # 2. Self-Attention layer
    hidden_states = self.self_attn_layer_norm(hidden_states)
    hidden_states, attn_weights = self.self_attn(
        hidden_states=hidden_states,
        attention_mask=attention_mask,
        output_attentions=output_attentions,
    )
    hidden_states = self.self_attn_dropout(hidden_states)
    hidden_states = hidden_states + residual

    # 3. Convolutional Layer
    residual = hidden_states
    hidden_states = self.conv_module(hidden_states, attention_mask=conv_attention_mask)
    hidden_states = residual + hidden_states

    # 4. Feed-Forward 2 Layer
    residual = hidden_states
    hidden_states = self.ffn2_layer_norm(hidden_states)
    hidden_states = self.ffn2(hidden_states)
    hidden_states = hidden_states * 0.5 + residual
    hidden_states = self.final_layer_norm(hidden_states)

    return hidden_states, attn_weights

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerFeatureProjection

Bases: Module

This class represents a feature projection module for the SeamlessM4Tv2Conformer model. It inherits from the nn.Module class.

The feature projection module is responsible for projecting the input hidden states into a higher-dimensional space, followed by layer normalization and dropout. This helps in capturing complex patterns and enhancing the expressive power of the model.

ATTRIBUTE DESCRIPTION
layer_norm

A layer normalization module that normalizes the hidden states.

TYPE: LayerNorm

projection

A dense linear projection layer that projects the hidden states into a higher-dimensional space.

TYPE: Linear

dropout

A dropout module that randomly sets elements of the hidden states to zero.

TYPE: Dropout

METHOD DESCRIPTION
__init__

Initializes the SeamlessM4Tv2ConformerFeatureProjection module with the given configuration.

forward

Applies the feature projection operation on the input hidden states.

RETURNS DESCRIPTION

The projected hidden states after applying layer normalization and dropout.

Note
  • The input hidden states should have a shape of [batch_size, sequence_length, input_dim].
  • The configuration should contain the following attributes:

    • feature_projection_input_dim: The input dimension of the feature projection layer.
    • hidden_size: The output dimension of the feature projection layer.
    • layer_norm_eps: The epsilon value for layer normalization.
    • speech_encoder_dropout: The dropout probability for the dropout layer.
Example
>>> config = {
...     'feature_projection_input_dim': 512,
...     'hidden_size': 256,
...     'layer_norm_eps': 1e-5,
...     'speech_encoder_dropout': 0.1
...}
>>> feature_projection = SeamlessM4Tv2ConformerFeatureProjection(config)
>>> hidden_states = torch.randn(3, 100, 512)
>>> projected_states = feature_projection.forward(hidden_states)
Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
class SeamlessM4Tv2ConformerFeatureProjection(nn.Module):

    """
    This class represents a feature projection module for the SeamlessM4Tv2Conformer model.
    It inherits from the nn.Module class.

    The feature projection module is responsible for projecting the input hidden states into a higher-dimensional space,
    followed by layer normalization and dropout. This helps in capturing complex patterns and enhancing the expressive
    power of the model.

    Attributes:
        layer_norm (nn.LayerNorm): A layer normalization module that normalizes the hidden states.
        projection (nn.Linear): A dense linear projection layer that projects the hidden states into a
            higher-dimensional space.
        dropout (nn.Dropout): A dropout module that randomly sets elements of the hidden states to zero.

    Methods:
        __init__:
            Initializes the SeamlessM4Tv2ConformerFeatureProjection module with the given configuration.

        forward:
            Applies the feature projection operation on the input hidden states.

    Returns:
        The projected hidden states after applying layer normalization and dropout.

    Note:
        - The input hidden states should have a shape of [batch_size, sequence_length, input_dim].
        - The configuration should contain the following attributes:

            - feature_projection_input_dim: The input dimension of the feature projection layer.
            - hidden_size: The output dimension of the feature projection layer.
            - layer_norm_eps: The epsilon value for layer normalization.
            - speech_encoder_dropout: The dropout probability for the dropout layer.

    Example:
        ```python
        >>> config = {
        ...     'feature_projection_input_dim': 512,
        ...     'hidden_size': 256,
        ...     'layer_norm_eps': 1e-5,
        ...     'speech_encoder_dropout': 0.1
        ...}
        >>> feature_projection = SeamlessM4Tv2ConformerFeatureProjection(config)
        >>> hidden_states = torch.randn(3, 100, 512)
        >>> projected_states = feature_projection.forward(hidden_states)
        ```
    """
    # Copied from transformers.models.seamless_m4t.modeling_seamless_m4t.SeamlessM4TConformerFeatureProjection.__init__
    def __init__(self, config):
        """
        Initializes an instance of the SeamlessM4Tv2ConformerFeatureProjection class.

        Args:
            self: The instance of the class.
            config (object):
                An object containing configuration parameters for the feature projection.

                - feature_projection_input_dim (int): The input dimension of the feature projection.
                - layer_norm_eps (float): The epsilon value for LayerNorm.
                - hidden_size (int): The size of the hidden layer.
                - speech_encoder_dropout (float): The dropout probability for the speech encoder.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        self.layer_norm = nn.LayerNorm([config.feature_projection_input_dim], eps=config.layer_norm_eps)
        self.projection = nn.Linear(config.feature_projection_input_dim, config.hidden_size)
        self.dropout = nn.Dropout(p=config.speech_encoder_dropout)

    def forward(self, hidden_states):
        """Constructs the feature projection for the SeamlessM4Tv2Conformer model.

        Args:
            self (SeamlessM4Tv2ConformerFeatureProjection): An instance of the SeamlessM4Tv2ConformerFeatureProjection class.
            hidden_states (torch.Tensor): The input hidden states tensor to be projected.

        Returns:
            torch.Tensor or None: The projected hidden states tensor. If the input tensor is None, the method returns None.

        Raises:
            TypeError: If the input hidden_states tensor is not a torch.Tensor object.
            ValueError: If the input hidden_states tensor is empty or has an incompatible shape.
            RuntimeError: If the input hidden_states tensor cannot be cast to the same dtype as the layer_norm weights.
        """
        # non-projected hidden states are needed for quantization
        norm_hidden_states = self.layer_norm(hidden_states.to(self.layer_norm.weight.dtype))
        hidden_states = self.projection(norm_hidden_states)
        hidden_states = self.dropout(hidden_states)
        return hidden_states

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerFeatureProjection.__init__(config)

Initializes an instance of the SeamlessM4Tv2ConformerFeatureProjection class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

An object containing configuration parameters for the feature projection.

  • feature_projection_input_dim (int): The input dimension of the feature projection.
  • layer_norm_eps (float): The epsilon value for LayerNorm.
  • hidden_size (int): The size of the hidden layer.
  • speech_encoder_dropout (float): The dropout probability for the speech encoder.

TYPE: object

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
def __init__(self, config):
    """
    Initializes an instance of the SeamlessM4Tv2ConformerFeatureProjection class.

    Args:
        self: The instance of the class.
        config (object):
            An object containing configuration parameters for the feature projection.

            - feature_projection_input_dim (int): The input dimension of the feature projection.
            - layer_norm_eps (float): The epsilon value for LayerNorm.
            - hidden_size (int): The size of the hidden layer.
            - speech_encoder_dropout (float): The dropout probability for the speech encoder.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    self.layer_norm = nn.LayerNorm([config.feature_projection_input_dim], eps=config.layer_norm_eps)
    self.projection = nn.Linear(config.feature_projection_input_dim, config.hidden_size)
    self.dropout = nn.Dropout(p=config.speech_encoder_dropout)

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerFeatureProjection.forward(hidden_states)

Constructs the feature projection for the SeamlessM4Tv2Conformer model.

PARAMETER DESCRIPTION
self

An instance of the SeamlessM4Tv2ConformerFeatureProjection class.

TYPE: SeamlessM4Tv2ConformerFeatureProjection

hidden_states

The input hidden states tensor to be projected.

TYPE: Tensor

RETURNS DESCRIPTION

torch.Tensor or None: The projected hidden states tensor. If the input tensor is None, the method returns None.

RAISES DESCRIPTION
TypeError

If the input hidden_states tensor is not a torch.Tensor object.

ValueError

If the input hidden_states tensor is empty or has an incompatible shape.

RuntimeError

If the input hidden_states tensor cannot be cast to the same dtype as the layer_norm weights.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
def forward(self, hidden_states):
    """Constructs the feature projection for the SeamlessM4Tv2Conformer model.

    Args:
        self (SeamlessM4Tv2ConformerFeatureProjection): An instance of the SeamlessM4Tv2ConformerFeatureProjection class.
        hidden_states (torch.Tensor): The input hidden states tensor to be projected.

    Returns:
        torch.Tensor or None: The projected hidden states tensor. If the input tensor is None, the method returns None.

    Raises:
        TypeError: If the input hidden_states tensor is not a torch.Tensor object.
        ValueError: If the input hidden_states tensor is empty or has an incompatible shape.
        RuntimeError: If the input hidden_states tensor cannot be cast to the same dtype as the layer_norm weights.
    """
    # non-projected hidden states are needed for quantization
    norm_hidden_states = self.layer_norm(hidden_states.to(self.layer_norm.weight.dtype))
    hidden_states = self.projection(norm_hidden_states)
    hidden_states = self.dropout(hidden_states)
    return hidden_states

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerFeedForward

Bases: Module

This class represents a feed-forward module for the SeamlessM4Tv2Conformer model, which is used for speech encoding.

Inherits from: nn.Module

ATTRIBUTE DESCRIPTION
config

An object containing configuration parameters for the module.

act_fn

The activation function to be applied to the intermediate hidden states.

dropout

The dropout probability to be applied to the intermediate hidden states.

METHOD DESCRIPTION
__init__

Initializes the SeamlessM4Tv2ConformerFeedForward module.

Args:

  • config: An object containing configuration parameters for the module.
  • act_fn (optional): The activation function to be applied to the intermediate hidden states.
  • dropout (optional): The dropout probability to be applied to the intermediate hidden states.
forward

Applies the feed-forward operations on the input hidden states.

Args:

  • hidden_states: The input hidden states to be processed.

Returns:

  • hidden_states: The processed hidden states after applying the feed-forward operations.
Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
class SeamlessM4Tv2ConformerFeedForward(nn.Module):

    """
    This class represents a feed-forward module for the SeamlessM4Tv2Conformer model, which is used for speech encoding.

    Inherits from: nn.Module

    Attributes:
        config: An object containing configuration parameters for the module.
        act_fn: The activation function to be applied to the intermediate hidden states.
        dropout: The dropout probability to be applied to the intermediate hidden states.

    Methods:
        __init__:
            Initializes the SeamlessM4Tv2ConformerFeedForward module.

            Args:

            - config: An object containing configuration parameters for the module.
            - act_fn (optional): The activation function to be applied to the intermediate hidden states.
            - dropout (optional): The dropout probability to be applied to the intermediate hidden states.

        forward:
            Applies the feed-forward operations on the input hidden states.

            Args:

            - hidden_states: The input hidden states to be processed.

            Returns:

            - hidden_states: The processed hidden states after applying the feed-forward operations.
    """
    def __init__(self, config, act_fn=None, dropout=None):
        """
        Initializes an instance of the SeamlessM4Tv2ConformerFeedForward class.

        Args:
            self: The object instance.
            config: An object containing configuration parameters.
            act_fn (optional): The activation function to be used for the hidden layers.
                If not provided, it defaults to the value of config.speech_encoder_hidden_act.
                It can be either a string specifying a predefined activation function or a custom activation function.
            dropout (optional): The dropout probability for the intermediate layers.
                If not provided, it defaults to the value of config.speech_encoder_dropout.

        Returns:
            None.

        Raises:
            None.

        Note:
            - The intermediate_dropout attribute is assigned an instance of nn.Dropout with p=dropout.
            - The intermediate_dense attribute is assigned an instance of nn.Linear with input size config.hidden_size
            and output size config.speech_encoder_intermediate_size.
            - The intermediate_act_fn attribute is assigned the activation function specified by act_fn.
            If act_fn is a string, it is mapped to the corresponding activation function from the ACT2FN dictionary.
            If act_fn is a custom function, it is directly assigned.
            - The output_dense attribute is assigned an instance of nn.Linear with input size
            config.speech_encoder_intermediate_size and output size config.hidden_size.
            - The output_dropout attribute is assigned an instance of nn.Dropout with p=dropout.
        """
        super().__init__()
        dropout = dropout if dropout is not None else config.speech_encoder_dropout
        act_fn = act_fn if act_fn is not None else config.speech_encoder_hidden_act

        self.intermediate_dropout = nn.Dropout(p=dropout)
        self.intermediate_dense = nn.Linear(config.hidden_size, config.speech_encoder_intermediate_size)
        self.intermediate_act_fn = ACT2FN[act_fn] if isinstance(act_fn, str) else act_fn

        self.output_dense = nn.Linear(config.speech_encoder_intermediate_size, config.hidden_size)
        self.output_dropout = nn.Dropout(p=dropout)

    def forward(self, hidden_states):
        """
        Constructs the feedforward layer in the SeamlessM4Tv2Conformer model.

        Args:
            self (SeamlessM4Tv2ConformerFeedForward): An instance of the SeamlessM4Tv2ConformerFeedForward class.
            hidden_states (torch.Tensor): The input hidden states of shape (batch_size, sequence_length, hidden_size).

        Returns:
            None

        Raises:
            None

        Description:
            This method applies a series of operations to the input hidden states to forward the feedforward layer
            in the SeamlessM4Tv2Conformer model. The operations include intermediate dense layer, activation function,
            dropout layer, and output dense layer. The resulting hidden states are returned.

            - intermediate_dense: Applies a linear transformation to the hidden states using the intermediate dense layer.
            - intermediate_act_fn: Applies the activation function to the intermediate dense outputs.
            - intermediate_dropout: Applies dropout to the intermediate outputs.
            - output_dense: Applies a linear transformation to the intermediate outputs using the output dense layer.
            - output_dropout: Applies dropout to the output dense outputs.

            Note:
                The intermediate dense layer, activation function, dropout layers, and output dense layer must be defined
                before calling this method.
        """
        hidden_states = self.intermediate_dense(hidden_states)
        hidden_states = self.intermediate_act_fn(hidden_states)
        hidden_states = self.intermediate_dropout(hidden_states)

        hidden_states = self.output_dense(hidden_states)
        hidden_states = self.output_dropout(hidden_states)
        return hidden_states

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerFeedForward.__init__(config, act_fn=None, dropout=None)

Initializes an instance of the SeamlessM4Tv2ConformerFeedForward class.

PARAMETER DESCRIPTION
self

The object instance.

config

An object containing configuration parameters.

act_fn

The activation function to be used for the hidden layers. If not provided, it defaults to the value of config.speech_encoder_hidden_act. It can be either a string specifying a predefined activation function or a custom activation function.

TYPE: optional DEFAULT: None

dropout

The dropout probability for the intermediate layers. If not provided, it defaults to the value of config.speech_encoder_dropout.

TYPE: optional DEFAULT: None

RETURNS DESCRIPTION

None.

Note
  • The intermediate_dropout attribute is assigned an instance of nn.Dropout with p=dropout.
  • The intermediate_dense attribute is assigned an instance of nn.Linear with input size config.hidden_size and output size config.speech_encoder_intermediate_size.
  • The intermediate_act_fn attribute is assigned the activation function specified by act_fn. If act_fn is a string, it is mapped to the corresponding activation function from the ACT2FN dictionary. If act_fn is a custom function, it is directly assigned.
  • The output_dense attribute is assigned an instance of nn.Linear with input size config.speech_encoder_intermediate_size and output size config.hidden_size.
  • The output_dropout attribute is assigned an instance of nn.Dropout with p=dropout.
Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
def __init__(self, config, act_fn=None, dropout=None):
    """
    Initializes an instance of the SeamlessM4Tv2ConformerFeedForward class.

    Args:
        self: The object instance.
        config: An object containing configuration parameters.
        act_fn (optional): The activation function to be used for the hidden layers.
            If not provided, it defaults to the value of config.speech_encoder_hidden_act.
            It can be either a string specifying a predefined activation function or a custom activation function.
        dropout (optional): The dropout probability for the intermediate layers.
            If not provided, it defaults to the value of config.speech_encoder_dropout.

    Returns:
        None.

    Raises:
        None.

    Note:
        - The intermediate_dropout attribute is assigned an instance of nn.Dropout with p=dropout.
        - The intermediate_dense attribute is assigned an instance of nn.Linear with input size config.hidden_size
        and output size config.speech_encoder_intermediate_size.
        - The intermediate_act_fn attribute is assigned the activation function specified by act_fn.
        If act_fn is a string, it is mapped to the corresponding activation function from the ACT2FN dictionary.
        If act_fn is a custom function, it is directly assigned.
        - The output_dense attribute is assigned an instance of nn.Linear with input size
        config.speech_encoder_intermediate_size and output size config.hidden_size.
        - The output_dropout attribute is assigned an instance of nn.Dropout with p=dropout.
    """
    super().__init__()
    dropout = dropout if dropout is not None else config.speech_encoder_dropout
    act_fn = act_fn if act_fn is not None else config.speech_encoder_hidden_act

    self.intermediate_dropout = nn.Dropout(p=dropout)
    self.intermediate_dense = nn.Linear(config.hidden_size, config.speech_encoder_intermediate_size)
    self.intermediate_act_fn = ACT2FN[act_fn] if isinstance(act_fn, str) else act_fn

    self.output_dense = nn.Linear(config.speech_encoder_intermediate_size, config.hidden_size)
    self.output_dropout = nn.Dropout(p=dropout)

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerFeedForward.forward(hidden_states)

Constructs the feedforward layer in the SeamlessM4Tv2Conformer model.

PARAMETER DESCRIPTION
self

An instance of the SeamlessM4Tv2ConformerFeedForward class.

TYPE: SeamlessM4Tv2ConformerFeedForward

hidden_states

The input hidden states of shape (batch_size, sequence_length, hidden_size).

TYPE: Tensor

RETURNS DESCRIPTION

None

Description

This method applies a series of operations to the input hidden states to forward the feedforward layer in the SeamlessM4Tv2Conformer model. The operations include intermediate dense layer, activation function, dropout layer, and output dense layer. The resulting hidden states are returned.

  • intermediate_dense: Applies a linear transformation to the hidden states using the intermediate dense layer.
  • intermediate_act_fn: Applies the activation function to the intermediate dense outputs.
  • intermediate_dropout: Applies dropout to the intermediate outputs.
  • output_dense: Applies a linear transformation to the intermediate outputs using the output dense layer.
  • output_dropout: Applies dropout to the output dense outputs.

Note: The intermediate dense layer, activation function, dropout layers, and output dense layer must be defined before calling this method.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
def forward(self, hidden_states):
    """
    Constructs the feedforward layer in the SeamlessM4Tv2Conformer model.

    Args:
        self (SeamlessM4Tv2ConformerFeedForward): An instance of the SeamlessM4Tv2ConformerFeedForward class.
        hidden_states (torch.Tensor): The input hidden states of shape (batch_size, sequence_length, hidden_size).

    Returns:
        None

    Raises:
        None

    Description:
        This method applies a series of operations to the input hidden states to forward the feedforward layer
        in the SeamlessM4Tv2Conformer model. The operations include intermediate dense layer, activation function,
        dropout layer, and output dense layer. The resulting hidden states are returned.

        - intermediate_dense: Applies a linear transformation to the hidden states using the intermediate dense layer.
        - intermediate_act_fn: Applies the activation function to the intermediate dense outputs.
        - intermediate_dropout: Applies dropout to the intermediate outputs.
        - output_dense: Applies a linear transformation to the intermediate outputs using the output dense layer.
        - output_dropout: Applies dropout to the output dense outputs.

        Note:
            The intermediate dense layer, activation function, dropout layers, and output dense layer must be defined
            before calling this method.
    """
    hidden_states = self.intermediate_dense(hidden_states)
    hidden_states = self.intermediate_act_fn(hidden_states)
    hidden_states = self.intermediate_dropout(hidden_states)

    hidden_states = self.output_dense(hidden_states)
    hidden_states = self.output_dropout(hidden_states)
    return hidden_states

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerSelfAttention

Bases: Module

Construct a SeamlessM4Tv2ConformerSelfAttention object. Can be enhanced with relative position embeddings.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
class SeamlessM4Tv2ConformerSelfAttention(nn.Module):
    """Construct a SeamlessM4Tv2ConformerSelfAttention object.
    Can be enhanced with relative position embeddings.
    """
    def __init__(self, config, use_position_embeddings=True):
        """
        Initializes a new instance of the SeamlessM4Tv2ConformerSelfAttention class.

        Args:
            self: The object itself.
            config: An instance of the configuration class that contains the model's configuration parameters.
            use_position_embeddings (bool): Whether to use position embeddings or not. Defaults to True.

        Returns:
            None

        Raises:
            None
        """
        super().__init__()

        self.head_size = config.hidden_size // config.speech_encoder_attention_heads
        self.num_heads = config.speech_encoder_attention_heads
        self.position_embeddings_type = config.position_embeddings_type if use_position_embeddings else None

        self.linear_q = nn.Linear(config.hidden_size, config.hidden_size)
        self.linear_k = nn.Linear(config.hidden_size, config.hidden_size)
        self.linear_v = nn.Linear(config.hidden_size, config.hidden_size)
        self.linear_out = nn.Linear(config.hidden_size, config.hidden_size)

        self.dropout = nn.Dropout(p=config.speech_encoder_dropout)

        if self.position_embeddings_type == "relative_key":
            self.left_max_position_embeddings = config.left_max_position_embeddings
            self.right_max_position_embeddings = config.right_max_position_embeddings
            num_positions = self.left_max_position_embeddings + self.right_max_position_embeddings + 1
            self.distance_embedding = nn.Embedding(num_positions, self.head_size)

    def forward(
        self,
        hidden_states: mindspore.Tensor,
        attention_mask: Optional[mindspore.Tensor] = None,
        output_attentions: bool = False,
    ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
        """
        Constructs the self-attention mechanism in the SeamlessM4Tv2ConformerSelfAttention class.

        Args:
            self (SeamlessM4Tv2ConformerSelfAttention): An instance of the SeamlessM4Tv2ConformerSelfAttention class.
            hidden_states (mindspore.Tensor): The input hidden states tensor of shape
                (batch_size, sequence_length, hidden_size).
            attention_mask (Optional[mindspore.Tensor]): An optional attention mask tensor of shape
                (batch_size, sequence_length, sequence_length). Defaults to None.
            output_attentions (bool): Indicates whether to output the attention weights. Defaults to False.

        Returns:
            Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
                A tuple containing:

                - attn_output (mindspore.Tensor): The attention output tensor of shape
                (batch_size, sequence_length, hidden_size).
                - attn_weights (Optional[mindspore.Tensor]): The attention weights tensor of shape
                (batch_size, num_heads, sequence_length, sequence_length). None if output_attentions is False.
                - None (Optional[Tuple[mindspore.Tensor]]): None if output_attentions is False.

        Raises:
            None
        """
        # self-attention mechanism
        batch_size, _, _ = hidden_states.shape

        # make sure query/key states can be != value states
        query_key_states = hidden_states
        value_states = hidden_states

        # project query_key_states and value_states
        query = self.linear_q(query_key_states).view(batch_size, -1, self.num_heads, self.head_size)
        key = self.linear_k(query_key_states).view(batch_size, -1, self.num_heads, self.head_size)
        value = self.linear_v(value_states).view(batch_size, -1, self.num_heads, self.head_size)

        # => (batch, head, time1, d_k)
        query = query.swapaxes(1, 2)
        key = key.swapaxes(1, 2)
        value = value.swapaxes(1, 2)

        attn_weights = ops.matmul(query, key.swapaxes(-2, -1)) / math.sqrt(self.head_size)

        if self.position_embeddings_type == "relative_key":
            query_length, key_length = query.shape[2], key.shape[2]

            position_ids_l = ops.arange(query_length, dtype=mindspore.int64).view(-1, 1)
            position_ids_r = ops.arange(key_length, dtype=mindspore.int64).view(1, -1)
            distance = position_ids_r - position_ids_l
            distance = ops.clamp(distance, -self.left_max_position_embeddings, self.right_max_position_embeddings)

            positional_embedding = self.distance_embedding(distance + self.left_max_position_embeddings)
            positional_embedding = positional_embedding.to(dtype=query.dtype)  # fp16 compatibility

            relative_position_attn_weights = ops.einsum("bhld,lrd->bhlr", query, positional_embedding)
            attn_weights = attn_weights + (relative_position_attn_weights / math.sqrt(self.head_size))

        # apply attention_mask if necessary
        if attention_mask is not None:
            attn_weights = attn_weights + attention_mask

        # => (batch, head, time1, time2)
        attn_weights = ops.softmax(attn_weights, axis=-1)
        attn_weights = self.dropout(attn_weights)

        # => (batch, head, time1, d_k)
        attn_output = ops.matmul(attn_weights, value)

        # => (batch, time1, hidden_size)
        attn_output = attn_output.swapaxes(1, 2).reshape(batch_size, -1, self.num_heads * self.head_size)
        attn_output = self.linear_out(attn_output)

        if not output_attentions:
            attn_weights = None

        return attn_output, attn_weights

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerSelfAttention.__init__(config, use_position_embeddings=True)

Initializes a new instance of the SeamlessM4Tv2ConformerSelfAttention class.

PARAMETER DESCRIPTION
self

The object itself.

config

An instance of the configuration class that contains the model's configuration parameters.

use_position_embeddings

Whether to use position embeddings or not. Defaults to True.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
def __init__(self, config, use_position_embeddings=True):
    """
    Initializes a new instance of the SeamlessM4Tv2ConformerSelfAttention class.

    Args:
        self: The object itself.
        config: An instance of the configuration class that contains the model's configuration parameters.
        use_position_embeddings (bool): Whether to use position embeddings or not. Defaults to True.

    Returns:
        None

    Raises:
        None
    """
    super().__init__()

    self.head_size = config.hidden_size // config.speech_encoder_attention_heads
    self.num_heads = config.speech_encoder_attention_heads
    self.position_embeddings_type = config.position_embeddings_type if use_position_embeddings else None

    self.linear_q = nn.Linear(config.hidden_size, config.hidden_size)
    self.linear_k = nn.Linear(config.hidden_size, config.hidden_size)
    self.linear_v = nn.Linear(config.hidden_size, config.hidden_size)
    self.linear_out = nn.Linear(config.hidden_size, config.hidden_size)

    self.dropout = nn.Dropout(p=config.speech_encoder_dropout)

    if self.position_embeddings_type == "relative_key":
        self.left_max_position_embeddings = config.left_max_position_embeddings
        self.right_max_position_embeddings = config.right_max_position_embeddings
        num_positions = self.left_max_position_embeddings + self.right_max_position_embeddings + 1
        self.distance_embedding = nn.Embedding(num_positions, self.head_size)

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2ConformerSelfAttention.forward(hidden_states, attention_mask=None, output_attentions=False)

Constructs the self-attention mechanism in the SeamlessM4Tv2ConformerSelfAttention class.

PARAMETER DESCRIPTION
self

An instance of the SeamlessM4Tv2ConformerSelfAttention class.

TYPE: SeamlessM4Tv2ConformerSelfAttention

hidden_states

The input hidden states tensor of shape (batch_size, sequence_length, hidden_size).

TYPE: Tensor

attention_mask

An optional attention mask tensor of shape (batch_size, sequence_length, sequence_length). Defaults to None.

TYPE: Optional[Tensor] DEFAULT: None

output_attentions

Indicates whether to output the attention weights. Defaults to False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
Tuple[Tensor, Optional[Tensor], Optional[Tuple[Tensor]]]

Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: A tuple containing:

  • attn_output (mindspore.Tensor): The attention output tensor of shape (batch_size, sequence_length, hidden_size).
  • attn_weights (Optional[mindspore.Tensor]): The attention weights tensor of shape (batch_size, num_heads, sequence_length, sequence_length). None if output_attentions is False.
  • None (Optional[Tuple[mindspore.Tensor]]): None if output_attentions is False.
Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
def forward(
    self,
    hidden_states: mindspore.Tensor,
    attention_mask: Optional[mindspore.Tensor] = None,
    output_attentions: bool = False,
) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
    """
    Constructs the self-attention mechanism in the SeamlessM4Tv2ConformerSelfAttention class.

    Args:
        self (SeamlessM4Tv2ConformerSelfAttention): An instance of the SeamlessM4Tv2ConformerSelfAttention class.
        hidden_states (mindspore.Tensor): The input hidden states tensor of shape
            (batch_size, sequence_length, hidden_size).
        attention_mask (Optional[mindspore.Tensor]): An optional attention mask tensor of shape
            (batch_size, sequence_length, sequence_length). Defaults to None.
        output_attentions (bool): Indicates whether to output the attention weights. Defaults to False.

    Returns:
        Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
            A tuple containing:

            - attn_output (mindspore.Tensor): The attention output tensor of shape
            (batch_size, sequence_length, hidden_size).
            - attn_weights (Optional[mindspore.Tensor]): The attention weights tensor of shape
            (batch_size, num_heads, sequence_length, sequence_length). None if output_attentions is False.
            - None (Optional[Tuple[mindspore.Tensor]]): None if output_attentions is False.

    Raises:
        None
    """
    # self-attention mechanism
    batch_size, _, _ = hidden_states.shape

    # make sure query/key states can be != value states
    query_key_states = hidden_states
    value_states = hidden_states

    # project query_key_states and value_states
    query = self.linear_q(query_key_states).view(batch_size, -1, self.num_heads, self.head_size)
    key = self.linear_k(query_key_states).view(batch_size, -1, self.num_heads, self.head_size)
    value = self.linear_v(value_states).view(batch_size, -1, self.num_heads, self.head_size)

    # => (batch, head, time1, d_k)
    query = query.swapaxes(1, 2)
    key = key.swapaxes(1, 2)
    value = value.swapaxes(1, 2)

    attn_weights = ops.matmul(query, key.swapaxes(-2, -1)) / math.sqrt(self.head_size)

    if self.position_embeddings_type == "relative_key":
        query_length, key_length = query.shape[2], key.shape[2]

        position_ids_l = ops.arange(query_length, dtype=mindspore.int64).view(-1, 1)
        position_ids_r = ops.arange(key_length, dtype=mindspore.int64).view(1, -1)
        distance = position_ids_r - position_ids_l
        distance = ops.clamp(distance, -self.left_max_position_embeddings, self.right_max_position_embeddings)

        positional_embedding = self.distance_embedding(distance + self.left_max_position_embeddings)
        positional_embedding = positional_embedding.to(dtype=query.dtype)  # fp16 compatibility

        relative_position_attn_weights = ops.einsum("bhld,lrd->bhlr", query, positional_embedding)
        attn_weights = attn_weights + (relative_position_attn_weights / math.sqrt(self.head_size))

    # apply attention_mask if necessary
    if attention_mask is not None:
        attn_weights = attn_weights + attention_mask

    # => (batch, head, time1, time2)
    attn_weights = ops.softmax(attn_weights, axis=-1)
    attn_weights = self.dropout(attn_weights)

    # => (batch, head, time1, d_k)
    attn_output = ops.matmul(attn_weights, value)

    # => (batch, time1, hidden_size)
    attn_output = attn_output.swapaxes(1, 2).reshape(batch_size, -1, self.num_heads * self.head_size)
    attn_output = self.linear_out(attn_output)

    if not output_attentions:
        attn_weights = None

    return attn_output, attn_weights

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2Decoder

Bases: SeamlessM4Tv2PreTrainedModel

A Python class representing the SeamlessM4Tv2Decoder module of the SeamlessM4Tv2 model architecture.

This class inherits from the SeamlessM4Tv2PreTrainedModel class and implements the decoder component of the SeamlessM4Tv2 model. It consists of multiple decoder layers and includes functionality for embedding tokens, calculating positional embeddings, and performing self-attention and cross-attention operations.

ATTRIBUTE DESCRIPTION
config

The configuration object for the SeamlessM4Tv2Decoder module.

TYPE: SeamlessM4Tv2Config

dropout

The dropout probability for the decoder layers.

TYPE: float

layerdrop

The layer dropout probability for the decoder layers.

TYPE: float

padding_idx

The index of the padding token in the vocabulary.

TYPE: int

vocab_size

The size of the vocabulary.

TYPE: int

max_target_positions

The maximum number of target positions.

TYPE: int

embed_scale

The scale factor for the embedding layer.

TYPE: float

embed_tokens

The embedding layer for the input tokens.

TYPE: Embedding

embed_positions

The positional embedding layer.

TYPE: SeamlessM4Tv2SinusoidalPositionalEmbedding

layers

The list of decoder layers.

TYPE: ModuleList

layer_norm

The layer normalization module.

TYPE: LayerNorm

METHOD DESCRIPTION
__init__

Initializes the SeamlessM4Tv2Decoder module.

get_input_embeddings

Returns the input embeddings.

set_input_embeddings

Sets the input embeddings.

forward

Constructs the SeamlessM4Tv2Decoder module.

Please refer to the documentation of the parent class, SeamlessM4Tv2PreTrainedModel, for more details on the inherited attributes and methods.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
class SeamlessM4Tv2Decoder(SeamlessM4Tv2PreTrainedModel):

    """
    A Python class representing the SeamlessM4Tv2Decoder module of the SeamlessM4Tv2 model architecture.

    This class inherits from the SeamlessM4Tv2PreTrainedModel class and implements the decoder component of the
    SeamlessM4Tv2 model. It consists of multiple decoder layers and includes functionality for embedding tokens,
    calculating positional embeddings, and performing self-attention and cross-attention operations.

    Attributes:
        config (SeamlessM4Tv2Config): The configuration object for the SeamlessM4Tv2Decoder module.
        dropout (float): The dropout probability for the decoder layers.
        layerdrop (float): The layer dropout probability for the decoder layers.
        padding_idx (int): The index of the padding token in the vocabulary.
        vocab_size (int): The size of the vocabulary.
        max_target_positions (int): The maximum number of target positions.
        embed_scale (float): The scale factor for the embedding layer.
        embed_tokens (nn.Embedding): The embedding layer for the input tokens.
        embed_positions (SeamlessM4Tv2SinusoidalPositionalEmbedding): The positional embedding layer.
        layers (nn.ModuleList): The list of decoder layers.
        layer_norm (nn.LayerNorm): The layer normalization module.

    Methods:
        __init__: Initializes the SeamlessM4Tv2Decoder module.
        get_input_embeddings: Returns the input embeddings.
        set_input_embeddings: Sets the input embeddings.
        forward: Constructs the SeamlessM4Tv2Decoder module.

    Please refer to the documentation of the parent class, SeamlessM4Tv2PreTrainedModel, for more details on the
    inherited attributes and methods.
    """
    def __init__(
        self,
        config: SeamlessM4Tv2Config,
        embed_tokens: Optional[nn.Embedding] = None,
    ):
        """Initialize the SeamlessM4Tv2Decoder.

        Args:
            self: The object itself.
            config (SeamlessM4Tv2Config): An instance of SeamlessM4Tv2Config containing configuration parameters
                for the decoder.
            embed_tokens (Optional[nn.Embedding]): An optional instance of nn.Embedding for token embedding.

        Returns:
            None.

        Raises:
            TypeError: If the config parameter is not an instance of SeamlessM4Tv2Config.
            ValueError: If the embed_tokens parameter is not None and is not an instance of nn.Embedding.
        """
        super().__init__(config)
        self.dropout = config.dropout
        self.layerdrop = config.decoder_layerdrop
        self.padding_idx = config.pad_token_id
        self.vocab_size = config.vocab_size
        self.max_target_positions = config.max_position_embeddings
        self.embed_scale = math.sqrt(config.hidden_size) if config.scale_embedding else 1.0

        if embed_tokens is not None:
            # if embed_tokens defined, use its shape instead
            self.embed_tokens = nn.Embedding(embed_tokens.vocab_size, embed_tokens.embedding_size, self.padding_idx)
            self.embed_tokens.weight = embed_tokens.weight
        else:
            self.embed_tokens = nn.Embedding(self.vocab_size, config.hidden_size, self.padding_idx)

        self.embed_positions = SeamlessM4Tv2SinusoidalPositionalEmbedding(
            self.max_target_positions,
            config.hidden_size,
            padding_idx=self.padding_idx,
        )

        layers = []
        for _ in range(config.decoder_layers):
            layers.append(
                SeamlessM4Tv2DecoderLayer(
                    config,
                    decoder_attention_heads=config.decoder_attention_heads,
                    decoder_ffn_dim=config.decoder_ffn_dim,
                )
            )
        self.layers = nn.ModuleList(layers)
        self.layer_norm = nn.LayerNorm([config.hidden_size])

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        """
        Retrieves the input embeddings for the SeamlessM4Tv2Decoder.

        Args:
            self (SeamlessM4Tv2Decoder): An instance of the SeamlessM4Tv2Decoder class.

        Returns:
            None.

        Raises:
            None.
        """
        return self.embed_tokens

    def set_input_embeddings(self, value):
        """
        Sets the input embeddings for the SeamlessM4Tv2Decoder.

        Args:
            self (SeamlessM4Tv2Decoder): The instance of the SeamlessM4Tv2Decoder class.
            value: The input embeddings to be set. This should be a tensor or an instance of the Embedding class.

        Returns:
            None.

        Raises:
            None.
        """
        self.embed_tokens = value

    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        encoder_hidden_states: Optional[mindspore.Tensor] = None,
        encoder_attention_mask: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:
        r"""
        Args:
            input_ids (`mindspore.Tensor` of shape `(batch_size, sequence_length)`):
                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
                provide it.

                Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
                [`PreTrainedTokenizer.__call__`] for details.

                [What are input IDs?](../glossary#input-ids)
            attention_mask (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

                - 1 for tokens that are **not masked**,
                - 0 for tokens that are **masked**.

                [What are attention masks?](../glossary#attention-mask)
            encoder_hidden_states (`mindspore.Tensor` of shape `(batch_size, encoder_sequence_length, hidden_size)`, *optional*):
                Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
                of the decoder.
            encoder_attention_mask (`mindspore.Tensor` of shape `(batch_size, encoder_sequence_length)`, *optional*):
                Mask to avoid performing cross-attention on padding tokens indices of encoder input_ids. Mask values
                selected in `[0, 1]`:

                - 1 for tokens that are **not masked**,
                - 0 for tokens that are **masked**.

                [What are attention masks?](../glossary#attention-mask)
            past_key_values (`tuple(tuple(mindspore.Tensor))`, *optional*, returned when `use_cache=True` is passed
                or when `config.use_cache=True`):
                Tuple of `tuple(mindspore.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of
                shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of
                shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.

                Contains pre-computed hidden-states (key and values in the self-attention blocks and in the
                cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.

                If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those
                that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of
                all `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`mindspore.Tensor` of
                shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing
                `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more
                control over how to convert `input_ids` indices into associated vectors than the model's internal
                embedding lookup matrix.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
            output_hidden_states (`bool`, *optional*):
                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
                for more detail.
            return_dict (`bool`, *optional*):
                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # retrieve input_ids and inputs_embeds
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
        if input_ids is not None:
            input = input_ids
            input_shape = input.shape
            input_ids = input_ids.view(-1, input_shape[-1])
        elif inputs_embeds is not None:
            input_shape = inputs_embeds.shape[:-1]
            input = inputs_embeds[:, :, -1]
        else:
            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")

        # past_key_values_length
        past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0

        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale

        attention_mask = _prepare_4d_causal_attention_mask(
            attention_mask, input_shape, inputs_embeds, past_key_values_length
        )

        # expand encoder attention mask
        if encoder_hidden_states is not None and encoder_attention_mask is not None:
            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
            encoder_attention_mask = _prepare_4d_attention_mask(
                encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]
            )

        # embed positions
        positions = self.embed_positions(input, past_key_values_length=past_key_values_length)

        hidden_states = inputs_embeds + positions

        hidden_states = ops.dropout(hidden_states, p=self.dropout, training=self.training)

        # decoder layers
        all_hidden_states = () if output_hidden_states else None
        all_self_attns = () if output_attentions else None
        all_cross_attentions = () if (output_attentions and encoder_hidden_states is not None) else None
        next_decoder_cache = () if use_cache else None

        for idx, decoder_layer in enumerate(self.layers):
            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
            if output_hidden_states:
                all_hidden_states += (hidden_states,)
            if self.training:
                dropout_probability = ops.rand([])
                if dropout_probability < self.layerdrop:
                    continue

            past_key_value = past_key_values[idx] if past_key_values is not None else None

            layer_outputs = decoder_layer(
                hidden_states,
                attention_mask=attention_mask,
                encoder_hidden_states=encoder_hidden_states,
                encoder_attention_mask=encoder_attention_mask,
                past_key_value=past_key_value,
                output_attentions=output_attentions,
                use_cache=use_cache,
            )
            hidden_states = layer_outputs[0]

            if use_cache:
                next_decoder_cache += (layer_outputs[1],)

            if output_attentions:
                all_self_attns += (layer_outputs[2],)

                if encoder_hidden_states is not None:
                    all_cross_attentions += (layer_outputs[3],)

        hidden_states = self.layer_norm(hidden_states)

        # add hidden states from the last decoder layer
        if output_hidden_states:
            all_hidden_states += (hidden_states,)

        next_cache = next_decoder_cache if use_cache else None
        if not return_dict:
            return tuple(
                v
                for v in [hidden_states, next_cache, all_hidden_states, all_self_attns, all_cross_attentions]
                if v is not None
            )
        return BaseModelOutputWithPastAndCrossAttentions(
            last_hidden_state=hidden_states,
            past_key_values=next_cache,
            hidden_states=all_hidden_states,
            attentions=all_self_attns,
            cross_attentions=all_cross_attentions,
        )

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2Decoder.__init__(config, embed_tokens=None)

Initialize the SeamlessM4Tv2Decoder.

PARAMETER DESCRIPTION
self

The object itself.

config

An instance of SeamlessM4Tv2Config containing configuration parameters for the decoder.

TYPE: SeamlessM4Tv2Config

embed_tokens

An optional instance of nn.Embedding for token embedding.

TYPE: Optional[Embedding] DEFAULT: None

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the config parameter is not an instance of SeamlessM4Tv2Config.

ValueError

If the embed_tokens parameter is not None and is not an instance of nn.Embedding.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
def __init__(
    self,
    config: SeamlessM4Tv2Config,
    embed_tokens: Optional[nn.Embedding] = None,
):
    """Initialize the SeamlessM4Tv2Decoder.

    Args:
        self: The object itself.
        config (SeamlessM4Tv2Config): An instance of SeamlessM4Tv2Config containing configuration parameters
            for the decoder.
        embed_tokens (Optional[nn.Embedding]): An optional instance of nn.Embedding for token embedding.

    Returns:
        None.

    Raises:
        TypeError: If the config parameter is not an instance of SeamlessM4Tv2Config.
        ValueError: If the embed_tokens parameter is not None and is not an instance of nn.Embedding.
    """
    super().__init__(config)
    self.dropout = config.dropout
    self.layerdrop = config.decoder_layerdrop
    self.padding_idx = config.pad_token_id
    self.vocab_size = config.vocab_size
    self.max_target_positions = config.max_position_embeddings
    self.embed_scale = math.sqrt(config.hidden_size) if config.scale_embedding else 1.0

    if embed_tokens is not None:
        # if embed_tokens defined, use its shape instead
        self.embed_tokens = nn.Embedding(embed_tokens.vocab_size, embed_tokens.embedding_size, self.padding_idx)
        self.embed_tokens.weight = embed_tokens.weight
    else:
        self.embed_tokens = nn.Embedding(self.vocab_size, config.hidden_size, self.padding_idx)

    self.embed_positions = SeamlessM4Tv2SinusoidalPositionalEmbedding(
        self.max_target_positions,
        config.hidden_size,
        padding_idx=self.padding_idx,
    )

    layers = []
    for _ in range(config.decoder_layers):
        layers.append(
            SeamlessM4Tv2DecoderLayer(
                config,
                decoder_attention_heads=config.decoder_attention_heads,
                decoder_ffn_dim=config.decoder_ffn_dim,
            )
        )
    self.layers = nn.ModuleList(layers)
    self.layer_norm = nn.LayerNorm([config.hidden_size])

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2Decoder.forward(input_ids=None, attention_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_values=None, inputs_embeds=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
input_ids

Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

What are input IDs?

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length)` DEFAULT: None

attention_mask

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

  • 1 for tokens that are not masked,
  • 0 for tokens that are masked.

What are attention masks?

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional* DEFAULT: None

encoder_hidden_states

Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention of the decoder.

TYPE: `mindspore.Tensor` of shape `(batch_size, encoder_sequence_length, hidden_size)`, *optional* DEFAULT: None

encoder_attention_mask

Mask to avoid performing cross-attention on padding tokens indices of encoder input_ids. Mask values selected in [0, 1]:

  • 1 for tokens that are not masked,
  • 0 for tokens that are masked.

What are attention masks?

TYPE: `mindspore.Tensor` of shape `(batch_size, encoder_sequence_length)`, *optional* DEFAULT: None

output_attentions

Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

TYPE: `bool`, *optional* DEFAULT: None

output_hidden_states

Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

TYPE: `bool`, *optional* DEFAULT: None

return_dict

Whether or not to return a [~utils.ModelOutput] instead of a plain tuple.

TYPE: `bool`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
def forward(
    self,
    input_ids: mindspore.Tensor = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    encoder_hidden_states: Optional[mindspore.Tensor] = None,
    encoder_attention_mask: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[Tuple[Tuple[mindspore.Tensor]]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPastAndCrossAttentions]:
    r"""
    Args:
        input_ids (`mindspore.Tensor` of shape `(batch_size, sequence_length)`):
            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
            provide it.

            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
            [`PreTrainedTokenizer.__call__`] for details.

            [What are input IDs?](../glossary#input-ids)
        attention_mask (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

            - 1 for tokens that are **not masked**,
            - 0 for tokens that are **masked**.

            [What are attention masks?](../glossary#attention-mask)
        encoder_hidden_states (`mindspore.Tensor` of shape `(batch_size, encoder_sequence_length, hidden_size)`, *optional*):
            Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention
            of the decoder.
        encoder_attention_mask (`mindspore.Tensor` of shape `(batch_size, encoder_sequence_length)`, *optional*):
            Mask to avoid performing cross-attention on padding tokens indices of encoder input_ids. Mask values
            selected in `[0, 1]`:

            - 1 for tokens that are **not masked**,
            - 0 for tokens that are **masked**.

            [What are attention masks?](../glossary#attention-mask)
        past_key_values (`tuple(tuple(mindspore.Tensor))`, *optional*, returned when `use_cache=True` is passed
            or when `config.use_cache=True`):
            Tuple of `tuple(mindspore.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of
            shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of
            shape `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.

            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the
            cross-attention blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.

            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those
            that don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of
            all `decoder_input_ids` of shape `(batch_size, sequence_length)`. inputs_embeds (`mindspore.Tensor` of
            shape `(batch_size, sequence_length, hidden_size)`, *optional*): Optionally, instead of passing
            `input_ids` you can choose to directly pass an embedded representation. This is useful if you want more
            control over how to convert `input_ids` indices into associated vectors than the model's internal
            embedding lookup matrix.
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under
            returned tensors for more detail.
        output_hidden_states (`bool`, *optional*):
            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
            for more detail.
        return_dict (`bool`, *optional*):
            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    use_cache = use_cache if use_cache is not None else self.config.use_cache
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    # retrieve input_ids and inputs_embeds
    if input_ids is not None and inputs_embeds is not None:
        raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
    if input_ids is not None:
        input = input_ids
        input_shape = input.shape
        input_ids = input_ids.view(-1, input_shape[-1])
    elif inputs_embeds is not None:
        input_shape = inputs_embeds.shape[:-1]
        input = inputs_embeds[:, :, -1]
    else:
        raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")

    # past_key_values_length
    past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0

    if inputs_embeds is None:
        inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale

    attention_mask = _prepare_4d_causal_attention_mask(
        attention_mask, input_shape, inputs_embeds, past_key_values_length
    )

    # expand encoder attention mask
    if encoder_hidden_states is not None and encoder_attention_mask is not None:
        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
        encoder_attention_mask = _prepare_4d_attention_mask(
            encoder_attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]
        )

    # embed positions
    positions = self.embed_positions(input, past_key_values_length=past_key_values_length)

    hidden_states = inputs_embeds + positions

    hidden_states = ops.dropout(hidden_states, p=self.dropout, training=self.training)

    # decoder layers
    all_hidden_states = () if output_hidden_states else None
    all_self_attns = () if output_attentions else None
    all_cross_attentions = () if (output_attentions and encoder_hidden_states is not None) else None
    next_decoder_cache = () if use_cache else None

    for idx, decoder_layer in enumerate(self.layers):
        # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
        if output_hidden_states:
            all_hidden_states += (hidden_states,)
        if self.training:
            dropout_probability = ops.rand([])
            if dropout_probability < self.layerdrop:
                continue

        past_key_value = past_key_values[idx] if past_key_values is not None else None

        layer_outputs = decoder_layer(
            hidden_states,
            attention_mask=attention_mask,
            encoder_hidden_states=encoder_hidden_states,
            encoder_attention_mask=encoder_attention_mask,
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache,
        )
        hidden_states = layer_outputs[0]

        if use_cache:
            next_decoder_cache += (layer_outputs[1],)

        if output_attentions:
            all_self_attns += (layer_outputs[2],)

            if encoder_hidden_states is not None:
                all_cross_attentions += (layer_outputs[3],)

    hidden_states = self.layer_norm(hidden_states)

    # add hidden states from the last decoder layer
    if output_hidden_states:
        all_hidden_states += (hidden_states,)

    next_cache = next_decoder_cache if use_cache else None
    if not return_dict:
        return tuple(
            v
            for v in [hidden_states, next_cache, all_hidden_states, all_self_attns, all_cross_attentions]
            if v is not None
        )
    return BaseModelOutputWithPastAndCrossAttentions(
        last_hidden_state=hidden_states,
        past_key_values=next_cache,
        hidden_states=all_hidden_states,
        attentions=all_self_attns,
        cross_attentions=all_cross_attentions,
    )

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2Decoder.get_input_embeddings()

Retrieves the input embeddings for the SeamlessM4Tv2Decoder.

PARAMETER DESCRIPTION
self

An instance of the SeamlessM4Tv2Decoder class.

TYPE: SeamlessM4Tv2Decoder

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
def get_input_embeddings(self):
    """
    Retrieves the input embeddings for the SeamlessM4Tv2Decoder.

    Args:
        self (SeamlessM4Tv2Decoder): An instance of the SeamlessM4Tv2Decoder class.

    Returns:
        None.

    Raises:
        None.
    """
    return self.embed_tokens

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2Decoder.set_input_embeddings(value)

Sets the input embeddings for the SeamlessM4Tv2Decoder.

PARAMETER DESCRIPTION
self

The instance of the SeamlessM4Tv2Decoder class.

TYPE: SeamlessM4Tv2Decoder

value

The input embeddings to be set. This should be a tensor or an instance of the Embedding class.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
def set_input_embeddings(self, value):
    """
    Sets the input embeddings for the SeamlessM4Tv2Decoder.

    Args:
        self (SeamlessM4Tv2Decoder): The instance of the SeamlessM4Tv2Decoder class.
        value: The input embeddings to be set. This should be a tensor or an instance of the Embedding class.

    Returns:
        None.

    Raises:
        None.
    """
    self.embed_tokens = value

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2DecoderLayer

Bases: Module

This class represents a decoder layer of the SeamlessM4Tv2 model. It is used to process the input hidden states and generate the output hidden states for the decoder part of the model.

ATTRIBUTE DESCRIPTION
`embed_dim`

The dimension of the hidden states.

`self_attn`

The self-attention mechanism used in the decoder layer.

`dropout`

The dropout probability used in the decoder layer.

`activation_fn`

The activation function used in the decoder layer.

`attn_dropout`

The dropout probability used in the self-attention mechanism.

`self_attn_layer_norm`

The layer normalization applied to the self-attention output.

`cross_attention`

The cross-attention mechanism used in the decoder layer.

`cross_attention_layer_norm`

The layer normalization applied to the cross-attention output.

`ffn`

The feed-forward network used in the decoder layer.

`ffn_layer_norm`

The layer normalization applied to the feed-forward network output.

`ffn_dropout`

The dropout probability used in the feed-forward network.

METHOD DESCRIPTION
`forward`

Performs the forward pass of the decoder layer.

PARAMETER DESCRIPTION
`hidden_states

The input hidden states of shape (batch, seq_len, embed_dim).

TYPE: mindspore.Tensor)`

`attention_mask

The attention mask of size (batch, 1, tgt_len, src_len) where padding elements are indicated by very large negative values.

TYPE: mindspore.Tensor)`

`encoder_hidden_states

The cross-attention input hidden states of shape (batch, seq_len, embed_dim).

TYPE: mindspore.Tensor)`

`encoder_attention_mask

The encoder attention mask of size (batch, 1, tgt_len, src_len) where padding elements are indicated by very large negative values.

TYPE: mindspore.Tensor)`

`past_key_value

The cached past key and value projection states.

TYPE: Tuple(mindspore.Tensor))`

`output_attentions

Whether or not to return the attentions tensors of all attention layers.

TYPE: bool, optional)`

`use_cache

Whether or not to use the cached key and value projection states.

TYPE: bool, optional)`

RETURNS DESCRIPTION

outputs: A tuple containing the output hidden states and the present key and value projection states. If output_attentions is True, the tuple also contains the self-attention weights and the cross-attention weights.

Note

The attention weights are returned only if output_attentions is True.

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
class SeamlessM4Tv2DecoderLayer(nn.Module):

    """
    This class represents a decoder layer of the SeamlessM4Tv2 model. It is used to process the input hidden states
    and generate the output hidden states for the decoder part of the model.

    Attributes:
        `embed_dim`: The dimension of the hidden states.
        `self_attn`: The self-attention mechanism used in the decoder layer.
        `dropout`: The dropout probability used in the decoder layer.
        `activation_fn`: The activation function used in the decoder layer.
        `attn_dropout`: The dropout probability used in the self-attention mechanism.
        `self_attn_layer_norm`: The layer normalization applied to the self-attention output.
        `cross_attention`: The cross-attention mechanism used in the decoder layer.
        `cross_attention_layer_norm`: The layer normalization applied to the cross-attention output.
        `ffn`: The feed-forward network used in the decoder layer.
        `ffn_layer_norm`: The layer normalization applied to the feed-forward network output.
        `ffn_dropout`: The dropout probability used in the feed-forward network.

    Methods:
        `forward`: Performs the forward pass of the decoder layer.

    Args:
        `hidden_states (mindspore.Tensor)`: The input hidden states of shape `(batch, seq_len, embed_dim)`.
        `attention_mask (mindspore.Tensor)`: The attention mask of size `(batch, 1, tgt_len, src_len)`
            where padding elements are indicated by very large negative values.
        `encoder_hidden_states (mindspore.Tensor)`:
            The cross-attention input hidden states of shape `(batch, seq_len, embed_dim)`.
        `encoder_attention_mask (mindspore.Tensor)`: The encoder attention mask of size `(batch, 1, tgt_len, src_len)`
            where padding elements are indicated by very large negative values.
        `past_key_value (Tuple(mindspore.Tensor))`: The cached past key and value projection states.
        `output_attentions (bool, optional)`: Whether or not to return the attentions tensors of all attention layers.
        `use_cache (bool, optional)`: Whether or not to use the cached key and value projection states.

    Returns:
        `outputs`: A tuple containing the output hidden states and the present key and value projection states.
            If `output_attentions` is `True`, the tuple also contains the self-attention weights and the
            cross-attention weights.

    Note:
        The attention weights are returned only if `output_attentions` is `True`.
    """
    def __init__(self, config: SeamlessM4Tv2Config, decoder_ffn_dim=None, decoder_attention_heads=None):
        """
        Initialize a decoder layer in the SeamlessM4Tv2 model.

        Args:
            self: The object instance.
            config (SeamlessM4Tv2Config): The configuration object for the SeamlessM4Tv2 model.
            decoder_ffn_dim (int, optional): The dimension of the feed-forward network in the decoder layer.
                Defaults to None.
            decoder_attention_heads (int, optional): The number of attention heads to use in the decoder layer.
                Defaults to None.

        Returns:
            None

        Raises:
            None
        """
        super().__init__()
        decoder_ffn_dim = config.decoder_ffn_dim if decoder_ffn_dim is None else decoder_ffn_dim
        decoder_attention_heads = (
            config.decoder_attention_heads if decoder_attention_heads is None else decoder_attention_heads
        )

        self.embed_dim = config.hidden_size
        self.self_attn = SeamlessM4Tv2Attention(
            embed_dim=self.embed_dim,
            num_heads=decoder_attention_heads,
            dropout=config.attention_dropout,
            is_decoder=True,
        )
        self.dropout = config.dropout
        self.activation_fn = ACT2FN[config.activation_function]
        self.attn_dropout = nn.Dropout(p=config.dropout)

        self.self_attn_layer_norm = nn.LayerNorm([self.embed_dim])
        self.cross_attention = SeamlessM4Tv2Attention(
            self.embed_dim, decoder_attention_heads, config.attention_dropout, is_decoder=True
        )
        self.cross_attention_layer_norm = nn.LayerNorm([self.embed_dim])

        self.ffn = SeamlessM4Tv2FeedForwardNetwork(config, ffn_dim=decoder_ffn_dim)

        self.ffn_layer_norm = nn.LayerNorm([config.hidden_size])
        self.ffn_dropout = nn.Dropout(p=config.activation_dropout)

    def forward(
        self,
        hidden_states: mindspore.Tensor,
        attention_mask: Optional[mindspore.Tensor] = None,
        encoder_hidden_states: Optional[mindspore.Tensor] = None,
        encoder_attention_mask: Optional[mindspore.Tensor] = None,
        past_key_value: Optional[Tuple[mindspore.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = True,
    ) -> mindspore.Tensor:
        """
        Args:
            hidden_states (`mindspore.Tensor`):
                input to the layer of shape `(batch, seq_len, embed_dim)`
            attention_mask (`mindspore.Tensor`):
                attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very
                large negative values.
            encoder_hidden_states (`mindspore.Tensor`):
                cross attention input to the layer of shape `(batch, seq_len, embed_dim)`
            encoder_attention_mask (`mindspore.Tensor`):
                encoder attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by
                very large negative values.
            past_key_value (`Tuple(mindspore.Tensor)`):
                cached past key and value projection states
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
        """
        residual = hidden_states
        hidden_states = self.self_attn_layer_norm(hidden_states)

        # Self Attention
        # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
        # add present self-attn cache to positions 1,2 of present_key_value tuple
        hidden_states, self_attn_weights, present_key_value = self.self_attn(
            hidden_states=hidden_states,
            past_key_value=self_attn_past_key_value,
            attention_mask=attention_mask,
            output_attentions=output_attentions,
        )
        hidden_states = self.attn_dropout(hidden_states)
        hidden_states = residual + hidden_states

        # Cross-Attention Block
        cross_attn_present_key_value = None
        cross_attn_weights = None
        if encoder_hidden_states is not None:
            residual = hidden_states
            hidden_states = self.cross_attention_layer_norm(hidden_states)

            # cross_attn cached key/values tuple is at positions 3,4 of present_key_value tuple
            cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None

            hidden_states, cross_attn_weights, cross_attn_present_key_value = self.cross_attention(
                hidden_states=hidden_states,
                encoder_hidden_states=encoder_hidden_states,
                past_key_value=cross_attn_past_key_value,
                attention_mask=encoder_attention_mask,
                output_attentions=output_attentions,
            )
            hidden_states = self.attn_dropout(hidden_states)
            hidden_states = residual + hidden_states

            # add cross-attn to positions 3,4 of present_key_value tuple
            present_key_value += cross_attn_present_key_value

        # Fully Connected
        residual = hidden_states

        hidden_states = self.ffn_layer_norm(hidden_states)

        hidden_states = self.ffn(hidden_states)
        hidden_states = self.ffn_dropout(hidden_states)

        hidden_states = residual + hidden_states

        outputs = (hidden_states, present_key_value)

        if output_attentions:
            outputs += (self_attn_weights, cross_attn_weights)

        return outputs

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2DecoderLayer.__init__(config, decoder_ffn_dim=None, decoder_attention_heads=None)

Initialize a decoder layer in the SeamlessM4Tv2 model.

PARAMETER DESCRIPTION
self

The object instance.

config

The configuration object for the SeamlessM4Tv2 model.

TYPE: SeamlessM4Tv2Config

decoder_ffn_dim

The dimension of the feed-forward network in the decoder layer. Defaults to None.

TYPE: int DEFAULT: None

decoder_attention_heads

The number of attention heads to use in the decoder layer. Defaults to None.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
def __init__(self, config: SeamlessM4Tv2Config, decoder_ffn_dim=None, decoder_attention_heads=None):
    """
    Initialize a decoder layer in the SeamlessM4Tv2 model.

    Args:
        self: The object instance.
        config (SeamlessM4Tv2Config): The configuration object for the SeamlessM4Tv2 model.
        decoder_ffn_dim (int, optional): The dimension of the feed-forward network in the decoder layer.
            Defaults to None.
        decoder_attention_heads (int, optional): The number of attention heads to use in the decoder layer.
            Defaults to None.

    Returns:
        None

    Raises:
        None
    """
    super().__init__()
    decoder_ffn_dim = config.decoder_ffn_dim if decoder_ffn_dim is None else decoder_ffn_dim
    decoder_attention_heads = (
        config.decoder_attention_heads if decoder_attention_heads is None else decoder_attention_heads
    )

    self.embed_dim = config.hidden_size
    self.self_attn = SeamlessM4Tv2Attention(
        embed_dim=self.embed_dim,
        num_heads=decoder_attention_heads,
        dropout=config.attention_dropout,
        is_decoder=True,
    )
    self.dropout = config.dropout
    self.activation_fn = ACT2FN[config.activation_function]
    self.attn_dropout = nn.Dropout(p=config.dropout)

    self.self_attn_layer_norm = nn.LayerNorm([self.embed_dim])
    self.cross_attention = SeamlessM4Tv2Attention(
        self.embed_dim, decoder_attention_heads, config.attention_dropout, is_decoder=True
    )
    self.cross_attention_layer_norm = nn.LayerNorm([self.embed_dim])

    self.ffn = SeamlessM4Tv2FeedForwardNetwork(config, ffn_dim=decoder_ffn_dim)

    self.ffn_layer_norm = nn.LayerNorm([config.hidden_size])
    self.ffn_dropout = nn.Dropout(p=config.activation_dropout)

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2DecoderLayer.forward(hidden_states, attention_mask=None, encoder_hidden_states=None, encoder_attention_mask=None, past_key_value=None, output_attentions=False, use_cache=True)

PARAMETER DESCRIPTION
hidden_states

input to the layer of shape (batch, seq_len, embed_dim)

TYPE: `mindspore.Tensor`

attention_mask

attention mask of size (batch, 1, tgt_len, src_len) where padding elements are indicated by very large negative values.

TYPE: `mindspore.Tensor` DEFAULT: None

encoder_hidden_states

cross attention input to the layer of shape (batch, seq_len, embed_dim)

TYPE: `mindspore.Tensor` DEFAULT: None

encoder_attention_mask

encoder attention mask of size (batch, 1, tgt_len, src_len) where padding elements are indicated by very large negative values.

TYPE: `mindspore.Tensor` DEFAULT: None

past_key_value

cached past key and value projection states

TYPE: `Tuple(mindspore.Tensor)` DEFAULT: None

output_attentions

Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

TYPE: `bool`, *optional* DEFAULT: False

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
def forward(
    self,
    hidden_states: mindspore.Tensor,
    attention_mask: Optional[mindspore.Tensor] = None,
    encoder_hidden_states: Optional[mindspore.Tensor] = None,
    encoder_attention_mask: Optional[mindspore.Tensor] = None,
    past_key_value: Optional[Tuple[mindspore.Tensor]] = None,
    output_attentions: Optional[bool] = False,
    use_cache: Optional[bool] = True,
) -> mindspore.Tensor:
    """
    Args:
        hidden_states (`mindspore.Tensor`):
            input to the layer of shape `(batch, seq_len, embed_dim)`
        attention_mask (`mindspore.Tensor`):
            attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very
            large negative values.
        encoder_hidden_states (`mindspore.Tensor`):
            cross attention input to the layer of shape `(batch, seq_len, embed_dim)`
        encoder_attention_mask (`mindspore.Tensor`):
            encoder attention mask of size `(batch, 1, tgt_len, src_len)` where padding elements are indicated by
            very large negative values.
        past_key_value (`Tuple(mindspore.Tensor)`):
            cached past key and value projection states
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under
            returned tensors for more detail.
    """
    residual = hidden_states
    hidden_states = self.self_attn_layer_norm(hidden_states)

    # Self Attention
    # decoder uni-directional self-attention cached key/values tuple is at positions 1,2
    self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
    # add present self-attn cache to positions 1,2 of present_key_value tuple
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
        hidden_states=hidden_states,
        past_key_value=self_attn_past_key_value,
        attention_mask=attention_mask,
        output_attentions=output_attentions,
    )
    hidden_states = self.attn_dropout(hidden_states)
    hidden_states = residual + hidden_states

    # Cross-Attention Block
    cross_attn_present_key_value = None
    cross_attn_weights = None
    if encoder_hidden_states is not None:
        residual = hidden_states
        hidden_states = self.cross_attention_layer_norm(hidden_states)

        # cross_attn cached key/values tuple is at positions 3,4 of present_key_value tuple
        cross_attn_past_key_value = past_key_value[-2:] if past_key_value is not None else None

        hidden_states, cross_attn_weights, cross_attn_present_key_value = self.cross_attention(
            hidden_states=hidden_states,
            encoder_hidden_states=encoder_hidden_states,
            past_key_value=cross_attn_past_key_value,
            attention_mask=encoder_attention_mask,
            output_attentions=output_attentions,
        )
        hidden_states = self.attn_dropout(hidden_states)
        hidden_states = residual + hidden_states

        # add cross-attn to positions 3,4 of present_key_value tuple
        present_key_value += cross_attn_present_key_value

    # Fully Connected
    residual = hidden_states

    hidden_states = self.ffn_layer_norm(hidden_states)

    hidden_states = self.ffn(hidden_states)
    hidden_states = self.ffn_dropout(hidden_states)

    hidden_states = residual + hidden_states

    outputs = (hidden_states, present_key_value)

    if output_attentions:
        outputs += (self_attn_weights, cross_attn_weights)

    return outputs

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2Encoder

Bases: SeamlessM4Tv2PreTrainedModel

World Class Technical Documentation for SeamlessM4Tv2Encoder:

The SeamlessM4Tv2Encoder class is a Python class that represents an encoder module in the SeamlessM4Tv2 model. This class inherits from the SeamlessM4Tv2PreTrainedModel class.

Summary

The SeamlessM4Tv2Encoder class implements the encoder module of the SeamlessM4Tv2 model. It takes input tokens, applies embedding and positional encoding, and passes it through multiple encoder layers to generate encoded representations of the input.

Constructor
>>> def __init__(self, config: SeamlessM4Tv2Config, embed_tokens: Optional[nn.Embedding] = None, is_t2u_encoder: bool = False):
>>>     super().__init__(config)
>>>     # Initializes parameters and attributes of the encoder
...
>>>     self.post_init()
Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
class SeamlessM4Tv2Encoder(SeamlessM4Tv2PreTrainedModel):

    """
    World Class Technical Documentation for SeamlessM4Tv2Encoder:

    The `SeamlessM4Tv2Encoder` class is a Python class that represents an encoder module in the SeamlessM4Tv2 model.
    This class inherits from the `SeamlessM4Tv2PreTrainedModel` class.

    Summary:
        The `SeamlessM4Tv2Encoder` class implements the encoder module of the SeamlessM4Tv2 model.
        It takes input tokens, applies embedding and positional encoding, and passes it through multiple encoder layers
        to generate encoded representations of the input.

    Constructor:
        ```python
        >>> def __init__(self, config: SeamlessM4Tv2Config, embed_tokens: Optional[nn.Embedding] = None, is_t2u_encoder: bool = False):
        >>>     super().__init__(config)
        >>>     # Initializes parameters and attributes of the encoder
        ...
        >>>     self.post_init()
        ```

    Methods:
        forward
    """
    def __init__(
        self,
        config: SeamlessM4Tv2Config,
        embed_tokens: Optional[nn.Embedding] = None,
        is_t2u_encoder: bool = False,
    ):
        """
        Initializes a new instance of the SeamlessM4Tv2Encoder class.

        Args:
            self (SeamlessM4Tv2Encoder): The instance of the class.
            config (SeamlessM4Tv2Config): The configuration object containing various settings.
            embed_tokens (Optional[nn.Embedding]): An optional pre-trained embedding layer.
            is_t2u_encoder (bool): A boolean value indicating whether the encoder is used for T2U (text-to-unit) conversion.

        Returns:
            None

        Raises:
            None
        """
        super().__init__(config)

        self.dropout = config.dropout
        self.layerdrop = config.encoder_layerdrop
        self.padding_idx = config.pad_token_id
        embed_dim = config.hidden_size

        self.is_t2u_encoder = is_t2u_encoder
        self.max_source_positions = config.max_position_embeddings

        if not self.is_t2u_encoder:
            self.embed_scale = math.sqrt(embed_dim) if config.scale_embedding else 1.0

            self.embed_tokens = nn.Embedding(config.vocab_size, embed_dim, self.padding_idx)

            if embed_tokens is not None:
                self.embed_tokens.weight = embed_tokens.weight

            self.embed_positions = SeamlessM4Tv2SinusoidalPositionalEmbedding(
                self.max_source_positions,
                embed_dim,
                self.padding_idx,
            )

        layers = []
        for _ in range(config.encoder_layers):
            layers.append(
                SeamlessM4Tv2EncoderLayer(
                    config,
                    encoder_attention_heads=config.encoder_attention_heads,
                    encoder_ffn_dim=config.encoder_ffn_dim,
                )
            )

        self.layers = nn.ModuleList(layers)

        self.layer_norm = nn.LayerNorm([config.hidden_size])

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        **kwargs,
    ) -> Union[Tuple, BaseModelOutput]:
        r"""
        Args:
            input_ids (`mindspore.Tensor` of shape `(batch_size, sequence_length)`):
                Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
                provide it.

                Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
                [`PreTrainedTokenizer.__call__`] for details.

                [What are input IDs?](../glossary#input-ids)
            attention_mask (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

                - 1 for tokens that are **not masked**,
                - 0 for tokens that are **masked**.

                [What are attention masks?](../glossary#attention-mask)
            inputs_embeds (`mindspore.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
                Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
                This is useful if you want more control over how to convert `input_ids` indices into associated vectors
                than the model's internal embedding lookup matrix.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
            output_hidden_states (`bool`, *optional*):
                Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
                for more detail.
            return_dict (`bool`, *optional*):
                Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if input_ids is not None and self.is_t2u_encoder:
            raise ValueError(
                "You cannot pass input_ids to the encoder of the text_to_units model. Pass inputs_embeds instead."
            )

        # retrieve input_ids and inputs_embeds
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
        if input_ids is not None:
            input = input_ids
            input_shape = input.shape
            input_ids = input_ids.view(-1, input_shape[-1])
        elif inputs_embeds is not None:
            input = inputs_embeds[:, :, -1]
        else:
            raise ValueError("You have to specify either input_ids or inputs_embeds")

        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale

        if not self.is_t2u_encoder:
            embed_pos = self.embed_positions(input)

            hidden_states = inputs_embeds + embed_pos
        else:
            hidden_states = inputs_embeds

        hidden_states = ops.dropout(hidden_states, p=self.dropout, training=self.training)

        # expand attention_mask
        if attention_mask is not None:
            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
            attention_mask = _prepare_4d_attention_mask(attention_mask, inputs_embeds.dtype)

        encoder_states = () if output_hidden_states else None
        all_attentions = () if output_attentions else None

        for _, encoder_layer in enumerate(self.layers):
            if output_hidden_states:
                encoder_states = encoder_states + (hidden_states,)
            # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
            to_drop = False
            if self.training:
                dropout_probability = ops.rand([])
                if dropout_probability < self.layerdrop:  # skip the layer
                    to_drop = True

            if to_drop:
                layer_outputs = (None, None)
            else:
                layer_outputs = encoder_layer(
                    hidden_states,
                    attention_mask,
                    output_attentions=output_attentions,
                )

                hidden_states = layer_outputs[0]

            if output_attentions:
                all_attentions = all_attentions + (layer_outputs[1],)

        hidden_states = self.layer_norm(hidden_states)

        if output_hidden_states:
            encoder_states = encoder_states + (hidden_states,)

        if not return_dict:
            return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
        return BaseModelOutput(
            last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions
        )

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2Encoder.__init__(config, embed_tokens=None, is_t2u_encoder=False)

Initializes a new instance of the SeamlessM4Tv2Encoder class.

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: SeamlessM4Tv2Encoder

config

The configuration object containing various settings.

TYPE: SeamlessM4Tv2Config

embed_tokens

An optional pre-trained embedding layer.

TYPE: Optional[Embedding] DEFAULT: None

is_t2u_encoder

A boolean value indicating whether the encoder is used for T2U (text-to-unit) conversion.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
def __init__(
    self,
    config: SeamlessM4Tv2Config,
    embed_tokens: Optional[nn.Embedding] = None,
    is_t2u_encoder: bool = False,
):
    """
    Initializes a new instance of the SeamlessM4Tv2Encoder class.

    Args:
        self (SeamlessM4Tv2Encoder): The instance of the class.
        config (SeamlessM4Tv2Config): The configuration object containing various settings.
        embed_tokens (Optional[nn.Embedding]): An optional pre-trained embedding layer.
        is_t2u_encoder (bool): A boolean value indicating whether the encoder is used for T2U (text-to-unit) conversion.

    Returns:
        None

    Raises:
        None
    """
    super().__init__(config)

    self.dropout = config.dropout
    self.layerdrop = config.encoder_layerdrop
    self.padding_idx = config.pad_token_id
    embed_dim = config.hidden_size

    self.is_t2u_encoder = is_t2u_encoder
    self.max_source_positions = config.max_position_embeddings

    if not self.is_t2u_encoder:
        self.embed_scale = math.sqrt(embed_dim) if config.scale_embedding else 1.0

        self.embed_tokens = nn.Embedding(config.vocab_size, embed_dim, self.padding_idx)

        if embed_tokens is not None:
            self.embed_tokens.weight = embed_tokens.weight

        self.embed_positions = SeamlessM4Tv2SinusoidalPositionalEmbedding(
            self.max_source_positions,
            embed_dim,
            self.padding_idx,
        )

    layers = []
    for _ in range(config.encoder_layers):
        layers.append(
            SeamlessM4Tv2EncoderLayer(
                config,
                encoder_attention_heads=config.encoder_attention_heads,
                encoder_ffn_dim=config.encoder_ffn_dim,
            )
        )

    self.layers = nn.ModuleList(layers)

    self.layer_norm = nn.LayerNorm([config.hidden_size])

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2Encoder.forward(input_ids=None, attention_mask=None, inputs_embeds=None, output_attentions=None, output_hidden_states=None, return_dict=None, **kwargs)

PARAMETER DESCRIPTION
input_ids

Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide it.

Indices can be obtained using [AutoTokenizer]. See [PreTrainedTokenizer.encode] and [PreTrainedTokenizer.__call__] for details.

What are input IDs?

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length)` DEFAULT: None

attention_mask

Mask to avoid performing attention on padding token indices. Mask values selected in [0, 1]:

  • 1 for tokens that are not masked,
  • 0 for tokens that are masked.

What are attention masks?

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional* DEFAULT: None

inputs_embeds

Optionally, instead of passing input_ids you can choose to directly pass an embedded representation. This is useful if you want more control over how to convert input_ids indices into associated vectors than the model's internal embedding lookup matrix.

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional* DEFAULT: None

output_attentions

Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

TYPE: `bool`, *optional* DEFAULT: None

output_hidden_states

Whether or not to return the hidden states of all layers. See hidden_states under returned tensors for more detail.

TYPE: `bool`, *optional* DEFAULT: None

return_dict

Whether or not to return a [~utils.ModelOutput] instead of a plain tuple.

TYPE: `bool`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
def forward(
    self,
    input_ids: mindspore.Tensor = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
    **kwargs,
) -> Union[Tuple, BaseModelOutput]:
    r"""
    Args:
        input_ids (`mindspore.Tensor` of shape `(batch_size, sequence_length)`):
            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you
            provide it.

            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
            [`PreTrainedTokenizer.__call__`] for details.

            [What are input IDs?](../glossary#input-ids)
        attention_mask (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

            - 1 for tokens that are **not masked**,
            - 0 for tokens that are **masked**.

            [What are attention masks?](../glossary#attention-mask)
        inputs_embeds (`mindspore.Tensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
            This is useful if you want more control over how to convert `input_ids` indices into associated vectors
            than the model's internal embedding lookup matrix.
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under
            returned tensors for more detail.
        output_hidden_states (`bool`, *optional*):
            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
            for more detail.
        return_dict (`bool`, *optional*):
            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    if input_ids is not None and self.is_t2u_encoder:
        raise ValueError(
            "You cannot pass input_ids to the encoder of the text_to_units model. Pass inputs_embeds instead."
        )

    # retrieve input_ids and inputs_embeds
    if input_ids is not None and inputs_embeds is not None:
        raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
    if input_ids is not None:
        input = input_ids
        input_shape = input.shape
        input_ids = input_ids.view(-1, input_shape[-1])
    elif inputs_embeds is not None:
        input = inputs_embeds[:, :, -1]
    else:
        raise ValueError("You have to specify either input_ids or inputs_embeds")

    if inputs_embeds is None:
        inputs_embeds = self.embed_tokens(input_ids) * self.embed_scale

    if not self.is_t2u_encoder:
        embed_pos = self.embed_positions(input)

        hidden_states = inputs_embeds + embed_pos
    else:
        hidden_states = inputs_embeds

    hidden_states = ops.dropout(hidden_states, p=self.dropout, training=self.training)

    # expand attention_mask
    if attention_mask is not None:
        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
        attention_mask = _prepare_4d_attention_mask(attention_mask, inputs_embeds.dtype)

    encoder_states = () if output_hidden_states else None
    all_attentions = () if output_attentions else None

    for _, encoder_layer in enumerate(self.layers):
        if output_hidden_states:
            encoder_states = encoder_states + (hidden_states,)
        # add LayerDrop (see https://arxiv.org/abs/1909.11556 for description)
        to_drop = False
        if self.training:
            dropout_probability = ops.rand([])
            if dropout_probability < self.layerdrop:  # skip the layer
                to_drop = True

        if to_drop:
            layer_outputs = (None, None)
        else:
            layer_outputs = encoder_layer(
                hidden_states,
                attention_mask,
                output_attentions=output_attentions,
            )

            hidden_states = layer_outputs[0]

        if output_attentions:
            all_attentions = all_attentions + (layer_outputs[1],)

    hidden_states = self.layer_norm(hidden_states)

    if output_hidden_states:
        encoder_states = encoder_states + (hidden_states,)

    if not return_dict:
        return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
    return BaseModelOutput(
        last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions
    )

mindnlp.transformers.models.seamless_m4t_v2.modeling_seamless_m4t_v2.SeamlessM4Tv2EncoderLayer

Bases: Module

This class represents an encoder layer for the SeamlessM4Tv2 model. It inherits from the nn.Module class.

The encoder layer performs multi-head self-attention and feed-forward network operations on the input hidden states.

ATTRIBUTE DESCRIPTION
embed_dim

The dimension of the hidden states.

TYPE: int

self_attn

The self-attention module for the encoder layer.

TYPE: SeamlessM4Tv2Attention

attn_dropout

Dropout layer for attention weights.

TYPE: Dropout

self_attn_layer_norm

Layer normalization for the hidden states after self-attention.

TYPE: LayerNorm

ffn

The feed-forward network module for the encoder layer.

TYPE: SeamlessM4Tv2FeedForwardNetwork

ffn_layer_norm

Layer normalization for the hidden states after feed-forward network.

TYPE: LayerNorm

ffn_dropout

Dropout layer for the feed-forward network output.

TYPE: Dropout

METHOD DESCRIPTION
forward

Performs the forward pass of the encoder layer.

Args:

  • hidden_states (mindspore.Tensor): Input hidden states of shape (batch, seq_len, embed_dim).
  • attention_mask (mindspore.Tensor): Attention mask of size (batch, 1, tgt_len, src_len) where padding elements are indicated by very large negative values.

    • output_attentions (bool, optional): Whether to output attention weights. Defaults to False.

Returns:

  • outputs (tuple): A tuple containing the computed hidden states. If output_attentions=True, the tuple also contains attention weights.
Source code in mindnlp/transformers/models/seamless_m4t_v2/modeling_seamless_m4t_v2.py
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
class SeamlessM4Tv2EncoderLayer(nn.Module):

    """
    This class represents an encoder layer for the SeamlessM4Tv2 model. It inherits from the nn.Module class.

    The encoder layer performs multi-head self-attention and feed-forward network operations on the input hidden states.

    Attributes:
        embed_dim (int): The dimension of the hidden states.
        self_attn (SeamlessM4Tv2Attention): The self-attention module for the encoder layer.
        attn_dropout (nn.Dropout): Dropout layer for attention weights.
        self_attn_layer_norm (nn.LayerNorm): Layer normalization for the hidden states after self-attention.
        ffn (SeamlessM4Tv2FeedForwardNetwork): The feed-forward network module for the encoder layer.
        ffn_layer_norm (nn.LayerNorm): Layer normalization for the hidden states after feed-forward network.
        ffn_dropout (nn.Dropout): Dropout layer for the feed-forward network output.

    Methods:
        forward(hidden_states, attention_mask, output_attentions=False):
            Performs the forward pass of the encoder layer.