Skip to content

qwen2_moe

mindnlp.transformers.models.qwen2_moe.configuration_qwen2_moe

Qwen2MoE model configuration

mindnlp.transformers.models.qwen2_moe.configuration_qwen2_moe.Qwen2MoeConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [Qwen2MoeModel]. It is used to instantiate a Qwen2MoE model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of Qwen1.5-MoE-A2.7B" Qwen/Qwen1.5-MoE-A2.7B".

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
vocab_size

Vocabulary size of the Qwen2MoE model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [Qwen2MoeModel]

TYPE: `int`, *optional*, defaults to 151936 DEFAULT: 151936

hidden_size

Dimension of the hidden representations.

TYPE: `int`, *optional*, defaults to 2048 DEFAULT: 2048

intermediate_size

Dimension of the MLP representations.

TYPE: `int`, *optional*, defaults to 5632 DEFAULT: 5632

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 24 DEFAULT: 24

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 16 DEFAULT: 16

num_key_value_heads

This is the number of key_value heads that should be used to implement Grouped Query Attention. If num_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), if num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be forwarded by meanpooling all the original heads within that group. For more details checkout this paper. If it is not specified, will default to 32.

TYPE: `int`, *optional*, defaults to 16 DEFAULT: 16

hidden_act

The non-linear activation function (function or string) in the decoder.

TYPE: `str` or `function`, *optional*, defaults to `"silu"` DEFAULT: 'silu'

max_position_embeddings

The maximum sequence length that this model might ever be used with.

TYPE: `int`, *optional*, defaults to 32768 DEFAULT: 32768

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

rms_norm_eps

The epsilon used by the rms normalization layers.

TYPE: `float`, *optional*, defaults to 1e-06 DEFAULT: 1e-06

use_cache

Whether or not the model should return the last key/values attentions (not used by all models). Only relevant if config.is_decoder=True.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

tie_word_embeddings

Whether the model's input and output word embeddings should be tied.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

rope_theta

The base period of the RoPE embeddings.

TYPE: `float`, *optional*, defaults to 10000.0 DEFAULT: 10000.0

use_sliding_window

Whether to use sliding window attention.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

sliding_window

Sliding window attention (SWA) window size. If not specified, will default to 4096.

TYPE: `int`, *optional*, defaults to 4096 DEFAULT: 4096

max_window_layers

The number of layers that use SWA (Sliding Window Attention). The bottom layers use SWA while the top use full attention.

TYPE: `int`, *optional*, defaults to 28 DEFAULT: 28

attention_dropout

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

decoder_sparse_step

The frequency of the MoE layer.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

moe_intermediate_size

Intermediate size of the routed expert.

TYPE: `int`, *optional*, defaults to 1408 DEFAULT: 1408

shared_expert_intermediate_size

Intermediate size of the shared expert.

TYPE: `int`, *optional*, defaults to 5632 DEFAULT: 5632

num_experts_per_tok

Number of selected experts.

TYPE: `int`, *optional*, defaults to 4 DEFAULT: 4

num_experts

Number of routed experts.

TYPE: `int`, *optional*, defaults to 60 DEFAULT: 60

norm_topk_prob

Whether to normalize the topk probabilities.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

output_router_logits

Whether or not the router logits should be returned by the model. Enabeling this will also allow the model to output the auxiliary loss, including load balancing loss and router z-loss.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

router_aux_loss_coef

The aux loss factor for the total loss.

TYPE: `float`, *optional*, defaults to 0.001 DEFAULT: 0.001

Example
>>> from transformers import Qwen2MoeModel, Qwen2MoeConfig
...
>>> # Initializing a Qwen2MoE style configuration
>>> configuration = Qwen2MoeConfig()
...
>>> # Initializing a model from the Qwen1.5-MoE-A2.7B" style configuration
>>> model = Qwen2MoeModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/qwen2_moe/configuration_qwen2_moe.py
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
class Qwen2MoeConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`Qwen2MoeModel`]. It is used to instantiate a
    Qwen2MoE model according to the specified arguments, defining the model architecture. Instantiating a configuration
    with the defaults will yield a similar configuration to that of
    Qwen1.5-MoE-A2.7B" [Qwen/Qwen1.5-MoE-A2.7B"](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B").

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.


    Args:
        vocab_size (`int`, *optional*, defaults to 151936):
            Vocabulary size of the Qwen2MoE model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`Qwen2MoeModel`]
        hidden_size (`int`, *optional*, defaults to 2048):
            Dimension of the hidden representations.
        intermediate_size (`int`, *optional*, defaults to 5632):
            Dimension of the MLP representations.
        num_hidden_layers (`int`, *optional*, defaults to 24):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 16):
            Number of attention heads for each attention layer in the Transformer encoder.
        num_key_value_heads (`int`, *optional*, defaults to 16):
            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
            `num_key_value_heads=1` the model will use Multi Query Attention (MQA) otherwise GQA is used. When
            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be forwarded
            by meanpooling all the original heads within that group. For more details checkout [this
            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `32`.
        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
            The non-linear activation function (function or string) in the decoder.
        max_position_embeddings (`int`, *optional*, defaults to 32768):
            The maximum sequence length that this model might ever be used with.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
            The epsilon used by the rms normalization layers.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.
        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
            Whether the model's input and output word embeddings should be tied.
        rope_theta (`float`, *optional*, defaults to 10000.0):
            The base period of the RoPE embeddings.
        use_sliding_window (`bool`, *optional*, defaults to `False`):
            Whether to use sliding window attention.
        sliding_window (`int`, *optional*, defaults to 4096):
            Sliding window attention (SWA) window size. If not specified, will default to `4096`.
        max_window_layers (`int`, *optional*, defaults to 28):
            The number of layers that use SWA (Sliding Window Attention). The bottom layers use SWA while the top use full attention.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        decoder_sparse_step (`int`, *optional*, defaults to 1):
            The frequency of the MoE layer.
        moe_intermediate_size (`int`, *optional*, defaults to 1408):
            Intermediate size of the routed expert.
        shared_expert_intermediate_size (`int`, *optional*, defaults to 5632):
            Intermediate size of the shared expert.
        num_experts_per_tok (`int`, *optional*, defaults to 4):
            Number of selected experts.
        num_experts (`int`, *optional*, defaults to 60):
            Number of routed experts.
        norm_topk_prob (`bool`, *optional*, defaults to `False`):
            Whether to normalize the topk probabilities.
        output_router_logits (`bool`, *optional*, defaults to `False`):
            Whether or not the router logits should be returned by the model. Enabeling this will also
            allow the model to output the auxiliary loss, including load balancing loss and router z-loss.
        router_aux_loss_coef (`float`, *optional*, defaults to 0.001):
            The aux loss factor for the total loss.

    Example:
        ```python
        >>> from transformers import Qwen2MoeModel, Qwen2MoeConfig
        ...
        >>> # Initializing a Qwen2MoE style configuration
        >>> configuration = Qwen2MoeConfig()
        ...
        >>> # Initializing a model from the Qwen1.5-MoE-A2.7B" style configuration
        >>> model = Qwen2MoeModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "qwen2_moe"
    keys_to_ignore_at_inference = ["past_key_values"]

    def __init__(
        self,
        vocab_size=151936,
        hidden_size=2048,
        intermediate_size=5632,
        num_hidden_layers=24,
        num_attention_heads=16,
        num_key_value_heads=16,
        hidden_act="silu",
        max_position_embeddings=32768,
        initializer_range=0.02,
        rms_norm_eps=1e-6,
        use_cache=True,
        tie_word_embeddings=False,
        rope_theta=10000.0,
        use_sliding_window=False,
        sliding_window=4096,
        max_window_layers=28,
        attention_dropout=0.0,
        decoder_sparse_step=1,
        moe_intermediate_size=1408,
        shared_expert_intermediate_size=5632,
        num_experts_per_tok=4,
        num_experts=60,
        norm_topk_prob=False,
        output_router_logits=False,
        router_aux_loss_coef=0.001,
        **kwargs,
    ):
        """
        Initializes a Qwen2MoeConfig object with the specified parameters.

        Args:
            vocab_size (int): The size of the vocabulary.
            hidden_size (int): The size of the hidden layers.
            intermediate_size (int): The size of the intermediate layers.
            num_hidden_layers (int): The number of hidden layers.
            num_attention_heads (int): The number of attention heads.
            num_key_value_heads (int): The number of key and value heads.
            hidden_act (str): The activation function for the hidden layers.
            max_position_embeddings (int): The maximum position embeddings.
            initializer_range (float): The range for weight initialization.
            rms_norm_eps (float): The epsilon value for RMS normalization.
            use_cache (bool): Whether to use caching.
            tie_word_embeddings (bool): Whether to tie word embeddings.
            rope_theta (float): The theta value for rope.
            use_sliding_window (bool): Whether to use sliding window.
            sliding_window (int): The size of the sliding window.
            max_window_layers (int): The maximum number of window layers.
            attention_dropout (float): The dropout rate for attention.
            decoder_sparse_step (int): The step size for decoder sparsity.
            moe_intermediate_size (int): The size of intermediate layers for Mixture of Experts.
            shared_expert_intermediate_size (int): The size of shared expert intermediate layers.
            num_experts_per_tok (int): The number of experts per token.
            num_experts (int): The total number of experts.
            norm_topk_prob (bool): Whether to normalize top-k probabilities.
            output_router_logits (bool): Whether to output router logits.
            router_aux_loss_coef (float): The coefficient for router auxiliary loss.

        Returns:
            None.

        Raises:
            None.
        """
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.use_sliding_window = use_sliding_window
        self.sliding_window = sliding_window
        self.max_window_layers = max_window_layers

        self.num_key_value_heads = num_key_value_heads
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
        self.rms_norm_eps = rms_norm_eps
        self.use_cache = use_cache
        self.rope_theta = rope_theta
        self.attention_dropout = attention_dropout

        # MoE arguments
        self.decoder_sparse_step = decoder_sparse_step
        self.moe_intermediate_size = moe_intermediate_size
        self.shared_expert_intermediate_size = shared_expert_intermediate_size
        self.num_experts_per_tok = num_experts_per_tok
        self.num_experts = num_experts
        self.norm_topk_prob = norm_topk_prob
        self.output_router_logits = output_router_logits
        self.router_aux_loss_coef = router_aux_loss_coef

        super().__init__(
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )

mindnlp.transformers.models.qwen2_moe.configuration_qwen2_moe.Qwen2MoeConfig.__init__(vocab_size=151936, hidden_size=2048, intermediate_size=5632, num_hidden_layers=24, num_attention_heads=16, num_key_value_heads=16, hidden_act='silu', max_position_embeddings=32768, initializer_range=0.02, rms_norm_eps=1e-06, use_cache=True, tie_word_embeddings=False, rope_theta=10000.0, use_sliding_window=False, sliding_window=4096, max_window_layers=28, attention_dropout=0.0, decoder_sparse_step=1, moe_intermediate_size=1408, shared_expert_intermediate_size=5632, num_experts_per_tok=4, num_experts=60, norm_topk_prob=False, output_router_logits=False, router_aux_loss_coef=0.001, **kwargs)

Initializes a Qwen2MoeConfig object with the specified parameters.

PARAMETER DESCRIPTION
vocab_size

The size of the vocabulary.

TYPE: int DEFAULT: 151936

hidden_size

The size of the hidden layers.

TYPE: int DEFAULT: 2048

intermediate_size

The size of the intermediate layers.

TYPE: int DEFAULT: 5632

num_hidden_layers

The number of hidden layers.

TYPE: int DEFAULT: 24

num_attention_heads

The number of attention heads.

TYPE: int DEFAULT: 16

num_key_value_heads

The number of key and value heads.

TYPE: int DEFAULT: 16

hidden_act

The activation function for the hidden layers.

TYPE: str DEFAULT: 'silu'

max_position_embeddings

The maximum position embeddings.

TYPE: int DEFAULT: 32768

initializer_range

The range for weight initialization.

TYPE: float DEFAULT: 0.02

rms_norm_eps

The epsilon value for RMS normalization.

TYPE: float DEFAULT: 1e-06

use_cache

Whether to use caching.

TYPE: bool DEFAULT: True

tie_word_embeddings

Whether to tie word embeddings.

TYPE: bool DEFAULT: False

rope_theta

The theta value for rope.

TYPE: float DEFAULT: 10000.0

use_sliding_window

Whether to use sliding window.

TYPE: bool DEFAULT: False

sliding_window

The size of the sliding window.

TYPE: int DEFAULT: 4096

max_window_layers

The maximum number of window layers.

TYPE: int DEFAULT: 28

attention_dropout

The dropout rate for attention.

TYPE: float DEFAULT: 0.0

decoder_sparse_step

The step size for decoder sparsity.

TYPE: int DEFAULT: 1

moe_intermediate_size

The size of intermediate layers for Mixture of Experts.

TYPE: int DEFAULT: 1408

shared_expert_intermediate_size

The size of shared expert intermediate layers.

TYPE: int DEFAULT: 5632

num_experts_per_tok

The number of experts per token.

TYPE: int DEFAULT: 4

num_experts

The total number of experts.

TYPE: int DEFAULT: 60

norm_topk_prob

Whether to normalize top-k probabilities.

TYPE: bool DEFAULT: False

output_router_logits

Whether to output router logits.

TYPE: bool DEFAULT: False

router_aux_loss_coef

The coefficient for router auxiliary loss.

TYPE: float DEFAULT: 0.001

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/qwen2_moe/configuration_qwen2_moe.py
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
def __init__(
    self,
    vocab_size=151936,
    hidden_size=2048,
    intermediate_size=5632,
    num_hidden_layers=24,
    num_attention_heads=16,
    num_key_value_heads=16,
    hidden_act="silu",
    max_position_embeddings=32768,
    initializer_range=0.02,
    rms_norm_eps=1e-6,
    use_cache=True,
    tie_word_embeddings=False,
    rope_theta=10000.0,
    use_sliding_window=False,
    sliding_window=4096,
    max_window_layers=28,
    attention_dropout=0.0,
    decoder_sparse_step=1,
    moe_intermediate_size=1408,
    shared_expert_intermediate_size=5632,
    num_experts_per_tok=4,
    num_experts=60,
    norm_topk_prob=False,
    output_router_logits=False,
    router_aux_loss_coef=0.001,
    **kwargs,
):
    """
    Initializes a Qwen2MoeConfig object with the specified parameters.

    Args:
        vocab_size (int): The size of the vocabulary.
        hidden_size (int): The size of the hidden layers.
        intermediate_size (int): The size of the intermediate layers.
        num_hidden_layers (int): The number of hidden layers.
        num_attention_heads (int): The number of attention heads.
        num_key_value_heads (int): The number of key and value heads.
        hidden_act (str): The activation function for the hidden layers.
        max_position_embeddings (int): The maximum position embeddings.
        initializer_range (float): The range for weight initialization.
        rms_norm_eps (float): The epsilon value for RMS normalization.
        use_cache (bool): Whether to use caching.
        tie_word_embeddings (bool): Whether to tie word embeddings.
        rope_theta (float): The theta value for rope.
        use_sliding_window (bool): Whether to use sliding window.
        sliding_window (int): The size of the sliding window.
        max_window_layers (int): The maximum number of window layers.
        attention_dropout (float): The dropout rate for attention.
        decoder_sparse_step (int): The step size for decoder sparsity.
        moe_intermediate_size (int): The size of intermediate layers for Mixture of Experts.
        shared_expert_intermediate_size (int): The size of shared expert intermediate layers.
        num_experts_per_tok (int): The number of experts per token.
        num_experts (int): The total number of experts.
        norm_topk_prob (bool): Whether to normalize top-k probabilities.
        output_router_logits (bool): Whether to output router logits.
        router_aux_loss_coef (float): The coefficient for router auxiliary loss.

    Returns:
        None.

    Raises:
        None.
    """
    self.vocab_size = vocab_size
    self.max_position_embeddings = max_position_embeddings
    self.hidden_size = hidden_size
    self.intermediate_size = intermediate_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.use_sliding_window = use_sliding_window
    self.sliding_window = sliding_window
    self.max_window_layers = max_window_layers

    self.num_key_value_heads = num_key_value_heads
    self.hidden_act = hidden_act
    self.initializer_range = initializer_range
    self.rms_norm_eps = rms_norm_eps
    self.use_cache = use_cache
    self.rope_theta = rope_theta
    self.attention_dropout = attention_dropout

    # MoE arguments
    self.decoder_sparse_step = decoder_sparse_step
    self.moe_intermediate_size = moe_intermediate_size
    self.shared_expert_intermediate_size = shared_expert_intermediate_size
    self.num_experts_per_tok = num_experts_per_tok
    self.num_experts = num_experts
    self.norm_topk_prob = norm_topk_prob
    self.output_router_logits = output_router_logits
    self.router_aux_loss_coef = router_aux_loss_coef

    super().__init__(
        tie_word_embeddings=tie_word_embeddings,
        **kwargs,
    )

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe

MindSpore Qwen2MoE model.

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeAttention

Bases: Module

Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer and "Generating Long Sequences with Sparse Transformers".

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
class Qwen2MoeAttention(nn.Module):
    """
    Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
    and "Generating Long Sequences with Sparse Transformers".
    """
    def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None):
        """
        Initialize a Qwen2MoeAttention instance.

        Args:
            self: The instance of the class.
            config (Qwen2MoeConfig): The configuration object containing model hyperparameters.
            layer_idx (Optional[int]): The index of the layer within the model. Defaults to None if not provided.
                If layer_idx is None, a warning is issued indicating potential issues during forward call
                if caching is used.
                It is recommended to always provide a layer index when creating an instance of this class.

        Returns:
            None.

        Raises:
            ValueError: If the hidden_size is not divisible by num_heads.
        """
        super().__init__()
        self.config = config
        self.layer_idx = layer_idx
        if layer_idx is None:
            logger.warning_once(
                f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
                "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
                "when creating this class."
            )

        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads
        self.head_dim = self.hidden_size // self.num_heads
        self.num_key_value_heads = config.num_key_value_heads
        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
        self.max_position_embeddings = config.max_position_embeddings
        self.rope_theta = config.rope_theta
        self.is_causal = True
        self.attention_dropout = config.attention_dropout

        if (self.head_dim * self.num_heads) != self.hidden_size:
            raise ValueError(
                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
                f" and `num_heads`: {self.num_heads})."
            )
        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True)
        self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)

        self.rotary_emb = Qwen2MoeRotaryEmbedding(
            self.head_dim,
            max_position_embeddings=self.max_position_embeddings,
            base=self.rope_theta,
        )

    def forward(
        self,
        hidden_states: mindspore.Tensor,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_value: Optional[Cache] = None,
        output_attentions: bool = False,
        use_cache: bool = False,
        **kwargs,
    ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
        """
        This method forwards Qwen2MoeAttention and performs attention mechanism.

        Args:
            self: The instance of the Qwen2MoeAttention class.
            hidden_states (mindspore.Tensor): The input hidden states tensor of shape
                (batch_size, sequence_length, hidden_size).
            attention_mask (Optional[mindspore.Tensor], optional): An optional tensor specifying the attention mask of
                shape (batch_size, 1, sequence_length, key_value_sequence_length). Defaults to None.
            position_ids (Optional[mindspore.Tensor], optional): An optional tensor specifying the position ids of
                shape (batch_size, sequence_length). Defaults to None.
            past_key_value (Optional[Cache], optional): An optional cache object for storing key and value states
                from previous steps. Defaults to None.
            output_attentions (bool): A flag indicating whether to output attention weights. Defaults to False.
            use_cache (bool): A flag indicating whether to use cache for storing key and value states. Defaults to False.

        Returns:
            Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
            A tuple containing the attention output tensor of shape (batch_size, sequence_length, hidden_size),
            optional attention weights tensor, and optional updated past key value tuple.

        Raises:
            ValueError: If the size of attention weights or attention mask does not match the expected shape.
            ValueError: If the size of the final attention output tensor does not match the expected shape.
            ValueError: If the cache structure has changed since version v4.36 and a layer index is not provided
                for auto-regressive decoding with k/v caching.
        """
        if "padding_mask" in kwargs:
            warnings.warn(
                "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
            )
        bsz, q_len, _ = hidden_states.shape

        query_states = self.q_proj(hidden_states)
        key_states = self.k_proj(hidden_states)
        value_states = self.v_proj(hidden_states)

        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).swapaxes(1, 2)
        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).swapaxes(1, 2)
        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).swapaxes(1, 2)

        kv_seq_len = key_states.shape[-2]
        if past_key_value is not None:
            if self.layer_idx is None:
                raise ValueError(
                    f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
                    "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
                    "with a layer index."
                )
            kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)

        if past_key_value is not None:
            cache_kwargs = {"sin": sin, "cos": cos}  # Specific to RoPE models
            key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)

        # repeat k/v heads if n_kv_heads < n_heads
        key_states = repeat_kv(key_states, self.num_key_value_groups)
        value_states = repeat_kv(value_states, self.num_key_value_groups)

        attn_weights = ops.matmul(query_states, key_states.swapaxes(2, 3)) / math.sqrt(self.head_dim)

        if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len):
            raise ValueError(
                f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
                f" {attn_weights.shape}"
            )

        if attention_mask is not None:
            if attention_mask.shape != (bsz, 1, q_len, kv_seq_len):
                raise ValueError(
                    f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}"
                )

            attn_weights = attn_weights + attention_mask

        # upcast attention to fp32
        attn_weights = ops.softmax(attn_weights, axis=-1, dtype=mindspore.float32).to(query_states.dtype)
        attn_weights = ops.dropout(attn_weights, p=self.attention_dropout, training=self.training)
        attn_output = ops.matmul(attn_weights, value_states)

        if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim):
            raise ValueError(
                f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
                f" {attn_output.shape}"
            )

        attn_output = attn_output.swapaxes(1, 2)
        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)

        attn_output = self.o_proj(attn_output)

        if not output_attentions:
            attn_weights = None

        return attn_output, attn_weights, past_key_value

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeAttention.__init__(config, layer_idx=None)

Initialize a Qwen2MoeAttention instance.

PARAMETER DESCRIPTION
self

The instance of the class.

config

The configuration object containing model hyperparameters.

TYPE: Qwen2MoeConfig

layer_idx

The index of the layer within the model. Defaults to None if not provided. If layer_idx is None, a warning is issued indicating potential issues during forward call if caching is used. It is recommended to always provide a layer index when creating an instance of this class.

TYPE: Optional[int] DEFAULT: None

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If the hidden_size is not divisible by num_heads.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None):
    """
    Initialize a Qwen2MoeAttention instance.

    Args:
        self: The instance of the class.
        config (Qwen2MoeConfig): The configuration object containing model hyperparameters.
        layer_idx (Optional[int]): The index of the layer within the model. Defaults to None if not provided.
            If layer_idx is None, a warning is issued indicating potential issues during forward call
            if caching is used.
            It is recommended to always provide a layer index when creating an instance of this class.

    Returns:
        None.

    Raises:
        ValueError: If the hidden_size is not divisible by num_heads.
    """
    super().__init__()
    self.config = config
    self.layer_idx = layer_idx
    if layer_idx is None:
        logger.warning_once(
            f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
            "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
            "when creating this class."
        )

    self.hidden_size = config.hidden_size
    self.num_heads = config.num_attention_heads
    self.head_dim = self.hidden_size // self.num_heads
    self.num_key_value_heads = config.num_key_value_heads
    self.num_key_value_groups = self.num_heads // self.num_key_value_heads
    self.max_position_embeddings = config.max_position_embeddings
    self.rope_theta = config.rope_theta
    self.is_causal = True
    self.attention_dropout = config.attention_dropout

    if (self.head_dim * self.num_heads) != self.hidden_size:
        raise ValueError(
            f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
            f" and `num_heads`: {self.num_heads})."
        )
    self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True)
    self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
    self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
    self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)

    self.rotary_emb = Qwen2MoeRotaryEmbedding(
        self.head_dim,
        max_position_embeddings=self.max_position_embeddings,
        base=self.rope_theta,
    )

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeAttention.forward(hidden_states, attention_mask=None, position_ids=None, past_key_value=None, output_attentions=False, use_cache=False, **kwargs)

This method forwards Qwen2MoeAttention and performs attention mechanism.

PARAMETER DESCRIPTION
self

The instance of the Qwen2MoeAttention class.

hidden_states

The input hidden states tensor of shape (batch_size, sequence_length, hidden_size).

TYPE: Tensor

attention_mask

An optional tensor specifying the attention mask of shape (batch_size, 1, sequence_length, key_value_sequence_length). Defaults to None.

TYPE: Optional[Tensor] DEFAULT: None

position_ids

An optional tensor specifying the position ids of shape (batch_size, sequence_length). Defaults to None.

TYPE: Optional[Tensor] DEFAULT: None

past_key_value

An optional cache object for storing key and value states from previous steps. Defaults to None.

TYPE: Optional[Cache] DEFAULT: None

output_attentions

A flag indicating whether to output attention weights. Defaults to False.

TYPE: bool DEFAULT: False

use_cache

A flag indicating whether to use cache for storing key and value states. Defaults to False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
Tensor

Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:

Optional[Tensor]

A tuple containing the attention output tensor of shape (batch_size, sequence_length, hidden_size),

Optional[Tuple[Tensor]]

optional attention weights tensor, and optional updated past key value tuple.

RAISES DESCRIPTION
ValueError

If the size of attention weights or attention mask does not match the expected shape.

ValueError

If the size of the final attention output tensor does not match the expected shape.

ValueError

If the cache structure has changed since version v4.36 and a layer index is not provided for auto-regressive decoding with k/v caching.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
def forward(
    self,
    hidden_states: mindspore.Tensor,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_value: Optional[Cache] = None,
    output_attentions: bool = False,
    use_cache: bool = False,
    **kwargs,
) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
    """
    This method forwards Qwen2MoeAttention and performs attention mechanism.

    Args:
        self: The instance of the Qwen2MoeAttention class.
        hidden_states (mindspore.Tensor): The input hidden states tensor of shape
            (batch_size, sequence_length, hidden_size).
        attention_mask (Optional[mindspore.Tensor], optional): An optional tensor specifying the attention mask of
            shape (batch_size, 1, sequence_length, key_value_sequence_length). Defaults to None.
        position_ids (Optional[mindspore.Tensor], optional): An optional tensor specifying the position ids of
            shape (batch_size, sequence_length). Defaults to None.
        past_key_value (Optional[Cache], optional): An optional cache object for storing key and value states
            from previous steps. Defaults to None.
        output_attentions (bool): A flag indicating whether to output attention weights. Defaults to False.
        use_cache (bool): A flag indicating whether to use cache for storing key and value states. Defaults to False.

    Returns:
        Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]:
        A tuple containing the attention output tensor of shape (batch_size, sequence_length, hidden_size),
        optional attention weights tensor, and optional updated past key value tuple.

    Raises:
        ValueError: If the size of attention weights or attention mask does not match the expected shape.
        ValueError: If the size of the final attention output tensor does not match the expected shape.
        ValueError: If the cache structure has changed since version v4.36 and a layer index is not provided
            for auto-regressive decoding with k/v caching.
    """
    if "padding_mask" in kwargs:
        warnings.warn(
            "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
        )
    bsz, q_len, _ = hidden_states.shape

    query_states = self.q_proj(hidden_states)
    key_states = self.k_proj(hidden_states)
    value_states = self.v_proj(hidden_states)

    query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).swapaxes(1, 2)
    key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).swapaxes(1, 2)
    value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).swapaxes(1, 2)

    kv_seq_len = key_states.shape[-2]
    if past_key_value is not None:
        if self.layer_idx is None:
            raise ValueError(
                f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
                "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
                "with a layer index."
            )
        kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
    cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)

    if past_key_value is not None:
        cache_kwargs = {"sin": sin, "cos": cos}  # Specific to RoPE models
        key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)

    # repeat k/v heads if n_kv_heads < n_heads
    key_states = repeat_kv(key_states, self.num_key_value_groups)
    value_states = repeat_kv(value_states, self.num_key_value_groups)

    attn_weights = ops.matmul(query_states, key_states.swapaxes(2, 3)) / math.sqrt(self.head_dim)

    if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len):
        raise ValueError(
            f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
            f" {attn_weights.shape}"
        )

    if attention_mask is not None:
        if attention_mask.shape != (bsz, 1, q_len, kv_seq_len):
            raise ValueError(
                f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}"
            )

        attn_weights = attn_weights + attention_mask

    # upcast attention to fp32
    attn_weights = ops.softmax(attn_weights, axis=-1, dtype=mindspore.float32).to(query_states.dtype)
    attn_weights = ops.dropout(attn_weights, p=self.attention_dropout, training=self.training)
    attn_output = ops.matmul(attn_weights, value_states)

    if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim):
        raise ValueError(
            f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
            f" {attn_output.shape}"
        )

    attn_output = attn_output.swapaxes(1, 2)
    attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)

    attn_output = self.o_proj(attn_output)

    if not output_attentions:
        attn_weights = None

    return attn_output, attn_weights, past_key_value

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeDecoderLayer

Bases: Module

The Qwen2MoeDecoderLayer class represents a single layer of the Qwen2Moe decoder model. It is designed to be used in the Qwen2MoeDecoder model to process the input hidden states and generate output representations.

This class inherits from the nn.Module class.

ATTRIBUTE DESCRIPTION
hidden_size

The size of the hidden state.

TYPE: int

self_attn

The self-attention mechanism used in the layer.

TYPE: Qwen2MoeAttention

mlp

The multi-layer perceptron used in the layer.

TYPE: Union[Qwen2MoeSparseMoeBlock, Qwen2MoeMLP]

input_layernorm

The layer normalization applied to the input hidden states.

TYPE: Qwen2MoeRMSNorm

post_attention_layernorm

The layer normalization applied after the attention mechanism.

TYPE: Qwen2MoeRMSNorm

Note
  • The hidden_states argument represents the input to the layer.
  • The attention_mask argument is an optional tensor that masks certain positions in the input sequence.
  • The position_ids argument is an optional tensor that represents the position IDs of the input hidden states.
  • The past_key_value argument is an optional tuple of tensors that caches the past key and value projection states.
  • The output_attentions argument is an optional boolean flag indicating whether to return the attention tensors.
  • The output_router_logits argument is an optional boolean flag indicating whether to return the logits of the routers.
  • The use_cache argument is an optional boolean flag indicating whether to use the cached key value states for decoding.

Please refer to the source code for more information on the specific implementation details.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
class Qwen2MoeDecoderLayer(nn.Module):

    """
    The `Qwen2MoeDecoderLayer` class represents a single layer of the Qwen2Moe decoder model.
    It is designed to be used in the Qwen2MoeDecoder model to process the input hidden states and generate output
    representations.

    This class inherits from the `nn.Module` class.

    Attributes:
        hidden_size (int): The size of the hidden state.
        self_attn (Qwen2MoeAttention): The self-attention mechanism used in the layer.
        mlp (Union[Qwen2MoeSparseMoeBlock, Qwen2MoeMLP]): The multi-layer perceptron used in the layer.
        input_layernorm (Qwen2MoeRMSNorm): The layer normalization applied to the input hidden states.
        post_attention_layernorm (Qwen2MoeRMSNorm): The layer normalization applied after the attention mechanism.

    Note:
        - The `hidden_states` argument represents the input to the layer.
        - The `attention_mask` argument is an optional tensor that masks certain positions in the input sequence.
        - The `position_ids` argument is an optional tensor that represents the position IDs of the input hidden states.
        - The `past_key_value` argument is an optional tuple of tensors that caches the past key and value projection states.
        - The `output_attentions` argument is an optional boolean flag indicating whether to return the attention tensors.
        - The `output_router_logits` argument is an optional boolean flag indicating whether to return the logits of the routers.
        - The `use_cache` argument is an optional boolean flag indicating whether to use the cached key value states for decoding.

    Please refer to the source code for more information on the specific implementation details.
    """
    def __init__(self, config: Qwen2MoeConfig, layer_idx: int):
        """
        Initializes a Qwen2MoeDecoderLayer object.

        Args:
            self: The instance of the Qwen2MoeDecoderLayer class.
            config (Qwen2MoeConfig): An object containing configuration settings for the decoder layer.
                It specifies the hidden size, number of experts, decoder sparse step, and intermediate size.
            layer_idx (int): An integer indicating the index of the layer within the decoder.
                It is used to determine the behavior of the layer based on the configuration.

        Returns:
            None.

        Raises:
            KeyError: If the attention class specified in the configuration is not found in QWEN2MOE_ATTENTION_CLASSES.
            ValueError: If the number of experts specified in the configuration is less than or equal to 0.
            TypeError: If the configuration parameters are not of the expected types.
        """
        super().__init__()
        self.hidden_size = config.hidden_size

        self.self_attn = QWEN2MOE_ATTENTION_CLASSES["eager"](config, layer_idx)

        if config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0:
            self.mlp = Qwen2MoeSparseMoeBlock(config)
        else:
            self.mlp = Qwen2MoeMLP(config, intermediate_size=config.intermediate_size)

        self.input_layernorm = Qwen2MoeRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.post_attention_layernorm = Qwen2MoeRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

    def forward(
        self,
        hidden_states: mindspore.Tensor,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_value: Optional[Tuple[mindspore.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        output_router_logits: Optional[bool] = False,
        use_cache: Optional[bool] = False,
    ) -> Tuple[mindspore.Tensor, Optional[Tuple[mindspore.Tensor, mindspore.Tensor]]]:
        """
        Args:
            hidden_states (`mindspore.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
            attention_mask (`mindspore.Tensor`, *optional*): attention mask of size
                `(batch, sequence_length)` where padding elements are indicated by 0.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
            output_router_logits (`bool`, *optional*):
                Whether or not to return the logits of all the routers. They are useful for computing the router loss,
                and should not be returned during inference.
            use_cache (`bool`, *optional*):
                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
                (see `past_key_values`).
            past_key_value (`Tuple(mindspore.Tensor)`, *optional*): cached past key and value projection states
        """
        residual = hidden_states

        hidden_states = self.input_layernorm(hidden_states)

        # Self Attention
        hidden_states, self_attn_weights, present_key_value = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache,
        )
        hidden_states = residual + hidden_states

        # Fully Connected
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)

        hidden_states = self.mlp(hidden_states)
        if isinstance(hidden_states, tuple):
            hidden_states, router_logits = hidden_states
        else:
            router_logits = None

        hidden_states = residual + hidden_states

        outputs = (hidden_states,)

        if output_attentions:
            outputs += (self_attn_weights,)

        if use_cache:
            outputs += (present_key_value,)

        if output_router_logits:
            outputs += (router_logits,)

        return outputs

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeDecoderLayer.__init__(config, layer_idx)

Initializes a Qwen2MoeDecoderLayer object.

PARAMETER DESCRIPTION
self

The instance of the Qwen2MoeDecoderLayer class.

config

An object containing configuration settings for the decoder layer. It specifies the hidden size, number of experts, decoder sparse step, and intermediate size.

TYPE: Qwen2MoeConfig

layer_idx

An integer indicating the index of the layer within the decoder. It is used to determine the behavior of the layer based on the configuration.

TYPE: int

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
KeyError

If the attention class specified in the configuration is not found in QWEN2MOE_ATTENTION_CLASSES.

ValueError

If the number of experts specified in the configuration is less than or equal to 0.

TypeError

If the configuration parameters are not of the expected types.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
def __init__(self, config: Qwen2MoeConfig, layer_idx: int):
    """
    Initializes a Qwen2MoeDecoderLayer object.

    Args:
        self: The instance of the Qwen2MoeDecoderLayer class.
        config (Qwen2MoeConfig): An object containing configuration settings for the decoder layer.
            It specifies the hidden size, number of experts, decoder sparse step, and intermediate size.
        layer_idx (int): An integer indicating the index of the layer within the decoder.
            It is used to determine the behavior of the layer based on the configuration.

    Returns:
        None.

    Raises:
        KeyError: If the attention class specified in the configuration is not found in QWEN2MOE_ATTENTION_CLASSES.
        ValueError: If the number of experts specified in the configuration is less than or equal to 0.
        TypeError: If the configuration parameters are not of the expected types.
    """
    super().__init__()
    self.hidden_size = config.hidden_size

    self.self_attn = QWEN2MOE_ATTENTION_CLASSES["eager"](config, layer_idx)

    if config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0:
        self.mlp = Qwen2MoeSparseMoeBlock(config)
    else:
        self.mlp = Qwen2MoeMLP(config, intermediate_size=config.intermediate_size)

    self.input_layernorm = Qwen2MoeRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
    self.post_attention_layernorm = Qwen2MoeRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeDecoderLayer.forward(hidden_states, attention_mask=None, position_ids=None, past_key_value=None, output_attentions=False, output_router_logits=False, use_cache=False)

PARAMETER DESCRIPTION
hidden_states

input to the layer of shape (batch, seq_len, embed_dim)

TYPE: `mindspore.Tensor`

attention_mask

attention mask of size (batch, sequence_length) where padding elements are indicated by 0.

TYPE: `mindspore.Tensor`, *optional* DEFAULT: None

output_attentions

Whether or not to return the attentions tensors of all attention layers. See attentions under returned tensors for more detail.

TYPE: `bool`, *optional* DEFAULT: False

output_router_logits

Whether or not to return the logits of all the routers. They are useful for computing the router loss, and should not be returned during inference.

TYPE: `bool`, *optional* DEFAULT: False

use_cache

If set to True, past_key_values key value states are returned and can be used to speed up decoding (see past_key_values).

TYPE: `bool`, *optional* DEFAULT: False

past_key_value

cached past key and value projection states

TYPE: `Tuple(mindspore.Tensor)`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
def forward(
    self,
    hidden_states: mindspore.Tensor,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_value: Optional[Tuple[mindspore.Tensor]] = None,
    output_attentions: Optional[bool] = False,
    output_router_logits: Optional[bool] = False,
    use_cache: Optional[bool] = False,
) -> Tuple[mindspore.Tensor, Optional[Tuple[mindspore.Tensor, mindspore.Tensor]]]:
    """
    Args:
        hidden_states (`mindspore.Tensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
        attention_mask (`mindspore.Tensor`, *optional*): attention mask of size
            `(batch, sequence_length)` where padding elements are indicated by 0.
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under
            returned tensors for more detail.
        output_router_logits (`bool`, *optional*):
            Whether or not to return the logits of all the routers. They are useful for computing the router loss,
            and should not be returned during inference.
        use_cache (`bool`, *optional*):
            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
            (see `past_key_values`).
        past_key_value (`Tuple(mindspore.Tensor)`, *optional*): cached past key and value projection states
    """
    residual = hidden_states

    hidden_states = self.input_layernorm(hidden_states)

    # Self Attention
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
        hidden_states=hidden_states,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_value=past_key_value,
        output_attentions=output_attentions,
        use_cache=use_cache,
    )
    hidden_states = residual + hidden_states

    # Fully Connected
    residual = hidden_states
    hidden_states = self.post_attention_layernorm(hidden_states)

    hidden_states = self.mlp(hidden_states)
    if isinstance(hidden_states, tuple):
        hidden_states, router_logits = hidden_states
    else:
        router_logits = None

    hidden_states = residual + hidden_states

    outputs = (hidden_states,)

    if output_attentions:
        outputs += (self_attn_weights,)

    if use_cache:
        outputs += (present_key_value,)

    if output_router_logits:
        outputs += (router_logits,)

    return outputs

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForCausalLM

Bases: Qwen2MoePreTrainedModel

This class represents a Qwen2Moe model for causal language modeling. It is used for generating text based on a given input. The model is initialized with a configuration and consists of a Qwen2MoeModel for encoding and a linear layer (lm_head) for decoding. It also includes methods for getting and setting the input and output embeddings, setting and getting the decoder, and generating text.

ATTRIBUTE DESCRIPTION
`model`

The Qwen2MoeModel used for encoding.

TYPE: Qwen2MoeModel

`vocab_size`

The size of the vocabulary.

TYPE: int

`lm_head`

The linear layer used for decoding.

TYPE: Linear

`router_aux_loss_coef`

The coefficient for the auxiliary loss.

TYPE: float

`num_experts`

The number of experts.

TYPE: int

`num_experts_per_tok`

The number of experts per token.

TYPE: int

METHOD DESCRIPTION
`get_input_embeddings`

Returns the input embeddings.

`set_input_embeddings`

Sets the input embeddings.

`get_output_embeddings`

Returns the output embeddings.

`set_output_embeddings`

Sets the output embeddings.

`set_decoder`

Sets the decoder.

`get_decoder`

Returns the decoder.

`forward`

Constructs the model with the given inputs and returns the output logits. Optionally computes the masked language modeling loss and the auxiliary loss.

`prepare_inputs_for_generation`

Prepares the inputs for text generation, taking into account past key values and attention mask.

`_reorder_cache`

Reorders the cache based on the beam index.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel):

    """
    This class represents a Qwen2Moe model for causal language modeling.
    It is used for generating text based on a given input. The model is initialized with a configuration and consists of
    a Qwen2MoeModel for encoding and a linear layer (lm_head) for decoding. It also includes methods for getting and
    setting the input and output embeddings, setting and getting the decoder, and generating text.

    Attributes:
        `model` (Qwen2MoeModel): The Qwen2MoeModel used for encoding.
        `vocab_size` (int): The size of the vocabulary.
        `lm_head` (nn.Linear): The linear layer used for decoding.
        `router_aux_loss_coef` (float): The coefficient for the auxiliary loss.
        `num_experts` (int): The number of experts.
        `num_experts_per_tok` (int): The number of experts per token.

    Methods:
        `get_input_embeddings`: Returns the input embeddings.
        `set_input_embeddings`: Sets the input embeddings.
        `get_output_embeddings`: Returns the output embeddings.
        `set_output_embeddings`: Sets the output embeddings.
        `set_decoder`: Sets the decoder.
        `get_decoder`: Returns the decoder.
        `forward`: Constructs the model with the given inputs and returns the output logits.
            Optionally computes the masked language modeling loss and the auxiliary loss.
        `prepare_inputs_for_generation`: Prepares the inputs for text generation, taking into account past key values
            and attention mask.
        `_reorder_cache`: Reorders the cache based on the beam index.
    """
    _tied_weights_keys = ["lm_head.weight"]

    def __init__(self, config):
        """
        Initializes a Qwen2MoeForCausalLM object.

        Args:
            self (Qwen2MoeForCausalLM): The instance of the class.
            config (dict):
                A dictionary containing configuration parameters.

                - vocab_size (int): The size of the vocabulary.
                - hidden_size (int): The size of the hidden layer.
                - router_aux_loss_coef (float): Coefficient for router auxiliary loss.
                - num_experts (int): The total number of experts.
                - num_experts_per_tok (int): Number of experts per token.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)
        self.model = Qwen2MoeModel(config)
        self.vocab_size = config.vocab_size
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

        self.router_aux_loss_coef = config.router_aux_loss_coef
        self.num_experts = config.num_experts
        self.num_experts_per_tok = config.num_experts_per_tok
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        """
        Returns the input embeddings for the Qwen2MoeForCausalLM model.

        Args:
            self (Qwen2MoeForCausalLM): The instance of the Qwen2MoeForCausalLM class.

        Returns:
            None.

        Raises:
            None.
        """
        return self.model.embed_tokens

    def set_input_embeddings(self, value):
        """
        Method: set_input_embeddings

        Description:
            This method sets the input embeddings for the Qwen2MoeForCausalLM model.

        Args:
            self (Qwen2MoeForCausalLM): The instance of the Qwen2MoeForCausalLM class.
                This parameter refers to the current instance of the model where the input embeddings will be set.

            value:
                The input embeddings to be set for the model.

                - Type: Any
                - Purpose: The value representing the input embeddings that will be assigned to the model's 
                embed_tokens attribute.
                - Restrictions: None

        Returns:
            None.

        Raises:
            None.
        """
        self.model.embed_tokens = value

    def get_output_embeddings(self):
        """Return the output embeddings of the Qwen2MoeForCausalLM model.

        Args:
            self (Qwen2MoeForCausalLM): An instance of the Qwen2MoeForCausalLM class.

        Returns:
            None.

        Raises:
            None.
        """
        return self.lm_head

    def set_output_embeddings(self, new_embeddings):
        """
        Sets new output embeddings for the Qwen2MoeForCausalLM model.

        Args:
            self (Qwen2MoeForCausalLM): The instance of the Qwen2MoeForCausalLM class.
            new_embeddings (object): The new output embeddings to be set for the model. 
                Should be of the desired embedding type.

        Returns:
            None.

        Raises:
            None.
        """
        self.lm_head = new_embeddings

    def set_decoder(self, decoder):
        """
        Sets the decoder for the Qwen2MoeForCausalLM class.

        Args:
            self (Qwen2MoeForCausalLM): An instance of the Qwen2MoeForCausalLM class.
            decoder: The decoder to be set for the Qwen2MoeForCausalLM class.

        Returns:
            None.

        Raises:
            None.
        """
        self.model = decoder

    def get_decoder(self):
        """
        Returns the decoder model used in the Qwen2MoeForCausalLM class.

        Args:
            self: An instance of the Qwen2MoeForCausalLM class.

        Returns:
            None

        Raises:
            None
        """
        return self.model

    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[List[mindspore.Tensor]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        output_router_logits: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, MoeCausalLMOutputWithPast]:
        r"""

        Args:
            labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

        Returns:
            `Union[Tuple, MoeCausalLMOutputWithPast]`

        Example:
            ```python
            >>> from transformers import AutoTokenizer, Qwen2MoeForCausalLM
            ...
            >>> model = Qwen2MoeForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
            >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
            ...
            >>> prompt = "Hey, are you conscious? Can you talk to me?"
            >>> inputs = tokenizer(prompt, return_tensors="pt")
            ...
            >>> # Generate
            >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
            >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
            "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
            ```
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_router_logits = (
            output_router_logits if output_router_logits is not None else self.config.output_router_logits
        )
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            output_router_logits=output_router_logits,
            return_dict=return_dict,
        )

        hidden_states = outputs[0]
        logits = self.lm_head(hidden_states)
        logits = logits.float()

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :]
            shift_labels = labels[..., 1:]
            # Flatten the tokens
            shift_logits = shift_logits.view(-1, self.config.vocab_size)
            shift_labels = shift_labels.view(-1)
            # Enable model parallelism
            loss = ops.cross_entropy(shift_logits, shift_labels)

        aux_loss = None
        if output_router_logits:
            aux_loss = load_balancing_loss_func(
                outputs.router_logits if return_dict else outputs[-1],
                self.num_experts,
                self.num_experts_per_tok,
                attention_mask,
            )
            if labels is not None:
                loss += self.router_aux_loss_coef * aux_loss

        if not return_dict:
            output = (logits,) + outputs[1:]
            if output_router_logits:
                output = (aux_loss,) + output
            return (loss,) + output if loss is not None else output

        return MoeCausalLMOutputWithPast(
            loss=loss,
            aux_loss=aux_loss,
            logits=logits,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
            router_logits=outputs.router_logits,
        )

    def prepare_inputs_for_generation(
        self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
    ):
        """
        Prepares inputs for generation in the Qwen2MoeForCausalLM class.

        Args:
            self: The instance of the Qwen2MoeForCausalLM class.
            input_ids (torch.Tensor): The input tensor of shape (batch_size, sequence_length) containing the input IDs.
            past_key_values (Union[Cache, Tuple[torch.Tensor]]): Optional. The past key values used for caching during generation.
                If past_key_values is an instance of Cache, it represents the cached key values with attributes:

                - cache_length (int): The length of the cache.
                - past_length (int): The length of the past tokens.
                - max_cache_length (Optional[int]): The maximum cache length, if applicable.
                If past_key_values is a tuple, it represents the shape of the past key values tensor.
            attention_mask (torch.Tensor): Optional. The attention mask tensor of shape (batch_size, sequence_length) containing
                the attention mask for the input IDs.
            inputs_embeds (torch.Tensor): Optional. The input embeddings tensor of shape (batch_size, sequence_length, hidden_size)
                containing the input embeddings.
            **kwargs: Additional keyword arguments.

        Returns:
            dict:
                A dictionary containing the model inputs for generation with the following keys:

                - 'inputs_embeds' (torch.Tensor): The input embeddings tensor.
                - 'input_ids' (torch.Tensor): The input IDs tensor.
                - 'position_ids' (torch.Tensor): The position IDs tensor.
                - 'past_key_values' (Union[Cache, Tuple[torch.Tensor]]): The past key values tensor.
                - 'use_cache' (Optional[bool]): Indicates whether to use cache during generation.
                - 'attention_mask' (torch.Tensor): The attention mask tensor.

        Raises:
            None.
        """
        # Omit tokens covered by past_key_values
        if past_key_values is not None:
            if isinstance(past_key_values, Cache):
                cache_length = past_key_values.get_seq_length()
                past_length = past_key_values.seen_tokens
                max_cache_length = past_key_values.get_max_length()
            else:
                cache_length = past_length = past_key_values[0][0].shape[2]
                max_cache_length = None

            # Keep only the unprocessed tokens:
            # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
            # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
            # input)
            if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
                input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
            # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
            # input_ids based on the past_length.
            elif past_length < input_ids.shape[1]:
                input_ids = input_ids[:, past_length:]
            # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.

            # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
            if (
                max_cache_length is not None
                and attention_mask is not None
                and cache_length + input_ids.shape[1] > max_cache_length
            ):
                attention_mask = attention_mask[:, -max_cache_length:]

        position_ids = kwargs.get("position_ids", None)
        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids = position_ids.masked_fill(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -input_ids.shape[1] :]

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "position_ids": position_ids,
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
                "attention_mask": attention_mask,
            }
        )
        return model_inputs

    @staticmethod
    def _reorder_cache(past_key_values, beam_idx):
        """
        Reorders the cache based on the provided beam index.

        Args:
            past_key_values (tuple): A tuple containing the past key values for each layer.
                Each element in the tuple represents the past key values for a specific layer.
            beam_idx (Tensor): A tensor containing the indices to reorder the cache based on the beam search results.

        Returns:
            None.

        Raises:
            TypeError: If the input past_key_values is not a tuple or if beam_idx is not a tensor.
            ValueError: If the dimensions of the input tensors are incompatible for reordering.
        """
        reordered_past = ()
        for layer_past in past_key_values:
            reordered_past += (
                tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),
            )
        return reordered_past

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForCausalLM.__init__(config)

Initializes a Qwen2MoeForCausalLM object.

PARAMETER DESCRIPTION
self

The instance of the class.

TYPE: Qwen2MoeForCausalLM

config

A dictionary containing configuration parameters.

  • vocab_size (int): The size of the vocabulary.
  • hidden_size (int): The size of the hidden layer.
  • router_aux_loss_coef (float): Coefficient for router auxiliary loss.
  • num_experts (int): The total number of experts.
  • num_experts_per_tok (int): Number of experts per token.

TYPE: dict

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
def __init__(self, config):
    """
    Initializes a Qwen2MoeForCausalLM object.

    Args:
        self (Qwen2MoeForCausalLM): The instance of the class.
        config (dict):
            A dictionary containing configuration parameters.

            - vocab_size (int): The size of the vocabulary.
            - hidden_size (int): The size of the hidden layer.
            - router_aux_loss_coef (float): Coefficient for router auxiliary loss.
            - num_experts (int): The total number of experts.
            - num_experts_per_tok (int): Number of experts per token.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)
    self.model = Qwen2MoeModel(config)
    self.vocab_size = config.vocab_size
    self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

    self.router_aux_loss_coef = config.router_aux_loss_coef
    self.num_experts = config.num_experts
    self.num_experts_per_tok = config.num_experts_per_tok
    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForCausalLM.forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, output_router_logits=None, return_dict=None)

PARAMETER DESCRIPTION
labels

Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple, MoeCausalLMOutputWithPast]

Union[Tuple, MoeCausalLMOutputWithPast]

Example
>>> from transformers import AutoTokenizer, Qwen2MoeForCausalLM
...
>>> model = Qwen2MoeForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
>>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
...
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
...
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
def forward(
    self,
    input_ids: mindspore.Tensor = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[List[mindspore.Tensor]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    output_router_logits: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, MoeCausalLMOutputWithPast]:
    r"""

    Args:
        labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

    Returns:
        `Union[Tuple, MoeCausalLMOutputWithPast]`

    Example:
        ```python
        >>> from transformers import AutoTokenizer, Qwen2MoeForCausalLM
        ...
        >>> model = Qwen2MoeForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
        ...
        >>> prompt = "Hey, are you conscious? Can you talk to me?"
        >>> inputs = tokenizer(prompt, return_tensors="pt")
        ...
        >>> # Generate
        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
        ```
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_router_logits = (
        output_router_logits if output_router_logits is not None else self.config.output_router_logits
    )
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
    outputs = self.model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        output_router_logits=output_router_logits,
        return_dict=return_dict,
    )

    hidden_states = outputs[0]
    logits = self.lm_head(hidden_states)
    logits = logits.float()

    loss = None
    if labels is not None:
        # Shift so that tokens < n predict n
        shift_logits = logits[..., :-1, :]
        shift_labels = labels[..., 1:]
        # Flatten the tokens
        shift_logits = shift_logits.view(-1, self.config.vocab_size)
        shift_labels = shift_labels.view(-1)
        # Enable model parallelism
        loss = ops.cross_entropy(shift_logits, shift_labels)

    aux_loss = None
    if output_router_logits:
        aux_loss = load_balancing_loss_func(
            outputs.router_logits if return_dict else outputs[-1],
            self.num_experts,
            self.num_experts_per_tok,
            attention_mask,
        )
        if labels is not None:
            loss += self.router_aux_loss_coef * aux_loss

    if not return_dict:
        output = (logits,) + outputs[1:]
        if output_router_logits:
            output = (aux_loss,) + output
        return (loss,) + output if loss is not None else output

    return MoeCausalLMOutputWithPast(
        loss=loss,
        aux_loss=aux_loss,
        logits=logits,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
        router_logits=outputs.router_logits,
    )

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForCausalLM.get_decoder()

Returns the decoder model used in the Qwen2MoeForCausalLM class.

PARAMETER DESCRIPTION
self

An instance of the Qwen2MoeForCausalLM class.

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
def get_decoder(self):
    """
    Returns the decoder model used in the Qwen2MoeForCausalLM class.

    Args:
        self: An instance of the Qwen2MoeForCausalLM class.

    Returns:
        None

    Raises:
        None
    """
    return self.model

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForCausalLM.get_input_embeddings()

Returns the input embeddings for the Qwen2MoeForCausalLM model.

PARAMETER DESCRIPTION
self

The instance of the Qwen2MoeForCausalLM class.

TYPE: Qwen2MoeForCausalLM

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
def get_input_embeddings(self):
    """
    Returns the input embeddings for the Qwen2MoeForCausalLM model.

    Args:
        self (Qwen2MoeForCausalLM): The instance of the Qwen2MoeForCausalLM class.

    Returns:
        None.

    Raises:
        None.
    """
    return self.model.embed_tokens

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForCausalLM.get_output_embeddings()

Return the output embeddings of the Qwen2MoeForCausalLM model.

PARAMETER DESCRIPTION
self

An instance of the Qwen2MoeForCausalLM class.

TYPE: Qwen2MoeForCausalLM

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
def get_output_embeddings(self):
    """Return the output embeddings of the Qwen2MoeForCausalLM model.

    Args:
        self (Qwen2MoeForCausalLM): An instance of the Qwen2MoeForCausalLM class.

    Returns:
        None.

    Raises:
        None.
    """
    return self.lm_head

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForCausalLM.prepare_inputs_for_generation(input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs)

Prepares inputs for generation in the Qwen2MoeForCausalLM class.

PARAMETER DESCRIPTION
self

The instance of the Qwen2MoeForCausalLM class.

input_ids

The input tensor of shape (batch_size, sequence_length) containing the input IDs.

TYPE: Tensor

past_key_values

Optional. The past key values used for caching during generation. If past_key_values is an instance of Cache, it represents the cached key values with attributes:

  • cache_length (int): The length of the cache.
  • past_length (int): The length of the past tokens.
  • max_cache_length (Optional[int]): The maximum cache length, if applicable. If past_key_values is a tuple, it represents the shape of the past key values tensor.

TYPE: Union[Cache, Tuple[Tensor]] DEFAULT: None

attention_mask

Optional. The attention mask tensor of shape (batch_size, sequence_length) containing the attention mask for the input IDs.

TYPE: Tensor DEFAULT: None

inputs_embeds

Optional. The input embeddings tensor of shape (batch_size, sequence_length, hidden_size) containing the input embeddings.

TYPE: Tensor DEFAULT: None

**kwargs

Additional keyword arguments.

DEFAULT: {}

RETURNS DESCRIPTION
dict

A dictionary containing the model inputs for generation with the following keys:

  • 'inputs_embeds' (torch.Tensor): The input embeddings tensor.
  • 'input_ids' (torch.Tensor): The input IDs tensor.
  • 'position_ids' (torch.Tensor): The position IDs tensor.
  • 'past_key_values' (Union[Cache, Tuple[torch.Tensor]]): The past key values tensor.
  • 'use_cache' (Optional[bool]): Indicates whether to use cache during generation.
  • 'attention_mask' (torch.Tensor): The attention mask tensor.
Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
def prepare_inputs_for_generation(
    self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
):
    """
    Prepares inputs for generation in the Qwen2MoeForCausalLM class.

    Args:
        self: The instance of the Qwen2MoeForCausalLM class.
        input_ids (torch.Tensor): The input tensor of shape (batch_size, sequence_length) containing the input IDs.
        past_key_values (Union[Cache, Tuple[torch.Tensor]]): Optional. The past key values used for caching during generation.
            If past_key_values is an instance of Cache, it represents the cached key values with attributes:

            - cache_length (int): The length of the cache.
            - past_length (int): The length of the past tokens.
            - max_cache_length (Optional[int]): The maximum cache length, if applicable.
            If past_key_values is a tuple, it represents the shape of the past key values tensor.
        attention_mask (torch.Tensor): Optional. The attention mask tensor of shape (batch_size, sequence_length) containing
            the attention mask for the input IDs.
        inputs_embeds (torch.Tensor): Optional. The input embeddings tensor of shape (batch_size, sequence_length, hidden_size)
            containing the input embeddings.
        **kwargs: Additional keyword arguments.

    Returns:
        dict:
            A dictionary containing the model inputs for generation with the following keys:

            - 'inputs_embeds' (torch.Tensor): The input embeddings tensor.
            - 'input_ids' (torch.Tensor): The input IDs tensor.
            - 'position_ids' (torch.Tensor): The position IDs tensor.
            - 'past_key_values' (Union[Cache, Tuple[torch.Tensor]]): The past key values tensor.
            - 'use_cache' (Optional[bool]): Indicates whether to use cache during generation.
            - 'attention_mask' (torch.Tensor): The attention mask tensor.

    Raises:
        None.
    """
    # Omit tokens covered by past_key_values
    if past_key_values is not None:
        if isinstance(past_key_values, Cache):
            cache_length = past_key_values.get_seq_length()
            past_length = past_key_values.seen_tokens
            max_cache_length = past_key_values.get_max_length()
        else:
            cache_length = past_length = past_key_values[0][0].shape[2]
            max_cache_length = None

        # Keep only the unprocessed tokens:
        # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
        # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
        # input)
        if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
            input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
        # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
        # input_ids based on the past_length.
        elif past_length < input_ids.shape[1]:
            input_ids = input_ids[:, past_length:]
        # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.

        # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
        if (
            max_cache_length is not None
            and attention_mask is not None
            and cache_length + input_ids.shape[1] > max_cache_length
        ):
            attention_mask = attention_mask[:, -max_cache_length:]

    position_ids = kwargs.get("position_ids", None)
    if attention_mask is not None and position_ids is None:
        # create position_ids on the fly for batch generation
        position_ids = attention_mask.long().cumsum(-1) - 1
        position_ids = position_ids.masked_fill(attention_mask == 0, 1)
        if past_key_values:
            position_ids = position_ids[:, -input_ids.shape[1] :]

    # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
    if inputs_embeds is not None and past_key_values is None:
        model_inputs = {"inputs_embeds": inputs_embeds}
    else:
        model_inputs = {"input_ids": input_ids}

    model_inputs.update(
        {
            "position_ids": position_ids,
            "past_key_values": past_key_values,
            "use_cache": kwargs.get("use_cache"),
            "attention_mask": attention_mask,
        }
    )
    return model_inputs

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForCausalLM.set_decoder(decoder)

Sets the decoder for the Qwen2MoeForCausalLM class.

PARAMETER DESCRIPTION
self

An instance of the Qwen2MoeForCausalLM class.

TYPE: Qwen2MoeForCausalLM

decoder

The decoder to be set for the Qwen2MoeForCausalLM class.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
def set_decoder(self, decoder):
    """
    Sets the decoder for the Qwen2MoeForCausalLM class.

    Args:
        self (Qwen2MoeForCausalLM): An instance of the Qwen2MoeForCausalLM class.
        decoder: The decoder to be set for the Qwen2MoeForCausalLM class.

    Returns:
        None.

    Raises:
        None.
    """
    self.model = decoder

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForCausalLM.set_input_embeddings(value)

Description

This method sets the input embeddings for the Qwen2MoeForCausalLM model.

PARAMETER DESCRIPTION
self

The instance of the Qwen2MoeForCausalLM class. This parameter refers to the current instance of the model where the input embeddings will be set.

TYPE: Qwen2MoeForCausalLM

value

The input embeddings to be set for the model.

  • Type: Any
  • Purpose: The value representing the input embeddings that will be assigned to the model's embed_tokens attribute.
  • Restrictions: None

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
def set_input_embeddings(self, value):
    """
    Method: set_input_embeddings

    Description:
        This method sets the input embeddings for the Qwen2MoeForCausalLM model.

    Args:
        self (Qwen2MoeForCausalLM): The instance of the Qwen2MoeForCausalLM class.
            This parameter refers to the current instance of the model where the input embeddings will be set.

        value:
            The input embeddings to be set for the model.

            - Type: Any
            - Purpose: The value representing the input embeddings that will be assigned to the model's 
            embed_tokens attribute.
            - Restrictions: None

    Returns:
        None.

    Raises:
        None.
    """
    self.model.embed_tokens = value

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForCausalLM.set_output_embeddings(new_embeddings)

Sets new output embeddings for the Qwen2MoeForCausalLM model.

PARAMETER DESCRIPTION
self

The instance of the Qwen2MoeForCausalLM class.

TYPE: Qwen2MoeForCausalLM

new_embeddings

The new output embeddings to be set for the model. Should be of the desired embedding type.

TYPE: object

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
def set_output_embeddings(self, new_embeddings):
    """
    Sets new output embeddings for the Qwen2MoeForCausalLM model.

    Args:
        self (Qwen2MoeForCausalLM): The instance of the Qwen2MoeForCausalLM class.
        new_embeddings (object): The new output embeddings to be set for the model. 
            Should be of the desired embedding type.

    Returns:
        None.

    Raises:
        None.
    """
    self.lm_head = new_embeddings

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForSequenceClassification

Bases: Qwen2MoePreTrainedModel

Qwen2MoeForSequenceClassification is a class that implements a sequence classification model based on the Qwen2Moe architecture. It inherits from the Qwen2MoePreTrainedModel class and provides methods for initializing the model, getting and setting input embeddings, and forwarding the model for sequence classification tasks.

ATTRIBUTE DESCRIPTION
num_labels

Number of labels for classification.

TYPE: int

model

The Qwen2MoeModel instance used in the classification model.

TYPE: Qwen2MoeModel

score

Dense layer for computing the classification scores.

TYPE: Linear

METHOD DESCRIPTION
__init__

Initializes the Qwen2MoeForSequenceClassification instance with the provided configuration.

get_input_embeddings

Retrieves the input embeddings from the model.

set_input_embeddings

Sets the input embeddings of the model to the given value.

forward

Constructs the model for sequence classification based on the input parameters. Computes the classification loss based on the provided labels and problem type. Returns a tuple of loss and output if loss is computed, otherwise returns the model outputs.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
class Qwen2MoeForSequenceClassification(Qwen2MoePreTrainedModel):

    """
    Qwen2MoeForSequenceClassification is a class that implements a sequence classification model based on the
    Qwen2Moe architecture.
    It inherits from the Qwen2MoePreTrainedModel class and provides methods for initializing the model,
    getting and setting input embeddings, and forwarding the model for sequence classification tasks.

    Attributes:
        num_labels (int): Number of labels for classification.
        model (Qwen2MoeModel): The Qwen2MoeModel instance used in the classification model.
        score (nn.Linear): Dense layer for computing the classification scores.

    Methods:
        __init__: Initializes the Qwen2MoeForSequenceClassification instance with the provided configuration.
        get_input_embeddings: Retrieves the input embeddings from the model.
        set_input_embeddings: Sets the input embeddings of the model to the given value.
        forward:
            Constructs the model for sequence classification based on the input parameters.
            Computes the classification loss based on the provided labels and problem type.
            Returns a tuple of loss and output if loss is computed, otherwise returns the model outputs.
    """
    def __init__(self, config):
        """
        Initializes a new instance of the Qwen2MoeForSequenceClassification class.

        Args:
            self: A reference to the current instance of the class.
            config: An instance of the Qwen2MoeConfig class containing the configuration parameters for the model.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)
        self.num_labels = config.num_labels
        self.model = Qwen2MoeModel(config)
        self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        """
        Method: get_input_embeddings

        Description:
            This method retrieves the input embeddings from the 'Qwen2MoeForSequenceClassification' model.

        Args:
            self: An instance of the 'Qwen2MoeForSequenceClassification' class.

        Returns:
            None

        Raises:
            None

        """
        return self.model.embed_tokens

    def set_input_embeddings(self, value):
        """
        Set the input embeddings for the Qwen2MoeForSequenceClassification model.

        Args:
            self (Qwen2MoeForSequenceClassification): The instance of the Qwen2MoeForSequenceClassification class.
            value: The input embeddings to be set for the model. It should be an object of type torch.nn.Embedding.

        Returns:
            None.

        Raises:
            None.
        """
        self.model.embed_tokens = value

    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[List[mindspore.Tensor]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
        r"""
        Args:
            labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
                Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
                config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
                `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        transformer_outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        hidden_states = transformer_outputs[0]
        logits = self.score(hidden_states)

        if input_ids is not None:
            batch_size = input_ids.shape[0]
        else:
            batch_size = inputs_embeds.shape[0]

        if self.config.pad_token_id is None and batch_size != 1:
            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
        if self.config.pad_token_id is None:
            sequence_lengths = -1
        else:
            if input_ids is not None:
                # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
                sequence_lengths = ops.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
                sequence_lengths = sequence_lengths % input_ids.shape[-1]
            else:
                sequence_lengths = -1

        pooled_logits = logits[ops.arange(batch_size), sequence_lengths]

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                if self.num_labels == 1:
                    loss = ops.mse_loss(pooled_logits.squeeze(), labels.squeeze())
                else:
                    loss = ops.mse_loss(pooled_logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss = ops.cross_entropy(pooled_logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss = ops.binary_cross_entropy_with_logits(pooled_logits, labels)
        if not return_dict:
            output = (pooled_logits,) + transformer_outputs[1:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutputWithPast(
            loss=loss,
            logits=pooled_logits,
            past_key_values=transformer_outputs.past_key_values,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
        )

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForSequenceClassification.__init__(config)

Initializes a new instance of the Qwen2MoeForSequenceClassification class.

PARAMETER DESCRIPTION
self

A reference to the current instance of the class.

config

An instance of the Qwen2MoeConfig class containing the configuration parameters for the model.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
def __init__(self, config):
    """
    Initializes a new instance of the Qwen2MoeForSequenceClassification class.

    Args:
        self: A reference to the current instance of the class.
        config: An instance of the Qwen2MoeConfig class containing the configuration parameters for the model.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)
    self.num_labels = config.num_labels
    self.model = Qwen2MoeModel(config)
    self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForSequenceClassification.forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
labels

Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

TYPE: `mindspore.Tensor` of shape `(batch_size,)`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
def forward(
    self,
    input_ids: mindspore.Tensor = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[List[mindspore.Tensor]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
    r"""
    Args:
        labels (`mindspore.Tensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    transformer_outputs = self.model(
        input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    hidden_states = transformer_outputs[0]
    logits = self.score(hidden_states)

    if input_ids is not None:
        batch_size = input_ids.shape[0]
    else:
        batch_size = inputs_embeds.shape[0]

    if self.config.pad_token_id is None and batch_size != 1:
        raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
    if self.config.pad_token_id is None:
        sequence_lengths = -1
    else:
        if input_ids is not None:
            # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
            sequence_lengths = ops.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
            sequence_lengths = sequence_lengths % input_ids.shape[-1]
        else:
            sequence_lengths = -1

    pooled_logits = logits[ops.arange(batch_size), sequence_lengths]

    loss = None
    if labels is not None:
        if self.config.problem_type is None:
            if self.num_labels == 1:
                self.config.problem_type = "regression"
            elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                self.config.problem_type = "single_label_classification"
            else:
                self.config.problem_type = "multi_label_classification"

        if self.config.problem_type == "regression":
            if self.num_labels == 1:
                loss = ops.mse_loss(pooled_logits.squeeze(), labels.squeeze())
            else:
                loss = ops.mse_loss(pooled_logits, labels)
        elif self.config.problem_type == "single_label_classification":
            loss = ops.cross_entropy(pooled_logits.view(-1, self.num_labels), labels.view(-1))
        elif self.config.problem_type == "multi_label_classification":
            loss = ops.binary_cross_entropy_with_logits(pooled_logits, labels)
    if not return_dict:
        output = (pooled_logits,) + transformer_outputs[1:]
        return ((loss,) + output) if loss is not None else output

    return SequenceClassifierOutputWithPast(
        loss=loss,
        logits=pooled_logits,
        past_key_values=transformer_outputs.past_key_values,
        hidden_states=transformer_outputs.hidden_states,
        attentions=transformer_outputs.attentions,
    )

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForSequenceClassification.get_input_embeddings()

Description

This method retrieves the input embeddings from the 'Qwen2MoeForSequenceClassification' model.

PARAMETER DESCRIPTION
self

An instance of the 'Qwen2MoeForSequenceClassification' class.

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
def get_input_embeddings(self):
    """
    Method: get_input_embeddings

    Description:
        This method retrieves the input embeddings from the 'Qwen2MoeForSequenceClassification' model.

    Args:
        self: An instance of the 'Qwen2MoeForSequenceClassification' class.

    Returns:
        None

    Raises:
        None

    """
    return self.model.embed_tokens

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeForSequenceClassification.set_input_embeddings(value)

Set the input embeddings for the Qwen2MoeForSequenceClassification model.

PARAMETER DESCRIPTION
self

The instance of the Qwen2MoeForSequenceClassification class.

TYPE: Qwen2MoeForSequenceClassification

value

The input embeddings to be set for the model. It should be an object of type torch.nn.Embedding.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
def set_input_embeddings(self, value):
    """
    Set the input embeddings for the Qwen2MoeForSequenceClassification model.

    Args:
        self (Qwen2MoeForSequenceClassification): The instance of the Qwen2MoeForSequenceClassification class.
        value: The input embeddings to be set for the model. It should be an object of type torch.nn.Embedding.

    Returns:
        None.

    Raises:
        None.
    """
    self.model.embed_tokens = value

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeMLP

Bases: Module

Qwen2MoeMLP represents a multi-layer perceptron (MLP) model with customized projection layers for gating and feature transformation.

The Qwen2MoeMLP class inherits from nn.Module and is initialized with a configuration and an optional intermediate size. The class provides methods to forward and manipulate the MLP model.

ATTRIBUTE DESCRIPTION
config

The configuration object used for initializing the MLP.

hidden_size

The size of the hidden layers in the MLP.

intermediate_size

The optional intermediate size for the projection layers.

gate_proj

The projection layer for gating, implemented as a Dense layer with the hidden size and intermediate size.

up_proj

The projection layer for feature transformation, implemented as a Dense layer with the hidden size and intermediate size.

down_proj

The inverse projection layer for feature transformation, implemented as a Dense layer with the intermediate size and hidden size.

act_fn

The activation function used in the MLP model, derived from the configuration's hidden activation function.

METHOD DESCRIPTION
forward

Constructs the multi-layer perceptron model using the provided input x. This method applies the gating, feature transformation, and activation function to the input data.

Note

The Qwen2MoeMLP class assumes the availability of the nn module for neural network operations.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
class Qwen2MoeMLP(nn.Module):

    """
    Qwen2MoeMLP represents a multi-layer perceptron (MLP) model with customized projection layers for gating and
    feature transformation.

    The Qwen2MoeMLP class inherits from nn.Module and is initialized with a configuration and an optional intermediate size.
    The class provides methods to forward and manipulate the MLP model.

    Attributes:
        config: The configuration object used for initializing the MLP.
        hidden_size: The size of the hidden layers in the MLP.
        intermediate_size: The optional intermediate size for the projection layers.
        gate_proj: The projection layer for gating, implemented as a Dense layer with the hidden size and
            intermediate size.
        up_proj: The projection layer for feature transformation, implemented as a Dense layer with the hidden size
            and intermediate size.
        down_proj: The inverse projection layer for feature transformation, implemented as a Dense layer with the
            intermediate size and hidden size.
        act_fn: The activation function used in the MLP model, derived from the configuration's
            hidden activation function.

    Methods:
        forward(x): Constructs the multi-layer perceptron model using the provided input x.
            This method applies the gating, feature transformation, and activation function to the input data.

    Note:
        The Qwen2MoeMLP class assumes the availability of the nn module for neural network operations.
    """
    def __init__(self, config, intermediate_size=None):
        """
        Initializes an instance of the Qwen2MoeMLP class.

        Args:
            self: The instance of the class.
            config (object): The configuration object containing various settings and parameters.
            intermediate_size (int, optional): The size of the intermediate layer. Defaults to None.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.intermediate_size = intermediate_size
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, x):
        """
        Constructs a modified multi-layer perceptron in the Qwen2MoeMLP class.

        Args:
            self (Qwen2MoeMLP): An instance of the Qwen2MoeMLP class.
                Represents the object itself.
            x:
                Input data for forwarding the modified MLP.

                - Type: Any
                - Purpose: The input data to be processed by the MLP.
                - Restrictions: None

        Returns:
            None:

                - Type: None
                - Purpose: The method modifies the MLP structure within the class instance.

        Raises:
            None.
        """
        return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeMLP.__init__(config, intermediate_size=None)

Initializes an instance of the Qwen2MoeMLP class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

The configuration object containing various settings and parameters.

TYPE: object

intermediate_size

The size of the intermediate layer. Defaults to None.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
def __init__(self, config, intermediate_size=None):
    """
    Initializes an instance of the Qwen2MoeMLP class.

    Args:
        self: The instance of the class.
        config (object): The configuration object containing various settings and parameters.
        intermediate_size (int, optional): The size of the intermediate layer. Defaults to None.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    self.config = config
    self.hidden_size = config.hidden_size
    self.intermediate_size = intermediate_size
    self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
    self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
    self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
    self.act_fn = ACT2FN[config.hidden_act]

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeMLP.forward(x)

Constructs a modified multi-layer perceptron in the Qwen2MoeMLP class.

PARAMETER DESCRIPTION
self

An instance of the Qwen2MoeMLP class. Represents the object itself.

TYPE: Qwen2MoeMLP

x

Input data for forwarding the modified MLP.

  • Type: Any
  • Purpose: The input data to be processed by the MLP.
  • Restrictions: None

RETURNS DESCRIPTION
None
  • Type: None
  • Purpose: The method modifies the MLP structure within the class instance.
Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
def forward(self, x):
    """
    Constructs a modified multi-layer perceptron in the Qwen2MoeMLP class.

    Args:
        self (Qwen2MoeMLP): An instance of the Qwen2MoeMLP class.
            Represents the object itself.
        x:
            Input data for forwarding the modified MLP.

            - Type: Any
            - Purpose: The input data to be processed by the MLP.
            - Restrictions: None

    Returns:
        None:

            - Type: None
            - Purpose: The method modifies the MLP structure within the class instance.

    Raises:
        None.
    """
    return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeModel

Bases: Qwen2MoePreTrainedModel

Transformer decoder consisting of config.num_hidden_layers layers. Each layer is a [Qwen2MoeDecoderLayer]

PARAMETER DESCRIPTION
config

Qwen2MoeConfig

TYPE: Qwen2MoeConfig

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
class Qwen2MoeModel(Qwen2MoePreTrainedModel):
    """
    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Qwen2MoeDecoderLayer`]

    Args:
        config: Qwen2MoeConfig
    """
    def __init__(self, config: Qwen2MoeConfig):
        """
        Initializes a new instance of the Qwen2MoeModel class.

        Args:
            self: The current object instance.
            config (Qwen2MoeConfig): The configuration object for the model.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)
        self.padding_idx = config.pad_token_id
        self.vocab_size = config.vocab_size

        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
        self.layers = nn.ModuleList(
            [Qwen2MoeDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
        )
        self.norm = Qwen2MoeRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

        self.gradient_checkpointing = False
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        """
        Get the input embeddings.

        This method takes the 'self' parameter, which refers to the instance of the Qwen2MoeModel class.

        Args:
            self (Qwen2MoeModel): The instance of the Qwen2MoeModel class.

        Returns:
            None.

        Raises:
            None.
        """
        return self.embed_tokens

    def set_input_embeddings(self, value):
        """
        Sets the input embeddings for the Qwen2MoeModel.

        Args:
            self (Qwen2MoeModel): The instance of the Qwen2MoeModel class.
            value (Any): The input embeddings to be set.
                This should be a tensor or an object that can be assigned to the `embed_tokens` attribute.

        Returns:
            None.

        Raises:
            None.

        This method sets the input embeddings for the Qwen2MoeModel by assigning the given value to the
        `embed_tokens` attribute of the instance.
        """
        self.embed_tokens = value

    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[List[mindspore.Tensor]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        output_router_logits: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, MoeModelOutputWithPast]:
        """
        Constructs the Qwen2MoeModel.

        Args:
            self: The object instance.
            input_ids (mindspore.Tensor, optional): The input tensor representing the token ids. Defaults to None.
            attention_mask (mindspore.Tensor, optional): The tensor representing the attention mask. Defaults to None.
            position_ids (mindspore.Tensor, optional): The tensor representing the position ids. Defaults to None.
            past_key_values (List[mindspore.Tensor], optional): The list of tensors representing past key values.
                Defaults to None.
            inputs_embeds (mindspore.Tensor, optional): The tensor representing the embedded inputs. Defaults to None.
            use_cache (bool, optional): Whether to use cache or not. Defaults to None.
            output_attentions (bool, optional): Whether to output attentions or not. Defaults to None.
            output_hidden_states (bool, optional): Whether to output hidden states or not. Defaults to None.
            output_router_logits (bool, optional): Whether to output router logits or not. Defaults to None.
            return_dict (bool, optional): Whether to return a dictionary or not. Defaults to None.

        Returns:
            Union[Tuple, MoeModelOutputWithPast]: The forwarded model output.

        Raises:
            ValueError: If both input_ids and inputs_embeds are specified at the same time.
            ValueError: If neither input_ids nor inputs_embeds are specified.
            Warning: If use_cache=True is incompatible with gradient checkpointing.

            """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_router_logits = (
            output_router_logits if output_router_logits is not None else self.config.output_router_logits
        )
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache

        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # retrieve input_ids and inputs_embeds
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
        elif input_ids is not None:
            batch_size, seq_length = input_ids.shape
        elif inputs_embeds is not None:
            batch_size, seq_length, _ = inputs_embeds.shape
        else:
            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")

        if self.gradient_checkpointing and self.training:
            if use_cache:
                logger.warning_once(
                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
                )
                use_cache = False

        past_key_values_length = 0

        if use_cache:
            use_legacy_cache = not isinstance(past_key_values, Cache)
            if use_legacy_cache:
                past_key_values = DynamicCache.from_legacy_cache(past_key_values)
            past_key_values_length = past_key_values.get_usable_length(seq_length)

        if position_ids is None:
            position_ids = ops.arange(
                past_key_values_length, seq_length + past_key_values_length, dtype=mindspore.int64
            )
            position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
        else:
            position_ids = position_ids.view(-1, seq_length).long()

        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)

        # 4d mask is passed through the layers
        attention_mask = _prepare_4d_causal_attention_mask(
            attention_mask,
            (batch_size, seq_length),
            inputs_embeds,
            past_key_values_length,
            sliding_window=self.config.sliding_window,
        )

        hidden_states = inputs_embeds

        # decoder layers
        all_hidden_states = () if output_hidden_states else None
        all_self_attns = () if output_attentions else None
        all_router_logits = () if output_router_logits else None
        next_decoder_cache = None

        for decoder_layer in self.layers:
            if output_hidden_states:
                all_hidden_states += (hidden_states,)

            if self.gradient_checkpointing and self.training:
                layer_outputs = self._gradient_checkpointing_func(
                    decoder_layer.__call__,
                    hidden_states,
                    attention_mask,
                    position_ids,
                    past_key_values,
                    output_attentions,
                    output_router_logits,
                    use_cache,
                )
            else:
                layer_outputs = decoder_layer(
                    hidden_states,
                    attention_mask=attention_mask,
                    position_ids=position_ids,
                    past_key_value=past_key_values,
                    output_attentions=output_attentions,
                    output_router_logits=output_router_logits,
                    use_cache=use_cache,
                )

            hidden_states = layer_outputs[0]

            if use_cache:
                next_decoder_cache = layer_outputs[2 if output_attentions else 1]

            if output_attentions:
                all_self_attns += (layer_outputs[1],)

            if output_router_logits and layer_outputs[-1] is not None:
                all_router_logits += (layer_outputs[-1],)

        hidden_states = self.norm(hidden_states)

        # add hidden states from the last decoder layer
        if output_hidden_states:
            all_hidden_states += (hidden_states,)

        next_cache = None
        if use_cache:
            next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache

        if not return_dict:
            return tuple(
                v
                for v in [hidden_states, next_cache, all_hidden_states, all_self_attns, all_router_logits]
                if v is not None
            )
        return MoeModelOutputWithPast(
            last_hidden_state=hidden_states,
            past_key_values=next_cache,
            hidden_states=all_hidden_states,
            attentions=all_self_attns,
            router_logits=all_router_logits,
        )

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeModel.__init__(config)

Initializes a new instance of the Qwen2MoeModel class.

PARAMETER DESCRIPTION
self

The current object instance.

config

The configuration object for the model.

TYPE: Qwen2MoeConfig

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
def __init__(self, config: Qwen2MoeConfig):
    """
    Initializes a new instance of the Qwen2MoeModel class.

    Args:
        self: The current object instance.
        config (Qwen2MoeConfig): The configuration object for the model.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)
    self.padding_idx = config.pad_token_id
    self.vocab_size = config.vocab_size

    self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
    self.layers = nn.ModuleList(
        [Qwen2MoeDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
    )
    self.norm = Qwen2MoeRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

    self.gradient_checkpointing = False
    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeModel.forward(input_ids=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, use_cache=None, output_attentions=None, output_hidden_states=None, output_router_logits=None, return_dict=None)

Constructs the Qwen2MoeModel.

PARAMETER DESCRIPTION
self

The object instance.

input_ids

The input tensor representing the token ids. Defaults to None.

TYPE: Tensor DEFAULT: None

attention_mask

The tensor representing the attention mask. Defaults to None.

TYPE: Tensor DEFAULT: None

position_ids

The tensor representing the position ids. Defaults to None.

TYPE: Tensor DEFAULT: None

past_key_values

The list of tensors representing past key values. Defaults to None.

TYPE: List[Tensor] DEFAULT: None

inputs_embeds

The tensor representing the embedded inputs. Defaults to None.

TYPE: Tensor DEFAULT: None

use_cache

Whether to use cache or not. Defaults to None.

TYPE: bool DEFAULT: None

output_attentions

Whether to output attentions or not. Defaults to None.

TYPE: bool DEFAULT: None

output_hidden_states

Whether to output hidden states or not. Defaults to None.

TYPE: bool DEFAULT: None

output_router_logits

Whether to output router logits or not. Defaults to None.

TYPE: bool DEFAULT: None

return_dict

Whether to return a dictionary or not. Defaults to None.

TYPE: bool DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple, MoeModelOutputWithPast]

Union[Tuple, MoeModelOutputWithPast]: The forwarded model output.

RAISES DESCRIPTION
ValueError

If both input_ids and inputs_embeds are specified at the same time.

ValueError

If neither input_ids nor inputs_embeds are specified.

Warning

If use_cache=True is incompatible with gradient checkpointing.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
def forward(
    self,
    input_ids: mindspore.Tensor = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[List[mindspore.Tensor]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    output_router_logits: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, MoeModelOutputWithPast]:
    """
    Constructs the Qwen2MoeModel.

    Args:
        self: The object instance.
        input_ids (mindspore.Tensor, optional): The input tensor representing the token ids. Defaults to None.
        attention_mask (mindspore.Tensor, optional): The tensor representing the attention mask. Defaults to None.
        position_ids (mindspore.Tensor, optional): The tensor representing the position ids. Defaults to None.
        past_key_values (List[mindspore.Tensor], optional): The list of tensors representing past key values.
            Defaults to None.
        inputs_embeds (mindspore.Tensor, optional): The tensor representing the embedded inputs. Defaults to None.
        use_cache (bool, optional): Whether to use cache or not. Defaults to None.
        output_attentions (bool, optional): Whether to output attentions or not. Defaults to None.
        output_hidden_states (bool, optional): Whether to output hidden states or not. Defaults to None.
        output_router_logits (bool, optional): Whether to output router logits or not. Defaults to None.
        return_dict (bool, optional): Whether to return a dictionary or not. Defaults to None.

    Returns:
        Union[Tuple, MoeModelOutputWithPast]: The forwarded model output.

    Raises:
        ValueError: If both input_ids and inputs_embeds are specified at the same time.
        ValueError: If neither input_ids nor inputs_embeds are specified.
        Warning: If use_cache=True is incompatible with gradient checkpointing.

        """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_router_logits = (
        output_router_logits if output_router_logits is not None else self.config.output_router_logits
    )
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    use_cache = use_cache if use_cache is not None else self.config.use_cache

    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    # retrieve input_ids and inputs_embeds
    if input_ids is not None and inputs_embeds is not None:
        raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
    elif input_ids is not None:
        batch_size, seq_length = input_ids.shape
    elif inputs_embeds is not None:
        batch_size, seq_length, _ = inputs_embeds.shape
    else:
        raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")

    if self.gradient_checkpointing and self.training:
        if use_cache:
            logger.warning_once(
                "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
            )
            use_cache = False

    past_key_values_length = 0

    if use_cache:
        use_legacy_cache = not isinstance(past_key_values, Cache)
        if use_legacy_cache:
            past_key_values = DynamicCache.from_legacy_cache(past_key_values)
        past_key_values_length = past_key_values.get_usable_length(seq_length)

    if position_ids is None:
        position_ids = ops.arange(
            past_key_values_length, seq_length + past_key_values_length, dtype=mindspore.int64
        )
        position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
    else:
        position_ids = position_ids.view(-1, seq_length).long()

    if inputs_embeds is None:
        inputs_embeds = self.embed_tokens(input_ids)

    # 4d mask is passed through the layers
    attention_mask = _prepare_4d_causal_attention_mask(
        attention_mask,
        (batch_size, seq_length),
        inputs_embeds,
        past_key_values_length,
        sliding_window=self.config.sliding_window,
    )

    hidden_states = inputs_embeds

    # decoder layers
    all_hidden_states = () if output_hidden_states else None
    all_self_attns = () if output_attentions else None
    all_router_logits = () if output_router_logits else None
    next_decoder_cache = None

    for decoder_layer in self.layers:
        if output_hidden_states:
            all_hidden_states += (hidden_states,)

        if self.gradient_checkpointing and self.training:
            layer_outputs = self._gradient_checkpointing_func(
                decoder_layer.__call__,
                hidden_states,
                attention_mask,
                position_ids,
                past_key_values,
                output_attentions,
                output_router_logits,
                use_cache,
            )
        else:
            layer_outputs = decoder_layer(
                hidden_states,
                attention_mask=attention_mask,
                position_ids=position_ids,
                past_key_value=past_key_values,
                output_attentions=output_attentions,
                output_router_logits=output_router_logits,
                use_cache=use_cache,
            )

        hidden_states = layer_outputs[0]

        if use_cache:
            next_decoder_cache = layer_outputs[2 if output_attentions else 1]

        if output_attentions:
            all_self_attns += (layer_outputs[1],)

        if output_router_logits and layer_outputs[-1] is not None:
            all_router_logits += (layer_outputs[-1],)

    hidden_states = self.norm(hidden_states)

    # add hidden states from the last decoder layer
    if output_hidden_states:
        all_hidden_states += (hidden_states,)

    next_cache = None
    if use_cache:
        next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache

    if not return_dict:
        return tuple(
            v
            for v in [hidden_states, next_cache, all_hidden_states, all_self_attns, all_router_logits]
            if v is not None
        )
    return MoeModelOutputWithPast(
        last_hidden_state=hidden_states,
        past_key_values=next_cache,
        hidden_states=all_hidden_states,
        attentions=all_self_attns,
        router_logits=all_router_logits,
    )

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeModel.get_input_embeddings()

Get the input embeddings.

This method takes the 'self' parameter, which refers to the instance of the Qwen2MoeModel class.

PARAMETER DESCRIPTION
self

The instance of the Qwen2MoeModel class.

TYPE: Qwen2MoeModel

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
def get_input_embeddings(self):
    """
    Get the input embeddings.

    This method takes the 'self' parameter, which refers to the instance of the Qwen2MoeModel class.

    Args:
        self (Qwen2MoeModel): The instance of the Qwen2MoeModel class.

    Returns:
        None.

    Raises:
        None.
    """
    return self.embed_tokens

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeModel.set_input_embeddings(value)

Sets the input embeddings for the Qwen2MoeModel.

PARAMETER DESCRIPTION
self

The instance of the Qwen2MoeModel class.

TYPE: Qwen2MoeModel

value

The input embeddings to be set. This should be a tensor or an object that can be assigned to the embed_tokens attribute.

TYPE: Any

RETURNS DESCRIPTION

None.

This method sets the input embeddings for the Qwen2MoeModel by assigning the given value to the embed_tokens attribute of the instance.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
def set_input_embeddings(self, value):
    """
    Sets the input embeddings for the Qwen2MoeModel.

    Args:
        self (Qwen2MoeModel): The instance of the Qwen2MoeModel class.
        value (Any): The input embeddings to be set.
            This should be a tensor or an object that can be assigned to the `embed_tokens` attribute.

    Returns:
        None.

    Raises:
        None.

    This method sets the input embeddings for the Qwen2MoeModel by assigning the given value to the
    `embed_tokens` attribute of the instance.
    """
    self.embed_tokens = value

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoePreTrainedModel

Bases: PreTrainedModel

Qwen2MoePreTrainedModel is a Python class that represents a pre-trained model for Qwen2Moe. This class inherits from PreTrainedModel and contains methods for initializing weights for different types of cells such as Dense and Embedding.

METHOD DESCRIPTION
_init_weights

Initializes the weights for the given cell. If the cell is a Dense type, it initializes the weight using a normal distribution with a specified range and initializes the bias to zeros if present. If the cell is an Embedding type, it initializes the weight with random values within the specified range and handles padding if necessary.

PARAMETER DESCRIPTION
cell

The cell for which weights need to be initialized. It can be a nn.Linear or nn.Embedding type.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
class Qwen2MoePreTrainedModel(PreTrainedModel):

    """
    Qwen2MoePreTrainedModel is a Python class that represents a pre-trained model for Qwen2Moe.
    This class inherits from PreTrainedModel and contains methods for initializing weights for different types
    of cells such as Dense and Embedding.

    Methods:
        _init_weights: Initializes the weights for the given cell. If the cell is a Dense type,
            it initializes the weight using a normal distribution with a specified range and initializes the bias to
            zeros if present. If the cell is an Embedding type, it initializes the weight with random values
            within the specified range and handles padding if necessary.

    Parameters:
        cell: The cell for which weights need to be initialized. It can be a nn.Linear or nn.Embedding type.

    Returns:
        None.
    """
    config_class = Qwen2MoeConfig
    base_model_prefix = "model"
    supports_gradient_checkpointing = True
    _no_split_modules = ["Qwen2MoeDecoderLayer"]
    _skip_keys_device_placement = "past_key_values"
    _supports_cache_class = True

    def _init_weights(self, cell):
        """Initialize the weights"""
        if isinstance(cell, nn.Linear):
            cell.weight.set_data(initializer(Normal(self.config.initializer_range),
                                                    cell.weight.shape, cell.weight.dtype))
            if cell.bias:
                cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))
        elif isinstance(cell, nn.Embedding):
            weight = np.random.normal(0.0, self.config.initializer_range, cell.weight.shape)
            if cell.padding_idx:
                weight[cell.padding_idx] = 0

            cell.weight.set_data(Tensor(weight, cell.weight.dtype))

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeRMSNorm

Bases: Module

Qwen2MoeRMSNorm is a custom normalization layer that is equivalent to T5LayerNorm. It inherits from the nn.Module class.

This normalization layer performs root mean square normalization (RMSNorm) on the input hidden states. It is commonly used in neural network architectures, such as T5 models, to improve the training efficiency and convergence.

PARAMETER DESCRIPTION
hidden_size

The size of the hidden states.

TYPE: int

eps

A small value added to the variance for numerical stability. Defaults to 1e-06.

TYPE: float DEFAULT: 1e-06

METHOD DESCRIPTION
__init__

Initializes a new instance of the Qwen2MoeRMSNorm class.

forward

Applies RMSNorm normalization to the input hidden_states.

Parameters:

  • hidden_states (Tensor): The input hidden states to be normalized.

Returns:

  • Tensor: The normalized hidden states after applying RMSNorm.
Example
>>> # Create a Qwen2MoeRMSNorm instance
>>> norm_layer = Qwen2MoeRMSNorm(hidden_size=512)
...
>>> # Apply RMSNorm normalization to the input tensor
>>> input_tensor = ops.randn((batch_size, sequence_length, hidden_size))
>>> normalized_tensor = norm_layer.forward(input_tensor)
...
>>> # The normalized_tensor now contains the input tensor after applying RMSNorm normalization.
Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
class Qwen2MoeRMSNorm(nn.Module):

    """
    Qwen2MoeRMSNorm is a custom normalization layer that is equivalent to T5LayerNorm. It inherits from the nn.Module class.

    This normalization layer performs root mean square normalization (RMSNorm) on the input hidden states.
    It is commonly used in neural network architectures, such as T5 models, to improve the training
    efficiency and convergence.

    Parameters:
        hidden_size (int): The size of the hidden states.
        eps (float, optional): A small value added to the variance for numerical stability. Defaults to 1e-06.

    Methods:
        __init__:
            Initializes a new instance of the Qwen2MoeRMSNorm class.

        forward:
            Applies RMSNorm normalization to the input hidden_states.

            Parameters:

            - hidden_states (Tensor): The input hidden states to be normalized.

            Returns:

            - Tensor: The normalized hidden states after applying RMSNorm.

    Example:
        ```python
        >>> # Create a Qwen2MoeRMSNorm instance
        >>> norm_layer = Qwen2MoeRMSNorm(hidden_size=512)
        ...
        >>> # Apply RMSNorm normalization to the input tensor
        >>> input_tensor = ops.randn((batch_size, sequence_length, hidden_size))
        >>> normalized_tensor = norm_layer.forward(input_tensor)
        ...
        >>> # The normalized_tensor now contains the input tensor after applying RMSNorm normalization.
        ```
    """
    def __init__(self, hidden_size, eps=1e-6):
        """
        Qwen2MoeRMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()
        self.weight = Parameter(ops.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        """
        Constructs the Qwen2MoeRMSNorm for the given hidden states.

        Args:
            self: An instance of the Qwen2MoeRMSNorm class.
            hidden_states (Tensor): The input hidden states to normalize. It should be of type 'mindspore.dtype'.

        Returns:
            None.

        Raises:
            None.

        Note:
            - The hidden_states parameter is expected to be a tensor of shape (batch_size, sequence_length, hidden_size).
            - The hidden_states tensor is converted to 'mindspore.float32' type.
            - The variance of the hidden_states tensor is calculated by squaring each element and then taking the mean
            along the last dimension.
            - The hidden_states tensor is then multiplied by the reciprocal square root of the variance plus
            'self.variance_epsilon'.
            - The final result is the element-wise multiplication of the hidden_states tensor with the weight tensor,
            which is then casted back to the input_dtype.

        Example:
            ```python
            >>> qwen = Qwen2MoeRMSNorm()
            >>> hidden_states = mindspore.Tensor(np.random.rand(2, 3, 4), dtype=mindspore.float16)
            >>> qwen.forward(hidden_states)
            ```
        """
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(mindspore.float32)
        variance = hidden_states.pow(2).mean(-1, keep_dims=True)
        hidden_states = hidden_states * ops.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeRMSNorm.__init__(hidden_size, eps=1e-06)

Qwen2MoeRMSNorm is equivalent to T5LayerNorm

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
199
200
201
202
203
204
205
def __init__(self, hidden_size, eps=1e-6):
    """
    Qwen2MoeRMSNorm is equivalent to T5LayerNorm
    """
    super().__init__()
    self.weight = Parameter(ops.ones(hidden_size))
    self.variance_epsilon = eps

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeRMSNorm.forward(hidden_states)

Constructs the Qwen2MoeRMSNorm for the given hidden states.

PARAMETER DESCRIPTION
self

An instance of the Qwen2MoeRMSNorm class.

hidden_states

The input hidden states to normalize. It should be of type 'mindspore.dtype'.

TYPE: Tensor

RETURNS DESCRIPTION

None.

Note
  • The hidden_states parameter is expected to be a tensor of shape (batch_size, sequence_length, hidden_size).
  • The hidden_states tensor is converted to 'mindspore.float32' type.
  • The variance of the hidden_states tensor is calculated by squaring each element and then taking the mean along the last dimension.
  • The hidden_states tensor is then multiplied by the reciprocal square root of the variance plus 'self.variance_epsilon'.
  • The final result is the element-wise multiplication of the hidden_states tensor with the weight tensor, which is then casted back to the input_dtype.
Example
>>> qwen = Qwen2MoeRMSNorm()
>>> hidden_states = mindspore.Tensor(np.random.rand(2, 3, 4), dtype=mindspore.float16)
>>> qwen.forward(hidden_states)
Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
def forward(self, hidden_states):
    """
    Constructs the Qwen2MoeRMSNorm for the given hidden states.

    Args:
        self: An instance of the Qwen2MoeRMSNorm class.
        hidden_states (Tensor): The input hidden states to normalize. It should be of type 'mindspore.dtype'.

    Returns:
        None.

    Raises:
        None.

    Note:
        - The hidden_states parameter is expected to be a tensor of shape (batch_size, sequence_length, hidden_size).
        - The hidden_states tensor is converted to 'mindspore.float32' type.
        - The variance of the hidden_states tensor is calculated by squaring each element and then taking the mean
        along the last dimension.
        - The hidden_states tensor is then multiplied by the reciprocal square root of the variance plus
        'self.variance_epsilon'.
        - The final result is the element-wise multiplication of the hidden_states tensor with the weight tensor,
        which is then casted back to the input_dtype.

    Example:
        ```python
        >>> qwen = Qwen2MoeRMSNorm()
        >>> hidden_states = mindspore.Tensor(np.random.rand(2, 3, 4), dtype=mindspore.float16)
        >>> qwen.forward(hidden_states)
        ```
    """
    input_dtype = hidden_states.dtype
    hidden_states = hidden_states.to(mindspore.float32)
    variance = hidden_states.pow(2).mean(-1, keep_dims=True)
    hidden_states = hidden_states * ops.rsqrt(variance + self.variance_epsilon)
    return self.weight * hidden_states.to(input_dtype)

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeRotaryEmbedding

Bases: Module

This class represents a Qwen2MoeRotaryEmbedding, which is a rotary positional embedding used in natural language processing tasks. It is a subclass of the nn.Module class.

The Qwen2MoeRotaryEmbedding class initializes with the following parameters:

  • dim (int): The dimension of the embedding.
  • max_position_embeddings (int): The maximum number of position embeddings.
  • base (int): The base used in the exponential calculation.

The class provides the following methods:

  • init: Initializes the Qwen2MoeRotaryEmbedding instance.

  • _set_cos_sin_cache: Sets the cosine and sine cache for the given sequence length and data type.

  • forward: Constructs the rotary embedding for the given input tensor and sequence length.

Note

The methods above are inherited from the nn.Module class.

Example
>>> # Create a Qwen2MoeRotaryEmbedding instance
>>> embedding = Qwen2MoeRotaryEmbedding(dim=512)
...
>>> # Generate rotary embedding for input tensor x
>>> x = ...  # Input tensor
>>> seq_len = ...  # Sequence length
>>> cos_embedding, sin_embedding = embedding.forward(x, seq_len)
Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
class Qwen2MoeRotaryEmbedding(nn.Module):

    """
    This class represents a Qwen2MoeRotaryEmbedding, which is a rotary positional embedding used in natural language
    processing tasks. It is a subclass of the nn.Module class.

    The Qwen2MoeRotaryEmbedding class initializes with the following parameters:

    - dim (int): The dimension of the embedding.
    - max_position_embeddings (int): The maximum number of position embeddings.
    - base (int): The base used in the exponential calculation.

    The class provides the following methods:

    - __init__:
    Initializes the Qwen2MoeRotaryEmbedding instance.

    - _set_cos_sin_cache:
    Sets the cosine and sine cache for the given sequence length and data type.

    - forward:
    Constructs the rotary embedding for the given input tensor and sequence length.

    Note:
        The methods above are inherited from the nn.Module class.

    Example:
        ```python
        >>> # Create a Qwen2MoeRotaryEmbedding instance
        >>> embedding = Qwen2MoeRotaryEmbedding(dim=512)
        ...
        >>> # Generate rotary embedding for input tensor x
        >>> x = ...  # Input tensor
        >>> seq_len = ...  # Sequence length
        >>> cos_embedding, sin_embedding = embedding.forward(x, seq_len)
        ```
    """
    def __init__(self, dim, max_position_embeddings=2048, base=10000):
        """
        Initializes an instance of the Qwen2MoeRotaryEmbedding class.

        Args:
            self: The instance of the class.
            dim (int): The dimension of the embedding.
            max_position_embeddings (int, optional): The maximum number of position embeddings. Defaults to 2048.
            base (int, optional): The base value used in the calculation. Defaults to 10000.

        Returns:
            None.

        Raises:
            None.

        """
        super().__init__()

        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.base = base
        inv_freq = 1.0 / (self.base ** (ops.arange(0, self.dim, 2, dtype=mindspore.int64).float() / self.dim))
        self.inv_freq = inv_freq

        # Build here to make `torch.jit.trace` work.
        self._set_cos_sin_cache(
            seq_len=max_position_embeddings, dtype=get_default_dtype()
        )

    def _set_cos_sin_cache(self, seq_len, dtype):
        """
        This method '_set_cos_sin_cache' is defined within the class 'Qwen2MoeRotaryEmbedding' and is responsible for
        setting up the cosine and sine cache based on the input sequence length and data type.

        Args:
            self: The instance of the class.
            seq_len (int): The length of the sequence for which the cosine and sine cache is to be computed.
            dtype: The data type for the cache values.

        Returns:
            None.

        Raises:
            ValueError: If the sequence length is not a positive integer.
            TypeError: If the data type is not valid or compatible with the expected operations.
        """
        self.max_seq_len_cached = seq_len
        t = ops.arange(self.max_seq_len_cached, dtype=mindspore.int64).type_as(self.inv_freq)

        freqs = ops.outer(t, self.inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = ops.cat((freqs, freqs), axis=-1)
        self.cos_cached = emb.cos().to(dtype)
        self.sin_cached = emb.sin().to(dtype)

    def forward(self, x, seq_len=None):
        """
        Constructs a rotary embedding for the given input sequence.

        Args:
            self (Qwen2MoeRotaryEmbedding): An instance of the Qwen2MoeRotaryEmbedding class.
            x (torch.Tensor): The input tensor of shape (batch_size, sequence_length, input_size).
            seq_len (int, optional): The length of the input sequence. Defaults to None.

        Returns:
            None

        Raises:
            ValueError: If seq_len is greater than the maximum sequence length that is cached.

        This method forwards a rotary embedding for the input sequence. It first checks if the provided seq_len
        is greater than the maximum sequence length that is currently cached. If so, it updates the cosine and sine
        caches by calling the _set_cos_sin_cache method. The cached cosine and sine values are then returned for the
        specified sequence length.

        Note that the returned cosine and sine tensors are converted to the same dtype as the input tensor x.
        """
        # x: [bs, num_attention_heads, seq_len, head_size]
        if seq_len > self.max_seq_len_cached:
            self._set_cos_sin_cache(seq_len=seq_len, dtype=x.dtype)

        return (
            self.cos_cached[:seq_len].to(dtype=x.dtype),
            self.sin_cached[:seq_len].to(dtype=x.dtype),
        )

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeRotaryEmbedding.__init__(dim, max_position_embeddings=2048, base=10000)

Initializes an instance of the Qwen2MoeRotaryEmbedding class.

PARAMETER DESCRIPTION
self

The instance of the class.

dim

The dimension of the embedding.

TYPE: int

max_position_embeddings

The maximum number of position embeddings. Defaults to 2048.

TYPE: int DEFAULT: 2048

base

The base value used in the calculation. Defaults to 10000.

TYPE: int DEFAULT: 10000

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
def __init__(self, dim, max_position_embeddings=2048, base=10000):
    """
    Initializes an instance of the Qwen2MoeRotaryEmbedding class.

    Args:
        self: The instance of the class.
        dim (int): The dimension of the embedding.
        max_position_embeddings (int, optional): The maximum number of position embeddings. Defaults to 2048.
        base (int, optional): The base value used in the calculation. Defaults to 10000.

    Returns:
        None.

    Raises:
        None.

    """
    super().__init__()

    self.dim = dim
    self.max_position_embeddings = max_position_embeddings
    self.base = base
    inv_freq = 1.0 / (self.base ** (ops.arange(0, self.dim, 2, dtype=mindspore.int64).float() / self.dim))
    self.inv_freq = inv_freq

    # Build here to make `torch.jit.trace` work.
    self._set_cos_sin_cache(
        seq_len=max_position_embeddings, dtype=get_default_dtype()
    )

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeRotaryEmbedding.forward(x, seq_len=None)

Constructs a rotary embedding for the given input sequence.

PARAMETER DESCRIPTION
self

An instance of the Qwen2MoeRotaryEmbedding class.

TYPE: Qwen2MoeRotaryEmbedding

x

The input tensor of shape (batch_size, sequence_length, input_size).

TYPE: Tensor

seq_len

The length of the input sequence. Defaults to None.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION

None

RAISES DESCRIPTION
ValueError

If seq_len is greater than the maximum sequence length that is cached.

This method forwards a rotary embedding for the input sequence. It first checks if the provided seq_len is greater than the maximum sequence length that is currently cached. If so, it updates the cosine and sine caches by calling the _set_cos_sin_cache method. The cached cosine and sine values are then returned for the specified sequence length.

Note that the returned cosine and sine tensors are converted to the same dtype as the input tensor x.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
def forward(self, x, seq_len=None):
    """
    Constructs a rotary embedding for the given input sequence.

    Args:
        self (Qwen2MoeRotaryEmbedding): An instance of the Qwen2MoeRotaryEmbedding class.
        x (torch.Tensor): The input tensor of shape (batch_size, sequence_length, input_size).
        seq_len (int, optional): The length of the input sequence. Defaults to None.

    Returns:
        None

    Raises:
        ValueError: If seq_len is greater than the maximum sequence length that is cached.

    This method forwards a rotary embedding for the input sequence. It first checks if the provided seq_len
    is greater than the maximum sequence length that is currently cached. If so, it updates the cosine and sine
    caches by calling the _set_cos_sin_cache method. The cached cosine and sine values are then returned for the
    specified sequence length.

    Note that the returned cosine and sine tensors are converted to the same dtype as the input tensor x.
    """
    # x: [bs, num_attention_heads, seq_len, head_size]
    if seq_len > self.max_seq_len_cached:
        self._set_cos_sin_cache(seq_len=seq_len, dtype=x.dtype)

    return (
        self.cos_cached[:seq_len].to(dtype=x.dtype),
        self.sin_cached[:seq_len].to(dtype=x.dtype),
    )

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeSparseMoeBlock

Bases: Module

This class represents a sparse mixture-of-experts (MoE) block for the Qwen2 model. It is a subclass of nn.Module.

ATTRIBUTE DESCRIPTION
num_experts

The number of experts in the MoE block.

TYPE: int

top_k

The number of top experts to select per token.

TYPE: int

norm_topk_prob

Flag indicating whether to normalize the probabilities of the top experts.

TYPE: bool

gate

The gate layer that computes the routing probabilities for the experts.

TYPE: Linear

experts

List of expert layers in the MoE block.

TYPE: ModuleList

shared_expert

The shared expert layer in the MoE block.

TYPE: Qwen2MoeMLP

shared_expert_gate

The gate layer for the shared expert.

TYPE: Linear

METHOD DESCRIPTION
forward

Constructs the MoE block by processing the given hidden states.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
class Qwen2MoeSparseMoeBlock(nn.Module):

    """
    This class represents a sparse mixture-of-experts (MoE) block for the Qwen2 model. It is a subclass of nn.Module.

    Attributes:
        num_experts (int): The number of experts in the MoE block.
        top_k (int): The number of top experts to select per token.
        norm_topk_prob (bool): Flag indicating whether to normalize the probabilities of the top experts.
        gate (nn.Linear): The gate layer that computes the routing probabilities for the experts.
        experts (nn.ModuleList): List of expert layers in the MoE block.
        shared_expert (Qwen2MoeMLP): The shared expert layer in the MoE block.
        shared_expert_gate (nn.Linear): The gate layer for the shared expert.

    Methods:
        forward:
            Constructs the MoE block by processing the given hidden states.

    """
    def __init__(self, config):
        """
        Args:
            self (Qwen2MoeSparseMoeBlock): The instance of the Qwen2MoeSparseMoeBlock class.
            config (Config): A configuration object containing various parameters for the Qwen2MoeSparseMoeBlock.

        Returns:
            None.

        Raises:
            ValueError: If the number of experts (config.num_experts) is not a positive integer.
            ValueError: If the top k value (config.num_experts_per_tok) is not a positive integer.
            ValueError: If the normalized top k probability (config.norm_topk_prob) is not in the range [0, 1].
            ValueError: If the hidden size for the gate (config.hidden_size) is not a positive integer.
            ValueError: If the intermediate size for the experts (config.moe_intermediate_size) or shared expert
                (config.shared_expert_intermediate_size) is not a positive integer.
            ValueError: If the number of shared expert gates (1) is not a positive integer.
            TypeError: If the provided configuration object is not of type Config.
            RuntimeError: If there is an issue with initializing the gate or expert models.
        """
        super().__init__()
        self.num_experts = config.num_experts
        self.top_k = config.num_experts_per_tok
        self.norm_topk_prob = config.norm_topk_prob

        # gating
        self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False)
        self.experts = nn.ModuleList(
            [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)]
        )

        self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size)
        self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False)

    def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor:
        """
        This method forwards a Qwen2MoeSparseMoeBlock by processing the input hidden_states.

        Args:
            self: Qwen2MoeSparseMoeBlock
                The instance of the Qwen2MoeSparseMoeBlock class.
            hidden_states: mindspore.Tensor
                A tensor representing the hidden states with the shape (batch_size, sequence_length, hidden_dim).

        Returns:
            mindspore.Tensor
                A tensor representing the final hidden states after processing, with the shape
                (batch_size, sequence_length, hidden_dim).

        Raises:
            None
        """
        batch_size, sequence_length, hidden_dim = hidden_states.shape
        hidden_states = hidden_states.view(-1, hidden_dim)
        # router_logits: (batch * sequence_length, n_experts)
        router_logits = self.gate(hidden_states)

        routing_weights = ops.softmax(router_logits, axis=1, dtype=mindspore.float32)
        routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1)
        if self.norm_topk_prob:
            routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
        # we cast back to the input dtype
        routing_weights = routing_weights.to(hidden_states.dtype)

        final_hidden_states = ops.zeros(
            (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype)

        # One hot encode the selected experts to create an expert mask
        # this will be used to easily index which expert is going to be sollicitated
        expert_mask = ops.one_hot(selected_experts, self.num_experts).permute(2, 1, 0)

        # Loop over all available experts in the model and perform the computation on each expert
        for expert_idx in range(self.num_experts):
            expert_layer = self.experts[expert_idx]
            nonezero = ops.nonzero(expert_mask[expert_idx])
            idx, top_x = nonezero.tensor_split(2, -1)
            if top_x.shape[0] == 0:
                continue

            # Index the correct hidden states and compute the expert hidden state for
            # the current expert. We need to make sure to multiply the output hidden
            # states by `routing_weights` on the corresponding tokens (top-1 and top-2)
            current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)
            current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx]

            # However `index_add_` only support torch tensors for indexing so we'll use
            # the `top_x` tensor here.
            final_hidden_states = final_hidden_states.index_add(0, top_x.astype(mindspore.int32).reshape(-1), current_hidden_states.to(hidden_states.dtype))

        shared_expert_output = self.shared_expert(hidden_states)
        shared_expert_output = ops.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output

        final_hidden_states = final_hidden_states + shared_expert_output

        final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)
        return final_hidden_states, router_logits

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeSparseMoeBlock.__init__(config)

PARAMETER DESCRIPTION
self

The instance of the Qwen2MoeSparseMoeBlock class.

TYPE: Qwen2MoeSparseMoeBlock

config

A configuration object containing various parameters for the Qwen2MoeSparseMoeBlock.

TYPE: Config

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If the number of experts (config.num_experts) is not a positive integer.

ValueError

If the top k value (config.num_experts_per_tok) is not a positive integer.

ValueError

If the normalized top k probability (config.norm_topk_prob) is not in the range [0, 1].

ValueError

If the hidden size for the gate (config.hidden_size) is not a positive integer.

ValueError

If the intermediate size for the experts (config.moe_intermediate_size) or shared expert (config.shared_expert_intermediate_size) is not a positive integer.

ValueError

If the number of shared expert gates (1) is not a positive integer.

TypeError

If the provided configuration object is not of type Config.

RuntimeError

If there is an issue with initializing the gate or expert models.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
def __init__(self, config):
    """
    Args:
        self (Qwen2MoeSparseMoeBlock): The instance of the Qwen2MoeSparseMoeBlock class.
        config (Config): A configuration object containing various parameters for the Qwen2MoeSparseMoeBlock.

    Returns:
        None.

    Raises:
        ValueError: If the number of experts (config.num_experts) is not a positive integer.
        ValueError: If the top k value (config.num_experts_per_tok) is not a positive integer.
        ValueError: If the normalized top k probability (config.norm_topk_prob) is not in the range [0, 1].
        ValueError: If the hidden size for the gate (config.hidden_size) is not a positive integer.
        ValueError: If the intermediate size for the experts (config.moe_intermediate_size) or shared expert
            (config.shared_expert_intermediate_size) is not a positive integer.
        ValueError: If the number of shared expert gates (1) is not a positive integer.
        TypeError: If the provided configuration object is not of type Config.
        RuntimeError: If there is an issue with initializing the gate or expert models.
    """
    super().__init__()
    self.num_experts = config.num_experts
    self.top_k = config.num_experts_per_tok
    self.norm_topk_prob = config.norm_topk_prob

    # gating
    self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False)
    self.experts = nn.ModuleList(
        [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)]
    )

    self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size)
    self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False)

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.Qwen2MoeSparseMoeBlock.forward(hidden_states)

This method forwards a Qwen2MoeSparseMoeBlock by processing the input hidden_states.

PARAMETER DESCRIPTION
self

Qwen2MoeSparseMoeBlock The instance of the Qwen2MoeSparseMoeBlock class.

hidden_states

mindspore.Tensor A tensor representing the hidden states with the shape (batch_size, sequence_length, hidden_dim).

TYPE: Tensor

RETURNS DESCRIPTION
Tensor

mindspore.Tensor A tensor representing the final hidden states after processing, with the shape (batch_size, sequence_length, hidden_dim).

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor:
    """
    This method forwards a Qwen2MoeSparseMoeBlock by processing the input hidden_states.

    Args:
        self: Qwen2MoeSparseMoeBlock
            The instance of the Qwen2MoeSparseMoeBlock class.
        hidden_states: mindspore.Tensor
            A tensor representing the hidden states with the shape (batch_size, sequence_length, hidden_dim).

    Returns:
        mindspore.Tensor
            A tensor representing the final hidden states after processing, with the shape
            (batch_size, sequence_length, hidden_dim).

    Raises:
        None
    """
    batch_size, sequence_length, hidden_dim = hidden_states.shape
    hidden_states = hidden_states.view(-1, hidden_dim)
    # router_logits: (batch * sequence_length, n_experts)
    router_logits = self.gate(hidden_states)

    routing_weights = ops.softmax(router_logits, axis=1, dtype=mindspore.float32)
    routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1)
    if self.norm_topk_prob:
        routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
    # we cast back to the input dtype
    routing_weights = routing_weights.to(hidden_states.dtype)

    final_hidden_states = ops.zeros(
        (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype)

    # One hot encode the selected experts to create an expert mask
    # this will be used to easily index which expert is going to be sollicitated
    expert_mask = ops.one_hot(selected_experts, self.num_experts).permute(2, 1, 0)

    # Loop over all available experts in the model and perform the computation on each expert
    for expert_idx in range(self.num_experts):
        expert_layer = self.experts[expert_idx]
        nonezero = ops.nonzero(expert_mask[expert_idx])
        idx, top_x = nonezero.tensor_split(2, -1)
        if top_x.shape[0] == 0:
            continue

        # Index the correct hidden states and compute the expert hidden state for
        # the current expert. We need to make sure to multiply the output hidden
        # states by `routing_weights` on the corresponding tokens (top-1 and top-2)
        current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)
        current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx]

        # However `index_add_` only support torch tensors for indexing so we'll use
        # the `top_x` tensor here.
        final_hidden_states = final_hidden_states.index_add(0, top_x.astype(mindspore.int32).reshape(-1), current_hidden_states.to(hidden_states.dtype))

    shared_expert_output = self.shared_expert(hidden_states)
    shared_expert_output = ops.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output

    final_hidden_states = final_hidden_states + shared_expert_output

    final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim)
    return final_hidden_states, router_logits

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1)

Applies Rotary Position Embedding to the query and key tensors.

PARAMETER DESCRIPTION
q

The query tensor.

TYPE: `mindspore.Tensor`

k

The key tensor.

TYPE: `mindspore.Tensor`

cos

The cosine part of the rotary embedding.

TYPE: `mindspore.Tensor`

sin

The sine part of the rotary embedding.

TYPE: `mindspore.Tensor`

position_ids

The position indices of the tokens corresponding to the query and key tensors. For example, this can be used to pass offsetted position ids when working with a KV-cache.

TYPE: `mindspore.Tensor`

unsqueeze_dim

The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

RETURNS DESCRIPTION

tuple(mindspore.Tensor) comprising of the query and key tensors rotated using the Rotary Position Embedding.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
    """Applies Rotary Position Embedding to the query and key tensors.

    Args:
        q (`mindspore.Tensor`): The query tensor.
        k (`mindspore.Tensor`): The key tensor.
        cos (`mindspore.Tensor`): The cosine part of the rotary embedding.
        sin (`mindspore.Tensor`): The sine part of the rotary embedding.
        position_ids (`mindspore.Tensor`):
            The position indices of the tokens corresponding to the query and key tensors. For example, this can be
            used to pass offsetted position ids when working with a KV-cache.
        unsqueeze_dim (`int`, *optional*, defaults to 1):
            The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
            sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
            that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
            k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
            cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
            the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.

    Returns:
        `tuple(mindspore.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
    """
    cos = cos[position_ids].unsqueeze(unsqueeze_dim)
    sin = sin[position_ids].unsqueeze(unsqueeze_dim)
    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.load_balancing_loss_func(gate_logits, num_experts=None, top_k=2, attention_mask=None)

Computes auxiliary load balancing loss as in Switch Transformer - implemented in MindSpore.

See Switch Transformer (https://arxiv.org/abs/2101.03961) for more details. This function implements the loss function presented in equations (4) - (6) of the paper. It aims at penalizing cases where the routing between experts is too unbalanced.

PARAMETER DESCRIPTION
gate_logits

Logits from the gate, should be a tuple of model.config.num_hidden_layers tensors of shape [batch_size X sequence_length, num_experts].

TYPE: Union[`mindspore.Tensor`, Tuple[mindspore.Tensor]

attention_mask

The attention_mask used in forward function shape [batch_size X sequence_length] if not None.

TYPE: `mindspore.Tensor`, None DEFAULT: None

num_experts

Number of experts

TYPE: `int`, *optional* DEFAULT: None

RETURNS DESCRIPTION
float

The auxiliary loss.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def load_balancing_loss_func(
    gate_logits: mindspore.Tensor, num_experts: mindspore.Tensor = None, top_k=2, attention_mask: Optional[mindspore.Tensor] = None
) -> float:
    r"""
    Computes auxiliary load balancing loss as in Switch Transformer - implemented in MindSpore.

    See Switch Transformer (https://arxiv.org/abs/2101.03961) for more details. This function implements the loss
    function presented in equations (4) - (6) of the paper. It aims at penalizing cases where the routing between
    experts is too unbalanced.

    Args:
        gate_logits (Union[`mindspore.Tensor`, Tuple[mindspore.Tensor]):
            Logits from the `gate`, should be a tuple of model.config.num_hidden_layers tensors of
            shape [batch_size X sequence_length, num_experts].
        attention_mask (`mindspore.Tensor`, None):
            The attention_mask used in forward function
            shape [batch_size X sequence_length] if not None.
        num_experts (`int`, *optional*):
            Number of experts

    Returns:
        The auxiliary loss.
    """
    if gate_logits is None or not isinstance(gate_logits, tuple):
        return 0

    if isinstance(gate_logits, tuple):
        concatenated_gate_logits = ops.cat(list(gate_logits), axis=0)

    routing_weights = ops.softmax(concatenated_gate_logits, axis=-1)

    _, selected_experts = ops.topk(routing_weights, top_k, dim=-1)

    expert_mask = ops.one_hot(selected_experts, num_experts)

    if attention_mask is None:
        # Compute the percentage of tokens routed to each experts
        tokens_per_expert = ops.mean(expert_mask.float(), axis=0)

        # Compute the average probability of routing to these experts
        router_prob_per_expert = ops.mean(routing_weights, axis=0)
    else:
        batch_size, sequence_length = attention_mask.shape
        num_hidden_layers = concatenated_gate_logits.shape[0] // (batch_size * sequence_length)

        # Compute the mask that masks all padding tokens as 0 with the same shape of expert_mask
        expert_attention_mask = (
            attention_mask[None, :, :, None, None]
            .expand((num_hidden_layers, batch_size, sequence_length, top_k, num_experts))
            .reshape(-1, top_k, num_experts)
        )

        # Compute the percentage of tokens routed to each experts
        tokens_per_expert = ops.sum(expert_mask.float() * expert_attention_mask, dim=0) / ops.sum(
            expert_attention_mask, dim=0
        )

        # Compute the mask that masks all padding tokens as 0 with the same shape of tokens_per_expert
        router_per_expert_attention_mask = (
            attention_mask[None, :, :, None]
            .expand((num_hidden_layers, batch_size, sequence_length, num_experts))
            .reshape(-1, num_experts)
        )

        # Compute the average probability of routing to these experts
        router_prob_per_expert = ops.sum(routing_weights * router_per_expert_attention_mask, dim=0) / ops.sum(
            router_per_expert_attention_mask, dim=0
        )

    overall_loss = ops.sum(tokens_per_expert * router_prob_per_expert.unsqueeze(0))
    return overall_loss * num_experts

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.repeat_kv(hidden_states, n_rep)

This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
491
492
493
494
495
496
497
498
499
500
def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor:
    """
    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
    """
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    if n_rep == 1:
        return hidden_states
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)

mindnlp.transformers.models.qwen2_moe.modeling_qwen2_moe.rotate_half(x)

Rotates half the hidden dims of the input.

Source code in mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py
372
373
374
375
376
377
def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    # x1 = x[..., : x.shape[-1] // 2]
    # x2 = x[..., x.shape[-1] // 2 :]
    x1, x2 = x.tensor_split(2, -1)
    return ops.cat((-x2, x1), axis=-1)