Skip to content

vipllava

mindnlp.transformers.models.vipllava.configuration_vipllava

VipLlava model configuration

mindnlp.transformers.models.vipllava.configuration_vipllava.VipLlavaConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [VipLlavaForConditionalGeneration]. It is used to instantiate an VipLlava model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the VipLlava-9B.

e.g. ybelkada/vip-llava-7b-hf

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
vision_config

Custom vision config or dict

TYPE: `VipLlavaVisionConfig`, *optional* DEFAULT: None

text_config

The config object of the text backbone. Can be any of LlamaConfig or MistralConfig.

TYPE: `Union[AutoConfig, dict]`, *optional* DEFAULT: None

ignore_index

The ignore index for the loss function.

TYPE: `int`, *optional*, defaults to -100 DEFAULT: -100

image_token_index

The image token index to encode the image prompt.

TYPE: `int`, *optional*, defaults to 32000 DEFAULT: 32000

projector_hidden_act

The activation function used by the multimodal projector.

TYPE: `str`, *optional*, defaults to `"gelu"` DEFAULT: 'gelu'

projector_layernorm_eps

The layer norm epsilon of the projector layernorm

TYPE: `float`, *optional*, defaults to 1e-05 DEFAULT: 1e-05

vision_feature_layers

The list of layers to select the vision features from.

TYPE: `List[int]`, *optional*, defaults to `[-2, -5, -8, -11, 6]` DEFAULT: [-2, -5, -8, -11, 6]

Example
>>> from transformers import VipLlavaForConditionalGeneration, VipLlavaConfig, CLIPVisionConfig, LlamaConfig
...
>>> # Initializing a CLIP-vision config
>>> vision_config = CLIPVisionConfig()
...
>>> # Initializing a Llama config
>>> text_config = LlamaConfig()
...
>>> # Initializing a VipLlava vipllava-7b style configuration
>>> configuration = VipLlavaConfig(vision_config, text_config)
...
>>> # Initializing a model from the vipllava-7b style configuration
>>> model = VipLlavaForConditionalGeneration(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/vipllava/configuration_vipllava.py
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
class VipLlavaConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`VipLlavaForConditionalGeneration`]. It is used to instantiate an
    VipLlava model according to the specified arguments, defining the model architecture. Instantiating a configuration
    with the defaults will yield a similar configuration to that of the VipLlava-9B.

    e.g. [ybelkada/vip-llava-7b-hf](https://huggingface.co/ybelkada/vip-llava-7b-hf)

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        vision_config (`VipLlavaVisionConfig`,  *optional*):
            Custom vision config or dict
        text_config (`Union[AutoConfig, dict]`, *optional*):
            The config object of the text backbone. Can be any of `LlamaConfig` or `MistralConfig`.
        ignore_index (`int`, *optional*, defaults to -100):
            The ignore index for the loss function.
        image_token_index (`int`, *optional*, defaults to 32000):
            The image token index to encode the image prompt.
        projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
            The activation function used by the multimodal projector.
        projector_layernorm_eps (`float`, *optional*, defaults to 1e-05):
            The layer norm epsilon of the projector layernorm
        vision_feature_layers (`List[int]`, *optional*, defaults to `[-2, -5, -8, -11, 6]`):
            The list of layers to select the vision features from.

    Example:
        ```python
        >>> from transformers import VipLlavaForConditionalGeneration, VipLlavaConfig, CLIPVisionConfig, LlamaConfig
        ...
        >>> # Initializing a CLIP-vision config
        >>> vision_config = CLIPVisionConfig()
        ...
        >>> # Initializing a Llama config
        >>> text_config = LlamaConfig()
        ...
        >>> # Initializing a VipLlava vipllava-7b style configuration
        >>> configuration = VipLlavaConfig(vision_config, text_config)
        ...
        >>> # Initializing a model from the vipllava-7b style configuration
        >>> model = VipLlavaForConditionalGeneration(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
        """
    model_type = "vipllava"
    is_composition = False

    def __init__(
        self,
        vision_config=None,
        text_config=None,
        ignore_index=-100,
        image_token_index=32000,
        projector_hidden_act="gelu",
        projector_layernorm_eps=1e-5,
        vision_feature_layers=[-2, -5, -8, -11, 6],
        **kwargs,
    ):
        '''
        Initialize the VipLlavaConfig with the provided configuration parameters.

        Args:
            self: The instance of the class.
            vision_config (dict, optional): The configuration for the vision model. Defaults to None.
            text_config (dict, optional): The configuration for the text model. Defaults to None.
            ignore_index (int, optional): The index to ignore during training. Defaults to -100.
            image_token_index (int, optional): The index for image tokens. Defaults to 32000.
            projector_hidden_act (str, optional): The activation function for the projector hidden layer.
                Defaults to 'gelu'.
            projector_layernorm_eps (float, optional): The epsilon value for layer normalization in the projector.
                Defaults to 1e-05.
            vision_feature_layers (list of int, optional): The list of layers to extract features from in the vision model.
                Defaults to [-2, -5, -8, -11, 6].

        Returns:
            None.

        Raises:
            FutureWarning: If 'vocab_size' is provided in kwargs, it will issue a warning about deprecation and removal
                in future versions.
            Warning: If 'vocab_size' attribute is accessed, it will issue a warning about deprecation and recommendation
                to use 'text_config.vocab_size' instead.
        '''
        self.ignore_index = ignore_index
        self.image_token_index = image_token_index
        self.projector_hidden_act = projector_hidden_act
        self.projector_layernorm_eps = projector_layernorm_eps
        self.vision_feature_layers = vision_feature_layers

        if "vocab_size" in kwargs:
            warnings.warn(
                "The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect",
                FutureWarning,
            )

        self.vision_config = vision_config

        if isinstance(self.vision_config, dict):
            vision_config["model_type"] = (
                vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
            )
            self.vision_config = CONFIG_MAPPING[vision_config["model_type"]](
                **vision_config)
        elif vision_config is None:
            self.vision_config = CONFIG_MAPPING["clip_vision_model"](
                intermediate_size=4096,
                hidden_size=1024,
                patch_size=14,
                image_size=336,
                num_hidden_layers=24,
                num_attention_heads=16,
                vocab_size=32000,
                projection_dim=768,
            )

        if isinstance(text_config, dict):
            text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
            text_config = CONFIG_MAPPING[text_config["model_type"]](
                **text_config)
        elif text_config is None:
            text_config = CONFIG_MAPPING["llama"]()

        self.text_config = text_config
        self._vocab_size = self.text_config.vocab_size

        super().__init__(**kwargs)

        @property
        def vocab_size(self):
            warnings.warn(
                "The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.",
                FutureWarning,
            )
            return self._vocab_size

        def to_dict(self):
            output = super().to_dict()
            output.pop("_vocab_size", None)
            return output

mindnlp.transformers.models.vipllava.configuration_vipllava.VipLlavaConfig.__init__(vision_config=None, text_config=None, ignore_index=-100, image_token_index=32000, projector_hidden_act='gelu', projector_layernorm_eps=1e-05, vision_feature_layers=[-2, -5, -8, -11, 6], **kwargs)

Initialize the VipLlavaConfig with the provided configuration parameters.

PARAMETER DESCRIPTION
self

The instance of the class.

vision_config

The configuration for the vision model. Defaults to None.

TYPE: dict DEFAULT: None

text_config

The configuration for the text model. Defaults to None.

TYPE: dict DEFAULT: None

ignore_index

The index to ignore during training. Defaults to -100.

TYPE: int DEFAULT: -100

image_token_index

The index for image tokens. Defaults to 32000.

TYPE: int DEFAULT: 32000

projector_hidden_act

The activation function for the projector hidden layer. Defaults to 'gelu'.

TYPE: str DEFAULT: 'gelu'

projector_layernorm_eps

The epsilon value for layer normalization in the projector. Defaults to 1e-05.

TYPE: float DEFAULT: 1e-05

vision_feature_layers

The list of layers to extract features from in the vision model. Defaults to [-2, -5, -8, -11, 6].

TYPE: list of int DEFAULT: [-2, -5, -8, -11, 6]

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
FutureWarning

If 'vocab_size' is provided in kwargs, it will issue a warning about deprecation and removal in future versions.

Warning

If 'vocab_size' attribute is accessed, it will issue a warning about deprecation and recommendation to use 'text_config.vocab_size' instead.

Source code in mindnlp/transformers/models/vipllava/configuration_vipllava.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
def __init__(
    self,
    vision_config=None,
    text_config=None,
    ignore_index=-100,
    image_token_index=32000,
    projector_hidden_act="gelu",
    projector_layernorm_eps=1e-5,
    vision_feature_layers=[-2, -5, -8, -11, 6],
    **kwargs,
):
    '''
    Initialize the VipLlavaConfig with the provided configuration parameters.

    Args:
        self: The instance of the class.
        vision_config (dict, optional): The configuration for the vision model. Defaults to None.
        text_config (dict, optional): The configuration for the text model. Defaults to None.
        ignore_index (int, optional): The index to ignore during training. Defaults to -100.
        image_token_index (int, optional): The index for image tokens. Defaults to 32000.
        projector_hidden_act (str, optional): The activation function for the projector hidden layer.
            Defaults to 'gelu'.
        projector_layernorm_eps (float, optional): The epsilon value for layer normalization in the projector.
            Defaults to 1e-05.
        vision_feature_layers (list of int, optional): The list of layers to extract features from in the vision model.
            Defaults to [-2, -5, -8, -11, 6].

    Returns:
        None.

    Raises:
        FutureWarning: If 'vocab_size' is provided in kwargs, it will issue a warning about deprecation and removal
            in future versions.
        Warning: If 'vocab_size' attribute is accessed, it will issue a warning about deprecation and recommendation
            to use 'text_config.vocab_size' instead.
    '''
    self.ignore_index = ignore_index
    self.image_token_index = image_token_index
    self.projector_hidden_act = projector_hidden_act
    self.projector_layernorm_eps = projector_layernorm_eps
    self.vision_feature_layers = vision_feature_layers

    if "vocab_size" in kwargs:
        warnings.warn(
            "The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect",
            FutureWarning,
        )

    self.vision_config = vision_config

    if isinstance(self.vision_config, dict):
        vision_config["model_type"] = (
            vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
        )
        self.vision_config = CONFIG_MAPPING[vision_config["model_type"]](
            **vision_config)
    elif vision_config is None:
        self.vision_config = CONFIG_MAPPING["clip_vision_model"](
            intermediate_size=4096,
            hidden_size=1024,
            patch_size=14,
            image_size=336,
            num_hidden_layers=24,
            num_attention_heads=16,
            vocab_size=32000,
            projection_dim=768,
        )

    if isinstance(text_config, dict):
        text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
        text_config = CONFIG_MAPPING[text_config["model_type"]](
            **text_config)
    elif text_config is None:
        text_config = CONFIG_MAPPING["llama"]()

    self.text_config = text_config
    self._vocab_size = self.text_config.vocab_size

    super().__init__(**kwargs)

    @property
    def vocab_size(self):
        warnings.warn(
            "The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.",
            FutureWarning,
        )
        return self._vocab_size

    def to_dict(self):
        output = super().to_dict()
        output.pop("_vocab_size", None)
        return output

mindnlp.transformers.models.vipllava.modeling_vipllava

MindSpore VipLlava model.

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaCausalLMOutputWithPast dataclass

Bases: ModelOutput

Base class for VipLlava causal language model (or autoregressive) outputs.

PARAMETER DESCRIPTION
loss

Language modeling loss (for next-token prediction).

TYPE: `torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided DEFAULT: None

logits

Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

TYPE: `torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)` DEFAULT: None

image_hidden_states

Tuple of torch.FloatTensor (one for the output of the image embeddings, (batch_size, num_images, sequence_length, hidden_size).

image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver

TYPE: `tuple(torch.FloatTensor)`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
@dataclass
# Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->VipLlava
class VipLlavaCausalLMOutputWithPast(ModelOutput):
    """
    Base class for VipLlava causal language model (or autoregressive) outputs.

    Args:
        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
            Language modeling loss (for next-token prediction).
        logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed
            or when `config.use_cache=True`):
            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
            `(batch_size, num_heads, sequence_length, embed_size_per_head)`)

            Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
            `past_key_values` input) to speed up sequential decoding.
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed
            or when `config.output_hidden_states=True`):
            Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed
            or when `config.output_attentions=True`):
            Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
            sequence_length)`.

            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
        image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
            Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
            sequence_length, hidden_size)`.

            image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
    """
    loss: Optional[Tensor] = None
    logits: Tensor = None
    past_key_values: Optional[List[Tensor]] = None
    hidden_states: Optional[Tuple[Tensor]] = None
    attentions: Optional[Tuple[Tensor]] = None
    image_hidden_states: Optional[Tuple[Tensor]] = None

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration

Bases: VipLlavaPreTrainedModel

This class represents a model for conditional generation using the VipLlava architecture. It inherits from VipLlavaPreTrainedModel and provides methods for preparing inputs for generation, forwarding the model, and reordering cache.

METHOD DESCRIPTION
forward

Generates output based on input tokens, image features, attention mask, and other optional parameters. It returns a tuple or VipLlavaCausalLMOutputWithPast object.

prepare_inputs_for_generation

Prepares model inputs for generation, considering past key values, inputs embeds, pixel values, attention mask, and position ids.

_reorder_cache

Reorders the cache for the model.

The class also includes methods for handling input and output embeddings, decoder settings, tying weights, and resizing token embeddings.

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
class VipLlavaForConditionalGeneration(VipLlavaPreTrainedModel):

    """
    This class represents a model for conditional generation using the VipLlava architecture.
    It inherits from VipLlavaPreTrainedModel and provides methods for preparing inputs for generation, forwarding the
    model, and reordering cache.

    Methods:
        forward: Generates output based on input tokens, image features, attention mask, and other optional parameters.
            It returns a tuple or VipLlavaCausalLMOutputWithPast object.
        prepare_inputs_for_generation: Prepares model inputs for generation, considering past key values, inputs embeds,
            pixel values, attention mask, and position ids.
        _reorder_cache: Reorders the cache for the model.

    The class also includes methods for handling input and output embeddings, decoder settings, tying weights,
    and resizing token embeddings.
    """
    def __init__(self, config: VipLlavaConfig):
        """
        Initializes an instance of the VipLlavaForConditionalGeneration class.

        Args:
            self: The instance of the VipLlavaForConditionalGeneration class.
            config (VipLlavaConfig): The configuration object containing settings for the model.
                It is used to initialize the vision tower, multi-modal projector, language model, and other attributes.
                This parameter is mandatory and must be an instance of VipLlavaConfig.

        Returns:
            None.

        Raises:
            ValueError: If the provided config parameter is not an instance of VipLlavaConfig.
            RuntimeError: If an error occurs during the initialization process.
        """
        super().__init__(config)
        self.vision_tower = AutoModel.from_config(config.vision_config)

        self.multi_modal_projector = VipLlavaMultiModalProjector(config)
        self.vocab_size = config.text_config.vocab_size
        self.language_model = AutoModelForCausalLM.from_config(
            config.text_config
        )
        self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else -1
        self.post_init()

    def get_input_embeddings(self):
        """
        Returns the input embeddings of the language model used for conditional generation.

        Args:
            self: An instance of the VipLlavaForConditionalGeneration class.

        Returns:
            embeddings: This method returns the input embeddings of the language model used for conditional generation.
                The embeddings are obtained by calling the 'get_input_embeddings()' method of the language model.

        Raises:
            None.
        """
        return self.language_model.get_input_embeddings()

    def set_input_embeddings(self, value):
        """
        Sets the input embeddings for the VipLlavaForConditionalGeneration language model.

        Args:
            self (VipLlavaForConditionalGeneration): The instance of the VipLlavaForConditionalGeneration class.
            value: The input embeddings to be set for the language model. It should be a tensor of shape
                (vocab_size, embedding_dim).

        Returns:
            None.

        Raises:
            None.
        """
        self.language_model.set_input_embeddings(value)

    def get_output_embeddings(self):
        """
        This method returns the output embeddings from the language model for the VipLlavaForConditionalGeneration class.

        Args:
            self: The instance of the VipLlavaForConditionalGeneration class.

        Returns:
            embeddings: This method returns None as it simply retrieves the output embeddings from the language model.

        Raises:
            None.
        """
        return self.language_model.get_output_embeddings()

    def set_output_embeddings(self, new_embeddings):
        """
        Sets the output embeddings for the VipLlavaForConditionalGeneration model.

        Args:
            self (VipLlavaForConditionalGeneration): The instance of the VipLlavaForConditionalGeneration class.
            new_embeddings (object): The new embeddings to be set for the model's output.
                It should be compatible with the model's requirements.

        Returns:
            None.

        Raises:
            TypeError: If the new_embeddings parameter is not of the correct type.
            ValueError: If the new_embeddings parameter does not meet the model's requirements.
        """
        self.language_model.set_output_embeddings(new_embeddings)

    def set_decoder(self, decoder):
        """
        Sets the decoder for the VipLlavaForConditionalGeneration instance.

        Args:
            self (VipLlavaForConditionalGeneration): The VipLlavaForConditionalGeneration instance.
            decoder: The decoder to be set for the language model. It should be an instance of the decoder class.

        Returns:
            None.

        Raises:
            This method does not raise any exceptions.
        """
        self.language_model.set_decoder(decoder)

    def get_decoder(self):
        """
        This method returns the decoder from the language model associated with the VipLlavaForConditionalGeneration class.

        Args:
            self (VipLlavaForConditionalGeneration): The instance of the VipLlavaForConditionalGeneration class.
                It is used to access the language model and retrieve the decoder.

        Returns:
            decoder: This method does not return any value directly. It returns the decoder from the language model.

        Raises:
            This method does not raise any exceptions explicitly. However, exceptions related to accessing the
            language model or retrieving the decoder may be raised indirectly.
        """
        return self.language_model.get_decoder()

    def tie_weights(self):
        """
        Method to tie weights for the VipLlavaForConditionalGeneration class.

        Args:
            self: VipLlavaForConditionalGeneration object.
                Represents an instance of the VipLlavaForConditionalGeneration class.

        Returns:
            None.

        Raises:
            None.
        """
        return self.language_model.tie_weights()

    def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
        """
        Resizes the token embeddings of the VipLlavaForConditionalGeneration model.

        Args:
            self (VipLlavaForConditionalGeneration): The instance of the VipLlavaForConditionalGeneration class.
            new_num_tokens (Optional[int]): The new number of tokens for resizing the embeddings. Defaults to None.
            pad_to_multiple_of: The value to pad the embeddings to a multiple of. Defaults to None.

        Returns:
            nn.Embedding: The resized token embeddings as an instance of nn.Embedding.

        Raises:
            None.

        This method resizes the token embeddings of the VipLlavaForConditionalGeneration model based on the provided
        parameters. It first resizes the token embeddings of the underlying language model using the
        'resize_token_embeddings' method. It then updates the 'vocab_size' configuration parameter and the
        'vocab_size' attribute of the model to reflect the new size of the embeddings. Finally, it returns the resized
        token embeddings as an instance of nn.Embedding.
        """
        model_embeds = self.language_model.resize_token_embeddings(
            new_num_tokens, pad_to_multiple_of)
        # update vocab size
        self.config.text_config.vocab_size = model_embeds.vocab_size
        self.vocab_size = model_embeds.vocab_size
        return model_embeds

    def _merge_input_ids_with_image_features(self, image_features, inputs_embeds, input_ids, attention_mask, labels):
        """
        This method '_merge_input_ids_with_image_features' in the class 'VipLlavaForConditionalGeneration' merges input
        ids with image features to create final embeddings for conditional generation.

        Args:
            self: The instance of the class.
            image_features: Tensor containing image features with shape (num_images, num_image_patches, embed_dim).
            inputs_embeds: Tensor containing embeddings for input tokens.
            input_ids: Tensor containing input token IDs with shape (batch_size, sequence_length).
            attention_mask: Tensor containing attention mask for input tokens.
            labels: Optional tensor containing labels for tokens.

        Returns:
            final_embedding: Tensor containing final embeddings with shape (batch_size, max_embed_dim, embed_dim).
            final_attention_mask: Tensor containing final attention mask with shape (batch_size, max_embed_dim).
            final_labels: Tensor containing final labels with shape (batch_size, max_embed_dim). Returns None if labels
                is None.
            position_ids: Tensor containing position IDs.

        Raises:
            ValueError: If the input provided to the model is incorrect, raising an exception with details on the
                mismatch of image tokens and images given.
        """
        num_images, num_image_patches, embed_dim = image_features.shape
        batch_size, sequence_length = input_ids.shape
        left_padding = not ops.sum(
            input_ids[:, -1] == Tensor(self.pad_token_id))
        # 1. Create a mask to know where special image tokens are
        special_image_token_mask = input_ids == self.config.image_token_index
        num_special_image_tokens = ops.sum(special_image_token_mask, dim=-1)
        # Compute the maximum embed dimension
        max_embed_dim = (num_special_image_tokens.max() *
                         (num_image_patches - 1)).item() + sequence_length
        nonzero = ops.nonzero(input_ids != self.config.image_token_index)
        batch_indices, non_image_indices = ops.tensor_split(nonzero, 2, -1)

        # 2. Compute the positions where text should be written
        # Calculate new positions for text tokens in merged image-text sequence.
        # `special_image_token_mask` identifies image tokens. Each image token will be replaced by `nb_text_tokens_per_images - 1` text tokens.
        # `torch.cumsum` computes how each image token shifts subsequent text token positions.
        # - 1 to adjust for zero-based indexing, as `cumsum` inherently increases indices by one.
        new_token_positions = ops.cumsum(
            (special_image_token_mask * (num_image_patches - 1) + 1), -1) - 1
        nb_image_pad = max_embed_dim - 1 - new_token_positions[:, -1]
        if left_padding:
            # offset for left padding
            new_token_positions += nb_image_pad[:, None]
        text_to_overwrite = new_token_positions[batch_indices,
                                                non_image_indices]

        # 3. Create the full embedding, already padded to the maximum position
        final_embedding = ops.zeros(
            (batch_size, int(max_embed_dim), embed_dim), dtype=inputs_embeds.dtype
        )
        final_attention_mask = ops.zeros(
            (batch_size, int(max_embed_dim)), dtype=attention_mask.dtype
        )
        if labels is not None:
            final_labels = ops.full(
                (batch_size, max_embed_dim), self.config.ignore_index, dtype=input_ids.dtype
            )
        # In case the Vision model or the Language model has been offloaded to CPU, we need to manually
        # set the corresponding tensors into their correct target device.

        # 4. Fill the embeddings based on the mask. If we have ["hey" "<image>", "how", "are"]
        # we need to index copy on [0, 577, 578, 579] for the text and [1:576] for the image features
        final_embedding[batch_indices,
                        text_to_overwrite] = inputs_embeds[batch_indices, non_image_indices]
        final_attention_mask[batch_indices,
                             text_to_overwrite] = attention_mask[batch_indices, non_image_indices]
        if labels is not None:
            final_labels[batch_indices,
                         text_to_overwrite] = labels[batch_indices, non_image_indices]

        # 5. Fill the embeddings corresponding to the images. Anything that is still zeros needs filling
        image_to_overwrite = ops.all(final_embedding == 0, axis=-1)
        image_to_overwrite &= image_to_overwrite.cumsum(
            -1) - 1 >= nb_image_pad[:, None]

        if image_to_overwrite.sum() != reduce(lambda x, y: x * y, image_features.shape[:-1]):
            raise ValueError(
                f"The input provided to the model are wrong. The number of image tokens is {ops.sum(special_image_token_mask)} while"
                f" the number of image given to the model is {num_images}. This prevents correct indexing and breaks batch generation."
            )

        final_embedding[image_to_overwrite] = image_features.reshape(
            -1, embed_dim)
        final_attention_mask |= image_to_overwrite
        position_ids = (final_attention_mask.cumsum(-1) -
                        1).masked_fill((final_attention_mask == 0), 1)

        # 6. Mask out the embedding at padding positions, as we later use the past_key_value value to determine the non-attended tokens.
        nonzero = ops.nonzero(input_ids == self.pad_token_id)
        batch_indices, pad_indices = ops.tensor_split(nonzero, 2, -1)
        indices_to_mask = new_token_positions[batch_indices, pad_indices]

        if batch_indices.asnumpy() != []:
            final_embedding[batch_indices, indices_to_mask] = 0

        if labels is None:
            final_labels = None

        return final_embedding, final_attention_mask, final_labels, position_ids

    # Ignore copy
    def forward(
        self,
        input_ids: Tensor = None,
        pixel_values: Tensor = None,
        attention_mask: Optional[Tensor] = None,
        position_ids: Optional[Tensor] = None,
        past_key_values: Optional[List[Tensor]] = None,
        inputs_embeds: Optional[Tensor] = None,
        vision_feature_layers: Optional[List[int]] = None,
        labels: Optional[Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, VipLlavaCausalLMOutputWithPast]:
        r"""
        Args:
            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

        Returns:
            Union[Tuple, VipLlavaCausalLMOutputWithPast]

        Example:
            ```python
            >>> import torch
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, VipLlavaForConditionalGeneration
            ...
            >>> model = VipLlavaForConditionalGeneration.from_pretrained("llava-hf/vip-llava-7b-hf", device_map="auto", torch_dtype=torch.float16)
            >>> processor = AutoProcessor.from_pretrained("llava-hf/vip-llava-7b-hf")
            ...
            >>> prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human:<image>\n{}###Assistant:"
            >>> question = "Can you please describe this image?"
            >>> prompt = prompt.format(question)
            >>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-neg.png"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> inputs = processor(text=text, images=image, return_tensors="pt").to(0, torch.float16)
            ...
            >>> # Generate
            >>> generate_ids = model.generate(**inputs, max_new_tokens=20)
            >>> processor.decode(generate_ids[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
            The image features a brown and white cat sitting on a green surface, with a red ball in its
            ```
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        vision_feature_layers = (
            vision_feature_layers if vision_feature_layers is not None else self.config.vision_feature_layers
        )

        if inputs_embeds is None:
            # 1. Extra the input embeddings
            inputs_embeds = self.get_input_embeddings()(input_ids)

            # 2. Merge text and images
            if pixel_values is not None and input_ids.shape[1] != 1:
                image_outputs = self.vision_tower(
                    pixel_values, output_hidden_states=True)
                # For VIP-llava, the image features are computed this way
                # We select the features from index 1: for the layers -2, -5, -8, -11 and 6
                image_features = [image_outputs.hidden_states[index][:, 1:]
                                  for index in vision_feature_layers]
                image_features = ops.cat(image_features, axis=-1)

                image_features = self.multi_modal_projector(image_features)
                inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
                    image_features, inputs_embeds, input_ids, attention_mask, labels
                )
                if labels is None:
                    labels = ops.full_like(
                        attention_mask, self.config.ignore_index).to(ms.int64)
            else:
                # In case input_ids.shape[1] == 1 & pixel_values==None & past_key_values != None, we are in the case of
                # generation with cache
                if past_key_values is not None and pixel_values is not None and input_ids.shape[1] == 1:
                    # Retrieve the first layer to inspect the logits and mask out the hidden states
                    # that are set to 0
                    first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]

                    # Sum all dimensions of head_dim (-1) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
                    nonzero = ops.nonzero(first_layer_past_key_value.float().sum(-2) == 0)
                    batch_index, non_attended_tokens = ops.tensor_split(nonzero, 2, -1)

                    target_length = input_ids.shape[1]
                    past_length = first_layer_past_key_value.shape[-1]

                    extended_attention_mask = ops.ones(
                        (attention_mask.shape[0], past_length),
                        dtype=attention_mask.dtype,
                    )

                    # Filter out only the tokens that can be un-attended, this can happen
                    # in the case one uses Llava + Fused modules where the cache on the
                    # first iteration is already big enough, or if one passes custom cache
                    valid_indices = non_attended_tokens < extended_attention_mask.shape[-1]
                    new_batch_index = batch_index[valid_indices]
                    new_non_attended_tokens = non_attended_tokens[valid_indices]

                    # Zero-out the places where we don't need to attend
                    extended_attention_mask[new_batch_index,
                                            new_non_attended_tokens] = 0

                    attention_mask = ops.cat(
                        (extended_attention_mask, attention_mask[:, -target_length:]), axis=1)
                    position_ids = ops.sum(
                        attention_mask, dim=1).unsqueeze(-1) - 1

        outputs = self.language_model(
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        logits = outputs[0]

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            if attention_mask is not None:
                shift_attention_mask = attention_mask[..., 1:]
                shift_logits = logits[..., :-1, :][shift_attention_mask != 0]
                shift_labels = labels[..., 1:][shift_attention_mask != 0]
            else:
                shift_logits = logits[..., :-1, :]
                shift_labels = labels[..., 1:]
            # Flatten the tokens
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(
                shift_logits.view(-1, shift_logits.shape[-1]
                                  ), shift_labels.astype(ms.int32).view(-1)
            )

        if not return_dict:
            output = (logits,) + outputs[1:]
            return (loss,) + output if loss is not None else output

        return VipLlavaCausalLMOutputWithPast(
            loss=loss,
            logits=logits,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

    def prepare_inputs_for_generation(
        self, input_ids, past_key_values=None, inputs_embeds=None, pixel_values=None, attention_mask=None, **kwargs
    ):
        """
        Method to prepare inputs for generation in the VipLlavaForConditionalGeneration class.

        Args:
            self: The instance of the class.
            input_ids (Tensor): The input tensor containing token ids of the input sequence.
            past_key_values (Cache or tuple): The cache of key values from previous computations or tuple
                representing past and cache length.
            inputs_embeds (Tensor): An optional tensor containing embeddings for input tokens.
            pixel_values (Tensor): An optional tensor containing pixel values for image inputs.
            attention_mask (Tensor): An optional tensor indicating the attention mask for the input sequence.

        Returns:
            dict: A dictionary containing model inputs for generation, including inputs_embeds, input_ids, position_ids,
                past_key_values, use_cache, attention_mask, and pixel_values.

        Raises:
            TypeError: If past_key_values is not of type Cache or tuple.
            IndexError: If the attention_mask shape is incompatible with input_ids.
            ValueError: If the pixel_values tensor is missing.
            RuntimeError: If there is an issue with calculating position_ids based on attention_mask.
        """
        if past_key_values is not None:
            if isinstance(past_key_values, Cache):
                cache_length = past_key_values.get_seq_length()
                past_length = past_key_values.seen_tokens
            else:
                cache_length = past_length = past_key_values[0][0].shape[2]

            # Keep only the unprocessed tokens:
            # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
            # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
            # input)
            if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
                input_ids = input_ids[:, -
                                      (attention_mask.shape[1] - past_length):]
            # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
            # input_ids based on the past_length.
            elif past_length < input_ids.shape[1]:
                input_ids = input_ids[:, past_length:]
            # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
            elif self.config.image_token_index in input_ids:
                input_ids = input_ids[:, input_ids.shape[1] - 1:]
            # If the cache has seen more tokens than it can hold, then the cache has a size limit. Let's discard the
            # older attention values, as their corresponding values are not part of the input.
            if cache_length < past_length and attention_mask is not None:
                attention_mask = attention_mask[:, -
                                                (cache_length + input_ids.shape[1]):]

        position_ids = kwargs.get("position_ids", None)
        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids.masked_fill(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -input_ids.shape[1]:]

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "position_ids": position_ids,
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
                "attention_mask": attention_mask,
                "pixel_values": pixel_values,
            }
        )
        return model_inputs

    def _reorder_cache(self, *args, **kwargs):
        """
        Method _reorder_cache in class VipLlavaForConditionalGeneration.

        Args:
            self: VipLlavaForConditionalGeneration object. The instance of the class.

        Returns:
            None.

        Raises:
            None.
        """
        return self.language_model._reorder_cache(*args, **kwargs)

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration.__init__(config)

Initializes an instance of the VipLlavaForConditionalGeneration class.

PARAMETER DESCRIPTION
self

The instance of the VipLlavaForConditionalGeneration class.

config

The configuration object containing settings for the model. It is used to initialize the vision tower, multi-modal projector, language model, and other attributes. This parameter is mandatory and must be an instance of VipLlavaConfig.

TYPE: VipLlavaConfig

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If the provided config parameter is not an instance of VipLlavaConfig.

RuntimeError

If an error occurs during the initialization process.

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
def __init__(self, config: VipLlavaConfig):
    """
    Initializes an instance of the VipLlavaForConditionalGeneration class.

    Args:
        self: The instance of the VipLlavaForConditionalGeneration class.
        config (VipLlavaConfig): The configuration object containing settings for the model.
            It is used to initialize the vision tower, multi-modal projector, language model, and other attributes.
            This parameter is mandatory and must be an instance of VipLlavaConfig.

    Returns:
        None.

    Raises:
        ValueError: If the provided config parameter is not an instance of VipLlavaConfig.
        RuntimeError: If an error occurs during the initialization process.
    """
    super().__init__(config)
    self.vision_tower = AutoModel.from_config(config.vision_config)

    self.multi_modal_projector = VipLlavaMultiModalProjector(config)
    self.vocab_size = config.text_config.vocab_size
    self.language_model = AutoModelForCausalLM.from_config(
        config.text_config
    )
    self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else -1
    self.post_init()

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration.forward(input_ids=None, pixel_values=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, vision_feature_layers=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
labels

Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].

TYPE: `torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple, VipLlavaCausalLMOutputWithPast]

Union[Tuple, VipLlavaCausalLMOutputWithPast]

Example
>>> import torch
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, VipLlavaForConditionalGeneration
...
>>> model = VipLlavaForConditionalGeneration.from_pretrained("llava-hf/vip-llava-7b-hf", device_map="auto", torch_dtype=torch.float16)
>>> processor = AutoProcessor.from_pretrained("llava-hf/vip-llava-7b-hf")
...
>>> prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human:<image>\n{}###Assistant:"
>>> question = "Can you please describe this image?"
>>> prompt = prompt.format(question)
>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-neg.png"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = processor(text=text, images=image, return_tensors="pt").to(0, torch.float16)
...
>>> # Generate
>>> generate_ids = model.generate(**inputs, max_new_tokens=20)
>>> processor.decode(generate_ids[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
The image features a brown and white cat sitting on a green surface, with a red ball in its
Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
def forward(
    self,
    input_ids: Tensor = None,
    pixel_values: Tensor = None,
    attention_mask: Optional[Tensor] = None,
    position_ids: Optional[Tensor] = None,
    past_key_values: Optional[List[Tensor]] = None,
    inputs_embeds: Optional[Tensor] = None,
    vision_feature_layers: Optional[List[int]] = None,
    labels: Optional[Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, VipLlavaCausalLMOutputWithPast]:
    r"""
    Args:
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

    Returns:
        Union[Tuple, VipLlavaCausalLMOutputWithPast]

    Example:
        ```python
        >>> import torch
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, VipLlavaForConditionalGeneration
        ...
        >>> model = VipLlavaForConditionalGeneration.from_pretrained("llava-hf/vip-llava-7b-hf", device_map="auto", torch_dtype=torch.float16)
        >>> processor = AutoProcessor.from_pretrained("llava-hf/vip-llava-7b-hf")
        ...
        >>> prompt = "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human:<image>\n{}###Assistant:"
        >>> question = "Can you please describe this image?"
        >>> prompt = prompt.format(question)
        >>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/compel-neg.png"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = processor(text=text, images=image, return_tensors="pt").to(0, torch.float16)
        ...
        >>> # Generate
        >>> generate_ids = model.generate(**inputs, max_new_tokens=20)
        >>> processor.decode(generate_ids[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
        The image features a brown and white cat sitting on a green surface, with a red ball in its
        ```
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    vision_feature_layers = (
        vision_feature_layers if vision_feature_layers is not None else self.config.vision_feature_layers
    )

    if inputs_embeds is None:
        # 1. Extra the input embeddings
        inputs_embeds = self.get_input_embeddings()(input_ids)

        # 2. Merge text and images
        if pixel_values is not None and input_ids.shape[1] != 1:
            image_outputs = self.vision_tower(
                pixel_values, output_hidden_states=True)
            # For VIP-llava, the image features are computed this way
            # We select the features from index 1: for the layers -2, -5, -8, -11 and 6
            image_features = [image_outputs.hidden_states[index][:, 1:]
                              for index in vision_feature_layers]
            image_features = ops.cat(image_features, axis=-1)

            image_features = self.multi_modal_projector(image_features)
            inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
                image_features, inputs_embeds, input_ids, attention_mask, labels
            )
            if labels is None:
                labels = ops.full_like(
                    attention_mask, self.config.ignore_index).to(ms.int64)
        else:
            # In case input_ids.shape[1] == 1 & pixel_values==None & past_key_values != None, we are in the case of
            # generation with cache
            if past_key_values is not None and pixel_values is not None and input_ids.shape[1] == 1:
                # Retrieve the first layer to inspect the logits and mask out the hidden states
                # that are set to 0
                first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]

                # Sum all dimensions of head_dim (-1) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
                nonzero = ops.nonzero(first_layer_past_key_value.float().sum(-2) == 0)
                batch_index, non_attended_tokens = ops.tensor_split(nonzero, 2, -1)

                target_length = input_ids.shape[1]
                past_length = first_layer_past_key_value.shape[-1]

                extended_attention_mask = ops.ones(
                    (attention_mask.shape[0], past_length),
                    dtype=attention_mask.dtype,
                )

                # Filter out only the tokens that can be un-attended, this can happen
                # in the case one uses Llava + Fused modules where the cache on the
                # first iteration is already big enough, or if one passes custom cache
                valid_indices = non_attended_tokens < extended_attention_mask.shape[-1]
                new_batch_index = batch_index[valid_indices]
                new_non_attended_tokens = non_attended_tokens[valid_indices]

                # Zero-out the places where we don't need to attend
                extended_attention_mask[new_batch_index,
                                        new_non_attended_tokens] = 0

                attention_mask = ops.cat(
                    (extended_attention_mask, attention_mask[:, -target_length:]), axis=1)
                position_ids = ops.sum(
                    attention_mask, dim=1).unsqueeze(-1) - 1

    outputs = self.language_model(
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    logits = outputs[0]

    loss = None
    if labels is not None:
        # Shift so that tokens < n predict n
        if attention_mask is not None:
            shift_attention_mask = attention_mask[..., 1:]
            shift_logits = logits[..., :-1, :][shift_attention_mask != 0]
            shift_labels = labels[..., 1:][shift_attention_mask != 0]
        else:
            shift_logits = logits[..., :-1, :]
            shift_labels = labels[..., 1:]
        # Flatten the tokens
        loss_fct = nn.CrossEntropyLoss()
        loss = loss_fct(
            shift_logits.view(-1, shift_logits.shape[-1]
                              ), shift_labels.astype(ms.int32).view(-1)
        )

    if not return_dict:
        output = (logits,) + outputs[1:]
        return (loss,) + output if loss is not None else output

    return VipLlavaCausalLMOutputWithPast(
        loss=loss,
        logits=logits,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration.get_decoder()

This method returns the decoder from the language model associated with the VipLlavaForConditionalGeneration class.

PARAMETER DESCRIPTION
self

The instance of the VipLlavaForConditionalGeneration class. It is used to access the language model and retrieve the decoder.

TYPE: VipLlavaForConditionalGeneration

RETURNS DESCRIPTION
decoder

This method does not return any value directly. It returns the decoder from the language model.

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
def get_decoder(self):
    """
    This method returns the decoder from the language model associated with the VipLlavaForConditionalGeneration class.

    Args:
        self (VipLlavaForConditionalGeneration): The instance of the VipLlavaForConditionalGeneration class.
            It is used to access the language model and retrieve the decoder.

    Returns:
        decoder: This method does not return any value directly. It returns the decoder from the language model.

    Raises:
        This method does not raise any exceptions explicitly. However, exceptions related to accessing the
        language model or retrieving the decoder may be raised indirectly.
    """
    return self.language_model.get_decoder()

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration.get_input_embeddings()

Returns the input embeddings of the language model used for conditional generation.

PARAMETER DESCRIPTION
self

An instance of the VipLlavaForConditionalGeneration class.

RETURNS DESCRIPTION
embeddings

This method returns the input embeddings of the language model used for conditional generation. The embeddings are obtained by calling the 'get_input_embeddings()' method of the language model.

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
def get_input_embeddings(self):
    """
    Returns the input embeddings of the language model used for conditional generation.

    Args:
        self: An instance of the VipLlavaForConditionalGeneration class.

    Returns:
        embeddings: This method returns the input embeddings of the language model used for conditional generation.
            The embeddings are obtained by calling the 'get_input_embeddings()' method of the language model.

    Raises:
        None.
    """
    return self.language_model.get_input_embeddings()

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration.get_output_embeddings()

This method returns the output embeddings from the language model for the VipLlavaForConditionalGeneration class.

PARAMETER DESCRIPTION
self

The instance of the VipLlavaForConditionalGeneration class.

RETURNS DESCRIPTION
embeddings

This method returns None as it simply retrieves the output embeddings from the language model.

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
412
413
414
415
416
417
418
419
420
421
422
423
424
425
def get_output_embeddings(self):
    """
    This method returns the output embeddings from the language model for the VipLlavaForConditionalGeneration class.

    Args:
        self: The instance of the VipLlavaForConditionalGeneration class.

    Returns:
        embeddings: This method returns None as it simply retrieves the output embeddings from the language model.

    Raises:
        None.
    """
    return self.language_model.get_output_embeddings()

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration.prepare_inputs_for_generation(input_ids, past_key_values=None, inputs_embeds=None, pixel_values=None, attention_mask=None, **kwargs)

Method to prepare inputs for generation in the VipLlavaForConditionalGeneration class.

PARAMETER DESCRIPTION
self

The instance of the class.

input_ids

The input tensor containing token ids of the input sequence.

TYPE: Tensor

past_key_values

The cache of key values from previous computations or tuple representing past and cache length.

TYPE: Cache or tuple DEFAULT: None

inputs_embeds

An optional tensor containing embeddings for input tokens.

TYPE: Tensor DEFAULT: None

pixel_values

An optional tensor containing pixel values for image inputs.

TYPE: Tensor DEFAULT: None

attention_mask

An optional tensor indicating the attention mask for the input sequence.

TYPE: Tensor DEFAULT: None

RETURNS DESCRIPTION
dict

A dictionary containing model inputs for generation, including inputs_embeds, input_ids, position_ids, past_key_values, use_cache, attention_mask, and pixel_values.

RAISES DESCRIPTION
TypeError

If past_key_values is not of type Cache or tuple.

IndexError

If the attention_mask shape is incompatible with input_ids.

ValueError

If the pixel_values tensor is missing.

RuntimeError

If there is an issue with calculating position_ids based on attention_mask.

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
def prepare_inputs_for_generation(
    self, input_ids, past_key_values=None, inputs_embeds=None, pixel_values=None, attention_mask=None, **kwargs
):
    """
    Method to prepare inputs for generation in the VipLlavaForConditionalGeneration class.

    Args:
        self: The instance of the class.
        input_ids (Tensor): The input tensor containing token ids of the input sequence.
        past_key_values (Cache or tuple): The cache of key values from previous computations or tuple
            representing past and cache length.
        inputs_embeds (Tensor): An optional tensor containing embeddings for input tokens.
        pixel_values (Tensor): An optional tensor containing pixel values for image inputs.
        attention_mask (Tensor): An optional tensor indicating the attention mask for the input sequence.

    Returns:
        dict: A dictionary containing model inputs for generation, including inputs_embeds, input_ids, position_ids,
            past_key_values, use_cache, attention_mask, and pixel_values.

    Raises:
        TypeError: If past_key_values is not of type Cache or tuple.
        IndexError: If the attention_mask shape is incompatible with input_ids.
        ValueError: If the pixel_values tensor is missing.
        RuntimeError: If there is an issue with calculating position_ids based on attention_mask.
    """
    if past_key_values is not None:
        if isinstance(past_key_values, Cache):
            cache_length = past_key_values.get_seq_length()
            past_length = past_key_values.seen_tokens
        else:
            cache_length = past_length = past_key_values[0][0].shape[2]

        # Keep only the unprocessed tokens:
        # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
        # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
        # input)
        if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
            input_ids = input_ids[:, -
                                  (attention_mask.shape[1] - past_length):]
        # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
        # input_ids based on the past_length.
        elif past_length < input_ids.shape[1]:
            input_ids = input_ids[:, past_length:]
        # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
        elif self.config.image_token_index in input_ids:
            input_ids = input_ids[:, input_ids.shape[1] - 1:]
        # If the cache has seen more tokens than it can hold, then the cache has a size limit. Let's discard the
        # older attention values, as their corresponding values are not part of the input.
        if cache_length < past_length and attention_mask is not None:
            attention_mask = attention_mask[:, -
                                            (cache_length + input_ids.shape[1]):]

    position_ids = kwargs.get("position_ids", None)
    if attention_mask is not None and position_ids is None:
        # create position_ids on the fly for batch generation
        position_ids = attention_mask.long().cumsum(-1) - 1
        position_ids.masked_fill(attention_mask == 0, 1)
        if past_key_values:
            position_ids = position_ids[:, -input_ids.shape[1]:]

    # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
    if inputs_embeds is not None and past_key_values is None:
        model_inputs = {"inputs_embeds": inputs_embeds}
    else:
        model_inputs = {"input_ids": input_ids}

    model_inputs.update(
        {
            "position_ids": position_ids,
            "past_key_values": past_key_values,
            "use_cache": kwargs.get("use_cache"),
            "attention_mask": attention_mask,
            "pixel_values": pixel_values,
        }
    )
    return model_inputs

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration.resize_token_embeddings(new_num_tokens=None, pad_to_multiple_of=None)

Resizes the token embeddings of the VipLlavaForConditionalGeneration model.

PARAMETER DESCRIPTION
self

The instance of the VipLlavaForConditionalGeneration class.

TYPE: VipLlavaForConditionalGeneration

new_num_tokens

The new number of tokens for resizing the embeddings. Defaults to None.

TYPE: Optional[int] DEFAULT: None

pad_to_multiple_of

The value to pad the embeddings to a multiple of. Defaults to None.

DEFAULT: None

RETURNS DESCRIPTION
Embedding

nn.Embedding: The resized token embeddings as an instance of nn.Embedding.

This method resizes the token embeddings of the VipLlavaForConditionalGeneration model based on the provided parameters. It first resizes the token embeddings of the underlying language model using the 'resize_token_embeddings' method. It then updates the 'vocab_size' configuration parameter and the 'vocab_size' attribute of the model to reflect the new size of the embeddings. Finally, it returns the resized token embeddings as an instance of nn.Embedding.

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
    """
    Resizes the token embeddings of the VipLlavaForConditionalGeneration model.

    Args:
        self (VipLlavaForConditionalGeneration): The instance of the VipLlavaForConditionalGeneration class.
        new_num_tokens (Optional[int]): The new number of tokens for resizing the embeddings. Defaults to None.
        pad_to_multiple_of: The value to pad the embeddings to a multiple of. Defaults to None.

    Returns:
        nn.Embedding: The resized token embeddings as an instance of nn.Embedding.

    Raises:
        None.

    This method resizes the token embeddings of the VipLlavaForConditionalGeneration model based on the provided
    parameters. It first resizes the token embeddings of the underlying language model using the
    'resize_token_embeddings' method. It then updates the 'vocab_size' configuration parameter and the
    'vocab_size' attribute of the model to reflect the new size of the embeddings. Finally, it returns the resized
    token embeddings as an instance of nn.Embedding.
    """
    model_embeds = self.language_model.resize_token_embeddings(
        new_num_tokens, pad_to_multiple_of)
    # update vocab size
    self.config.text_config.vocab_size = model_embeds.vocab_size
    self.vocab_size = model_embeds.vocab_size
    return model_embeds

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration.set_decoder(decoder)

Sets the decoder for the VipLlavaForConditionalGeneration instance.

PARAMETER DESCRIPTION
self

The VipLlavaForConditionalGeneration instance.

TYPE: VipLlavaForConditionalGeneration

decoder

The decoder to be set for the language model. It should be an instance of the decoder class.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
def set_decoder(self, decoder):
    """
    Sets the decoder for the VipLlavaForConditionalGeneration instance.

    Args:
        self (VipLlavaForConditionalGeneration): The VipLlavaForConditionalGeneration instance.
        decoder: The decoder to be set for the language model. It should be an instance of the decoder class.

    Returns:
        None.

    Raises:
        This method does not raise any exceptions.
    """
    self.language_model.set_decoder(decoder)

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration.set_input_embeddings(value)

Sets the input embeddings for the VipLlavaForConditionalGeneration language model.

PARAMETER DESCRIPTION
self

The instance of the VipLlavaForConditionalGeneration class.

TYPE: VipLlavaForConditionalGeneration

value

The input embeddings to be set for the language model. It should be a tensor of shape (vocab_size, embedding_dim).

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
def set_input_embeddings(self, value):
    """
    Sets the input embeddings for the VipLlavaForConditionalGeneration language model.

    Args:
        self (VipLlavaForConditionalGeneration): The instance of the VipLlavaForConditionalGeneration class.
        value: The input embeddings to be set for the language model. It should be a tensor of shape
            (vocab_size, embedding_dim).

    Returns:
        None.

    Raises:
        None.
    """
    self.language_model.set_input_embeddings(value)

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration.set_output_embeddings(new_embeddings)

Sets the output embeddings for the VipLlavaForConditionalGeneration model.

PARAMETER DESCRIPTION
self

The instance of the VipLlavaForConditionalGeneration class.

TYPE: VipLlavaForConditionalGeneration

new_embeddings

The new embeddings to be set for the model's output. It should be compatible with the model's requirements.

TYPE: object

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the new_embeddings parameter is not of the correct type.

ValueError

If the new_embeddings parameter does not meet the model's requirements.

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
def set_output_embeddings(self, new_embeddings):
    """
    Sets the output embeddings for the VipLlavaForConditionalGeneration model.

    Args:
        self (VipLlavaForConditionalGeneration): The instance of the VipLlavaForConditionalGeneration class.
        new_embeddings (object): The new embeddings to be set for the model's output.
            It should be compatible with the model's requirements.

    Returns:
        None.

    Raises:
        TypeError: If the new_embeddings parameter is not of the correct type.
        ValueError: If the new_embeddings parameter does not meet the model's requirements.
    """
    self.language_model.set_output_embeddings(new_embeddings)

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaForConditionalGeneration.tie_weights()

Method to tie weights for the VipLlavaForConditionalGeneration class.

PARAMETER DESCRIPTION
self

VipLlavaForConditionalGeneration object. Represents an instance of the VipLlavaForConditionalGeneration class.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
def tie_weights(self):
    """
    Method to tie weights for the VipLlavaForConditionalGeneration class.

    Args:
        self: VipLlavaForConditionalGeneration object.
            Represents an instance of the VipLlavaForConditionalGeneration class.

    Returns:
        None.

    Raises:
        None.
    """
    return self.language_model.tie_weights()

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaMultiModalProjector

Bases: Module

Represents a multi-modal projector for the VipLlava model, used to project hidden states from both vision and text modalities.

This class inherits from nn.Module and contains methods to initialize the projector and forward the projection process.

ATTRIBUTE DESCRIPTION
projector_layernorm

Layer normalization for the projector.

TYPE: LayerNorm

linear_1

First linear transformation for the projector.

TYPE: Linear

act

Activation function applied after the first linear transformation.

TYPE: function

linear_2

Second linear transformation for the projector.

TYPE: Linear

METHOD DESCRIPTION
__init__

Initializes the multi-modal projector with the provided configuration.

forward

Constructs the projection process by applying layer normalization, linear transformations, and activation function.

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
class VipLlavaMultiModalProjector(nn.Module):

    """
    Represents a multi-modal projector for the VipLlava model, used to project hidden states from both vision and
    text modalities.

    This class inherits from nn.Module and contains methods to initialize the projector and forward the projection
    process.

    Attributes:
        projector_layernorm (nn.LayerNorm): Layer normalization for the projector.
        linear_1 (nn.Linear): First linear transformation for the projector.
        act (function): Activation function applied after the first linear transformation.
        linear_2 (nn.Linear): Second linear transformation for the projector.

    Methods:
        __init__: Initializes the multi-modal projector with the provided configuration.
        forward: Constructs the projection process by applying layer normalization, linear transformations,
            and activation function.

    """
    def __init__(self, config: VipLlavaConfig):
        """
        Initializes an instance of the VipLlavaMultiModalProjector class.

        Args:
            self: The instance of the class.
            config (VipLlavaConfig): The configuration object that contains the parameters for the projector.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()
        self.projector_layernorm = nn.LayerNorm(
            len(config.vision_feature_layers) * config.vision_config.hidden_size, eps=config.projector_layernorm_eps
        )

        self.linear_1 = nn.Linear(
            len(config.vision_feature_layers) *
            config.vision_config.hidden_size,
            config.text_config.hidden_size,
            bias=True,
        )
        self.act = ACT2FN[config.projector_hidden_act]
        self.linear_2 = nn.Linear(
            config.text_config.hidden_size, config.text_config.hidden_size, bias=True)

    def forward(self, hidden_states):
        """
        Constructs the multi-modal projector for the VipLlava model.

        Args:
            self (VipLlavaMultiModalProjector): An instance of the VipLlavaMultiModalProjector class.
            hidden_states (Tensor): The input hidden states to be projected. Should be of shape (batch_size, hidden_size).

        Returns:
            None

        Raises:
            None
        """
        hidden_states = self.projector_layernorm(hidden_states)
        hidden_states = self.linear_1(hidden_states)
        hidden_states = self.act(hidden_states)
        hidden_states = self.linear_2(hidden_states)
        return hidden_states

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaMultiModalProjector.__init__(config)

Initializes an instance of the VipLlavaMultiModalProjector class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

The configuration object that contains the parameters for the projector.

TYPE: VipLlavaConfig

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
def __init__(self, config: VipLlavaConfig):
    """
    Initializes an instance of the VipLlavaMultiModalProjector class.

    Args:
        self: The instance of the class.
        config (VipLlavaConfig): The configuration object that contains the parameters for the projector.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()
    self.projector_layernorm = nn.LayerNorm(
        len(config.vision_feature_layers) * config.vision_config.hidden_size, eps=config.projector_layernorm_eps
    )

    self.linear_1 = nn.Linear(
        len(config.vision_feature_layers) *
        config.vision_config.hidden_size,
        config.text_config.hidden_size,
        bias=True,
    )
    self.act = ACT2FN[config.projector_hidden_act]
    self.linear_2 = nn.Linear(
        config.text_config.hidden_size, config.text_config.hidden_size, bias=True)

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaMultiModalProjector.forward(hidden_states)

Constructs the multi-modal projector for the VipLlava model.

PARAMETER DESCRIPTION
self

An instance of the VipLlavaMultiModalProjector class.

TYPE: VipLlavaMultiModalProjector

hidden_states

The input hidden states to be projected. Should be of shape (batch_size, hidden_size).

TYPE: Tensor

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
def forward(self, hidden_states):
    """
    Constructs the multi-modal projector for the VipLlava model.

    Args:
        self (VipLlavaMultiModalProjector): An instance of the VipLlavaMultiModalProjector class.
        hidden_states (Tensor): The input hidden states to be projected. Should be of shape (batch_size, hidden_size).

    Returns:
        None

    Raises:
        None
    """
    hidden_states = self.projector_layernorm(hidden_states)
    hidden_states = self.linear_1(hidden_states)
    hidden_states = self.act(hidden_states)
    hidden_states = self.linear_2(hidden_states)
    return hidden_states

mindnlp.transformers.models.vipllava.modeling_vipllava.VipLlavaPreTrainedModel

Bases: PreTrainedModel

This class represents a pre-trained model for VipLlava. It is a subclass of the PreTrainedModel class.

The VipLlavaPreTrainedModel class provides methods for initializing the weights of the model and checking whether the model supports SDPA (Semi-Definite Programming Algorithm).

METHOD DESCRIPTION
_init_weights

Initializes the weights of the given module using random normal distribution with a standard deviation determined by the configuration.

  • If the module has a class_embedding attribute, it sets the data of the class_embedding tensor with random values.
  • If the module is an instance of nn.Linear or nn.Conv2d, it sets the data of the weight tensor with random values and initializes the bias tensor with zeros.
  • If the module is an instance of nn.Embedding, it sets the data of the weight tensor with random values and initializes the padding_idx tensor with zeros. _supports_sdpa: Retrieves the language_model attribute of the class to check whether the model supports SDPA or not.
Note

Please refer to the PreTrainedModel class for additional inherited methods and attributes.

Source code in mindnlp/transformers/models/vipllava/modeling_vipllava.py
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
class VipLlavaPreTrainedModel(PreTrainedModel):

    """
    This class represents a pre-trained model for VipLlava. It is a subclass of the PreTrainedModel class.

    The VipLlavaPreTrainedModel class provides methods for initializing the weights of the model and checking whether
    the model supports SDPA (Semi-Definite Programming Algorithm).

    Methods:
        _init_weights:
            Initializes the weights of the given module using random normal distribution with a standard deviation
            determined by the configuration.

            - If the module has a class_embedding attribute, it sets the data of the class_embedding tensor with random
            values.
            - If the module is an instance of nn.Linear or nn.Conv2d, it sets the data of the weight tensor with random
            values and initializes the bias tensor with zeros.
            - If the module is an instance of nn.Embedding, it sets the data of the weight tensor with random values and
            initializes the padding_idx tensor with zeros.
         _supports_sdpa:
            Retrieves the language_model attribute of the class to check whether the model supports SDPA or not.

    Note:
        Please refer to the PreTrainedModel class for additional inherited methods and attributes.
    """
    config_class = VipLlavaConfig
    base_model_prefix = "model"
    supports_gradient_checkpointing = True
    _no_split_modules = ["VipLlavaVisionAttention"]
    _skip_keys_device_placement = "past_key_values"
    _supports_flash_attn_2 = True

    def _init_weights(self, module):
        """
        This method '_init_weights' initializes the weights of the provided 'module' based on the specified
        standard deviation.

        Args:
            self:
                An instance of the VipLlavaPreTrainedModel class.

                - Purpose: Represents the current instance of the class.
                - Restrictions: None.

            module:
                The module for which weights are to be initialized.

                - Type: Any valid module object.
                - Purpose: Represents the module whose weights are to be initialized.
                - Restrictions: The module should be a valid PyTorch module.

        Returns:
            None.

        Raises:
            None.
        """
        # important: this ported version of VipLlava isn't meant for training from scratch - only
        # inference and fine-tuning - so the proper init weights code has been removed - the original codebase
        # https://github.com/haotian-liu/LLaVA/tree/main/vipllava should serve for that purpose
        std = (
            self.config.initializer_range
            if hasattr(self.config, "initializer_range")
            else self.config.text_config.initializer_range
        )

        if hasattr(module, "class_embedding"):
            module.class_embedding.set_data(Tensor(np.random.normal(
                0.0, std, module.class_embedding.shape), dtype=module.class_embedding.dtype))

        if isinstance(module, (nn.Linear, nn.Conv2d)):
            module.weight.set_data(Tensor(np.random.normal(
                0.0, std, module.weight.shape), dtype=module.weight.dtype))
            if module.bias is not None:
                module.bias.set_data(ms.common.initializer.initializer(
                    "zeros", module.bias.shape, module.bias.dtype))
        elif isinstance(module, nn.Embedding):
            module.weight.set_data(Tensor(np.random.normal(
                0.0, std, module.weight.shape), dtype=module.weight.dtype))
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx] = ms.common.initializer.initializer(
                    "zeros", module.weight.data[module.padding_idx].shape, module.weight.dtype)

    @property
    def _supports_sdpa(self):
        """
        Retrieve language_model's attribute to check whether the model supports
        SDPA or not.
        """
        return self.language_model._supports_sdpa