Skip to content

llava_next

mindnlp.transformers.models.llava_next.configuration_llava_next

Llava-NeXT model configuration

mindnlp.transformers.models.llava_next.configuration_llava_next.LlavaNextConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [LlavaNextForConditionalGeneration]. It is used to instantiate an Llava-NeXT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the llava-hf/llava-v1.6-mistral-7b-hf model.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
vision_config

The config object or dictionary of the vision backbone.

TYPE: `Union[AutoConfig, dict]`, *optional*, defaults to `CLIPVisionConfig` DEFAULT: None

text_config

The config object or dictionary of the text backbone.

TYPE: `Union[AutoConfig, dict]`, *optional*, defaults to `LlamaConfig` DEFAULT: None

ignore_index

The ignore index for the loss function.

TYPE: `int`, *optional*, defaults to -100 DEFAULT: -100

image_token_index

The image token index to encode the image prompt.

TYPE: `int`, *optional*, defaults to 32000 DEFAULT: 32000

projector_hidden_act

The activation function used by the multimodal projector.

TYPE: `str`, *optional*, defaults to `"gelu"` DEFAULT: 'gelu'

vision_feature_select_strategy

The feature selection strategy used to select the vision feature from the vision backbone. Can be one of "default" or "full". If "default", the CLS token is removed from the vision features. If "full", the full vision features are used.

TYPE: `str`, *optional*, defaults to `"default"` DEFAULT: 'default'

vision_feature_layer

The index of the layer to select the vision feature.

TYPE: `int`, *optional*, defaults to -2 DEFAULT: -2

image_grid_pinpoints

A list of possible resolutions to use for processing high resolution images. Each item in the list should be a tuple or list of the form (height, width).

TYPE: `List`, *optional*, defaults to `[[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]` DEFAULT: None

Example
>>> from transformers import LlavaNextForConditionalGeneration, LlavaNextConfig, CLIPVisionConfig, LlamaConfig
...
>>> # Initializing a CLIP-vision config
>>> vision_config = CLIPVisionConfig()
...
>>> # Initializing a Llama config
>>> text_config = LlamaConfig()
...
>>> # Initializing a Llava-Next llava-hf/llava-v1.6-mistral-7b-hf style configuration
>>> configuration = LlavaNextConfig(vision_config, text_config)
...
>>> # Initializing a model from the llava-hf/llava-v1.6-mistral-7b-hf style configuration
>>> model = LlavaNextForConditionalGeneration(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/llava_next/configuration_llava_next.py
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
class LlavaNextConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`LlavaNextForConditionalGeneration`].
    It is used to instantiate an Llava-NeXT model according to the specified arguments, defining the model architecture.
    Instantiating a configuration with the defaults will yield a similar configuration to that of the
    [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) model.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        vision_config (`Union[AutoConfig, dict]`,  *optional*, defaults to `CLIPVisionConfig`):
            The config object or dictionary of the vision backbone.
        text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `LlamaConfig`):
            The config object or dictionary of the text backbone.
        ignore_index (`int`, *optional*, defaults to -100):
            The ignore index for the loss function.
        image_token_index (`int`, *optional*, defaults to 32000):
            The image token index to encode the image prompt.
        projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
            The activation function used by the multimodal projector.
        vision_feature_select_strategy (`str`, *optional*, defaults to `"default"`):
            The feature selection strategy used to select the vision feature from the vision backbone.
            Can be one of `"default"` or `"full"`. If `"default"`, the CLS token is removed from the vision features.
            If `"full"`, the full vision features are used.
        vision_feature_layer (`int`, *optional*, defaults to -2):
            The index of the layer to select the vision feature.
        image_grid_pinpoints (`List`, *optional*, defaults to `[[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]`):
            A list of possible resolutions to use for processing high resolution images. Each item in the list should be a tuple or list
            of the form `(height, width)`.

    Example:
        ```python
        >>> from transformers import LlavaNextForConditionalGeneration, LlavaNextConfig, CLIPVisionConfig, LlamaConfig
        ...
        >>> # Initializing a CLIP-vision config
        >>> vision_config = CLIPVisionConfig()
        ...
        >>> # Initializing a Llama config
        >>> text_config = LlamaConfig()
        ...
        >>> # Initializing a Llava-Next llava-hf/llava-v1.6-mistral-7b-hf style configuration
        >>> configuration = LlavaNextConfig(vision_config, text_config)
        ...
        >>> # Initializing a model from the llava-hf/llava-v1.6-mistral-7b-hf style configuration
        >>> model = LlavaNextForConditionalGeneration(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "llava_next"
    is_composition = False

    def __init__(
        self,
        vision_config=None,
        text_config=None,
        ignore_index=-100,
        image_token_index=32000,
        projector_hidden_act="gelu",
        vision_feature_select_strategy="default",
        vision_feature_layer=-2,
        image_grid_pinpoints=None,
        **kwargs,
    ):
        """
        This method initializes an instance of the LlavaNextConfig class with the provided parameters.

        Args:
            self: The instance of the class.
            vision_config (dict, optional): Configuration settings for the vision model.
                If not provided, default settings will be used.
            text_config (dict, optional): Configuration settings for the text model.
                If not provided, default settings will be used.
            ignore_index (int, optional): Index to ignore during computation. Default is -100.
            image_token_index (int, optional): Index for image token. Default is 32000.
            projector_hidden_act (str, optional): Activation function for hidden layers in projector.
                Default is 'gelu'.
            vision_feature_select_strategy (str): Strategy for selecting vision features.
                Should be one of 'default' or 'full'.
            vision_feature_layer (int, optional): Layer to extract features from in the vision model.
            image_grid_pinpoints (list of lists, optional): Coordinates for image grid pinpoints.
                Default is [[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008].

        Returns:
            None

        Raises:
            ValueError: If vision_feature_select_strategy is not 'default' or 'full'.
        """
        self.ignore_index = ignore_index
        self.image_token_index = image_token_index
        self.projector_hidden_act = projector_hidden_act

        if vision_feature_select_strategy not in ["default", "full"]:
            raise ValueError(
                "vision_feature_select_strategy should be one of 'default', 'full'."
                f"Got: {vision_feature_select_strategy}"
            )

        self.vision_feature_select_strategy = vision_feature_select_strategy
        self.vision_feature_layer = vision_feature_layer
        image_grid_pinpoints = (
            image_grid_pinpoints
            if image_grid_pinpoints is not None
            else [[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]
        )
        self.image_grid_pinpoints = image_grid_pinpoints

        if isinstance(vision_config, dict):
            vision_config["model_type"] = (
                vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
            )
            vision_config = CONFIG_MAPPING[vision_config["model_type"]](
                **vision_config)
        elif vision_config is None:
            vision_config = CONFIG_MAPPING["clip_vision_model"](
                intermediate_size=4096,
                hidden_size=1024,
                patch_size=14,
                image_size=336,
                num_hidden_layers=24,
                num_attention_heads=16,
                vocab_size=32000,
                projection_dim=768,
            )

        self.vision_config = vision_config

        if isinstance(text_config, dict):
            text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
            text_config = CONFIG_MAPPING[text_config["model_type"]](
                **text_config)
        elif text_config is None:
            text_config = CONFIG_MAPPING["llama"]()

        self.text_config = text_config

        super().__init__(**kwargs)

mindnlp.transformers.models.llava_next.configuration_llava_next.LlavaNextConfig.__init__(vision_config=None, text_config=None, ignore_index=-100, image_token_index=32000, projector_hidden_act='gelu', vision_feature_select_strategy='default', vision_feature_layer=-2, image_grid_pinpoints=None, **kwargs)

This method initializes an instance of the LlavaNextConfig class with the provided parameters.

PARAMETER DESCRIPTION
self

The instance of the class.

vision_config

Configuration settings for the vision model. If not provided, default settings will be used.

TYPE: dict DEFAULT: None

text_config

Configuration settings for the text model. If not provided, default settings will be used.

TYPE: dict DEFAULT: None

ignore_index

Index to ignore during computation. Default is -100.

TYPE: int DEFAULT: -100

image_token_index

Index for image token. Default is 32000.

TYPE: int DEFAULT: 32000

projector_hidden_act

Activation function for hidden layers in projector. Default is 'gelu'.

TYPE: str DEFAULT: 'gelu'

vision_feature_select_strategy

Strategy for selecting vision features. Should be one of 'default' or 'full'.

TYPE: str DEFAULT: 'default'

vision_feature_layer

Layer to extract features from in the vision model.

TYPE: int DEFAULT: -2

image_grid_pinpoints

Coordinates for image grid pinpoints. Default is [[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008].

TYPE: list of lists DEFAULT: None

RETURNS DESCRIPTION

None

RAISES DESCRIPTION
ValueError

If vision_feature_select_strategy is not 'default' or 'full'.

Source code in mindnlp/transformers/models/llava_next/configuration_llava_next.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
def __init__(
    self,
    vision_config=None,
    text_config=None,
    ignore_index=-100,
    image_token_index=32000,
    projector_hidden_act="gelu",
    vision_feature_select_strategy="default",
    vision_feature_layer=-2,
    image_grid_pinpoints=None,
    **kwargs,
):
    """
    This method initializes an instance of the LlavaNextConfig class with the provided parameters.

    Args:
        self: The instance of the class.
        vision_config (dict, optional): Configuration settings for the vision model.
            If not provided, default settings will be used.
        text_config (dict, optional): Configuration settings for the text model.
            If not provided, default settings will be used.
        ignore_index (int, optional): Index to ignore during computation. Default is -100.
        image_token_index (int, optional): Index for image token. Default is 32000.
        projector_hidden_act (str, optional): Activation function for hidden layers in projector.
            Default is 'gelu'.
        vision_feature_select_strategy (str): Strategy for selecting vision features.
            Should be one of 'default' or 'full'.
        vision_feature_layer (int, optional): Layer to extract features from in the vision model.
        image_grid_pinpoints (list of lists, optional): Coordinates for image grid pinpoints.
            Default is [[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008].

    Returns:
        None

    Raises:
        ValueError: If vision_feature_select_strategy is not 'default' or 'full'.
    """
    self.ignore_index = ignore_index
    self.image_token_index = image_token_index
    self.projector_hidden_act = projector_hidden_act

    if vision_feature_select_strategy not in ["default", "full"]:
        raise ValueError(
            "vision_feature_select_strategy should be one of 'default', 'full'."
            f"Got: {vision_feature_select_strategy}"
        )

    self.vision_feature_select_strategy = vision_feature_select_strategy
    self.vision_feature_layer = vision_feature_layer
    image_grid_pinpoints = (
        image_grid_pinpoints
        if image_grid_pinpoints is not None
        else [[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]
    )
    self.image_grid_pinpoints = image_grid_pinpoints

    if isinstance(vision_config, dict):
        vision_config["model_type"] = (
            vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
        )
        vision_config = CONFIG_MAPPING[vision_config["model_type"]](
            **vision_config)
    elif vision_config is None:
        vision_config = CONFIG_MAPPING["clip_vision_model"](
            intermediate_size=4096,
            hidden_size=1024,
            patch_size=14,
            image_size=336,
            num_hidden_layers=24,
            num_attention_heads=16,
            vocab_size=32000,
            projection_dim=768,
        )

    self.vision_config = vision_config

    if isinstance(text_config, dict):
        text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
        text_config = CONFIG_MAPPING[text_config["model_type"]](
            **text_config)
    elif text_config is None:
        text_config = CONFIG_MAPPING["llama"]()

    self.text_config = text_config

    super().__init__(**kwargs)

mindnlp.transformers.models.llava_next.modeling_llava_next

MindSpore Llava-NeXT model.

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextCausalLMOutputWithPast dataclass

Bases: ModelOutput

Base class for LlavaNext causal language model (or autoregressive) outputs.

PARAMETER DESCRIPTION
loss

Language modeling loss (for next-token prediction).

TYPE: `mindspore.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided DEFAULT: None

logits

Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)` DEFAULT: None

image_hidden_states

Tuple of mindspore.Tensor (one for the output of the image embeddings, (batch_size, num_images, sequence_length, hidden_size).

image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver

TYPE: `tuple(mindspore.Tensor)`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
@dataclass
# Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->LlavaNext
class LlavaNextCausalLMOutputWithPast(ModelOutput):
    """
    Base class for LlavaNext causal language model (or autoregressive) outputs.

    Args:
        loss (`mindspore.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
            Language modeling loss (for next-token prediction).
        logits (`mindspore.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
        past_key_values (`tuple(tuple(mindspore.Tensor))`, *optional*, returned when `use_cache=True`
            is passed or when `config.use_cache=True`):
            Tuple of `tuple(mindspore.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
            `(batch_size, num_heads, sequence_length, embed_size_per_head)`)

            Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
            `past_key_values` input) to speed up sequential decoding.
        hidden_states (`tuple(mindspore.Tensor)`, *optional*, returned when `output_hidden_states=True`
            is passed or when `config.output_hidden_states=True`):
            Tuple of `mindspore.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
        attentions (`tuple(mindspore.Tensor)`, *optional*, returned when `output_attentions=True`
            is passed or when `config.output_attentions=True`):
            Tuple of `mindspore.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
            sequence_length)`.

            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
        image_hidden_states (`tuple(mindspore.Tensor)`, *optional*):
            Tuple of `mindspore.Tensor` (one for the output of the image embeddings, `(batch_size, num_images,
            sequence_length, hidden_size)`.

            image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
    """
    loss: Optional[Tensor] = None
    logits: Tensor = None
    past_key_values: Optional[List[Tensor]] = None
    hidden_states: Optional[Tuple[Tensor]] = None
    attentions: Optional[Tuple[Tensor]] = None
    image_hidden_states: Optional[Tuple[Tensor]] = None

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextForConditionalGeneration

Bases: LlavaNextPreTrainedModel

This class represents a model for conditional text generation with multimodal capabilities. It is designed to generate text based on input text prompts along with associated images. The model utilizes a pre-trained language model for text generation and incorporates image features for enhanced context understanding.

The class provides methods for setting and getting input embeddings, output embeddings, decoder, and for tying weights. It also includes functionality for resizing token embeddings and merging input IDs with image features. Additionally, the class offers a 'forward' method for generating text based on input IDs, pixel values, attention masks, and other optional parameters. The 'prepare_inputs_for_generation' method prepares input data for text generation by handling past key values, inputs embeddings, pixel values, and attention masks.

This class inherits from LlavaNextPreTrainedModel and is designed to be used for conditional text generation tasks in a multimodal setting.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
class LlavaNextForConditionalGeneration(LlavaNextPreTrainedModel):

    """
    This class represents a model for conditional text generation with multimodal capabilities.
    It is designed to generate text based on input text prompts along with associated images. The model utilizes a
    pre-trained language model for text generation and incorporates image features for enhanced context understanding.

    The class provides methods for setting and getting input embeddings, output embeddings, decoder, and for tying weights.
    It also includes functionality for resizing token embeddings and merging input IDs with image features.
    Additionally, the class offers a 'forward' method for generating text based on input IDs, pixel values,
    attention masks, and other optional parameters. The 'prepare_inputs_for_generation' method prepares input data
    for text generation by handling past key values, inputs embeddings, pixel values, and attention masks.

    This class inherits from LlavaNextPreTrainedModel and is designed to be used for conditional text generation tasks
    in a multimodal setting.
    """
    def __init__(self, config: LlavaNextConfig):
        """Initializes an instance of the LlavaNextForConditionalGeneration class.

        Args:
            self: The instance of the class.
            config (LlavaNextConfig): The configuration object that contains the necessary parameters for
                setting up the instance.

        Returns:
            None

        Raises:
            None
        """
        super().__init__(config)
        self.vision_tower = AutoModel.from_config(config.vision_config)

        self.multi_modal_projector = LlavaNextMultiModalProjector(config)

        self.image_newline = Parameter(
            ops.zeros(int(config.text_config.hidden_size)))

        self.vocab_size = config.text_config.vocab_size
        self.language_model = AutoModelForCausalLM.from_config(config.text_config)
        self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else -1
        self.post_init()

    # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.get_input_embeddings
    def get_input_embeddings(self):
        """
        Returns the input embeddings of the language model used for conditional generation.

        Args:
            self (LlavaNextForConditionalGeneration): The instance of the LlavaNextForConditionalGeneration class.

        Returns:
            The input embeddings of the language model.

        Raises:
            None.
        """
        return self.language_model.get_input_embeddings()

    # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.set_input_embeddings
    def set_input_embeddings(self, value):
        """
        Method to set input embeddings for the LlavaNextForConditionalGeneration class.

        Args:
            self (LlavaNextForConditionalGeneration): The instance of the LlavaNextForConditionalGeneration class.
            value (object): The input embeddings to be set for the language model.

        Returns:
            None.

        Raises:
            None.
        """
        self.language_model.set_input_embeddings(value)

    # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.get_output_embeddings
    def get_output_embeddings(self):
        """
        Retrieve the output embeddings from the language model for the LlavaNextForConditionalGeneration class.

        Args:
            self: The instance of the LlavaNextForConditionalGeneration class.

        Returns:
            The output embeddings from the language model associated with the LlavaNextForConditionalGeneration instance.

        Raises:
            None.
        """
        return self.language_model.get_output_embeddings()

    # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.set_output_embeddings
    def set_output_embeddings(self, new_embeddings):
        """
        Sets the output embeddings for the LlavaNextForConditionalGeneration class.

        Args:
            self: An instance of the LlavaNextForConditionalGeneration class.
            new_embeddings: The new embeddings to be set for the language model. 
                It should be of type 'torch.nn.Embedding' or a subclass of it.

        Returns:
            None.

        Raises:
            None.
        """
        self.language_model.set_output_embeddings(new_embeddings)

    # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.set_decoder
    def set_decoder(self, decoder):
        """
        Sets the decoder for the LlavaNextForConditionalGeneration language model.

        Args:
            self (LlavaNextForConditionalGeneration): The instance of the LlavaNextForConditionalGeneration class.
            decoder: The decoder to be set for the language model. 
                It should be compatible with the language model for proper functioning.

        Returns:
            None.

        Raises:
            None.
        """
        self.language_model.set_decoder(decoder)

    # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.get_decoder
    def get_decoder(self):
        """
        Retrieve the decoder from the language model for conditional generation.

        Args:
            self (LlavaNextForConditionalGeneration): The instance of the LlavaNextForConditionalGeneration class.
                This parameter is automatically passed when calling the method.

        Returns:
            The decoder obtained from the language model.

        Raises:
            None.
        """
        return self.language_model.get_decoder()

    # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.tie_weights
    def tie_weights(self):
        """
        Ties the weights of the language model for conditional generation in the LlavaNextForConditionalGeneration class.

        Args:
            self: An instance of the LlavaNextForConditionalGeneration class.

        Returns:
            None.

        Raises:
            None.

        This method is responsible for tying the weights of the language model used for conditional generation in the
        LlavaNextForConditionalGeneration class. Tying the weights refers to sharing the parameters of the language
        model with other parts of the model, such as the encoder or the decoder.
        By tying the weights, the model can learn more efficiently and effectively by reducing the number of parameters
        that need to be learned.

        Note:
            This method internally calls the 'tie_weights' method of the language model to perform the weight
            tying operation.
        """
        return self.language_model.tie_weights()

    # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.resize_token_embeddings
    def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
        """
        Resizes the token embeddings for conditional generation in the LlavaNext model.

        Args:
            self (LlavaNextForConditionalGeneration): The instance of the LlavaNextForConditionalGeneration class.
            new_num_tokens (Optional[int]): The desired number of tokens for the resized embeddings. Defaults to None.
            pad_to_multiple_of: (Optional[int]): The value to which the embedding size should be padded. Defaults to None.

        Returns:
            nn.Embedding: The resized token embeddings of type nn.Embedding.

        Raises:
            None.
        """
        model_embeds = self.language_model.resize_token_embeddings(
            new_num_tokens, pad_to_multiple_of)
        # update vocab size
        self.config.text_config.vocab_size = model_embeds.vocab_size
        self.vocab_size = model_embeds.vocab_size
        return model_embeds

    # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration._merge_input_ids_with_image_features
    def _merge_input_ids_with_image_features(self, image_features, inputs_embeds, input_ids, attention_mask, labels):
        """
        Merges image features with input embeddings, input IDs, attention masks, and labels.

        Args:
            self (LlavaNextForConditionalGeneration): The object instance.
            image_features (Tensor): A tensor containing image features.
            inputs_embeds (Tensor): A tensor containing input embeddings.
            input_ids (Tensor): A tensor containing input IDs.
            attention_mask (Tensor): A tensor containing attention masks.
            labels (Tensor): A tensor containing labels.

        Returns:
            None.

        Raises:
            ValueError: If the number of image tokens provided to the model does not match the number of images given.
        """
        num_images, num_image_patches, embed_dim = image_features.shape
        batch_size, sequence_length = input_ids.shape
        left_padding = not ops.sum(input_ids[:, -1] == mindspore.tensor(self.pad_token_id))
        # 1. Create a mask to know where special image tokens are
        special_image_token_mask = input_ids == self.config.image_token_index
        num_special_image_tokens = ops.sum(special_image_token_mask, dim=-1)
        # Compute the maximum embed dimension
        max_embed_dim = (num_special_image_tokens.max() * (num_image_patches - 1)).item() + sequence_length
        nonzero_result = ops.nonzero(
            input_ids != self.config.image_token_index)
        batch_indices, non_image_indices = ops.tensor_split(nonzero_result, 2, -1)

        # 2. Compute the positions where text should be written
        # Calculate new positions for text tokens in merged image-text sequence.
        # `special_image_token_mask` identifies image tokens. Each image token will be replaced by `nb_text_tokens_per_images - 1` text tokens.
        # `torch.cumsum` computes how each image token shifts subsequent text token positions.
        # - 1 to adjust for zero-based indexing, as `cumsum` inherently increases indices by one.
        new_token_positions = ops.cumsum((special_image_token_mask * (num_image_patches - 1) + 1), -1) - 1
        nb_image_pad = max_embed_dim - 1 - new_token_positions[:, -1]
        if left_padding:
            new_token_positions += nb_image_pad[:, None]  # offset for left padding
        text_to_overwrite = new_token_positions[batch_indices, non_image_indices]

        # 3. Create the full embedding, already padded to the maximum position
        final_embedding = ops.zeros(
            (batch_size, max_embed_dim, embed_dim), dtype=inputs_embeds.dtype
        )
        final_attention_mask = ops.zeros(
            (batch_size, max_embed_dim), dtype=attention_mask.dtype
        )
        if labels is not None:
            final_labels = ops.full(
                (batch_size, max_embed_dim), self.config.ignore_index, dtype=input_ids.dtype
            )

        # 4. Fill the embeddings based on the mask. If we have ["hey" "<image>", "how", "are"]
        # we need to index copy on [0, 577, 578, 579] for the text and [1:576] for the image features
        final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[batch_indices, non_image_indices]
        final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_image_indices]
        if labels is not None:
            final_labels[batch_indices, text_to_overwrite] = labels[batch_indices, non_image_indices]

        # 5. Fill the embeddings corresponding to the images. Anything that is still zeros needs filling
        image_to_overwrite = ops.all(final_embedding == 0, axis=-1)
        image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[:, None]

        if image_to_overwrite.sum() != reduce(lambda x, y: x * y, image_features.shape[:-1]):
            raise ValueError(
                f"The input provided to the model are wrong. The number of image tokens is {ops.sum(special_image_token_mask)} while"
                f" the number of image given to the model is {num_images}. This prevents correct indexing and breaks batch generation."
            )

        final_embedding[image_to_overwrite] = image_features.reshape(-1, embed_dim)
        final_attention_mask |= image_to_overwrite
        position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill((final_attention_mask == 0), 1)

        # 6. Mask out the embedding at padding positions, as we later use the past_key_value value to determine the non-attended tokens.
        nonzero = ops.nonzero(input_ids == self.pad_token_id)
        batch_indices, pad_indices = ops.tensor_split(nonzero, 2, -1)
        indices_to_mask = new_token_positions[batch_indices, pad_indices]

        if batch_indices.asnumpy() != []:
            final_embedding[batch_indices, indices_to_mask] = 0

        if labels is None:
            final_labels = None

        return final_embedding, final_attention_mask, final_labels, position_ids

    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        pixel_values: mindspore.Tensor = None,
        image_sizes: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[List[mindspore.Tensor]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        vision_feature_layer: Optional[int] = None,
        vision_feature_select_strategy: Optional[str] = None,
        labels: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, LlavaNextCausalLMOutputWithPast]:
        r"""
        Args:
            labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

        Returns:
            Union[Tuple, LlavaNextCausalLMOutputWithPast]

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, LlavaNextForConditionalGeneration
            ...
            >>> model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
            >>> processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
            ...
            >>> prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
            >>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> inputs = processor(text=prompt, images=image, return_tensors="pt")
            ...
            >>> # Generate
            >>> generate_ids = model.generate(**inputs, max_length=30)
            >>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
            "[INST]  \nWhat is shown in this image? [/INST] The image appears to be a radar chart, which is a type of multi-dimensional plot (...)"
            ```
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        vision_feature_layer = (
            vision_feature_layer if vision_feature_layer is not None else self.config.vision_feature_layer
        )
        vision_feature_select_strategy = (
            vision_feature_select_strategy
            if vision_feature_select_strategy is not None
            else self.config.vision_feature_select_strategy
        )

        if inputs_embeds is None:
            # 1. Extract the input embeddings
            inputs_embeds = self.get_input_embeddings()(input_ids)

            # 2. Merge text and images
            if pixel_values is not None and input_ids.shape[1] != 1:
                batch_size, num_patches, num_channels, height, width = pixel_values.shape
                reshaped_pixel_values = pixel_values.view(batch_size * num_patches, num_channels, height, width)
                image_features = self.vision_tower(reshaped_pixel_values, output_hidden_states=True)

                selected_image_feature = image_features.hidden_states[vision_feature_layer]

                if vision_feature_select_strategy == "default":
                    selected_image_feature = selected_image_feature[:, 1:]

                image_features = self.multi_modal_projector(selected_image_feature)

                # split up image_features for each of the individual images
                # hence we get a list of image_features, each of shape (5, num_patches, hidden_size)
                # if we assume each image has 5 image features (base image + 4 patches)
                split_sizes = [image.shape[0] for image in pixel_values]
                image_features = ops.split(image_features, split_sizes, axis=0)

                # NOTE we only support multimodal_patch_merge_type == "spatial_unpad"
                height = width = self.config.vision_config.image_size // self.config.vision_config.patch_size

                new_image_features = []
                for image_idx, image_feature in enumerate(image_features):
                    if image_feature.shape[0] > 1:
                        base_image_feature = image_feature[0]
                        image_feature = image_feature[1:]

                        if height * width != base_image_feature.shape[0]:
                            raise ValueError("The number of patches is not consistent with the image size.")
                        num_patch_height, num_patch_width = get_anyres_image_grid_shape(
                            image_sizes[image_idx],
                            self.config.image_grid_pinpoints,
                            self.config.vision_config.image_size,
                        )
                        image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1)
                        image_feature = image_feature.permute(4, 0, 2, 1, 3)
                        image_feature = image_feature.flatten(start_dim=1, end_dim=2).flatten(start_dim=2, end_dim=3)
                        image_feature = unpad_image(image_feature, image_sizes[image_idx])
                        image_feature = ops.cat(
                            (
                                image_feature,
                                self.image_newline[:, None, None].expand(*image_feature.shape[:-1], 1),
                            ),
                            axis=-1,
                        )
                        image_feature = image_feature.flatten(start_dim=1, end_dim=2).swapaxes(0, 1)
                        image_feature = ops.cat((base_image_feature, image_feature), axis=0)
                    else:
                        image_feature = image_feature[0]
                        image_feature = ops.cat((image_feature, self.image_newline[None]), axis=0)
                    new_image_features.append(image_feature)
                image_features = ops.stack(new_image_features, axis=0)

                inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
                    image_features, inputs_embeds, input_ids, attention_mask, labels
                )
                if labels is None:
                    labels = ops.full_like(attention_mask, self.config.ignore_index).to(mindspore.int64)

            # In case input_ids.shape[1] == 1 & pixel_values==None & past_key_values != None, we are in the case of
            # generation with cache
            elif past_key_values is not None and pixel_values is not None and input_ids.shape[1] == 1:
                # Retrieve the first layer to inspect the logits and mask out the hidden states
                # that are set to 0
                first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]

                # Sum all dimensions of head_dim (-2) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
                nonzero = ops.nonzero(first_layer_past_key_value.float().sum(-2) == 0)
                batch_index, non_attended_tokens = ops.tensor_split(nonzero, 2, -1)

                # Get the target length
                target_length = input_ids.shape[1]
                past_length = first_layer_past_key_value.shape[-1]

                extended_attention_mask = ops.ones(
                    (attention_mask.shape[0], past_length),
                    dtype=attention_mask.dtype,
                )

                # Filter out only the tokens that can be un-attended, this can happen
                # if one uses Llava + Fused modules where the cache on the
                # first iteration is already big enough, or if one passes custom cache
                valid_indices = non_attended_tokens < extended_attention_mask.size(-1)
                new_batch_index = batch_index[valid_indices]
                new_non_attended_tokens = non_attended_tokens[valid_indices]

                # Zero-out the places where we don't need to attend
                extended_attention_mask[new_batch_index, new_non_attended_tokens] = 0

                attention_mask = ops.cat((extended_attention_mask, attention_mask[:, -target_length:]), axis=1)
                position_ids = ops.sum(attention_mask, dim=1).unsqueeze(-1) - 1

        outputs = self.language_model(
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        logits = outputs[0]

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            if attention_mask is not None:
                shift_attention_mask = attention_mask[..., 1:]
                shift_logits = logits[..., :-1, :][shift_attention_mask != 0]
                shift_labels = labels[..., 1:][shift_attention_mask != 0]
            else:
                shift_logits = logits[..., :-1, :]
                shift_labels = labels[..., 1:]
            # Flatten the tokens
            loss = ops.cross_entropy(
                shift_logits.view(-1, shift_logits.shape[-1]), shift_labels.view(-1)
            )

        if not return_dict:
            output = (logits,) + outputs[1:]
            return (loss,) + output if loss is not None else output

        return LlavaNextCausalLMOutputWithPast(
            loss=loss,
            logits=logits,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

    def prepare_inputs_for_generation(
        self,
        input_ids,
        past_key_values=None,
        inputs_embeds=None,
        pixel_values=None,
        image_sizes=None,
        attention_mask=None,
        **kwargs,
    ):
        """
        Prepare the inputs for text generation.

        Args:
            self (LlavaNextForConditionalGeneration): The instance of the LlavaNextForConditionalGeneration class.
            input_ids (Tensor): The input token IDs tensor for text generation.
            past_key_values (Cache or tuple of Tensors): The cached key values from previous generation steps.
                If Cache object is passed, cache_length is obtained from it, else from the tuple of Tensors.
                Defaults to None.
            inputs_embeds (Tensor): The input embeddings tensor. Defaults to None.
            pixel_values (Tensor): The pixel values tensor for image inputs. Defaults to None.
            image_sizes (Tensor): The sizes of the input images. Defaults to None.
            attention_mask (Tensor): The attention mask tensor to mask certain tokens during generation. Defaults to None.

        Returns:
            model_inputs (dict): A dictionary containing the model inputs for text generation, including 'inputs_embeds',
                'input_ids', 'position_ids', 'past_key_values', 'use_cache', 'attention_mask', 'pixel_values',
                and 'image_sizes'.

        Raises:
            TypeError: If past_key_values is not of type Cache or tuple of Tensors.
            IndexError: If the attention_mask shape is not compatible with input_ids shape.
            ValueError: If there are inconsistencies in handling input token IDs based on cache and attention mask lengths.
            AttributeError: If the image token index is missing in the input_ids.
        """
        if past_key_values is not None:
            if isinstance(past_key_values, Cache):
                cache_length = past_key_values.get_seq_length()
                past_length = past_key_values.seen_tokens
            else:
                cache_length = past_length = past_key_values[0][0].shape[2]

            # Keep only the unprocessed tokens:
            # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
            # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
            # input)
            if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
                input_ids = input_ids[:, -
                                      (attention_mask.shape[1] - past_length):]
            # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
            # input_ids based on the past_length.
            elif past_length < input_ids.shape[1]:
                input_ids = input_ids[:, past_length:]
            # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
            elif self.config.image_token_index in input_ids:
                input_ids = input_ids[:, input_ids.shape[1] - 1:]
            # If the cache has seen more tokens than it can hold, then the cache has a size limit. Let's discard the
            # older attention values, as their corresponding values are not part of the input.
            if cache_length < past_length and attention_mask is not None:
                attention_mask = attention_mask[:, -
                                                (cache_length + input_ids.shape[1]):]

        position_ids = kwargs.get("position_ids", None)
        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids = position_ids.masked_fill(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -input_ids.shape[1]:]

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "position_ids": position_ids,
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
                "attention_mask": attention_mask,
                "pixel_values": pixel_values,
                "image_sizes": image_sizes,
            }
        )
        return model_inputs

    # Copied from transformers.models.llava.modeling_llava.LlavaForConditionalGeneration._reorder_cache
    def _reorder_cache(self, *args, **kwargs):
        """
        Reorders the cache for the language model.

        Args:
            self:
                The instance of the LlavaNextForConditionalGeneration class.

                - Type: LlavaNextForConditionalGeneration
                - Purpose: Represents the current instance of the class.
                - Restrictions: This parameter is required and should be the first positional argument.

        Returns:
            None.

        Raises:
            None.
        """
        return self.language_model._reorder_cache(*args, **kwargs)

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextForConditionalGeneration.__init__(config)

Initializes an instance of the LlavaNextForConditionalGeneration class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

The configuration object that contains the necessary parameters for setting up the instance.

TYPE: LlavaNextConfig

RETURNS DESCRIPTION

None

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
def __init__(self, config: LlavaNextConfig):
    """Initializes an instance of the LlavaNextForConditionalGeneration class.

    Args:
        self: The instance of the class.
        config (LlavaNextConfig): The configuration object that contains the necessary parameters for
            setting up the instance.

    Returns:
        None

    Raises:
        None
    """
    super().__init__(config)
    self.vision_tower = AutoModel.from_config(config.vision_config)

    self.multi_modal_projector = LlavaNextMultiModalProjector(config)

    self.image_newline = Parameter(
        ops.zeros(int(config.text_config.hidden_size)))

    self.vocab_size = config.text_config.vocab_size
    self.language_model = AutoModelForCausalLM.from_config(config.text_config)
    self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else -1
    self.post_init()

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextForConditionalGeneration.forward(input_ids=None, pixel_values=None, image_sizes=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, vision_feature_layer=None, vision_feature_select_strategy=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
labels

Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple, LlavaNextCausalLMOutputWithPast]

Union[Tuple, LlavaNextCausalLMOutputWithPast]

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, LlavaNextForConditionalGeneration
...
>>> model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
>>> processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
...
>>> prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
>>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = processor(text=prompt, images=image, return_tensors="pt")
...
>>> # Generate
>>> generate_ids = model.generate(**inputs, max_length=30)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"[INST]  \nWhat is shown in this image? [/INST] The image appears to be a radar chart, which is a type of multi-dimensional plot (...)"
Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
def forward(
    self,
    input_ids: mindspore.Tensor = None,
    pixel_values: mindspore.Tensor = None,
    image_sizes: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[List[mindspore.Tensor]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    vision_feature_layer: Optional[int] = None,
    vision_feature_select_strategy: Optional[str] = None,
    labels: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, LlavaNextCausalLMOutputWithPast]:
    r"""
    Args:
        labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

    Returns:
        Union[Tuple, LlavaNextCausalLMOutputWithPast]

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, LlavaNextForConditionalGeneration
        ...
        >>> model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
        >>> processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
        ...
        >>> prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
        >>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = processor(text=prompt, images=image, return_tensors="pt")
        ...
        >>> # Generate
        >>> generate_ids = model.generate(**inputs, max_length=30)
        >>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        "[INST]  \nWhat is shown in this image? [/INST] The image appears to be a radar chart, which is a type of multi-dimensional plot (...)"
        ```
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    vision_feature_layer = (
        vision_feature_layer if vision_feature_layer is not None else self.config.vision_feature_layer
    )
    vision_feature_select_strategy = (
        vision_feature_select_strategy
        if vision_feature_select_strategy is not None
        else self.config.vision_feature_select_strategy
    )

    if inputs_embeds is None:
        # 1. Extract the input embeddings
        inputs_embeds = self.get_input_embeddings()(input_ids)

        # 2. Merge text and images
        if pixel_values is not None and input_ids.shape[1] != 1:
            batch_size, num_patches, num_channels, height, width = pixel_values.shape
            reshaped_pixel_values = pixel_values.view(batch_size * num_patches, num_channels, height, width)
            image_features = self.vision_tower(reshaped_pixel_values, output_hidden_states=True)

            selected_image_feature = image_features.hidden_states[vision_feature_layer]

            if vision_feature_select_strategy == "default":
                selected_image_feature = selected_image_feature[:, 1:]

            image_features = self.multi_modal_projector(selected_image_feature)

            # split up image_features for each of the individual images
            # hence we get a list of image_features, each of shape (5, num_patches, hidden_size)
            # if we assume each image has 5 image features (base image + 4 patches)
            split_sizes = [image.shape[0] for image in pixel_values]
            image_features = ops.split(image_features, split_sizes, axis=0)

            # NOTE we only support multimodal_patch_merge_type == "spatial_unpad"
            height = width = self.config.vision_config.image_size // self.config.vision_config.patch_size

            new_image_features = []
            for image_idx, image_feature in enumerate(image_features):
                if image_feature.shape[0] > 1:
                    base_image_feature = image_feature[0]
                    image_feature = image_feature[1:]

                    if height * width != base_image_feature.shape[0]:
                        raise ValueError("The number of patches is not consistent with the image size.")
                    num_patch_height, num_patch_width = get_anyres_image_grid_shape(
                        image_sizes[image_idx],
                        self.config.image_grid_pinpoints,
                        self.config.vision_config.image_size,
                    )
                    image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1)
                    image_feature = image_feature.permute(4, 0, 2, 1, 3)
                    image_feature = image_feature.flatten(start_dim=1, end_dim=2).flatten(start_dim=2, end_dim=3)
                    image_feature = unpad_image(image_feature, image_sizes[image_idx])
                    image_feature = ops.cat(
                        (
                            image_feature,
                            self.image_newline[:, None, None].expand(*image_feature.shape[:-1], 1),
                        ),
                        axis=-1,
                    )
                    image_feature = image_feature.flatten(start_dim=1, end_dim=2).swapaxes(0, 1)
                    image_feature = ops.cat((base_image_feature, image_feature), axis=0)
                else:
                    image_feature = image_feature[0]
                    image_feature = ops.cat((image_feature, self.image_newline[None]), axis=0)
                new_image_features.append(image_feature)
            image_features = ops.stack(new_image_features, axis=0)

            inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
                image_features, inputs_embeds, input_ids, attention_mask, labels
            )
            if labels is None:
                labels = ops.full_like(attention_mask, self.config.ignore_index).to(mindspore.int64)

        # In case input_ids.shape[1] == 1 & pixel_values==None & past_key_values != None, we are in the case of
        # generation with cache
        elif past_key_values is not None and pixel_values is not None and input_ids.shape[1] == 1:
            # Retrieve the first layer to inspect the logits and mask out the hidden states
            # that are set to 0
            first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]

            # Sum all dimensions of head_dim (-2) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
            nonzero = ops.nonzero(first_layer_past_key_value.float().sum(-2) == 0)
            batch_index, non_attended_tokens = ops.tensor_split(nonzero, 2, -1)

            # Get the target length
            target_length = input_ids.shape[1]
            past_length = first_layer_past_key_value.shape[-1]

            extended_attention_mask = ops.ones(
                (attention_mask.shape[0], past_length),
                dtype=attention_mask.dtype,
            )

            # Filter out only the tokens that can be un-attended, this can happen
            # if one uses Llava + Fused modules where the cache on the
            # first iteration is already big enough, or if one passes custom cache
            valid_indices = non_attended_tokens < extended_attention_mask.size(-1)
            new_batch_index = batch_index[valid_indices]
            new_non_attended_tokens = non_attended_tokens[valid_indices]

            # Zero-out the places where we don't need to attend
            extended_attention_mask[new_batch_index, new_non_attended_tokens] = 0

            attention_mask = ops.cat((extended_attention_mask, attention_mask[:, -target_length:]), axis=1)
            position_ids = ops.sum(attention_mask, dim=1).unsqueeze(-1) - 1

    outputs = self.language_model(
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    logits = outputs[0]

    loss = None
    if labels is not None:
        # Shift so that tokens < n predict n
        if attention_mask is not None:
            shift_attention_mask = attention_mask[..., 1:]
            shift_logits = logits[..., :-1, :][shift_attention_mask != 0]
            shift_labels = labels[..., 1:][shift_attention_mask != 0]
        else:
            shift_logits = logits[..., :-1, :]
            shift_labels = labels[..., 1:]
        # Flatten the tokens
        loss = ops.cross_entropy(
            shift_logits.view(-1, shift_logits.shape[-1]), shift_labels.view(-1)
        )

    if not return_dict:
        output = (logits,) + outputs[1:]
        return (loss,) + output if loss is not None else output

    return LlavaNextCausalLMOutputWithPast(
        loss=loss,
        logits=logits,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextForConditionalGeneration.get_decoder()

Retrieve the decoder from the language model for conditional generation.

PARAMETER DESCRIPTION
self

The instance of the LlavaNextForConditionalGeneration class. This parameter is automatically passed when calling the method.

TYPE: LlavaNextForConditionalGeneration

RETURNS DESCRIPTION

The decoder obtained from the language model.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
def get_decoder(self):
    """
    Retrieve the decoder from the language model for conditional generation.

    Args:
        self (LlavaNextForConditionalGeneration): The instance of the LlavaNextForConditionalGeneration class.
            This parameter is automatically passed when calling the method.

    Returns:
        The decoder obtained from the language model.

    Raises:
        None.
    """
    return self.language_model.get_decoder()

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextForConditionalGeneration.get_input_embeddings()

Returns the input embeddings of the language model used for conditional generation.

PARAMETER DESCRIPTION
self

The instance of the LlavaNextForConditionalGeneration class.

TYPE: LlavaNextForConditionalGeneration

RETURNS DESCRIPTION

The input embeddings of the language model.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
317
318
319
320
321
322
323
324
325
326
327
328
329
330
def get_input_embeddings(self):
    """
    Returns the input embeddings of the language model used for conditional generation.

    Args:
        self (LlavaNextForConditionalGeneration): The instance of the LlavaNextForConditionalGeneration class.

    Returns:
        The input embeddings of the language model.

    Raises:
        None.
    """
    return self.language_model.get_input_embeddings()

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextForConditionalGeneration.get_output_embeddings()

Retrieve the output embeddings from the language model for the LlavaNextForConditionalGeneration class.

PARAMETER DESCRIPTION
self

The instance of the LlavaNextForConditionalGeneration class.

RETURNS DESCRIPTION

The output embeddings from the language model associated with the LlavaNextForConditionalGeneration instance.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
350
351
352
353
354
355
356
357
358
359
360
361
362
363
def get_output_embeddings(self):
    """
    Retrieve the output embeddings from the language model for the LlavaNextForConditionalGeneration class.

    Args:
        self: The instance of the LlavaNextForConditionalGeneration class.

    Returns:
        The output embeddings from the language model associated with the LlavaNextForConditionalGeneration instance.

    Raises:
        None.
    """
    return self.language_model.get_output_embeddings()

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextForConditionalGeneration.prepare_inputs_for_generation(input_ids, past_key_values=None, inputs_embeds=None, pixel_values=None, image_sizes=None, attention_mask=None, **kwargs)

Prepare the inputs for text generation.

PARAMETER DESCRIPTION
self

The instance of the LlavaNextForConditionalGeneration class.

TYPE: LlavaNextForConditionalGeneration

input_ids

The input token IDs tensor for text generation.

TYPE: Tensor

past_key_values

The cached key values from previous generation steps. If Cache object is passed, cache_length is obtained from it, else from the tuple of Tensors. Defaults to None.

TYPE: Cache or tuple of Tensors DEFAULT: None

inputs_embeds

The input embeddings tensor. Defaults to None.

TYPE: Tensor DEFAULT: None

pixel_values

The pixel values tensor for image inputs. Defaults to None.

TYPE: Tensor DEFAULT: None

image_sizes

The sizes of the input images. Defaults to None.

TYPE: Tensor DEFAULT: None

attention_mask

The attention mask tensor to mask certain tokens during generation. Defaults to None.

TYPE: Tensor DEFAULT: None

RETURNS DESCRIPTION
model_inputs

A dictionary containing the model inputs for text generation, including 'inputs_embeds', 'input_ids', 'position_ids', 'past_key_values', 'use_cache', 'attention_mask', 'pixel_values', and 'image_sizes'.

TYPE: dict

RAISES DESCRIPTION
TypeError

If past_key_values is not of type Cache or tuple of Tensors.

IndexError

If the attention_mask shape is not compatible with input_ids shape.

ValueError

If there are inconsistencies in handling input token IDs based on cache and attention mask lengths.

AttributeError

If the image token index is missing in the input_ids.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
def prepare_inputs_for_generation(
    self,
    input_ids,
    past_key_values=None,
    inputs_embeds=None,
    pixel_values=None,
    image_sizes=None,
    attention_mask=None,
    **kwargs,
):
    """
    Prepare the inputs for text generation.

    Args:
        self (LlavaNextForConditionalGeneration): The instance of the LlavaNextForConditionalGeneration class.
        input_ids (Tensor): The input token IDs tensor for text generation.
        past_key_values (Cache or tuple of Tensors): The cached key values from previous generation steps.
            If Cache object is passed, cache_length is obtained from it, else from the tuple of Tensors.
            Defaults to None.
        inputs_embeds (Tensor): The input embeddings tensor. Defaults to None.
        pixel_values (Tensor): The pixel values tensor for image inputs. Defaults to None.
        image_sizes (Tensor): The sizes of the input images. Defaults to None.
        attention_mask (Tensor): The attention mask tensor to mask certain tokens during generation. Defaults to None.

    Returns:
        model_inputs (dict): A dictionary containing the model inputs for text generation, including 'inputs_embeds',
            'input_ids', 'position_ids', 'past_key_values', 'use_cache', 'attention_mask', 'pixel_values',
            and 'image_sizes'.

    Raises:
        TypeError: If past_key_values is not of type Cache or tuple of Tensors.
        IndexError: If the attention_mask shape is not compatible with input_ids shape.
        ValueError: If there are inconsistencies in handling input token IDs based on cache and attention mask lengths.
        AttributeError: If the image token index is missing in the input_ids.
    """
    if past_key_values is not None:
        if isinstance(past_key_values, Cache):
            cache_length = past_key_values.get_seq_length()
            past_length = past_key_values.seen_tokens
        else:
            cache_length = past_length = past_key_values[0][0].shape[2]

        # Keep only the unprocessed tokens:
        # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
        # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
        # input)
        if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
            input_ids = input_ids[:, -
                                  (attention_mask.shape[1] - past_length):]
        # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
        # input_ids based on the past_length.
        elif past_length < input_ids.shape[1]:
            input_ids = input_ids[:, past_length:]
        # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
        elif self.config.image_token_index in input_ids:
            input_ids = input_ids[:, input_ids.shape[1] - 1:]
        # If the cache has seen more tokens than it can hold, then the cache has a size limit. Let's discard the
        # older attention values, as their corresponding values are not part of the input.
        if cache_length < past_length and attention_mask is not None:
            attention_mask = attention_mask[:, -
                                            (cache_length + input_ids.shape[1]):]

    position_ids = kwargs.get("position_ids", None)
    if attention_mask is not None and position_ids is None:
        # create position_ids on the fly for batch generation
        position_ids = attention_mask.long().cumsum(-1) - 1
        position_ids = position_ids.masked_fill(attention_mask == 0, 1)
        if past_key_values:
            position_ids = position_ids[:, -input_ids.shape[1]:]

    # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
    if inputs_embeds is not None and past_key_values is None:
        model_inputs = {"inputs_embeds": inputs_embeds}
    else:
        model_inputs = {"input_ids": input_ids}

    model_inputs.update(
        {
            "position_ids": position_ids,
            "past_key_values": past_key_values,
            "use_cache": kwargs.get("use_cache"),
            "attention_mask": attention_mask,
            "pixel_values": pixel_values,
            "image_sizes": image_sizes,
        }
    )
    return model_inputs

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextForConditionalGeneration.resize_token_embeddings(new_num_tokens=None, pad_to_multiple_of=None)

Resizes the token embeddings for conditional generation in the LlavaNext model.

PARAMETER DESCRIPTION
self

The instance of the LlavaNextForConditionalGeneration class.

TYPE: LlavaNextForConditionalGeneration

new_num_tokens

The desired number of tokens for the resized embeddings. Defaults to None.

TYPE: Optional[int] DEFAULT: None

pad_to_multiple_of

(Optional[int]): The value to which the embedding size should be padded. Defaults to None.

DEFAULT: None

RETURNS DESCRIPTION
Embedding

nn.Embedding: The resized token embeddings of type nn.Embedding.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
    """
    Resizes the token embeddings for conditional generation in the LlavaNext model.

    Args:
        self (LlavaNextForConditionalGeneration): The instance of the LlavaNextForConditionalGeneration class.
        new_num_tokens (Optional[int]): The desired number of tokens for the resized embeddings. Defaults to None.
        pad_to_multiple_of: (Optional[int]): The value to which the embedding size should be padded. Defaults to None.

    Returns:
        nn.Embedding: The resized token embeddings of type nn.Embedding.

    Raises:
        None.
    """
    model_embeds = self.language_model.resize_token_embeddings(
        new_num_tokens, pad_to_multiple_of)
    # update vocab size
    self.config.text_config.vocab_size = model_embeds.vocab_size
    self.vocab_size = model_embeds.vocab_size
    return model_embeds

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextForConditionalGeneration.set_decoder(decoder)

Sets the decoder for the LlavaNextForConditionalGeneration language model.

PARAMETER DESCRIPTION
self

The instance of the LlavaNextForConditionalGeneration class.

TYPE: LlavaNextForConditionalGeneration

decoder

The decoder to be set for the language model. It should be compatible with the language model for proper functioning.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
def set_decoder(self, decoder):
    """
    Sets the decoder for the LlavaNextForConditionalGeneration language model.

    Args:
        self (LlavaNextForConditionalGeneration): The instance of the LlavaNextForConditionalGeneration class.
        decoder: The decoder to be set for the language model. 
            It should be compatible with the language model for proper functioning.

    Returns:
        None.

    Raises:
        None.
    """
    self.language_model.set_decoder(decoder)

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextForConditionalGeneration.set_input_embeddings(value)

Method to set input embeddings for the LlavaNextForConditionalGeneration class.

PARAMETER DESCRIPTION
self

The instance of the LlavaNextForConditionalGeneration class.

TYPE: LlavaNextForConditionalGeneration

value

The input embeddings to be set for the language model.

TYPE: object

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
def set_input_embeddings(self, value):
    """
    Method to set input embeddings for the LlavaNextForConditionalGeneration class.

    Args:
        self (LlavaNextForConditionalGeneration): The instance of the LlavaNextForConditionalGeneration class.
        value (object): The input embeddings to be set for the language model.

    Returns:
        None.

    Raises:
        None.
    """
    self.language_model.set_input_embeddings(value)

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextForConditionalGeneration.set_output_embeddings(new_embeddings)

Sets the output embeddings for the LlavaNextForConditionalGeneration class.

PARAMETER DESCRIPTION
self

An instance of the LlavaNextForConditionalGeneration class.

new_embeddings

The new embeddings to be set for the language model. It should be of type 'torch.nn.Embedding' or a subclass of it.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
def set_output_embeddings(self, new_embeddings):
    """
    Sets the output embeddings for the LlavaNextForConditionalGeneration class.

    Args:
        self: An instance of the LlavaNextForConditionalGeneration class.
        new_embeddings: The new embeddings to be set for the language model. 
            It should be of type 'torch.nn.Embedding' or a subclass of it.

    Returns:
        None.

    Raises:
        None.
    """
    self.language_model.set_output_embeddings(new_embeddings)

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextForConditionalGeneration.tie_weights()

Ties the weights of the language model for conditional generation in the LlavaNextForConditionalGeneration class.

PARAMETER DESCRIPTION
self

An instance of the LlavaNextForConditionalGeneration class.

RETURNS DESCRIPTION

None.

This method is responsible for tying the weights of the language model used for conditional generation in the LlavaNextForConditionalGeneration class. Tying the weights refers to sharing the parameters of the language model with other parts of the model, such as the encoder or the decoder. By tying the weights, the model can learn more efficiently and effectively by reducing the number of parameters that need to be learned.

Note

This method internally calls the 'tie_weights' method of the language model to perform the weight tying operation.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
def tie_weights(self):
    """
    Ties the weights of the language model for conditional generation in the LlavaNextForConditionalGeneration class.

    Args:
        self: An instance of the LlavaNextForConditionalGeneration class.

    Returns:
        None.

    Raises:
        None.

    This method is responsible for tying the weights of the language model used for conditional generation in the
    LlavaNextForConditionalGeneration class. Tying the weights refers to sharing the parameters of the language
    model with other parts of the model, such as the encoder or the decoder.
    By tying the weights, the model can learn more efficiently and effectively by reducing the number of parameters
    that need to be learned.

    Note:
        This method internally calls the 'tie_weights' method of the language model to perform the weight
        tying operation.
    """
    return self.language_model.tie_weights()

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextMultiModalProjector

Bases: Module

This class represents a multi-modal projector for the LlavaNext model. It is used to project image features and text embeddings into a shared hidden space.

Inherits from

nn.Module

ATTRIBUTE DESCRIPTION
linear_1

A fully connected layer that maps image features to the hidden size specified in the configuration.

TYPE: Linear

act

An activation function chosen based on the configuration's specified projector hidden activation.

TYPE: function

linear_2

A fully connected layer that maps the hidden states from linear_1 to the hidden size specified in the configuration.

TYPE: Linear

METHOD DESCRIPTION
forward

Projects the given image features into the shared hidden space by applying the linear transformations and activation function.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
class LlavaNextMultiModalProjector(nn.Module):

    """
    This class represents a multi-modal projector for the LlavaNext model.
    It is used to project image features and text embeddings into a shared hidden space.

    Inherits from:
        nn.Module

    Attributes:
        linear_1 (nn.Linear): A fully connected layer that maps image features to the hidden size specified
            in the configuration.
        act (function): An activation function chosen based on the configuration's specified projector hidden activation.
        linear_2 (nn.Linear): A fully connected layer that maps the hidden states from linear_1 to the hidden size
            specified in the configuration.

    Methods:
        forward(image_features):
            Projects the given image features into the shared hidden space by applying the linear transformations
            and activation function.

    """
    def __init__(self, config: LlavaNextConfig):
        """
        Initializes an instance of the LlavaNextMultiModalProjector class.

        Args:
            self: The instance of the class.
            config (LlavaNextConfig): An object of type LlavaNextConfig containing configuration settings for the projector.
                It is used to set up the linear layers and activation function for the projector.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()

        self.linear_1 = nn.Linear(
            config.vision_config.hidden_size, config.text_config.hidden_size, bias=True)
        self.act = ACT2FN[config.projector_hidden_act]
        self.linear_2 = nn.Linear(
            config.text_config.hidden_size, config.text_config.hidden_size, bias=True)

    def forward(self, image_features):
        """
        Constructs the hidden states for the LlavaNextMultiModalProjector.

        Args:
            self (LlavaNextMultiModalProjector): The instance of the LlavaNextMultiModalProjector class.
            image_features (Tensor): The input image features to be processed.

        Returns:
            None: This method modifies the hidden_states attribute of the LlavaNextMultiModalProjector instance.

        Raises:
            TypeError: If the input image_features is not a Tensor.
            RuntimeError: If an error occurs during the linear transformation or activation function application.
        """
        hidden_states = self.linear_1(image_features)
        hidden_states = self.act(hidden_states)
        hidden_states = self.linear_2(hidden_states)
        return hidden_states

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextMultiModalProjector.__init__(config)

Initializes an instance of the LlavaNextMultiModalProjector class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

An object of type LlavaNextConfig containing configuration settings for the projector. It is used to set up the linear layers and activation function for the projector.

TYPE: LlavaNextConfig

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
def __init__(self, config: LlavaNextConfig):
    """
    Initializes an instance of the LlavaNextMultiModalProjector class.

    Args:
        self: The instance of the class.
        config (LlavaNextConfig): An object of type LlavaNextConfig containing configuration settings for the projector.
            It is used to set up the linear layers and activation function for the projector.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()

    self.linear_1 = nn.Linear(
        config.vision_config.hidden_size, config.text_config.hidden_size, bias=True)
    self.act = ACT2FN[config.projector_hidden_act]
    self.linear_2 = nn.Linear(
        config.text_config.hidden_size, config.text_config.hidden_size, bias=True)

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextMultiModalProjector.forward(image_features)

Constructs the hidden states for the LlavaNextMultiModalProjector.

PARAMETER DESCRIPTION
self

The instance of the LlavaNextMultiModalProjector class.

TYPE: LlavaNextMultiModalProjector

image_features

The input image features to be processed.

TYPE: Tensor

RETURNS DESCRIPTION
None

This method modifies the hidden_states attribute of the LlavaNextMultiModalProjector instance.

RAISES DESCRIPTION
TypeError

If the input image_features is not a Tensor.

RuntimeError

If an error occurs during the linear transformation or activation function application.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
def forward(self, image_features):
    """
    Constructs the hidden states for the LlavaNextMultiModalProjector.

    Args:
        self (LlavaNextMultiModalProjector): The instance of the LlavaNextMultiModalProjector class.
        image_features (Tensor): The input image features to be processed.

    Returns:
        None: This method modifies the hidden_states attribute of the LlavaNextMultiModalProjector instance.

    Raises:
        TypeError: If the input image_features is not a Tensor.
        RuntimeError: If an error occurs during the linear transformation or activation function application.
    """
    hidden_states = self.linear_1(image_features)
    hidden_states = self.act(hidden_states)
    hidden_states = self.linear_2(hidden_states)
    return hidden_states

mindnlp.transformers.models.llava_next.modeling_llava_next.LlavaNextPreTrainedModel

Bases: PreTrainedModel

Represents a pre-trained model for the LlavaNext model architecture, inheriting from PreTrainedModel.

This class includes methods for initializing weights based on the configuration settings. It initializes weights for different types of cells such as Dense, Conv2d, and Embedding based on the provided standard deviation value. The initialization process handles class embeddings, biases, and padding indices as needed.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
class LlavaNextPreTrainedModel(PreTrainedModel):

    """
    Represents a pre-trained model for the LlavaNext model architecture, inheriting from PreTrainedModel.

    This class includes methods for initializing weights based on the configuration settings.
    It initializes weights for different types of cells such as Dense, Conv2d, and Embedding based on the provided
    standard deviation value. The initialization process handles class embeddings, biases, and padding indices as needed.
    """
    config_class = LlavaNextConfig
    base_model_prefix = "model"
    supports_gradient_checkpointing = True
    _no_split_modules = ["LlavaNextVisionAttention"]
    _skip_keys_device_placement = "past_key_values"
    _supports_flash_attn_2 = True

    def _init_weights(self, cell):
        """
        This method initializes the weights of the specified cell based on the configuration parameters.

        Args:
            self (LlavaNextPreTrainedModel): The instance of the LlavaNextPreTrainedModel class.
            cell: The cell for which the weights need to be initialized.
                It can be of type nn.Embedding, nn.Linear, or nn.Conv2d.

        Returns:
            None.

        Raises:
            AttributeError: If the provided cell does not have the required attributes for weight initialization.
            TypeError: If the cell type is not supported for weight initialization.
        """
        # important: this ported version of Llava isn't meant for training from scratch - only
        # inference and fine-tuning - so the proper init weights code has been removed - the original codebase
        # https://github.com/haotian-liu/LLaVA/tree/main/llava should serve for that purpose
        std = (
            self.config.initializer_range
            if hasattr(self.config, "initializer_range")
            else self.config.text_config.initializer_range
        )

        if hasattr(cell, "class_embedding"):
            cell.class_embedding.initialize(Normal(std))

        if isinstance(cell, (nn.Linear, nn.Conv2d)):
            cell.weight.data.initialize(Normal(std))
            if cell.bias is not None:
                cell.bias.initialize('zeros')
        elif isinstance(cell, nn.Embedding):
            weight = np.random.normal(0.0, std, cell.weight.shape)
            if cell.padding_idx:
                weight[cell.padding_idx] = 0

            cell.weight.set_data(Tensor(weight, cell.weight.dtype))

mindnlp.transformers.models.llava_next.modeling_llava_next.get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size)

Calculate the shape of the image patch grid after the preprocessing for images of any resolution.

PARAMETER DESCRIPTION
image_size

The size of the input image in the format (width, height).

TYPE: `tuple`

grid_pinpoints

A list containing possible resolutions. Each item in the list should be a tuple or list of the form (height, width).

TYPE: `List`

patch_size

The size of each image patch.

TYPE: `int`

RETURNS DESCRIPTION
tuple

The shape of the image patch grid in the format (width, height).

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
    """
    Calculate the shape of the image patch grid after the preprocessing for images of any resolution.

    Args:
        image_size (`tuple`):
            The size of the input image in the format (width, height).
        grid_pinpoints (`List`):
            A list containing possible resolutions. Each item in the list should be a tuple or list
            of the form `(height, width)`.
        patch_size (`int`):
            The size of each image patch.

    Returns:
        tuple: The shape of the image patch grid in the format (width, height).
    """
    if not isinstance(grid_pinpoints, list):
        raise ValueError("grid_pinpoints should be a list of tuples or lists")

    height, width = select_best_resolution(image_size, grid_pinpoints)
    return height // patch_size, width // patch_size

mindnlp.transformers.models.llava_next.modeling_llava_next.unpad_image(tensor, original_size)

Unpads a PyTorch tensor of a padded and resized image.

PARAMETER DESCRIPTION
tensor

The image tensor, assumed to be of shape (num_channels, height, width).

TYPE: `mindspore.Tensor`

original_size

The original size of the image (height, width).

TYPE: `tuple`

RETURNS DESCRIPTION

mindspore.Tensor: The unpadded image tensor.

Source code in mindnlp/transformers/models/llava_next/modeling_llava_next.py
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
def unpad_image(tensor, original_size):
    """
    Unpads a PyTorch tensor of a padded and resized image.

    Args:
        tensor (`mindspore.Tensor`):
            The image tensor, assumed to be of shape (num_channels, height, width).
        original_size (`tuple`):
            The original size of the image (height, width).

    Returns:
        `mindspore.Tensor`: The unpadded image tensor.
    """
    original_height, original_width = original_size
    current_height, current_width = tensor.shape[1:]

    original_aspect_ratio = original_width / original_height
    current_aspect_ratio = current_width / current_height

    if original_aspect_ratio > current_aspect_ratio:
        scale_factor = current_width / original_width
        new_height = int(original_height * scale_factor)
        padding = (current_height - new_height) // 2
        unpadded_tensor = tensor[:, padding: current_height - padding, :]
    else:
        scale_factor = current_height / original_height
        new_width = int(original_width * scale_factor)
        padding = (current_width - new_width) // 2
        unpadded_tensor = tensor[:, :, padding: current_width - padding]

    return unpadded_tensor

mindnlp.transformers.models.llava_next.processing_llava_next

Processor class for LLaVa-NeXT.

mindnlp.transformers.models.llava_next.processing_llava_next.LlavaNextProcessor

Bases: ProcessorMixin

Constructs a LLaVa-NeXT processor which wraps a LLaVa-NeXT image processor and a LLaMa tokenizer into a single processor.

[LlavaNextProcessor] offers all the functionalities of [LlavaNextImageProcessor] and [LlamaTokenizerFast]. See the [~LlavaNextProcessor.__call__] and [~LlavaNextProcessor.decode] for more information.

PARAMETER DESCRIPTION
image_processor

The image processor is a required input.

TYPE: [`LlavaNextImageProcessor`], *optional* DEFAULT: None

tokenizer

The tokenizer is a required input.

TYPE: [`LlamaTokenizerFast`], *optional* DEFAULT: None

Source code in mindnlp/transformers/models/llava_next/processing_llava_next.py
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
class LlavaNextProcessor(ProcessorMixin):
    r"""
    Constructs a LLaVa-NeXT processor which wraps a LLaVa-NeXT image processor and a LLaMa tokenizer into a single processor.

    [`LlavaNextProcessor`] offers all the functionalities of [`LlavaNextImageProcessor`] and [`LlamaTokenizerFast`]. See the
    [`~LlavaNextProcessor.__call__`] and [`~LlavaNextProcessor.decode`] for more information.

    Args:
        image_processor ([`LlavaNextImageProcessor`], *optional*):
            The image processor is a required input.
        tokenizer ([`LlamaTokenizerFast`], *optional*):
            The tokenizer is a required input.
    """
    attributes = ["image_processor", "tokenizer"]
    image_processor_class = "LlavaNextImageProcessor"
    tokenizer_class = ("LlamaTokenizer", "LlamaTokenizerFast")

    def __init__(self, image_processor=None, tokenizer=None):
        """
        Initializes a new instance of the LlavaNextProcessor class.

        Args:
            self (LlavaNextProcessor): The instance of the class itself.
            image_processor (optional): An image processing object. Defaults to None.
            tokenizer (optional): A tokenizer object. Defaults to None.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(image_processor, tokenizer)

    def __call__(
        self,
        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
        images: ImageInput = None,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = None,
        max_length=None,
        return_tensors: Optional[Union[str, TensorType]] = None,
    ) -> BatchFeature:
        """
        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
        and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
        the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
        LlavaNextImageProcessor's [`~LlavaNextImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
        of the above two methods for more information.

        Args:
            text (`str`, `List[str]`, `List[List[str]]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. Both channels-first and channels-last formats are supported.
            padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
                Select a strategy to pad the returned sequences (according to the model's padding side and padding
                index) among:

                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
                sequence if provided).
                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
                acceptable input length for the model if that argument is not provided.
                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
                lengths).
            max_length (`int`, *optional*):
                Maximum length of the returned list and optionally padding length (see above).
            truncation (`bool`, *optional*):
                Activates truncation to cut input sequences longer than `max_length` to `max_length`.
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:

                - `'tf'`: Return TensorFlow `tf.constant` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
                - `'jax'`: Return JAX `jnp.ndarray` objects.

        Returns:
            [`BatchFeature`]:
                A [`BatchFeature`] with the following fields:

                - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
                - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
                `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
                `None`).
                - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
        """
        if images is not None:
            image_inputs = self.image_processor(
                images, return_tensors=return_tensors)
        else:
            image_inputs = {}
        text_inputs = self.tokenizer(
            text, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length
        )

        return BatchFeature(data={**text_inputs, **image_inputs})

    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.batch_decode with CLIP->Llama
    def batch_decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
        refer to the docstring of this method for more information.
        """
        return self.tokenizer.batch_decode(*args, **kwargs)

    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Llama
    def decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
        the docstring of this method for more information.
        """
        return self.tokenizer.decode(*args, **kwargs)

    @property
    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names
    def model_input_names(self):
        """
        Returns a list of model input names used by the LlavaNextProcessor.

        Args:
            self: An instance of the LlavaNextProcessor class.

        Returns:
            None.

        Raises:
            None.

        This method retrieves the model input names from the tokenizer and image processor of the LlavaNextProcessor.
        It concatenates the tokenizer input names and image processor input names, and removes any
        duplicate entries using a dictionary conversion. The resulting list of model input names is returned.
        """
        tokenizer_input_names = self.tokenizer.model_input_names
        image_processor_input_names = self.image_processor.model_input_names
        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))

mindnlp.transformers.models.llava_next.processing_llava_next.LlavaNextProcessor.model_input_names property

Returns a list of model input names used by the LlavaNextProcessor.

PARAMETER DESCRIPTION
self

An instance of the LlavaNextProcessor class.

RETURNS DESCRIPTION

None.

This method retrieves the model input names from the tokenizer and image processor of the LlavaNextProcessor. It concatenates the tokenizer input names and image processor input names, and removes any duplicate entries using a dictionary conversion. The resulting list of model input names is returned.

mindnlp.transformers.models.llava_next.processing_llava_next.LlavaNextProcessor.__call__(text, images=None, padding=False, truncation=None, max_length=None, return_tensors=None)

Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the text and kwargs arguments to LlamaTokenizerFast's [~LlamaTokenizerFast.__call__] if text is not None to encode the text. To prepare the image(s), this method forwards the images and kwrags arguments to LlavaNextImageProcessor's [~LlavaNextImageProcessor.__call__] if images is not None. Please refer to the doctsring of the above two methods for more information.

PARAMETER DESCRIPTION
text

The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

TYPE: `str`, `List[str]`, `List[List[str]]`

images

The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. Both channels-first and channels-last formats are supported.

TYPE: `PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]` DEFAULT: None

padding

Select a strategy to pad the returned sequences (according to the model's padding side and padding index) among:

  • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
  • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
  • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).

TYPE: `bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False` DEFAULT: False

max_length

Maximum length of the returned list and optionally padding length (see above).

TYPE: `int`, *optional* DEFAULT: None

truncation

Activates truncation to cut input sequences longer than max_length to max_length.

TYPE: `bool`, *optional* DEFAULT: None

return_tensors

If set, will return tensors of a particular framework. Acceptable values are:

  • 'tf': Return TensorFlow tf.constant objects.
  • 'pt': Return PyTorch torch.Tensor objects.
  • 'np': Return NumPy np.ndarray objects.
  • 'jax': Return JAX jnp.ndarray objects.

TYPE: `str` or [`~utils.TensorType`], *optional* DEFAULT: None

RETURNS DESCRIPTION
BatchFeature

[BatchFeature]: A [BatchFeature] with the following fields:

  • input_ids -- List of token ids to be fed to a model. Returned when text is not None.
  • attention_mask -- List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if "attention_mask" is in self.model_input_names and if text is not None).
  • pixel_values -- Pixel values to be fed to a model. Returned when images is not None.
Source code in mindnlp/transformers/models/llava_next/processing_llava_next.py
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def __call__(
    self,
    text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]],
    images: ImageInput = None,
    padding: Union[bool, str, PaddingStrategy] = False,
    truncation: Union[bool, str, TruncationStrategy] = None,
    max_length=None,
    return_tensors: Optional[Union[str, TensorType]] = None,
) -> BatchFeature:
    """
    Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
    and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
    the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
    LlavaNextImageProcessor's [`~LlavaNextImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
    of the above two methods for more information.

    Args:
        text (`str`, `List[str]`, `List[List[str]]`):
            The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
            (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
            `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
        images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
            The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
            tensor. Both channels-first and channels-last formats are supported.
        padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding
            index) among:

            - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
            sequence if provided).
            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
            acceptable input length for the model if that argument is not provided.
            - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
            lengths).
        max_length (`int`, *optional*):
            Maximum length of the returned list and optionally padding length (see above).
        truncation (`bool`, *optional*):
            Activates truncation to cut input sequences longer than `max_length` to `max_length`.
        return_tensors (`str` or [`~utils.TensorType`], *optional*):
            If set, will return tensors of a particular framework. Acceptable values are:

            - `'tf'`: Return TensorFlow `tf.constant` objects.
            - `'pt'`: Return PyTorch `torch.Tensor` objects.
            - `'np'`: Return NumPy `np.ndarray` objects.
            - `'jax'`: Return JAX `jnp.ndarray` objects.

    Returns:
        [`BatchFeature`]:
            A [`BatchFeature`] with the following fields:

            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
            `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
            `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
    """
    if images is not None:
        image_inputs = self.image_processor(
            images, return_tensors=return_tensors)
    else:
        image_inputs = {}
    text_inputs = self.tokenizer(
        text, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length
    )

    return BatchFeature(data={**text_inputs, **image_inputs})

mindnlp.transformers.models.llava_next.processing_llava_next.LlavaNextProcessor.__init__(image_processor=None, tokenizer=None)

Initializes a new instance of the LlavaNextProcessor class.

PARAMETER DESCRIPTION
self

The instance of the class itself.

TYPE: LlavaNextProcessor

image_processor

An image processing object. Defaults to None.

TYPE: optional DEFAULT: None

tokenizer

A tokenizer object. Defaults to None.

TYPE: optional DEFAULT: None

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/llava_next/processing_llava_next.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def __init__(self, image_processor=None, tokenizer=None):
    """
    Initializes a new instance of the LlavaNextProcessor class.

    Args:
        self (LlavaNextProcessor): The instance of the class itself.
        image_processor (optional): An image processing object. Defaults to None.
        tokenizer (optional): A tokenizer object. Defaults to None.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(image_processor, tokenizer)

mindnlp.transformers.models.llava_next.processing_llava_next.LlavaNextProcessor.batch_decode(*args, **kwargs)

This method forwards all its arguments to LlamaTokenizerFast's [~PreTrainedTokenizer.batch_decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp/transformers/models/llava_next/processing_llava_next.py
131
132
133
134
135
136
def batch_decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
    refer to the docstring of this method for more information.
    """
    return self.tokenizer.batch_decode(*args, **kwargs)

mindnlp.transformers.models.llava_next.processing_llava_next.LlavaNextProcessor.decode(*args, **kwargs)

This method forwards all its arguments to LlamaTokenizerFast's [~PreTrainedTokenizer.decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp/transformers/models/llava_next/processing_llava_next.py
139
140
141
142
143
144
def decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
    the docstring of this method for more information.
    """
    return self.tokenizer.decode(*args, **kwargs)

mindnlp.transformers.models.llava_next.image_processing_llava_next

Image processor class for LLaVa-NeXT.

mindnlp.transformers.models.llava_next.image_processing_llava_next.LlavaNextImageProcessor

Bases: BaseImageProcessor

Constructs a LLaVa-NeXT image processor. Based on [CLIPImageProcessor] with incorporation of additional techniques for processing high resolution images as explained in the LLaVa paper.

PARAMETER DESCRIPTION
do_resize

Whether to resize the image's (height, width) dimensions to the specified size. Can be overridden by do_resize in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

size

224}): Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge resized to keep the input aspect ratio. Can be overridden bysizein thepreprocess` method.

TYPE: `Dict[str, int]` *optional*, defaults to `{"shortest_edge" DEFAULT: None

image_grid_pinpoints

A list of possible resolutions to use for processing high resolution images. The best resolution is selected based on the original size of the image. Can be overridden by image_grid_pinpoints in the preprocess method.

TYPE: `List` *optional*, defaults to `[[672, 336], [336, 672], [672, 672], [336, 1008], [1008, 336]]` DEFAULT: None

resample

Resampling filter to use if resizing the image. Can be overridden by resample in the preprocess method.

TYPE: `PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC` DEFAULT: BICUBIC

do_center_crop

Whether to center crop the image to the specified crop_size. Can be overridden by do_center_crop in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

crop_size

Size of the output image after applying center_crop. Can be overridden by crop_size in the preprocess method.

TYPE: `Dict[str, int]` *optional*, defaults to 224 DEFAULT: None

do_rescale

Whether to rescale the image by the specified scale rescale_factor. Can be overridden by do_rescale in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

rescale_factor

Scale factor to use if rescaling the image. Can be overridden by rescale_factor in the preprocess method.

TYPE: `int` or `float`, *optional*, defaults to `1/255` DEFAULT: 1 / 255

do_normalize

Whether to normalize the image. Can be overridden by do_normalize in the preprocess method.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

image_mean

Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_mean parameter in the preprocess method.

TYPE: `float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]` DEFAULT: None

image_std

Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_std parameter in the preprocess method. Can be overridden by the image_std parameter in the preprocess method.

TYPE: `float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]` DEFAULT: None

do_convert_rgb

Whether to convert the image to RGB.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

Source code in mindnlp/transformers/models/llava_next/image_processing_llava_next.py
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
class LlavaNextImageProcessor(BaseImageProcessor):
    r"""
    Constructs a LLaVa-NeXT image processor. Based on [`CLIPImageProcessor`] with incorporation of additional techniques
    for processing high resolution images as explained in the [LLaVa paper](https://arxiv.org/abs/2310.03744).

    Args:
        do_resize (`bool`, *optional*, defaults to `True`):
            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by
            `do_resize` in the `preprocess` method.
        size (`Dict[str, int]` *optional*, defaults to `{"shortest_edge": 224}`):
            Size of the image after resizing. The shortest edge of the image is resized to size["shortest_edge"], with
            the longest edge resized to keep the input aspect ratio. Can be overridden by `size` in the `preprocess`
            method.
        image_grid_pinpoints (`List` *optional*, defaults to `[[672, 336], [336, 672], [672, 672], [336, 1008], [1008, 336]]`):
            A list of possible resolutions to use for processing high resolution images. The best resolution is selected
            based on the original size of the image. Can be overridden by `image_grid_pinpoints` in the `preprocess`
            method.
        resample (`PILImageResampling`, *optional*, defaults to `Resampling.BICUBIC`):
            Resampling filter to use if resizing the image. Can be overridden by `resample` in the `preprocess` method.
        do_center_crop (`bool`, *optional*, defaults to `True`):
            Whether to center crop the image to the specified `crop_size`. Can be overridden by `do_center_crop` in the
            `preprocess` method.
        crop_size (`Dict[str, int]` *optional*, defaults to 224):
            Size of the output image after applying `center_crop`. Can be overridden by `crop_size` in the `preprocess`
            method.
        do_rescale (`bool`, *optional*, defaults to `True`):
            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by `do_rescale` in
            the `preprocess` method.
        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
            Scale factor to use if rescaling the image. Can be overridden by `rescale_factor` in the `preprocess`
            method.
        do_normalize (`bool`, *optional*, defaults to `True`):
            Whether to normalize the image. Can be overridden by `do_normalize` in the `preprocess` method.
        image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
        image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
            Can be overridden by the `image_std` parameter in the `preprocess` method.
        do_convert_rgb (`bool`, *optional*, defaults to `True`):
            Whether to convert the image to RGB.
    """
    model_input_names = ["pixel_values"]

    def __init__(
        self,
        do_resize: bool = True,
        size: Dict[str, int] = None,
        image_grid_pinpoints: List = None,
        resample: PILImageResampling = PILImageResampling.BICUBIC,
        do_center_crop: bool = True,
        crop_size: Dict[str, int] = None,
        do_rescale: bool = True,
        rescale_factor: Union[int, float] = 1 / 255,
        do_normalize: bool = True,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_convert_rgb: bool = True,
        **kwargs,
    ) -> None:
        """
        __init__

        Initializes an instance of the LlavaNextImageProcessor class.

        Args:
            self: The instance of the class.
            do_resize (bool, optional): Flag to indicate whether resizing should be performed. Defaults to True.
            size (Dict[str, int], optional): Dictionary specifying the size of the image. Defaults to None.
            image_grid_pinpoints (List, optional): List of points for image grid pinpoints. Defaults to None.
            resample (PILImageResampling): Resampling method for image resizing. Defaults to PILImageResampling.BICUBIC.
            do_center_crop (bool): Flag to indicate whether center cropping should be performed. Defaults to True.
            crop_size (Dict[str, int], optional): Dictionary specifying the crop size. Defaults to None.
            do_rescale (bool): Flag to indicate whether rescaling should be performed. Defaults to True.
            rescale_factor (Union[int, float]): Factor used for rescaling the image. Defaults to 1/255.
            do_normalize (bool): Flag to indicate whether normalization should be performed. Defaults to True.
            image_mean (Optional[Union[float, List[float]]], optional): Mean value for image normalization.
                Defaults to None or OPENAI_CLIP_MEAN.
            image_std (Optional[Union[float, List[float]]], optional): Standard deviation value for image normalization.
                Defaults to None or OPENAI_CLIP_STD.
            do_convert_rgb (bool): Flag to indicate whether RGB conversion should be performed.

        Returns:
            None:

        Raises:
            ValueError: If invalid parameters are provided or if the rescale_factor is not a valid number.
            TypeError: If the types of input parameters are incorrect.
        """
        super().__init__(**kwargs)
        size = size if size is not None else {"shortest_edge": 224}
        size = get_size_dict(size, default_to_square=False)
        image_grid_pinpoints = (
            image_grid_pinpoints
            if image_grid_pinpoints is not None
            else [[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]
        )
        crop_size = crop_size if crop_size is not None else {
            "height": 224, "width": 224}
        crop_size = get_size_dict(
            crop_size, default_to_square=True, param_name="crop_size")

        self.do_resize = do_resize
        self.size = size
        self.image_grid_pinpoints = image_grid_pinpoints
        self.resample = resample
        self.do_center_crop = do_center_crop
        self.crop_size = crop_size
        self.do_rescale = do_rescale
        self.rescale_factor = rescale_factor
        self.do_normalize = do_normalize
        self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
        self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
        self.do_convert_rgb = do_convert_rgb

    # Copied from transformers.models.clip.image_processing_clip.CLIPImageProcessor.resize with CLIP->LLaVa
    def resize(
        self,
        image: np.ndarray,
        size: Dict[str, int],
        resample: PILImageResampling = PILImageResampling.BICUBIC,
        data_format: Optional[Union[str, ChannelDimension]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> np.ndarray:
        """
        Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
        resized to keep the input aspect ratio.

        Args:
            image (`np.ndarray`):
                Image to resize.
            size (`Dict[str, int]`):
                Size of the output image.
            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
                Resampling filter to use when resiizing the image.
            data_format (`str` or `ChannelDimension`, *optional*):
                The channel dimension format of the image. If not provided, it will be the same as the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format of the input image. If not provided, it will be inferred.
        """
        default_to_square = True
        if "shortest_edge" in size:
            size = size["shortest_edge"]
            default_to_square = False
        elif "height" in size and "width" in size:
            size = (size["height"], size["width"])
        else:
            raise ValueError(
                "Size must contain either 'shortest_edge' or 'height' and 'width'.")

        output_size = get_resize_output_image_size(
            image,
            size=size,
            default_to_square=default_to_square,
            input_data_format=input_data_format,
        )

        return resize(
            image,
            size=output_size,
            resample=resample,
            data_format=data_format,
            input_data_format=input_data_format,
            **kwargs,
        )

    def _preprocess(
        self,
        images: ImageInput,
        do_resize: bool = None,
        size: Dict[str, int] = None,
        resample: PILImageResampling = None,
        do_center_crop: bool = None,
        crop_size: int = None,
        do_rescale: bool = None,
        rescale_factor: float = None,
        do_normalize: bool = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ) -> Image.Image:
        """
        Preprocess an image or batch of images. Copy of the `preprocess` method from `CLIPImageProcessor`.

        Args:
            images (`ImageInput`):
                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the image.
            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
                Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
                the longest edge resized to keep the input aspect ratio.
            resample (`int`, *optional*, defaults to `self.resample`):
                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
                has an effect if `do_resize` is set to `True`.
            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
                Whether to center crop the image.
            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
                Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the image.
            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the image.
            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
                `True`.
            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
                The channel dimension format for the output image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - Unset: Use the channel dimension format of the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. If unset, the channel dimension format is inferred
                from the input image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
        """
        images = make_list_of_images(images)

        if do_resize:
            images = [
                self.resize(image=image, size=size, resample=resample,
                            input_data_format=input_data_format)
                for image in images
            ]

        if do_center_crop:
            images = [
                self.center_crop(image=image, size=crop_size, input_data_format=input_data_format) for image in images
            ]

        if do_rescale:
            images = [
                self.rescale(image=image, scale=rescale_factor,
                             input_data_format=input_data_format)
                for image in images
            ]

        if do_normalize:
            images = [
                self.normalize(image=image, mean=image_mean,
                               std=image_std, input_data_format=input_data_format)
                for image in images
            ]

        images = [
            to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
        ]

        return images

    def _resize_for_patching(
        self, image: np.array, target_resolution: tuple, resample, input_data_format: ChannelDimension
    ) -> np.array:
        """
        Resizes an image to a target resolution while maintaining aspect ratio.

        Args:
            image (np.array):
                The input image.
            target_resolution (tuple):
                The target resolution (height, width) of the image.
            resample (`PILImageResampling`):
                Resampling filter to use if resizing the image.
            input_data_format (`ChannelDimension` or `str`):
                The channel dimension format of the input image.

        Returns:
            np.array: The resized and padded image.
        """
        new_height, new_width = _get_patch_output_size(
            image, target_resolution, input_data_format)

        # Resize the image
        resized_image = resize(image, (new_height, new_width),
                               resample=resample, input_data_format=input_data_format)

        return resized_image

    def _pad_for_patching(
        self, image: np.array, target_resolution: tuple, input_data_format: ChannelDimension
    ) -> np.array:
        """
        Pad an image to a target resolution while maintaining aspect ratio.
        """
        target_height, target_width = target_resolution
        new_height, new_width = _get_patch_output_size(
            image, target_resolution, input_data_format)

        paste_x = (target_width - new_width) // 2
        paste_y = (target_height - new_height) // 2

        padded_image = pad(image, padding=(
            (paste_y, paste_y), (paste_x, paste_x)))

        return padded_image

    def get_image_patches(
        self,
        image: np.array,
        grid_pinpoints,
        size: tuple,
        patch_size: int,
        resample: PILImageResampling,
        data_format: ChannelDimension,
        input_data_format: ChannelDimension,
    ) -> List[np.array]:
        """
        Process an image with variable resolutions by dividing it into patches.

        Args:
            image (np.array):
                The input image to be processed.
            grid_pinpoints (List):
                A string representation of a list of possible resolutions.
            size (`tuple`):
                Size to resize the original image to.
            patch_size (`int`):
                Size of the patches to divide the image into.
            resample (`PILImageResampling`):
                Resampling filter to use if resizing the image.
            data_format (`ChannelDimension` or `str`):
                The channel dimension format for the output image.
            input_data_format (`ChannelDimension` or `str`):
                The channel dimension format of the input image.

        Returns:
            List[np.array]: A list of NumPy arrays containing the processed image patches.
        """
        if not isinstance(grid_pinpoints, list):
            raise ValueError(
                "grid_pinpoints must be a list of possible resolutions.")

        possible_resolutions = grid_pinpoints

        image_size = get_image_size(image, channel_dim=input_data_format)
        best_resolution = select_best_resolution(
            image_size, possible_resolutions)
        resized_image = self._resize_for_patching(
            image, best_resolution, resample=resample, input_data_format=input_data_format
        )
        padded_image = self._pad_for_patching(
            resized_image, best_resolution, input_data_format=input_data_format)

        patches = divide_to_patches(
            padded_image, patch_size=patch_size, input_data_format=input_data_format)

        # make sure that all patches are in the input data format
        patches = [
            to_channel_dimension_format(
                patch, channel_dim=data_format, input_channel_dim=input_data_format)
            for patch in patches
        ]

        resized_original_image = resize(
            image,
            size=size,
            resample=resample,
            data_format=data_format,
            input_data_format=input_data_format,
        )

        image_patches = [resized_original_image] + patches

        return image_patches

    def preprocess(
        self,
        images: ImageInput,
        do_resize: bool = None,
        size: Dict[str, int] = None,
        image_grid_pinpoints: List = None,
        resample: PILImageResampling = None,
        do_center_crop: bool = None,
        crop_size: int = None,
        do_rescale: bool = None,
        rescale_factor: float = None,
        do_normalize: bool = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_convert_rgb: bool = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
    ):
        """
        Args:
            images (`ImageInput`):
                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the image.
            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
                Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
                the longest edge resized to keep the input aspect ratio.
            image_grid_pinpoints (`List` *optional*, defaults to `self.image_grid_pinpoints`):
                A list of possible resolutions to use for processing high resolution images. The best resolution is
                selected based on the original size of the image.
            resample (`int`, *optional*, defaults to `self.resample`):
                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
                has an effect if `do_resize` is set to `True`.
            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
                Whether to center crop the image.
            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
                Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the image.
            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the image.
            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
                `True`.
            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
                Whether to convert the image to RGB.
            return_tensors (`str` or `TensorType`, *optional*):
                The type of tensors to return. Can be one of:

                - Unset: Return a list of `np.ndarray`.
                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
                The channel dimension format for the output image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - Unset: Use the channel dimension format of the input image.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. If unset, the channel dimension format is inferred
                from the input image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
        """
        do_resize = do_resize if do_resize is not None else self.do_resize
        size = size if size is not None else self.size
        size = get_size_dict(size, param_name="size", default_to_square=False)
        image_grid_pinpoints = image_grid_pinpoints if image_grid_pinpoints is not None else self.image_grid_pinpoints
        resample = resample if resample is not None else self.resample
        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
        crop_size = crop_size if crop_size is not None else self.crop_size
        crop_size = get_size_dict(
            crop_size, param_name="crop_size", default_to_square=True)
        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
        image_mean = image_mean if image_mean is not None else self.image_mean
        image_std = image_std if image_std is not None else self.image_std
        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb

        images = make_list_of_images(images)

        if not valid_images(images):
            raise ValueError(
                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                "torch.Tensor, tf.Tensor or jax.ndarray."
            )

        validate_preprocess_arguments(
            do_rescale=do_rescale,
            rescale_factor=rescale_factor,
            do_normalize=do_normalize,
            image_mean=image_mean,
            image_std=image_std,
            do_center_crop=do_center_crop,
            crop_size=crop_size,
            do_resize=do_resize,
            size=size,
            resample=resample,
        )

        if do_convert_rgb:
            images = [convert_to_rgb(image) for image in images]

        # All transformations expect numpy arrays.
        images = [to_numpy_array(image) for image in images]

        if is_scaled_image(images[0]) and do_rescale:
            logger.warning_once(
                "It looks like you are trying to rescale already rescaled images. If the input"
                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
            )

        if input_data_format is None:
            # We assume that all images have the same channel dimension format.
            input_data_format = infer_channel_dimension_format(images[0])

        new_images = []
        image_sizes = [get_image_size(
            image, channel_dim=input_data_format) for image in images]
        for image in images:
            # convert image into a list of patches
            # we intentially use the same data format as the input data format
            image_patches = self.get_image_patches(
                image,
                image_grid_pinpoints,
                size=(size["shortest_edge"], size["shortest_edge"]),
                patch_size=crop_size["height"],
                resample=resample,
                data_format=input_data_format,
                input_data_format=input_data_format,
            )

            # preprocess patches
            pixel_values = self._preprocess(
                image_patches,
                do_resize=do_resize,
                size=size,
                resample=resample,
                do_center_crop=do_center_crop,
                crop_size=crop_size,
                do_rescale=do_rescale,
                rescale_factor=rescale_factor,
                do_normalize=do_normalize,
                image_mean=image_mean,
                image_std=image_std,
                data_format=data_format,
                input_data_format=input_data_format,
            )
            pixel_values = np.array(pixel_values)
            new_images.append(pixel_values)

        data = {"pixel_values": new_images, "image_sizes": image_sizes}

        return BatchFeature(data=data, tensor_type=return_tensors)

mindnlp.transformers.models.llava_next.image_processing_llava_next.LlavaNextImageProcessor.__init__(do_resize=True, size=None, image_grid_pinpoints=None, resample=PILImageResampling.BICUBIC, do_center_crop=True, crop_size=None, do_rescale=True, rescale_factor=1 / 255, do_normalize=True, image_mean=None, image_std=None, do_convert_rgb=True, **kwargs)

init

Initializes an instance of the LlavaNextImageProcessor class.

PARAMETER DESCRIPTION
self

The instance of the class.

do_resize

Flag to indicate whether resizing should be performed. Defaults to True.

TYPE: bool DEFAULT: True

size

Dictionary specifying the size of the image. Defaults to None.

TYPE: Dict[str, int] DEFAULT: None

image_grid_pinpoints

List of points for image grid pinpoints. Defaults to None.

TYPE: List DEFAULT: None

resample

Resampling method for image resizing. Defaults to PILImageResampling.BICUBIC.

TYPE: PILImageResampling DEFAULT: BICUBIC

do_center_crop

Flag to indicate whether center cropping should be performed. Defaults to True.

TYPE: bool DEFAULT: True

crop_size

Dictionary specifying the crop size. Defaults to None.

TYPE: Dict[str, int] DEFAULT: None

do_rescale

Flag to indicate whether rescaling should be performed. Defaults to True.

TYPE: bool DEFAULT: True

rescale_factor

Factor used for rescaling the image. Defaults to 1/255.

TYPE: Union[int, float] DEFAULT: 1 / 255

do_normalize

Flag to indicate whether normalization should be performed. Defaults to True.

TYPE: bool DEFAULT: True

image_mean

Mean value for image normalization. Defaults to None or OPENAI_CLIP_MEAN.

TYPE: Optional[Union[float, List[float]]] DEFAULT: None

image_std

Standard deviation value for image normalization. Defaults to None or OPENAI_CLIP_STD.

TYPE: Optional[Union[float, List[float]]] DEFAULT: None

do_convert_rgb

Flag to indicate whether RGB conversion should be performed.

TYPE: bool DEFAULT: True

RETURNS DESCRIPTION
None

TYPE: None

RAISES DESCRIPTION
ValueError

If invalid parameters are provided or if the rescale_factor is not a valid number.

TypeError

If the types of input parameters are incorrect.

Source code in mindnlp/transformers/models/llava_next/image_processing_llava_next.py
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
def __init__(
    self,
    do_resize: bool = True,
    size: Dict[str, int] = None,
    image_grid_pinpoints: List = None,
    resample: PILImageResampling = PILImageResampling.BICUBIC,
    do_center_crop: bool = True,
    crop_size: Dict[str, int] = None,
    do_rescale: bool = True,
    rescale_factor: Union[int, float] = 1 / 255,
    do_normalize: bool = True,
    image_mean: Optional[Union[float, List[float]]] = None,
    image_std: Optional[Union[float, List[float]]] = None,
    do_convert_rgb: bool = True,
    **kwargs,
) -> None:
    """
    __init__

    Initializes an instance of the LlavaNextImageProcessor class.

    Args:
        self: The instance of the class.
        do_resize (bool, optional): Flag to indicate whether resizing should be performed. Defaults to True.
        size (Dict[str, int], optional): Dictionary specifying the size of the image. Defaults to None.
        image_grid_pinpoints (List, optional): List of points for image grid pinpoints. Defaults to None.
        resample (PILImageResampling): Resampling method for image resizing. Defaults to PILImageResampling.BICUBIC.
        do_center_crop (bool): Flag to indicate whether center cropping should be performed. Defaults to True.
        crop_size (Dict[str, int], optional): Dictionary specifying the crop size. Defaults to None.
        do_rescale (bool): Flag to indicate whether rescaling should be performed. Defaults to True.
        rescale_factor (Union[int, float]): Factor used for rescaling the image. Defaults to 1/255.
        do_normalize (bool): Flag to indicate whether normalization should be performed. Defaults to True.
        image_mean (Optional[Union[float, List[float]]], optional): Mean value for image normalization.
            Defaults to None or OPENAI_CLIP_MEAN.
        image_std (Optional[Union[float, List[float]]], optional): Standard deviation value for image normalization.
            Defaults to None or OPENAI_CLIP_STD.
        do_convert_rgb (bool): Flag to indicate whether RGB conversion should be performed.

    Returns:
        None:

    Raises:
        ValueError: If invalid parameters are provided or if the rescale_factor is not a valid number.
        TypeError: If the types of input parameters are incorrect.
    """
    super().__init__(**kwargs)
    size = size if size is not None else {"shortest_edge": 224}
    size = get_size_dict(size, default_to_square=False)
    image_grid_pinpoints = (
        image_grid_pinpoints
        if image_grid_pinpoints is not None
        else [[336, 672], [672, 336], [672, 672], [1008, 336], [336, 1008]]
    )
    crop_size = crop_size if crop_size is not None else {
        "height": 224, "width": 224}
    crop_size = get_size_dict(
        crop_size, default_to_square=True, param_name="crop_size")

    self.do_resize = do_resize
    self.size = size
    self.image_grid_pinpoints = image_grid_pinpoints
    self.resample = resample
    self.do_center_crop = do_center_crop
    self.crop_size = crop_size
    self.do_rescale = do_rescale
    self.rescale_factor = rescale_factor
    self.do_normalize = do_normalize
    self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
    self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
    self.do_convert_rgb = do_convert_rgb

mindnlp.transformers.models.llava_next.image_processing_llava_next.LlavaNextImageProcessor.get_image_patches(image, grid_pinpoints, size, patch_size, resample, data_format, input_data_format)

Process an image with variable resolutions by dividing it into patches.

PARAMETER DESCRIPTION
image

The input image to be processed.

TYPE: array

grid_pinpoints

A string representation of a list of possible resolutions.

TYPE: List

size

Size to resize the original image to.

TYPE: `tuple`

patch_size

Size of the patches to divide the image into.

TYPE: `int`

resample

Resampling filter to use if resizing the image.

TYPE: `PILImageResampling`

data_format

The channel dimension format for the output image.

TYPE: `ChannelDimension` or `str`

input_data_format

The channel dimension format of the input image.

TYPE: `ChannelDimension` or `str`

RETURNS DESCRIPTION
List[array]

List[np.array]: A list of NumPy arrays containing the processed image patches.

Source code in mindnlp/transformers/models/llava_next/image_processing_llava_next.py
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
def get_image_patches(
    self,
    image: np.array,
    grid_pinpoints,
    size: tuple,
    patch_size: int,
    resample: PILImageResampling,
    data_format: ChannelDimension,
    input_data_format: ChannelDimension,
) -> List[np.array]:
    """
    Process an image with variable resolutions by dividing it into patches.

    Args:
        image (np.array):
            The input image to be processed.
        grid_pinpoints (List):
            A string representation of a list of possible resolutions.
        size (`tuple`):
            Size to resize the original image to.
        patch_size (`int`):
            Size of the patches to divide the image into.
        resample (`PILImageResampling`):
            Resampling filter to use if resizing the image.
        data_format (`ChannelDimension` or `str`):
            The channel dimension format for the output image.
        input_data_format (`ChannelDimension` or `str`):
            The channel dimension format of the input image.

    Returns:
        List[np.array]: A list of NumPy arrays containing the processed image patches.
    """
    if not isinstance(grid_pinpoints, list):
        raise ValueError(
            "grid_pinpoints must be a list of possible resolutions.")

    possible_resolutions = grid_pinpoints

    image_size = get_image_size(image, channel_dim=input_data_format)
    best_resolution = select_best_resolution(
        image_size, possible_resolutions)
    resized_image = self._resize_for_patching(
        image, best_resolution, resample=resample, input_data_format=input_data_format
    )
    padded_image = self._pad_for_patching(
        resized_image, best_resolution, input_data_format=input_data_format)

    patches = divide_to_patches(
        padded_image, patch_size=patch_size, input_data_format=input_data_format)

    # make sure that all patches are in the input data format
    patches = [
        to_channel_dimension_format(
            patch, channel_dim=data_format, input_channel_dim=input_data_format)
        for patch in patches
    ]

    resized_original_image = resize(
        image,
        size=size,
        resample=resample,
        data_format=data_format,
        input_data_format=input_data_format,
    )

    image_patches = [resized_original_image] + patches

    return image_patches

mindnlp.transformers.models.llava_next.image_processing_llava_next.LlavaNextImageProcessor.preprocess(images, do_resize=None, size=None, image_grid_pinpoints=None, resample=None, do_center_crop=None, crop_size=None, do_rescale=None, rescale_factor=None, do_normalize=None, image_mean=None, image_std=None, do_convert_rgb=None, return_tensors=None, data_format=ChannelDimension.FIRST, input_data_format=None)

PARAMETER DESCRIPTION
images

Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False.

TYPE: `ImageInput`

do_resize

Whether to resize the image.

TYPE: `bool`, *optional*, defaults to `self.do_resize` DEFAULT: None

size

Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with the longest edge resized to keep the input aspect ratio.

TYPE: `Dict[str, int]`, *optional*, defaults to `self.size` DEFAULT: None

image_grid_pinpoints

A list of possible resolutions to use for processing high resolution images. The best resolution is selected based on the original size of the image.

TYPE: `List` *optional*, defaults to `self.image_grid_pinpoints` DEFAULT: None

resample

Resampling filter to use if resizing the image. This can be one of the enum PILImageResampling. Only has an effect if do_resize is set to True.

TYPE: `int`, *optional*, defaults to `self.resample` DEFAULT: None

do_center_crop

Whether to center crop the image.

TYPE: `bool`, *optional*, defaults to `self.do_center_crop` DEFAULT: None

crop_size

Size of the center crop. Only has an effect if do_center_crop is set to True.

TYPE: `Dict[str, int]`, *optional*, defaults to `self.crop_size` DEFAULT: None

do_rescale

Whether to rescale the image.

TYPE: `bool`, *optional*, defaults to `self.do_rescale` DEFAULT: None

rescale_factor

Rescale factor to rescale the image by if do_rescale is set to True.

TYPE: `float`, *optional*, defaults to `self.rescale_factor` DEFAULT: None

do_normalize

Whether to normalize the image.

TYPE: `bool`, *optional*, defaults to `self.do_normalize` DEFAULT: None

image_mean

Image mean to use for normalization. Only has an effect if do_normalize is set to True.

TYPE: `float` or `List[float]`, *optional*, defaults to `self.image_mean` DEFAULT: None

image_std

Image standard deviation to use for normalization. Only has an effect if do_normalize is set to True.

TYPE: `float` or `List[float]`, *optional*, defaults to `self.image_std` DEFAULT: None

do_convert_rgb

Whether to convert the image to RGB.

TYPE: `bool`, *optional*, defaults to `self.do_convert_rgb` DEFAULT: None

return_tensors

The type of tensors to return. Can be one of:

  • Unset: Return a list of np.ndarray.
  • TensorType.TENSORFLOW or 'tf': Return a batch of type tf.Tensor.
  • TensorType.PYTORCH or 'pt': Return a batch of type torch.Tensor.
  • TensorType.NUMPY or 'np': Return a batch of type np.ndarray.
  • TensorType.JAX or 'jax': Return a batch of type jax.numpy.ndarray.

TYPE: `str` or `TensorType`, *optional* DEFAULT: None

data_format

The channel dimension format for the output image. Can be one of:

  • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
  • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
  • Unset: Use the channel dimension format of the input image.

TYPE: `ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST` DEFAULT: FIRST

input_data_format

The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:

  • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
  • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
  • "none" or ChannelDimension.NONE: image in (height, width) format.

TYPE: `ChannelDimension` or `str`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/llava_next/image_processing_llava_next.py
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
def preprocess(
    self,
    images: ImageInput,
    do_resize: bool = None,
    size: Dict[str, int] = None,
    image_grid_pinpoints: List = None,
    resample: PILImageResampling = None,
    do_center_crop: bool = None,
    crop_size: int = None,
    do_rescale: bool = None,
    rescale_factor: float = None,
    do_normalize: bool = None,
    image_mean: Optional[Union[float, List[float]]] = None,
    image_std: Optional[Union[float, List[float]]] = None,
    do_convert_rgb: bool = None,
    return_tensors: Optional[Union[str, TensorType]] = None,
    data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
    """
    Args:
        images (`ImageInput`):
            Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
            passing in images with pixel values between 0 and 1, set `do_rescale=False`.
        do_resize (`bool`, *optional*, defaults to `self.do_resize`):
            Whether to resize the image.
        size (`Dict[str, int]`, *optional*, defaults to `self.size`):
            Size of the image after resizing. Shortest edge of the image is resized to size["shortest_edge"], with
            the longest edge resized to keep the input aspect ratio.
        image_grid_pinpoints (`List` *optional*, defaults to `self.image_grid_pinpoints`):
            A list of possible resolutions to use for processing high resolution images. The best resolution is
            selected based on the original size of the image.
        resample (`int`, *optional*, defaults to `self.resample`):
            Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`. Only
            has an effect if `do_resize` is set to `True`.
        do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
            Whether to center crop the image.
        crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
            Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
        do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
            Whether to rescale the image.
        rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
            Rescale factor to rescale the image by if `do_rescale` is set to `True`.
        do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
            Whether to normalize the image.
        image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
            Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
        image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
            Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
            `True`.
        do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
            Whether to convert the image to RGB.
        return_tensors (`str` or `TensorType`, *optional*):
            The type of tensors to return. Can be one of:

            - Unset: Return a list of `np.ndarray`.
            - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
            - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
            - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
            - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
        data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
            The channel dimension format for the output image. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - Unset: Use the channel dimension format of the input image.
        input_data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format for the input image. If unset, the channel dimension format is inferred
            from the input image. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
    """
    do_resize = do_resize if do_resize is not None else self.do_resize
    size = size if size is not None else self.size
    size = get_size_dict(size, param_name="size", default_to_square=False)
    image_grid_pinpoints = image_grid_pinpoints if image_grid_pinpoints is not None else self.image_grid_pinpoints
    resample = resample if resample is not None else self.resample
    do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
    crop_size = crop_size if crop_size is not None else self.crop_size
    crop_size = get_size_dict(
        crop_size, param_name="crop_size", default_to_square=True)
    do_rescale = do_rescale if do_rescale is not None else self.do_rescale
    rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
    do_normalize = do_normalize if do_normalize is not None else self.do_normalize
    image_mean = image_mean if image_mean is not None else self.image_mean
    image_std = image_std if image_std is not None else self.image_std
    do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb

    images = make_list_of_images(images)

    if not valid_images(images):
        raise ValueError(
            "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
            "torch.Tensor, tf.Tensor or jax.ndarray."
        )

    validate_preprocess_arguments(
        do_rescale=do_rescale,
        rescale_factor=rescale_factor,
        do_normalize=do_normalize,
        image_mean=image_mean,
        image_std=image_std,
        do_center_crop=do_center_crop,
        crop_size=crop_size,
        do_resize=do_resize,
        size=size,
        resample=resample,
    )

    if do_convert_rgb:
        images = [convert_to_rgb(image) for image in images]

    # All transformations expect numpy arrays.
    images = [to_numpy_array(image) for image in images]

    if is_scaled_image(images[0]) and do_rescale:
        logger.warning_once(
            "It looks like you are trying to rescale already rescaled images. If the input"
            " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
        )

    if input_data_format is None:
        # We assume that all images have the same channel dimension format.
        input_data_format = infer_channel_dimension_format(images[0])

    new_images = []
    image_sizes = [get_image_size(
        image, channel_dim=input_data_format) for image in images]
    for image in images:
        # convert image into a list of patches
        # we intentially use the same data format as the input data format
        image_patches = self.get_image_patches(
            image,
            image_grid_pinpoints,
            size=(size["shortest_edge"], size["shortest_edge"]),
            patch_size=crop_size["height"],
            resample=resample,
            data_format=input_data_format,
            input_data_format=input_data_format,
        )

        # preprocess patches
        pixel_values = self._preprocess(
            image_patches,
            do_resize=do_resize,
            size=size,
            resample=resample,
            do_center_crop=do_center_crop,
            crop_size=crop_size,
            do_rescale=do_rescale,
            rescale_factor=rescale_factor,
            do_normalize=do_normalize,
            image_mean=image_mean,
            image_std=image_std,
            data_format=data_format,
            input_data_format=input_data_format,
        )
        pixel_values = np.array(pixel_values)
        new_images.append(pixel_values)

    data = {"pixel_values": new_images, "image_sizes": image_sizes}

    return BatchFeature(data=data, tensor_type=return_tensors)

mindnlp.transformers.models.llava_next.image_processing_llava_next.LlavaNextImageProcessor.resize(image, size, resample=PILImageResampling.BICUBIC, data_format=None, input_data_format=None, **kwargs)

Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge resized to keep the input aspect ratio.

PARAMETER DESCRIPTION
image

Image to resize.

TYPE: `np.ndarray`

size

Size of the output image.

TYPE: `Dict[str, int]`

resample

Resampling filter to use when resiizing the image.

TYPE: `PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC` DEFAULT: BICUBIC

data_format

The channel dimension format of the image. If not provided, it will be the same as the input image.

TYPE: `str` or `ChannelDimension`, *optional* DEFAULT: None

input_data_format

The channel dimension format of the input image. If not provided, it will be inferred.

TYPE: `ChannelDimension` or `str`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/llava_next/image_processing_llava_next.py
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
def resize(
    self,
    image: np.ndarray,
    size: Dict[str, int],
    resample: PILImageResampling = PILImageResampling.BICUBIC,
    data_format: Optional[Union[str, ChannelDimension]] = None,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
) -> np.ndarray:
    """
    Resize an image. The shortest edge of the image is resized to size["shortest_edge"], with the longest edge
    resized to keep the input aspect ratio.

    Args:
        image (`np.ndarray`):
            Image to resize.
        size (`Dict[str, int]`):
            Size of the output image.
        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
            Resampling filter to use when resiizing the image.
        data_format (`str` or `ChannelDimension`, *optional*):
            The channel dimension format of the image. If not provided, it will be the same as the input image.
        input_data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format of the input image. If not provided, it will be inferred.
    """
    default_to_square = True
    if "shortest_edge" in size:
        size = size["shortest_edge"]
        default_to_square = False
    elif "height" in size and "width" in size:
        size = (size["height"], size["width"])
    else:
        raise ValueError(
            "Size must contain either 'shortest_edge' or 'height' and 'width'.")

    output_size = get_resize_output_image_size(
        image,
        size=size,
        default_to_square=default_to_square,
        input_data_format=input_data_format,
    )

    return resize(
        image,
        size=output_size,
        resample=resample,
        data_format=data_format,
        input_data_format=input_data_format,
        **kwargs,
    )

mindnlp.transformers.models.llava_next.image_processing_llava_next.divide_to_patches(image, patch_size, input_data_format)

Divides an image into patches of a specified size.

PARAMETER DESCRIPTION
image

The input image.

TYPE: `np.array`

patch_size

The size of each patch.

TYPE: `int`

input_data_format

The channel dimension format of the input image.

TYPE: `ChannelDimension` or `str`

RETURNS DESCRIPTION
list

A list of np.array representing the patches.

TYPE: List[array]

Source code in mindnlp/transformers/models/llava_next/image_processing_llava_next.py
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
def divide_to_patches(image: np.array, patch_size: int, input_data_format) -> List[np.array]:
    """
    Divides an image into patches of a specified size.

    Args:
        image (`np.array`):
            The input image.
        patch_size (`int`):
            The size of each patch.
        input_data_format (`ChannelDimension` or `str`):
            The channel dimension format of the input image.

    Returns:
        list: A list of np.array representing the patches.
    """
    patches = []
    height, width = get_image_size(image, channel_dim=input_data_format)
    for i in range(0, height, patch_size):
        for j in range(0, width, patch_size):
            if input_data_format == ChannelDimension.LAST:
                patch = image[i: i + patch_size, j: j + patch_size]
            else:
                patch = image[:, i: i + patch_size, j: j + patch_size]
            patches.append(patch)

    return patches

mindnlp.transformers.models.llava_next.image_processing_llava_next.expand_to_square(image, background_color, input_data_format)

Expands an image to a square by adding a background color.

Source code in mindnlp/transformers/models/llava_next/image_processing_llava_next.py
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
def expand_to_square(image: np.array, background_color, input_data_format) -> np.array:
    """
    Expands an image to a square by adding a background color.
    """
    height, width = get_image_size(image, channel_dim=input_data_format)
    if width == height:
        return image
    elif width > height:
        result = np.ones(
            (width, width, image.shape[2]), dtype=image.dtype) * background_color
        result[(width - height) // 2: (width - height) // 2 + height, :] = image
        return result
    else:
        result = np.ones(
            (height, height, image.shape[2]), dtype=image.dtype) * background_color
        result[:, (height - width) // 2: (height - width) // 2 + width] = image
        return result