Skip to content

llava

mindnlp.transformers.models.llava.configuration_llava

Llava model configuration

mindnlp.transformers.models.llava.configuration_llava.LlavaConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [LlavaForConditionalGeneration]. It is used to instantiate an Llava model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the Llava-9B.

e.g. llava-hf/llava-9b

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
vision_config

The config object or dictionary of the vision backbone.

TYPE: `Union[AutoConfig, dict]`, *optional*, defaults to `CLIPVisionConfig` DEFAULT: None

text_config

The config object or dictionary of the text backbone.

TYPE: `Union[AutoConfig, dict]`, *optional*, defaults to `LlamaConfig` DEFAULT: None

ignore_index

The ignore index for the loss function.

TYPE: `int`, *optional*, defaults to -100 DEFAULT: -100

image_token_index

The image token index to encode the image prompt.

TYPE: `int`, *optional*, defaults to 32000 DEFAULT: 32000

projector_hidden_act

The activation function used by the multimodal projector.

TYPE: `str`, *optional*, defaults to `"gelu"` DEFAULT: 'gelu'

vision_feature_select_strategy

The feature selection strategy used to select the vision feature from the vision backbone. Can be one of "default" or "full".

TYPE: `str`, *optional*, defaults to `"default"` DEFAULT: 'default'

vision_feature_layer

The index of the layer to select the vision feature.

TYPE: `int`, *optional*, defaults to -2 DEFAULT: -2

Example
>>> from transformers import LlavaForConditionalGeneration, LlavaConfig, CLIPVisionConfig, LlamaConfig
...
>>> # Initializing a CLIP-vision config
>>> vision_config = CLIPVisionConfig()
...
>>> # Initializing a Llama config
>>> text_config = LlamaConfig()
...
>>> # Initializing a Llava llava-1.5-7b style configuration
>>> configuration = LlavaConfig(vision_config, text_config)
...
>>> # Initializing a model from the llava-1.5-7b style configuration
>>> model = LlavaForConditionalGeneration(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/llava/configuration_llava.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
class LlavaConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`LlavaForConditionalGeneration`]. It is used to instantiate an
    Llava model according to the specified arguments, defining the model architecture. Instantiating a configuration
    with the defaults will yield a similar configuration to that of the Llava-9B.

    e.g. [llava-hf/llava-9b](https://huggingface.co/llava-hf/llava-9b)

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        vision_config (`Union[AutoConfig, dict]`,  *optional*, defaults to `CLIPVisionConfig`):
            The config object or dictionary of the vision backbone.
        text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `LlamaConfig`):
            The config object or dictionary of the text backbone.
        ignore_index (`int`, *optional*, defaults to -100):
            The ignore index for the loss function.
        image_token_index (`int`, *optional*, defaults to 32000):
            The image token index to encode the image prompt.
        projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
            The activation function used by the multimodal projector.
        vision_feature_select_strategy (`str`, *optional*, defaults to `"default"`):
            The feature selection strategy used to select the vision feature from the vision backbone.
            Can be one of `"default"` or `"full"`.
        vision_feature_layer (`int`, *optional*, defaults to -2):
            The index of the layer to select the vision feature.

    Example:
        ```python
        >>> from transformers import LlavaForConditionalGeneration, LlavaConfig, CLIPVisionConfig, LlamaConfig
        ...
        >>> # Initializing a CLIP-vision config
        >>> vision_config = CLIPVisionConfig()
        ...
        >>> # Initializing a Llama config
        >>> text_config = LlamaConfig()
        ...
        >>> # Initializing a Llava llava-1.5-7b style configuration
        >>> configuration = LlavaConfig(vision_config, text_config)
        ...
        >>> # Initializing a model from the llava-1.5-7b style configuration
        >>> model = LlavaForConditionalGeneration(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "llava"
    is_composition = False

    def __init__(
        self,
        vision_config=None,
        text_config=None,
        ignore_index=-100,
        image_token_index=32000,
        projector_hidden_act="gelu",
        vision_feature_select_strategy="default",
        vision_feature_layer=-2,
        **kwargs,
    ):
        """
        Initializes an instance of the LlavaConfig class.

        Args:
            self: The instance of the class.
            vision_config (dict or None): Configuration options for the vision model.
                If provided as a dictionary, must include the 'model_type' key. Default is None.
            text_config (dict or None): Configuration options for the text model.
                If provided as a dictionary, must include the 'model_type' key. Default is None.
            ignore_index (int): The index to ignore during computations. Default is -100.
            image_token_index (int): The index assigned to image tokens. Default is 32000.
            projector_hidden_act (str): The activation function for the projector. Default is 'gelu'.
            vision_feature_select_strategy (str): The strategy to select vision features.
                Valid values are 'default' and 'full'. Default is 'default'.
            vision_feature_layer (int): The layer to extract vision features from. Default is -2.
            **kwargs: Additional keyword arguments.

        Returns:
            None.

        Raises:
            ValueError: If the provided vision_feature_select_strategy is not 'default' or 'full'.
            FutureWarning: If the 'vocab_size' argument is deprecated and no longer has any effect.

        Note:
            - The 'vision_config' parameter can be provided as a dictionary or None. If a dictionary is provided, it must include the 'model_type' key.
            - If 'vision_config' is None, a default configuration is used.
            - The 'text_config' parameter can be provided as a dictionary or None. If a dictionary is provided, it must include the 'model_type' key.
            - If 'text_config' is None, a default configuration is used.
            - The '_vocab_size' attribute is set based on the 'text_config' vocabulary size.
        """
        self.ignore_index = ignore_index
        self.image_token_index = image_token_index
        self.projector_hidden_act = projector_hidden_act

        if vision_feature_select_strategy not in ["default", "full"]:
            raise ValueError(
                "vision_feature_select_strategy should be one of 'default', 'full'."
                f"Got: {vision_feature_select_strategy}"
            )

        if "vocab_size" in kwargs:
            warnings.warn(
                "The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect",
                FutureWarning,
            )

        self.vision_feature_select_strategy = vision_feature_select_strategy
        self.vision_feature_layer = vision_feature_layer

        if isinstance(vision_config, dict):
            vision_config["model_type"] = (
                vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
            )
            vision_config = CONFIG_MAPPING[vision_config["model_type"]](
                **vision_config)
        elif vision_config is None:
            vision_config = CONFIG_MAPPING["clip_vision_model"](
                intermediate_size=4096,
                hidden_size=1024,
                patch_size=14,
                image_size=336,
                num_hidden_layers=24,
                num_attention_heads=16,
                vocab_size=32000,
                projection_dim=768,
            )

        self.vision_config = vision_config

        if isinstance(text_config, dict):
            text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
            text_config = CONFIG_MAPPING[text_config["model_type"]](
                **text_config)
        elif text_config is None:
            text_config = CONFIG_MAPPING["llama"]()

        self.text_config = text_config
        self._vocab_size = self.text_config.vocab_size

        super().__init__(**kwargs)

    @property
    def vocab_size(self):
        """
        Method to retrieve the vocabulary size.

        Args:
            self (LlavaConfig): The instance of the LlavaConfig class.
                This parameter refers to the current instance of the LlavaConfig class.
                It is used to access the internal attributes and configurations of the class.

        Returns:
            None.

        Raises:
            FutureWarning: This method raises a FutureWarning when accessed,
                indicating that the 'vocab_size' attribute is deprecated. Users are advised to use
                'text_config.vocab_size' instead.
        """
        warnings.warn(
            "The `vocab_size` attribute is deprecated and will be removed in v4.42, Please use `text_config.vocab_size` instead.",
            FutureWarning,
        )
        return self._vocab_size

    @vocab_size.setter
    def vocab_size(self, value):
        """
        Sets the vocabulary size for the LlavaConfig class.

        Args:
            self (LlavaConfig): The instance of the LlavaConfig class.
            value (int): The new vocabulary size to be set. It should be a positive integer.

        Returns:
            None.

        Raises:
            None.

        This method is used to set the vocabulary size for the LlavaConfig class. The vocabulary size determines
        the number of unique words that can be stored in the vocabulary. It is important to set an appropriate
        vocabulary size based on the application and the amount of available memory. The vocabulary size can only be
        set to a positive integer value, otherwise an error will be raised.

        Example:
            ```python
           >>> config = LlavaConfig()
           >>> config.vocab_size = 10000
            ```
        """
        self._vocab_size = value

    def to_dict(self):
        """
        Converts the LlavaConfig object into a dictionary representation.

        Args:
            self (LlavaConfig): The LlavaConfig object itself.

        Returns:
            dict: A dictionary representation of the LlavaConfig object, excluding the '_vocab_size' attribute.

        Raises:
            None

        Note:
            This method is inherited from the parent class and modified to exclude the '_vocab_size' attribute from the output dictionary.
        """
        output = super().to_dict()
        output.pop("_vocab_size", None)
        return output

mindnlp.transformers.models.llava.configuration_llava.LlavaConfig.vocab_size property writable

Method to retrieve the vocabulary size.

PARAMETER DESCRIPTION
self

The instance of the LlavaConfig class. This parameter refers to the current instance of the LlavaConfig class. It is used to access the internal attributes and configurations of the class.

TYPE: LlavaConfig

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
FutureWarning

This method raises a FutureWarning when accessed, indicating that the 'vocab_size' attribute is deprecated. Users are advised to use 'text_config.vocab_size' instead.

mindnlp.transformers.models.llava.configuration_llava.LlavaConfig.__init__(vision_config=None, text_config=None, ignore_index=-100, image_token_index=32000, projector_hidden_act='gelu', vision_feature_select_strategy='default', vision_feature_layer=-2, **kwargs)

Initializes an instance of the LlavaConfig class.

PARAMETER DESCRIPTION
self

The instance of the class.

vision_config

Configuration options for the vision model. If provided as a dictionary, must include the 'model_type' key. Default is None.

TYPE: dict or None DEFAULT: None

text_config

Configuration options for the text model. If provided as a dictionary, must include the 'model_type' key. Default is None.

TYPE: dict or None DEFAULT: None

ignore_index

The index to ignore during computations. Default is -100.

TYPE: int DEFAULT: -100

image_token_index

The index assigned to image tokens. Default is 32000.

TYPE: int DEFAULT: 32000

projector_hidden_act

The activation function for the projector. Default is 'gelu'.

TYPE: str DEFAULT: 'gelu'

vision_feature_select_strategy

The strategy to select vision features. Valid values are 'default' and 'full'. Default is 'default'.

TYPE: str DEFAULT: 'default'

vision_feature_layer

The layer to extract vision features from. Default is -2.

TYPE: int DEFAULT: -2

**kwargs

Additional keyword arguments.

DEFAULT: {}

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
ValueError

If the provided vision_feature_select_strategy is not 'default' or 'full'.

FutureWarning

If the 'vocab_size' argument is deprecated and no longer has any effect.

Note
  • The 'vision_config' parameter can be provided as a dictionary or None. If a dictionary is provided, it must include the 'model_type' key.
  • If 'vision_config' is None, a default configuration is used.
  • The 'text_config' parameter can be provided as a dictionary or None. If a dictionary is provided, it must include the 'model_type' key.
  • If 'text_config' is None, a default configuration is used.
  • The '_vocab_size' attribute is set based on the 'text_config' vocabulary size.
Source code in mindnlp/transformers/models/llava/configuration_llava.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
def __init__(
    self,
    vision_config=None,
    text_config=None,
    ignore_index=-100,
    image_token_index=32000,
    projector_hidden_act="gelu",
    vision_feature_select_strategy="default",
    vision_feature_layer=-2,
    **kwargs,
):
    """
    Initializes an instance of the LlavaConfig class.

    Args:
        self: The instance of the class.
        vision_config (dict or None): Configuration options for the vision model.
            If provided as a dictionary, must include the 'model_type' key. Default is None.
        text_config (dict or None): Configuration options for the text model.
            If provided as a dictionary, must include the 'model_type' key. Default is None.
        ignore_index (int): The index to ignore during computations. Default is -100.
        image_token_index (int): The index assigned to image tokens. Default is 32000.
        projector_hidden_act (str): The activation function for the projector. Default is 'gelu'.
        vision_feature_select_strategy (str): The strategy to select vision features.
            Valid values are 'default' and 'full'. Default is 'default'.
        vision_feature_layer (int): The layer to extract vision features from. Default is -2.
        **kwargs: Additional keyword arguments.

    Returns:
        None.

    Raises:
        ValueError: If the provided vision_feature_select_strategy is not 'default' or 'full'.
        FutureWarning: If the 'vocab_size' argument is deprecated and no longer has any effect.

    Note:
        - The 'vision_config' parameter can be provided as a dictionary or None. If a dictionary is provided, it must include the 'model_type' key.
        - If 'vision_config' is None, a default configuration is used.
        - The 'text_config' parameter can be provided as a dictionary or None. If a dictionary is provided, it must include the 'model_type' key.
        - If 'text_config' is None, a default configuration is used.
        - The '_vocab_size' attribute is set based on the 'text_config' vocabulary size.
    """
    self.ignore_index = ignore_index
    self.image_token_index = image_token_index
    self.projector_hidden_act = projector_hidden_act

    if vision_feature_select_strategy not in ["default", "full"]:
        raise ValueError(
            "vision_feature_select_strategy should be one of 'default', 'full'."
            f"Got: {vision_feature_select_strategy}"
        )

    if "vocab_size" in kwargs:
        warnings.warn(
            "The `vocab_size` argument is deprecated and will be removed in v4.42, since it can be inferred from the `text_config`. Passing this argument has no effect",
            FutureWarning,
        )

    self.vision_feature_select_strategy = vision_feature_select_strategy
    self.vision_feature_layer = vision_feature_layer

    if isinstance(vision_config, dict):
        vision_config["model_type"] = (
            vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
        )
        vision_config = CONFIG_MAPPING[vision_config["model_type"]](
            **vision_config)
    elif vision_config is None:
        vision_config = CONFIG_MAPPING["clip_vision_model"](
            intermediate_size=4096,
            hidden_size=1024,
            patch_size=14,
            image_size=336,
            num_hidden_layers=24,
            num_attention_heads=16,
            vocab_size=32000,
            projection_dim=768,
        )

    self.vision_config = vision_config

    if isinstance(text_config, dict):
        text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
        text_config = CONFIG_MAPPING[text_config["model_type"]](
            **text_config)
    elif text_config is None:
        text_config = CONFIG_MAPPING["llama"]()

    self.text_config = text_config
    self._vocab_size = self.text_config.vocab_size

    super().__init__(**kwargs)

mindnlp.transformers.models.llava.configuration_llava.LlavaConfig.to_dict()

Converts the LlavaConfig object into a dictionary representation.

PARAMETER DESCRIPTION
self

The LlavaConfig object itself.

TYPE: LlavaConfig

RETURNS DESCRIPTION
dict

A dictionary representation of the LlavaConfig object, excluding the '_vocab_size' attribute.

Note

This method is inherited from the parent class and modified to exclude the '_vocab_size' attribute from the output dictionary.

Source code in mindnlp/transformers/models/llava/configuration_llava.py
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
def to_dict(self):
    """
    Converts the LlavaConfig object into a dictionary representation.

    Args:
        self (LlavaConfig): The LlavaConfig object itself.

    Returns:
        dict: A dictionary representation of the LlavaConfig object, excluding the '_vocab_size' attribute.

    Raises:
        None

    Note:
        This method is inherited from the parent class and modified to exclude the '_vocab_size' attribute from the output dictionary.
    """
    output = super().to_dict()
    output.pop("_vocab_size", None)
    return output

mindnlp.transformers.models.llava.modeling_llava

MindSpore Llava model.

mindnlp.transformers.models.llava.modeling_llava.LlavaCausalLMOutputWithPast dataclass

Bases: ModelOutput

Base class for Llava causal language model (or autoregressive) outputs.

PARAMETER DESCRIPTION
loss

Language modeling loss (for next-token prediction).

TYPE: `mindspore.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided DEFAULT: None

logits

Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)` DEFAULT: None

image_hidden_states

Tuple of mindspore.Tensor (one for the output of the image embeddings, (batch_size, num_images, sequence_length, hidden_size).

image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver

TYPE: `tuple(mindspore.Tensor)`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/llava/modeling_llava.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
@dataclass
# Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->Llava
class LlavaCausalLMOutputWithPast(ModelOutput):
    """
    Base class for Llava causal language model (or autoregressive) outputs.

    Args:
        loss (`mindspore.Tensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
            Language modeling loss (for next-token prediction).
        logits (`mindspore.Tensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
            Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
        past_key_values (`tuple(tuple(mindspore.Tensor))`, *optional*, returned when `use_cache=True` is passed
            or when `config.use_cache=True`):
            Tuple of `tuple(mindspore.Tensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
            `(batch_size, num_heads, sequence_length, embed_size_per_head)`)

            Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
            `past_key_values` input) to speed up sequential decoding.
        hidden_states (`tuple(mindspore.Tensor)`, *optional*, returned when `output_hidden_states=True` is passed
            or when `config.output_hidden_states=True`):
            Tuple of `mindspore.Tensor` (one for the output of the embeddings, if the model has an embedding layer, +
            one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

            Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
        attentions (`tuple(mindspore.Tensor)`, *optional*, returned when `output_attentions=True` is passed
            or when `config.output_attentions=True`):
            Tuple of `mindspore.Tensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
            sequence_length)`.

            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
        image_hidden_states (`tuple(mindspore.Tensor)`, *optional*):
            Tuple of `mindspore.Tensor` (one for the output of the image embeddings, `(batch_size, num_images,
            sequence_length, hidden_size)`.

            image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
    """
    loss: Optional[mindspore.Tensor] = None
    logits: mindspore.Tensor = None
    past_key_values: Optional[List[mindspore.Tensor]] = None
    hidden_states: Optional[Tuple[mindspore.Tensor]] = None
    attentions: Optional[Tuple[mindspore.Tensor]] = None
    image_hidden_states: Optional[Tuple[mindspore.Tensor]] = None

mindnlp.transformers.models.llava.modeling_llava.LlavaForConditionalGeneration

Bases: LlavaPreTrainedModel

LlavaForConditionalGeneration

This class is a language model for conditional generation based on the Llava architecture. It extends the LlavaPreTrainedModel class.

ATTRIBUTE DESCRIPTION
vision_tower

The vision tower model for extracting image features.

TYPE: AutoModel

multi_modal_projector

The multi-modal projector for combining image and text features.

TYPE: LlavaMultiModalProjector

vocab_size

The size of the vocabulary used by the language model.

TYPE: int

language_model

The language model for generating text.

TYPE: AutoModelForCausalLM

pad_token_id

The ID of the padding token in the vocabulary. Defaults to -1 if not provided.

TYPE: int

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import LlavaForConditionalGeneration
...
>>> model = LlavaForConditionalGeneration(config)
...
>>> input_ids = [1, 2, 3]
>>> pixel_values = [0.1, 0.2, 0.3]
>>> attention_mask = [1, 1, 1]
... 
>>> output = model.forward(input_ids=input_ids, pixel_values=pixel_values, attention_mask=attention_mask)
Source code in mindnlp/transformers/models/llava/modeling_llava.py
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
class LlavaForConditionalGeneration(LlavaPreTrainedModel):

    """
    LlavaForConditionalGeneration

    This class is a language model for conditional generation based on the Llava architecture.
    It extends the LlavaPreTrainedModel class.

    Attributes:
        vision_tower (AutoModel): The vision tower model for extracting image features.
        multi_modal_projector (LlavaMultiModalProjector): The multi-modal projector for combining image and text features.
        vocab_size (int): The size of the vocabulary used by the language model.
        language_model (AutoModelForCausalLM): The language model for generating text.
        pad_token_id (int): The ID of the padding token in the vocabulary. Defaults to -1 if not provided.

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import LlavaForConditionalGeneration
        ...
        >>> model = LlavaForConditionalGeneration(config)
        ...
        >>> input_ids = [1, 2, 3]
        >>> pixel_values = [0.1, 0.2, 0.3]
        >>> attention_mask = [1, 1, 1]
        ... 
        >>> output = model.forward(input_ids=input_ids, pixel_values=pixel_values, attention_mask=attention_mask)
        ```
    """
    def __init__(self, config: LlavaConfig):
        """
        Initializes an instance of the LlavaForConditionalGeneration class.

        Args:
            self: The instance of the class.
            config (LlavaConfig): An object of type LlavaConfig containing the configuration settings for the model.
                It specifies the configuration parameters for the vision tower, multi-modal projector, vocab size, 
                language model, pad token id, and other model settings.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)
        self.vision_tower = AutoModel.from_config(config.vision_config)

        self.multi_modal_projector = LlavaMultiModalProjector(config)
        self.vocab_size = config.text_config.vocab_size
        self.language_model = AutoModelForCausalLM.from_config(config.text_config)
        self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else -1
        self.post_init()

    def get_input_embeddings(self):
        """
        Get the input embeddings from the language model.

        Args:
            self (LlavaForConditionalGeneration): An instance of the LlavaForConditionalGeneration class.

        Returns:
            None.

        Raises:
            None.
        """
        return self.language_model.get_input_embeddings()

    def set_input_embeddings(self, value):
        """
        Set the input embeddings for the LlavaForConditionalGeneration language model.

        Args:
            self (LlavaForConditionalGeneration): The instance of the LlavaForConditionalGeneration class.
            value (Any): The input embeddings to be set for the language model.

        Returns:
            None.

        Raises:
            None.
        """
        self.language_model.set_input_embeddings(value)

    def get_output_embeddings(self):
        """
        Retrieve the output embeddings from the language model used for conditional generation.

        Args:
            self: An instance of the LlavaForConditionalGeneration class.

        Returns:
            None: This method returns None, it simply delegates the call to the language model's
                get_output_embeddings method.

        Raises:
            None: However, if the language_model.get_output_embeddings() method raises any exceptions,
                they will propagate up to the caller.
        """
        return self.language_model.get_output_embeddings()

    def set_output_embeddings(self, new_embeddings):
        """
        Sets the output embeddings for the LlavaForConditionalGeneration model.

        Args:
            self (LlavaForConditionalGeneration): An instance of the LlavaForConditionalGeneration class.
            new_embeddings (Tensor): The new output embeddings to be set for the model.
                It should have the same shape as the original output embeddings.

        Returns:
            None.

        Raises:
            TypeError: If the provided new_embeddings parameter is not of type Tensor.
            ValueError: If the shape of the new_embeddings parameter does not match the
                shape of the original output embeddings.
        """
        self.language_model.set_output_embeddings(new_embeddings)

    def set_decoder(self, decoder):
        """
        Sets the decoder for the language model used in LlavaForConditionalGeneration.

        Args:
            self (LlavaForConditionalGeneration): The instance of the LlavaForConditionalGeneration class.
            decoder: The decoder object to be set for the language model.
                It should be compatible with the language model's requirements.

        Returns:
            None.

        Raises:
            TypeError: If the provided decoder is not of the correct type.
            ValueError: If the decoder object is invalid or incompatible with the language model.
        """
        self.language_model.set_decoder(decoder)

    def get_decoder(self):
        """
        Returns the decoder of the LlavaForConditionalGeneration model.

        Args:
            self: An instance of the LlavaForConditionalGeneration class.

        Returns:
            The decoder from the language model used by the LlavaForConditionalGeneration model.

        Raises:
            None.
        """
        return self.language_model.get_decoder()

    def tie_weights(self):
        """
        Ties the weights of the language model used for conditional generation in the LlavaForConditionalGeneration class.

        Args:
            self (LlavaForConditionalGeneration): An instance of the LlavaForConditionalGeneration class.

        Returns:
            None.

        Raises:
            None.
        """
        return self.language_model.tie_weights()

    def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
        """
        Resize the token embeddings for conditional generation in the LlavaForConditionalGeneration class.

        Args:
            self: The instance of the LlavaForConditionalGeneration class.

            new_num_tokens (int, optional): The new number of tokens to resize the embeddings to. Defaults to None.
                If provided, the token embeddings will be resized to accommodate this number of tokens.

            pad_to_multiple_of (int): The value to pad the token embeddings to a multiple of. Defaults to None.
                If provided, the token embeddings will be padded to a multiple of this value.

        Returns:
            nn.Embedding: The resized token embeddings after the operation.
                This updated nn.Embedding object reflects the changes made to the token embeddings.

        Raises:
            None specified.
        """
        model_embeds = self.language_model.resize_token_embeddings(new_num_tokens, pad_to_multiple_of)
        # update vocab size
        self.config.text_config.vocab_size = model_embeds.vocab_size
        self.vocab_size = model_embeds.vocab_size
        return model_embeds

    def _merge_input_ids_with_image_features(self, image_features, inputs_embeds, input_ids, attention_mask, labels):
        """
        Merges image features with input embeddings and applies necessary modifications.

        Args:
            self (LlavaForConditionalGeneration): The instance of the LlavaForConditionalGeneration class.
            image_features (Tensor): A tensor of shape (num_images, num_image_patches, embed_dim)
                representing the image features.
            inputs_embeds (Tensor): A tensor of shape (batch_size, sequence_length, embed_dim)
                representing the input embeddings.
            input_ids (Tensor): A tensor of shape (batch_size, sequence_length) representing the input token IDs.
            attention_mask (Tensor): A tensor of shape (batch_size, sequence_length) representing the attention mask.
            labels (Tensor): A tensor of shape (batch_size, sequence_length) representing the labels.

        Returns:
            None

        Raises:
            ValueError: If the number of image tokens provided in the input does not match the number of
                images given to the model.
        """
        num_images, num_image_patches, embed_dim = image_features.shape
        batch_size, sequence_length = input_ids.shape
        left_padding = not ops.sum(input_ids[:, -1] == mindspore.tensor(self.pad_token_id))
        # 1. Create a mask to know where special image tokens are
        special_image_token_mask = input_ids == self.config.image_token_index
        num_special_image_tokens = ops.sum(special_image_token_mask, dim=-1)
        # Compute the maximum embed dimension
        max_embed_dim = (num_special_image_tokens.max() * (num_image_patches - 1)).item() + sequence_length
        nonzero = ops.nonzero(input_ids != self.config.image_token_index)
        batch_indices, non_image_indices = ops.tensor_split(nonzero, 2, -1)

        # 2. Compute the positions where text should be written
        # Calculate new positions for text tokens in merged image-text sequence.
        # `special_image_token_mask` identifies image tokens. Each image token will be replaced by `nb_text_tokens_per_images - 1` text tokens.
        # `torch.cumsum` computes how each image token shifts subsequent text token positions.
        # - 1 to adjust for zero-based indexing, as `cumsum` inherently increases indices by one.
        new_token_positions = ops.cumsum((special_image_token_mask * (num_image_patches - 1) + 1), -1) - 1
        nb_image_pad = max_embed_dim - 1 - new_token_positions[:, -1]
        if left_padding:
            new_token_positions += nb_image_pad[:, None]  # offset for left padding
        text_to_overwrite = new_token_positions[batch_indices, non_image_indices]

        # 3. Create the full embedding, already padded to the maximum position
        final_embedding = ops.zeros(
            (batch_size, max_embed_dim, embed_dim), dtype=inputs_embeds.dtype
        )
        final_attention_mask = ops.zeros(
            (batch_size, max_embed_dim), dtype=attention_mask.dtype
        )
        if labels is not None:
            final_labels = ops.full(
                (batch_size, max_embed_dim), self.config.ignore_index, dtype=input_ids.dtype
            )

        # 4. Fill the embeddings based on the mask. If we have ["hey" "<image>", "how", "are"]
        # we need to index copy on [0, 577, 578, 579] for the text and [1:576] for the image features
        final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[batch_indices, non_image_indices]
        final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_image_indices]
        if labels is not None:
            final_labels[batch_indices, text_to_overwrite] = labels[batch_indices, non_image_indices]

        # 5. Fill the embeddings corresponding to the images. Anything that is still zeros needs filling
        image_to_overwrite = ops.all(final_embedding == 0, axis=-1)
        image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[:, None]

        if image_to_overwrite.sum() != reduce(lambda x, y: x * y, image_features.shape[:-1]):
            raise ValueError(
                f"The input provided to the model are wrong. The number of image tokens is {ops.sum(special_image_token_mask)} while"
                f" the number of image given to the model is {num_images}. This prevents correct indexing and breaks batch generation."
            )

        final_embedding[image_to_overwrite] = image_features.reshape(-1, embed_dim)
        final_attention_mask |= image_to_overwrite
        position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill((final_attention_mask == 0), 1)

        # 6. Mask out the embedding at padding positions, as we later use the past_key_value value to determine the non-attended tokens.
        nonzero = ops.nonzero(input_ids == self.pad_token_id)
        batch_indices, pad_indices = ops.tensor_split(nonzero, 2, -1)
        indices_to_mask = new_token_positions[batch_indices, pad_indices]

        final_embedding[batch_indices, indices_to_mask] = 0

        if labels is None:
            final_labels = None

        return final_embedding, final_attention_mask, final_labels, position_ids

    def forward(
        self,
        input_ids: mindspore.Tensor = None,
        pixel_values: mindspore.Tensor = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        past_key_values: Optional[List[mindspore.Tensor]] = None,
        inputs_embeds: Optional[mindspore.Tensor] = None,
        vision_feature_layer: Optional[int] = None,
        vision_feature_select_strategy: Optional[str] = None,
        labels: Optional[mindspore.Tensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, LlavaCausalLMOutputWithPast]:
        r"""
        Args:
            labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

        Returns:
            Union[Tuple, LlavaCausalLMOutputWithPast]

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, LlavaForConditionalGeneration
            ...
            >>> model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
            >>> processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
            ...
            >>> prompt = "USER: <image>\nWhat's the content of the image? ASSISTANT:"
            >>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> inputs = processor(text=prompt, images=image, return_tensors="pt")
            ...
            >>> # Generate
            >>> generate_ids = model.generate(**inputs, max_new_tokens=15)
            >>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
            "USER:  \nWhat's the content of the image? ASSISTANT: The image features a busy city street with a stop sign prominently displayed"
            ```
        """
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        vision_feature_layer = (
            vision_feature_layer if vision_feature_layer is not None else self.config.vision_feature_layer
        )
        vision_feature_select_strategy = (
            vision_feature_select_strategy
            if vision_feature_select_strategy is not None
            else self.config.vision_feature_select_strategy
        )

        if inputs_embeds is None:
            # 1. Extra the input embeddings
            inputs_embeds = self.get_input_embeddings()(input_ids)

            # 2. Merge text and images
            if pixel_values is not None and input_ids.shape[1] != 1:
                image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
                # this is not memory efficient at all (output_hidden_states=True) will save all the hidden stated.
                selected_image_feature = image_outputs.hidden_states[vision_feature_layer]

                if vision_feature_select_strategy == "default":
                    selected_image_feature = selected_image_feature[:, 1:]
                else:
                    raise ValueError(
                        f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}"
                    )

                image_features = self.multi_modal_projector(selected_image_feature)
                inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
                    image_features, inputs_embeds, input_ids, attention_mask, labels
                )
                if labels is None:
                    labels = ops.full_like(attention_mask, self.config.ignore_index).to(mindspore.int64)

            # In case input_ids.shape[1] == 1 & pixel_values==None & past_key_values != None, we are in the case of
            # generation with cache
            elif past_key_values is not None and pixel_values is not None and input_ids.shape[1] == 1:
                # Retrieve the first layer to inspect the logits and mask out the hidden states
                # that are set to 0
                first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]

                # Sum all dimensions of head_dim (-2) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
                nonzero = ops.nonzero(first_layer_past_key_value.float().sum(-2) == 0)
                batch_index, non_attended_tokens = ops.tensor_split(nonzero, 2, -1)

                # Get the target length
                target_length = input_ids.shape[1]
                past_length = first_layer_past_key_value.shape[-1]

                extended_attention_mask = ops.ones(
                    (attention_mask.shape[0], past_length),
                    dtype=attention_mask.dtype,
                )

                # Filter out only the tokens that can be un-attended, this can happen
                # if one uses Llava + Fused modules where the cache on the
                # first iteration is already big enough, or if one passes custom cache
                valid_indices = non_attended_tokens < extended_attention_mask.shape[-1]
                new_batch_index = batch_index[valid_indices]
                new_non_attended_tokens = non_attended_tokens[valid_indices]

                # Zero-out the places where we don't need to attend
                extended_attention_mask[new_batch_index, new_non_attended_tokens] = 0

                attention_mask = ops.cat((extended_attention_mask, attention_mask[:, -target_length:]), axis=1)
                position_ids = ops.sum(attention_mask, dim=1).unsqueeze(-1) - 1

        outputs = self.language_model(
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        logits = outputs[0]

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            if attention_mask is not None:
                shift_attention_mask = attention_mask[..., 1:]
                shift_logits = logits[..., :-1, :][ops.ne(shift_attention_mask, 0)]
                shift_labels = labels[..., 1:][ops.ne(shift_attention_mask, 0)]
            else:
                shift_logits = logits[..., :-1, :]
                shift_labels = labels[..., 1:]
            # Flatten the tokens
            loss = ops.cross_entropy(
                shift_logits.view(-1, shift_logits.shape[-1]), shift_labels.view(-1)
            )

        if not return_dict:
            output = (logits,) + outputs[1:]
            return (loss,) + output if loss is not None else output

        return LlavaCausalLMOutputWithPast(
            loss=loss,
            logits=logits,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

    def prepare_inputs_for_generation(
        self, input_ids, past_key_values=None, inputs_embeds=None, pixel_values=None, attention_mask=None, **kwargs
    ):
        '''
        Prepares inputs for text generation in the LlavaForConditionalGeneration class.

        Args:
            self (LlavaForConditionalGeneration): The instance of the LlavaForConditionalGeneration class.
            input_ids (Tensor): The input tensor containing the tokenized input sequence.
            past_key_values (Cache or tuple of Tensor, optional):
                The cache of past key values or tuple of tensors containing past key values.
            inputs_embeds (Tensor, optional): The input embeddings tensor.
            pixel_values (Tensor, optional): The tensor containing the pixel values.
            attention_mask (Tensor, optional): The attention mask tensor.
            **kwargs: Additional keyword arguments.

        Returns:
            dict: A dictionary containing the prepared model inputs for text generation.

        Raises:
            None.
        '''
        if past_key_values is not None:
            if isinstance(past_key_values, Cache):
                cache_length = past_key_values.get_seq_length()
                past_length = past_key_values.seen_tokens
            else:
                cache_length = past_length = past_key_values[0][0].shape[2]

            # Keep only the unprocessed tokens:
            # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
            # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
            # input)
            if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
                input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
            # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
            # input_ids based on the past_length.
            elif past_length < input_ids.shape[1]:
                input_ids = input_ids[:, past_length:]
            # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
            elif self.config.image_token_index in input_ids:
                input_ids = input_ids[:, input_ids.shape[1] - 1 :]
            # If the cache has seen more tokens than it can hold, then the cache has a size limit. Let's discard the
            # older attention values, as their corresponding values are not part of the input.
            if cache_length < past_length and attention_mask is not None:
                attention_mask = attention_mask[:, -(cache_length + input_ids.shape[1]) :]

        position_ids = kwargs.get("position_ids", None)
        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids = position_ids.masked_fill(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -input_ids.shape[1] :]

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "position_ids": position_ids,
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
                "attention_mask": attention_mask,
                "pixel_values": pixel_values,
            }
        )
        return model_inputs

    def _reorder_cache(self, *args, **kwargs):
        """
        Method _reorder_cache in class LlavaForConditionalGeneration.

        Args:
            self: LlavaForConditionalGeneration instance. Represents the current instance of the class.

        Returns:
            None.

        Raises:
            None.
        """
        return self.language_model._reorder_cache(*args, **kwargs)

mindnlp.transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.__init__(config)

Initializes an instance of the LlavaForConditionalGeneration class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

An object of type LlavaConfig containing the configuration settings for the model. It specifies the configuration parameters for the vision tower, multi-modal projector, vocab size, language model, pad token id, and other model settings.

TYPE: LlavaConfig

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/llava/modeling_llava.py
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
def __init__(self, config: LlavaConfig):
    """
    Initializes an instance of the LlavaForConditionalGeneration class.

    Args:
        self: The instance of the class.
        config (LlavaConfig): An object of type LlavaConfig containing the configuration settings for the model.
            It specifies the configuration parameters for the vision tower, multi-modal projector, vocab size, 
            language model, pad token id, and other model settings.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)
    self.vision_tower = AutoModel.from_config(config.vision_config)

    self.multi_modal_projector = LlavaMultiModalProjector(config)
    self.vocab_size = config.text_config.vocab_size
    self.language_model = AutoModelForCausalLM.from_config(config.text_config)
    self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else -1
    self.post_init()

mindnlp.transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.forward(input_ids=None, pixel_values=None, attention_mask=None, position_ids=None, past_key_values=None, inputs_embeds=None, vision_feature_layer=None, vision_feature_select_strategy=None, labels=None, use_cache=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
labels

Labels for computing the masked language modeling loss. Indices should either be in [0, ..., config.vocab_size] or -100 (see input_ids docstring). Tokens with indices set to -100 are ignored (masked), the loss is only computed for the tokens with labels in [0, ..., config.vocab_size].

TYPE: `mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional* DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple, LlavaCausalLMOutputWithPast]

Union[Tuple, LlavaCausalLMOutputWithPast]

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, LlavaForConditionalGeneration
...
>>> model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
>>> processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
...
>>> prompt = "USER: <image>\nWhat's the content of the image? ASSISTANT:"
>>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = processor(text=prompt, images=image, return_tensors="pt")
...
>>> # Generate
>>> generate_ids = model.generate(**inputs, max_new_tokens=15)
>>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"USER:  \nWhat's the content of the image? ASSISTANT: The image features a busy city street with a stop sign prominently displayed"
Source code in mindnlp/transformers/models/llava/modeling_llava.py
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
def forward(
    self,
    input_ids: mindspore.Tensor = None,
    pixel_values: mindspore.Tensor = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    past_key_values: Optional[List[mindspore.Tensor]] = None,
    inputs_embeds: Optional[mindspore.Tensor] = None,
    vision_feature_layer: Optional[int] = None,
    vision_feature_select_strategy: Optional[str] = None,
    labels: Optional[mindspore.Tensor] = None,
    use_cache: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, LlavaCausalLMOutputWithPast]:
    r"""
    Args:
        labels (`mindspore.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
            config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
            (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

    Returns:
        Union[Tuple, LlavaCausalLMOutputWithPast]

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, LlavaForConditionalGeneration
        ...
        >>> model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
        >>> processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
        ...
        >>> prompt = "USER: <image>\nWhat's the content of the image? ASSISTANT:"
        >>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = processor(text=prompt, images=image, return_tensors="pt")
        ...
        >>> # Generate
        >>> generate_ids = model.generate(**inputs, max_new_tokens=15)
        >>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        "USER:  \nWhat's the content of the image? ASSISTANT: The image features a busy city street with a stop sign prominently displayed"
        ```
    """
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    vision_feature_layer = (
        vision_feature_layer if vision_feature_layer is not None else self.config.vision_feature_layer
    )
    vision_feature_select_strategy = (
        vision_feature_select_strategy
        if vision_feature_select_strategy is not None
        else self.config.vision_feature_select_strategy
    )

    if inputs_embeds is None:
        # 1. Extra the input embeddings
        inputs_embeds = self.get_input_embeddings()(input_ids)

        # 2. Merge text and images
        if pixel_values is not None and input_ids.shape[1] != 1:
            image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
            # this is not memory efficient at all (output_hidden_states=True) will save all the hidden stated.
            selected_image_feature = image_outputs.hidden_states[vision_feature_layer]

            if vision_feature_select_strategy == "default":
                selected_image_feature = selected_image_feature[:, 1:]
            else:
                raise ValueError(
                    f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}"
                )

            image_features = self.multi_modal_projector(selected_image_feature)
            inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
                image_features, inputs_embeds, input_ids, attention_mask, labels
            )
            if labels is None:
                labels = ops.full_like(attention_mask, self.config.ignore_index).to(mindspore.int64)

        # In case input_ids.shape[1] == 1 & pixel_values==None & past_key_values != None, we are in the case of
        # generation with cache
        elif past_key_values is not None and pixel_values is not None and input_ids.shape[1] == 1:
            # Retrieve the first layer to inspect the logits and mask out the hidden states
            # that are set to 0
            first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]

            # Sum all dimensions of head_dim (-2) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
            nonzero = ops.nonzero(first_layer_past_key_value.float().sum(-2) == 0)
            batch_index, non_attended_tokens = ops.tensor_split(nonzero, 2, -1)

            # Get the target length
            target_length = input_ids.shape[1]
            past_length = first_layer_past_key_value.shape[-1]

            extended_attention_mask = ops.ones(
                (attention_mask.shape[0], past_length),
                dtype=attention_mask.dtype,
            )

            # Filter out only the tokens that can be un-attended, this can happen
            # if one uses Llava + Fused modules where the cache on the
            # first iteration is already big enough, or if one passes custom cache
            valid_indices = non_attended_tokens < extended_attention_mask.shape[-1]
            new_batch_index = batch_index[valid_indices]
            new_non_attended_tokens = non_attended_tokens[valid_indices]

            # Zero-out the places where we don't need to attend
            extended_attention_mask[new_batch_index, new_non_attended_tokens] = 0

            attention_mask = ops.cat((extended_attention_mask, attention_mask[:, -target_length:]), axis=1)
            position_ids = ops.sum(attention_mask, dim=1).unsqueeze(-1) - 1

    outputs = self.language_model(
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_values=past_key_values,
        inputs_embeds=inputs_embeds,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    logits = outputs[0]

    loss = None
    if labels is not None:
        # Shift so that tokens < n predict n
        if attention_mask is not None:
            shift_attention_mask = attention_mask[..., 1:]
            shift_logits = logits[..., :-1, :][ops.ne(shift_attention_mask, 0)]
            shift_labels = labels[..., 1:][ops.ne(shift_attention_mask, 0)]
        else:
            shift_logits = logits[..., :-1, :]
            shift_labels = labels[..., 1:]
        # Flatten the tokens
        loss = ops.cross_entropy(
            shift_logits.view(-1, shift_logits.shape[-1]), shift_labels.view(-1)
        )

    if not return_dict:
        output = (logits,) + outputs[1:]
        return (loss,) + output if loss is not None else output

    return LlavaCausalLMOutputWithPast(
        loss=loss,
        logits=logits,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.get_decoder()

Returns the decoder of the LlavaForConditionalGeneration model.

PARAMETER DESCRIPTION
self

An instance of the LlavaForConditionalGeneration class.

RETURNS DESCRIPTION

The decoder from the language model used by the LlavaForConditionalGeneration model.

Source code in mindnlp/transformers/models/llava/modeling_llava.py
379
380
381
382
383
384
385
386
387
388
389
390
391
392
def get_decoder(self):
    """
    Returns the decoder of the LlavaForConditionalGeneration model.

    Args:
        self: An instance of the LlavaForConditionalGeneration class.

    Returns:
        The decoder from the language model used by the LlavaForConditionalGeneration model.

    Raises:
        None.
    """
    return self.language_model.get_decoder()

mindnlp.transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.get_input_embeddings()

Get the input embeddings from the language model.

PARAMETER DESCRIPTION
self

An instance of the LlavaForConditionalGeneration class.

TYPE: LlavaForConditionalGeneration

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/llava/modeling_llava.py
294
295
296
297
298
299
300
301
302
303
304
305
306
307
def get_input_embeddings(self):
    """
    Get the input embeddings from the language model.

    Args:
        self (LlavaForConditionalGeneration): An instance of the LlavaForConditionalGeneration class.

    Returns:
        None.

    Raises:
        None.
    """
    return self.language_model.get_input_embeddings()

mindnlp.transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.get_output_embeddings()

Retrieve the output embeddings from the language model used for conditional generation.

PARAMETER DESCRIPTION
self

An instance of the LlavaForConditionalGeneration class.

RETURNS DESCRIPTION
None

This method returns None, it simply delegates the call to the language model's get_output_embeddings method.

RAISES DESCRIPTION
None

However, if the language_model.get_output_embeddings() method raises any exceptions, they will propagate up to the caller.

Source code in mindnlp/transformers/models/llava/modeling_llava.py
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
def get_output_embeddings(self):
    """
    Retrieve the output embeddings from the language model used for conditional generation.

    Args:
        self: An instance of the LlavaForConditionalGeneration class.

    Returns:
        None: This method returns None, it simply delegates the call to the language model's
            get_output_embeddings method.

    Raises:
        None: However, if the language_model.get_output_embeddings() method raises any exceptions,
            they will propagate up to the caller.
    """
    return self.language_model.get_output_embeddings()

mindnlp.transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.prepare_inputs_for_generation(input_ids, past_key_values=None, inputs_embeds=None, pixel_values=None, attention_mask=None, **kwargs)

Prepares inputs for text generation in the LlavaForConditionalGeneration class.

PARAMETER DESCRIPTION
self

The instance of the LlavaForConditionalGeneration class.

TYPE: LlavaForConditionalGeneration

input_ids

The input tensor containing the tokenized input sequence.

TYPE: Tensor

past_key_values

The cache of past key values or tuple of tensors containing past key values.

TYPE: Cache or tuple of Tensor DEFAULT: None

inputs_embeds

The input embeddings tensor.

TYPE: Tensor DEFAULT: None

pixel_values

The tensor containing the pixel values.

TYPE: Tensor DEFAULT: None

attention_mask

The attention mask tensor.

TYPE: Tensor DEFAULT: None

**kwargs

Additional keyword arguments.

DEFAULT: {}

RETURNS DESCRIPTION
dict

A dictionary containing the prepared model inputs for text generation.

Source code in mindnlp/transformers/models/llava/modeling_llava.py
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
def prepare_inputs_for_generation(
    self, input_ids, past_key_values=None, inputs_embeds=None, pixel_values=None, attention_mask=None, **kwargs
):
    '''
    Prepares inputs for text generation in the LlavaForConditionalGeneration class.

    Args:
        self (LlavaForConditionalGeneration): The instance of the LlavaForConditionalGeneration class.
        input_ids (Tensor): The input tensor containing the tokenized input sequence.
        past_key_values (Cache or tuple of Tensor, optional):
            The cache of past key values or tuple of tensors containing past key values.
        inputs_embeds (Tensor, optional): The input embeddings tensor.
        pixel_values (Tensor, optional): The tensor containing the pixel values.
        attention_mask (Tensor, optional): The attention mask tensor.
        **kwargs: Additional keyword arguments.

    Returns:
        dict: A dictionary containing the prepared model inputs for text generation.

    Raises:
        None.
    '''
    if past_key_values is not None:
        if isinstance(past_key_values, Cache):
            cache_length = past_key_values.get_seq_length()
            past_length = past_key_values.seen_tokens
        else:
            cache_length = past_length = past_key_values[0][0].shape[2]

        # Keep only the unprocessed tokens:
        # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
        # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
        # input)
        if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
            input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
        # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
        # input_ids based on the past_length.
        elif past_length < input_ids.shape[1]:
            input_ids = input_ids[:, past_length:]
        # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
        elif self.config.image_token_index in input_ids:
            input_ids = input_ids[:, input_ids.shape[1] - 1 :]
        # If the cache has seen more tokens than it can hold, then the cache has a size limit. Let's discard the
        # older attention values, as their corresponding values are not part of the input.
        if cache_length < past_length and attention_mask is not None:
            attention_mask = attention_mask[:, -(cache_length + input_ids.shape[1]) :]

    position_ids = kwargs.get("position_ids", None)
    if attention_mask is not None and position_ids is None:
        # create position_ids on the fly for batch generation
        position_ids = attention_mask.long().cumsum(-1) - 1
        position_ids = position_ids.masked_fill(attention_mask == 0, 1)
        if past_key_values:
            position_ids = position_ids[:, -input_ids.shape[1] :]

    # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
    if inputs_embeds is not None and past_key_values is None:
        model_inputs = {"inputs_embeds": inputs_embeds}
    else:
        model_inputs = {"input_ids": input_ids}

    model_inputs.update(
        {
            "position_ids": position_ids,
            "past_key_values": past_key_values,
            "use_cache": kwargs.get("use_cache"),
            "attention_mask": attention_mask,
            "pixel_values": pixel_values,
        }
    )
    return model_inputs

mindnlp.transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.resize_token_embeddings(new_num_tokens=None, pad_to_multiple_of=None)

Resize the token embeddings for conditional generation in the LlavaForConditionalGeneration class.

PARAMETER DESCRIPTION
self

The instance of the LlavaForConditionalGeneration class.

new_num_tokens

The new number of tokens to resize the embeddings to. Defaults to None. If provided, the token embeddings will be resized to accommodate this number of tokens.

TYPE: int DEFAULT: None

pad_to_multiple_of

The value to pad the token embeddings to a multiple of. Defaults to None. If provided, the token embeddings will be padded to a multiple of this value.

TYPE: int DEFAULT: None

RETURNS DESCRIPTION
Embedding

nn.Embedding: The resized token embeddings after the operation. This updated nn.Embedding object reflects the changes made to the token embeddings.

Source code in mindnlp/transformers/models/llava/modeling_llava.py
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
    """
    Resize the token embeddings for conditional generation in the LlavaForConditionalGeneration class.

    Args:
        self: The instance of the LlavaForConditionalGeneration class.

        new_num_tokens (int, optional): The new number of tokens to resize the embeddings to. Defaults to None.
            If provided, the token embeddings will be resized to accommodate this number of tokens.

        pad_to_multiple_of (int): The value to pad the token embeddings to a multiple of. Defaults to None.
            If provided, the token embeddings will be padded to a multiple of this value.

    Returns:
        nn.Embedding: The resized token embeddings after the operation.
            This updated nn.Embedding object reflects the changes made to the token embeddings.

    Raises:
        None specified.
    """
    model_embeds = self.language_model.resize_token_embeddings(new_num_tokens, pad_to_multiple_of)
    # update vocab size
    self.config.text_config.vocab_size = model_embeds.vocab_size
    self.vocab_size = model_embeds.vocab_size
    return model_embeds

mindnlp.transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.set_decoder(decoder)

Sets the decoder for the language model used in LlavaForConditionalGeneration.

PARAMETER DESCRIPTION
self

The instance of the LlavaForConditionalGeneration class.

TYPE: LlavaForConditionalGeneration

decoder

The decoder object to be set for the language model. It should be compatible with the language model's requirements.

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the provided decoder is not of the correct type.

ValueError

If the decoder object is invalid or incompatible with the language model.

Source code in mindnlp/transformers/models/llava/modeling_llava.py
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
def set_decoder(self, decoder):
    """
    Sets the decoder for the language model used in LlavaForConditionalGeneration.

    Args:
        self (LlavaForConditionalGeneration): The instance of the LlavaForConditionalGeneration class.
        decoder: The decoder object to be set for the language model.
            It should be compatible with the language model's requirements.

    Returns:
        None.

    Raises:
        TypeError: If the provided decoder is not of the correct type.
        ValueError: If the decoder object is invalid or incompatible with the language model.
    """
    self.language_model.set_decoder(decoder)

mindnlp.transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.set_input_embeddings(value)

Set the input embeddings for the LlavaForConditionalGeneration language model.

PARAMETER DESCRIPTION
self

The instance of the LlavaForConditionalGeneration class.

TYPE: LlavaForConditionalGeneration

value

The input embeddings to be set for the language model.

TYPE: Any

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/llava/modeling_llava.py
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
def set_input_embeddings(self, value):
    """
    Set the input embeddings for the LlavaForConditionalGeneration language model.

    Args:
        self (LlavaForConditionalGeneration): The instance of the LlavaForConditionalGeneration class.
        value (Any): The input embeddings to be set for the language model.

    Returns:
        None.

    Raises:
        None.
    """
    self.language_model.set_input_embeddings(value)

mindnlp.transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.set_output_embeddings(new_embeddings)

Sets the output embeddings for the LlavaForConditionalGeneration model.

PARAMETER DESCRIPTION
self

An instance of the LlavaForConditionalGeneration class.

TYPE: LlavaForConditionalGeneration

new_embeddings

The new output embeddings to be set for the model. It should have the same shape as the original output embeddings.

TYPE: Tensor

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the provided new_embeddings parameter is not of type Tensor.

ValueError

If the shape of the new_embeddings parameter does not match the shape of the original output embeddings.

Source code in mindnlp/transformers/models/llava/modeling_llava.py
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
def set_output_embeddings(self, new_embeddings):
    """
    Sets the output embeddings for the LlavaForConditionalGeneration model.

    Args:
        self (LlavaForConditionalGeneration): An instance of the LlavaForConditionalGeneration class.
        new_embeddings (Tensor): The new output embeddings to be set for the model.
            It should have the same shape as the original output embeddings.

    Returns:
        None.

    Raises:
        TypeError: If the provided new_embeddings parameter is not of type Tensor.
        ValueError: If the shape of the new_embeddings parameter does not match the
            shape of the original output embeddings.
    """
    self.language_model.set_output_embeddings(new_embeddings)

mindnlp.transformers.models.llava.modeling_llava.LlavaForConditionalGeneration.tie_weights()

Ties the weights of the language model used for conditional generation in the LlavaForConditionalGeneration class.

PARAMETER DESCRIPTION
self

An instance of the LlavaForConditionalGeneration class.

TYPE: LlavaForConditionalGeneration

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/llava/modeling_llava.py
394
395
396
397
398
399
400
401
402
403
404
405
406
407
def tie_weights(self):
    """
    Ties the weights of the language model used for conditional generation in the LlavaForConditionalGeneration class.

    Args:
        self (LlavaForConditionalGeneration): An instance of the LlavaForConditionalGeneration class.

    Returns:
        None.

    Raises:
        None.
    """
    return self.language_model.tie_weights()

mindnlp.transformers.models.llava.modeling_llava.LlavaMultiModalProjector

Bases: Module

LlavaMultiModalProjector is a class representing a multi-modal projector for processing image and text data simultaneously. It facilitates the transformation of image features through linear layers with activation functions to map them to text features.

This class inherits from nn.Module and contains methods for initialization and forwarding the projection of image features to text features. The initialization method initializes the linear layers and activation function based on the provided configuration. The forward method applies the linear transformations and activation functions to the input image features to generate the final hidden states for text representation.

Example
>>> config = LlavaConfig(vision_config=..., text_config=..., projector_hidden_act=...)
>>> projector = LlavaMultiModalProjector(config)
>>> hidden_states = projector.forward(image_features)
Source code in mindnlp/transformers/models/llava/modeling_llava.py
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
class LlavaMultiModalProjector(nn.Module):

    """
    LlavaMultiModalProjector is a class representing a multi-modal projector for processing image and text data
    simultaneously. It facilitates the transformation of image features through linear layers with activation functions
    to map them to text features.

    This class inherits from nn.Module and contains methods for initialization and forwarding the projection of image
    features to text features.
    The initialization method initializes the linear layers and activation function based on the provided configuration. 
    The forward method applies the linear transformations and activation functions to the input image features to
    generate the final hidden states for text representation.

    Example:
        ```python
        >>> config = LlavaConfig(vision_config=..., text_config=..., projector_hidden_act=...)
        >>> projector = LlavaMultiModalProjector(config)
        >>> hidden_states = projector.forward(image_features)
        ```
    """
    def __init__(self, config: LlavaConfig):
        """
        Initializes an instance of the LlavaMultiModalProjector class.

        Args:
            self: The object instance.
            config (LlavaConfig):
                The configuration object containing the settings for the projector.

                - config.vision_config.hidden_size (int): The size of the hidden layer for the visual input.
                - config.text_config.hidden_size (int): The size of the hidden layer for the text input.
                - config.projector_hidden_act (str): The activation function for the hidden layer.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__()

        self.linear_1 = nn.Linear(config.vision_config.hidden_size, config.text_config.hidden_size, bias=True)
        self.act = ACT2FN[config.projector_hidden_act]
        self.linear_2 = nn.Linear(config.text_config.hidden_size, config.text_config.hidden_size, bias=True)

    def forward(self, image_features):
        '''
        This method forwards a multi-modal projector within the LlavaMultiModalProjector class.

        Args:
            self (LlavaMultiModalProjector): The instance of the LlavaMultiModalProjector class.
            image_features (tensor): The input tensor containing image features.

        Returns:
            tensor:
                The hidden states tensor obtained after processing the image features through linear and activation layers.

        Raises:
            None.
        '''
        hidden_states = self.linear_1(image_features)
        hidden_states = self.act(hidden_states)
        hidden_states = self.linear_2(hidden_states)
        return hidden_states

mindnlp.transformers.models.llava.modeling_llava.LlavaMultiModalProjector.__init__(config)

Initializes an instance of the LlavaMultiModalProjector class.

PARAMETER DESCRIPTION
self

The object instance.

config

The configuration object containing the settings for the projector.

  • config.vision_config.hidden_size (int): The size of the hidden layer for the visual input.
  • config.text_config.hidden_size (int): The size of the hidden layer for the text input.
  • config.projector_hidden_act (str): The activation function for the hidden layer.

TYPE: LlavaConfig

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/llava/modeling_llava.py
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
def __init__(self, config: LlavaConfig):
    """
    Initializes an instance of the LlavaMultiModalProjector class.

    Args:
        self: The object instance.
        config (LlavaConfig):
            The configuration object containing the settings for the projector.

            - config.vision_config.hidden_size (int): The size of the hidden layer for the visual input.
            - config.text_config.hidden_size (int): The size of the hidden layer for the text input.
            - config.projector_hidden_act (str): The activation function for the hidden layer.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__()

    self.linear_1 = nn.Linear(config.vision_config.hidden_size, config.text_config.hidden_size, bias=True)
    self.act = ACT2FN[config.projector_hidden_act]
    self.linear_2 = nn.Linear(config.text_config.hidden_size, config.text_config.hidden_size, bias=True)

mindnlp.transformers.models.llava.modeling_llava.LlavaMultiModalProjector.forward(image_features)

This method forwards a multi-modal projector within the LlavaMultiModalProjector class.

PARAMETER DESCRIPTION
self

The instance of the LlavaMultiModalProjector class.

TYPE: LlavaMultiModalProjector

image_features

The input tensor containing image features.

TYPE: tensor

RETURNS DESCRIPTION
tensor

The hidden states tensor obtained after processing the image features through linear and activation layers.

Source code in mindnlp/transformers/models/llava/modeling_llava.py
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
def forward(self, image_features):
    '''
    This method forwards a multi-modal projector within the LlavaMultiModalProjector class.

    Args:
        self (LlavaMultiModalProjector): The instance of the LlavaMultiModalProjector class.
        image_features (tensor): The input tensor containing image features.

    Returns:
        tensor:
            The hidden states tensor obtained after processing the image features through linear and activation layers.

    Raises:
        None.
    '''
    hidden_states = self.linear_1(image_features)
    hidden_states = self.act(hidden_states)
    hidden_states = self.linear_2(hidden_states)
    return hidden_states

mindnlp.transformers.models.llava.modeling_llava.LlavaPreTrainedModel

Bases: PreTrainedModel

The LlavaPreTrainedModel class is a subclass of the PreTrainedModel class in the Hugging Face library. It represents a pre-trained model for natural language processing tasks.

This class provides functionality for initializing the weights of the model's cells. The _init_weights method is used to set the initial weights of the model's cells based on the specified configuration. The method supports different types of cells, including Dense, Conv2d, and Embedding.

If the cell has a class_embedding attribute, the method initializes it using a normal distribution with a standard deviation specified by the initializer_range attribute of the configuration.

For Dense and Conv2d cells, the method initializes the weight attribute using a normal distribution with the same standard deviation as above. If the cell has a bias attribute, it is initialized with zeros.

For Embedding cells, the method initializes the weight attribute using a normal distribution with the same standard deviation as above. If the cell has a padding_idx attribute, the corresponding element in the weight matrix is set to zero.

Note

The LlavaPreTrainedModel class assumes that the PreTrainedModel class is available in the code environment.

Source code in mindnlp/transformers/models/llava/modeling_llava.py
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
class LlavaPreTrainedModel(PreTrainedModel):

    """
    The `LlavaPreTrainedModel` class is a subclass of the `PreTrainedModel` class in the Hugging Face library.
    It represents a pre-trained model for natural language processing tasks.

    This class provides functionality for initializing the weights of the model's cells.
    The `_init_weights` method is used to set the initial weights of the model's cells based on the specified configuration.
    The method supports different types of cells, including `Dense`, `Conv2d`, and `Embedding`.

    If the cell has a `class_embedding` attribute, the method initializes it using a normal distribution with
    a standard deviation specified by the `initializer_range` attribute of the configuration.

    For `Dense` and `Conv2d` cells, the method initializes the `weight` attribute using a normal distribution
    with the same standard deviation as above. If the cell has a `bias` attribute, it is initialized with zeros.

    For `Embedding` cells, the method initializes the `weight` attribute using a normal distribution with the same
    standard deviation as above. If the cell has a `padding_idx` attribute, the corresponding element in the
    weight matrix is set to zero.

    Note:
        The `LlavaPreTrainedModel` class assumes that the `PreTrainedModel` class is available in the code environment.

    """
    config_class = LlavaConfig
    base_model_prefix = "model"
    _no_split_modules = ["LlavaVisionAttention"]
    _skip_keys_device_placement = "past_key_values"

    def _init_weights(self, cell):
        """
        Initializes the weights of a given cell in the LlavaPreTrainedModel.

        Args:
            self (LlavaPreTrainedModel): The instance of the LlavaPreTrainedModel class.
            cell: The cell whose weights need to be initialized.

        Returns:
            None: This method modifies the weights of the given cell in-place.

        Raises:
            None.

        This method initializes the weights of the provided cell based on the configuration settings of the LlavaPreTrainedModel.
        If the configuration has an 'initializer_range' attribute, the standard deviation is set to that value. Otherwise,
        it falls back to the 'initializer_range' value in the 'text_config' attribute of the configuration.

        If the cell has a 'class_embedding' attribute, it is initialized using a normal distribution with the
        calculated standard deviation.

        If the cell is an instance of nn.Linear or nn.Conv2d, both the weight and bias are initialized using a normal
        distribution with the calculated standard deviation. If the cell has no bias, it remains unchanged.

        If the cell is an instance of nn.Embedding, the weight tensor is initialized using a normal distribution with
        the calculated standard deviation. If a 'padding_idx' is specified, the corresponding weight value is set to 0.

        Note:
            - The weight initialization is done in-place and modifies the original cell.
            - The 'initializer_range' attribute must be present either in the configuration or the 'text_config'
            attribute of the configuration.
            - The 'padding_idx' attribute is optional for nn.Embedding cells.
            - The method does not return any value.
        """
        # important: this ported version of Llava isn't meant for training from scratch - only
        # inference and fine-tuning - so the proper init weights code has been removed - the original codebase
        # https://github.com/haotian-liu/LLaVA/tree/main/llava should serve for that purpose
        std = (
            self.config.initializer_range
            if hasattr(self.config, "initializer_range")
            else self.config.text_config.initializer_range
        )

        if hasattr(cell, "class_embedding"):
            cell.class_embedding.initialize(Normal(std))

        if isinstance(cell, (nn.Linear, nn.Conv2d)):
            cell.weight.data.initialize(Normal(std))
            if cell.bias is not None:
                cell.bias.initialize('zeros')
        elif isinstance(cell, nn.Embedding):
            weight = np.random.normal(0.0, std, cell.weight.shape)
            if cell.padding_idx:
                weight[cell.padding_idx] = 0

            cell.weight.set_data(Tensor(weight, cell.weight.dtype))

mindnlp.transformers.models.llava.processing_llava

Processor class for Llava.

mindnlp.transformers.models.llava.processing_llava.LlavaProcessor

Bases: ProcessorMixin

Constructs a Llava processor which wraps a Llava image processor and a Llava tokenizer into a single processor.

[LlavaProcessor] offers all the functionalities of [CLIPImageProcessor] and [LlamaTokenizerFast]. See the [~LlavaProcessor.__call__] and [~LlavaProcessor.decode] for more information.

PARAMETER DESCRIPTION
image_processor

The image processor is a required input.

TYPE: [`CLIPImageProcessor`], *optional* DEFAULT: None

tokenizer

The tokenizer is a required input.

TYPE: [`LlamaTokenizerFast`], *optional* DEFAULT: None

Source code in mindnlp/transformers/models/llava/processing_llava.py
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
class LlavaProcessor(ProcessorMixin):
    r"""
    Constructs a Llava processor which wraps a Llava image processor and a Llava tokenizer into a single processor.

    [`LlavaProcessor`] offers all the functionalities of [`CLIPImageProcessor`] and [`LlamaTokenizerFast`]. See the
    [`~LlavaProcessor.__call__`] and [`~LlavaProcessor.decode`] for more information.

    Args:
        image_processor ([`CLIPImageProcessor`], *optional*):
            The image processor is a required input.
        tokenizer ([`LlamaTokenizerFast`], *optional*):
            The tokenizer is a required input.
    """
    attributes = ["image_processor", "tokenizer"]
    image_processor_class = "CLIPImageProcessor"
    tokenizer_class = ("LlamaTokenizer", "LlamaTokenizerFast")

    def __init__(self, image_processor=None, tokenizer=None):
        """
        Initializes a new instance of the LlavaProcessor class.

        Args:
            self (LlavaProcessor): The current instance of the LlavaProcessor class.
            image_processor (object, optional): An object that handles image processing operations. Defaults to None.
            tokenizer (object, optional): An object that handles tokenization operations. Defaults to None.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(image_processor, tokenizer)

    def __call__(
        self,
        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
        images: ImageInput = None,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = None,
        max_length=None,
        return_tensors: Optional[Union[str, TensorType]] = None,
    ) -> BatchFeature:
        """
        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
        and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
        the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
        CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
        of the above two methods for more information.

        Args:
            text (`str`, `List[str]`, `List[List[str]]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. Both channels-first and channels-last formats are supported.
            padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
                Select a strategy to pad the returned sequences (according to the model's padding side and padding
                index) among:

                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
                sequence if provided).
                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
                acceptable input length for the model if that argument is not provided.
                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
                lengths).
            max_length (`int`, *optional*):
                Maximum length of the returned list and optionally padding length (see above).
            truncation (`bool`, *optional*):
                Activates truncation to cut input sequences longer than `max_length` to `max_length`.
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:

                - `'tf'`: Return TensorFlow `tf.constant` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
                - `'jax'`: Return JAX `jnp.ndarray` objects.

        Returns:
            [`BatchFeature`]:
                A [`BatchFeature`] with the following fields:

                - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
                - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
                `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
                `None`).
                - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
        """
        if images is not None:
            pixel_values = self.image_processor(
                images, return_tensors=return_tensors)["pixel_values"]
        else:
            pixel_values = None
        text_inputs = self.tokenizer(
            text, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length
        )

        return BatchFeature(data={**text_inputs, "pixel_values": pixel_values})

    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.batch_decode with CLIP->Llama
    def batch_decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
        refer to the docstring of this method for more information.
        """
        return self.tokenizer.batch_decode(*args, **kwargs)

    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Llama
    def decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
        the docstring of this method for more information.
        """
        return self.tokenizer.decode(*args, **kwargs)

    @property
    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names
    def model_input_names(self):
        """
        Retrieve the unique model input names from the tokenizer and image processor.

        Args:
            self (LlavaProcessor): An instance of the LlavaProcessor class.

        Returns:
            list: A list of unique model input names extracted from the tokenizer and image processor.

        Raises:
            None.
        """
        tokenizer_input_names = self.tokenizer.model_input_names
        image_processor_input_names = self.image_processor.model_input_names
        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))

mindnlp.transformers.models.llava.processing_llava.LlavaProcessor.model_input_names property

Retrieve the unique model input names from the tokenizer and image processor.

PARAMETER DESCRIPTION
self

An instance of the LlavaProcessor class.

TYPE: LlavaProcessor

RETURNS DESCRIPTION
list

A list of unique model input names extracted from the tokenizer and image processor.

mindnlp.transformers.models.llava.processing_llava.LlavaProcessor.__call__(text=None, images=None, padding=False, truncation=None, max_length=None, return_tensors=None)

Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the text and kwargs arguments to LlamaTokenizerFast's [~LlamaTokenizerFast.__call__] if text is not None to encode the text. To prepare the image(s), this method forwards the images and kwrags arguments to CLIPImageProcessor's [~CLIPImageProcessor.__call__] if images is not None. Please refer to the doctsring of the above two methods for more information.

PARAMETER DESCRIPTION
text

The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set is_split_into_words=True (to lift the ambiguity with a batch of sequences).

TYPE: `str`, `List[str]`, `List[List[str]]` DEFAULT: None

images

The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. Both channels-first and channels-last formats are supported.

TYPE: `PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]` DEFAULT: None

padding

Select a strategy to pad the returned sequences (according to the model's padding side and padding index) among:

  • True or 'longest': Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
  • 'max_length': Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
  • False or 'do_not_pad' (default): No padding (i.e., can output a batch with sequences of different lengths).

TYPE: `bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False` DEFAULT: False

max_length

Maximum length of the returned list and optionally padding length (see above).

TYPE: `int`, *optional* DEFAULT: None

truncation

Activates truncation to cut input sequences longer than max_length to max_length.

TYPE: `bool`, *optional* DEFAULT: None

return_tensors

If set, will return tensors of a particular framework. Acceptable values are:

  • 'tf': Return TensorFlow tf.constant objects.
  • 'pt': Return PyTorch torch.Tensor objects.
  • 'np': Return NumPy np.ndarray objects.
  • 'jax': Return JAX jnp.ndarray objects.

TYPE: `str` or [`~utils.TensorType`], *optional* DEFAULT: None

RETURNS DESCRIPTION
BatchFeature

[BatchFeature]: A [BatchFeature] with the following fields:

  • input_ids -- List of token ids to be fed to a model. Returned when text is not None.
  • attention_mask -- List of indices specifying which tokens should be attended to by the model (when return_attention_mask=True or if "attention_mask" is in self.model_input_names and if text is not None).
  • pixel_values -- Pixel values to be fed to a model. Returned when images is not None.
Source code in mindnlp/transformers/models/llava/processing_llava.py
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
def __call__(
    self,
    text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
    images: ImageInput = None,
    padding: Union[bool, str, PaddingStrategy] = False,
    truncation: Union[bool, str, TruncationStrategy] = None,
    max_length=None,
    return_tensors: Optional[Union[str, TensorType]] = None,
) -> BatchFeature:
    """
    Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
    and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
    the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
    CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
    of the above two methods for more information.

    Args:
        text (`str`, `List[str]`, `List[List[str]]`):
            The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
            (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
            `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
        images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
            The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
            tensor. Both channels-first and channels-last formats are supported.
        padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
            Select a strategy to pad the returned sequences (according to the model's padding side and padding
            index) among:

            - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
            sequence if provided).
            - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
            acceptable input length for the model if that argument is not provided.
            - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
            lengths).
        max_length (`int`, *optional*):
            Maximum length of the returned list and optionally padding length (see above).
        truncation (`bool`, *optional*):
            Activates truncation to cut input sequences longer than `max_length` to `max_length`.
        return_tensors (`str` or [`~utils.TensorType`], *optional*):
            If set, will return tensors of a particular framework. Acceptable values are:

            - `'tf'`: Return TensorFlow `tf.constant` objects.
            - `'pt'`: Return PyTorch `torch.Tensor` objects.
            - `'np'`: Return NumPy `np.ndarray` objects.
            - `'jax'`: Return JAX `jnp.ndarray` objects.

    Returns:
        [`BatchFeature`]:
            A [`BatchFeature`] with the following fields:

            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
            `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
            `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
    """
    if images is not None:
        pixel_values = self.image_processor(
            images, return_tensors=return_tensors)["pixel_values"]
    else:
        pixel_values = None
    text_inputs = self.tokenizer(
        text, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length
    )

    return BatchFeature(data={**text_inputs, "pixel_values": pixel_values})

mindnlp.transformers.models.llava.processing_llava.LlavaProcessor.__init__(image_processor=None, tokenizer=None)

Initializes a new instance of the LlavaProcessor class.

PARAMETER DESCRIPTION
self

The current instance of the LlavaProcessor class.

TYPE: LlavaProcessor

image_processor

An object that handles image processing operations. Defaults to None.

TYPE: object DEFAULT: None

tokenizer

An object that handles tokenization operations. Defaults to None.

TYPE: object DEFAULT: None

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/llava/processing_llava.py
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def __init__(self, image_processor=None, tokenizer=None):
    """
    Initializes a new instance of the LlavaProcessor class.

    Args:
        self (LlavaProcessor): The current instance of the LlavaProcessor class.
        image_processor (object, optional): An object that handles image processing operations. Defaults to None.
        tokenizer (object, optional): An object that handles tokenization operations. Defaults to None.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(image_processor, tokenizer)

mindnlp.transformers.models.llava.processing_llava.LlavaProcessor.batch_decode(*args, **kwargs)

This method forwards all its arguments to LlamaTokenizerFast's [~PreTrainedTokenizer.batch_decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp/transformers/models/llava/processing_llava.py
130
131
132
133
134
135
def batch_decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
    refer to the docstring of this method for more information.
    """
    return self.tokenizer.batch_decode(*args, **kwargs)

mindnlp.transformers.models.llava.processing_llava.LlavaProcessor.decode(*args, **kwargs)

This method forwards all its arguments to LlamaTokenizerFast's [~PreTrainedTokenizer.decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp/transformers/models/llava/processing_llava.py
138
139
140
141
142
143
def decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
    the docstring of this method for more information.
    """
    return self.tokenizer.decode(*args, **kwargs)