Skip to content

flava

mindnlp.transformers.models.flava.configuration_flava

FLAVA model configurations

mindnlp.transformers.models.flava.configuration_flava.FlavaConfig

Bases: PretrainedConfig

[FlavaConfig] is the configuration class to store the configuration of a [FlavaModel]. It is used to instantiate FLAVA model according to the specified arguments, defining the text model, image model, image codebook and multimodal model configs. Instantiating a configuration with the defaults will yield a similar configuration to that of the FLAVA facebook/flava-full architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
text_config

Dictionary of configuration options used to initialize [FlavaTextConfig].

TYPE: `dict`, *optional* DEFAULT: None

image_config

Dictionary of configuration options used to initialize [FlavaImageConfig].

TYPE: `dict`, *optional* DEFAULT: None

multimodal_config

Dictionary of configuration options used to initialize [FlavaMultimodalConfig].

TYPE: `dict`, *optional* DEFAULT: None

hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-12 DEFAULT: 1e-12

projection_dim

Dimentionality of text and image projection layers.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 768

logit_scale_init_value

The inital value of the logit_scale paramter. Default is used as per the original FLAVA/CLIP implementation.

TYPE: `float`, *optional*, defaults to 2.6592 DEFAULT: 2.6592

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

ce_ignore_index

Cross entropy index to ignore.

TYPE: `int`, *optional*, defaults to -100 DEFAULT: -100

mim_weight

Weight to be assigned to MIM (Masked Image Modeling) unimodal loss

TYPE: `float`, *optional*, defaults to 1.0 DEFAULT: 1.0

mlm_weight

Weight to be assigned to MLM (Masked Language Modeling) unimodal loss

TYPE: `float`, *optional*, defaults to 1.0 DEFAULT: 1.0

global_contrastive_weight

Weight to be assigned to global contrastive cross-alignment loss.

TYPE: `float`, *optional*, defaults to 1.0 DEFAULT: 1.0

itm_weight

Weight to be assigned to image-text matching multimodal loss.

TYPE: `float`, *optional*, defaults to 1.0 DEFAULT: 1.0

mmm_image_weight

Weight to be assigned to MMM loss's image part.

TYPE: `float`, *optional*, defaults to 1.0 DEFAULT: 1.0

mmm_text_weight

Weight to be assigned to MMM loss's text part.

TYPE: `float`, *optional*, defaults to 1.0 DEFAULT: 1.0

global_backprop_contrastive

Whether to use global backpropgation through all workers in contrastive loss.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

skip_unmasked_multimodal_encoder

Whether to skip running unmasked multimodal encoder whose outputs are not used by FLAVA losses.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

return_loss

Whether to return loss or not

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

kwargs

Dictionary of keyword arguments.

TYPE: *optional* DEFAULT: {}

Example
>>> from transformers import FlavaConfig, FlavaModel, FlavaForPreTraining
...
>>> # Initializing a FlavaConfig with style configuration
>>> configuration = FlavaConfig()
...
>>> # Initializing a FlavaModel and FlavaForPreTraining model (with random weights) from the style configuration
>>> model = FlavaModel(configuration)
>>> model_pre = FlavaForPreTraining(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
>>> configuration_pre = model_pre.config
Source code in mindnlp/transformers/models/flava/configuration_flava.py
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
class FlavaConfig(PretrainedConfig):
    r"""
    [`FlavaConfig`] is the configuration class to store the configuration of a [`FlavaModel`]. It is used to
    instantiate FLAVA model according to the specified arguments, defining the text model, image model, image codebook
    and multimodal model configs. Instantiating a configuration with the defaults will yield a similar configuration to
    that of the FLAVA [facebook/flava-full](https://huggingface.co/facebook/flava-full) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        text_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`FlavaTextConfig`].
        image_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`FlavaImageConfig`].
        multimodal_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`FlavaMultimodalConfig`].
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        projection_dim (`int`, *optional*, defaults to 512):
            Dimentionality of text and image projection layers.
        logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
            The inital value of the *logit_scale* paramter. Default is used as per the original FLAVA/CLIP
            implementation.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        ce_ignore_index (`int`, *optional*, defaults to -100):
            Cross entropy index to ignore.
        mim_weight (`float`, *optional*, defaults to 1.0):
            Weight to be assigned to MIM (Masked Image Modeling) unimodal loss
        mlm_weight (`float`, *optional*, defaults to 1.0):
            Weight to be assigned to MLM (Masked Language Modeling) unimodal loss
        global_contrastive_weight (`float`, *optional*, defaults to 1.0):
            Weight to be assigned to global contrastive cross-alignment loss.
        itm_weight (`float`, *optional*, defaults to 1.0):
            Weight to be assigned to image-text matching multimodal loss.
        mmm_image_weight (`float`, *optional*, defaults to 1.0):
            Weight to be assigned to MMM loss's image part.
        mmm_text_weight (`float`, *optional*, defaults to 1.0):
            Weight to be assigned to MMM loss's text part.
        global_backprop_contrastive (`bool`, *optional*, defaults to `True`):
            Whether to use global backpropgation through all workers in contrastive loss.
        skip_unmasked_multimodal_encoder (`bool`, *optional*, defaults to `True`):
            Whether to skip running unmasked multimodal encoder whose outputs are not used by FLAVA losses.
        return_loss (`bool`, *optional*, defaults to `True`):
            Whether to return loss or not

        kwargs (*optional*):
            Dictionary of keyword arguments.

    Example:
        ```python
        >>> from transformers import FlavaConfig, FlavaModel, FlavaForPreTraining
        ...
        >>> # Initializing a FlavaConfig with style configuration
        >>> configuration = FlavaConfig()
        ...
        >>> # Initializing a FlavaModel and FlavaForPreTraining model (with random weights) from the style configuration
        >>> model = FlavaModel(configuration)
        >>> model_pre = FlavaForPreTraining(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        >>> configuration_pre = model_pre.config
        ```
    """

    model_type = "flava"

    def __init__(
        self,
        image_config: Dict[str, Any] = None,
        text_config: Dict[str, Any] = None,
        multimodal_config: Dict[str, Any] = None,
        image_codebook_config: Dict[str, Any] = None,
        hidden_size: int = 768,
        layer_norm_eps: float = 1e-12,
        projection_dim: int = 768,
        init_codebook: bool = True,
        logit_scale_init_value: float = 2.6592,
        initializer_range: float = 0.02,
        ce_ignore_index: int = -100,
        mim_weight: float = 1.0,
        mlm_weight: float = 1.0,
        global_contrastive_weight: float = 1.0,
        itm_weight: float = 1.0,
        mmm_image_weight: float = 1.0,
        mmm_text_weight: float = 1.0,
        global_backprop_contrastive: bool = True,
        skip_unmasked_multimodal_encoder: bool = True,
        return_loss: bool = True,
        **kwargs,
    ):
        # If `_config_dict` exist, we use them for the backward compatibility.
        # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot
        # of confusion!).
        text_config_dict = kwargs.pop("text_config_dict", None)
        image_config_dict = kwargs.pop("image_config_dict", None)
        multimodal_config_dict = kwargs.pop("multimodal_config_dict", None)
        image_codebook_config_dict = kwargs.pop("image_codebook_config_dict", None)

        super().__init__(**kwargs)

        # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in
        # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most
        # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`.
        if text_config_dict is not None:
            if text_config is None:
                text_config = {}

            # This is the complete result when using `text_config_dict`.
            _text_config_dict = FlavaTextConfig(**text_config_dict).to_dict()

            # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different.
            for key, value in _text_config_dict.items():
                if key in text_config and value != text_config[key] and key not in ["transformers_version"]:
                    # If specified in `text_config_dict`
                    if key in text_config_dict:
                        message = (
                            f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. "
                            f'The value `text_config_dict["{key}"]` will be used instead.'
                        )
                    # If inferred from default argument values (just to be super careful)
                    else:
                        message = (
                            f"`text_config_dict` is provided which will be used to initialize `FlavaTextConfig`. The "
                            f'value `text_config["{key}"]` will be overriden.'
                        )
                    logger.info(message)

            # Update all values in `text_config` with the ones in `_text_config_dict`.
            text_config.update(_text_config_dict)

        if image_config_dict is not None:
            if image_config is None:
                image_config = {}

            # This is the complete result when using `image_config_dict`.
            _image_config_dict = FlavaImageConfig(**image_config_dict).to_dict()
            # convert keys to string instead of integer
            if "id2label" in _image_config_dict:
                _image_config_dict["id2label"] = {
                    str(key): value for key, value in _image_config_dict["id2label"].items()
                }

            # Give a warning if the values exist in both `_image_config_dict` and `image_config` but being different.
            for key, value in _image_config_dict.items():
                if key in image_config and value != image_config[key] and key not in ["transformers_version"]:
                    # If specified in `image_config_dict`
                    if key in image_config_dict:
                        message = (
                            f"`{key}` is found in both `image_config_dict` and `image_config` but with different "
                            f'values. The value `image_config_dict["{key}"]` will be used instead.'
                        )
                    # If inferred from default argument values (just to be super careful)
                    else:
                        message = (
                            f"`image_config_dict` is provided which will be used to initialize `FlavaImageConfig`. "
                            f'The value `image_config["{key}"]` will be overriden.'
                        )
                    logger.info(message)

            # Update all values in `image_config` with the ones in `_image_config_dict`.
            image_config.update(_image_config_dict)

        if multimodal_config_dict is not None:
            if multimodal_config is None:
                multimodal_config = {}

            # This is the complete result when using `multimodal_config_dict`.
            _multimodal_config_dict = FlavaMultimodalConfig(**multimodal_config_dict).to_dict()

            # Give a warning if the values exist in both `_multimodal_config_dict` and `multimodal_config` but being
            # different.
            for key, value in _multimodal_config_dict.items():
                if (
                    key in multimodal_config
                    and value != multimodal_config[key]
                    and key not in ["transformers_version"]
                ):
                    # If specified in `multimodal_config_dict`
                    if key in multimodal_config_dict:
                        message = (
                            f"`{key}` is found in both `multimodal_config_dict` and `multimodal_config` but with "
                            f'different values. The value `multimodal_config_dict["{key}"]` will be used instead.'
                        )
                    # If inferred from default argument values (just to be super careful)
                    else:
                        message = (
                            f"`multimodal_config_dict` is provided which will be used to initialize "
                            f'`FlavaMultimodalConfig`. The value `multimodal_config["{key}"]` will be overriden.'
                        )
                    logger.info(message)

            # Update all values in `multimodal_config` with the ones in `_multimodal_config_dict`.
            multimodal_config.update(_multimodal_config_dict)

        if image_codebook_config_dict is not None:
            if image_codebook_config is None:
                image_codebook_config = {}

            # This is the complete result when using `image_codebook_config_dict`.
            _image_codebook_config_dict = FlavaImageCodebookConfig(**image_codebook_config_dict).to_dict()

            # Give a warning if the values exist in both `_image_codebook_config_dict` and `image_codebook_config` but
            # being different.
            for key, value in _image_codebook_config_dict.items():
                if (
                    key in image_codebook_config
                    and value != image_codebook_config[key]
                    and key not in ["transformers_version"]
                ):
                    # If specified in `image_codebook_config_dict`
                    if key in image_codebook_config_dict:
                        message = (
                            f"`{key}` is found in both `image_codebook_config_dict` and `image_codebook_config` but "
                            f'with different values. The value `image_codebook_config_dict["{key}"]` will be used '
                            "instead."
                        )
                    # If inferred from default argument values (just to be super careful)
                    else:
                        message = (
                            f"`image_codebook_config_dict` is provided which will be used to initialize "
                            f'`FlavaImageCodebookConfig`. The value `image_codebook_config["{key}"]` will be overriden.'
                        )
                    logger.info(message)

            # Update all values in `image_codebook_config` with the ones in `_image_codebook_config_dict`.
            image_codebook_config.update(_image_codebook_config_dict)

        if image_config is None:
            image_config = {}
            logger.info("`image_config` is `None`. initializing the `FlavaImageConfig` with default values.")

        if text_config is None:
            text_config = {}
            logger.info("`text_config` is `None`. Initializing the `FlavaTextConfig` with default values.")

        if multimodal_config is None:
            multimodal_config = {}
            logger.info("`multimodal_config` is `None`. initializing the `FlavaMultimodalConfig` with default values.")

        if image_codebook_config is None:
            image_codebook_config = {}
            logger.info(
                "`image_codebook_config` is `None`. initializing the `FlavaImageCodebookConfig` with default values."
            )

        self.image_config = FlavaImageConfig(**image_config)
        self.text_config = FlavaTextConfig(**text_config)
        self.multimodal_config = FlavaMultimodalConfig(**multimodal_config)
        self.image_codebook_config = FlavaImageCodebookConfig(**image_codebook_config)
        self.projection_dim = projection_dim
        self.init_codebook = init_codebook

        self.hidden_size = hidden_size
        self.layer_norm_eps = layer_norm_eps
        self.initializer_range = initializer_range
        self.logit_scale_init_value = logit_scale_init_value
        self.initializer_factor = 1.0
        self.ce_ignore_index = ce_ignore_index
        self.mim_weight = mim_weight
        self.mlm_weight = mlm_weight
        self.global_contrastive_weight = global_contrastive_weight
        self.itm_weight = itm_weight
        self.mmm_image_weight = mmm_image_weight
        self.mmm_text_weight = mmm_text_weight
        self.global_backprop_contrastive = global_backprop_contrastive
        self.skip_unmasked_multimodal_encoder = skip_unmasked_multimodal_encoder
        self.return_loss = return_loss

    @classmethod
    def from_configs(
        cls,
        image_config: FlavaImageConfig,
        text_config: FlavaTextConfig,
        multimodal_config: FlavaMultimodalConfig,
        image_codebook_config: FlavaImageCodebookConfig,
        **kwargs,
    ):
        r"""
        Instantiate a [`FlavaConfig`] (or a derived class) from flava text model configuration, flava image model
        configuration, flava multimodal model and flava codebook model configuration.

        Returns:
            [`FlavaConfig`]: An instance of a configuration object
        """

        return cls(
            image_config=image_config.to_dict(),
            text_config=text_config.to_dict(),
            multimodal_config=multimodal_config.to_dict(),
            image_codebook_config=image_codebook_config.to_dict(),
            **kwargs,
        )

mindnlp.transformers.models.flava.configuration_flava.FlavaConfig.from_configs(image_config, text_config, multimodal_config, image_codebook_config, **kwargs) classmethod

Instantiate a [FlavaConfig] (or a derived class) from flava text model configuration, flava image model configuration, flava multimodal model and flava codebook model configuration.

RETURNS DESCRIPTION

[FlavaConfig]: An instance of a configuration object

Source code in mindnlp/transformers/models/flava/configuration_flava.py
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
@classmethod
def from_configs(
    cls,
    image_config: FlavaImageConfig,
    text_config: FlavaTextConfig,
    multimodal_config: FlavaMultimodalConfig,
    image_codebook_config: FlavaImageCodebookConfig,
    **kwargs,
):
    r"""
    Instantiate a [`FlavaConfig`] (or a derived class) from flava text model configuration, flava image model
    configuration, flava multimodal model and flava codebook model configuration.

    Returns:
        [`FlavaConfig`]: An instance of a configuration object
    """

    return cls(
        image_config=image_config.to_dict(),
        text_config=text_config.to_dict(),
        multimodal_config=multimodal_config.to_dict(),
        image_codebook_config=image_codebook_config.to_dict(),
        **kwargs,
    )

mindnlp.transformers.models.flava.configuration_flava.FlavaImageCodebookConfig

Bases: PretrainedConfig

[FlavaImageCodebookConfig] is the configuration class to store the configuration of a [FlavaImageCodebook]. It is used to instantiate an FLAVA model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the FLAVA facebook/flava-image-codebook architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
num_groups

Number of groups to be created. This parameter as of now doesn't affect the model and is used for some internal calculation and estimations.

TYPE: `int`, defaults to 4 DEFAULT: 4

input_channels

Number of channels in the image to be passed.

TYPE: `int`, defaults to 3 DEFAULT: 3

num_blocks_per_group

Number of conv-based blocks per group.

TYPE: `int`, defaults to 2 DEFAULT: 2

hidden_size

Size of hidden dim for the blocks.

TYPE: `int`, defaults to 256 DEFAULT: 256

vocab_size

Size of the output vocabulary for the codebook.

TYPE: `int`, defaults to 8192 DEFAULT: 8192

freeze

Whether to freeze the weights of the model.

TYPE: `bool`, defaults to `True` DEFAULT: True

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

kwargs

Dictionary of keyword arguments.

TYPE: *optional* DEFAULT: {}

Example
>>> from transformers import FlavaImageCodebookConfig, FlavaImageCodebook
...
>>> # Initializing a FlavaImageCodebook with style configuration
>>> configuration = FlavaImageCodebookConfig()
...
>>> # Initializing a FlavaImageCodebook model (with random weights) from the style configuration
>>> model = FlavaImageCodebook(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/flava/configuration_flava.py
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
class FlavaImageCodebookConfig(PretrainedConfig):
    r"""
    [`FlavaImageCodebookConfig`] is the configuration class to store the configuration of a [`FlavaImageCodebook`]. It
    is used to instantiate an FLAVA model according to the specified arguments, defining the model architecture.
    Instantiating a configuration with the defaults will yield a similar configuration to that of the FLAVA
    [facebook/flava-image-codebook](https://huggingface.co/facebook/flava-image-codebook) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        num_groups (`int`, defaults to 4):
            Number of groups to be created. This parameter as of now doesn't affect the model and is used for some
            internal calculation and estimations.
        input_channels (`int`, defaults to 3):
            Number of channels in the image to be passed.
        num_blocks_per_group (`int`, defaults to 2):
            Number of conv-based blocks per group.
        hidden_size (`int`, defaults to 256):
            Size of hidden dim for the blocks.
        vocab_size (`int`, defaults to 8192):
            Size of the output vocabulary for the codebook.
        freeze (`bool`, defaults to `True`):
            Whether to freeze the weights of the model.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        kwargs (*optional*):
            Dictionary of keyword arguments.

    Example:
        ```python
        >>> from transformers import FlavaImageCodebookConfig, FlavaImageCodebook
        ...
        >>> # Initializing a FlavaImageCodebook with style configuration
        >>> configuration = FlavaImageCodebookConfig()
        ...
        >>> # Initializing a FlavaImageCodebook model (with random weights) from the style configuration
        >>> model = FlavaImageCodebook(configuration)
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type="flava_image_codebook"

    def __init__(
        self,
        num_groups: int = 4,
        input_channels: int = 3,
        num_blocks_per_group: int = 2,
        hidden_size: int = 256,
        vocab_size: int = 8192,
        freeze: int = True,
        initializer_range: float = 0.02,
        **kwargs,
    ):
        super().__init__(**kwargs)
        self.num_groups = num_groups
        self.input_channels = input_channels
        self.num_blocks_per_group = num_blocks_per_group
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        self.freeze = freeze
        self.initializer_range = initializer_range

    @staticmethod
    def _set_token_in_kwargs(kwargs, token=None):
        """Temporary method to deal with `token` and `use_auth_token`.

        This method is to avoid apply the same changes in all model config classes that overwrite `from_pretrained`.

        Need to clean up `use_auth_token` in a follow PR.
        """
        # Some model config classes like CLIP define their own `from_pretrained` without the new argument `token` yet.
        if token is None:
            token = kwargs.pop("token", None)
        use_auth_token = kwargs.pop("use_auth_token", None)

        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError(
                    "`token` and `use_auth_token` are both specified. Please set only the argument `token`."
                )
            token = use_auth_token

        if token is not None:
            kwargs["token"] = token

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        cls._set_token_in_kwargs(kwargs)

        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

        # get the image codebook config dict if we are loading from FlavaConfig
        if config_dict.get("model_type") == "flava":
            config_dict = config_dict["image_codebook_config"]

        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
            logger.warning(
                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
            )

        return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.flava.configuration_flava.FlavaImageConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [FlavaImageModel]. It is used to instantiate an FLAVA model according to the specified arguments, defining the model architecture.

Instantiating a configuration with the defaults will yield a similar configuration to that of the FLAVA facebook/flava-full architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

intermediate_size

Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 3072 DEFAULT: 3072

hidden_act

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" are supported.

TYPE: `str` or `function`, *optional*, defaults to `"gelu"` DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

attention_probs_dropout_prob

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-12 DEFAULT: 1e-12

image_size

The size (resolution) of each image.

TYPE: `int`, *optional*, defaults to 224 DEFAULT: 224

patch_size

The size (resolution) of each patch.

TYPE: `int`, *optional*, defaults to 16 DEFAULT: 16

num_channels

The number of input channels.

TYPE: `int`, *optional*, defaults to 3 DEFAULT: 3

qkv_bias

Whether to add a bias to the queries, keys and values.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

mask_token

Whether to use a mask token or not. Used in MIM (Masked Image Modeling) loss for FLAVA.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

vocab_size

Vocabulary size of the [FlavaImageCodebook] used in conjunction with [FlavaImageModel] for MIM (Masked Image Modeling) loss for FLAVA.

TYPE: `int`, *optional*, defaults to 8192 DEFAULT: 8192

Example
>>> from transformers import FlavaImageConfig, FlavaImageModel
...
>>> # Initializing a FlavaImageModel with  style configuration
>>> configuration = FlavaImageConfig()
...
>>> # Initializing a FlavaImageModel model (with random weights) from the style configuration
>>> model = FlavaImageModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/flava/configuration_flava.py
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
class FlavaImageConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`FlavaImageModel`]. It is used to instantiate an
    FLAVA model according to the specified arguments, defining the model architecture.

    Instantiating a configuration with the defaults will yield a similar configuration to that of the FLAVA
    [facebook/flava-full](https://huggingface.co/facebook/flava-full) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.


    Args:
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"selu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        image_size (`int`, *optional*, defaults to 224):
            The size (resolution) of each image.
        patch_size (`int`, *optional*, defaults to 16):
            The size (resolution) of each patch.
        num_channels (`int`, *optional*, defaults to 3):
            The number of input channels.
        qkv_bias (`bool`, *optional*, defaults to `True`):
            Whether to add a bias to the queries, keys and values.
        mask_token (`bool`, *optional*, defaults to `True`):
            Whether to use a mask token or not. Used in MIM (Masked Image Modeling) loss for FLAVA.
        vocab_size (`int`, *optional*, defaults to 8192):
            Vocabulary size of the [`FlavaImageCodebook`] used in conjunction with [`FlavaImageModel`] for MIM (Masked
            Image Modeling) loss for FLAVA.

    Example:
        ```python
        >>> from transformers import FlavaImageConfig, FlavaImageModel
        ...
        >>> # Initializing a FlavaImageModel with  style configuration
        >>> configuration = FlavaImageConfig()
        ...
        >>> # Initializing a FlavaImageModel model (with random weights) from the style configuration
        >>> model = FlavaImageModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """

    model_type = "flava_image_model"

    def __init__(
        self,
        hidden_size: int = 768,
        num_hidden_layers: int = 12,
        num_attention_heads: int = 12,
        intermediate_size: int = 3072,
        hidden_act: int = "gelu",
        hidden_dropout_prob: float = 0.0,
        attention_probs_dropout_prob: float = 0.0,
        initializer_range: float = 0.02,
        layer_norm_eps: float = 1e-12,
        image_size: int = 224,
        patch_size: int = 16,
        num_channels: int = 3,
        qkv_bias: bool = True,
        mask_token: bool = True,
        vocab_size: int = 8192,
        **kwargs,
    ):
        super().__init__(**kwargs)

        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.intermediate_size = intermediate_size
        self.hidden_act = hidden_act
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps
        self.image_size = image_size
        self.patch_size = patch_size
        self.num_channels = num_channels
        self.qkv_bias = qkv_bias
        self.mask_token = mask_token
        self.vocab_size = vocab_size

    @staticmethod
    def _set_token_in_kwargs(kwargs, token=None):
        """Temporary method to deal with `token` and `use_auth_token`.

        This method is to avoid apply the same changes in all model config classes that overwrite `from_pretrained`.

        Need to clean up `use_auth_token` in a follow PR.
        """
        # Some model config classes like CLIP define their own `from_pretrained` without the new argument `token` yet.
        if token is None:
            token = kwargs.pop("token", None)
        use_auth_token = kwargs.pop("use_auth_token", None)

        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError(
                    "`token` and `use_auth_token` are both specified. Please set only the argument `token`."
                )
            token = use_auth_token

        if token is not None:
            kwargs["token"] = token

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        cls._set_token_in_kwargs(kwargs)

        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

        # get the image config dict if we are loading from FlavaConfig
        if config_dict.get("model_type") == "flava":
            config_dict = config_dict["image_config"]

        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
            logger.warning(
                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
            )

        return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.flava.configuration_flava.FlavaMultimodalConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [FlavaMultimodalModel]. It is used to instantiate an FLAVA model according to the specified arguments, defining the model architecture.

Instantiating a configuration with the defaults will yield a similar configuration to that of the FLAVA facebook/flava-full architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 6 DEFAULT: 6

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

intermediate_size

Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 3072 DEFAULT: 3072

hidden_act

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" are supported.

TYPE: `str` or `function`, *optional*, defaults to `"gelu"` DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

attention_probs_dropout_prob

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-12 DEFAULT: 1e-12

qkv_bias

Whether to add a bias to the queries, keys and values.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

use_cls_token

Whether to use an extra CLS token for multimodal settings. Usually needed by the FLAVA model.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

Example
>>> from transformers import FlavaMultimodalConfig, FlavaMultimodalModel
...
>>> # Initializing a FlavaMultimodalModel with  style configuration
>>> configuration = FlavaMultimodalConfig()
...
>>> # Initializing a FlavaMultimodalModel model (with random weights) from the style configuration
>>> model = FlavaMultimodalModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/flava/configuration_flava.py
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
class FlavaMultimodalConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`FlavaMultimodalModel`]. It is used to instantiate
    an FLAVA model according to the specified arguments, defining the model architecture.

    Instantiating a configuration with the defaults will yield a similar configuration to that of the FLAVA
    [facebook/flava-full](https://huggingface.co/facebook/flava-full) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.


    Args:
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 6):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"selu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        qkv_bias (`bool`, *optional*, defaults to `True`):
            Whether to add a bias to the queries, keys and values.
        use_cls_token (`bool`, *optional*, defaults to `True`):
            Whether to use an extra CLS token for multimodal settings. Usually needed by the FLAVA model.


    Example:
        ```python
        >>> from transformers import FlavaMultimodalConfig, FlavaMultimodalModel
        ...
        >>> # Initializing a FlavaMultimodalModel with  style configuration
        >>> configuration = FlavaMultimodalConfig()
        ...
        >>> # Initializing a FlavaMultimodalModel model (with random weights) from the style configuration
        >>> model = FlavaMultimodalModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """

    model_type = "flava_multimodal_model"

    def __init__(
        self,
        hidden_size: int = 768,
        num_hidden_layers: int = 6,
        num_attention_heads: int = 12,
        intermediate_size: int = 3072,
        hidden_act: int = "gelu",
        hidden_dropout_prob: int = 0.0,
        attention_probs_dropout_prob: int = 0.0,
        initializer_range: float = 0.02,
        layer_norm_eps: float = 1e-12,
        qkv_bias: bool = True,
        use_cls_token: bool = True,
        **kwargs,
    ):
        super().__init__(**kwargs)

        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.intermediate_size = intermediate_size
        self.hidden_act = hidden_act
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps
        self.qkv_bias = qkv_bias
        self.use_cls_token = use_cls_token

    @staticmethod
    def _set_token_in_kwargs(kwargs, token=None):
        """Temporary method to deal with `token` and `use_auth_token`.

        This method is to avoid apply the same changes in all model config classes that overwrite `from_pretrained`.

        Need to clean up `use_auth_token` in a follow PR.
        """
        # Some model config classes like CLIP define their own `from_pretrained` without the new argument `token` yet.
        if token is None:
            token = kwargs.pop("token", None)
        use_auth_token = kwargs.pop("use_auth_token", None)

        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError(
                    "`token` and `use_auth_token` are both specified. Please set only the argument `token`."
                )
            token = use_auth_token

        if token is not None:
            kwargs["token"] = token

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        cls._set_token_in_kwargs(kwargs)

        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

        # get the multimodal config dict if we are loading from FlavaConfig
        if config_dict.get("model_type") == "flava":
            config_dict = config_dict["multimodal_config"]

        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
            logger.warning(
                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
            )

        return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.flava.configuration_flava.FlavaTextConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [FlavaTextModel]. It is used to instantiate an FLAVA model according to the specified arguments, defining the model architecture.

Instantiating a configuration with the defaults will yield a similar configuration to that of the FLAVA facebook/flava-full architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
vocab_size

Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the inputs_ids passed when calling [FlavaTextModel].

TYPE: `int`, *optional*, defaults to 30522 DEFAULT: 30522

type_vocab_size

The vocabulary size of the token_type_ids passed when calling [FlavaTextModel]. Note that even though text encoder allows token_type_ids's value as 2, for text-only pretraining and fine-tuning, only 1 is used similar to RoBERTa.

TYPE: `int`, *optional*, defaults to 2 DEFAULT: 2

max_position_embeddings

The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). For VL, max_length passed to model is 77.

TYPE: `int`, *optional*, defaults to 512 DEFAULT: 512

position_embedding_type

Type of position embedding. Choose one of "absolute", "relative_key", "relative_key_query". For positional embeddings use "absolute". For more information on "relative_key", please refer to Self-Attention with Relative Position Representations (Shaw et al.). For more information on "relative_key_query", please refer to Method 4 in Improve Transformer Models with Better Relative Position Embeddings (Huang et al.).

TYPE: `str`, *optional*, defaults to `"absolute"` DEFAULT: 'absolute'

hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

intermediate_size

Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 3072 DEFAULT: 3072

hidden_act

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" are supported.

TYPE: `str` or `function`, *optional*, defaults to `"gelu"` DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

TYPE: `float`, *optional*, defaults to 0.1 DEFAULT: 0.0

attention_probs_dropout_prob

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.1 DEFAULT: 0.0

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-12 DEFAULT: 1e-12

image_size

The size (resolution) of each image.

TYPE: `int`, *optional*, defaults to 224

patch_size

The size (resolution) of each patch.

TYPE: `int`, *optional*, defaults to 16

num_channels

The number of input channels.

TYPE: `int`, *optional*, defaults to 3

qkv_bias

Whether to add a bias to the queries, keys and values.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

Example
>>> from transformers import FlavaTextConfig, FlavaTextModel
...
>>> # Initializing a FlavaTextModel with  style configuration
>>> configuration = FlavaTextConfig()
...
>>> # Initializing a FlavaTextModel model (with random weights) from the style configuration
>>> model = FlavaTextModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/flava/configuration_flava.py
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
class FlavaTextConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`FlavaTextModel`]. It is used to instantiate an
    FLAVA model according to the specified arguments, defining the model architecture.

    Instantiating a configuration with the defaults will yield a similar configuration to that of the FLAVA
    [facebook/flava-full](https://huggingface.co/facebook/flava-full) architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.


    Args:
        vocab_size (`int`, *optional*, defaults to 30522):
            Vocabulary size of the BERT model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`FlavaTextModel`].
        type_vocab_size (`int`, *optional*, defaults to 2):
            The vocabulary size of the `token_type_ids` passed when calling [`FlavaTextModel`]. Note that even though
            text encoder allows `token_type_ids`'s value as 2, for text-only pretraining and fine-tuning, only 1 is
            used similar to RoBERTa.
        max_position_embeddings (`int`, *optional*, defaults to 512):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048). For VL, max_length passed to model is 77.
        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"selu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
            The dropout ratio for the attention probabilities.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        image_size (`int`, *optional*, defaults to 224):
            The size (resolution) of each image.
        patch_size (`int`, *optional*, defaults to 16):
            The size (resolution) of each patch.
        num_channels (`int`, *optional*, defaults to 3):
            The number of input channels.
        qkv_bias (`bool`, *optional*, defaults to `True`):
            Whether to add a bias to the queries, keys and values.

    Example:
        ```python
        >>> from transformers import FlavaTextConfig, FlavaTextModel
        ...
        >>> # Initializing a FlavaTextModel with  style configuration
        >>> configuration = FlavaTextConfig()
        ...
        >>> # Initializing a FlavaTextModel model (with random weights) from the style configuration
        >>> model = FlavaTextModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """

    model_type = "flava_text_model"

    def __init__(
        self,
        vocab_size: int = 30522,
        type_vocab_size: int = 2,
        max_position_embeddings: int = 512,
        position_embedding_type: str = "absolute",
        hidden_size: int = 768,
        num_hidden_layers: int = 12,
        num_attention_heads: int = 12,
        intermediate_size: int = 3072,
        hidden_act: str = "gelu",
        hidden_dropout_prob: float = 0.0,
        attention_probs_dropout_prob: float = 0.0,
        initializer_range: float = 0.02,
        layer_norm_eps: float = 1e-12,
        pad_token_id: int = 0,
        qkv_bias: bool = True,
        **kwargs,
    ):
        super().__init__(**kwargs)

        self.vocab_size = vocab_size
        self.type_vocab_size = type_vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.position_embedding_type = position_embedding_type
        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.intermediate_size = intermediate_size
        self.hidden_act = hidden_act
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps
        self.qkv_bias = qkv_bias
        self.pad_token_id = pad_token_id

    @staticmethod
    def _set_token_in_kwargs(kwargs, token=None):
        """Temporary method to deal with `token` and `use_auth_token`.

        This method is to avoid apply the same changes in all model config classes that overwrite `from_pretrained`.

        Need to clean up `use_auth_token` in a follow PR.
        """
        # Some model config classes like CLIP define their own `from_pretrained` without the new argument `token` yet.
        if token is None:
            token = kwargs.pop("token", None)
        use_auth_token = kwargs.pop("use_auth_token", None)

        if use_auth_token is not None:
            warnings.warn(
                "The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.",
                FutureWarning,
            )
            if token is not None:
                raise ValueError(
                    "`token` and `use_auth_token` are both specified. Please set only the argument `token`."
                )
            token = use_auth_token

        if token is not None:
            kwargs["token"] = token

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
        cls._set_token_in_kwargs(kwargs)

        config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)

        # get the text config dict if we are loading from FlavaConfig
        if config_dict.get("model_type") == "flava":
            config_dict = config_dict["text_config"]

        if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
            logger.warning(
                f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
                f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
            )

        return cls.from_dict(config_dict, **kwargs)

mindnlp.transformers.models.flava.feature_extraction_flava

Feature extractor class for FLAVA.

mindnlp.transformers.models.flava.image_processing_flava

Image processor class for Flava.

mindnlp.transformers.models.flava.image_processing_flava.FlavaImageProcessor

Bases: BaseImageProcessor

Constructs a Flava image processor.

PARAMETER DESCRIPTION
do_resize

Whether to resize the image's (height, width) dimensions to the specified size. Can be overridden by the do_resize parameter in preprocess.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

size

224, "width": 224}): Size of the image after resizing. Can be overridden by thesizeparameter inpreprocess`.

TYPE: `Dict[str, int]` *optional*, defaults to `{"height" DEFAULT: None

resample

Resampling filter to use if resizing the image. Can be overridden by the resample parameter in preprocess.

TYPE: `PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC` DEFAULT: BICUBIC

do_center_crop

Whether to center crop the images. Can be overridden by the do_center_crop parameter in preprocess.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

crop_size

224, "width": 224}): Size of image after the center crop(crop_size["height"], crop_size["width"]). Can be overridden by thecrop_sizeparameter inpreprocess`.

TYPE: `Dict[str, int]` *optional*, defaults to `{"height" DEFAULT: None

do_rescale

Whether to rescale the image by the specified scale rescale_factor. Can be overridden by the do_rescale parameter in preprocess.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

rescale_factor

Scale factor to use if rescaling the image. Can be overridden by the rescale_factor parameter in preprocess.

TYPE: `int` or `float`, *optional*, defaults to `1/255` DEFAULT: 1 / 255

do_normalize

Whether to normalize the image. Can be overridden by the do_normalize parameter in preprocess.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

image_mean

Mean to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_mean parameter in the preprocess method.

TYPE: `float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN` DEFAULT: None

image_std

Standard deviation to use if normalizing the image. This is a float or list of floats the length of the number of channels in the image. Can be overridden by the image_std parameter in the preprocess method.

TYPE: `float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD` DEFAULT: None

return_image_mask

Whether to return the image mask. Can be overridden by the return_image_mask parameter in preprocess.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

input_size_patches

Number of patches in the image in height and width direction. 14x14 = 196 total patches. Can be overridden by the input_size_patches parameter in preprocess.

TYPE: `int`, *optional*, defaults to 14 DEFAULT: 14

total_mask_patches

Total number of patches that should be masked. Can be overridden by the total_mask_patches parameter in preprocess.

TYPE: `int`, *optional*, defaults to 75 DEFAULT: 75

mask_group_min_patches

Minimum number of patches that should be masked. Can be overridden by the mask_group_min_patches parameter in preprocess.

TYPE: `int`, *optional*, defaults to 16 DEFAULT: 16

mask_group_max_patches

Maximum number of patches that should be masked. Can be overridden by the mask_group_max_patches parameter in preprocess.

TYPE: `int`, *optional* DEFAULT: None

mask_group_min_aspect_ratio

Minimum aspect ratio of the mask window. Can be overridden by the mask_group_min_aspect_ratio parameter in preprocess.

TYPE: `float`, *optional*, defaults to 0.3 DEFAULT: 0.3

mask_group_max_aspect_ratio

Maximum aspect ratio of the mask window. Can be overridden by the mask_group_max_aspect_ratio parameter in preprocess.

TYPE: `float`, *optional* DEFAULT: None

codebook_do_resize

Whether to resize the input for codebook to a certain. Can be overridden by the codebook_do_resize parameter in preprocess. codebook_size.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

codebook_size

224, "width": 224}): Resize the input for codebook to the given size. Can be overridden by thecodebook_sizeparameter inpreprocess`.

TYPE: `Dict[str, int]`, *optional*, defaults to `{"height" DEFAULT: None

codebook_resample

Resampling filter to use if resizing the codebook image. Can be overridden by the codebook_resample parameter in preprocess.

TYPE: `PILImageResampling`, *optional*, defaults to `PILImageResampling.LANCZOS` DEFAULT: LANCZOS

codebook_do_center_crop

Whether to crop the input for codebook at the center. If the input size is smaller than codebook_crop_size along any edge, the image is padded with 0's and then center cropped. Can be overridden by the codebook_do_center_crop parameter in preprocess.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

codebook_crop_size

224, "width": 224}): Desired output size for codebook input when applying center-cropping. Can be overridden by thecodebook_crop_sizeparameter inpreprocess`.

TYPE: `Dict[str, int]`, *optional*, defaults to `{"height" DEFAULT: None

codebook_do_rescale

Whether to rescale the input for codebook by the specified scale codebook_rescale_factor. Can be overridden by the codebook_do_rescale parameter in preprocess.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

codebook_rescale_factor

Defines the scale factor to use if rescaling the codebook image. Can be overridden by the codebook_rescale_factor parameter in preprocess.

TYPE: `int` or `float`, *optional*, defaults to `1/255` DEFAULT: 1 / 255

codebook_do_map_pixels

Whether to map the pixel values of the codebook input to (1 - 2e)x + e. Can be overridden by the codebook_do_map_pixels parameter in preprocess.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

codebook_do_normalize

Whether or not to normalize the input for codebook with codebook_image_mean and codebook_image_std. Can be overridden by the codebook_do_normalize parameter in preprocess.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

codebook_image_mean

The sequence of means for each channel, to be used when normalizing images for codebook. Can be overridden by the codebook_image_mean parameter in preprocess.

TYPE: `Optional[Union[float, Iterable[float]]]`, *optional*, defaults to `[0, 0, 0]` DEFAULT: None

codebook_image_std

The sequence of standard deviations for each channel, to be used when normalizing images for codebook. Can be overridden by the codebook_image_std parameter in preprocess.

TYPE: `Optional[Union[float, Iterable[float]]]`, *optional*, defaults to `[0.5, 0.5, 0.5]` DEFAULT: None

Source code in mindnlp/transformers/models/flava/image_processing_flava.py
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
class FlavaImageProcessor(BaseImageProcessor):
    r"""
    Constructs a Flava image processor.

    Args:
        do_resize (`bool`, *optional*, defaults to `True`):
            Whether to resize the image's (height, width) dimensions to the specified `size`. Can be overridden by the
            `do_resize` parameter in `preprocess`.
        size (`Dict[str, int]` *optional*, defaults to `{"height": 224, "width": 224}`):
            Size of the image after resizing. Can be overridden by the `size` parameter in `preprocess`.
        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
            Resampling filter to use if resizing the image. Can be overridden by the `resample` parameter in
            `preprocess`.
        do_center_crop (`bool`, *optional*, defaults to `True`):
            Whether to center crop the images. Can be overridden by the `do_center_crop` parameter in `preprocess`.
        crop_size (`Dict[str, int]` *optional*, defaults to `{"height": 224, "width": 224}`):
            Size of image after the center crop `(crop_size["height"], crop_size["width"])`. Can be overridden by the
            `crop_size` parameter in `preprocess`.
        do_rescale (`bool`, *optional*, defaults to `True`):
            Whether to rescale the image by the specified scale `rescale_factor`. Can be overridden by the `do_rescale`
            parameter in `preprocess`.
        rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
            Scale factor to use if rescaling the image. Can be overridden by the `rescale_factor` parameter in
            `preprocess`.
        do_normalize (`bool`, *optional*, defaults to `True`):
            Whether to normalize the image. Can be overridden by the `do_normalize` parameter in `preprocess`.
        image_mean (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_MEAN`):
            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
        image_std (`float` or `List[float]`, *optional*, defaults to `IMAGENET_STANDARD_STD`):
            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
        return_image_mask (`bool`, *optional*, defaults to `False`):
            Whether to return the image mask. Can be overridden by the `return_image_mask` parameter in `preprocess`.
        input_size_patches (`int`, *optional*, defaults to 14):
            Number of patches in the image in height and width direction. 14x14 = 196 total patches. Can be overridden
            by the `input_size_patches` parameter in `preprocess`.
        total_mask_patches (`int`, *optional*, defaults to 75):
            Total number of patches that should be masked. Can be overridden by the `total_mask_patches` parameter in
            `preprocess`.
        mask_group_min_patches (`int`, *optional*, defaults to 16):
            Minimum number of patches that should be masked. Can be overridden by the `mask_group_min_patches`
            parameter in `preprocess`.
        mask_group_max_patches (`int`, *optional*):
            Maximum number of patches that should be masked. Can be overridden by the `mask_group_max_patches`
            parameter in `preprocess`.
        mask_group_min_aspect_ratio (`float`, *optional*, defaults to 0.3):
            Minimum aspect ratio of the mask window. Can be overridden by the `mask_group_min_aspect_ratio` parameter
            in `preprocess`.
        mask_group_max_aspect_ratio (`float`, *optional*):
            Maximum aspect ratio of the mask window. Can be overridden by the `mask_group_max_aspect_ratio` parameter
            in `preprocess`.
        codebook_do_resize (`bool`, *optional*, defaults to `True`):
            Whether to resize the input for codebook to a certain. Can be overridden by the `codebook_do_resize`
            parameter in `preprocess`. `codebook_size`.
        codebook_size (`Dict[str, int]`, *optional*, defaults to `{"height": 224, "width": 224}`):
            Resize the input for codebook to the given size. Can be overridden by the `codebook_size` parameter in
            `preprocess`.
        codebook_resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.LANCZOS`):
            Resampling filter to use if resizing the codebook image. Can be overridden by the `codebook_resample`
            parameter in `preprocess`.
        codebook_do_center_crop (`bool`, *optional*, defaults to `True`):
            Whether to crop the input for codebook at the center. If the input size is smaller than
            `codebook_crop_size` along any edge, the image is padded with 0's and then center cropped. Can be
            overridden by the `codebook_do_center_crop` parameter in `preprocess`.
        codebook_crop_size (`Dict[str, int]`, *optional*, defaults to `{"height": 224, "width": 224}`):
            Desired output size for codebook input when applying center-cropping. Can be overridden by the
            `codebook_crop_size` parameter in `preprocess`.
        codebook_do_rescale (`bool`, *optional*, defaults to `True`):
            Whether to rescale the input for codebook by the specified scale `codebook_rescale_factor`. Can be
            overridden by the `codebook_do_rescale` parameter in `preprocess`.
        codebook_rescale_factor (`int` or `float`, *optional*, defaults to `1/255`):
            Defines the scale factor to use if rescaling the codebook image. Can be overridden by the
            `codebook_rescale_factor` parameter in `preprocess`.
        codebook_do_map_pixels (`bool`, *optional*, defaults to `True`):
            Whether to map the pixel values of the codebook input to (1 - 2e)x + e. Can be overridden by the
            `codebook_do_map_pixels` parameter in `preprocess`.
        codebook_do_normalize (`bool`, *optional*, defaults to `True`):
            Whether or not to normalize the input for codebook with `codebook_image_mean` and `codebook_image_std`. Can
            be overridden by the `codebook_do_normalize` parameter in `preprocess`.
        codebook_image_mean (`Optional[Union[float, Iterable[float]]]`, *optional*, defaults to `[0, 0, 0]`):
            The sequence of means for each channel, to be used when normalizing images for codebook. Can be overridden
            by the `codebook_image_mean` parameter in `preprocess`.
        codebook_image_std (`Optional[Union[float, Iterable[float]]]`, *optional*, defaults to `[0.5, 0.5, 0.5]`):
            The sequence of standard deviations for each channel, to be used when normalizing images for codebook. Can
            be overridden by the `codebook_image_std` parameter in `preprocess`.
    """

    model_input_names = ["pixel_values"]

    def __init__(
        self,
        do_resize: bool = True,
        size: Dict[str, int] = None,
        resample: PILImageResampling = PILImageResampling.BICUBIC,
        do_center_crop: bool = True,
        crop_size: Dict[str, int] = None,
        do_rescale: bool = True,
        rescale_factor: Union[int, float] = 1 / 255,
        do_normalize: bool = True,
        image_mean: Optional[Union[float, Iterable[float]]] = None,
        image_std: Optional[Union[float, Iterable[float]]] = None,
        # Mask related params
        return_image_mask: bool = False,
        input_size_patches: int = 14,
        total_mask_patches: int = 75,
        mask_group_min_patches: int = 16,
        mask_group_max_patches: Optional[int] = None,
        mask_group_min_aspect_ratio: float = 0.3,
        mask_group_max_aspect_ratio: Optional[float] = None,
        # Codebook related params
        return_codebook_pixels: bool = False,
        codebook_do_resize: bool = True,
        codebook_size: bool = None,
        codebook_resample: int = PILImageResampling.LANCZOS,
        codebook_do_center_crop: bool = True,
        codebook_crop_size: int = None,
        codebook_do_rescale: bool = True,
        codebook_rescale_factor: Union[int, float] = 1 / 255,
        codebook_do_map_pixels: bool = True,
        codebook_do_normalize: bool = True,
        codebook_image_mean: Optional[Union[float, Iterable[float]]] = None,
        codebook_image_std: Optional[Union[float, Iterable[float]]] = None,
        **kwargs,
    ) -> None:
        super().__init__(**kwargs)
        size = size if size is not None else {"height": 224, "width": 224}
        size = get_size_dict(size)
        crop_size = crop_size if crop_size is not None else {"height": 224, "width": 224}
        crop_size = get_size_dict(crop_size, param_name="crop_size")

        codebook_size = codebook_size if codebook_size is not None else {"height": 112, "width": 112}
        codebook_size = get_size_dict(codebook_size, param_name="codebook_size")
        codebook_crop_size = codebook_crop_size if codebook_crop_size is not None else {"height": 112, "width": 112}
        codebook_crop_size = get_size_dict(codebook_crop_size, param_name="codebook_crop_size")

        self.do_resize = do_resize
        self.size = size
        self.resample = resample
        self.do_rescale = do_rescale
        self.rescale_factor = rescale_factor
        self.do_center_crop = do_center_crop
        self.crop_size = crop_size
        self.do_normalize = do_normalize
        self.image_mean = image_mean if image_mean is not None else FLAVA_IMAGE_MEAN
        self.image_std = image_std if image_std is not None else FLAVA_IMAGE_STD

        self.return_image_mask = return_image_mask
        self.input_size_patches = input_size_patches
        self.total_mask_patches = total_mask_patches
        self.mask_group_min_patches = mask_group_min_patches
        self.mask_group_max_patches = mask_group_max_patches
        self.mask_group_min_aspect_ratio = mask_group_min_aspect_ratio
        self.mask_group_max_aspect_ratio = mask_group_max_aspect_ratio

        self.return_codebook_pixels = return_codebook_pixels
        self.codebook_do_resize = codebook_do_resize
        self.codebook_size = codebook_size
        self.codebook_resample = codebook_resample
        self.codebook_do_center_crop = codebook_do_center_crop
        self.codebook_crop_size = codebook_crop_size
        self.codebook_do_rescale = codebook_do_rescale
        self.codebook_rescale_factor = codebook_rescale_factor
        self.codebook_do_map_pixels = codebook_do_map_pixels
        self.codebook_do_normalize = codebook_do_normalize
        self.codebook_image_mean = codebook_image_mean
        self.codebook_image_mean = codebook_image_mean if codebook_image_mean is not None else FLAVA_CODEBOOK_MEAN
        self.codebook_image_std = codebook_image_std if codebook_image_std is not None else FLAVA_CODEBOOK_STD
        self._valid_processor_keys = [
            "images",
            "do_resize",
            "size",
            "resample",
            "do_center_crop",
            "crop_size",
            "do_rescale",
            "rescale_factor",
            "do_normalize",
            "image_mean",
            "image_std",
            "return_image_mask",
            "input_size_patches",
            "total_mask_patches",
            "mask_group_min_patches",
            "mask_group_max_patches",
            "mask_group_min_aspect_ratio",
            "mask_group_max_aspect_ratio",
            "return_codebook_pixels",
            "codebook_do_resize",
            "codebook_size",
            "codebook_resample",
            "codebook_do_center_crop",
            "codebook_crop_size",
            "codebook_do_rescale",
            "codebook_rescale_factor",
            "codebook_do_map_pixels",
            "codebook_do_normalize",
            "codebook_image_mean",
            "codebook_image_std",
            "return_tensors",
            "data_format",
            "input_data_format",
        ]

    @classmethod
    def from_dict(cls, image_processor_dict: Dict[str, Any], **kwargs):
        """
        Overrides the `from_dict` method from the base class to make sure parameters are updated if image processor is
        created using from_dict and kwargs e.g. `FlavaImageProcessor.from_pretrained(checkpoint, codebook_size=600)`
        """
        image_processor_dict = image_processor_dict.copy()
        if "codebook_size" in kwargs:
            image_processor_dict["codebook_size"] = kwargs.pop("codebook_size")
        if "codebook_crop_size" in kwargs:
            image_processor_dict["codebook_crop_size"] = kwargs.pop("codebook_crop_size")
        return super().from_dict(image_processor_dict, **kwargs)

    @lru_cache()
    def masking_generator(
        self,
        input_size_patches,
        total_mask_patches,
        mask_group_min_patches,
        mask_group_max_patches,
        mask_group_min_aspect_ratio,
        mask_group_max_aspect_ratio,
    ) -> FlavaMaskingGenerator:
        return FlavaMaskingGenerator(
            input_size=input_size_patches,
            total_mask_patches=total_mask_patches,
            mask_group_min_patches=mask_group_min_patches,
            mask_group_max_patches=mask_group_max_patches,
            mask_group_min_aspect_ratio=mask_group_min_aspect_ratio,
            mask_group_max_aspect_ratio=mask_group_max_aspect_ratio,
        )

    # Copied from transformers.models.vit.image_processing_vit.ViTImageProcessor.resize with PILImageResampling.BILINEAR->PILImageResampling.BICUBIC
    def resize(
        self,
        image: np.ndarray,
        size: Dict[str, int],
        resample: PILImageResampling = PILImageResampling.BICUBIC,
        data_format: Optional[Union[str, ChannelDimension]] = None,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> np.ndarray:
        """
        Resize an image to `(size["height"], size["width"])`.

        Args:
            image (`np.ndarray`):
                Image to resize.
            size (`Dict[str, int]`):
                Dictionary in the format `{"height": int, "width": int}` specifying the size of the output image.
            resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
                `PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BICUBIC`.
            data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the output image. If unset, the channel dimension format of the input
                image is used. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. If unset, the channel dimension format is inferred
                from the input image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.

        Returns:
            `np.ndarray`: The resized image.
        """
        size = get_size_dict(size)
        if "height" not in size or "width" not in size:
            raise ValueError(f"The `size` dictionary must contain the keys `height` and `width`. Got {size.keys()}")
        output_size = (size["height"], size["width"])
        return resize(
            image,
            size=output_size,
            resample=resample,
            data_format=data_format,
            input_data_format=input_data_format,
            **kwargs,
        )

    def map_pixels(self, image: np.ndarray) -> np.ndarray:
        return (1 - 2 * LOGIT_LAPLACE_EPS) * image + LOGIT_LAPLACE_EPS

    def _preprocess_image(
        self,
        image: ImageInput,
        do_resize: bool = None,
        size: Dict[str, int] = None,
        resample: PILImageResampling = None,
        do_center_crop: bool = None,
        crop_size: Dict[str, int] = None,
        do_rescale: bool = None,
        rescale_factor: float = None,
        do_normalize: bool = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_map_pixels: bool = None,
        data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
        input_data_format: Optional[ChannelDimension] = None,
    ) -> np.ndarray:
        """Preprocesses a single image."""

        validate_preprocess_arguments(
            do_rescale=do_rescale,
            rescale_factor=rescale_factor,
            do_normalize=do_normalize,
            image_mean=image_mean,
            image_std=image_std,
            do_center_crop=do_center_crop,
            crop_size=crop_size,
            do_resize=do_resize,
            size=size,
            resample=resample,
        )

        # All transformations expect numpy arrays.
        image = to_numpy_array(image)

        if is_scaled_image(image) and do_rescale:
            logger.warning_once(
                "It looks like you are trying to rescale already rescaled images. If the input"
                " images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
            )

        if input_data_format is None:
            # We assume that all images have the same channel dimension format.
            input_data_format = infer_channel_dimension_format(image)

        if do_resize:
            image = self.resize(image=image, size=size, resample=resample, input_data_format=input_data_format)

        if do_center_crop:
            image = self.center_crop(image=image, size=crop_size, input_data_format=input_data_format)

        if do_rescale:
            image = self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)

        if do_normalize:
            image = self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)

        if do_map_pixels:
            image = self.map_pixels(image)

        if data_format is not None:
            image = to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format)
        return image

    def preprocess(
        self,
        images: ImageInput,
        do_resize: Optional[bool] = None,
        size: Dict[str, int] = None,
        resample: PILImageResampling = None,
        do_center_crop: Optional[bool] = None,
        crop_size: Optional[Dict[str, int]] = None,
        do_rescale: Optional[bool] = None,
        rescale_factor: Optional[float] = None,
        do_normalize: Optional[bool] = None,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        # Mask related params
        return_image_mask: Optional[bool] = None,
        input_size_patches: Optional[int] = None,
        total_mask_patches: Optional[int] = None,
        mask_group_min_patches: Optional[int] = None,
        mask_group_max_patches: Optional[int] = None,
        mask_group_min_aspect_ratio: Optional[float] = None,
        mask_group_max_aspect_ratio: Optional[float] = None,
        # Codebook related params
        return_codebook_pixels: Optional[bool] = None,
        codebook_do_resize: Optional[bool] = None,
        codebook_size: Optional[Dict[str, int]] = None,
        codebook_resample: Optional[int] = None,
        codebook_do_center_crop: Optional[bool] = None,
        codebook_crop_size: Optional[Dict[str, int]] = None,
        codebook_do_rescale: Optional[bool] = None,
        codebook_rescale_factor: Optional[float] = None,
        codebook_do_map_pixels: Optional[bool] = None,
        codebook_do_normalize: Optional[bool] = None,
        codebook_image_mean: Optional[Iterable[float]] = None,
        codebook_image_std: Optional[Iterable[float]] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        data_format: ChannelDimension = ChannelDimension.FIRST,
        input_data_format: Optional[Union[str, ChannelDimension]] = None,
        **kwargs,
    ) -> PIL.Image.Image:
        """
        Preprocess an image or batch of images.

        Args:
            images (`ImageInput`):
                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
            do_resize (`bool`, *optional*, defaults to `self.do_resize`):
                Whether to resize the image.
            size (`Dict[str, int]`, *optional*, defaults to `self.size`):
                Size of the image.
            resample (`int`, *optional*, defaults to `self.resample`):
                Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`, Only
                has an effect if `do_resize` is set to `True`.
            do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
                Whether to center crop the image.
            crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
                Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
            do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
                Whether to rescale the image values between [0 - 1].
            rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
                Rescale factor to rescale the image by if `do_rescale` is set to `True`.
            do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
                Whether to normalize the image.
            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
                Image mean.
            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
                Image standard deviation.
            return_image_mask (`bool`, *optional*, defaults to `self.return_image_mask`):
                Whether to return the image mask.
            input_size_patches (`int`, *optional*, defaults to `self.input_size_patches`):
                Size of the patches to extract from the image.
            total_mask_patches (`int`, *optional*, defaults to `self.total_mask_patches`):
                Total number of patches to extract from the image.
            mask_group_min_patches (`int`, *optional*, defaults to `self.mask_group_min_patches`):
                Minimum number of patches to extract from the image.
            mask_group_max_patches (`int`, *optional*, defaults to `self.mask_group_max_patches`):
                Maximum number of patches to extract from the image.
            mask_group_min_aspect_ratio (`float`, *optional*, defaults to `self.mask_group_min_aspect_ratio`):
                Minimum aspect ratio of the patches to extract from the image.
            mask_group_max_aspect_ratio (`float`, *optional*, defaults to `self.mask_group_max_aspect_ratio`):
                Maximum aspect ratio of the patches to extract from the image.
            return_codebook_pixels (`bool`, *optional*, defaults to `self.return_codebook_pixels`):
                Whether to return the codebook pixels.
            codebook_do_resize (`bool`, *optional*, defaults to `self.codebook_do_resize`):
                Whether to resize the codebook pixels.
            codebook_size (`Dict[str, int]`, *optional*, defaults to `self.codebook_size`):
                Size of the codebook pixels.
            codebook_resample (`int`, *optional*, defaults to `self.codebook_resample`):
                Resampling filter to use if resizing the codebook pixels. This can be one of the enum
                `PILImageResampling`, Only has an effect if `codebook_do_resize` is set to `True`.
            codebook_do_center_crop (`bool`, *optional*, defaults to `self.codebook_do_center_crop`):
                Whether to center crop the codebook pixels.
            codebook_crop_size (`Dict[str, int]`, *optional*, defaults to `self.codebook_crop_size`):
                Size of the center crop of the codebook pixels. Only has an effect if `codebook_do_center_crop` is set
                to `True`.
            codebook_do_rescale (`bool`, *optional*, defaults to `self.codebook_do_rescale`):
                Whether to rescale the codebook pixels values between [0 - 1].
            codebook_rescale_factor (`float`, *optional*, defaults to `self.codebook_rescale_factor`):
                Rescale factor to rescale the codebook pixels by if `codebook_do_rescale` is set to `True`.
            codebook_do_map_pixels (`bool`, *optional*, defaults to `self.codebook_do_map_pixels`):
                Whether to map the codebook pixels values.
            codebook_do_normalize (`bool`, *optional*, defaults to `self.codebook_do_normalize`):
                Whether to normalize the codebook pixels.
            codebook_image_mean (`float` or `List[float]`, *optional*, defaults to `self.codebook_image_mean`):
                Codebook pixels mean to normalize the codebook pixels by if `codebook_do_normalize` is set to `True`.
            codebook_image_std (`float` or `List[float]`, *optional*, defaults to `self.codebook_image_std`):
                Codebook pixels standard deviation to normalize the codebook pixels by if `codebook_do_normalize` is
                set to `True`.
            return_tensors (`str` or `TensorType`, *optional*):
                The type of tensors to return. Can be one of:

                - Unset: Return a list of `np.ndarray`.
                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
            data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
                The channel dimension format for the output image. Can be one of:

                - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            input_data_format (`ChannelDimension` or `str`, *optional*):
                The channel dimension format for the input image. If unset, the channel dimension format is inferred
                from the input image. Can be one of:

                - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
                - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
                - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
        """
        do_resize = do_resize if do_resize is not None else self.do_resize
        size = size if size is not None else self.size
        size = get_size_dict(size)
        resample = resample if resample is not None else self.resample
        do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
        crop_size = crop_size if crop_size is not None else self.crop_size
        crop_size = get_size_dict(crop_size, param_name="crop_size")
        do_rescale = do_rescale if do_rescale is not None else self.do_rescale
        rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
        do_normalize = do_normalize if do_normalize is not None else self.do_normalize
        image_mean = image_mean if image_mean is not None else self.image_mean
        image_std = image_std if image_std is not None else self.image_std

        return_image_mask = return_image_mask if return_image_mask is not None else self.return_image_mask
        input_size_patches = input_size_patches if input_size_patches is not None else self.input_size_patches
        total_mask_patches = total_mask_patches if total_mask_patches is not None else self.total_mask_patches
        mask_group_min_patches = (
            mask_group_min_patches if mask_group_min_patches is not None else self.mask_group_min_patches
        )
        mask_group_max_patches = (
            mask_group_max_patches if mask_group_max_patches is not None else self.mask_group_max_patches
        )
        mask_group_min_aspect_ratio = (
            mask_group_min_aspect_ratio
            if mask_group_min_aspect_ratio is not None
            else self.mask_group_min_aspect_ratio
        )
        mask_group_max_aspect_ratio = (
            mask_group_max_aspect_ratio
            if mask_group_max_aspect_ratio is not None
            else self.mask_group_max_aspect_ratio
        )

        return_codebook_pixels = (
            return_codebook_pixels if return_codebook_pixels is not None else self.return_codebook_pixels
        )
        codebook_do_resize = codebook_do_resize if codebook_do_resize is not None else self.codebook_do_resize
        codebook_size = codebook_size if codebook_size is not None else self.codebook_size
        codebook_size = get_size_dict(codebook_size, param_name="codebook_size")
        codebook_resample = codebook_resample if codebook_resample is not None else self.codebook_resample
        codebook_do_rescale = codebook_do_rescale if codebook_do_rescale is not None else self.codebook_do_rescale
        codebook_rescale_factor = (
            codebook_rescale_factor if codebook_rescale_factor is not None else self.codebook_rescale_factor
        )
        codebook_do_center_crop = (
            codebook_do_center_crop if codebook_do_center_crop is not None else self.codebook_do_center_crop
        )
        codebook_crop_size = codebook_crop_size if codebook_crop_size is not None else self.codebook_crop_size
        codebook_crop_size = get_size_dict(codebook_crop_size, param_name="codebook_crop_size")
        codebook_do_map_pixels = (
            codebook_do_map_pixels if codebook_do_map_pixels is not None else self.codebook_do_map_pixels
        )
        codebook_do_normalize = (
            codebook_do_normalize if codebook_do_normalize is not None else self.codebook_do_normalize
        )
        codebook_image_mean = codebook_image_mean if codebook_image_mean is not None else self.codebook_image_mean
        codebook_image_std = codebook_image_std if codebook_image_std is not None else self.codebook_image_std

        images = make_list_of_images(images)

        validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)

        if not valid_images(images):
            raise ValueError(
                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                "torch.Tensor, tf.Tensor or jax.ndarray."
            )

        processed_images = [
            self._preprocess_image(
                image=img,
                do_resize=do_resize,
                size=size,
                resample=resample,
                do_center_crop=do_center_crop,
                crop_size=crop_size,
                do_rescale=do_rescale,
                rescale_factor=rescale_factor,
                do_normalize=do_normalize,
                image_mean=image_mean,
                image_std=image_std,
                do_map_pixels=False,
                data_format=data_format,
                input_data_format=input_data_format,
            )
            for img in images
        ]
        data = {"pixel_values": processed_images}

        if return_codebook_pixels:
            codebook_images = [
                self._preprocess_image(
                    image=img,
                    do_resize=codebook_do_resize,
                    size=codebook_size,
                    resample=codebook_resample,
                    do_center_crop=codebook_do_center_crop,
                    crop_size=codebook_crop_size,
                    do_rescale=codebook_do_rescale,
                    rescale_factor=codebook_rescale_factor,
                    do_normalize=codebook_do_normalize,
                    image_mean=codebook_image_mean,
                    image_std=codebook_image_std,
                    do_map_pixels=codebook_do_map_pixels,
                    data_format=data_format,
                    input_data_format=input_data_format,
                )
                for img in images
            ]
            data["codebook_pixel_values"] = codebook_images

        if return_image_mask:
            mask_generator = self.masking_generator(
                input_size_patches=input_size_patches,
                total_mask_patches=total_mask_patches,
                mask_group_min_patches=mask_group_min_patches,
                mask_group_max_patches=mask_group_max_patches,
                mask_group_min_aspect_ratio=mask_group_min_aspect_ratio,
                mask_group_max_aspect_ratio=mask_group_max_aspect_ratio,
            )
            masks = [mask_generator() for _ in images]
            data["bool_masked_pos"] = masks

        return BatchFeature(data=data, tensor_type=return_tensors)

mindnlp.transformers.models.flava.image_processing_flava.FlavaImageProcessor.from_dict(image_processor_dict, **kwargs) classmethod

Overrides the from_dict method from the base class to make sure parameters are updated if image processor is created using from_dict and kwargs e.g. FlavaImageProcessor.from_pretrained(checkpoint, codebook_size=600)

Source code in mindnlp/transformers/models/flava/image_processing_flava.py
342
343
344
345
346
347
348
349
350
351
352
353
@classmethod
def from_dict(cls, image_processor_dict: Dict[str, Any], **kwargs):
    """
    Overrides the `from_dict` method from the base class to make sure parameters are updated if image processor is
    created using from_dict and kwargs e.g. `FlavaImageProcessor.from_pretrained(checkpoint, codebook_size=600)`
    """
    image_processor_dict = image_processor_dict.copy()
    if "codebook_size" in kwargs:
        image_processor_dict["codebook_size"] = kwargs.pop("codebook_size")
    if "codebook_crop_size" in kwargs:
        image_processor_dict["codebook_crop_size"] = kwargs.pop("codebook_crop_size")
    return super().from_dict(image_processor_dict, **kwargs)

mindnlp.transformers.models.flava.image_processing_flava.FlavaImageProcessor.preprocess(images, do_resize=None, size=None, resample=None, do_center_crop=None, crop_size=None, do_rescale=None, rescale_factor=None, do_normalize=None, image_mean=None, image_std=None, return_image_mask=None, input_size_patches=None, total_mask_patches=None, mask_group_min_patches=None, mask_group_max_patches=None, mask_group_min_aspect_ratio=None, mask_group_max_aspect_ratio=None, return_codebook_pixels=None, codebook_do_resize=None, codebook_size=None, codebook_resample=None, codebook_do_center_crop=None, codebook_crop_size=None, codebook_do_rescale=None, codebook_rescale_factor=None, codebook_do_map_pixels=None, codebook_do_normalize=None, codebook_image_mean=None, codebook_image_std=None, return_tensors=None, data_format=ChannelDimension.FIRST, input_data_format=None, **kwargs)

Preprocess an image or batch of images.

PARAMETER DESCRIPTION
images

Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If passing in images with pixel values between 0 and 1, set do_rescale=False.

TYPE: `ImageInput`

do_resize

Whether to resize the image.

TYPE: `bool`, *optional*, defaults to `self.do_resize` DEFAULT: None

size

Size of the image.

TYPE: `Dict[str, int]`, *optional*, defaults to `self.size` DEFAULT: None

resample

Resampling filter to use if resizing the image. This can be one of the enum PILImageResampling, Only has an effect if do_resize is set to True.

TYPE: `int`, *optional*, defaults to `self.resample` DEFAULT: None

do_center_crop

Whether to center crop the image.

TYPE: `bool`, *optional*, defaults to `self.do_center_crop` DEFAULT: None

crop_size

Size of the center crop. Only has an effect if do_center_crop is set to True.

TYPE: `Dict[str, int]`, *optional*, defaults to `self.crop_size` DEFAULT: None

do_rescale

Whether to rescale the image values between [0 - 1].

TYPE: `bool`, *optional*, defaults to `self.do_rescale` DEFAULT: None

rescale_factor

Rescale factor to rescale the image by if do_rescale is set to True.

TYPE: `float`, *optional*, defaults to `self.rescale_factor` DEFAULT: None

do_normalize

Whether to normalize the image.

TYPE: `bool`, *optional*, defaults to `self.do_normalize` DEFAULT: None

image_mean

Image mean.

TYPE: `float` or `List[float]`, *optional*, defaults to `self.image_mean` DEFAULT: None

image_std

Image standard deviation.

TYPE: `float` or `List[float]`, *optional*, defaults to `self.image_std` DEFAULT: None

return_image_mask

Whether to return the image mask.

TYPE: `bool`, *optional*, defaults to `self.return_image_mask` DEFAULT: None

input_size_patches

Size of the patches to extract from the image.

TYPE: `int`, *optional*, defaults to `self.input_size_patches` DEFAULT: None

total_mask_patches

Total number of patches to extract from the image.

TYPE: `int`, *optional*, defaults to `self.total_mask_patches` DEFAULT: None

mask_group_min_patches

Minimum number of patches to extract from the image.

TYPE: `int`, *optional*, defaults to `self.mask_group_min_patches` DEFAULT: None

mask_group_max_patches

Maximum number of patches to extract from the image.

TYPE: `int`, *optional*, defaults to `self.mask_group_max_patches` DEFAULT: None

mask_group_min_aspect_ratio

Minimum aspect ratio of the patches to extract from the image.

TYPE: `float`, *optional*, defaults to `self.mask_group_min_aspect_ratio` DEFAULT: None

mask_group_max_aspect_ratio

Maximum aspect ratio of the patches to extract from the image.

TYPE: `float`, *optional*, defaults to `self.mask_group_max_aspect_ratio` DEFAULT: None

return_codebook_pixels

Whether to return the codebook pixels.

TYPE: `bool`, *optional*, defaults to `self.return_codebook_pixels` DEFAULT: None

codebook_do_resize

Whether to resize the codebook pixels.

TYPE: `bool`, *optional*, defaults to `self.codebook_do_resize` DEFAULT: None

codebook_size

Size of the codebook pixels.

TYPE: `Dict[str, int]`, *optional*, defaults to `self.codebook_size` DEFAULT: None

codebook_resample

Resampling filter to use if resizing the codebook pixels. This can be one of the enum PILImageResampling, Only has an effect if codebook_do_resize is set to True.

TYPE: `int`, *optional*, defaults to `self.codebook_resample` DEFAULT: None

codebook_do_center_crop

Whether to center crop the codebook pixels.

TYPE: `bool`, *optional*, defaults to `self.codebook_do_center_crop` DEFAULT: None

codebook_crop_size

Size of the center crop of the codebook pixels. Only has an effect if codebook_do_center_crop is set to True.

TYPE: `Dict[str, int]`, *optional*, defaults to `self.codebook_crop_size` DEFAULT: None

codebook_do_rescale

Whether to rescale the codebook pixels values between [0 - 1].

TYPE: `bool`, *optional*, defaults to `self.codebook_do_rescale` DEFAULT: None

codebook_rescale_factor

Rescale factor to rescale the codebook pixels by if codebook_do_rescale is set to True.

TYPE: `float`, *optional*, defaults to `self.codebook_rescale_factor` DEFAULT: None

codebook_do_map_pixels

Whether to map the codebook pixels values.

TYPE: `bool`, *optional*, defaults to `self.codebook_do_map_pixels` DEFAULT: None

codebook_do_normalize

Whether to normalize the codebook pixels.

TYPE: `bool`, *optional*, defaults to `self.codebook_do_normalize` DEFAULT: None

codebook_image_mean

Codebook pixels mean to normalize the codebook pixels by if codebook_do_normalize is set to True.

TYPE: `float` or `List[float]`, *optional*, defaults to `self.codebook_image_mean` DEFAULT: None

codebook_image_std

Codebook pixels standard deviation to normalize the codebook pixels by if codebook_do_normalize is set to True.

TYPE: `float` or `List[float]`, *optional*, defaults to `self.codebook_image_std` DEFAULT: None

return_tensors

The type of tensors to return. Can be one of:

  • Unset: Return a list of np.ndarray.
  • TensorType.TENSORFLOW or 'tf': Return a batch of type tf.Tensor.
  • TensorType.PYTORCH or 'pt': Return a batch of type torch.Tensor.
  • TensorType.NUMPY or 'np': Return a batch of type np.ndarray.
  • TensorType.JAX or 'jax': Return a batch of type jax.numpy.ndarray.

TYPE: `str` or `TensorType`, *optional* DEFAULT: None

data_format

The channel dimension format for the output image. Can be one of:

  • ChannelDimension.FIRST: image in (num_channels, height, width) format.
  • ChannelDimension.LAST: image in (height, width, num_channels) format.

TYPE: `ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST` DEFAULT: FIRST

input_data_format

The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:

  • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
  • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
  • "none" or ChannelDimension.NONE: image in (height, width) format.

TYPE: `ChannelDimension` or `str`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/flava/image_processing_flava.py
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
def preprocess(
    self,
    images: ImageInput,
    do_resize: Optional[bool] = None,
    size: Dict[str, int] = None,
    resample: PILImageResampling = None,
    do_center_crop: Optional[bool] = None,
    crop_size: Optional[Dict[str, int]] = None,
    do_rescale: Optional[bool] = None,
    rescale_factor: Optional[float] = None,
    do_normalize: Optional[bool] = None,
    image_mean: Optional[Union[float, List[float]]] = None,
    image_std: Optional[Union[float, List[float]]] = None,
    # Mask related params
    return_image_mask: Optional[bool] = None,
    input_size_patches: Optional[int] = None,
    total_mask_patches: Optional[int] = None,
    mask_group_min_patches: Optional[int] = None,
    mask_group_max_patches: Optional[int] = None,
    mask_group_min_aspect_ratio: Optional[float] = None,
    mask_group_max_aspect_ratio: Optional[float] = None,
    # Codebook related params
    return_codebook_pixels: Optional[bool] = None,
    codebook_do_resize: Optional[bool] = None,
    codebook_size: Optional[Dict[str, int]] = None,
    codebook_resample: Optional[int] = None,
    codebook_do_center_crop: Optional[bool] = None,
    codebook_crop_size: Optional[Dict[str, int]] = None,
    codebook_do_rescale: Optional[bool] = None,
    codebook_rescale_factor: Optional[float] = None,
    codebook_do_map_pixels: Optional[bool] = None,
    codebook_do_normalize: Optional[bool] = None,
    codebook_image_mean: Optional[Iterable[float]] = None,
    codebook_image_std: Optional[Iterable[float]] = None,
    return_tensors: Optional[Union[str, TensorType]] = None,
    data_format: ChannelDimension = ChannelDimension.FIRST,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
) -> PIL.Image.Image:
    """
    Preprocess an image or batch of images.

    Args:
        images (`ImageInput`):
            Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
            passing in images with pixel values between 0 and 1, set `do_rescale=False`.
        do_resize (`bool`, *optional*, defaults to `self.do_resize`):
            Whether to resize the image.
        size (`Dict[str, int]`, *optional*, defaults to `self.size`):
            Size of the image.
        resample (`int`, *optional*, defaults to `self.resample`):
            Resampling filter to use if resizing the image. This can be one of the enum `PILImageResampling`, Only
            has an effect if `do_resize` is set to `True`.
        do_center_crop (`bool`, *optional*, defaults to `self.do_center_crop`):
            Whether to center crop the image.
        crop_size (`Dict[str, int]`, *optional*, defaults to `self.crop_size`):
            Size of the center crop. Only has an effect if `do_center_crop` is set to `True`.
        do_rescale (`bool`, *optional*, defaults to `self.do_rescale`):
            Whether to rescale the image values between [0 - 1].
        rescale_factor (`float`, *optional*, defaults to `self.rescale_factor`):
            Rescale factor to rescale the image by if `do_rescale` is set to `True`.
        do_normalize (`bool`, *optional*, defaults to `self.do_normalize`):
            Whether to normalize the image.
        image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
            Image mean.
        image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
            Image standard deviation.
        return_image_mask (`bool`, *optional*, defaults to `self.return_image_mask`):
            Whether to return the image mask.
        input_size_patches (`int`, *optional*, defaults to `self.input_size_patches`):
            Size of the patches to extract from the image.
        total_mask_patches (`int`, *optional*, defaults to `self.total_mask_patches`):
            Total number of patches to extract from the image.
        mask_group_min_patches (`int`, *optional*, defaults to `self.mask_group_min_patches`):
            Minimum number of patches to extract from the image.
        mask_group_max_patches (`int`, *optional*, defaults to `self.mask_group_max_patches`):
            Maximum number of patches to extract from the image.
        mask_group_min_aspect_ratio (`float`, *optional*, defaults to `self.mask_group_min_aspect_ratio`):
            Minimum aspect ratio of the patches to extract from the image.
        mask_group_max_aspect_ratio (`float`, *optional*, defaults to `self.mask_group_max_aspect_ratio`):
            Maximum aspect ratio of the patches to extract from the image.
        return_codebook_pixels (`bool`, *optional*, defaults to `self.return_codebook_pixels`):
            Whether to return the codebook pixels.
        codebook_do_resize (`bool`, *optional*, defaults to `self.codebook_do_resize`):
            Whether to resize the codebook pixels.
        codebook_size (`Dict[str, int]`, *optional*, defaults to `self.codebook_size`):
            Size of the codebook pixels.
        codebook_resample (`int`, *optional*, defaults to `self.codebook_resample`):
            Resampling filter to use if resizing the codebook pixels. This can be one of the enum
            `PILImageResampling`, Only has an effect if `codebook_do_resize` is set to `True`.
        codebook_do_center_crop (`bool`, *optional*, defaults to `self.codebook_do_center_crop`):
            Whether to center crop the codebook pixels.
        codebook_crop_size (`Dict[str, int]`, *optional*, defaults to `self.codebook_crop_size`):
            Size of the center crop of the codebook pixels. Only has an effect if `codebook_do_center_crop` is set
            to `True`.
        codebook_do_rescale (`bool`, *optional*, defaults to `self.codebook_do_rescale`):
            Whether to rescale the codebook pixels values between [0 - 1].
        codebook_rescale_factor (`float`, *optional*, defaults to `self.codebook_rescale_factor`):
            Rescale factor to rescale the codebook pixels by if `codebook_do_rescale` is set to `True`.
        codebook_do_map_pixels (`bool`, *optional*, defaults to `self.codebook_do_map_pixels`):
            Whether to map the codebook pixels values.
        codebook_do_normalize (`bool`, *optional*, defaults to `self.codebook_do_normalize`):
            Whether to normalize the codebook pixels.
        codebook_image_mean (`float` or `List[float]`, *optional*, defaults to `self.codebook_image_mean`):
            Codebook pixels mean to normalize the codebook pixels by if `codebook_do_normalize` is set to `True`.
        codebook_image_std (`float` or `List[float]`, *optional*, defaults to `self.codebook_image_std`):
            Codebook pixels standard deviation to normalize the codebook pixels by if `codebook_do_normalize` is
            set to `True`.
        return_tensors (`str` or `TensorType`, *optional*):
            The type of tensors to return. Can be one of:

            - Unset: Return a list of `np.ndarray`.
            - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
            - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
            - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
            - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
        data_format (`ChannelDimension` or `str`, *optional*, defaults to `ChannelDimension.FIRST`):
            The channel dimension format for the output image. Can be one of:

            - `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `ChannelDimension.LAST`: image in (height, width, num_channels) format.
        input_data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format for the input image. If unset, the channel dimension format is inferred
            from the input image. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
    """
    do_resize = do_resize if do_resize is not None else self.do_resize
    size = size if size is not None else self.size
    size = get_size_dict(size)
    resample = resample if resample is not None else self.resample
    do_center_crop = do_center_crop if do_center_crop is not None else self.do_center_crop
    crop_size = crop_size if crop_size is not None else self.crop_size
    crop_size = get_size_dict(crop_size, param_name="crop_size")
    do_rescale = do_rescale if do_rescale is not None else self.do_rescale
    rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
    do_normalize = do_normalize if do_normalize is not None else self.do_normalize
    image_mean = image_mean if image_mean is not None else self.image_mean
    image_std = image_std if image_std is not None else self.image_std

    return_image_mask = return_image_mask if return_image_mask is not None else self.return_image_mask
    input_size_patches = input_size_patches if input_size_patches is not None else self.input_size_patches
    total_mask_patches = total_mask_patches if total_mask_patches is not None else self.total_mask_patches
    mask_group_min_patches = (
        mask_group_min_patches if mask_group_min_patches is not None else self.mask_group_min_patches
    )
    mask_group_max_patches = (
        mask_group_max_patches if mask_group_max_patches is not None else self.mask_group_max_patches
    )
    mask_group_min_aspect_ratio = (
        mask_group_min_aspect_ratio
        if mask_group_min_aspect_ratio is not None
        else self.mask_group_min_aspect_ratio
    )
    mask_group_max_aspect_ratio = (
        mask_group_max_aspect_ratio
        if mask_group_max_aspect_ratio is not None
        else self.mask_group_max_aspect_ratio
    )

    return_codebook_pixels = (
        return_codebook_pixels if return_codebook_pixels is not None else self.return_codebook_pixels
    )
    codebook_do_resize = codebook_do_resize if codebook_do_resize is not None else self.codebook_do_resize
    codebook_size = codebook_size if codebook_size is not None else self.codebook_size
    codebook_size = get_size_dict(codebook_size, param_name="codebook_size")
    codebook_resample = codebook_resample if codebook_resample is not None else self.codebook_resample
    codebook_do_rescale = codebook_do_rescale if codebook_do_rescale is not None else self.codebook_do_rescale
    codebook_rescale_factor = (
        codebook_rescale_factor if codebook_rescale_factor is not None else self.codebook_rescale_factor
    )
    codebook_do_center_crop = (
        codebook_do_center_crop if codebook_do_center_crop is not None else self.codebook_do_center_crop
    )
    codebook_crop_size = codebook_crop_size if codebook_crop_size is not None else self.codebook_crop_size
    codebook_crop_size = get_size_dict(codebook_crop_size, param_name="codebook_crop_size")
    codebook_do_map_pixels = (
        codebook_do_map_pixels if codebook_do_map_pixels is not None else self.codebook_do_map_pixels
    )
    codebook_do_normalize = (
        codebook_do_normalize if codebook_do_normalize is not None else self.codebook_do_normalize
    )
    codebook_image_mean = codebook_image_mean if codebook_image_mean is not None else self.codebook_image_mean
    codebook_image_std = codebook_image_std if codebook_image_std is not None else self.codebook_image_std

    images = make_list_of_images(images)

    validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_processor_keys)

    if not valid_images(images):
        raise ValueError(
            "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
            "torch.Tensor, tf.Tensor or jax.ndarray."
        )

    processed_images = [
        self._preprocess_image(
            image=img,
            do_resize=do_resize,
            size=size,
            resample=resample,
            do_center_crop=do_center_crop,
            crop_size=crop_size,
            do_rescale=do_rescale,
            rescale_factor=rescale_factor,
            do_normalize=do_normalize,
            image_mean=image_mean,
            image_std=image_std,
            do_map_pixels=False,
            data_format=data_format,
            input_data_format=input_data_format,
        )
        for img in images
    ]
    data = {"pixel_values": processed_images}

    if return_codebook_pixels:
        codebook_images = [
            self._preprocess_image(
                image=img,
                do_resize=codebook_do_resize,
                size=codebook_size,
                resample=codebook_resample,
                do_center_crop=codebook_do_center_crop,
                crop_size=codebook_crop_size,
                do_rescale=codebook_do_rescale,
                rescale_factor=codebook_rescale_factor,
                do_normalize=codebook_do_normalize,
                image_mean=codebook_image_mean,
                image_std=codebook_image_std,
                do_map_pixels=codebook_do_map_pixels,
                data_format=data_format,
                input_data_format=input_data_format,
            )
            for img in images
        ]
        data["codebook_pixel_values"] = codebook_images

    if return_image_mask:
        mask_generator = self.masking_generator(
            input_size_patches=input_size_patches,
            total_mask_patches=total_mask_patches,
            mask_group_min_patches=mask_group_min_patches,
            mask_group_max_patches=mask_group_max_patches,
            mask_group_min_aspect_ratio=mask_group_min_aspect_ratio,
            mask_group_max_aspect_ratio=mask_group_max_aspect_ratio,
        )
        masks = [mask_generator() for _ in images]
        data["bool_masked_pos"] = masks

    return BatchFeature(data=data, tensor_type=return_tensors)

mindnlp.transformers.models.flava.image_processing_flava.FlavaImageProcessor.resize(image, size, resample=PILImageResampling.BICUBIC, data_format=None, input_data_format=None, **kwargs)

Resize an image to (size["height"], size["width"]).

PARAMETER DESCRIPTION
image

Image to resize.

TYPE: `np.ndarray`

size

Dictionary in the format {"height": int, "width": int} specifying the size of the output image.

TYPE: `Dict[str, int]`

resample

PILImageResampling filter to use when resizing the image e.g. PILImageResampling.BICUBIC.

TYPE: `PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC` DEFAULT: BICUBIC

data_format

The channel dimension format for the output image. If unset, the channel dimension format of the input image is used. Can be one of:

  • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
  • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
  • "none" or ChannelDimension.NONE: image in (height, width) format.

TYPE: `ChannelDimension` or `str`, *optional* DEFAULT: None

input_data_format

The channel dimension format for the input image. If unset, the channel dimension format is inferred from the input image. Can be one of:

  • "channels_first" or ChannelDimension.FIRST: image in (num_channels, height, width) format.
  • "channels_last" or ChannelDimension.LAST: image in (height, width, num_channels) format.
  • "none" or ChannelDimension.NONE: image in (height, width) format.

TYPE: `ChannelDimension` or `str`, *optional* DEFAULT: None

RETURNS DESCRIPTION
ndarray

np.ndarray: The resized image.

Source code in mindnlp/transformers/models/flava/image_processing_flava.py
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
def resize(
    self,
    image: np.ndarray,
    size: Dict[str, int],
    resample: PILImageResampling = PILImageResampling.BICUBIC,
    data_format: Optional[Union[str, ChannelDimension]] = None,
    input_data_format: Optional[Union[str, ChannelDimension]] = None,
    **kwargs,
) -> np.ndarray:
    """
    Resize an image to `(size["height"], size["width"])`.

    Args:
        image (`np.ndarray`):
            Image to resize.
        size (`Dict[str, int]`):
            Dictionary in the format `{"height": int, "width": int}` specifying the size of the output image.
        resample (`PILImageResampling`, *optional*, defaults to `PILImageResampling.BICUBIC`):
            `PILImageResampling` filter to use when resizing the image e.g. `PILImageResampling.BICUBIC`.
        data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format for the output image. If unset, the channel dimension format of the input
            image is used. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
        input_data_format (`ChannelDimension` or `str`, *optional*):
            The channel dimension format for the input image. If unset, the channel dimension format is inferred
            from the input image. Can be one of:

            - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
            - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
            - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.

    Returns:
        `np.ndarray`: The resized image.
    """
    size = get_size_dict(size)
    if "height" not in size or "width" not in size:
        raise ValueError(f"The `size` dictionary must contain the keys `height` and `width`. Got {size.keys()}")
    output_size = (size["height"], size["width"])
    return resize(
        image,
        size=output_size,
        resample=resample,
        data_format=data_format,
        input_data_format=input_data_format,
        **kwargs,
    )

mindnlp.transformers.models.flava.modeling_flava

Mindspore FLAVA model.

mindnlp.transformers.models.flava.modeling_flava.FlavaForPreTraining

Bases: FlavaPreTrainedModel

Source code in mindnlp/transformers/models/flava/modeling_flava.py
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
class FlavaForPreTraining(FlavaPreTrainedModel):
    # Those are linked to xxx.bias
    _tied_weights_keys = [
        "mmm_text_head.decoder.bias",
        "mmm_image_head.decoder.bias",
        "mlm_head.decoder.bias",
        "mim_head.decoder.bias",
    ]

    def __init__(self, config: FlavaConfig, image_codebook: Optional[nn.Module] = None):
        super().__init__(config)
        self.flava = FlavaModel(config)

        self.image_codebook = image_codebook
        if self.image_codebook is None and config.init_codebook:
            self.image_codebook = FlavaImageCodebook(config.image_codebook_config)

        # Levarage text and image encoder configs to create the masked
        # head since it has the right vocab
        self.mim_head = FlavaMaskedPredictionHead(config.image_config)
        self.mlm_head = FlavaMaskedPredictionHead(config.text_config)
        self.itm_head = FlavaITMHead(config)
        self.mmm_image_head = FlavaMaskedPredictionHead(config.image_config)
        self.mmm_text_head = FlavaMaskedPredictionHead(config.text_config)
        self.global_contrastive_head = FlavaGlobalContrastiveHead(config)

        self.image_vocab_size = config.image_config.vocab_size
        self.text_vocab_size = config.text_config.vocab_size
        self.mlm_weight = config.mlm_weight
        self.mim_weight = config.mim_weight
        self.global_contrastive_weight = config.global_contrastive_weight
        self.ce_ignore_index = config.ce_ignore_index
        self.itm_weight = config.itm_weight
        self.mmm_image_weight = config.mmm_image_weight
        self.mmm_text_weight = config.mmm_text_weight
        self.skip_unmasked_multimodal_encoder = config.skip_unmasked_multimodal_encoder

        self.post_init()

    def _resize_to_2d(self, x: mindspore.Tensor):
        if x.dim() > 2:
            x = x.view(x.shape[0], -1)
        return x


    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        input_ids_masked: Optional[mindspore.Tensor] = None,
        pixel_values: Optional[mindspore.Tensor] = None,
        codebook_pixel_values: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        bool_masked_pos: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        image_attention_mask: Optional[mindspore.Tensor] = None,
        skip_unmasked_multimodal_encoder: bool = None,
        mlm_labels: Optional[mindspore.Tensor] = None,
        mim_labels: Optional[mindspore.Tensor] = None,
        itm_labels: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: bool = True,
        return_dict: Optional[bool] = None,
        return_loss: Optional[bool] = None,
    ) -> Union[Tuple[mindspore.Tensor], FlavaForPreTrainingOutput]:
        """
        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import FlavaForPreTraining, AutoProcessor
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> model = FlavaForPreTraining.from_pretrained("facebook/flava-full")
            >>> processor = AutoProcessor.from_pretrained("facebook/flava-full")
            ...
            >>> text = ["a photo of a cat"]
            ...
            >>> inputs = processor(
            ...     images=[image],
            ...     text=text,
            ...     return_masks=True,
            ...     return_codebook_pixels=True,
            ...     padding=True,
            ...     max_length=77,
            ...     return_tensors="pt",
            ... )
            ...
            ...
            >>> output = model(**inputs)
            ```

        Return:
            `Union[Tuple[mindspore.Tensor], FlavaForPreTrainingOutput]`
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        return_loss = return_loss if return_loss is not None else self.config.return_loss

        skip_unmasked_multimodal_encoder = (
            skip_unmasked_multimodal_encoder
            if skip_unmasked_multimodal_encoder is not None
            else self.skip_unmasked_multimodal_encoder
        )

        if input_ids_masked is None and input_ids is not None:
            logger.warning(
                "`input_ids_masked` isn't passed which means MLM loss won't be calculated correctlySetting it to"
                " `input_ids` so that model can work. Please pass it if this is unintentional. This is usually OKAY if"
                " you are doing inference on unmasked text..."
            )
            input_ids_masked = input_ids

        flava_output = self.flava(
            input_ids=input_ids,
            pixel_values=pixel_values,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            image_attention_mask=image_attention_mask,
            # Don't need unmasked multimodal embedding for anything so skip it
            # NOTE: ITM uses masked version
            skip_multimodal_encoder=skip_unmasked_multimodal_encoder,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            # Pass true to have deterministic outputs
            return_dict=True,
        )

        flava_masked_output = self.flava(
            input_ids=input_ids_masked,
            pixel_values=pixel_values,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            image_attention_mask=image_attention_mask,
            bool_masked_pos=bool_masked_pos,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=True,
        )

        pos_mask = None

        image_embeddings = flava_output.image_embeddings
        text_embeddings = flava_output.text_embeddings
        image_masked_embeddings = flava_masked_output.image_embeddings
        text_masked_embeddings = flava_masked_output.text_embeddings
        multimodal_masked_embeddings = flava_masked_output.multimodal_embeddings

        total_loss = mim_loss = mlm_loss = mmm_text_loss = mmm_image_loss = gc_loss = itm_loss = None
        mim_logits = mlm_logits = mmm_text_logits = mmm_image_logits = None
        itm_logits = logits_per_image = logits_per_text = None

        # Calculate mim_labels if necessary from the image_codebook
        if image_masked_embeddings is not None or multimodal_masked_embeddings is not None:
            if mim_labels is None and return_loss:
                if self.image_codebook is None:
                    raise RuntimeError(
                        "`return_loss` is set to True but the image codebook is not initialized and no `mim_labels` "
                        " have been passed. Reinstantiate the model with `init_codebook` set to True or "
                        "pass in your custom `mim_labels`"
                    )
                if codebook_pixel_values is None:
                    raise ValueError(
                        "`codebook_pixel_value` are required to generate `mim_labels` if loss is expected. "
                        "Call `AutoProcessor` with `return_codebook_pixels` set to True"
                    )
                mim_labels = self.image_codebook.get_codebook_indices(codebook_pixel_values)
        # Unimodal MIM Loss
        # If multimodal embeddings are present, we will calculate MMM loss
        if self.mim_weight > 0 and image_masked_embeddings is not None and multimodal_masked_embeddings is None:
            sequence_for_image = image_masked_embeddings

            if mim_labels is not None:
                mim_labels = self._resize_to_2d(mim_labels)
                bool_masked_pos = self._resize_to_2d(bool_masked_pos)
                mim_labels[bool_masked_pos.ne(True)] = self.ce_ignore_index

                sequence_for_image = sequence_for_image[:, -mim_labels.shape[1] :, :]
                masked_tokens = mim_labels.ne(self.ce_ignore_index)
                mim_labels_filtered = mim_labels[masked_tokens]
                # sequence_for_image = sequence_for_image[masked_tokens, :]
                sequence_for_image = mindspore.Tensor(sequence_for_image.asnumpy()[masked_tokens.asnumpy(), :], mindspore.float32)
                mim_logits = self.mim_head(sequence_for_image)
                if return_loss:
                    mim_loss = ops.cross_entropy(
                        mim_logits.view(-1, self.image_vocab_size), mim_labels_filtered.view(-1)
                    )
                    mim_loss *= self.mim_weight
            else:
                mim_logits = self.mim_head(sequence_for_image)

        # Unimodal MLM Loss
        if self.mlm_weight > 0 and text_masked_embeddings is not None and multimodal_masked_embeddings is None:
            sequence_for_text = text_masked_embeddings
            if mlm_labels is not None:
                mlm_labels = self._resize_to_2d(mlm_labels)
                sequence_for_text = sequence_for_text[:, -mlm_labels.shape[1] :, :]
                masked_tokens = mlm_labels.ne(self.ce_ignore_index)
                mlm_labels_filtered = mlm_labels[masked_tokens]
                # sequence_for_text = sequence_for_text[masked_tokens, :]
                sequence_for_text = mindspore.Tensor(sequence_for_text.asnumpy()[masked_tokens.asnumpy(), :], mindspore.float32)
                mlm_logits = self.mlm_head(sequence_for_text)
                if return_loss:
                    mlm_loss = ops.cross_entropy(
                        mlm_logits.view(-1, self.text_vocab_size), mlm_labels_filtered.view(-1)
                    )
                    mlm_loss *= self.mlm_weight
            else:
                mlm_logits = self.mlm_head(sequence_for_text)

        # ITM Loss
        if self.itm_weight > 0 and multimodal_masked_embeddings is not None:
            itm_logits = self.itm_head(multimodal_masked_embeddings)

            if itm_labels is not None:
                pos_pairs = itm_labels.ne(0)
                # pos_mask = ops.where(pos_pairs.any(), pos_pairs, pos_pairs.new([True]))
                pos_mask = ops.where(pos_pairs.any(), pos_pairs, mindspore.Tensor([True], dtype=pos_pairs.dtype))
                if return_loss:
                    itm_loss = ops.cross_entropy(itm_logits, ops.cast(itm_labels, mindspore.int32))
                    itm_loss *= self.itm_weight

                if multimodal_masked_embeddings is not None:
                    multimodal_masked_embeddings = multimodal_masked_embeddings[pos_mask]

                if mlm_labels is not None:
                    mlm_labels = mlm_labels[pos_mask]

                if mim_labels is not None:
                    mim_labels = mim_labels[pos_mask]
                    bool_masked_pos = bool_masked_pos[pos_mask]

        # MMM Image Loss
        if multimodal_masked_embeddings is not None and self.mmm_image_weight > 0:
            sequence_for_image = multimodal_masked_embeddings
            end_index = image_masked_embeddings.shape[1] - 1
            sequence_for_image = sequence_for_image[:, 2 : 2 + end_index, :]

            if mim_labels is not None:
                mim_labels = self._resize_to_2d(mim_labels)
                bool_masked_pos = self._resize_to_2d(bool_masked_pos)
                mim_labels[bool_masked_pos.ne(True)] = self.ce_ignore_index

                masked_tokens = mim_labels.ne(self.ce_ignore_index)

                masked_tokens_np = masked_tokens.asnumpy()
                indices_to_modify = np.where(masked_tokens)
                mim_labels_filtered = mim_labels[masked_tokens]

                sequence_for_image = mindspore.Tensor(sequence_for_image.asnumpy()[masked_tokens.asnumpy(), :], mindspore.float32)
                # sequence_for_image = sequence_for_image[masked_tokens, :]
                mmm_image_logits = self.mmm_image_head(sequence_for_image)
                if return_loss:
                    mmm_image_loss = ops.cross_entropy(
                        mmm_image_logits.view(-1, self.image_vocab_size), mim_labels_filtered.view(-1)
                    )
                    mmm_image_loss *= self.mmm_image_weight
            else:
                mmm_image_logits = self.mmm_image_head(sequence_for_image)

        # MMM Text Loss
        if multimodal_masked_embeddings is not None and self.mmm_text_weight > 0:
            sequence_for_text = multimodal_masked_embeddings
            sequence_for_text = sequence_for_text[:, -text_masked_embeddings.shape[1] :, :]

            if mlm_labels is not None:
                mlm_labels = self._resize_to_2d(mlm_labels)
                masked_tokens = mlm_labels.ne(self.ce_ignore_index)
                mlm_labels_filtered = mlm_labels[masked_tokens]
                sequence_for_text = mindspore.Tensor(sequence_for_text.asnumpy()[masked_tokens.asnumpy(), :], mindspore.float32)
                # sequence_for_text = sequence_for_text[masked_tokens, :]
                mmm_text_logits = self.mmm_text_head(sequence_for_text)
                if return_loss:
                    mmm_text_loss = ops.cross_entropy(
                        mmm_text_logits.view(-1, self.text_vocab_size), mlm_labels_filtered.view(-1)
                    )
                    mmm_text_loss *= self.mmm_text_weight
            else:
                mmm_text_logits = self.mmm_text_head(sequence_for_text)

        # Global Contrastive Loss
        if image_embeddings is not None and text_embeddings is not None and self.global_contrastive_weight > 0:
            text_embedding = self.flava.text_projection(text_embeddings[:, 0, :])
            # text_embedding = ops.normalize(text_embedding, dim=-1)
            text_embedding = ops.L2Normalize(axis=-1)(text_embedding)

            image_embedding = self.flava.image_projection(image_embeddings[:, 0, :])
            # image_embedding = ops.normalize(image_embedding, dim=-1)
            image_embedding = ops.L2Normalize(axis=-1)(image_embedding)

            self.flava.logit_scale.data.clamp(LOGIT_SCALE_CLAMP_MIN, LOGIT_SCALE_CLAMP_MAX)

            logits_per_image, logits_per_text, gc_labels = self.global_contrastive_head(
                image_embedding, text_embedding, self.flava.logit_scale
            )

            # Apply ITM negative mask if any
            if pos_mask is not None:
                logits_per_image = logits_per_image[pos_mask]
                logits_per_text = logits_per_text[pos_mask]
                gc_labels = gc_labels[pos_mask]

            if return_loss:
                gc_loss_image = ops.cross_entropy(logits_per_image, gc_labels)
                gc_loss_text = ops.cross_entropy(logits_per_text, gc_labels)
                gc_loss = (gc_loss_image + gc_loss_text) / 2
                gc_loss *= self.global_contrastive_weight

        flava_losses = FlavaLosses(
            mim=mim_loss,
            mlm=mlm_loss,
            itm=itm_loss,
            global_contrastive=gc_loss,
            mmm_image=mmm_image_loss,
            mmm_text=mmm_text_loss,
        )

        if return_loss and not flava_losses.all_none():
            total_loss = sum(loss if loss is not None else 0 for loss in flava_losses.values())

        if not return_dict:
            output = (
                image_embeddings,
                flava_output.image_output.to_tuple() if flava_output.image_output is not None else None,
                text_embeddings,
                flava_output.text_output.to_tuple() if flava_output.text_output is not None else None,
                flava_output.multimodal_embeddings,
                flava_output.multimodal_output.to_tuple() if flava_output.multimodal_output is not None else None,
                image_masked_embeddings,
                flava_masked_output.image_output.to_tuple() if flava_masked_output.image_output is not None else None,
                text_masked_embeddings,
                flava_masked_output.text_output.to_tuple() if flava_masked_output.text_output is not None else None,
                multimodal_masked_embeddings,
                flava_masked_output.multimodal_output.to_tuple()
                if flava_masked_output.multimodal_output is not None
                else None,
                mim_logits,
                mlm_logits,
                itm_logits,
                logits_per_image,
                logits_per_image,
                mmm_image_logits,
                mmm_text_logits,
            )
            if return_loss and not flava_losses.all_none():
                output = (
                    total_loss,
                    flava_losses,
                ) + output

            # Filter None as transformer by default won't handle it
            return tuple(x for x in output if x is None)

        return FlavaForPreTrainingOutput(
            loss=total_loss,
            loss_info=flava_losses,
            image_embeddings=image_embeddings,
            image_output=flava_output.image_output,
            text_embeddings=text_embeddings,
            text_output=flava_output.text_output,
            multimodal_embeddings=flava_output.multimodal_embeddings,
            multimodal_output=flava_output.multimodal_output,
            image_masked_embeddings=image_masked_embeddings,
            image_masked_output=flava_masked_output.image_output,
            text_masked_embeddings=text_masked_embeddings,
            text_masked_output=flava_masked_output.text_output,
            multimodal_masked_embeddings=multimodal_masked_embeddings,
            multimodal_masked_output=flava_masked_output.multimodal_output,
            mim_logits=mim_logits,
            mlm_logits=mlm_logits,
            itm_logits=itm_logits,
            contrastive_logits_per_image=logits_per_image,
            contrastive_logits_per_text=logits_per_text,
            mmm_image_logits=mmm_image_logits,
            mmm_text_logits=mmm_text_logits,
        )

mindnlp.transformers.models.flava.modeling_flava.FlavaForPreTraining.forward(input_ids=None, input_ids_masked=None, pixel_values=None, codebook_pixel_values=None, attention_mask=None, token_type_ids=None, bool_masked_pos=None, position_ids=None, image_attention_mask=None, skip_unmasked_multimodal_encoder=None, mlm_labels=None, mim_labels=None, itm_labels=None, output_attentions=None, output_hidden_states=True, return_dict=None, return_loss=None)

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import FlavaForPreTraining, AutoProcessor
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> model = FlavaForPreTraining.from_pretrained("facebook/flava-full")
>>> processor = AutoProcessor.from_pretrained("facebook/flava-full")
...
>>> text = ["a photo of a cat"]
...
>>> inputs = processor(
...     images=[image],
...     text=text,
...     return_masks=True,
...     return_codebook_pixels=True,
...     padding=True,
...     max_length=77,
...     return_tensors="pt",
... )
...
...
>>> output = model(**inputs)
Return

Union[Tuple[mindspore.Tensor], FlavaForPreTrainingOutput]

Source code in mindnlp/transformers/models/flava/modeling_flava.py
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    input_ids_masked: Optional[mindspore.Tensor] = None,
    pixel_values: Optional[mindspore.Tensor] = None,
    codebook_pixel_values: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    token_type_ids: Optional[mindspore.Tensor] = None,
    bool_masked_pos: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    image_attention_mask: Optional[mindspore.Tensor] = None,
    skip_unmasked_multimodal_encoder: bool = None,
    mlm_labels: Optional[mindspore.Tensor] = None,
    mim_labels: Optional[mindspore.Tensor] = None,
    itm_labels: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: bool = True,
    return_dict: Optional[bool] = None,
    return_loss: Optional[bool] = None,
) -> Union[Tuple[mindspore.Tensor], FlavaForPreTrainingOutput]:
    """
    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import FlavaForPreTraining, AutoProcessor
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> model = FlavaForPreTraining.from_pretrained("facebook/flava-full")
        >>> processor = AutoProcessor.from_pretrained("facebook/flava-full")
        ...
        >>> text = ["a photo of a cat"]
        ...
        >>> inputs = processor(
        ...     images=[image],
        ...     text=text,
        ...     return_masks=True,
        ...     return_codebook_pixels=True,
        ...     padding=True,
        ...     max_length=77,
        ...     return_tensors="pt",
        ... )
        ...
        ...
        >>> output = model(**inputs)
        ```

    Return:
        `Union[Tuple[mindspore.Tensor], FlavaForPreTrainingOutput]`
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    return_loss = return_loss if return_loss is not None else self.config.return_loss

    skip_unmasked_multimodal_encoder = (
        skip_unmasked_multimodal_encoder
        if skip_unmasked_multimodal_encoder is not None
        else self.skip_unmasked_multimodal_encoder
    )

    if input_ids_masked is None and input_ids is not None:
        logger.warning(
            "`input_ids_masked` isn't passed which means MLM loss won't be calculated correctlySetting it to"
            " `input_ids` so that model can work. Please pass it if this is unintentional. This is usually OKAY if"
            " you are doing inference on unmasked text..."
        )
        input_ids_masked = input_ids

    flava_output = self.flava(
        input_ids=input_ids,
        pixel_values=pixel_values,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
        image_attention_mask=image_attention_mask,
        # Don't need unmasked multimodal embedding for anything so skip it
        # NOTE: ITM uses masked version
        skip_multimodal_encoder=skip_unmasked_multimodal_encoder,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        # Pass true to have deterministic outputs
        return_dict=True,
    )

    flava_masked_output = self.flava(
        input_ids=input_ids_masked,
        pixel_values=pixel_values,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        image_attention_mask=image_attention_mask,
        bool_masked_pos=bool_masked_pos,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=True,
    )

    pos_mask = None

    image_embeddings = flava_output.image_embeddings
    text_embeddings = flava_output.text_embeddings
    image_masked_embeddings = flava_masked_output.image_embeddings
    text_masked_embeddings = flava_masked_output.text_embeddings
    multimodal_masked_embeddings = flava_masked_output.multimodal_embeddings

    total_loss = mim_loss = mlm_loss = mmm_text_loss = mmm_image_loss = gc_loss = itm_loss = None
    mim_logits = mlm_logits = mmm_text_logits = mmm_image_logits = None
    itm_logits = logits_per_image = logits_per_text = None

    # Calculate mim_labels if necessary from the image_codebook
    if image_masked_embeddings is not None or multimodal_masked_embeddings is not None:
        if mim_labels is None and return_loss:
            if self.image_codebook is None:
                raise RuntimeError(
                    "`return_loss` is set to True but the image codebook is not initialized and no `mim_labels` "
                    " have been passed. Reinstantiate the model with `init_codebook` set to True or "
                    "pass in your custom `mim_labels`"
                )
            if codebook_pixel_values is None:
                raise ValueError(
                    "`codebook_pixel_value` are required to generate `mim_labels` if loss is expected. "
                    "Call `AutoProcessor` with `return_codebook_pixels` set to True"
                )
            mim_labels = self.image_codebook.get_codebook_indices(codebook_pixel_values)
    # Unimodal MIM Loss
    # If multimodal embeddings are present, we will calculate MMM loss
    if self.mim_weight > 0 and image_masked_embeddings is not None and multimodal_masked_embeddings is None:
        sequence_for_image = image_masked_embeddings

        if mim_labels is not None:
            mim_labels = self._resize_to_2d(mim_labels)
            bool_masked_pos = self._resize_to_2d(bool_masked_pos)
            mim_labels[bool_masked_pos.ne(True)] = self.ce_ignore_index

            sequence_for_image = sequence_for_image[:, -mim_labels.shape[1] :, :]
            masked_tokens = mim_labels.ne(self.ce_ignore_index)
            mim_labels_filtered = mim_labels[masked_tokens]
            # sequence_for_image = sequence_for_image[masked_tokens, :]
            sequence_for_image = mindspore.Tensor(sequence_for_image.asnumpy()[masked_tokens.asnumpy(), :], mindspore.float32)
            mim_logits = self.mim_head(sequence_for_image)
            if return_loss:
                mim_loss = ops.cross_entropy(
                    mim_logits.view(-1, self.image_vocab_size), mim_labels_filtered.view(-1)
                )
                mim_loss *= self.mim_weight
        else:
            mim_logits = self.mim_head(sequence_for_image)

    # Unimodal MLM Loss
    if self.mlm_weight > 0 and text_masked_embeddings is not None and multimodal_masked_embeddings is None:
        sequence_for_text = text_masked_embeddings
        if mlm_labels is not None:
            mlm_labels = self._resize_to_2d(mlm_labels)
            sequence_for_text = sequence_for_text[:, -mlm_labels.shape[1] :, :]
            masked_tokens = mlm_labels.ne(self.ce_ignore_index)
            mlm_labels_filtered = mlm_labels[masked_tokens]
            # sequence_for_text = sequence_for_text[masked_tokens, :]
            sequence_for_text = mindspore.Tensor(sequence_for_text.asnumpy()[masked_tokens.asnumpy(), :], mindspore.float32)
            mlm_logits = self.mlm_head(sequence_for_text)
            if return_loss:
                mlm_loss = ops.cross_entropy(
                    mlm_logits.view(-1, self.text_vocab_size), mlm_labels_filtered.view(-1)
                )
                mlm_loss *= self.mlm_weight
        else:
            mlm_logits = self.mlm_head(sequence_for_text)

    # ITM Loss
    if self.itm_weight > 0 and multimodal_masked_embeddings is not None:
        itm_logits = self.itm_head(multimodal_masked_embeddings)

        if itm_labels is not None:
            pos_pairs = itm_labels.ne(0)
            # pos_mask = ops.where(pos_pairs.any(), pos_pairs, pos_pairs.new([True]))
            pos_mask = ops.where(pos_pairs.any(), pos_pairs, mindspore.Tensor([True], dtype=pos_pairs.dtype))
            if return_loss:
                itm_loss = ops.cross_entropy(itm_logits, ops.cast(itm_labels, mindspore.int32))
                itm_loss *= self.itm_weight

            if multimodal_masked_embeddings is not None:
                multimodal_masked_embeddings = multimodal_masked_embeddings[pos_mask]

            if mlm_labels is not None:
                mlm_labels = mlm_labels[pos_mask]

            if mim_labels is not None:
                mim_labels = mim_labels[pos_mask]
                bool_masked_pos = bool_masked_pos[pos_mask]

    # MMM Image Loss
    if multimodal_masked_embeddings is not None and self.mmm_image_weight > 0:
        sequence_for_image = multimodal_masked_embeddings
        end_index = image_masked_embeddings.shape[1] - 1
        sequence_for_image = sequence_for_image[:, 2 : 2 + end_index, :]

        if mim_labels is not None:
            mim_labels = self._resize_to_2d(mim_labels)
            bool_masked_pos = self._resize_to_2d(bool_masked_pos)
            mim_labels[bool_masked_pos.ne(True)] = self.ce_ignore_index

            masked_tokens = mim_labels.ne(self.ce_ignore_index)

            masked_tokens_np = masked_tokens.asnumpy()
            indices_to_modify = np.where(masked_tokens)
            mim_labels_filtered = mim_labels[masked_tokens]

            sequence_for_image = mindspore.Tensor(sequence_for_image.asnumpy()[masked_tokens.asnumpy(), :], mindspore.float32)
            # sequence_for_image = sequence_for_image[masked_tokens, :]
            mmm_image_logits = self.mmm_image_head(sequence_for_image)
            if return_loss:
                mmm_image_loss = ops.cross_entropy(
                    mmm_image_logits.view(-1, self.image_vocab_size), mim_labels_filtered.view(-1)
                )
                mmm_image_loss *= self.mmm_image_weight
        else:
            mmm_image_logits = self.mmm_image_head(sequence_for_image)

    # MMM Text Loss
    if multimodal_masked_embeddings is not None and self.mmm_text_weight > 0:
        sequence_for_text = multimodal_masked_embeddings
        sequence_for_text = sequence_for_text[:, -text_masked_embeddings.shape[1] :, :]

        if mlm_labels is not None:
            mlm_labels = self._resize_to_2d(mlm_labels)
            masked_tokens = mlm_labels.ne(self.ce_ignore_index)
            mlm_labels_filtered = mlm_labels[masked_tokens]
            sequence_for_text = mindspore.Tensor(sequence_for_text.asnumpy()[masked_tokens.asnumpy(), :], mindspore.float32)
            # sequence_for_text = sequence_for_text[masked_tokens, :]
            mmm_text_logits = self.mmm_text_head(sequence_for_text)
            if return_loss:
                mmm_text_loss = ops.cross_entropy(
                    mmm_text_logits.view(-1, self.text_vocab_size), mlm_labels_filtered.view(-1)
                )
                mmm_text_loss *= self.mmm_text_weight
        else:
            mmm_text_logits = self.mmm_text_head(sequence_for_text)

    # Global Contrastive Loss
    if image_embeddings is not None and text_embeddings is not None and self.global_contrastive_weight > 0:
        text_embedding = self.flava.text_projection(text_embeddings[:, 0, :])
        # text_embedding = ops.normalize(text_embedding, dim=-1)
        text_embedding = ops.L2Normalize(axis=-1)(text_embedding)

        image_embedding = self.flava.image_projection(image_embeddings[:, 0, :])
        # image_embedding = ops.normalize(image_embedding, dim=-1)
        image_embedding = ops.L2Normalize(axis=-1)(image_embedding)

        self.flava.logit_scale.data.clamp(LOGIT_SCALE_CLAMP_MIN, LOGIT_SCALE_CLAMP_MAX)

        logits_per_image, logits_per_text, gc_labels = self.global_contrastive_head(
            image_embedding, text_embedding, self.flava.logit_scale
        )

        # Apply ITM negative mask if any
        if pos_mask is not None:
            logits_per_image = logits_per_image[pos_mask]
            logits_per_text = logits_per_text[pos_mask]
            gc_labels = gc_labels[pos_mask]

        if return_loss:
            gc_loss_image = ops.cross_entropy(logits_per_image, gc_labels)
            gc_loss_text = ops.cross_entropy(logits_per_text, gc_labels)
            gc_loss = (gc_loss_image + gc_loss_text) / 2
            gc_loss *= self.global_contrastive_weight

    flava_losses = FlavaLosses(
        mim=mim_loss,
        mlm=mlm_loss,
        itm=itm_loss,
        global_contrastive=gc_loss,
        mmm_image=mmm_image_loss,
        mmm_text=mmm_text_loss,
    )

    if return_loss and not flava_losses.all_none():
        total_loss = sum(loss if loss is not None else 0 for loss in flava_losses.values())

    if not return_dict:
        output = (
            image_embeddings,
            flava_output.image_output.to_tuple() if flava_output.image_output is not None else None,
            text_embeddings,
            flava_output.text_output.to_tuple() if flava_output.text_output is not None else None,
            flava_output.multimodal_embeddings,
            flava_output.multimodal_output.to_tuple() if flava_output.multimodal_output is not None else None,
            image_masked_embeddings,
            flava_masked_output.image_output.to_tuple() if flava_masked_output.image_output is not None else None,
            text_masked_embeddings,
            flava_masked_output.text_output.to_tuple() if flava_masked_output.text_output is not None else None,
            multimodal_masked_embeddings,
            flava_masked_output.multimodal_output.to_tuple()
            if flava_masked_output.multimodal_output is not None
            else None,
            mim_logits,
            mlm_logits,
            itm_logits,
            logits_per_image,
            logits_per_image,
            mmm_image_logits,
            mmm_text_logits,
        )
        if return_loss and not flava_losses.all_none():
            output = (
                total_loss,
                flava_losses,
            ) + output

        # Filter None as transformer by default won't handle it
        return tuple(x for x in output if x is None)

    return FlavaForPreTrainingOutput(
        loss=total_loss,
        loss_info=flava_losses,
        image_embeddings=image_embeddings,
        image_output=flava_output.image_output,
        text_embeddings=text_embeddings,
        text_output=flava_output.text_output,
        multimodal_embeddings=flava_output.multimodal_embeddings,
        multimodal_output=flava_output.multimodal_output,
        image_masked_embeddings=image_masked_embeddings,
        image_masked_output=flava_masked_output.image_output,
        text_masked_embeddings=text_masked_embeddings,
        text_masked_output=flava_masked_output.text_output,
        multimodal_masked_embeddings=multimodal_masked_embeddings,
        multimodal_masked_output=flava_masked_output.multimodal_output,
        mim_logits=mim_logits,
        mlm_logits=mlm_logits,
        itm_logits=itm_logits,
        contrastive_logits_per_image=logits_per_image,
        contrastive_logits_per_text=logits_per_text,
        mmm_image_logits=mmm_image_logits,
        mmm_text_logits=mmm_text_logits,
    )

mindnlp.transformers.models.flava.modeling_flava.FlavaForPreTrainingOutput dataclass

Bases: ModelOutput

Output from FlavaForPreTraining containing embeddings, and outputs from individual encoders.

Note that image_embeddings and text_embeddings returned are similar to pooled output returned from a transformer. If you want embeddings for contrastive loss or retrieval use a FLAVA model's image_projection and text_projection layers on image_embeddings and text_embeddings respectively.

PARAMETER DESCRIPTION
loss

Total loss calculated for this model.

TYPE: `mindspore.Tensor`, *optional*, returned when `return_loss` is True DEFAULT: None

loss_info

Detailed info for FLAVA Pretraining losses. Check FlavaLosses class description for the information on the keys.

TYPE: `FlavaLosses` DEFAULT: None

image_output

The output of the [FlavaImageModel].

TYPE: `BaseModelOutputWithPooling`, *optional*, returned when `pixel_values` are present DEFAULT: None

text_output

The output of the [FlavaTextModel].

TYPE: `BaseModelOutputWithPooling`, *optional*, returned when `input_ids` are present DEFAULT: None

image_masked_output

The output of the [FlavaImageModel]. Uses bool_masked_pos to create masked images.

TYPE: `BaseModelOutputWithPooling`, *optional*, returned when `pixel_values` are present DEFAULT: None

text_masked_output

The output of the [FlavaTextModel].

TYPE: `BaseModelOutputWithPooling`, *optional*, returned when `input_ids_masked` are present DEFAULT: None

contrastive_logits_per_image

The scaled dot product scores between image_embeddings and text_embeddings but passed through FLAVA's image_projection and text_projection layers respectively. This represents the image-text similarity scores. This is calculated on unmasked images and texts.

TYPE: `mindspore.Tensor` of shape `(image_batch_size, text_batch_size)` DEFAULT: None

contrastive_logits_per_text

The scaled dot product scores between text_embeddings and image_embeddings but passed through FLAVA's text_projection and image_projection layers respectively. This is calculated on unmasked images and texts.

TYPE: `mindspore.Tensor` of shape `(text_batch_size, image_batch_size)` DEFAULT: None

Source code in mindnlp/transformers/models/flava/modeling_flava.py
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
@dataclass
class FlavaForPreTrainingOutput(ModelOutput):
    """
    Output from FlavaForPreTraining containing embeddings, and outputs from individual encoders.

    Note that `image_embeddings` and `text_embeddings` returned are similar to pooled output returned from a
    transformer. If you want embeddings for contrastive loss or retrieval use a FLAVA model's `image_projection` and
    `text_projection` layers on `image_embeddings` and `text_embeddings` respectively.

    Args:
        loss (`mindspore.Tensor`, *optional*, returned when `return_loss` is True):
            Total loss calculated for this model.
        loss_info (`FlavaLosses`):
            Detailed info for FLAVA Pretraining losses. Check `FlavaLosses` class description for the information on
            the keys.
        image_embeddings (`mindspore.Tensor` of shape `(batch_size, output_dim)`, *optional*, returned when
            `pixel_values` are present):
            The image embeddings which are basically the pooled output of [`FlavaImageModel`].
        image_output (`BaseModelOutputWithPooling`, *optional*, returned when `pixel_values` are present):
            The output of the [`FlavaImageModel`].
        text_embeddings (`mindspore.Tensor` of shape `(batch_size, output_dim)`, *optional*, returned when
            `input_ids` are present):
            The text embeddings which are basically the pooled output of [`FlavaTextModel`].
        text_output (`BaseModelOutputWithPooling`, *optional*, returned when `input_ids` are present):
            The output of the [`FlavaTextModel`].
        multimodal_embeddings (`mindspore.Tensor` of shape `(batch_size, output_dim)`, *optional*, returned when
            `input_ids` and `pixel_values` are present and `skip_unmasked_multimodal_encoder` is `None` or `False`):
            The multimodal embeddings which are basically the pooled output of [`FlavaTextModel`].
        multimodal_output (`BaseModelOutputWithPooling`, returned when `input_ids` and `pixel_values` are present and
            `skip_unmasked_multimodal_encoder` is `None` or `False`):
            The output of the [`FlavaMultimodalModel`].

        image_masked_embeddings (`mindspore.Tensor` of shape `(batch_size, output_dim)`, *optional*, returned when
            `pixel_values` are present):
            The image embeddings which are basically the pooled output of [`FlavaImageModel`]. Uses `bool_masked_pos`
            to create masked images.
        image_masked_output (`BaseModelOutputWithPooling`, *optional*, returned when `pixel_values` are present):
            The output of the [`FlavaImageModel`]. Uses `bool_masked_pos` to create masked images.
        text_masked_embeddings (`mindspore.Tensor` of shape `(batch_size, output_dim)`, *optional*, returned when
            `input_ids_masked` are present):
            The text embeddings which are basically the pooled output of [`FlavaTextModel`].
        text_masked_output (`BaseModelOutputWithPooling`, *optional*, returned when `input_ids_masked` are present):
            The output of the [`FlavaTextModel`].
        multimodal_masked_embeddings (`mindspore.Tensor` of shape `(batch_size, output_dim)`, *optional*,
            returned when `input_ids` and `pixel_values` are present):
            The multimodal embeddings which are basically the pooled output of [`FlavaTextModel`].
        multimodal_masked_output (`BaseModelOutputWithPooling`, returned when `input_ids_masked` and `pixel_values`
            are present):
            The output of the [`FlavaMultimodalModel`].

        mim_logits (`mindspore.Tensor` of shape `(batch_size, num_image_patches, image_vocab_size)` or of shape
            `(total_masked_patches, image_vocab_size)` , *optional*, returned when `pixel_values` are present
            and `input_ids_masked` are not):
            The logits for MIM unimodal loss. Uses `book_masked_pos` to get masked patches. The flattened output is
            returned when `bool_masked_pos` has some of the patches masked.
        mlm_logits (`mindspore.Tensor` of shape `(batch_size, text_seq_length, text_vocab_size)` or of shape
            `(total_masked_seq_length, text_vocab_size)`, *optional*, returned when `input_ids_masked` are present
            and `pixel_values` are not):
            The logits for MLM unimodal loss. The flattened output is returned when `input_ids_masked` has some of
            the tokens masked.
        itm_logits (`mindspore.Tensor` of shape `(batch_size, 2)`, *optional*, returned when `input_ids_masked` and
            `pixel_values` are present):
            The logits for ITM loss. Note that ITM loss is calculated on masked pairs in FLAVA.
        mmm_image_logits (`mindspore.Tensor` of shape `(batch_size, num_image_patches, image_vocab_size)` or of shape
            `(total_masked_patches, image_vocab_size)`, *optional*, returned when `pixel_values` and `input_ids_masked`
            are present):
            The logits for MMM image multimodal loss. Uses `book_masked_pos` to get masked patches. The flattened
            output is returned when `bool_masked_pos` has some of the patches masked.
        mmm_text_logits (`mindspore.Tensor` of shape `(batch_size, text_seq_length, text_vocab_size)` or of shape
            `(`(total_masked_seq_length, text_vocab_size)`), *optional*, returned when `pixel_values` and
            `input_ids_masked` are present):
            The logits for MMM text multimodal loss. The flattened output is returned when `input_ids_masked` has
            some of the tokens masked.
        contrastive_logits_per_image (`mindspore.Tensor` of shape `(image_batch_size, text_batch_size)`):
            The scaled dot product scores between `image_embeddings` and `text_embeddings` but passed through FLAVA's
            `image_projection` and `text_projection` layers respectively. This represents the image-text similarity
            scores. This is calculated on unmasked images and texts.
        contrastive_logits_per_text (`mindspore.Tensor` of shape `(text_batch_size, image_batch_size)`):
            The scaled dot product scores between `text_embeddings` and `image_embeddings` but passed through FLAVA's
            `text_projection` and `image_projection` layers respectively. This is calculated on unmasked images and
            texts.
    """

    loss: Optional[mindspore.Tensor] = None
    loss_info: FlavaLosses = None
    image_embeddings: Optional[mindspore.Tensor] = None
    image_output: Optional[BaseModelOutputWithPooling] = None
    text_embeddings: Optional[mindspore.Tensor] = None
    text_output: Optional[BaseModelOutputWithPooling] = None
    multimodal_embeddings: Optional[mindspore.Tensor] = None
    multimodal_output: Optional[BaseModelOutputWithPooling] = None
    image_masked_embeddings: Optional[mindspore.Tensor] = None
    image_masked_output: Optional[BaseModelOutputWithPooling] = None
    text_masked_embeddings: Optional[mindspore.Tensor] = None
    text_masked_output: Optional[BaseModelOutputWithPooling] = None
    multimodal_masked_embeddings: Optional[mindspore.Tensor] = None
    multimodal_masked_output: Optional[BaseModelOutputWithPooling] = None
    mim_logits: Optional[mindspore.Tensor] = None
    mlm_logits: Optional[mindspore.Tensor] = None
    itm_logits: Optional[mindspore.Tensor] = None
    contrastive_logits_per_image: Optional[mindspore.Tensor] = None
    contrastive_logits_per_text: Optional[mindspore.Tensor] = None
    mmm_image_logits: Optional[mindspore.Tensor] = None
    mmm_text_logits: Optional[mindspore.Tensor] = None

    def to_tuple(self) -> Tuple[Any]:
        transformer_outputs = [
            "text_output",
            "image_output",
            "multimodal_output",
            "text_masked_output",
            "image_masked_output",
            "multimodal_masked_output",
        ]
        return tuple(self[k] if k not in transformer_outputs else getattr(self, k).to_tuple() for k in self.keys())

mindnlp.transformers.models.flava.modeling_flava.FlavaImageEmbeddings

Bases: Module

Construct the CLS token, position and patch embeddings. Optionally, also the mask token.

Source code in mindnlp/transformers/models/flava/modeling_flava.py
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
class FlavaImageEmbeddings(nn.Module):
    """
    Construct the CLS token, position and patch embeddings. Optionally, also the mask token.
    """

    def __init__(self, config: FlavaImageConfig, use_mask_token: bool = False) -> None:
        super().__init__()

        use_mask_token = use_mask_token or config.mask_token
        self.cls_token = mindspore.Parameter(ops.zeros((1, 1, config.hidden_size)), 'cls_token')
        self.mask_token = mindspore.Parameter(ops.zeros((1, 1, config.hidden_size)), 'mask_token') if use_mask_token else None
        self.patch_embeddings = PatchEmbeddings(
            image_size=config.image_size,
            patch_size=config.patch_size,
            num_channels=config.num_channels,
            embed_dim=config.hidden_size,
        )
        num_patches = self.patch_embeddings.num_patches
        self.position_embeddings = mindspore.Parameter(ops.zeros((1, num_patches + 1, config.hidden_size)), 'position_embeddings')
        self.dropout = nn.Dropout(p=config.hidden_dropout_prob)
        self.config = config

    def interpolate_pos_encoding(self, embeddings: mindspore.Tensor, height: int, width: int) -> mindspore.Tensor:
        """
        This method allows to interpolate the pre-trained position encodings, to be able to use the model on higher
        resolution images.

        Source:
        https://github.com/facebookresearch/dino/blob/de9ee3df6cf39fac952ab558447af1fa1365362a/image_transformer.py#L174
        """

        npatch = embeddings.shape[1] - 1
        num_pos = self.position_embeddings.shape[1] - 1
        if npatch == num_pos and height == width:
            return self.position_embeddings
        class_pos_embed = self.position_embeddings[:, 0]
        patch_pos_embed = self.position_embeddings[:, 1:]
        dim = embeddings.shape[-1]
        num_h_patches = height // self.config.patch_size
        num_w_patches = width // self.config.patch_size
        # we add a small number to avoid floating point error in the interpolation
        # see discussion at https://github.com/facebookresearch/dino/issues/8
        num_h_patches, num_w_patches = num_h_patches + 0.1, num_w_patches + 0.1
        patch_pos_embed = ops.interpolate(
            patch_pos_embed.reshape(1, int(math.sqrt(num_pos)), int(math.sqrt(num_pos)), dim).permute(0, 3, 1, 2),
            scale_factor=(num_h_patches / math.sqrt(num_pos), num_w_patches / math.sqrt(num_pos)),
            mode="bicubic",
            align_corners=False,
        )
        if int(num_h_patches) != patch_pos_embed.shape[-2] or int(num_w_patches) != patch_pos_embed.shape[-1]:
            raise ValueError(
                f"Number of patches for images ({int(num_h_patches), int(num_w_patches)}) don't match the "
                f"shape of position embedding ({patch_pos_embed.shape[-2], patch_pos_embed.shape[-1]})"
            )
        patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
        return ops.cat([class_pos_embed.unsqueeze(0), patch_pos_embed], axis=1)

    def forward(
        self,
        pixel_values: mindspore.Tensor,
        bool_masked_pos: Optional[mindspore.Tensor] = None,
        interpolate_pos_encoding: bool = False,
    ) -> mindspore.Tensor:
        batch_size, num_channels, height, width = pixel_values.shape
        embeddings = self.patch_embeddings(pixel_values, interpolate_pos_encoding=interpolate_pos_encoding)

        batch_size, seq_len, _ = embeddings.shape
        if bool_masked_pos is not None:
            mask_tokens = self.mask_token.broadcast_to((batch_size, seq_len, -1))
            # B X H X W = B X HW
            if bool_masked_pos.dim() == 3:
                bool_masked_pos = bool_masked_pos.view(bool_masked_pos.shape[0], -1)
            # replace the masked visual tokens by mask_tokens
            mask = bool_masked_pos.unsqueeze(-1)
            mask = ops.Cast()(mask, mask_tokens.dtype)
            embeddings = embeddings * (1.0 - mask) + mask_tokens * mask

        # add the [CLS] token to the embedded patch tokens
        cls_tokens = self.cls_token.broadcast_to((batch_size, -1, -1))
        embeddings = ops.cat((cls_tokens, embeddings), axis=1)

        # add positional encoding to each token
        if interpolate_pos_encoding:
            embeddings = embeddings + self.interpolate_pos_encoding(embeddings, height, width)
        else:
            embeddings = embeddings + self.position_embeddings

        embeddings = self.dropout(embeddings)

        return embeddings

mindnlp.transformers.models.flava.modeling_flava.FlavaImageEmbeddings.interpolate_pos_encoding(embeddings, height, width)

This method allows to interpolate the pre-trained position encodings, to be able to use the model on higher resolution images.

Source: https://github.com/facebookresearch/dino/blob/de9ee3df6cf39fac952ab558447af1fa1365362a/image_transformer.py#L174

Source code in mindnlp/transformers/models/flava/modeling_flava.py
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
def interpolate_pos_encoding(self, embeddings: mindspore.Tensor, height: int, width: int) -> mindspore.Tensor:
    """
    This method allows to interpolate the pre-trained position encodings, to be able to use the model on higher
    resolution images.

    Source:
    https://github.com/facebookresearch/dino/blob/de9ee3df6cf39fac952ab558447af1fa1365362a/image_transformer.py#L174
    """

    npatch = embeddings.shape[1] - 1
    num_pos = self.position_embeddings.shape[1] - 1
    if npatch == num_pos and height == width:
        return self.position_embeddings
    class_pos_embed = self.position_embeddings[:, 0]
    patch_pos_embed = self.position_embeddings[:, 1:]
    dim = embeddings.shape[-1]
    num_h_patches = height // self.config.patch_size
    num_w_patches = width // self.config.patch_size
    # we add a small number to avoid floating point error in the interpolation
    # see discussion at https://github.com/facebookresearch/dino/issues/8
    num_h_patches, num_w_patches = num_h_patches + 0.1, num_w_patches + 0.1
    patch_pos_embed = ops.interpolate(
        patch_pos_embed.reshape(1, int(math.sqrt(num_pos)), int(math.sqrt(num_pos)), dim).permute(0, 3, 1, 2),
        scale_factor=(num_h_patches / math.sqrt(num_pos), num_w_patches / math.sqrt(num_pos)),
        mode="bicubic",
        align_corners=False,
    )
    if int(num_h_patches) != patch_pos_embed.shape[-2] or int(num_w_patches) != patch_pos_embed.shape[-1]:
        raise ValueError(
            f"Number of patches for images ({int(num_h_patches), int(num_w_patches)}) don't match the "
            f"shape of position embedding ({patch_pos_embed.shape[-2], patch_pos_embed.shape[-1]})"
        )
    patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
    return ops.cat([class_pos_embed.unsqueeze(0), patch_pos_embed], axis=1)

mindnlp.transformers.models.flava.modeling_flava.FlavaImageModel

Bases: FlavaPreTrainedModel

Source code in mindnlp/transformers/models/flava/modeling_flava.py
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
class FlavaImageModel(FlavaPreTrainedModel):
    config_class = FlavaImageConfig
    # This override allows us to load FlavaImageModel from FlavaModel/FlavaForPreTraining checkpoints.
    base_model_prefix = "flava.image_model"
    main_input_name = "pixel_values"

    def __init__(self, config: FlavaImageConfig, add_pooling_layer: bool = True):
        super().__init__(config)

        self.config = config

        self.embeddings = FlavaImageEmbeddings(config)
        self.encoder = FlavaEncoder(config)

        self.layernorm = nn.LayerNorm([config.hidden_size], eps=config.layer_norm_eps)
        self.pooler = FlavaPooler(config) if add_pooling_layer else None

        self.post_init()

    def get_input_embeddings(self) -> nn.Module:
        return self.embeddings.patch_embeddings

    def set_input_embeddings(self, value: nn.Module):
        self.embeddings.patch_embeddings = value

    def _prune_heads(self, heads_to_prune: Dict[int, List[int]]) -> None:
        """
        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
        class PreTrainedModel
        """
        for layer, heads in heads_to_prune.items():
            self.encoder.layer[layer].attention.prune_heads(heads)

    def forward(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        bool_masked_pos: Optional[mindspore.Tensor] = None,
        interpolate_pos_encoding: Optional[bool] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[tuple, BaseModelOutputWithPooling]:
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if pixel_values is None:
            raise ValueError("You have to specify pixel_values")

        # Prepare head mask if needed
        # 1.0 in head_mask indicate we keep the head
        # attention_probs has shape bsz x n_heads x N x N
        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)

        embedding_output = self.embeddings(
            pixel_values, bool_masked_pos=bool_masked_pos, interpolate_pos_encoding=interpolate_pos_encoding
        )

        encoder_outputs = self.encoder(
            embedding_output,
            attention_mask=attention_mask,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = encoder_outputs[0]
        sequence_output = self.layernorm(sequence_output)
        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

        if not return_dict:
            return (sequence_output, pooled_output) + encoder_outputs[1:]

        return BaseModelOutputWithPooling(
            last_hidden_state=sequence_output,
            pooler_output=pooled_output,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
        )

mindnlp.transformers.models.flava.modeling_flava.FlavaLayer

Bases: Module

This corresponds to the Block class in the timm implementation.

Source code in mindnlp/transformers/models/flava/modeling_flava.py
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
class FlavaLayer(nn.Module):
    """This corresponds to the Block class in the timm implementation."""

    def __init__(self, config: FlavaPossibleConfigs) -> None:
        super().__init__()
        self.chunk_size_feed_forward = config.chunk_size_feed_forward
        self.seq_len_dim = 1
        self.attention = FlavaAttention(config)
        self.intermediate = FlavaIntermediate(config)
        self.output = FlavaOutput(config)

        # TODO: Check fp32 layer norm possiblity
        self.layernorm_before = nn.LayerNorm([config.hidden_size], eps=config.layer_norm_eps)
        self.layernorm_after = nn.LayerNorm([config.hidden_size], eps=config.layer_norm_eps)

    def forward(
        self,
        hidden_states: mindspore.Tensor,
        attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        output_attentions: bool = False,
    ) -> Union[Tuple[mindspore.Tensor, mindspore.Tensor], Tuple[mindspore.Tensor]]:
        self_attention_outputs = self.attention(
            self.layernorm_before(hidden_states),  # in ViT, layernorm is applied before self-attention
            attention_mask=attention_mask,
            head_mask=head_mask,
            output_attentions=output_attentions,
        )
        attention_output = self_attention_outputs[0]
        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights

        # first residual connection
        hidden_states = attention_output + hidden_states

        # in ViT, layernorm is also applied after self-attention
        layer_output = self.layernorm_after(hidden_states)
        layer_output = self.intermediate(layer_output)

        # second residual connection is done here
        layer_output = self.output(layer_output, hidden_states)

        outputs = (layer_output,) + outputs

        return outputs

mindnlp.transformers.models.flava.modeling_flava.FlavaLosses dataclass

Bases: ModelOutput

Class representing pretraining losses from FLAVA model

Source code in mindnlp/transformers/models/flava/modeling_flava.py
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
@dataclass
class FlavaLosses(ModelOutput):
    """
    Class representing pretraining losses from FLAVA model

    Args:
        mim (`mindspore.Tensor` of shape `(1,)`, *optional*, returned when `mim_labels` and `pixel_values` are present,
            `input_ids_masked` is absent and `mim_weight` > 0.): Masked Image Modeling loss as used in BeIT calculated
            only for unimodal image data.
        mlm (`mindspore.Tensor` of shape `(1,)`, *optional*, returned when `mlm_labels` and `input_ids_masked` are
            present, `pixel_values` is absent and `mlm_weight` > 0.): Masked Language Modeling loss as used in BERT
            calculated only for unimodal text data.
        itm (`mindspore.Tensor` of shape `(1,)`, *optional*, returned when `itm_labels`, `input_ids_masked`,
            `pixel_values` are present and `itm_weight` > 0.):
            Image Text Matching (ITM) loss calculated for paired image-text data. Note that ITM loss is calculated on
            masked pairs in FLAVA.
        global_contrastive (`mindspore.Tensor` of shape `(1,)`, *optional*, returned when `input_ids` and `pixel_values`
            are present and `global_contrastive_weight` > 0.):
            Contrastive loss for image-text similarity similar to CLIP but calculated globally for paired image-text
            data. This is calculated on unmasked images and texts.
        mmm_image (`mindspore.Tensor` of shape `(1,)`, *optional*, returned when `mim_labels`, `pixel_values` and
            `input_ids_masked` are present and `mmm_image_weight` > 0.):
            Masked Multimodal Modeling loss's image component calculated on paired image-text data.
        mmm_text (`mindspore.Tensor` of shape `(1,)`, *optional*, returned when `mlm_labels`, `pixel_values` and
            `input_ids_masked` are present and `mmm_text_weight` > 0.):
            Masked Multimodal Modeling loss's text component calculated on paired image-text data.
    """

    mim: Optional[mindspore.Tensor] = None
    mlm: Optional[mindspore.Tensor] = None
    itm: Optional[mindspore.Tensor] = None
    global_contrastive: Optional[mindspore.Tensor] = None
    mmm_image: Optional[mindspore.Tensor] = None
    mmm_text: Optional[mindspore.Tensor] = None

    def all_none(self) -> bool:
        all_none = True
        for v in self.values():
            if v is not None:
                all_none = False
                break
        return all_none

mindnlp.transformers.models.flava.modeling_flava.FlavaModel

Bases: FlavaPreTrainedModel

Source code in mindnlp/transformers/models/flava/modeling_flava.py
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
class FlavaModel(FlavaPreTrainedModel):
    config_class = FlavaConfig

    def __init__(self, config: FlavaConfig):
        super().__init__(config)

        if not isinstance(config.text_config, FlavaTextConfig):
            raise ValueError(
                "config.text_config is expected to be of type FlavaTextConfig but is of type"
                f" {type(config.text_config)}."
            )

        if not isinstance(config.image_config, FlavaImageConfig):
            raise ValueError(
                "config.image_config is expected to be of type FlavaImageConfig but is of type"
                f" {type(config.image_config)}."
            )

        if not isinstance(config.multimodal_config, FlavaMultimodalConfig):
            raise ValueError(
                "config.multimodal_config is expected to be of type FlavaMultimodalConfig but "
                + f"is of type {type(config.multimodal_config)}."
            )

        text_config = config.text_config
        image_config = config.image_config
        multimodal_config = config.multimodal_config

        self.projection_dim = config.projection_dim
        self.text_hidden_size = text_config.hidden_size
        self.image_hidden_size = image_config.hidden_size
        self.mm_hidden_size = multimodal_config.hidden_size

        self.text_model = FlavaTextModel(text_config)
        self.image_model = FlavaImageModel(image_config)
        self.multimodal_model = FlavaMultimodalModel(multimodal_config)

        self.image_projection = nn.Linear(self.image_hidden_size, self.projection_dim)
        self.text_projection = nn.Linear(self.text_hidden_size, self.projection_dim)
        self.logit_scale = mindspore.Parameter(mindspore.Tensor([self.config.logit_scale_init_value]), name="logit_scale")

        self.image_to_mm_projection = nn.Linear(self.image_hidden_size, self.mm_hidden_size)
        self.text_to_mm_projection = nn.Linear(self.text_hidden_size, self.mm_hidden_size)
        # Initialize weights and apply final processing
        self.post_init()

    def get_text_features(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> mindspore.Tensor:
        r"""
        Returns:
            text_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
            applying the projection layer to the pooled output of [`FlavaTextModel`].

        Example:
            ```python
            >>> from transformers import AutoProcessor, FlavaModel
            ... 
            >>> model = FlavaModel.from_pretrained("{0}")
            >>> processor = AutoProcessor.from_pretrained("{0}")
            ... 
            >>> inputs = processor(
            ...     text=["a photo of a cat", "a photo of a dog"], max_length=77, padding="max_length", return_tensors="pt"
            ... )
            >>> text_features = model.get_text_features(**inputs)
        ```
        """.format(_CHECKPOINT_FOR_DOC)

        text_outputs = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = text_outputs[0]  # last_hidden_state
        text_features = self.text_projection(pooled_output)

        return text_features

    def get_image_features(
        self,
        pixel_values: Optional[mindspore.Tensor] = None,
        bool_masked_pos: Optional[mindspore.Tensor] = None,
        interpolate_pos_encoding: Optional[bool] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> mindspore.Tensor:
        r"""
        Returns:
            image_features (`mindspore.Tensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
            applying the projection layer to the pooled output of [`FlavaImageModel`].

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, FlavaModel
            ... 
            >>> model = FlavaModel.from_pretrained("{0}")
            >>> processor = AutoProcessor.from_pretrained("{0}")
            ... 
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ... 
            >>> inputs = processor(images=image, return_tensors="pt")
            ... 
            >>> image_features = model.get_image_features(**inputs)
            ```
        """.format(_CHECKPOINT_FOR_DOC)

        image_outputs = self.image_model(
            pixel_values=pixel_values,
            bool_masked_pos=bool_masked_pos,
            attention_mask=attention_mask,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            interpolate_pos_encoding=interpolate_pos_encoding,
            return_dict=return_dict,
        )

        pooled_output = image_outputs[0]  # last_hidden_state
        image_features = self.image_projection(pooled_output)

        return image_features


    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        pixel_values: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        bool_masked_pos: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        image_attention_mask: Optional[mindspore.Tensor] = None,
        skip_multimodal_encoder: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: bool = True,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, FlavaOutput]:
        r"""

        Returns:
            `Union[Tuple, FlavaOutput]`

        Example:
            ```python
            >>> from PIL import Image
            >>> import requests
            >>> from transformers import AutoProcessor, FlavaModel
            ...
            >>> model = FlavaModel.from_pretrained("facebook/flava-full")
            >>> processor = AutoProcessor.from_pretrained("facebook/flava-full")
            ...
            >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
            >>> image = Image.open(requests.get(url, stream=True).raw)
            ...
            >>> inputs = processor(text=["a photo of a cat"], images=image, return_tensors="pt", padding=True)
            ...
            >>> outputs = model(**inputs)
            ...
            >>> image_embeddings = outputs.image_embeddings
            >>> text_embeddings = outputs.text_embeddings
            >>> multimodal_embeddings = outputs.multimodal_embeddings
            ...
            >>> outputs.image_embeddings.shape
            torch.Size([1, 197, 768])
            >>> text_embeddings.shape
            torch.Size([1, 7, 768])
            >>> multimodal_embeddings.shape
            torch.Size([1, 205, 768])
            ```
        """

        return_dict = return_dict if return_dict is not None else self.config.return_dict
        if not output_hidden_states:
            raise ValueError("FLAVA model requires hidden states to work. Please set `output_hidden_states=True`")
        image_embeddings = None
        image_states = None
        image_mm_projection = None
        image_output = None
        if pixel_values is not None:
            image_output = self.image_model(
                pixel_values=pixel_values,
                bool_masked_pos=bool_masked_pos,
                attention_mask=image_attention_mask,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
            )
            image_embeddings, image_states = image_output[0], image_output[2]
            # Note that these states don't use final layernorm in the transformer model
            image_mm_projection = self.image_to_mm_projection(image_states[-1])

        text_embeddings = None
        text_states = None
        text_mm_projection = None
        text_output = None
        if input_ids is not None:
            text_output = self.text_model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                position_ids=position_ids,
                token_type_ids=token_type_ids,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
            )

            text_embeddings, text_states = text_output[0], text_output[2]
            # Note that these states don't use final layernorm in the transformer model
            text_mm_projection = self.text_to_mm_projection(text_states[-1])

        multimodal_embeddings = None
        multimodal_output = None
        if image_mm_projection is not None and text_mm_projection is not None and not skip_multimodal_encoder:
            if attention_mask is not None:
                batch_size, seq_len, _ = image_mm_projection.shape
                if self.multimodal_model.use_cls_token:
                    seq_len += 1
                attention_mask_image = ops.ones((batch_size, seq_len))
                attention_multimodal = ops.cat([ops.cast(attention_mask_image, mindspore.int64), attention_mask], axis=1)
                # mindspore.Tensor(np.concatenate((attention_mask_image.asnumpy(), attention_mask.asnumpy()), axis=1))
            else:
                attention_multimodal = None
            multimodal_input = ops.cat([image_mm_projection, text_mm_projection], axis=1)
            multimodal_output = self.multimodal_model(
                multimodal_input, attention_mask=attention_multimodal, return_dict=return_dict
            )
            # multimodal_input = ops.concat([image_mm_projection, text_mm_projection], axis=1)
            # multimodal_output = self.multimodal_model(
            #     ops.concat([image_mm_projection, text_mm_projection], axis=1), attention_mask=attention_multimodal, return_dict=return_dict
            # )

            multimodal_embeddings = multimodal_output[0]

        if not return_dict:
            return (
                image_embeddings,
                image_output,
                text_embeddings,
                text_output,
                multimodal_embeddings,
                multimodal_output,
            )

        return FlavaModelOutput(
            image_embeddings=image_embeddings,
            image_output=image_output,
            text_embeddings=text_embeddings,
            text_output=text_output,
            multimodal_embeddings=multimodal_embeddings,
            multimodal_output=multimodal_output,
        )

mindnlp.transformers.models.flava.modeling_flava.FlavaModel.forward(input_ids=None, pixel_values=None, attention_mask=None, token_type_ids=None, bool_masked_pos=None, position_ids=None, image_attention_mask=None, skip_multimodal_encoder=None, output_attentions=None, output_hidden_states=True, return_dict=None)

RETURNS DESCRIPTION
Union[Tuple, FlavaOutput]

Union[Tuple, FlavaOutput]

Example
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, FlavaModel
...
>>> model = FlavaModel.from_pretrained("facebook/flava-full")
>>> processor = AutoProcessor.from_pretrained("facebook/flava-full")
...
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
...
>>> inputs = processor(text=["a photo of a cat"], images=image, return_tensors="pt", padding=True)
...
>>> outputs = model(**inputs)
...
>>> image_embeddings = outputs.image_embeddings
>>> text_embeddings = outputs.text_embeddings
>>> multimodal_embeddings = outputs.multimodal_embeddings
...
>>> outputs.image_embeddings.shape
torch.Size([1, 197, 768])
>>> text_embeddings.shape
torch.Size([1, 7, 768])
>>> multimodal_embeddings.shape
torch.Size([1, 205, 768])
Source code in mindnlp/transformers/models/flava/modeling_flava.py
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
def forward(
    self,
    input_ids: Optional[mindspore.Tensor] = None,
    pixel_values: Optional[mindspore.Tensor] = None,
    attention_mask: Optional[mindspore.Tensor] = None,
    token_type_ids: Optional[mindspore.Tensor] = None,
    bool_masked_pos: Optional[mindspore.Tensor] = None,
    position_ids: Optional[mindspore.Tensor] = None,
    image_attention_mask: Optional[mindspore.Tensor] = None,
    skip_multimodal_encoder: Optional[bool] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: bool = True,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, FlavaOutput]:
    r"""

    Returns:
        `Union[Tuple, FlavaOutput]`

    Example:
        ```python
        >>> from PIL import Image
        >>> import requests
        >>> from transformers import AutoProcessor, FlavaModel
        ...
        >>> model = FlavaModel.from_pretrained("facebook/flava-full")
        >>> processor = AutoProcessor.from_pretrained("facebook/flava-full")
        ...
        >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
        >>> image = Image.open(requests.get(url, stream=True).raw)
        ...
        >>> inputs = processor(text=["a photo of a cat"], images=image, return_tensors="pt", padding=True)
        ...
        >>> outputs = model(**inputs)
        ...
        >>> image_embeddings = outputs.image_embeddings
        >>> text_embeddings = outputs.text_embeddings
        >>> multimodal_embeddings = outputs.multimodal_embeddings
        ...
        >>> outputs.image_embeddings.shape
        torch.Size([1, 197, 768])
        >>> text_embeddings.shape
        torch.Size([1, 7, 768])
        >>> multimodal_embeddings.shape
        torch.Size([1, 205, 768])
        ```
    """

    return_dict = return_dict if return_dict is not None else self.config.return_dict
    if not output_hidden_states:
        raise ValueError("FLAVA model requires hidden states to work. Please set `output_hidden_states=True`")
    image_embeddings = None
    image_states = None
    image_mm_projection = None
    image_output = None
    if pixel_values is not None:
        image_output = self.image_model(
            pixel_values=pixel_values,
            bool_masked_pos=bool_masked_pos,
            attention_mask=image_attention_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        image_embeddings, image_states = image_output[0], image_output[2]
        # Note that these states don't use final layernorm in the transformer model
        image_mm_projection = self.image_to_mm_projection(image_states[-1])

    text_embeddings = None
    text_states = None
    text_mm_projection = None
    text_output = None
    if input_ids is not None:
        text_output = self.text_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            token_type_ids=token_type_ids,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        text_embeddings, text_states = text_output[0], text_output[2]
        # Note that these states don't use final layernorm in the transformer model
        text_mm_projection = self.text_to_mm_projection(text_states[-1])

    multimodal_embeddings = None
    multimodal_output = None
    if image_mm_projection is not None and text_mm_projection is not None and not skip_multimodal_encoder:
        if attention_mask is not None:
            batch_size, seq_len, _ = image_mm_projection.shape
            if self.multimodal_model.use_cls_token:
                seq_len += 1
            attention_mask_image = ops.ones((batch_size, seq_len))
            attention_multimodal = ops.cat([ops.cast(attention_mask_image, mindspore.int64), attention_mask], axis=1)
            # mindspore.Tensor(np.concatenate((attention_mask_image.asnumpy(), attention_mask.asnumpy()), axis=1))
        else:
            attention_multimodal = None
        multimodal_input = ops.cat([image_mm_projection, text_mm_projection], axis=1)
        multimodal_output = self.multimodal_model(
            multimodal_input, attention_mask=attention_multimodal, return_dict=return_dict
        )
        # multimodal_input = ops.concat([image_mm_projection, text_mm_projection], axis=1)
        # multimodal_output = self.multimodal_model(
        #     ops.concat([image_mm_projection, text_mm_projection], axis=1), attention_mask=attention_multimodal, return_dict=return_dict
        # )

        multimodal_embeddings = multimodal_output[0]

    if not return_dict:
        return (
            image_embeddings,
            image_output,
            text_embeddings,
            text_output,
            multimodal_embeddings,
            multimodal_output,
        )

    return FlavaModelOutput(
        image_embeddings=image_embeddings,
        image_output=image_output,
        text_embeddings=text_embeddings,
        text_output=text_output,
        multimodal_embeddings=multimodal_embeddings,
        multimodal_output=multimodal_output,
    )

mindnlp.transformers.models.flava.modeling_flava.FlavaModelOutput dataclass

Bases: ModelOutput

Output from FlavaModel containing embeddings and outputs from individual encoders.

Note that image_embeddings and text_embeddigns returned are similar to pooled output returned from a transformer. If you want embeddings for contrastive loss or retrieval use a FLAVA model's image_projection and text_projection layers on image_embeddings and text_embeddings respectively.

PARAMETER DESCRIPTION
image_output

The output of the [FlavaImageModel].

TYPE: `BaseModelOutputWithPooling`, *optional*, returned when `pixel_values` are present DEFAULT: None

text_output

The output of the [FlavaTextModel].

TYPE: `BaseModelOutputWithPooling`, *optional*, returned when `input_ids` are present DEFAULT: None

multimodal_output

The output of the [FlavaMultimodalModel].

TYPE: `BaseModelOutputWithPooling`, returned when `input_ids` and `pixel_values` are present and `skip_multimodal_encoder` is `None` or `False` DEFAULT: None

Source code in mindnlp/transformers/models/flava/modeling_flava.py
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
@dataclass
class FlavaModelOutput(ModelOutput):
    """
    Output from FlavaModel containing embeddings and outputs from individual encoders.

    Note that `image_embeddings` and `text_embeddigns` returned are similar to pooled output returned from a
    transformer. If you want embeddings for contrastive loss or retrieval use a FLAVA model's `image_projection` and
    `text_projection` layers on `image_embeddings` and `text_embeddings` respectively.

    Args:
        image_embeddings (`mindspore.Tensor` of shape `(batch_size, output_dim)`, *optional*,
            returned when `pixel_values` are present):
            The image embeddings which are basically the pooled output of [`FlavaImageModel`].
        image_output (`BaseModelOutputWithPooling`, *optional*, returned when `pixel_values` are present):
            The output of the [`FlavaImageModel`].
        text_embeddings (`mindspore.Tensor` of shape `(batch_size, output_dim)`, *optional*,
            returned when `input_ids` are present):
            The text embeddings which are basically the pooled output of [`FlavaTextModel`].
        text_output (`BaseModelOutputWithPooling`, *optional*, returned when `input_ids` are present):
            The output of the [`FlavaTextModel`].
        multimodal_embeddings (`mindspore.Tensor` of shape `(batch_size, output_dim)`, *optional*,
            returned when `input_ids` and `pixel_values` are present and `skip_multimodal_encoder` is `None` or `False`):
            The multimodal embeddings which are basically the pooled output of [`FlavaTextModel`].
        multimodal_output (`BaseModelOutputWithPooling`, returned when `input_ids` and `pixel_values` are present and `skip_multimodal_encoder` is `None` or `False`):
            The output of the [`FlavaMultimodalModel`].
    """

    image_embeddings: Optional[mindspore.Tensor] = None
    image_output: Optional[BaseModelOutputWithPooling] = None
    text_embeddings: Optional[mindspore.Tensor] = None
    text_output: Optional[BaseModelOutputWithPooling] = None
    multimodal_embeddings: Optional[mindspore.Tensor] = None
    multimodal_output: Optional[BaseModelOutputWithPooling] = None

    def to_tuple(self) -> Tuple[Any]:
        return tuple(
            self[k] if k not in ["text_output", "image_output", "multimodal_output"] else getattr(self, k).to_tuple()
            for k in self.keys()
        )

mindnlp.transformers.models.flava.modeling_flava.FlavaMultimodalModel

Bases: FlavaPreTrainedModel

Source code in mindnlp/transformers/models/flava/modeling_flava.py
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
class FlavaMultimodalModel(FlavaPreTrainedModel):
    config_class = FlavaMultimodalConfig
    # This override allows us to load FlavaMultimodalModel from FlavaModel/FlavaForPreTraining checkpoints.
    base_model_prefix = "flava.multimodal_model"
    main_input_name = "hidden_states"

    def __init__(self, config: FlavaMultimodalConfig, add_pooling_layer=True):
        super().__init__(config)
        self.config = config
        self.use_cls_token = self.config.use_cls_token
        if self.use_cls_token:
            self.cls_token = mindspore.Parameter(ops.zeros((1, 1, config.hidden_size)), name="cls_token")

        self.encoder = FlavaEncoder(config)

        self.layernorm = nn.LayerNorm([config.hidden_size], eps=config.layer_norm_eps)
        self.pooler = FlavaPooler(config) if add_pooling_layer else None

        self.post_init()

    def _prune_heads(self, heads_to_prune: Dict[int, List[int]]) -> None:
        """
        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
        class PreTrainedModel
        """
        for layer, heads in heads_to_prune.items():
            self.encoder.layer[layer].attention.prune_heads(heads)

    def forward(
        self,
        hidden_states: mindspore.Tensor,
        attention_mask: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[tuple, BaseModelOutputWithPooling]:
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict


        batch_size, seq_length, _ = hidden_states.shape

        if self.use_cls_token:
            cls_tokens = self.cls_token.broadcast_to((batch_size, -1, -1))
            hidden_states = ops.cat([cls_tokens, hidden_states], axis=1)
            seq_length += 1

        if attention_mask is None:
            attention_mask = ops.ones((batch_size, seq_length))

        # Prepare head mask if needed
        # 1.0 in head_mask indicate we keep the head
        # attention_probs has shape bsz x n_heads x N x N
        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
        extended_attention_mask: mindspore.Tensor = self.get_extended_attention_mask(
            attention_mask, (batch_size, seq_length)
        )

        encoder_outputs = self.encoder(
            hidden_states,
            attention_mask=extended_attention_mask,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = encoder_outputs[0]
        sequence_output = self.layernorm(sequence_output)
        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

        if not return_dict:
            return (sequence_output, pooled_output) + encoder_outputs[1:]

        return BaseModelOutputWithPooling(
            last_hidden_state=sequence_output,
            pooler_output=pooled_output,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
        )

mindnlp.transformers.models.flava.modeling_flava.FlavaPreTrainedModel

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in mindnlp/transformers/models/flava/modeling_flava.py
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
class FlavaPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """

    config_class = FlavaConfig
    base_model_prefix = "flava"
    supports_gradient_checkpointing = True

    def _init_weights(self, cell: Union[nn.Linear, nn.Conv2d, nn.LayerNorm]) -> None:
        """Initialize the weights"""
        if isinstance(cell, (nn.Linear, nn.Conv2d)):
            cell.weight.set_data(initializer(Normal(self.config.initializer_range),
                                                    cell.weight.shape, cell.weight.dtype))
            if cell.bias:
                cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))
        elif isinstance(cell, nn.Embedding):
            weight = np.random.normal(0.0, self.config.initializer_range, cell.weight.shape)
            if cell.padding_idx:
                weight[cell.padding_idx] = 0

            cell.weight.set_data(Tensor(weight, cell.weight.dtype))
        elif isinstance(cell, nn.LayerNorm):
            cell.weight.set_data(initializer('ones', cell.weight.shape, cell.weight.dtype))
            cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))

mindnlp.transformers.models.flava.modeling_flava.FlavaSelfOutput

Bases: Module

The residual connection is defined in FlavaLayer (same as ViTLayer) instead of here (as is the case with other models), due to the layernorm applied before each block.

Source code in mindnlp/transformers/models/flava/modeling_flava.py
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
class FlavaSelfOutput(nn.Module):
    """
    The residual connection is defined in FlavaLayer (same as ViTLayer) instead of here (as is the case with other
    models), due to the layernorm applied before each block.
    """

    def __init__(self, config: FlavaPossibleConfigs) -> None:
        super().__init__()
        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(p=config.hidden_dropout_prob)

    def forward(self, hidden_states: mindspore.Tensor, input_tensor: mindspore.Tensor) -> mindspore.Tensor:
        hidden_states = self.dense(hidden_states)
        hidden_states = self.dropout(hidden_states)

        return hidden_states

mindnlp.transformers.models.flava.modeling_flava.FlavaTextEmbeddings

Bases: Module

Construct the embeddings from word, position and token_type embeddings.

Source code in mindnlp/transformers/models/flava/modeling_flava.py
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
class FlavaTextEmbeddings(nn.Module):
    """Construct the embeddings from word, position and token_type embeddings."""

    def __init__(self, config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
        # any TensorFlow checkpoint file
        self.LayerNorm = nn.LayerNorm([config.hidden_size], eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(p=config.hidden_dropout_prob)
        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
        self.position_ids = ops.arange(config.max_position_embeddings).broadcast_to((1, -1))
        self.token_type_ids = ops.zeros(self.position_ids.shape, dtype=mindspore.int64)

    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
    ):
        input_shape = input_ids.shape
        seq_length = input_shape[1]

        if position_ids is None:
            position_ids = self.position_ids[:, :seq_length]

        # Setting the token_type_ids to the registered buffer in forwardor where it is all zeros, which usually occurs
        # when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves
        # issue #5664
        if token_type_ids is None:
            if hasattr(self, "token_type_ids"):
                buffered_token_type_ids = self.token_type_ids[:, :seq_length]
                buffered_token_type_ids_expanded = buffered_token_type_ids.broadcast_to((input_shape[0], seq_length))
                token_type_ids = buffered_token_type_ids_expanded
            else:
                token_type_ids = ops.zeros(input_shape, dtype=mindspore.int64)

        inputs_embeds = self.word_embeddings(input_ids)
        token_type_embeddings = self.token_type_embeddings(token_type_ids)

        embeddings = inputs_embeds + token_type_embeddings
        if self.position_embedding_type == "absolute":
            position_embeddings = self.position_embeddings(position_ids)
            embeddings += position_embeddings
        embeddings = self.LayerNorm(embeddings)
        embeddings = self.dropout(embeddings)
        return embeddings

mindnlp.transformers.models.flava.modeling_flava.FlavaTextModel

Bases: FlavaPreTrainedModel

Source code in mindnlp/transformers/models/flava/modeling_flava.py
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
class FlavaTextModel(FlavaPreTrainedModel):
    config_class = FlavaTextConfig
    # This override allows us to load FlavaTextModel from FlavaModel/FlavaForPreTraining checkpoints.
    base_model_prefix = "flava.text_model"

    def __init__(self, config: FlavaTextConfig, add_pooling_layer: bool = True):
        super().__init__(config)
        self.config = config

        self.embeddings = FlavaTextEmbeddings(config)
        self.encoder = FlavaEncoder(config)

        self.layernorm = nn.LayerNorm([config.hidden_size], eps=config.layer_norm_eps)
        self.pooler = FlavaPooler(config) if add_pooling_layer else None

        self.post_init()

    def get_input_embeddings(self) -> PatchEmbeddings:
        return self.embeddings.word_embeddings

    def set_input_embeddings(self, value: nn.Module):
        self.embeddings.word_embeddings = value

    def _prune_heads(self, heads_to_prune: Dict[int, List[int]]) -> None:
        """
        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
        class PreTrainedModel
        """
        for layer, heads in heads_to_prune.items():
            self.encoder.layer[layer].attention.prune_heads(heads)
    def forward(
        self,
        input_ids: Optional[mindspore.Tensor] = None,
        attention_mask: Optional[mindspore.Tensor] = None,
        token_type_ids: Optional[mindspore.Tensor] = None,
        position_ids: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[tuple, BaseModelOutputWithPooling]:
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if input_ids is None:
            raise ValueError("You have to specify input_ids")

        input_shape = input_ids.shape

        if attention_mask is None:
            attention_mask = ops.ones(input_shape)

        # Prepare head mask if needed
        # 1.0 in head_mask indicate we keep the head
        # attention_probs has shape bsz x n_heads x N x N
        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
        extended_attention_mask: mindspore.Tensor = self.get_extended_attention_mask(
            attention_mask, input_shape
        )

        embedding_output = self.embeddings(
            input_ids=input_ids,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
        )

        encoder_outputs = self.encoder(
            embedding_output,
            attention_mask=extended_attention_mask,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = encoder_outputs[0]
        sequence_output = self.layernorm(sequence_output)
        pooled_output = self.pooler(sequence_output) if self.pooler is not None else None

        if not return_dict:
            return (sequence_output, pooled_output) + encoder_outputs[1:]

        return BaseModelOutputWithPooling(
            last_hidden_state=sequence_output,
            pooler_output=pooled_output,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
        )

mindnlp.transformers.models.flava.modeling_flava.PatchEmbeddings

Bases: Module

Image to Patch Embedding.

Source code in mindnlp/transformers/models/flava/modeling_flava.py
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
class PatchEmbeddings(nn.Module):
    """
    Image to Patch Embedding.
    """

    def __init__(
        self,
        image_size: int = 224,
        patch_size: Union[int, Tuple[int, int]] = 16,
        num_channels: int = 3,
        embed_dim: int = 768,
    ):
        super().__init__()
        if not isinstance(image_size, collections.abc.Iterable):
            image_size = (image_size, image_size)
        if not isinstance(patch_size, collections.abc.Iterable):
            patch_size = (patch_size, patch_size)
        num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0])
        self.image_size = image_size
        self.patch_size = patch_size
        self.num_patches = num_patches

        self.projection = nn.Conv2d(num_channels, embed_dim, kernel_size=patch_size, stride=patch_size, pad_mode="valid", bias=True)

    def forward(self, pixel_values: mindspore.Tensor, interpolate_pos_encoding: bool = False) -> mindspore.Tensor:
        batch_size, num_channels, height, width = pixel_values.shape
        if not interpolate_pos_encoding:
            if height != self.image_size[0] or width != self.image_size[1]:
                raise ValueError(
                    f"Input image size ({height}*{width}) doesn't match model"
                    f" ({self.image_size[0]}*{self.image_size[1]})."
                )
        x = self.projection(pixel_values).flatten(start_dim=2).swapaxes(1, 2)
        return x

mindnlp.transformers.models.flava.processing_flava

Image/Text processor class for FLAVA

mindnlp.transformers.models.flava.processing_flava.FlavaProcessor

Bases: ProcessorMixin

Constructs a FLAVA processor which wraps a FLAVA image processor and a FLAVA tokenizer into a single processor.

[FlavaProcessor] offers all the functionalities of [FlavaImageProcessor] and [BertTokenizerFast]. See the [~FlavaProcessor.__call__] and [~FlavaProcessor.decode] for more information.

PARAMETER DESCRIPTION
image_processor

The image processor is a required input.

TYPE: [`FlavaImageProcessor`], *optional* DEFAULT: None

tokenizer

The tokenizer is a required input.

TYPE: [`BertTokenizerFast`], *optional* DEFAULT: None

Source code in mindnlp/transformers/models/flava/processing_flava.py
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
class FlavaProcessor(ProcessorMixin):
    r"""
    Constructs a FLAVA processor which wraps a FLAVA image processor and a FLAVA tokenizer into a single processor.

    [`FlavaProcessor`] offers all the functionalities of [`FlavaImageProcessor`] and [`BertTokenizerFast`]. See the
    [`~FlavaProcessor.__call__`] and [`~FlavaProcessor.decode`] for more information.

    Args:
        image_processor ([`FlavaImageProcessor`], *optional*): The image processor is a required input.
        tokenizer ([`BertTokenizerFast`], *optional*): The tokenizer is a required input.
    """

    attributes = ["image_processor", "tokenizer"]
    image_processor_class = "FlavaImageProcessor"
    tokenizer_class = ("BertTokenizer", "BertTokenizerFast")

    def __init__(self, image_processor=None, tokenizer=None, **kwargs):
        feature_extractor = None
        if "feature_extractor" in kwargs:
            warnings.warn(
                "The `feature_extractor` argument is deprecated and will be removed in v5, use `image_processor`"
                " instead.",
                FutureWarning,
            )
            feature_extractor = kwargs.pop("feature_extractor")

        image_processor = image_processor if image_processor is not None else feature_extractor
        if image_processor is None:
            raise ValueError("You need to specify an `image_processor`.")
        if tokenizer is None:
            raise ValueError("You need to specify a `tokenizer`.")

        super().__init__(image_processor, tokenizer)
        self.current_processor = self.image_processor

    def __call__(
        self,
        images: Optional[ImageInput] = None,
        text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
        add_special_tokens: bool = True,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = False,
        max_length: Optional[int] = None,
        stride: int = 0,
        pad_to_multiple_of: Optional[int] = None,
        return_image_mask: Optional[bool] = None,
        return_codebook_pixels: Optional[bool] = None,
        return_token_type_ids: Optional[bool] = None,
        return_attention_mask: Optional[bool] = None,
        return_overflowing_tokens: bool = False,
        return_special_tokens_mask: bool = False,
        return_offsets_mapping: bool = False,
        return_length: bool = False,
        verbose: bool = True,
        return_tensors: Optional[Union[str, TensorType]] = None,
        **kwargs,
    ):
        """
        This method uses [`FlavaImageProcessor.__call__`] method to prepare image(s) for the model, and
        [`BertTokenizerFast.__call__`] to prepare text for the model.

        Please refer to the docstring of the above two methods for more information.
        """

        if text is None and images is None:
            raise ValueError("You have to specify either text or images. Both cannot be none.")

        if text is not None:
            encoding = self.tokenizer(
                text=text,
                add_special_tokens=add_special_tokens,
                padding=padding,
                truncation=truncation,
                max_length=max_length,
                stride=stride,
                pad_to_multiple_of=pad_to_multiple_of,
                return_token_type_ids=return_token_type_ids,
                return_attention_mask=return_attention_mask,
                return_overflowing_tokens=return_overflowing_tokens,
                return_special_tokens_mask=return_special_tokens_mask,
                return_offsets_mapping=return_offsets_mapping,
                return_length=return_length,
                verbose=verbose,
                return_tensors=return_tensors,
                **kwargs,
            )
        if images is not None:
            image_features = self.image_processor(
                images,
                return_image_mask=return_image_mask,
                return_codebook_pixels=return_codebook_pixels,
                return_tensors=return_tensors,
                **kwargs,
            )

        if text is not None and images is not None:
            encoding.update(image_features)
            return encoding
        elif text is not None:
            return encoding
        else:
            return BatchEncoding(data={**image_features}, tensor_type=return_tensors)

    def batch_decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
        refer to the docstring of this method for more information.
        """
        return self.tokenizer.batch_decode(*args, **kwargs)

    def decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
        the docstring of this method for more information.
        """
        return self.tokenizer.decode(*args, **kwargs)

    @property
    def model_input_names(self):
        tokenizer_input_names = self.tokenizer.model_input_names
        image_processor_input_names = self.image_processor.model_input_names
        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))

    @property
    def feature_extractor_class(self):
        warnings.warn(
            "`feature_extractor_class` is deprecated and will be removed in v5. Use `image_processor_class` instead.",
            FutureWarning,
        )
        return self.image_processor_class

    @property
    def feature_extractor(self):
        warnings.warn(
            "`feature_extractor` is deprecated and will be removed in v5. Use `image_processor` instead.",
            FutureWarning,
        )
        return self.image_processor

mindnlp.transformers.models.flava.processing_flava.FlavaProcessor.__call__(images=None, text=None, add_special_tokens=True, padding=False, truncation=False, max_length=None, stride=0, pad_to_multiple_of=None, return_image_mask=None, return_codebook_pixels=None, return_token_type_ids=None, return_attention_mask=None, return_overflowing_tokens=False, return_special_tokens_mask=False, return_offsets_mapping=False, return_length=False, verbose=True, return_tensors=None, **kwargs)

This method uses [FlavaImageProcessor.__call__] method to prepare image(s) for the model, and [BertTokenizerFast.__call__] to prepare text for the model.

Please refer to the docstring of the above two methods for more information.

Source code in mindnlp/transformers/models/flava/processing_flava.py
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
def __call__(
    self,
    images: Optional[ImageInput] = None,
    text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
    add_special_tokens: bool = True,
    padding: Union[bool, str, PaddingStrategy] = False,
    truncation: Union[bool, str, TruncationStrategy] = False,
    max_length: Optional[int] = None,
    stride: int = 0,
    pad_to_multiple_of: Optional[int] = None,
    return_image_mask: Optional[bool] = None,
    return_codebook_pixels: Optional[bool] = None,
    return_token_type_ids: Optional[bool] = None,
    return_attention_mask: Optional[bool] = None,
    return_overflowing_tokens: bool = False,
    return_special_tokens_mask: bool = False,
    return_offsets_mapping: bool = False,
    return_length: bool = False,
    verbose: bool = True,
    return_tensors: Optional[Union[str, TensorType]] = None,
    **kwargs,
):
    """
    This method uses [`FlavaImageProcessor.__call__`] method to prepare image(s) for the model, and
    [`BertTokenizerFast.__call__`] to prepare text for the model.

    Please refer to the docstring of the above two methods for more information.
    """

    if text is None and images is None:
        raise ValueError("You have to specify either text or images. Both cannot be none.")

    if text is not None:
        encoding = self.tokenizer(
            text=text,
            add_special_tokens=add_special_tokens,
            padding=padding,
            truncation=truncation,
            max_length=max_length,
            stride=stride,
            pad_to_multiple_of=pad_to_multiple_of,
            return_token_type_ids=return_token_type_ids,
            return_attention_mask=return_attention_mask,
            return_overflowing_tokens=return_overflowing_tokens,
            return_special_tokens_mask=return_special_tokens_mask,
            return_offsets_mapping=return_offsets_mapping,
            return_length=return_length,
            verbose=verbose,
            return_tensors=return_tensors,
            **kwargs,
        )
    if images is not None:
        image_features = self.image_processor(
            images,
            return_image_mask=return_image_mask,
            return_codebook_pixels=return_codebook_pixels,
            return_tensors=return_tensors,
            **kwargs,
        )

    if text is not None and images is not None:
        encoding.update(image_features)
        return encoding
    elif text is not None:
        return encoding
    else:
        return BatchEncoding(data={**image_features}, tensor_type=return_tensors)

mindnlp.transformers.models.flava.processing_flava.FlavaProcessor.batch_decode(*args, **kwargs)

This method forwards all its arguments to BertTokenizerFast's [~PreTrainedTokenizer.batch_decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp/transformers/models/flava/processing_flava.py
129
130
131
132
133
134
def batch_decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
    refer to the docstring of this method for more information.
    """
    return self.tokenizer.batch_decode(*args, **kwargs)

mindnlp.transformers.models.flava.processing_flava.FlavaProcessor.decode(*args, **kwargs)

This method forwards all its arguments to BertTokenizerFast's [~PreTrainedTokenizer.decode]. Please refer to the docstring of this method for more information.

Source code in mindnlp/transformers/models/flava/processing_flava.py
136
137
138
139
140
141
def decode(self, *args, **kwargs):
    """
    This method forwards all its arguments to BertTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
    the docstring of this method for more information.
    """
    return self.tokenizer.decode(*args, **kwargs)