Skip to content

audio_spectrogram_transformer

mindnlp.transformers.models.audio_spectrogram_transformer.configuration_audio_spectrogram_transformer.AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP = {'MIT/ast-finetuned-audioset-10-10-0.4593': 'https://hf-mirror.com/MIT/ast-finetuned-audioset-10-10-0.4593/resolve/main/config.json'} module-attribute

mindnlp.transformers.models.audio_spectrogram_transformer.configuration_audio_spectrogram_transformer.ASTConfig

Bases: PretrainedConfig

This is the configuration class to store the configuration of a [ASTModel]. It is used to instantiate an AST model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the AST MIT/ast-finetuned-audioset-10-10-0.4593 architecture.

Configuration objects inherit from [PretrainedConfig] and can be used to control the model outputs. Read the documentation from [PretrainedConfig] for more information.

PARAMETER DESCRIPTION
hidden_size

Dimensionality of the encoder layers and the pooler layer.

TYPE: `int`, *optional*, defaults to 768 DEFAULT: 768

num_hidden_layers

Number of hidden layers in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

num_attention_heads

Number of attention heads for each attention layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 12 DEFAULT: 12

intermediate_size

Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.

TYPE: `int`, *optional*, defaults to 3072 DEFAULT: 3072

hidden_act

The non-linear activation function (function or string) in the encoder and pooler. If string, "gelu", "relu", "selu" and "gelu_new" are supported.

TYPE: `str` or `function`, *optional*, defaults to `"gelu"` DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

attention_probs_dropout_prob

The dropout ratio for the attention probabilities.

TYPE: `float`, *optional*, defaults to 0.0 DEFAULT: 0.0

initializer_range

The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

TYPE: `float`, *optional*, defaults to 0.02 DEFAULT: 0.02

layer_norm_eps

The epsilon used by the layer normalization layers.

TYPE: `float`, *optional*, defaults to 1e-12 DEFAULT: 1e-12

patch_size

The size (resolution) of each patch.

TYPE: `int`, *optional*, defaults to 16 DEFAULT: 16

qkv_bias

Whether to add a bias to the queries, keys and values.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

frequency_stride

Frequency stride to use when patchifying the spectrograms.

TYPE: `int`, *optional*, defaults to 10 DEFAULT: 10

time_stride

Temporal stride to use when patchifying the spectrograms.

TYPE: `int`, *optional*, defaults to 10 DEFAULT: 10

max_length

Temporal dimension of the spectrograms.

TYPE: `int`, *optional*, defaults to 1024 DEFAULT: 1024

num_mel_bins

Frequency dimension of the spectrograms (number of Mel-frequency bins).

TYPE: `int`, *optional*, defaults to 128 DEFAULT: 128

Example
>>> from transformers import ASTConfig, ASTModel
...
>>> # Initializing a AST MIT/ast-finetuned-audioset-10-10-0.4593 style configuration
>>> configuration = ASTConfig()
...
>>> # Initializing a model (with random weights) from the MIT/ast-finetuned-audioset-10-10-0.4593 style configuration
>>> model = ASTModel(configuration)
...
>>> # Accessing the model configuration
>>> configuration = model.config
Source code in mindnlp/transformers/models/audio_spectrogram_transformer/configuration_audio_spectrogram_transformer.py
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
class ASTConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`ASTModel`]. It is used to instantiate an AST
    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
    defaults will yield a similar configuration to that of the AST
    [MIT/ast-finetuned-audioset-10-10-0.4593](https://hf-mirror.com/MIT/ast-finetuned-audioset-10-10-0.4593)
    architecture.

    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.

    Args:
        hidden_size (`int`, *optional*, defaults to 768):
            Dimensionality of the encoder layers and the pooler layer.
        num_hidden_layers (`int`, *optional*, defaults to 12):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 12):
            Number of attention heads for each attention layer in the Transformer encoder.
        intermediate_size (`int`, *optional*, defaults to 3072):
            Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
        hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`):
            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
            `"relu"`, `"selu"` and `"gelu_new"` are supported.
        hidden_dropout_prob (`float`, *optional*, defaults to 0.0):
            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0):
            The dropout ratio for the attention probabilities.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the layer normalization layers.
        patch_size (`int`, *optional*, defaults to 16):
            The size (resolution) of each patch.
        qkv_bias (`bool`, *optional*, defaults to `True`):
            Whether to add a bias to the queries, keys and values.
        frequency_stride (`int`, *optional*, defaults to 10):
            Frequency stride to use when patchifying the spectrograms.
        time_stride (`int`, *optional*, defaults to 10):
            Temporal stride to use when patchifying the spectrograms.
        max_length (`int`, *optional*, defaults to 1024):
            Temporal dimension of the spectrograms.
        num_mel_bins (`int`, *optional*, defaults to 128):
            Frequency dimension of the spectrograms (number of Mel-frequency bins).

    Example:
        ```python
        >>> from transformers import ASTConfig, ASTModel
        ...
        >>> # Initializing a AST MIT/ast-finetuned-audioset-10-10-0.4593 style configuration
        >>> configuration = ASTConfig()
        ...
        >>> # Initializing a model (with random weights) from the MIT/ast-finetuned-audioset-10-10-0.4593 style configuration
        >>> model = ASTModel(configuration)
        ...
        >>> # Accessing the model configuration
        >>> configuration = model.config
        ```
    """
    model_type = "audio-spectrogram-transformer"

    def __init__(
        self,
        hidden_size=768,
        num_hidden_layers=12,
        num_attention_heads=12,
        intermediate_size=3072,
        hidden_act="gelu",
        hidden_dropout_prob=0.0,
        attention_probs_dropout_prob=0.0,
        initializer_range=0.02,
        layer_norm_eps=1e-12,
        patch_size=16,
        qkv_bias=True,
        frequency_stride=10,
        time_stride=10,
        max_length=1024,
        num_mel_bins=128,
        **kwargs,
    ):
        """
        Initializes an instance of ASTConfig.

        Args:
            self: The object itself.
            hidden_size (int, optional): The size of the hidden layers. Defaults to 768.
            num_hidden_layers (int, optional): The number of hidden layers. Defaults to 12.
            num_attention_heads (int, optional): The number of attention heads. Defaults to 12.
            intermediate_size (int, optional): The size of the intermediate layer. Defaults to 3072.
            hidden_act (str, optional): The activation function for the hidden layers. Defaults to 'gelu'.
            hidden_dropout_prob (float, optional): The dropout probability for the hidden layers. Defaults to 0.0.
            attention_probs_dropout_prob (float, optional): The dropout probability for the attention probabilities. Defaults to 0.0.
            initializer_range (float, optional): The range for parameter initialization. Defaults to 0.02.
            layer_norm_eps (float, optional): The epsilon value for layer normalization. Defaults to 1e-12.
            patch_size (int, optional): The size of the patch. Defaults to 16.
            qkv_bias (bool, optional): Whether to include bias in the query, key, and value tensors. Defaults to True.
            frequency_stride (int, optional): The stride for frequency. Defaults to 10.
            time_stride (int, optional): The stride for time. Defaults to 10.
            max_length (int, optional): The maximum length. Defaults to 1024.
            num_mel_bins (int, optional): The number of Mel bins. Defaults to 128.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(**kwargs)

        self.hidden_size = hidden_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        self.intermediate_size = intermediate_size
        self.hidden_act = hidden_act
        self.hidden_dropout_prob = hidden_dropout_prob
        self.attention_probs_dropout_prob = attention_probs_dropout_prob
        self.initializer_range = initializer_range
        self.layer_norm_eps = layer_norm_eps
        self.patch_size = patch_size
        self.qkv_bias = qkv_bias
        self.frequency_stride = frequency_stride
        self.time_stride = time_stride
        self.max_length = max_length
        self.num_mel_bins = num_mel_bins

mindnlp.transformers.models.audio_spectrogram_transformer.configuration_audio_spectrogram_transformer.ASTConfig.__init__(hidden_size=768, num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072, hidden_act='gelu', hidden_dropout_prob=0.0, attention_probs_dropout_prob=0.0, initializer_range=0.02, layer_norm_eps=1e-12, patch_size=16, qkv_bias=True, frequency_stride=10, time_stride=10, max_length=1024, num_mel_bins=128, **kwargs)

Initializes an instance of ASTConfig.

PARAMETER DESCRIPTION
self

The object itself.

hidden_size

The size of the hidden layers. Defaults to 768.

TYPE: int DEFAULT: 768

num_hidden_layers

The number of hidden layers. Defaults to 12.

TYPE: int DEFAULT: 12

num_attention_heads

The number of attention heads. Defaults to 12.

TYPE: int DEFAULT: 12

intermediate_size

The size of the intermediate layer. Defaults to 3072.

TYPE: int DEFAULT: 3072

hidden_act

The activation function for the hidden layers. Defaults to 'gelu'.

TYPE: str DEFAULT: 'gelu'

hidden_dropout_prob

The dropout probability for the hidden layers. Defaults to 0.0.

TYPE: float DEFAULT: 0.0

attention_probs_dropout_prob

The dropout probability for the attention probabilities. Defaults to 0.0.

TYPE: float DEFAULT: 0.0

initializer_range

The range for parameter initialization. Defaults to 0.02.

TYPE: float DEFAULT: 0.02

layer_norm_eps

The epsilon value for layer normalization. Defaults to 1e-12.

TYPE: float DEFAULT: 1e-12

patch_size

The size of the patch. Defaults to 16.

TYPE: int DEFAULT: 16

qkv_bias

Whether to include bias in the query, key, and value tensors. Defaults to True.

TYPE: bool DEFAULT: True

frequency_stride

The stride for frequency. Defaults to 10.

TYPE: int DEFAULT: 10

time_stride

The stride for time. Defaults to 10.

TYPE: int DEFAULT: 10

max_length

The maximum length. Defaults to 1024.

TYPE: int DEFAULT: 1024

num_mel_bins

The number of Mel bins. Defaults to 128.

TYPE: int DEFAULT: 128

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/audio_spectrogram_transformer/configuration_audio_spectrogram_transformer.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
def __init__(
    self,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072,
    hidden_act="gelu",
    hidden_dropout_prob=0.0,
    attention_probs_dropout_prob=0.0,
    initializer_range=0.02,
    layer_norm_eps=1e-12,
    patch_size=16,
    qkv_bias=True,
    frequency_stride=10,
    time_stride=10,
    max_length=1024,
    num_mel_bins=128,
    **kwargs,
):
    """
    Initializes an instance of ASTConfig.

    Args:
        self: The object itself.
        hidden_size (int, optional): The size of the hidden layers. Defaults to 768.
        num_hidden_layers (int, optional): The number of hidden layers. Defaults to 12.
        num_attention_heads (int, optional): The number of attention heads. Defaults to 12.
        intermediate_size (int, optional): The size of the intermediate layer. Defaults to 3072.
        hidden_act (str, optional): The activation function for the hidden layers. Defaults to 'gelu'.
        hidden_dropout_prob (float, optional): The dropout probability for the hidden layers. Defaults to 0.0.
        attention_probs_dropout_prob (float, optional): The dropout probability for the attention probabilities. Defaults to 0.0.
        initializer_range (float, optional): The range for parameter initialization. Defaults to 0.02.
        layer_norm_eps (float, optional): The epsilon value for layer normalization. Defaults to 1e-12.
        patch_size (int, optional): The size of the patch. Defaults to 16.
        qkv_bias (bool, optional): Whether to include bias in the query, key, and value tensors. Defaults to True.
        frequency_stride (int, optional): The stride for frequency. Defaults to 10.
        time_stride (int, optional): The stride for time. Defaults to 10.
        max_length (int, optional): The maximum length. Defaults to 1024.
        num_mel_bins (int, optional): The number of Mel bins. Defaults to 128.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(**kwargs)

    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.intermediate_size = intermediate_size
    self.hidden_act = hidden_act
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.initializer_range = initializer_range
    self.layer_norm_eps = layer_norm_eps
    self.patch_size = patch_size
    self.qkv_bias = qkv_bias
    self.frequency_stride = frequency_stride
    self.time_stride = time_stride
    self.max_length = max_length
    self.num_mel_bins = num_mel_bins

mindnlp.transformers.models.audio_spectrogram_transformer.feature_extraction_audio_spectrogram_transformer.ASTFeatureExtractor

Bases: SequenceFeatureExtractor

Constructs a Audio Spectrogram Transformer (AST) feature extractor.

This feature extractor inherits from [~feature_extraction_sequence_utils.SequenceFeatureExtractor] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.

This class extracts mel-filter bank features from raw speech using TorchAudio if installed or using numpy otherwise, pads/truncates them to a fixed length and normalizes them using a mean and standard deviation.

PARAMETER DESCRIPTION
feature_size

The feature dimension of the extracted features.

TYPE: `int`, *optional*, defaults to 1 DEFAULT: 1

sampling_rate

The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).

TYPE: `int`, *optional*, defaults to 16000 DEFAULT: 16000

num_mel_bins

Number of Mel-frequency bins.

TYPE: `int`, *optional*, defaults to 128 DEFAULT: 128

max_length

Maximum length to which to pad/truncate the extracted features.

TYPE: `int`, *optional*, defaults to 1024 DEFAULT: 1024

do_normalize

Whether or not to normalize the log-Mel features using mean and std.

TYPE: `bool`, *optional*, defaults to `True` DEFAULT: True

mean

The mean value used to normalize the log-Mel features. Uses the AudioSet mean by default.

TYPE: `float`, *optional*, defaults to -4.2677393 DEFAULT: -4.2677393

std

The standard deviation value used to normalize the log-Mel features. Uses the AudioSet standard deviation by default.

TYPE: `float`, *optional*, defaults to 4.5689974 DEFAULT: 4.5689974

return_attention_mask

Whether or not [~ASTFeatureExtractor.__call__] should return attention_mask.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

Source code in mindnlp/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
class ASTFeatureExtractor(SequenceFeatureExtractor):
    r"""
    Constructs a Audio Spectrogram Transformer (AST) feature extractor.

    This feature extractor inherits from [`~feature_extraction_sequence_utils.SequenceFeatureExtractor`] which contains
    most of the main methods. Users should refer to this superclass for more information regarding those methods.

    This class extracts mel-filter bank features from raw speech using TorchAudio if installed or using numpy
    otherwise, pads/truncates them to a fixed length and normalizes them using a mean and standard deviation.

    Args:
        feature_size (`int`, *optional*, defaults to 1):
            The feature dimension of the extracted features.
        sampling_rate (`int`, *optional*, defaults to 16000):
            The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).
        num_mel_bins (`int`, *optional*, defaults to 128):
            Number of Mel-frequency bins.
        max_length (`int`, *optional*, defaults to 1024):
            Maximum length to which to pad/truncate the extracted features.
        do_normalize (`bool`, *optional*, defaults to `True`):
            Whether or not to normalize the log-Mel features using `mean` and `std`.
        mean (`float`, *optional*, defaults to -4.2677393):
            The mean value used to normalize the log-Mel features. Uses the AudioSet mean by default.
        std (`float`, *optional*, defaults to 4.5689974):
            The standard deviation value used to normalize the log-Mel features. Uses the AudioSet standard deviation
            by default.
        return_attention_mask (`bool`, *optional*, defaults to `False`):
            Whether or not [`~ASTFeatureExtractor.__call__`] should return `attention_mask`.
    """
    model_input_names = ["input_values", "attention_mask"]

    def __init__(
        self,
        feature_size=1,
        sampling_rate=16000,
        num_mel_bins=128,
        max_length=1024,
        padding_value=0.0,
        do_normalize=True,
        mean=-4.2677393,
        std=4.5689974,
        return_attention_mask=False,
        **kwargs,
    ):
        """
        Initializes an instance of the ASTFeatureExtractor class.

        Args:
            self: The instance of the class.
            feature_size (int, optional): The size of the input features. Defaults to 1.
            sampling_rate (int, optional): The sampling rate of the audio. Defaults to 16000.
            num_mel_bins (int, optional): The number of mel bins for mel filtering. Defaults to 128.
            max_length (int, optional): The maximum length of the input. Defaults to 1024.
            padding_value (float, optional): The value used for padding sequences. Defaults to 0.0.
            do_normalize (bool, optional): Whether to normalize the input features. Defaults to True.
            mean (float, optional): The mean value for input feature normalization. Defaults to -4.2677393.
            std (float, optional): The standard deviation value for input feature normalization. Defaults to 4.5689974.
            return_attention_mask (bool, optional): Whether to return attention mask. Defaults to False.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(feature_size=feature_size, sampling_rate=sampling_rate, padding_value=padding_value, **kwargs)
        self.num_mel_bins = num_mel_bins
        self.max_length = max_length
        self.do_normalize = do_normalize
        self.mean = mean
        self.std = std
        self.return_attention_mask = return_attention_mask

        mel_filters = mel_filter_bank(
            num_frequency_bins=256,
            num_mel_filters=self.num_mel_bins,
            min_frequency=20,
            max_frequency=sampling_rate // 2,
            sampling_rate=sampling_rate,
            norm=None,
            mel_scale="kaldi",
            triangularize_in_mel_space=True,
        )

        self.mel_filters = np.pad(mel_filters, ((0, 1), (0, 0)))
        self.window = window_function(400, "hann", periodic=False)

    def _extract_fbank_features(
        self,
        waveform: np.ndarray,
        max_length: int,
    ) -> np.ndarray:
        """
        Get mel-filter bank features using TorchAudio. Note that TorchAudio requires 16-bit signed integers as inputs
        and hence the waveform should not be normalized before feature extraction.
        """
        # waveform = waveform * (2**15)  # Kaldi compliance: 16-bit signed integers
        waveform = np.squeeze(waveform)
        fbank = spectrogram(
            waveform,
            self.window,
            frame_length=400,
            hop_length=160,
            fft_length=512,
            power=2.0,
            center=False,
            preemphasis=0.97,
            mel_filters=self.mel_filters,
            log_mel="log",
            mel_floor=1.192092955078125e-07,
            remove_dc_offset=True,
        ).T

        fbank = mindspore.Tensor.from_numpy(fbank)

        n_frames = fbank.shape[0]
        difference = max_length - n_frames

        # pad or truncate, depending on difference
        if difference > 0:
            pad_module = mindspore.nn.ZeroPad2d((0, 0, 0, difference))
            fbank = pad_module(fbank)
        elif difference < 0:
            fbank = fbank[0:max_length, :]

        fbank = fbank.numpy()

        return fbank

    def normalize(self, input_values: np.ndarray) -> np.ndarray:
        """
        Normalize the input values using the mean and standard deviation stored in the ASTFeatureExtractor instance.

        Args:
            self (ASTFeatureExtractor): An instance of the ASTFeatureExtractor class.
                It holds the mean and standard deviation values necessary for normalization.
            input_values (np.ndarray): A NumPy array containing the input values to be normalized.
                The shape of the array must be compatible with the mean and standard deviation arrays stored in the ASTFeatureExtractor instance.

        Returns:
            np.ndarray: A NumPy array with the normalized values.
                The normalization is performed by subtracting the mean value and dividing by twice the standard deviation value.

        Raises:
            None
        """
        return (input_values - (self.mean)) / (self.std * 2)

    def __call__(
        self,
        raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
        sampling_rate: Optional[int] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        **kwargs,
    ) -> BatchFeature:
        """
        Main method to featurize and prepare for the model one or several sequence(s).

        Args:
            raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`):
                The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
                values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not
                stereo, i.e. single float per timestep.
            sampling_rate (`int`, *optional*):
                The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
                `sampling_rate` at the forward call to prevent silent errors.
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors instead of list of python integers. Acceptable values are:

                - `'tf'`: Return TensorFlow `tf.constant` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return Numpy `np.ndarray` objects.
        """
        if sampling_rate is not None:
            if sampling_rate != self.sampling_rate:
                raise ValueError(
                    f"The model corresponding to this feature extractor: {self} was trained using a sampling rate of"
                    f" {self.sampling_rate}. Please make sure that the provided `raw_speech` input was sampled with"
                    f" {self.sampling_rate} and not {sampling_rate}."
                )
        else:
            logger.warning(
                "It is strongly recommended to pass the `sampling_rate` argument to this function. "
                "Failing to do so can result in silent errors that might be hard to debug."
            )

        is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1
        if is_batched_numpy and len(raw_speech.shape) > 2:
            raise ValueError(f"Only mono-channel audio is supported for input to {self}")
        is_batched = is_batched_numpy or (
            isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list)))
        )

        if is_batched:
            raw_speech = [np.asarray(speech, dtype=np.float32) for speech in raw_speech]
        elif not is_batched and not isinstance(raw_speech, np.ndarray):
            raw_speech = np.asarray(raw_speech, dtype=np.float32)
        elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
            raw_speech = raw_speech.astype(np.float32)

        # always return batch
        if not is_batched:
            raw_speech = [raw_speech]

        # extract fbank features and pad/truncate to max_length
        features = [self._extract_fbank_features(waveform, max_length=self.max_length) for waveform in raw_speech]

        # convert into BatchFeature
        padded_inputs = BatchFeature({"input_values": features})

        # make sure list is in array format
        input_values = padded_inputs.get("input_values")
        if isinstance(input_values[0], list):
            padded_inputs["input_values"] = [np.asarray(feature, dtype=np.float32) for feature in input_values]

        # normalization
        if self.do_normalize:
            padded_inputs["input_values"] = [self.normalize(feature) for feature in input_values]

        if return_tensors is not None:
            padded_inputs = padded_inputs.convert_to_tensors(return_tensors)

        return padded_inputs

mindnlp.transformers.models.audio_spectrogram_transformer.feature_extraction_audio_spectrogram_transformer.ASTFeatureExtractor.__call__(raw_speech, sampling_rate=None, return_tensors=None, **kwargs)

Main method to featurize and prepare for the model one or several sequence(s).

PARAMETER DESCRIPTION
raw_speech

The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not stereo, i.e. single float per timestep.

TYPE: `np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`

sampling_rate

The sampling rate at which the raw_speech input was sampled. It is strongly recommended to pass sampling_rate at the forward call to prevent silent errors.

TYPE: `int`, *optional* DEFAULT: None

return_tensors

If set, will return tensors instead of list of python integers. Acceptable values are:

  • 'tf': Return TensorFlow tf.constant objects.
  • 'pt': Return PyTorch torch.Tensor objects.
  • 'np': Return Numpy np.ndarray objects.

TYPE: `str` or [`~utils.TensorType`], *optional* DEFAULT: None

Source code in mindnlp/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
def __call__(
    self,
    raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
    sampling_rate: Optional[int] = None,
    return_tensors: Optional[Union[str, TensorType]] = None,
    **kwargs,
) -> BatchFeature:
    """
    Main method to featurize and prepare for the model one or several sequence(s).

    Args:
        raw_speech (`np.ndarray`, `List[float]`, `List[np.ndarray]`, `List[List[float]]`):
            The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
            values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not
            stereo, i.e. single float per timestep.
        sampling_rate (`int`, *optional*):
            The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
            `sampling_rate` at the forward call to prevent silent errors.
        return_tensors (`str` or [`~utils.TensorType`], *optional*):
            If set, will return tensors instead of list of python integers. Acceptable values are:

            - `'tf'`: Return TensorFlow `tf.constant` objects.
            - `'pt'`: Return PyTorch `torch.Tensor` objects.
            - `'np'`: Return Numpy `np.ndarray` objects.
    """
    if sampling_rate is not None:
        if sampling_rate != self.sampling_rate:
            raise ValueError(
                f"The model corresponding to this feature extractor: {self} was trained using a sampling rate of"
                f" {self.sampling_rate}. Please make sure that the provided `raw_speech` input was sampled with"
                f" {self.sampling_rate} and not {sampling_rate}."
            )
    else:
        logger.warning(
            "It is strongly recommended to pass the `sampling_rate` argument to this function. "
            "Failing to do so can result in silent errors that might be hard to debug."
        )

    is_batched_numpy = isinstance(raw_speech, np.ndarray) and len(raw_speech.shape) > 1
    if is_batched_numpy and len(raw_speech.shape) > 2:
        raise ValueError(f"Only mono-channel audio is supported for input to {self}")
    is_batched = is_batched_numpy or (
        isinstance(raw_speech, (list, tuple)) and (isinstance(raw_speech[0], (np.ndarray, tuple, list)))
    )

    if is_batched:
        raw_speech = [np.asarray(speech, dtype=np.float32) for speech in raw_speech]
    elif not is_batched and not isinstance(raw_speech, np.ndarray):
        raw_speech = np.asarray(raw_speech, dtype=np.float32)
    elif isinstance(raw_speech, np.ndarray) and raw_speech.dtype is np.dtype(np.float64):
        raw_speech = raw_speech.astype(np.float32)

    # always return batch
    if not is_batched:
        raw_speech = [raw_speech]

    # extract fbank features and pad/truncate to max_length
    features = [self._extract_fbank_features(waveform, max_length=self.max_length) for waveform in raw_speech]

    # convert into BatchFeature
    padded_inputs = BatchFeature({"input_values": features})

    # make sure list is in array format
    input_values = padded_inputs.get("input_values")
    if isinstance(input_values[0], list):
        padded_inputs["input_values"] = [np.asarray(feature, dtype=np.float32) for feature in input_values]

    # normalization
    if self.do_normalize:
        padded_inputs["input_values"] = [self.normalize(feature) for feature in input_values]

    if return_tensors is not None:
        padded_inputs = padded_inputs.convert_to_tensors(return_tensors)

    return padded_inputs

mindnlp.transformers.models.audio_spectrogram_transformer.feature_extraction_audio_spectrogram_transformer.ASTFeatureExtractor.__init__(feature_size=1, sampling_rate=16000, num_mel_bins=128, max_length=1024, padding_value=0.0, do_normalize=True, mean=-4.2677393, std=4.5689974, return_attention_mask=False, **kwargs)

Initializes an instance of the ASTFeatureExtractor class.

PARAMETER DESCRIPTION
self

The instance of the class.

feature_size

The size of the input features. Defaults to 1.

TYPE: int DEFAULT: 1

sampling_rate

The sampling rate of the audio. Defaults to 16000.

TYPE: int DEFAULT: 16000

num_mel_bins

The number of mel bins for mel filtering. Defaults to 128.

TYPE: int DEFAULT: 128

max_length

The maximum length of the input. Defaults to 1024.

TYPE: int DEFAULT: 1024

padding_value

The value used for padding sequences. Defaults to 0.0.

TYPE: float DEFAULT: 0.0

do_normalize

Whether to normalize the input features. Defaults to True.

TYPE: bool DEFAULT: True

mean

The mean value for input feature normalization. Defaults to -4.2677393.

TYPE: float DEFAULT: -4.2677393

std

The standard deviation value for input feature normalization. Defaults to 4.5689974.

TYPE: float DEFAULT: 4.5689974

return_attention_mask

Whether to return attention mask. Defaults to False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
def __init__(
    self,
    feature_size=1,
    sampling_rate=16000,
    num_mel_bins=128,
    max_length=1024,
    padding_value=0.0,
    do_normalize=True,
    mean=-4.2677393,
    std=4.5689974,
    return_attention_mask=False,
    **kwargs,
):
    """
    Initializes an instance of the ASTFeatureExtractor class.

    Args:
        self: The instance of the class.
        feature_size (int, optional): The size of the input features. Defaults to 1.
        sampling_rate (int, optional): The sampling rate of the audio. Defaults to 16000.
        num_mel_bins (int, optional): The number of mel bins for mel filtering. Defaults to 128.
        max_length (int, optional): The maximum length of the input. Defaults to 1024.
        padding_value (float, optional): The value used for padding sequences. Defaults to 0.0.
        do_normalize (bool, optional): Whether to normalize the input features. Defaults to True.
        mean (float, optional): The mean value for input feature normalization. Defaults to -4.2677393.
        std (float, optional): The standard deviation value for input feature normalization. Defaults to 4.5689974.
        return_attention_mask (bool, optional): Whether to return attention mask. Defaults to False.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(feature_size=feature_size, sampling_rate=sampling_rate, padding_value=padding_value, **kwargs)
    self.num_mel_bins = num_mel_bins
    self.max_length = max_length
    self.do_normalize = do_normalize
    self.mean = mean
    self.std = std
    self.return_attention_mask = return_attention_mask

    mel_filters = mel_filter_bank(
        num_frequency_bins=256,
        num_mel_filters=self.num_mel_bins,
        min_frequency=20,
        max_frequency=sampling_rate // 2,
        sampling_rate=sampling_rate,
        norm=None,
        mel_scale="kaldi",
        triangularize_in_mel_space=True,
    )

    self.mel_filters = np.pad(mel_filters, ((0, 1), (0, 0)))
    self.window = window_function(400, "hann", periodic=False)

mindnlp.transformers.models.audio_spectrogram_transformer.feature_extraction_audio_spectrogram_transformer.ASTFeatureExtractor.normalize(input_values)

Normalize the input values using the mean and standard deviation stored in the ASTFeatureExtractor instance.

PARAMETER DESCRIPTION
self

An instance of the ASTFeatureExtractor class. It holds the mean and standard deviation values necessary for normalization.

TYPE: ASTFeatureExtractor

input_values

A NumPy array containing the input values to be normalized. The shape of the array must be compatible with the mean and standard deviation arrays stored in the ASTFeatureExtractor instance.

TYPE: ndarray

RETURNS DESCRIPTION
ndarray

np.ndarray: A NumPy array with the normalized values. The normalization is performed by subtracting the mean value and dividing by twice the standard deviation value.

Source code in mindnlp/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
def normalize(self, input_values: np.ndarray) -> np.ndarray:
    """
    Normalize the input values using the mean and standard deviation stored in the ASTFeatureExtractor instance.

    Args:
        self (ASTFeatureExtractor): An instance of the ASTFeatureExtractor class.
            It holds the mean and standard deviation values necessary for normalization.
        input_values (np.ndarray): A NumPy array containing the input values to be normalized.
            The shape of the array must be compatible with the mean and standard deviation arrays stored in the ASTFeatureExtractor instance.

    Returns:
        np.ndarray: A NumPy array with the normalized values.
            The normalization is performed by subtracting the mean value and dividing by twice the standard deviation value.

    Raises:
        None
    """
    return (input_values - (self.mean)) / (self.std * 2)

mindnlp.transformers.models.audio_spectrogram_transformer.modeling_audio_spectrogram_transformer.AUDIO_SPECTROGRAM_TRANSFORMER_PRETRAINED_MODEL_ARCHIVE_LIST = ['MIT/ast-finetuned-audioset-10-10-0.4593'] module-attribute

mindnlp.transformers.models.audio_spectrogram_transformer.modeling_audio_spectrogram_transformer.ASTForAudioClassification

Bases: ASTPreTrainedModel

ASTForAudioClassification is a class that implements a model for audio classification using the AST (Audio Spectrogram Transformer) architecture. This class inherits from ASTPreTrainedModel and provides methods for initializing the model with a configuration, and forwarding the model for audio classification tasks.

ATTRIBUTE DESCRIPTION
num_labels

Number of labels for the audio classification task.

audio_spectrogram_transformer

Instance of ASTModel for processing audio input.

classifier

Instance of ASTMLPHead for classification using the model's pooled output.

Source code in mindnlp/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
class ASTForAudioClassification(ASTPreTrainedModel):

    """
    ASTForAudioClassification is a class that implements a model for audio classification using the AST (Audio Spectrogram Transformer) architecture.
    This class inherits from ASTPreTrainedModel and provides methods for initializing the model with a configuration, and forwarding the model for audio classification tasks.

    Attributes:
        num_labels: Number of labels for the audio classification task.
        audio_spectrogram_transformer: Instance of ASTModel for processing audio input.
        classifier: Instance of ASTMLPHead for classification using the model's pooled output.
    """
    def __init__(self, config: ASTConfig) -> None:
        """
        Initializes an instance of the ASTForAudioClassification class.

        Args:
            self: The instance of the class.
            config (ASTConfig):
                The configuration object containing the necessary parameters for ASTForAudioClassification initialization.

                - num_labels (int): The number of labels/classes for audio classification.

        Returns:
            None.

        Raises:
            None.

        """
        super().__init__(config)

        self.num_labels = config.num_labels
        self.audio_spectrogram_transformer = ASTModel(config)

        # Classifier head
        self.classifier = ASTMLPHead(config)

        # Initialize weights and apply final processing
        self.post_init()

    def forward(
        self,
        input_values: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        labels: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[tuple, SequenceClassifierOutput]:
        r"""
        Args:
            labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
                Labels for computing the audio classification/regression loss. Indices should be in `[0, ...,
                config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
                `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.audio_spectrogram_transformer(
            input_values,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        pooled_output = outputs[1]
        logits = self.classifier(pooled_output)

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                if self.num_labels == 1:
                    loss = F.mse_loss(logits.squeeze(), labels.squeeze())
                else:
                    loss = F.mse_loss(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss = F.cross_entropy(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss = F.binary_cross_entropy_with_logits(logits, labels)

        if not return_dict:
            output = (logits,) + outputs[2:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

mindnlp.transformers.models.audio_spectrogram_transformer.modeling_audio_spectrogram_transformer.ASTForAudioClassification.__init__(config)

Initializes an instance of the ASTForAudioClassification class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

The configuration object containing the necessary parameters for ASTForAudioClassification initialization.

  • num_labels (int): The number of labels/classes for audio classification.

TYPE: ASTConfig

RETURNS DESCRIPTION
None

None.

Source code in mindnlp/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
def __init__(self, config: ASTConfig) -> None:
    """
    Initializes an instance of the ASTForAudioClassification class.

    Args:
        self: The instance of the class.
        config (ASTConfig):
            The configuration object containing the necessary parameters for ASTForAudioClassification initialization.

            - num_labels (int): The number of labels/classes for audio classification.

    Returns:
        None.

    Raises:
        None.

    """
    super().__init__(config)

    self.num_labels = config.num_labels
    self.audio_spectrogram_transformer = ASTModel(config)

    # Classifier head
    self.classifier = ASTMLPHead(config)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.audio_spectrogram_transformer.modeling_audio_spectrogram_transformer.ASTForAudioClassification.forward(input_values=None, head_mask=None, labels=None, output_attentions=None, output_hidden_states=None, return_dict=None)

PARAMETER DESCRIPTION
labels

Labels for computing the audio classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).

TYPE: `torch.LongTensor` of shape `(batch_size,)`, *optional* DEFAULT: None

Source code in mindnlp/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
def forward(
    self,
    input_values: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    labels: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[tuple, SequenceClassifierOutput]:
    r"""
    Args:
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the audio classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    outputs = self.audio_spectrogram_transformer(
        input_values,
        head_mask=head_mask,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )

    pooled_output = outputs[1]
    logits = self.classifier(pooled_output)

    loss = None
    if labels is not None:
        if self.config.problem_type is None:
            if self.num_labels == 1:
                self.config.problem_type = "regression"
            elif self.num_labels > 1 and labels.dtype in (mindspore.int64, mindspore.int32):
                self.config.problem_type = "single_label_classification"
            else:
                self.config.problem_type = "multi_label_classification"

        if self.config.problem_type == "regression":
            if self.num_labels == 1:
                loss = F.mse_loss(logits.squeeze(), labels.squeeze())
            else:
                loss = F.mse_loss(logits, labels)
        elif self.config.problem_type == "single_label_classification":
            loss = F.cross_entropy(logits.view(-1, self.num_labels), labels.view(-1))
        elif self.config.problem_type == "multi_label_classification":
            loss = F.binary_cross_entropy_with_logits(logits, labels)

    if not return_dict:
        output = (logits,) + outputs[2:]
        return ((loss,) + output) if loss is not None else output

    return SequenceClassifierOutput(
        loss=loss,
        logits=logits,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
    )

mindnlp.transformers.models.audio_spectrogram_transformer.modeling_audio_spectrogram_transformer.ASTModel

Bases: ASTPreTrainedModel

ASTModel is a class representing a model for Abstract Syntax Trees (AST) processing. This class inherits from ASTPreTrainedModel and includes methods for initializing the model, getting input embeddings, pruning heads, and forwarding the model's output.

ATTRIBUTE DESCRIPTION
config

An instance of ASTConfig containing configuration parameters for the model.

embeddings

An instance of ASTEmbeddings for handling AST embeddings.

encoder

An instance of ASTEncoder for encoding AST inputs.

layernorm

A layer normalization module with specified hidden size and epsilon.

METHOD DESCRIPTION
__init__

Initializes the ASTModel with the given configuration.

get_input_embeddings

Returns the patch embeddings used by the model.

_prune_heads

Prunes specified attention heads in the model.

forward

Constructs the model output based on input values and optional arguments.

The forward method handles input processing, encoding, and output generation based on the specified parameters. Pruning heads allows for fine-tuning the attention mechanism of the model. Overall, ASTModel provides a comprehensive solution for AST-based tasks.

Source code in mindnlp/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
class ASTModel(ASTPreTrainedModel):

    """
    ASTModel is a class representing a model for Abstract Syntax Trees (AST) processing.
    This class inherits from ASTPreTrainedModel and includes methods for initializing the model, getting input embeddings,
    pruning heads, and forwarding the model's output.

    Attributes:
        config: An instance of ASTConfig containing configuration parameters for the model.
        embeddings: An instance of ASTEmbeddings for handling AST embeddings.
        encoder: An instance of ASTEncoder for encoding AST inputs.
        layernorm: A layer normalization module with specified hidden size and epsilon.

    Methods:
        __init__: Initializes the ASTModel with the given configuration.
        get_input_embeddings: Returns the patch embeddings used by the model.
        _prune_heads: Prunes specified attention heads in the model.
        forward: Constructs the model output based on input values and optional arguments.

    The forward method handles input processing, encoding, and output generation based on the specified parameters.
    Pruning heads allows for fine-tuning the attention mechanism of the model.
    Overall, ASTModel provides a comprehensive solution for AST-based tasks.
    """
    def __init__(self, config: ASTConfig) -> None:
        """
        Initializes an instance of the ASTModel class.

        Args:
            self: The instance of the class.
            config (ASTConfig): The configuration object for the ASTModel.
                t provides necessary settings and hyperparameters for the model.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(config)
        self.config = config

        self.embeddings = ASTEmbeddings(config)
        self.encoder = ASTEncoder(config)

        self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self) -> ASTPatchEmbeddings:
        """
        Retrieve the input embeddings from the AST model.

        Args:
            self (ASTModel): The instance of the ASTModel class.
                It represents the current object of the ASTModel.
                This parameter is required as the method is an instance method.

        Returns:
            ASTPatchEmbeddings: An instance of ASTPatchEmbeddings representing the input embeddings.
                The returned ASTPatchEmbeddings object contains the patch embeddings related to the input.

        Raises:
            None
        """
        return self.embeddings.patch_embeddings

    def _prune_heads(self, heads_to_prune: Dict[int, List[int]]) -> None:
        """
        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
        class PreTrainedModel
        """
        for layer, heads in heads_to_prune.items():
            self.encoder.layer[layer].attention.prune_heads(heads)

    def forward(
        self,
        input_values: Optional[mindspore.Tensor] = None,
        head_mask: Optional[mindspore.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPooling]:
        '''
        This method forwards the ASTModel by processing the input values through the model's layers.

        Args:
            self (ASTModel): The instance of the ASTModel.
            input_values (Optional[mindspore.Tensor]): The input values to be processed by the model. Default is None.
            head_mask (Optional[mindspore.Tensor]): The head mask for controlling the attention in the encoder layers. Default is None.
            output_attentions (Optional[bool]): Whether to output attentions. Default is None.
            output_hidden_states (Optional[bool]): Whether to output hidden states. Default is None.
            return_dict (Optional[bool]): Whether to return a dict. Default is None.

        Returns:
            Union[Tuple, BaseModelOutputWithPooling]: The forwarded output, which can be a tuple or BaseModelOutputWithPooling object.

        Raises:
            ValueError: If input_values is None, a ValueError is raised with the message 'You have to specify input_values'.
        '''
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        if input_values is None:
            raise ValueError("You have to specify input_values")

        # Prepare head mask if needed
        # 1.0 in head_mask indicate we keep the head
        # attention_probs has shape bsz x n_heads x N x N
        # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
        # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)

        embedding_output = self.embeddings(input_values)

        encoder_outputs = self.encoder(
            embedding_output,
            head_mask=head_mask,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        sequence_output = encoder_outputs[0]
        sequence_output = self.layernorm(sequence_output)

        pooled_output = (sequence_output[:, 0] + sequence_output[:, 1]) / 2

        if not return_dict:
            return (sequence_output, pooled_output) + encoder_outputs[1:]

        return BaseModelOutputWithPooling(
            last_hidden_state=sequence_output,
            pooler_output=pooled_output,
            hidden_states=encoder_outputs.hidden_states,
            attentions=encoder_outputs.attentions,
        )

mindnlp.transformers.models.audio_spectrogram_transformer.modeling_audio_spectrogram_transformer.ASTModel.__init__(config)

Initializes an instance of the ASTModel class.

PARAMETER DESCRIPTION
self

The instance of the class.

config

The configuration object for the ASTModel. t provides necessary settings and hyperparameters for the model.

TYPE: ASTConfig

RETURNS DESCRIPTION
None

None.

Source code in mindnlp/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
def __init__(self, config: ASTConfig) -> None:
    """
    Initializes an instance of the ASTModel class.

    Args:
        self: The instance of the class.
        config (ASTConfig): The configuration object for the ASTModel.
            t provides necessary settings and hyperparameters for the model.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(config)
    self.config = config

    self.embeddings = ASTEmbeddings(config)
    self.encoder = ASTEncoder(config)

    self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)

    # Initialize weights and apply final processing
    self.post_init()

mindnlp.transformers.models.audio_spectrogram_transformer.modeling_audio_spectrogram_transformer.ASTModel.forward(input_values=None, head_mask=None, output_attentions=None, output_hidden_states=None, return_dict=None)

This method forwards the ASTModel by processing the input values through the model's layers.

PARAMETER DESCRIPTION
self

The instance of the ASTModel.

TYPE: ASTModel

input_values

The input values to be processed by the model. Default is None.

TYPE: Optional[Tensor] DEFAULT: None

head_mask

The head mask for controlling the attention in the encoder layers. Default is None.

TYPE: Optional[Tensor] DEFAULT: None

output_attentions

Whether to output attentions. Default is None.

TYPE: Optional[bool] DEFAULT: None

output_hidden_states

Whether to output hidden states. Default is None.

TYPE: Optional[bool] DEFAULT: None

return_dict

Whether to return a dict. Default is None.

TYPE: Optional[bool] DEFAULT: None

RETURNS DESCRIPTION
Union[Tuple, BaseModelOutputWithPooling]

Union[Tuple, BaseModelOutputWithPooling]: The forwarded output, which can be a tuple or BaseModelOutputWithPooling object.

RAISES DESCRIPTION
ValueError

If input_values is None, a ValueError is raised with the message 'You have to specify input_values'.

Source code in mindnlp/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
def forward(
    self,
    input_values: Optional[mindspore.Tensor] = None,
    head_mask: Optional[mindspore.Tensor] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPooling]:
    '''
    This method forwards the ASTModel by processing the input values through the model's layers.

    Args:
        self (ASTModel): The instance of the ASTModel.
        input_values (Optional[mindspore.Tensor]): The input values to be processed by the model. Default is None.
        head_mask (Optional[mindspore.Tensor]): The head mask for controlling the attention in the encoder layers. Default is None.
        output_attentions (Optional[bool]): Whether to output attentions. Default is None.
        output_hidden_states (Optional[bool]): Whether to output hidden states. Default is None.
        return_dict (Optional[bool]): Whether to return a dict. Default is None.

    Returns:
        Union[Tuple, BaseModelOutputWithPooling]: The forwarded output, which can be a tuple or BaseModelOutputWithPooling object.

    Raises:
        ValueError: If input_values is None, a ValueError is raised with the message 'You have to specify input_values'.
    '''
    output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
    output_hidden_states = (
        output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
    )
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict

    if input_values is None:
        raise ValueError("You have to specify input_values")

    # Prepare head mask if needed
    # 1.0 in head_mask indicate we keep the head
    # attention_probs has shape bsz x n_heads x N x N
    # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
    # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
    head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)

    embedding_output = self.embeddings(input_values)

    encoder_outputs = self.encoder(
        embedding_output,
        head_mask=head_mask,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
    )
    sequence_output = encoder_outputs[0]
    sequence_output = self.layernorm(sequence_output)

    pooled_output = (sequence_output[:, 0] + sequence_output[:, 1]) / 2

    if not return_dict:
        return (sequence_output, pooled_output) + encoder_outputs[1:]

    return BaseModelOutputWithPooling(
        last_hidden_state=sequence_output,
        pooler_output=pooled_output,
        hidden_states=encoder_outputs.hidden_states,
        attentions=encoder_outputs.attentions,
    )

mindnlp.transformers.models.audio_spectrogram_transformer.modeling_audio_spectrogram_transformer.ASTModel.get_input_embeddings()

Retrieve the input embeddings from the AST model.

PARAMETER DESCRIPTION
self

The instance of the ASTModel class. It represents the current object of the ASTModel. This parameter is required as the method is an instance method.

TYPE: ASTModel

RETURNS DESCRIPTION
ASTPatchEmbeddings

An instance of ASTPatchEmbeddings representing the input embeddings. The returned ASTPatchEmbeddings object contains the patch embeddings related to the input.

TYPE: ASTPatchEmbeddings

Source code in mindnlp/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
def get_input_embeddings(self) -> ASTPatchEmbeddings:
    """
    Retrieve the input embeddings from the AST model.

    Args:
        self (ASTModel): The instance of the ASTModel class.
            It represents the current object of the ASTModel.
            This parameter is required as the method is an instance method.

    Returns:
        ASTPatchEmbeddings: An instance of ASTPatchEmbeddings representing the input embeddings.
            The returned ASTPatchEmbeddings object contains the patch embeddings related to the input.

    Raises:
        None
    """
    return self.embeddings.patch_embeddings

mindnlp.transformers.models.audio_spectrogram_transformer.modeling_audio_spectrogram_transformer.ASTPreTrainedModel

Bases: PreTrainedModel

An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained models.

Source code in mindnlp/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
class ASTPreTrainedModel(PreTrainedModel):
    """
    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
    models.
    """
    config_class = ASTConfig
    base_model_prefix = "audio_spectrogram_transformer"
    main_input_name = "input_values"
    supports_gradient_checkpointing = True

    def _init_weights(self, cell):
        """Initialize the weights"""
        if isinstance(cell, (nn.Linear, nn.Conv2d)):
            cell.weight.set_data(initializer(Normal(self.config.initializer_range),
                                                    cell.weight.shape, cell.weight.dtype))
            if cell.bias is not None:
                cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))

        elif isinstance(cell, nn.LayerNorm):
            cell.weight.set_data(initializer('ones', cell.weight.shape, cell.weight.dtype))
            cell.bias.set_data(initializer('zeros', cell.bias.shape, cell.bias.dtype))