Skip to content

wav2vec2_with_lm

mindnlp.transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm

Speech processor class for Wav2Vec2

mindnlp.transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2DecoderWithLMOutput dataclass

Bases: ModelOutput

Output type of [Wav2Vec2DecoderWithLM], with transcription.

PARAMETER DESCRIPTION
text

Decoded logits in text from. Usually the speech transcription.

TYPE: list of `str` or `str`

logit_score

Total logit score of the beams associated with produced text.

TYPE: list of `float` or `float` DEFAULT: None

lm_score

Fused lm_score of the beams associated with produced text.

TYPE: list of `float` DEFAULT: None

word_offsets

Offsets of the decoded words. In combination with sampling rate and model downsampling rate word offsets can be used to compute time stamps for each word.

TYPE: list of `List[Dict[str, Union[int, str]]]` or `List[Dict[str, Union[int, str]]]` DEFAULT: None

Source code in mindnlp/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
@dataclass
class Wav2Vec2DecoderWithLMOutput(ModelOutput):
    """
    Output type of [`Wav2Vec2DecoderWithLM`], with transcription.

    Args:
        text (list of `str` or `str`):
            Decoded logits in text from. Usually the speech transcription.
        logit_score (list of `float` or `float`):
            Total logit score of the beams associated with produced text.
        lm_score (list of `float`):
            Fused lm_score of the beams associated with produced text.
        word_offsets (list of `List[Dict[str, Union[int, str]]]` or `List[Dict[str, Union[int, str]]]`):
            Offsets of the decoded words. In combination with sampling rate and model downsampling rate word offsets
            can be used to compute time stamps for each word.
    """
    text: Union[List[List[str]], List[str], str]
    logit_score: Union[List[List[float]], List[float], float] = None
    lm_score: Union[List[List[float]], List[float], float] = None
    word_offsets: Union[List[List[ListOfDict]], List[ListOfDict], ListOfDict] = None

mindnlp.transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2ProcessorWithLM

Bases: ProcessorMixin

Constructs a Wav2Vec2 processor which wraps a Wav2Vec2 feature extractor, a Wav2Vec2 CTC tokenizer and a decoder with language model support into a single processor for language model boosted speech recognition decoding.

PARAMETER DESCRIPTION
feature_extractor

An instance of [Wav2Vec2FeatureExtractor]. The feature extractor is a required input.

TYPE: [`Wav2Vec2FeatureExtractor`]

tokenizer

An instance of [Wav2Vec2CTCTokenizer]. The tokenizer is a required input.

TYPE: [`Wav2Vec2CTCTokenizer`]

decoder

An instance of [pyctcdecode.BeamSearchDecoderCTC]. The decoder is a required input.

TYPE: `pyctcdecode.BeamSearchDecoderCTC`

Source code in mindnlp/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
class Wav2Vec2ProcessorWithLM(ProcessorMixin):
    r"""
    Constructs a Wav2Vec2 processor which wraps a Wav2Vec2 feature extractor, a Wav2Vec2 CTC tokenizer and a decoder
    with language model support into a single processor for language model boosted speech recognition decoding.

    Args:
        feature_extractor ([`Wav2Vec2FeatureExtractor`]):
            An instance of [`Wav2Vec2FeatureExtractor`]. The feature extractor is a required input.
        tokenizer ([`Wav2Vec2CTCTokenizer`]):
            An instance of [`Wav2Vec2CTCTokenizer`]. The tokenizer is a required input.
        decoder (`pyctcdecode.BeamSearchDecoderCTC`):
            An instance of [`pyctcdecode.BeamSearchDecoderCTC`]. The decoder is a required input.
    """
    feature_extractor_class = "Wav2Vec2FeatureExtractor"
    tokenizer_class = "Wav2Vec2CTCTokenizer"

    def __init__(
        self,
        feature_extractor: "FeatureExtractionMixin",
        tokenizer: "PreTrainedTokenizerBase",
        decoder: "BeamSearchDecoderCTC",
    ):
        """
        Initializes a Wav2Vec2ProcessorWithLM object.

        Args:
            self: The object instance.
            feature_extractor (FeatureExtractionMixin): The feature extractor used for processing input audio data.
            tokenizer (PreTrainedTokenizerBase): The tokenizer used for tokenizing input text data.
            decoder (BeamSearchDecoderCTC): The decoder used for decoding the model's output.

        Returns:
            None. This method does not return any value.

        Raises:
            ValueError: If the provided 'decoder' parameter is not an instance of BeamSearchDecoderCTC.
            ValueError: If there are missing tokens in the decoder's alphabet that are present in the tokenizer's
                vocabulary.
        """
        from pyctcdecode import BeamSearchDecoderCTC

        super().__init__(feature_extractor, tokenizer)
        if not isinstance(decoder, BeamSearchDecoderCTC):
            raise ValueError(f"`decoder` has to be of type {BeamSearchDecoderCTC.__class__}, but is {type(decoder)}")

        # make sure that decoder's alphabet and tokenizer's vocab match in content
        missing_decoder_tokens = self.get_missing_alphabet_tokens(decoder, tokenizer)
        if len(missing_decoder_tokens) > 0:
            raise ValueError(
                f"The tokens {missing_decoder_tokens} are defined in the tokenizer's "
                "vocabulary, but not in the decoder's alphabet. "
                f"Make sure to include {missing_decoder_tokens} in the decoder's alphabet."
            )

        self.decoder = decoder
        self.current_processor = self.feature_extractor
        self._in_target_context_manager = False

    def save_pretrained(self, save_directory):
        """
        Save the Wav2Vec2ProcessorWithLM instance and the associated language model to the specified directory.

        Args:
            self (Wav2Vec2ProcessorWithLM): The instance of the Wav2Vec2ProcessorWithLM class.
            save_directory (str): The directory path where the processor and language model will be saved. 

        Returns:
            None.

        Raises:
            OSError: If the save_directory cannot be accessed or does not exist.
            ValueError: If the save_directory is not a valid directory path.
            TypeError: If the save_directory parameter is not a string.
        """
        super().save_pretrained(save_directory)
        self.decoder.save_to_dir(save_directory)

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
        r"""
        Instantiate a [`Wav2Vec2ProcessorWithLM`] from a pretrained Wav2Vec2 processor.

        <Tip>

        This class method is simply calling Wav2Vec2FeatureExtractor's
        [`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`], Wav2Vec2CTCTokenizer's
        [`~tokenization_utils_base.PreTrainedTokenizerBase.from_pretrained`], and
        [`pyctcdecode.BeamSearchDecoderCTC.load_from_hf_hub`].

        Please refer to the docstrings of the methods above for more information.

        </Tip>

        Args:
            pretrained_model_name_or_path (`str` or `os.PathLike`):
                This can be either:

                - a string, the *model id* of a pretrained feature_extractor hosted inside a model repo on
                hf-mirror.com.
                - a path to a *directory* containing a feature extractor file saved using the
                [`~SequenceFeatureExtractor.save_pretrained`] method, e.g., `./my_model_directory/`.
                - a path or url to a saved feature extractor JSON *file*, e.g.,
                `./my_model_directory/preprocessor_config.json`.
            **kwargs
                Additional keyword arguments passed along to both [`SequenceFeatureExtractor`] and
                [`PreTrainedTokenizer`]
        """
        requires_backends(cls, "pyctcdecode")
        from pyctcdecode import BeamSearchDecoderCTC

        feature_extractor, tokenizer = super()._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs) # pylint: disable=unbalanced-tuple-unpacking

        if os.path.isdir(pretrained_model_name_or_path) or os.path.isfile(pretrained_model_name_or_path):
            decoder = BeamSearchDecoderCTC.load_from_dir(pretrained_model_name_or_path)
        else:
            # BeamSearchDecoderCTC has no auto class
            kwargs.pop("_from_auto", None)
            # snapshot_download has no `trust_remote_code` flag
            kwargs.pop("trust_remote_code", None)

            # make sure that only relevant filenames are downloaded
            language_model_filenames = os.path.join(BeamSearchDecoderCTC._LANGUAGE_MODEL_SERIALIZED_DIRECTORY, "*")
            alphabet_filename = BeamSearchDecoderCTC._ALPHABET_SERIALIZED_FILENAME
            allow_patterns = [language_model_filenames, alphabet_filename]

            decoder = BeamSearchDecoderCTC.load_from_hf_hub(
                pretrained_model_name_or_path, allow_patterns=allow_patterns, **kwargs
            )

        # set language model attributes
        for attribute in ["alpha", "beta", "unk_score_offset", "score_boundary"]:
            value = kwargs.pop(attribute, None)

            if value is not None:
                cls._set_language_model_attribute(decoder, attribute, value)

        # make sure that decoder's alphabet and tokenizer's vocab match in content
        missing_decoder_tokens = cls.get_missing_alphabet_tokens(decoder, tokenizer)
        if len(missing_decoder_tokens) > 0:
            raise ValueError(
                f"The tokens {missing_decoder_tokens} are defined in the tokenizer's "
                "vocabulary, but not in the decoder's alphabet. "
                f"Make sure to include {missing_decoder_tokens} in the decoder's alphabet."
            )

        return cls(feature_extractor=feature_extractor, tokenizer=tokenizer, decoder=decoder)

    @staticmethod
    def _set_language_model_attribute(decoder: "BeamSearchDecoderCTC", attribute: str, value: float):
        """
        Sets the specified attribute of the language model within the Wav2Vec2ProcessorWithLM using the given decoder.

        Args:
            decoder (BeamSearchDecoderCTC): The decoder object used to access the language model.
            attribute (str): The name of the attribute to be set.
            value (float): The value to be assigned to the specified attribute.

        Returns:
            None.

        Raises:
            None.
        """
        setattr(decoder.model_container[decoder._model_key], attribute, value)

    @property
    def language_model(self):
        """
        This method returns the language model associated with the Wav2Vec2ProcessorWithLM instance.

        Args:
            self: The Wav2Vec2ProcessorWithLM instance.

        Returns:
            None.

        Raises:
            None.
        """
        return self.decoder.model_container[self.decoder._model_key]

    @staticmethod
    def get_missing_alphabet_tokens(decoder, tokenizer):
        """
        This method 'get_missing_alphabet_tokens' is defined in the class 'Wav2Vec2ProcessorWithLM' and is responsible
        for identifying missing alphabet tokens by comparing the tokenizer's vocabulary with the decoder's alphabet
        labels.

        Args:
            decoder (object): The decoder object used for decoding tokens. It should be of type 'Decoder' and is
                required as an input parameter for the method.
            tokenizer (object): The tokenizer object used for tokenizing input data. It should be of type 'Tokenizer'
                and is required as an input parameter for the method.

        Returns:
            set: This method returns a set of missing tokens from the tokenizer's vocabulary that are not present in
                the decoder's alphabet labels. If no missing tokens are found, it returns an empty set.

        Raises:
            No specific exceptions are documented to be raised by this method. However, potential exceptions may
            include AttributeError if the attributes accessed from the decoder or tokenizer objects do not exist,
            or TypeError if the input parameters are not of the expected types.
        """
        from pyctcdecode.alphabet import BLANK_TOKEN_PTN, UNK_TOKEN, UNK_TOKEN_PTN

        # we need to make sure that all of the tokenizer's except the special tokens
        # are present in the decoder's alphabet. Retrieve missing alphabet token
        # from decoder
        tokenizer_vocab_list = list(tokenizer.get_vocab().keys())

        # replace special tokens
        for i, token in enumerate(tokenizer_vocab_list):
            if BLANK_TOKEN_PTN.match(token):
                tokenizer_vocab_list[i] = ""
            if token == tokenizer.word_delimiter_token:
                tokenizer_vocab_list[i] = " "
            if UNK_TOKEN_PTN.match(token):
                tokenizer_vocab_list[i] = UNK_TOKEN

        # are any of the extra tokens no special tokenizer tokens?
        missing_tokens = set(tokenizer_vocab_list) - set(decoder._alphabet.labels)

        return missing_tokens

    def __call__(self, *args, **kwargs):
        """
        When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor's
        [`~Wav2Vec2FeatureExtractor.__call__`] and returns its output. If used in the context
        [`~Wav2Vec2ProcessorWithLM.as_target_processor`] this method forwards all its arguments to
        Wav2Vec2CTCTokenizer's [`~Wav2Vec2CTCTokenizer.__call__`]. Please refer to the docstring of the above two
        methods for more information.
        """
        # For backward compatibility
        if self._in_target_context_manager:
            return self.current_processor(*args, **kwargs)

        if "raw_speech" in kwargs:
            warnings.warn("Using `raw_speech` as a keyword argument is deprecated. Use `audio` instead.")
            audio = kwargs.pop("raw_speech")
        else:
            audio = kwargs.pop("audio", None)
        sampling_rate = kwargs.pop("sampling_rate", None)
        text = kwargs.pop("text", None)
        if len(args) > 0:
            audio = args[0]
            args = args[1:]

        if audio is None and text is None:
            raise ValueError("You need to specify either an `audio` or `text` input to process.")

        if audio is not None:
            inputs = self.feature_extractor(audio, *args, sampling_rate=sampling_rate, **kwargs)
        if text is not None:
            encodings = self.tokenizer(text, **kwargs)

        if text is None:
            return inputs
        if audio is None:
            return encodings
        inputs["labels"] = encodings["input_ids"]
        return inputs

    def pad(self, *args, **kwargs):
        """
        When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor's
        [`~Wav2Vec2FeatureExtractor.pad`] and returns its output. If used in the context
        [`~Wav2Vec2ProcessorWithLM.as_target_processor`] this method forwards all its arguments to
        Wav2Vec2CTCTokenizer's [`~Wav2Vec2CTCTokenizer.pad`]. Please refer to the docstring of the above two methods
        for more information.
        """
        # For backward compatibility
        if self._in_target_context_manager:
            return self.current_processor.pad(*args, **kwargs)

        input_features = kwargs.pop("input_features", None)
        labels = kwargs.pop("labels", None)
        if len(args) > 0:
            input_features = args[0]
            args = args[1:]

        if input_features is not None:
            input_features = self.feature_extractor.pad(input_features, *args, **kwargs)
        if labels is not None:
            labels = self.tokenizer.pad(labels, **kwargs)

        if labels is None:
            return input_features
        if input_features is None:
            return labels
        input_features["labels"] = labels["input_ids"]
        return input_features

    def batch_decode(
        self,
        logits: np.ndarray,
        pool: Optional[Pool] = None,
        num_processes: Optional[int] = None,
        beam_width: Optional[int] = None,
        beam_prune_logp: Optional[float] = None,
        token_min_logp: Optional[float] = None,
        hotwords: Optional[Iterable[str]] = None,
        hotword_weight: Optional[float] = None,
        alpha: Optional[float] = None,
        beta: Optional[float] = None,
        unk_score_offset: Optional[float] = None,
        lm_score_boundary: Optional[bool] = None,
        output_word_offsets: bool = False,
        n_best: int = 1,
    ):
        """
        Batch decode output logits to audio transcription with language model support.

        <Tip>

        This function makes use of Python's multiprocessing. Currently, multiprocessing is available only on Unix
        systems (see this [issue](https://github.com/kensho-technologies/pyctcdecode/issues/65)).

        If you are decoding multiple batches, consider creating a `Pool` and passing it to `batch_decode`. Otherwise,
        `batch_decode` will be very slow since it will create a fresh `Pool` for each call. See usage example below.

        </Tip>

        Args:
            logits (`np.ndarray`):
                The logits output vector of the model representing the log probabilities for each token.
            pool (`multiprocessing.Pool`, *optional*):
                An optional user-managed pool. If not set, one will be automatically created and closed. The pool
                should be instantiated *after* `Wav2Vec2ProcessorWithLM`. Otherwise, the LM won't be available to the
                pool's sub-processes.

                <Tip>

                Currently, only pools created with a 'fork' context can be used. If a 'spawn' pool is passed, it will
                be ignored and sequential decoding will be used instead.

                </Tip>

            num_processes (`int`, *optional*):
                If `pool` is not set, number of processes on which the function should be parallelized over. Defaults
                to the number of available CPUs.
            beam_width (`int`, *optional*):
                Maximum number of beams at each step in decoding. Defaults to pyctcdecode's DEFAULT_BEAM_WIDTH.
            beam_prune_logp (`int`, *optional*):
                Beams that are much worse than best beam will be pruned Defaults to pyctcdecode's DEFAULT_PRUNE_LOGP.
            token_min_logp (`int`, *optional*):
                Tokens below this logp are skipped unless they are argmax of frame Defaults to pyctcdecode's
                DEFAULT_MIN_TOKEN_LOGP.
            hotwords (`List[str]`, *optional*):
                List of words with extra importance, can be OOV for LM
            hotword_weight (`int`, *optional*):
                Weight factor for hotword importance Defaults to pyctcdecode's DEFAULT_HOTWORD_WEIGHT.
            alpha (`float`, *optional*):
                Weight for language model during shallow fusion
            beta (`float`, *optional*):
                Weight for length score adjustment of during scoring
            unk_score_offset (`float`, *optional*):
                Amount of log score offset for unknown tokens
            lm_score_boundary (`bool`, *optional*):
                Whether to have kenlm respect boundaries when scoring
            output_word_offsets (`bool`, *optional*, defaults to `False`):
                Whether or not to output word offsets. Word offsets can be used in combination with the sampling rate
                and model downsampling rate to compute the time-stamps of transcribed words.
            n_best (`int`, *optional*, defaults to `1`):
                Number of best hypotheses to return. If `n_best` is greater than 1, the returned `text` will be a list
                of lists of strings, `logit_score` will be a list of lists of floats, and `lm_score` will be a list of
                lists of floats, where the length of the outer list will correspond to the batch size and the length of
                the inner list will correspond to the number of returned hypotheses . The value should be >= 1.

                <Tip>

                Please take a look at the Example of [`~Wav2Vec2ProcessorWithLM.decode`] to better understand how to
                make use of `output_word_offsets`. [`~Wav2Vec2ProcessorWithLM.batch_decode`] works the same way with
                batched output.

                </Tip>

        Returns:
            [`~models.wav2vec2.Wav2Vec2DecoderWithLMOutput`].

        Example:
            See [Decoding multiple audios](#decoding-multiple-audios).
        """
        from pyctcdecode.constants import (
            DEFAULT_BEAM_WIDTH,
            DEFAULT_HOTWORD_WEIGHT,
            DEFAULT_MIN_TOKEN_LOGP,
            DEFAULT_PRUNE_LOGP,
        )

        # set defaults
        beam_width = beam_width if beam_width is not None else DEFAULT_BEAM_WIDTH
        beam_prune_logp = beam_prune_logp if beam_prune_logp is not None else DEFAULT_PRUNE_LOGP
        token_min_logp = token_min_logp if token_min_logp is not None else DEFAULT_MIN_TOKEN_LOGP
        hotword_weight = hotword_weight if hotword_weight is not None else DEFAULT_HOTWORD_WEIGHT

        # reset params at every forward call. It's just a `set` method in pyctcdecode
        self.decoder.reset_params(
            alpha=alpha, beta=beta, unk_score_offset=unk_score_offset, lm_score_boundary=lm_score_boundary
        )

        # create multiprocessing pool and list numpy arrays
        # filter out logits padding
        logits_list = [array[(array != -100.0).all(axis=-1)] for array in logits]

        # create a pool if necessary while also using it as a context manager to close itself
        if pool is None:
            # fork is safe to use only on Unix, see "Contexts and start methods" section on
            # multiprocessing's docs (https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
            default_context = get_start_method()

            if default_context == "fork":
                cm = pool = get_context().Pool(num_processes)
            else:
                logger.warning(
                    "Parallel batch decoding is not currently supported in this platform. "
                    "Falling back to sequential decoding."
                )
                cm = nullcontext()
        else:
            # pool is managed by the user, so we don't need to close it
            cm = nullcontext()

            if num_processes is not None:
                logger.warning(
                    "Parameter `num_process` was passed, but it will be ignored since `pool` was also specified."
                )

        # pyctcdecode
        with cm:
            decoded_beams = self.decoder.decode_beams_batch(
                pool=pool,
                logits_list=logits_list,
                beam_width=beam_width,
                beam_prune_logp=beam_prune_logp,
                token_min_logp=token_min_logp,
                hotwords=hotwords,
                hotword_weight=hotword_weight,
            )

        # extract text and scores
        batch_texts, logit_scores, lm_scores, word_offsets = [], [], [], []

        for d in decoded_beams:
            batch_texts.append([beam[0] for beam in d])
            logit_scores.append([beam[-2] for beam in d])
            lm_scores.append([beam[-1] for beam in d])

            # word_offsets.append([{"word": t[0], "start_offset": t[1][0], "end_offset": t[1][1]} for t in d[0][1]])

            word_offsets.append(
                [
                    [
                        {"word": word, "start_offset": start_offset, "end_offset": end_offset}
                        for word, (start_offset, end_offset) in beam[1]
                    ]
                    for beam in d
                ]
            )

        word_offsets = word_offsets if output_word_offsets else None

        if n_best == 1:
            return Wav2Vec2DecoderWithLMOutput(
                text=[hyps[0] for hyps in batch_texts],
                logit_score=[hyps[0] for hyps in logit_scores],
                lm_score=[hyps[0] for hyps in lm_scores],
                word_offsets=[hyps[0] for hyps in word_offsets] if word_offsets is not None else None,
            )
        return Wav2Vec2DecoderWithLMOutput(
            text=[hyps[:n_best] for hyps in batch_texts],
            logit_score=[hyps[:n_best] for hyps in logit_scores],
            lm_score=[hyps[:n_best] for hyps in lm_scores],
            word_offsets=[hyps[:n_best] for hyps in word_offsets] if word_offsets is not None else None,
        )

    def decode(
        self,
        logits: np.ndarray,
        beam_width: Optional[int] = None,
        beam_prune_logp: Optional[float] = None,
        token_min_logp: Optional[float] = None,
        hotwords: Optional[Iterable[str]] = None,
        hotword_weight: Optional[float] = None,
        alpha: Optional[float] = None,
        beta: Optional[float] = None,
        unk_score_offset: Optional[float] = None,
        lm_score_boundary: Optional[bool] = None,
        output_word_offsets: bool = False,
        n_best: int = 1,
    ):
        """
        Decode output logits to audio transcription with language model support.

        Args:
            logits (`np.ndarray`):
                The logits output vector of the model representing the log probabilities for each token.
            beam_width (`int`, *optional*):
                Maximum number of beams at each step in decoding. Defaults to pyctcdecode's DEFAULT_BEAM_WIDTH.
            beam_prune_logp (`int`, *optional*):
                A threshold to prune beams with log-probs less than best_beam_logp + beam_prune_logp. The value should
                be <= 0. Defaults to pyctcdecode's DEFAULT_PRUNE_LOGP.
            token_min_logp (`int`, *optional*):
                Tokens with log-probs below token_min_logp are skipped unless they are have the maximum log-prob for an
                utterance. Defaults to pyctcdecode's DEFAULT_MIN_TOKEN_LOGP.
            hotwords (`List[str]`, *optional*):
                List of words with extra importance which can be missing from the LM's vocabulary, e.g. ["huggingface"]
            hotword_weight (`int`, *optional*):
                Weight multiplier that boosts hotword scores. Defaults to pyctcdecode's DEFAULT_HOTWORD_WEIGHT.
            alpha (`float`, *optional*):
                Weight for language model during shallow fusion
            beta (`float`, *optional*):
                Weight for length score adjustment of during scoring
            unk_score_offset (`float`, *optional*):
                Amount of log score offset for unknown tokens
            lm_score_boundary (`bool`, *optional*):
                Whether to have kenlm respect boundaries when scoring
            output_word_offsets (`bool`, *optional*, defaults to `False`):
                Whether or not to output word offsets. Word offsets can be used in combination with the sampling rate
                and model downsampling rate to compute the time-stamps of transcribed words.
            n_best (`int`, *optional*, defaults to `1`):
                Number of best hypotheses to return. If `n_best` is greater than 1, the returned `text` will be a list
                of strings, `logit_score` will be a list of floats, and `lm_score` will be a list of floats, where the
                length of these lists will correspond to the number of returned hypotheses. The value should be >= 1.

                <Tip>

                Please take a look at the example below to better understand how to make use of `output_word_offsets`.

                </Tip>

        Returns:
            [`~models.wav2vec2.Wav2Vec2DecoderWithLMOutput`].

        Example:
            ```python
            >>> # Let's see how to retrieve time steps for a model
            >>> from transformers import AutoTokenizer, AutoProcessor, AutoModelForCTC
            >>> from datasets import load_dataset
            >>> import datasets
            >>> import torch
            ...
            >>> # import model, feature extractor, tokenizer
            >>> model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
            >>> processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
            ...
            >>> # load first sample of English common_voice
            >>> dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="train", streaming=True)
            >>> dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
            >>> dataset_iter = iter(dataset)
            >>> sample = next(dataset_iter)
            ...
            >>> # forward sample through model to get greedily predicted transcription ids
            >>> input_values = processor(sample["audio"]["array"], return_tensors="pt").input_values
            >>> with torch.no_grad():
            ...     logits = model(input_values).logits[0].cpu().numpy()
            ...
            >>> # retrieve word stamps (analogous commands for `output_char_offsets`)
            >>> outputs = processor.decode(logits, output_word_offsets=True)
            >>> # compute `time_offset` in seconds as product of downsampling ratio and sampling_rate
            >>> time_offset = model.config.inputs_to_logits_ratio / processor.feature_extractor.sampling_rate
            ...
            >>> word_offsets = [
            ...     {
            ...         "word": d["word"],
            ...         "start_time": round(d["start_offset"] * time_offset, 2),
            ...         "end_time": round(d["end_offset"] * time_offset, 2),
            ...     }
            ...     for d in outputs.word_offsets
            ... ]
            >>> # compare word offsets with audio `en_train_0/common_voice_en_19121553.mp3` online on the dataset viewer:
            >>> # https://hf-mirror.com/datasets/mozilla-foundation/common_voice_11_0/viewer/en
            >>> word_offsets[:4]
            [{'word': 'THE', 'start_time': 0.68, 'end_time': 0.78},
             {'word': 'TRACK', 'start_time': 0.88, 'end_time': 1.1},
             {'word': 'APPEARS', 'start_time': 1.18, 'end_time': 1.66},
             {'word': 'ON', 'start_time': 1.86, 'end_time': 1.92}]
            ```
        """
        from pyctcdecode.constants import (
            DEFAULT_BEAM_WIDTH,
            DEFAULT_HOTWORD_WEIGHT,
            DEFAULT_MIN_TOKEN_LOGP,
            DEFAULT_PRUNE_LOGP,
        )

        # set defaults
        beam_width = beam_width if beam_width is not None else DEFAULT_BEAM_WIDTH
        beam_prune_logp = beam_prune_logp if beam_prune_logp is not None else DEFAULT_PRUNE_LOGP
        token_min_logp = token_min_logp if token_min_logp is not None else DEFAULT_MIN_TOKEN_LOGP
        hotword_weight = hotword_weight if hotword_weight is not None else DEFAULT_HOTWORD_WEIGHT

        # reset params at every forward call. It's just a `set` method in pyctcdecode
        self.decoder.reset_params(
            alpha=alpha, beta=beta, unk_score_offset=unk_score_offset, lm_score_boundary=lm_score_boundary
        )

        # pyctcdecode
        decoded_beams = self.decoder.decode_beams(
            logits,
            beam_width=beam_width,
            beam_prune_logp=beam_prune_logp,
            token_min_logp=token_min_logp,
            hotwords=hotwords,
            hotword_weight=hotword_weight,
        )

        word_offsets = None
        if output_word_offsets:
            word_offsets = [
                [
                    {"word": word, "start_offset": start_offset, "end_offset": end_offset}
                    for word, (start_offset, end_offset) in beam[2]
                ]
                for beam in decoded_beams
            ]
        logit_scores = [beam[-2] for beam in decoded_beams]

        lm_scores = [beam[-1] for beam in decoded_beams]

        hypotheses = [beam[0] for beam in decoded_beams]

        if n_best > len(decoded_beams):
            logger.info(
                "N-best size is larger than the number of generated hypotheses, all hypotheses will be returned."
            )

        if n_best == 1:
            return Wav2Vec2DecoderWithLMOutput(
                text=hypotheses[0],
                logit_score=logit_scores[0],
                lm_score=lm_scores[0],
                word_offsets=word_offsets[0] if word_offsets is not None else None,
            )
        return Wav2Vec2DecoderWithLMOutput(
            text=hypotheses[:n_best],
            logit_score=logit_scores[:n_best],
            lm_score=lm_scores[:n_best],
            word_offsets=word_offsets[:n_best] if word_offsets is not None else None,
        )

    @contextmanager
    def as_target_processor(self):
        """
        Temporarily sets the processor for processing the target. Useful for encoding the labels when fine-tuning
        Wav2Vec2.
        """
        warnings.warn(
            "`as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your "
            "labels by using the argument `text` of the regular `__call__` method (either in the same call as "
            "your audio inputs, or in a separate call."
        )
        self._in_target_context_manager = True
        self.current_processor = self.tokenizer
        yield
        self.current_processor = self.feature_extractor
        self._in_target_context_manager = False

mindnlp.transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2ProcessorWithLM.language_model property

This method returns the language model associated with the Wav2Vec2ProcessorWithLM instance.

PARAMETER DESCRIPTION
self

The Wav2Vec2ProcessorWithLM instance.

RETURNS DESCRIPTION

None.

mindnlp.transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2ProcessorWithLM.__call__(*args, **kwargs)

When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor's [~Wav2Vec2FeatureExtractor.__call__] and returns its output. If used in the context [~Wav2Vec2ProcessorWithLM.as_target_processor] this method forwards all its arguments to Wav2Vec2CTCTokenizer's [~Wav2Vec2CTCTokenizer.__call__]. Please refer to the docstring of the above two methods for more information.

Source code in mindnlp/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
def __call__(self, *args, **kwargs):
    """
    When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor's
    [`~Wav2Vec2FeatureExtractor.__call__`] and returns its output. If used in the context
    [`~Wav2Vec2ProcessorWithLM.as_target_processor`] this method forwards all its arguments to
    Wav2Vec2CTCTokenizer's [`~Wav2Vec2CTCTokenizer.__call__`]. Please refer to the docstring of the above two
    methods for more information.
    """
    # For backward compatibility
    if self._in_target_context_manager:
        return self.current_processor(*args, **kwargs)

    if "raw_speech" in kwargs:
        warnings.warn("Using `raw_speech` as a keyword argument is deprecated. Use `audio` instead.")
        audio = kwargs.pop("raw_speech")
    else:
        audio = kwargs.pop("audio", None)
    sampling_rate = kwargs.pop("sampling_rate", None)
    text = kwargs.pop("text", None)
    if len(args) > 0:
        audio = args[0]
        args = args[1:]

    if audio is None and text is None:
        raise ValueError("You need to specify either an `audio` or `text` input to process.")

    if audio is not None:
        inputs = self.feature_extractor(audio, *args, sampling_rate=sampling_rate, **kwargs)
    if text is not None:
        encodings = self.tokenizer(text, **kwargs)

    if text is None:
        return inputs
    if audio is None:
        return encodings
    inputs["labels"] = encodings["input_ids"]
    return inputs

mindnlp.transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2ProcessorWithLM.__init__(feature_extractor, tokenizer, decoder)

Initializes a Wav2Vec2ProcessorWithLM object.

PARAMETER DESCRIPTION
self

The object instance.

feature_extractor

The feature extractor used for processing input audio data.

TYPE: FeatureExtractionMixin

tokenizer

The tokenizer used for tokenizing input text data.

TYPE: PreTrainedTokenizerBase

decoder

The decoder used for decoding the model's output.

TYPE: BeamSearchDecoderCTC

RETURNS DESCRIPTION

None. This method does not return any value.

RAISES DESCRIPTION
ValueError

If the provided 'decoder' parameter is not an instance of BeamSearchDecoderCTC.

ValueError

If there are missing tokens in the decoder's alphabet that are present in the tokenizer's vocabulary.

Source code in mindnlp/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
def __init__(
    self,
    feature_extractor: "FeatureExtractionMixin",
    tokenizer: "PreTrainedTokenizerBase",
    decoder: "BeamSearchDecoderCTC",
):
    """
    Initializes a Wav2Vec2ProcessorWithLM object.

    Args:
        self: The object instance.
        feature_extractor (FeatureExtractionMixin): The feature extractor used for processing input audio data.
        tokenizer (PreTrainedTokenizerBase): The tokenizer used for tokenizing input text data.
        decoder (BeamSearchDecoderCTC): The decoder used for decoding the model's output.

    Returns:
        None. This method does not return any value.

    Raises:
        ValueError: If the provided 'decoder' parameter is not an instance of BeamSearchDecoderCTC.
        ValueError: If there are missing tokens in the decoder's alphabet that are present in the tokenizer's
            vocabulary.
    """
    from pyctcdecode import BeamSearchDecoderCTC

    super().__init__(feature_extractor, tokenizer)
    if not isinstance(decoder, BeamSearchDecoderCTC):
        raise ValueError(f"`decoder` has to be of type {BeamSearchDecoderCTC.__class__}, but is {type(decoder)}")

    # make sure that decoder's alphabet and tokenizer's vocab match in content
    missing_decoder_tokens = self.get_missing_alphabet_tokens(decoder, tokenizer)
    if len(missing_decoder_tokens) > 0:
        raise ValueError(
            f"The tokens {missing_decoder_tokens} are defined in the tokenizer's "
            "vocabulary, but not in the decoder's alphabet. "
            f"Make sure to include {missing_decoder_tokens} in the decoder's alphabet."
        )

    self.decoder = decoder
    self.current_processor = self.feature_extractor
    self._in_target_context_manager = False

mindnlp.transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2ProcessorWithLM.as_target_processor()

Temporarily sets the processor for processing the target. Useful for encoding the labels when fine-tuning Wav2Vec2.

Source code in mindnlp/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
@contextmanager
def as_target_processor(self):
    """
    Temporarily sets the processor for processing the target. Useful for encoding the labels when fine-tuning
    Wav2Vec2.
    """
    warnings.warn(
        "`as_target_processor` is deprecated and will be removed in v5 of Transformers. You can process your "
        "labels by using the argument `text` of the regular `__call__` method (either in the same call as "
        "your audio inputs, or in a separate call."
    )
    self._in_target_context_manager = True
    self.current_processor = self.tokenizer
    yield
    self.current_processor = self.feature_extractor
    self._in_target_context_manager = False

mindnlp.transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2ProcessorWithLM.batch_decode(logits, pool=None, num_processes=None, beam_width=None, beam_prune_logp=None, token_min_logp=None, hotwords=None, hotword_weight=None, alpha=None, beta=None, unk_score_offset=None, lm_score_boundary=None, output_word_offsets=False, n_best=1)

Batch decode output logits to audio transcription with language model support.

This function makes use of Python's multiprocessing. Currently, multiprocessing is available only on Unix systems (see this issue).

If you are decoding multiple batches, consider creating a Pool and passing it to batch_decode. Otherwise, batch_decode will be very slow since it will create a fresh Pool for each call. See usage example below.

PARAMETER DESCRIPTION
logits

The logits output vector of the model representing the log probabilities for each token.

TYPE: `np.ndarray`

pool

An optional user-managed pool. If not set, one will be automatically created and closed. The pool should be instantiated after Wav2Vec2ProcessorWithLM. Otherwise, the LM won't be available to the pool's sub-processes.

Currently, only pools created with a 'fork' context can be used. If a 'spawn' pool is passed, it will be ignored and sequential decoding will be used instead.

TYPE: `multiprocessing.Pool`, *optional* DEFAULT: None

num_processes

If pool is not set, number of processes on which the function should be parallelized over. Defaults to the number of available CPUs.

TYPE: `int`, *optional* DEFAULT: None

beam_width

Maximum number of beams at each step in decoding. Defaults to pyctcdecode's DEFAULT_BEAM_WIDTH.

TYPE: `int`, *optional* DEFAULT: None

beam_prune_logp

Beams that are much worse than best beam will be pruned Defaults to pyctcdecode's DEFAULT_PRUNE_LOGP.

TYPE: `int`, *optional* DEFAULT: None

token_min_logp

Tokens below this logp are skipped unless they are argmax of frame Defaults to pyctcdecode's DEFAULT_MIN_TOKEN_LOGP.

TYPE: `int`, *optional* DEFAULT: None

hotwords

List of words with extra importance, can be OOV for LM

TYPE: `List[str]`, *optional* DEFAULT: None

hotword_weight

Weight factor for hotword importance Defaults to pyctcdecode's DEFAULT_HOTWORD_WEIGHT.

TYPE: `int`, *optional* DEFAULT: None

alpha

Weight for language model during shallow fusion

TYPE: `float`, *optional* DEFAULT: None

beta

Weight for length score adjustment of during scoring

TYPE: `float`, *optional* DEFAULT: None

unk_score_offset

Amount of log score offset for unknown tokens

TYPE: `float`, *optional* DEFAULT: None

lm_score_boundary

Whether to have kenlm respect boundaries when scoring

TYPE: `bool`, *optional* DEFAULT: None

output_word_offsets

Whether or not to output word offsets. Word offsets can be used in combination with the sampling rate and model downsampling rate to compute the time-stamps of transcribed words.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

n_best

Number of best hypotheses to return. If n_best is greater than 1, the returned text will be a list of lists of strings, logit_score will be a list of lists of floats, and lm_score will be a list of lists of floats, where the length of the outer list will correspond to the batch size and the length of the inner list will correspond to the number of returned hypotheses . The value should be >= 1.

Please take a look at the Example of [~Wav2Vec2ProcessorWithLM.decode] to better understand how to make use of output_word_offsets. [~Wav2Vec2ProcessorWithLM.batch_decode] works the same way with batched output.

TYPE: `int`, *optional*, defaults to `1` DEFAULT: 1

RETURNS DESCRIPTION

[~models.wav2vec2.Wav2Vec2DecoderWithLMOutput].

Example

See Decoding multiple audios.

Source code in mindnlp/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
def batch_decode(
    self,
    logits: np.ndarray,
    pool: Optional[Pool] = None,
    num_processes: Optional[int] = None,
    beam_width: Optional[int] = None,
    beam_prune_logp: Optional[float] = None,
    token_min_logp: Optional[float] = None,
    hotwords: Optional[Iterable[str]] = None,
    hotword_weight: Optional[float] = None,
    alpha: Optional[float] = None,
    beta: Optional[float] = None,
    unk_score_offset: Optional[float] = None,
    lm_score_boundary: Optional[bool] = None,
    output_word_offsets: bool = False,
    n_best: int = 1,
):
    """
    Batch decode output logits to audio transcription with language model support.

    <Tip>

    This function makes use of Python's multiprocessing. Currently, multiprocessing is available only on Unix
    systems (see this [issue](https://github.com/kensho-technologies/pyctcdecode/issues/65)).

    If you are decoding multiple batches, consider creating a `Pool` and passing it to `batch_decode`. Otherwise,
    `batch_decode` will be very slow since it will create a fresh `Pool` for each call. See usage example below.

    </Tip>

    Args:
        logits (`np.ndarray`):
            The logits output vector of the model representing the log probabilities for each token.
        pool (`multiprocessing.Pool`, *optional*):
            An optional user-managed pool. If not set, one will be automatically created and closed. The pool
            should be instantiated *after* `Wav2Vec2ProcessorWithLM`. Otherwise, the LM won't be available to the
            pool's sub-processes.

            <Tip>

            Currently, only pools created with a 'fork' context can be used. If a 'spawn' pool is passed, it will
            be ignored and sequential decoding will be used instead.

            </Tip>

        num_processes (`int`, *optional*):
            If `pool` is not set, number of processes on which the function should be parallelized over. Defaults
            to the number of available CPUs.
        beam_width (`int`, *optional*):
            Maximum number of beams at each step in decoding. Defaults to pyctcdecode's DEFAULT_BEAM_WIDTH.
        beam_prune_logp (`int`, *optional*):
            Beams that are much worse than best beam will be pruned Defaults to pyctcdecode's DEFAULT_PRUNE_LOGP.
        token_min_logp (`int`, *optional*):
            Tokens below this logp are skipped unless they are argmax of frame Defaults to pyctcdecode's
            DEFAULT_MIN_TOKEN_LOGP.
        hotwords (`List[str]`, *optional*):
            List of words with extra importance, can be OOV for LM
        hotword_weight (`int`, *optional*):
            Weight factor for hotword importance Defaults to pyctcdecode's DEFAULT_HOTWORD_WEIGHT.
        alpha (`float`, *optional*):
            Weight for language model during shallow fusion
        beta (`float`, *optional*):
            Weight for length score adjustment of during scoring
        unk_score_offset (`float`, *optional*):
            Amount of log score offset for unknown tokens
        lm_score_boundary (`bool`, *optional*):
            Whether to have kenlm respect boundaries when scoring
        output_word_offsets (`bool`, *optional*, defaults to `False`):
            Whether or not to output word offsets. Word offsets can be used in combination with the sampling rate
            and model downsampling rate to compute the time-stamps of transcribed words.
        n_best (`int`, *optional*, defaults to `1`):
            Number of best hypotheses to return. If `n_best` is greater than 1, the returned `text` will be a list
            of lists of strings, `logit_score` will be a list of lists of floats, and `lm_score` will be a list of
            lists of floats, where the length of the outer list will correspond to the batch size and the length of
            the inner list will correspond to the number of returned hypotheses . The value should be >= 1.

            <Tip>

            Please take a look at the Example of [`~Wav2Vec2ProcessorWithLM.decode`] to better understand how to
            make use of `output_word_offsets`. [`~Wav2Vec2ProcessorWithLM.batch_decode`] works the same way with
            batched output.

            </Tip>

    Returns:
        [`~models.wav2vec2.Wav2Vec2DecoderWithLMOutput`].

    Example:
        See [Decoding multiple audios](#decoding-multiple-audios).
    """
    from pyctcdecode.constants import (
        DEFAULT_BEAM_WIDTH,
        DEFAULT_HOTWORD_WEIGHT,
        DEFAULT_MIN_TOKEN_LOGP,
        DEFAULT_PRUNE_LOGP,
    )

    # set defaults
    beam_width = beam_width if beam_width is not None else DEFAULT_BEAM_WIDTH
    beam_prune_logp = beam_prune_logp if beam_prune_logp is not None else DEFAULT_PRUNE_LOGP
    token_min_logp = token_min_logp if token_min_logp is not None else DEFAULT_MIN_TOKEN_LOGP
    hotword_weight = hotword_weight if hotword_weight is not None else DEFAULT_HOTWORD_WEIGHT

    # reset params at every forward call. It's just a `set` method in pyctcdecode
    self.decoder.reset_params(
        alpha=alpha, beta=beta, unk_score_offset=unk_score_offset, lm_score_boundary=lm_score_boundary
    )

    # create multiprocessing pool and list numpy arrays
    # filter out logits padding
    logits_list = [array[(array != -100.0).all(axis=-1)] for array in logits]

    # create a pool if necessary while also using it as a context manager to close itself
    if pool is None:
        # fork is safe to use only on Unix, see "Contexts and start methods" section on
        # multiprocessing's docs (https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
        default_context = get_start_method()

        if default_context == "fork":
            cm = pool = get_context().Pool(num_processes)
        else:
            logger.warning(
                "Parallel batch decoding is not currently supported in this platform. "
                "Falling back to sequential decoding."
            )
            cm = nullcontext()
    else:
        # pool is managed by the user, so we don't need to close it
        cm = nullcontext()

        if num_processes is not None:
            logger.warning(
                "Parameter `num_process` was passed, but it will be ignored since `pool` was also specified."
            )

    # pyctcdecode
    with cm:
        decoded_beams = self.decoder.decode_beams_batch(
            pool=pool,
            logits_list=logits_list,
            beam_width=beam_width,
            beam_prune_logp=beam_prune_logp,
            token_min_logp=token_min_logp,
            hotwords=hotwords,
            hotword_weight=hotword_weight,
        )

    # extract text and scores
    batch_texts, logit_scores, lm_scores, word_offsets = [], [], [], []

    for d in decoded_beams:
        batch_texts.append([beam[0] for beam in d])
        logit_scores.append([beam[-2] for beam in d])
        lm_scores.append([beam[-1] for beam in d])

        # word_offsets.append([{"word": t[0], "start_offset": t[1][0], "end_offset": t[1][1]} for t in d[0][1]])

        word_offsets.append(
            [
                [
                    {"word": word, "start_offset": start_offset, "end_offset": end_offset}
                    for word, (start_offset, end_offset) in beam[1]
                ]
                for beam in d
            ]
        )

    word_offsets = word_offsets if output_word_offsets else None

    if n_best == 1:
        return Wav2Vec2DecoderWithLMOutput(
            text=[hyps[0] for hyps in batch_texts],
            logit_score=[hyps[0] for hyps in logit_scores],
            lm_score=[hyps[0] for hyps in lm_scores],
            word_offsets=[hyps[0] for hyps in word_offsets] if word_offsets is not None else None,
        )
    return Wav2Vec2DecoderWithLMOutput(
        text=[hyps[:n_best] for hyps in batch_texts],
        logit_score=[hyps[:n_best] for hyps in logit_scores],
        lm_score=[hyps[:n_best] for hyps in lm_scores],
        word_offsets=[hyps[:n_best] for hyps in word_offsets] if word_offsets is not None else None,
    )

mindnlp.transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2ProcessorWithLM.decode(logits, beam_width=None, beam_prune_logp=None, token_min_logp=None, hotwords=None, hotword_weight=None, alpha=None, beta=None, unk_score_offset=None, lm_score_boundary=None, output_word_offsets=False, n_best=1)

Decode output logits to audio transcription with language model support.

PARAMETER DESCRIPTION
logits

The logits output vector of the model representing the log probabilities for each token.

TYPE: `np.ndarray`

beam_width

Maximum number of beams at each step in decoding. Defaults to pyctcdecode's DEFAULT_BEAM_WIDTH.

TYPE: `int`, *optional* DEFAULT: None

beam_prune_logp

A threshold to prune beams with log-probs less than best_beam_logp + beam_prune_logp. The value should be <= 0. Defaults to pyctcdecode's DEFAULT_PRUNE_LOGP.

TYPE: `int`, *optional* DEFAULT: None

token_min_logp

Tokens with log-probs below token_min_logp are skipped unless they are have the maximum log-prob for an utterance. Defaults to pyctcdecode's DEFAULT_MIN_TOKEN_LOGP.

TYPE: `int`, *optional* DEFAULT: None

hotwords

List of words with extra importance which can be missing from the LM's vocabulary, e.g. ["huggingface"]

TYPE: `List[str]`, *optional* DEFAULT: None

hotword_weight

Weight multiplier that boosts hotword scores. Defaults to pyctcdecode's DEFAULT_HOTWORD_WEIGHT.

TYPE: `int`, *optional* DEFAULT: None

alpha

Weight for language model during shallow fusion

TYPE: `float`, *optional* DEFAULT: None

beta

Weight for length score adjustment of during scoring

TYPE: `float`, *optional* DEFAULT: None

unk_score_offset

Amount of log score offset for unknown tokens

TYPE: `float`, *optional* DEFAULT: None

lm_score_boundary

Whether to have kenlm respect boundaries when scoring

TYPE: `bool`, *optional* DEFAULT: None

output_word_offsets

Whether or not to output word offsets. Word offsets can be used in combination with the sampling rate and model downsampling rate to compute the time-stamps of transcribed words.

TYPE: `bool`, *optional*, defaults to `False` DEFAULT: False

n_best

Number of best hypotheses to return. If n_best is greater than 1, the returned text will be a list of strings, logit_score will be a list of floats, and lm_score will be a list of floats, where the length of these lists will correspond to the number of returned hypotheses. The value should be >= 1.

Please take a look at the example below to better understand how to make use of output_word_offsets.

TYPE: `int`, *optional*, defaults to `1` DEFAULT: 1

RETURNS DESCRIPTION

[~models.wav2vec2.Wav2Vec2DecoderWithLMOutput].

Example
>>> # Let's see how to retrieve time steps for a model
>>> from transformers import AutoTokenizer, AutoProcessor, AutoModelForCTC
>>> from datasets import load_dataset
>>> import datasets
>>> import torch
...
>>> # import model, feature extractor, tokenizer
>>> model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
>>> processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
...
>>> # load first sample of English common_voice
>>> dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="train", streaming=True)
>>> dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
>>> dataset_iter = iter(dataset)
>>> sample = next(dataset_iter)
...
>>> # forward sample through model to get greedily predicted transcription ids
>>> input_values = processor(sample["audio"]["array"], return_tensors="pt").input_values
>>> with torch.no_grad():
...     logits = model(input_values).logits[0].cpu().numpy()
...
>>> # retrieve word stamps (analogous commands for `output_char_offsets`)
>>> outputs = processor.decode(logits, output_word_offsets=True)
>>> # compute `time_offset` in seconds as product of downsampling ratio and sampling_rate
>>> time_offset = model.config.inputs_to_logits_ratio / processor.feature_extractor.sampling_rate
...
>>> word_offsets = [
...     {
...         "word": d["word"],
...         "start_time": round(d["start_offset"] * time_offset, 2),
...         "end_time": round(d["end_offset"] * time_offset, 2),
...     }
...     for d in outputs.word_offsets
... ]
>>> # compare word offsets with audio `en_train_0/common_voice_en_19121553.mp3` online on the dataset viewer:
>>> # https://hf-mirror.com/datasets/mozilla-foundation/common_voice_11_0/viewer/en
>>> word_offsets[:4]
[{'word': 'THE', 'start_time': 0.68, 'end_time': 0.78},
 {'word': 'TRACK', 'start_time': 0.88, 'end_time': 1.1},
 {'word': 'APPEARS', 'start_time': 1.18, 'end_time': 1.66},
 {'word': 'ON', 'start_time': 1.86, 'end_time': 1.92}]
Source code in mindnlp/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
def decode(
    self,
    logits: np.ndarray,
    beam_width: Optional[int] = None,
    beam_prune_logp: Optional[float] = None,
    token_min_logp: Optional[float] = None,
    hotwords: Optional[Iterable[str]] = None,
    hotword_weight: Optional[float] = None,
    alpha: Optional[float] = None,
    beta: Optional[float] = None,
    unk_score_offset: Optional[float] = None,
    lm_score_boundary: Optional[bool] = None,
    output_word_offsets: bool = False,
    n_best: int = 1,
):
    """
    Decode output logits to audio transcription with language model support.

    Args:
        logits (`np.ndarray`):
            The logits output vector of the model representing the log probabilities for each token.
        beam_width (`int`, *optional*):
            Maximum number of beams at each step in decoding. Defaults to pyctcdecode's DEFAULT_BEAM_WIDTH.
        beam_prune_logp (`int`, *optional*):
            A threshold to prune beams with log-probs less than best_beam_logp + beam_prune_logp. The value should
            be <= 0. Defaults to pyctcdecode's DEFAULT_PRUNE_LOGP.
        token_min_logp (`int`, *optional*):
            Tokens with log-probs below token_min_logp are skipped unless they are have the maximum log-prob for an
            utterance. Defaults to pyctcdecode's DEFAULT_MIN_TOKEN_LOGP.
        hotwords (`List[str]`, *optional*):
            List of words with extra importance which can be missing from the LM's vocabulary, e.g. ["huggingface"]
        hotword_weight (`int`, *optional*):
            Weight multiplier that boosts hotword scores. Defaults to pyctcdecode's DEFAULT_HOTWORD_WEIGHT.
        alpha (`float`, *optional*):
            Weight for language model during shallow fusion
        beta (`float`, *optional*):
            Weight for length score adjustment of during scoring
        unk_score_offset (`float`, *optional*):
            Amount of log score offset for unknown tokens
        lm_score_boundary (`bool`, *optional*):
            Whether to have kenlm respect boundaries when scoring
        output_word_offsets (`bool`, *optional*, defaults to `False`):
            Whether or not to output word offsets. Word offsets can be used in combination with the sampling rate
            and model downsampling rate to compute the time-stamps of transcribed words.
        n_best (`int`, *optional*, defaults to `1`):
            Number of best hypotheses to return. If `n_best` is greater than 1, the returned `text` will be a list
            of strings, `logit_score` will be a list of floats, and `lm_score` will be a list of floats, where the
            length of these lists will correspond to the number of returned hypotheses. The value should be >= 1.

            <Tip>

            Please take a look at the example below to better understand how to make use of `output_word_offsets`.

            </Tip>

    Returns:
        [`~models.wav2vec2.Wav2Vec2DecoderWithLMOutput`].

    Example:
        ```python
        >>> # Let's see how to retrieve time steps for a model
        >>> from transformers import AutoTokenizer, AutoProcessor, AutoModelForCTC
        >>> from datasets import load_dataset
        >>> import datasets
        >>> import torch
        ...
        >>> # import model, feature extractor, tokenizer
        >>> model = AutoModelForCTC.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
        >>> processor = AutoProcessor.from_pretrained("patrickvonplaten/wav2vec2-base-100h-with-lm")
        ...
        >>> # load first sample of English common_voice
        >>> dataset = load_dataset("mozilla-foundation/common_voice_11_0", "en", split="train", streaming=True)
        >>> dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))
        >>> dataset_iter = iter(dataset)
        >>> sample = next(dataset_iter)
        ...
        >>> # forward sample through model to get greedily predicted transcription ids
        >>> input_values = processor(sample["audio"]["array"], return_tensors="pt").input_values
        >>> with torch.no_grad():
        ...     logits = model(input_values).logits[0].cpu().numpy()
        ...
        >>> # retrieve word stamps (analogous commands for `output_char_offsets`)
        >>> outputs = processor.decode(logits, output_word_offsets=True)
        >>> # compute `time_offset` in seconds as product of downsampling ratio and sampling_rate
        >>> time_offset = model.config.inputs_to_logits_ratio / processor.feature_extractor.sampling_rate
        ...
        >>> word_offsets = [
        ...     {
        ...         "word": d["word"],
        ...         "start_time": round(d["start_offset"] * time_offset, 2),
        ...         "end_time": round(d["end_offset"] * time_offset, 2),
        ...     }
        ...     for d in outputs.word_offsets
        ... ]
        >>> # compare word offsets with audio `en_train_0/common_voice_en_19121553.mp3` online on the dataset viewer:
        >>> # https://hf-mirror.com/datasets/mozilla-foundation/common_voice_11_0/viewer/en
        >>> word_offsets[:4]
        [{'word': 'THE', 'start_time': 0.68, 'end_time': 0.78},
         {'word': 'TRACK', 'start_time': 0.88, 'end_time': 1.1},
         {'word': 'APPEARS', 'start_time': 1.18, 'end_time': 1.66},
         {'word': 'ON', 'start_time': 1.86, 'end_time': 1.92}]
        ```
    """
    from pyctcdecode.constants import (
        DEFAULT_BEAM_WIDTH,
        DEFAULT_HOTWORD_WEIGHT,
        DEFAULT_MIN_TOKEN_LOGP,
        DEFAULT_PRUNE_LOGP,
    )

    # set defaults
    beam_width = beam_width if beam_width is not None else DEFAULT_BEAM_WIDTH
    beam_prune_logp = beam_prune_logp if beam_prune_logp is not None else DEFAULT_PRUNE_LOGP
    token_min_logp = token_min_logp if token_min_logp is not None else DEFAULT_MIN_TOKEN_LOGP
    hotword_weight = hotword_weight if hotword_weight is not None else DEFAULT_HOTWORD_WEIGHT

    # reset params at every forward call. It's just a `set` method in pyctcdecode
    self.decoder.reset_params(
        alpha=alpha, beta=beta, unk_score_offset=unk_score_offset, lm_score_boundary=lm_score_boundary
    )

    # pyctcdecode
    decoded_beams = self.decoder.decode_beams(
        logits,
        beam_width=beam_width,
        beam_prune_logp=beam_prune_logp,
        token_min_logp=token_min_logp,
        hotwords=hotwords,
        hotword_weight=hotword_weight,
    )

    word_offsets = None
    if output_word_offsets:
        word_offsets = [
            [
                {"word": word, "start_offset": start_offset, "end_offset": end_offset}
                for word, (start_offset, end_offset) in beam[2]
            ]
            for beam in decoded_beams
        ]
    logit_scores = [beam[-2] for beam in decoded_beams]

    lm_scores = [beam[-1] for beam in decoded_beams]

    hypotheses = [beam[0] for beam in decoded_beams]

    if n_best > len(decoded_beams):
        logger.info(
            "N-best size is larger than the number of generated hypotheses, all hypotheses will be returned."
        )

    if n_best == 1:
        return Wav2Vec2DecoderWithLMOutput(
            text=hypotheses[0],
            logit_score=logit_scores[0],
            lm_score=lm_scores[0],
            word_offsets=word_offsets[0] if word_offsets is not None else None,
        )
    return Wav2Vec2DecoderWithLMOutput(
        text=hypotheses[:n_best],
        logit_score=logit_scores[:n_best],
        lm_score=lm_scores[:n_best],
        word_offsets=word_offsets[:n_best] if word_offsets is not None else None,
    )

mindnlp.transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2ProcessorWithLM.from_pretrained(pretrained_model_name_or_path, **kwargs) classmethod

Instantiate a [Wav2Vec2ProcessorWithLM] from a pretrained Wav2Vec2 processor.

This class method is simply calling Wav2Vec2FeatureExtractor's [~feature_extraction_utils.FeatureExtractionMixin.from_pretrained], Wav2Vec2CTCTokenizer's [~tokenization_utils_base.PreTrainedTokenizerBase.from_pretrained], and [pyctcdecode.BeamSearchDecoderCTC.load_from_hf_hub].

Please refer to the docstrings of the methods above for more information.

PARAMETER DESCRIPTION
pretrained_model_name_or_path

This can be either:

  • a string, the model id of a pretrained feature_extractor hosted inside a model repo on hf-mirror.com.
  • a path to a directory containing a feature extractor file saved using the [~SequenceFeatureExtractor.save_pretrained] method, e.g., ./my_model_directory/.
  • a path or url to a saved feature extractor JSON file, e.g., ./my_model_directory/preprocessor_config.json.

TYPE: `str` or `os.PathLike`

Source code in mindnlp/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
    r"""
    Instantiate a [`Wav2Vec2ProcessorWithLM`] from a pretrained Wav2Vec2 processor.

    <Tip>

    This class method is simply calling Wav2Vec2FeatureExtractor's
    [`~feature_extraction_utils.FeatureExtractionMixin.from_pretrained`], Wav2Vec2CTCTokenizer's
    [`~tokenization_utils_base.PreTrainedTokenizerBase.from_pretrained`], and
    [`pyctcdecode.BeamSearchDecoderCTC.load_from_hf_hub`].

    Please refer to the docstrings of the methods above for more information.

    </Tip>

    Args:
        pretrained_model_name_or_path (`str` or `os.PathLike`):
            This can be either:

            - a string, the *model id* of a pretrained feature_extractor hosted inside a model repo on
            hf-mirror.com.
            - a path to a *directory* containing a feature extractor file saved using the
            [`~SequenceFeatureExtractor.save_pretrained`] method, e.g., `./my_model_directory/`.
            - a path or url to a saved feature extractor JSON *file*, e.g.,
            `./my_model_directory/preprocessor_config.json`.
        **kwargs
            Additional keyword arguments passed along to both [`SequenceFeatureExtractor`] and
            [`PreTrainedTokenizer`]
    """
    requires_backends(cls, "pyctcdecode")
    from pyctcdecode import BeamSearchDecoderCTC

    feature_extractor, tokenizer = super()._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs) # pylint: disable=unbalanced-tuple-unpacking

    if os.path.isdir(pretrained_model_name_or_path) or os.path.isfile(pretrained_model_name_or_path):
        decoder = BeamSearchDecoderCTC.load_from_dir(pretrained_model_name_or_path)
    else:
        # BeamSearchDecoderCTC has no auto class
        kwargs.pop("_from_auto", None)
        # snapshot_download has no `trust_remote_code` flag
        kwargs.pop("trust_remote_code", None)

        # make sure that only relevant filenames are downloaded
        language_model_filenames = os.path.join(BeamSearchDecoderCTC._LANGUAGE_MODEL_SERIALIZED_DIRECTORY, "*")
        alphabet_filename = BeamSearchDecoderCTC._ALPHABET_SERIALIZED_FILENAME
        allow_patterns = [language_model_filenames, alphabet_filename]

        decoder = BeamSearchDecoderCTC.load_from_hf_hub(
            pretrained_model_name_or_path, allow_patterns=allow_patterns, **kwargs
        )

    # set language model attributes
    for attribute in ["alpha", "beta", "unk_score_offset", "score_boundary"]:
        value = kwargs.pop(attribute, None)

        if value is not None:
            cls._set_language_model_attribute(decoder, attribute, value)

    # make sure that decoder's alphabet and tokenizer's vocab match in content
    missing_decoder_tokens = cls.get_missing_alphabet_tokens(decoder, tokenizer)
    if len(missing_decoder_tokens) > 0:
        raise ValueError(
            f"The tokens {missing_decoder_tokens} are defined in the tokenizer's "
            "vocabulary, but not in the decoder's alphabet. "
            f"Make sure to include {missing_decoder_tokens} in the decoder's alphabet."
        )

    return cls(feature_extractor=feature_extractor, tokenizer=tokenizer, decoder=decoder)

mindnlp.transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2ProcessorWithLM.get_missing_alphabet_tokens(decoder, tokenizer) staticmethod

This method 'get_missing_alphabet_tokens' is defined in the class 'Wav2Vec2ProcessorWithLM' and is responsible for identifying missing alphabet tokens by comparing the tokenizer's vocabulary with the decoder's alphabet labels.

PARAMETER DESCRIPTION
decoder

The decoder object used for decoding tokens. It should be of type 'Decoder' and is required as an input parameter for the method.

TYPE: object

tokenizer

The tokenizer object used for tokenizing input data. It should be of type 'Tokenizer' and is required as an input parameter for the method.

TYPE: object

RETURNS DESCRIPTION
set

This method returns a set of missing tokens from the tokenizer's vocabulary that are not present in the decoder's alphabet labels. If no missing tokens are found, it returns an empty set.

Source code in mindnlp/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
@staticmethod
def get_missing_alphabet_tokens(decoder, tokenizer):
    """
    This method 'get_missing_alphabet_tokens' is defined in the class 'Wav2Vec2ProcessorWithLM' and is responsible
    for identifying missing alphabet tokens by comparing the tokenizer's vocabulary with the decoder's alphabet
    labels.

    Args:
        decoder (object): The decoder object used for decoding tokens. It should be of type 'Decoder' and is
            required as an input parameter for the method.
        tokenizer (object): The tokenizer object used for tokenizing input data. It should be of type 'Tokenizer'
            and is required as an input parameter for the method.

    Returns:
        set: This method returns a set of missing tokens from the tokenizer's vocabulary that are not present in
            the decoder's alphabet labels. If no missing tokens are found, it returns an empty set.

    Raises:
        No specific exceptions are documented to be raised by this method. However, potential exceptions may
        include AttributeError if the attributes accessed from the decoder or tokenizer objects do not exist,
        or TypeError if the input parameters are not of the expected types.
    """
    from pyctcdecode.alphabet import BLANK_TOKEN_PTN, UNK_TOKEN, UNK_TOKEN_PTN

    # we need to make sure that all of the tokenizer's except the special tokens
    # are present in the decoder's alphabet. Retrieve missing alphabet token
    # from decoder
    tokenizer_vocab_list = list(tokenizer.get_vocab().keys())

    # replace special tokens
    for i, token in enumerate(tokenizer_vocab_list):
        if BLANK_TOKEN_PTN.match(token):
            tokenizer_vocab_list[i] = ""
        if token == tokenizer.word_delimiter_token:
            tokenizer_vocab_list[i] = " "
        if UNK_TOKEN_PTN.match(token):
            tokenizer_vocab_list[i] = UNK_TOKEN

    # are any of the extra tokens no special tokenizer tokens?
    missing_tokens = set(tokenizer_vocab_list) - set(decoder._alphabet.labels)

    return missing_tokens

mindnlp.transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2ProcessorWithLM.pad(*args, **kwargs)

When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor's [~Wav2Vec2FeatureExtractor.pad] and returns its output. If used in the context [~Wav2Vec2ProcessorWithLM.as_target_processor] this method forwards all its arguments to Wav2Vec2CTCTokenizer's [~Wav2Vec2CTCTokenizer.pad]. Please refer to the docstring of the above two methods for more information.

Source code in mindnlp/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
def pad(self, *args, **kwargs):
    """
    When used in normal mode, this method forwards all its arguments to Wav2Vec2FeatureExtractor's
    [`~Wav2Vec2FeatureExtractor.pad`] and returns its output. If used in the context
    [`~Wav2Vec2ProcessorWithLM.as_target_processor`] this method forwards all its arguments to
    Wav2Vec2CTCTokenizer's [`~Wav2Vec2CTCTokenizer.pad`]. Please refer to the docstring of the above two methods
    for more information.
    """
    # For backward compatibility
    if self._in_target_context_manager:
        return self.current_processor.pad(*args, **kwargs)

    input_features = kwargs.pop("input_features", None)
    labels = kwargs.pop("labels", None)
    if len(args) > 0:
        input_features = args[0]
        args = args[1:]

    if input_features is not None:
        input_features = self.feature_extractor.pad(input_features, *args, **kwargs)
    if labels is not None:
        labels = self.tokenizer.pad(labels, **kwargs)

    if labels is None:
        return input_features
    if input_features is None:
        return labels
    input_features["labels"] = labels["input_ids"]
    return input_features

mindnlp.transformers.models.wav2vec2_with_lm.processing_wav2vec2_with_lm.Wav2Vec2ProcessorWithLM.save_pretrained(save_directory)

Save the Wav2Vec2ProcessorWithLM instance and the associated language model to the specified directory.

PARAMETER DESCRIPTION
self

The instance of the Wav2Vec2ProcessorWithLM class.

TYPE: Wav2Vec2ProcessorWithLM

save_directory

The directory path where the processor and language model will be saved.

TYPE: str

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
OSError

If the save_directory cannot be accessed or does not exist.

ValueError

If the save_directory is not a valid directory path.

TypeError

If the save_directory parameter is not a string.

Source code in mindnlp/transformers/models/wav2vec2_with_lm/processing_wav2vec2_with_lm.py
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def save_pretrained(self, save_directory):
    """
    Save the Wav2Vec2ProcessorWithLM instance and the associated language model to the specified directory.

    Args:
        self (Wav2Vec2ProcessorWithLM): The instance of the Wav2Vec2ProcessorWithLM class.
        save_directory (str): The directory path where the processor and language model will be saved. 

    Returns:
        None.

    Raises:
        OSError: If the save_directory cannot be accessed or does not exist.
        ValueError: If the save_directory is not a valid directory path.
        TypeError: If the save_directory parameter is not a string.
    """
    super().save_pretrained(save_directory)
    self.decoder.save_to_dir(save_directory)