Skip to content

automatic_speech_recognition

mindnlp.transformers.pipelines.automatic_speech_recognition.AutomaticSpeechRecognitionPipeline

Bases: ChunkPipeline

Pipeline that aims at extracting spoken text contained within some audio.

The input can be either a raw waveform or a audio file. In case of the audio file, ffmpeg should be installed for to support multiple audio formats

Example
>>> from transformers import pipeline
...
>>> transcriber = pipeline(model="openai/whisper-base")
>>> transcriber("https://hf-mirror.com/datasets/Narsil/asr_dummy/resolve/main/1.flac")
{'text': ' He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick, peppered flour-fatten sauce.'}

Learn more about the basics of using a pipeline in the pipeline tutorial

PARAMETER DESCRIPTION
model

The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from [PreTrainedModel] for PyTorch and [TFPreTrainedModel] for TensorFlow.

TYPE: [`PreTrainedModel`] or [`TFPreTrainedModel`]

feature_extractor

The feature extractor that will be used by the pipeline to encode waveform for the model.

TYPE: [`SequenceFeatureExtractor`] DEFAULT: None

tokenizer

The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from [PreTrainedTokenizer].

TYPE: [`PreTrainedTokenizer`] DEFAULT: None

decoder

PyCTCDecode's BeamSearchDecoderCTC can be passed for language model boosted decoding. See [Wav2Vec2ProcessorWithLM] for more information.

TYPE: `pyctcdecode.BeamSearchDecoderCTC`, *optional* DEFAULT: None

chunk_length_s

The input length for in each chunk. If chunk_length_s = 0 then chunking is disabled (default).

For more information on how to effectively use chunk_length_s, please have a look at the ASR chunking blog post.

TYPE: `float`, *optional*, defaults to 0

stride_length_s

The length of stride on the left and right of each chunk. Used only with chunk_length_s > 0. This enables the model to see more context and infer letters better than without this context but the pipeline discards the stride bits at the end to make the final reconstitution as perfect as possible.

For more information on how to effectively use stride_length_s, please have a look at the ASR chunking blog post.

TYPE: `float`, *optional*, defaults to `chunk_length_s / 6`

framework

The framework to use, either "ms" for PyTorch or "tf" for TensorFlow. The specified framework must be installed. If no framework is specified, will default to the one currently installed. If no framework is specified and both frameworks are installed, will default to the framework of the model, or to PyTorch if no model is provided.

TYPE: `str`, *optional*

device

Device ordinal for CPU/GPU supports. Setting this to None will leverage CPU, a positive will run the model on the associated CUDA device id.

TYPE: Union[`int`, `torch.device`], *optional*

ms_dtype

The data-type (dtype) of the computation. Setting this to None will use float32 precision. Set to torch.float16 or torch.bfloat16 to use half-precision in the respective dtypes.

TYPE: Union[`int`, `torch.dtype`], *optional* DEFAULT: None

Source code in mindnlp/transformers/pipelines/automatic_speech_recognition.py
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
class AutomaticSpeechRecognitionPipeline(ChunkPipeline):
    """
    Pipeline that aims at extracting spoken text contained within some audio.

    The input can be either a raw waveform or a audio file. In case of the audio file, ffmpeg should be installed for
    to support multiple audio formats

    Example:
        ```python
        >>> from transformers import pipeline
        ...
        >>> transcriber = pipeline(model="openai/whisper-base")
        >>> transcriber("https://hf-mirror.com/datasets/Narsil/asr_dummy/resolve/main/1.flac")
        {'text': ' He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick, peppered flour-fatten sauce.'}
        ```

    Learn more about the basics of using a pipeline in the [pipeline tutorial](../pipeline_tutorial)

    Arguments:
        model ([`PreTrainedModel`] or [`TFPreTrainedModel`]):
            The model that will be used by the pipeline to make predictions. This needs to be a model inheriting from
            [`PreTrainedModel`] for PyTorch and [`TFPreTrainedModel`] for TensorFlow.
        feature_extractor ([`SequenceFeatureExtractor`]):
            The feature extractor that will be used by the pipeline to encode waveform for the model.
        tokenizer ([`PreTrainedTokenizer`]):
            The tokenizer that will be used by the pipeline to encode data for the model. This object inherits from
            [`PreTrainedTokenizer`].
        decoder (`pyctcdecode.BeamSearchDecoderCTC`, *optional*):
            [PyCTCDecode's
            BeamSearchDecoderCTC](https://github.com/kensho-technologies/pyctcdecode/blob/2fd33dc37c4111417e08d89ccd23d28e9b308d19/pyctcdecode/decoder.py#L180)
            can be passed for language model boosted decoding. See [`Wav2Vec2ProcessorWithLM`] for more information.
        chunk_length_s (`float`, *optional*, defaults to 0):
            The input length for in each chunk. If `chunk_length_s = 0` then chunking is disabled (default).

            <Tip>

            For more information on how to effectively use `chunk_length_s`, please have a look at the [ASR chunking
            blog post](https://hf-mirror.com/blog/asr-chunking).

            </Tip>

        stride_length_s (`float`, *optional*, defaults to `chunk_length_s / 6`):
            The length of stride on the left and right of each chunk. Used only with `chunk_length_s > 0`. This enables
            the model to *see* more context and infer letters better than without this context but the pipeline
            discards the stride bits at the end to make the final reconstitution as perfect as possible.

            <Tip>

            For more information on how to effectively use `stride_length_s`, please have a look at the [ASR chunking
            blog post](https://hf-mirror.com/blog/asr-chunking).

            </Tip>

        framework (`str`, *optional*):
            The framework to use, either `"ms"` for PyTorch or `"tf"` for TensorFlow. The specified framework must be
            installed. If no framework is specified, will default to the one currently installed. If no framework is
            specified and both frameworks are installed, will default to the framework of the `model`, or to PyTorch if
            no model is provided.
        device (Union[`int`, `torch.device`], *optional*):
            Device ordinal for CPU/GPU supports. Setting this to `None` will leverage CPU, a positive will run the
            model on the associated CUDA device id.
        ms_dtype (Union[`int`, `torch.dtype`], *optional*):
            The data-type (dtype) of the computation. Setting this to `None` will use float32 precision. Set to
            `torch.float16` or `torch.bfloat16` to use half-precision in the respective dtypes.

    """
    def __init__(
        self,
        model: "PreTrainedModel",
        feature_extractor: Union["SequenceFeatureExtractor", str] = None,
        tokenizer: Optional[PreTrainedTokenizer] = None,
        decoder: Optional[Union["BeamSearchDecoderCTC", str]] = None,
        ms_dtype: Optional[str] = None,
        **kwargs,
    ):
        """
        This method initializes an instance of AutomaticSpeechRecognitionPipeline.

        Args:
            self: The instance of the class.
            model (PreTrainedModel): The pre-trained model used for speech recognition.
            feature_extractor (Union[SequenceFeatureExtractor, str]): The feature extractor used for processing
                input data. It can be an instance of SequenceFeatureExtractor class or a string.
            tokenizer (Optional[PreTrainedTokenizer]): The tokenizer used for tokenizing input data.
            decoder (Optional[Union[BeamSearchDecoderCTC, str]]): The decoder used for decoding the model predictions.
                It can be an instance of BeamSearchDecoderCTC class or a string.
            ms_dtype (Optional[str]): The data type used for processing input data.

        Returns:
            None.

        Raises:
            None
        """
        # set the model type so we can check we have the right pre- and post-processing parameters
        if model.config.model_type == "whisper":
            self.type = "seq2seq_whisper"
        elif model.__class__.__name__ in MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES.values():
            self.type = "seq2seq"
        elif (
            feature_extractor._processor_class
            and feature_extractor._processor_class.endswith("WithLM")
            and decoder is not None
        ):
            self.decoder = decoder
            self.type = "ctc_with_lm"
        else:
            self.type = "ctc"

        super().__init__(model, tokenizer, feature_extractor, ms_dtype=ms_dtype, **kwargs)

    def __call__(
        self,
        inputs: Union[np.ndarray, bytes, str],
        **kwargs,
    ):
        """
        Transcribe the audio sequence(s) given as inputs to text. See the [`AutomaticSpeechRecognitionPipeline`]
        documentation for more information.

        Args:
            inputs (`np.ndarray` or `bytes` or `str` or `dict`):
                - `str` that is either the filename of a local audio file, or a public URL address to download the
                audio file. The file will be read at the correct sampling rate to get the waveform using
                *ffmpeg*. This requires *ffmpeg* to be installed on the system.
                - `bytes` it is supposed to be the content of an audio file and is interpreted by *ffmpeg* in the same way.
                - (`np.ndarray` of shape (n, ) of type `np.float32` or `np.float64`)
                Raw audio at the correct sampling rate (no further check will be done)
                - `dict` form can be used to pass raw audio sampled at arbitrary `sampling_rate` and let this
                pipeline do the resampling. The dict must be in the format `{"sampling_rate": int, "raw":
                np.array}` with optionally a `"stride": (left: int, right: int)` than can ask the pipeline to
                treat the first `left` samples and last `right` samples to be ignored in decoding (but used at
                inference to provide more context to the model). Only use `stride` with CTC models.
            return_timestamps (*optional*, `str` or `bool`):
                - Only available for pure CTC models (Wav2Vec2, HuBERT, etc) and the Whisper model. Not available for
                other sequence-to-sequence models.
                - For CTC models, timestamps can take one of two formats:

                    - `"char"`: the pipeline will return timestamps along the text for every character in the text. For
                    instance, if you get `[{"text": "h", "timestamp": (0.5, 0.6)}, {"text": "i", "timestamp": (0.7,
                    0.9)}]`, then it means the model predicts that the letter "h" was spoken after `0.5` and before
                    `0.6` seconds.
                    - `"word"`: the pipeline will return timestamps along the text for every word in the text. For
                    instance, if you get `[{"text": "hi ", "timestamp": (0.5, 0.9)}, {"text": "there", "timestamp":
                    (1.0, 1.5)}]`, then it means the model predicts that the word "hi" was spoken after `0.5` and
                    before `0.9` seconds.
                - For the Whisper model, timestamps can take one of two formats:

                    - `"word"`: same as above for word-level CTC timestamps. Word-level timestamps are predicted
                            through the *dynamic-time warping (DTW)* algorithm, an approximation to word-level timestamps
                            by inspecting the cross-attention weights.
                    - `True`: the pipeline will return timestamps along the text for *segments* of words in the text.
                            For instance, if you get `[{"text": " Hi there!", "timestamp": (0.5, 1.5)}]`, then it means the
                            model predicts that the segment "Hi there!" was spoken after `0.5` and before `1.5` seconds.
                            Note that a segment of text refers to a sequence of one or more words, rather than individual
                            words as with word-level timestamps.
            generate_kwargs (`dict`, *optional*):
                The dictionary of ad-hoc parametrization of `generate_config` to be used for the generation call. For a
                complete overview of generate, check the [following
                guide](https://hf-mirror.com/docs/transformers/en/main_classes/text_generation).
            max_new_tokens (`int`, *optional*):
                The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.

        Returns:
            `Dict`:
                A dictionary with the following keys:

                - **text** (`str`): The recognized text.
                - **chunks** (*optional(, `List[Dict]`)
                When using `return_timestamps`, the `chunks` will become a list containing all the various text
                chunks identified by the model, *e.g.* `[{"text": "hi ", "timestamp": (0.5, 0.9)}, {"text":
                "there", "timestamp": (1.0, 1.5)}]`. The original full text can roughly be recovered by doing
                `"".join(chunk["text"] for chunk in output["chunks"])`.
        """
        return super().__call__(inputs, **kwargs)

    def _sanitize_parameters(
        self,
        chunk_length_s=None,
        stride_length_s=None,
        ignore_warning=None,
        decoder_kwargs=None,
        return_timestamps=None,
        return_language=None,
        generate_kwargs=None,
        max_new_tokens=None,
    ):
        """
        This method '_sanitize_parameters' in the class 'AutomaticSpeechRecognitionPipeline' is responsible for
        sanitizing and validating input parameters for the Automatic Speech Recognition pipeline.

        Args:
            self (object): The instance of the class.
            chunk_length_s (float, optional): The length of each audio chunk in seconds. If provided, it is stored in
                the preprocess_params dictionary. Note: Experimental with 'seq2seq' models.
            stride_length_s (float, optional): The stride length between consecutive audio chunks in seconds.
                Stored in preprocess_params.
            ignore_warning (bool, optional): If True, ignores experimental warning when using 'chunk_length_s'
                with 'seq2seq' models.
            decoder_kwargs (dict, optional): Additional keyword arguments for the decoder. Stored in postprocess_params.
            return_timestamps (str or bool, optional): Specifies the type of timestamps to return. Restrictions
                based on the model type.
            return_language (str, optional): Specifies whether to return language information.
                Only available for 'seq2seq_whisper' models.
            generate_kwargs (dict, optional): Additional keyword arguments for model generation.
                If 'max_new_tokens' is defined here, it should not be repeated in the argument list.
            max_new_tokens (int, optional): Maximum number of new tokens to generate. Stored in forward_params.

        Returns:
            tuple:
                A tuple containing three dictionaries - preprocess_params, forward_params, and postprocess_params.
                These dictionaries hold sanitized parameters for different stages of the ASR pipeline.

        Raises:
            ValueError: If 'max_new_tokens' is defined both as an argument and inside 'generate_kwargs'.
            ValueError: If attempting to return timestamps not supported by the model type.
            ValueError: If language information is requested for a model other than 'seq2seq_whisper'.
            Warning: Experimental warning message when using 'chunk_length_s' with 'seq2seq' models.
        """
        # No parameters on this pipeline right now
        preprocess_params = {}
        if chunk_length_s is not None:
            if self.type == "seq2seq" and not ignore_warning:
                logger.warning(
                    "Using `chunk_length_s` is very experimental with seq2seq models. The results will not necessarily"
                    " be entirely accurate and will have caveats. More information:"
                    " https://github.com/huggingface/transformers/pull/20104. Ignore this warning with pipeline(...,"
                    " ignore_warning=True)"
                )
            preprocess_params["chunk_length_s"] = chunk_length_s
        if stride_length_s is not None:
            preprocess_params["stride_length_s"] = stride_length_s

        forward_params = defaultdict(dict)
        if max_new_tokens is not None:
            forward_params["max_new_tokens"] = max_new_tokens
        if generate_kwargs is not None:
            if max_new_tokens is not None and "max_new_tokens" in generate_kwargs:
                raise ValueError(
                    "`max_new_tokens` is defined both as an argument and inside `generate_kwargs` argument, please use"
                    " only 1 version"
                )
            forward_params.update(generate_kwargs)

        postprocess_params = {}
        if decoder_kwargs is not None:
            postprocess_params["decoder_kwargs"] = decoder_kwargs
        if return_timestamps is not None:
            # Check whether we have a valid setting for return_timestamps and throw an error before we perform a forward pass
            if self.type == "seq2seq" and return_timestamps:
                raise ValueError("We cannot return_timestamps yet on non-CTC models apart from Whisper!")
            if self.type == "ctc_with_lm" and return_timestamps != "word":
                raise ValueError("CTC with LM can only predict word level timestamps, set `return_timestamps='word'`")
            if self.type == "ctc" and return_timestamps not in ["char", "word"]:
                raise ValueError(
                    "CTC can either predict character level timestamps, or word level timestamps. "
                    "Set `return_timestamps='char'` or `return_timestamps='word'` as required."
                )
            if self.type == "seq2seq_whisper" and return_timestamps == "char":
                raise ValueError(
                    "Whisper cannot return `char` timestamps, only word level or segment level timestamps. "
                    "Use `return_timestamps='word'` or `return_timestamps=True` respectively."
                )
            forward_params["return_timestamps"] = return_timestamps
            postprocess_params["return_timestamps"] = return_timestamps
        if return_language is not None:
            if self.type != "seq2seq_whisper":
                raise ValueError("Only Whisper can return language for now.")
            postprocess_params["return_language"] = return_language

        return preprocess_params, forward_params, postprocess_params

    def preprocess(self, inputs, chunk_length_s=0, stride_length_s=None):
        """
        This method preprocesses the input data for the AutomaticSpeechRecognitionPipeline.

        Args:
            self (object): The instance of the AutomaticSpeechRecognitionPipeline class.
            inputs (str, bytes, dict, or np.ndarray):
                The input data, which can be in the form of a file path (str), binary data (bytes),
                a dictionary containing audio data and its properties, or a numpy array representing the audio.
            chunk_length_s (float):
                The length of chunks into which the audio data should be divided for processing, in seconds.
                Defaults to 0.
            stride_length_s (float or list):
                The length of stride for chunking the audio data, in seconds.

                - If a single value is provided, it is applied to both the left and right strides.
                - If a list is provided, the first value represents the left stride and the second value represents
                the right stride.
                - If not provided, it defaults to chunk_length_s / 6.

        Returns:
            None: This method yields processed chunks of the input audio data and does not return a single value.

        Raises:
            ValueError: If the input data does not meet the expected format or requirements,
                such as missing keys in the dictionary input, incorrect stride length, or invalid chunk length.
            TypeError: If the type of the input does not match the expected type.
        """
        if isinstance(inputs, str):
            if inputs.startswith("http://") or inputs.startswith("https://"):
                # We need to actually check for a real protocol, otherwise it's impossible to use a local file
                # like http_hf-mirror.com.png
                inputs = requests.get(inputs, timeout=3).content
            else:
                with open(inputs, "rb") as f:
                    inputs = f.read()

        if isinstance(inputs, bytes):
            inputs = ffmpeg_read(inputs, self.feature_extractor.sampling_rate)

        stride = None
        extra = {}
        if isinstance(inputs, dict):
            stride = inputs.pop("stride", None)
            # Accepting `"array"` which is the key defined in `datasets` for
            # better integration
            if not ("sampling_rate" in inputs and ("raw" in inputs or "array" in inputs)):
                raise ValueError(
                    "When passing a dictionary to AutomaticSpeechRecognitionPipeline, the dict needs to contain a "
                    '"raw" key containing the numpy array representing the audio and a "sampling_rate" key, '
                    "containing the sampling_rate associated with that array"
                )

            _inputs = inputs.pop("raw", None)
            if _inputs is None:
                # Remove path which will not be used from `datasets`.
                inputs.pop("path", None)
                _inputs = inputs.pop("array", None)
            in_sampling_rate = inputs.pop("sampling_rate")
            extra = inputs
            inputs = _inputs
            if in_sampling_rate != self.feature_extractor.sampling_rate:
                transform = Resample(orig_freq=in_sampling_rate, new_freq=self.feature_extractor.sampling_rate)
                inputs = transform(inputs)
                ratio = self.feature_extractor.sampling_rate / in_sampling_rate
            else:
                ratio = 1
            if stride is not None:
                if stride[0] + stride[1] > inputs.shape[0]:
                    raise ValueError("Stride is too large for input")

                # Stride needs to get the chunk length here, it's going to get
                # swallowed by the `feature_extractor` later, and then batching
                # can add extra data in the inputs, so we need to keep track
                # of the original length in the stride so we can cut properly.
                stride = (inputs.shape[0], int(round(stride[0] * ratio)), int(round(stride[1] * ratio)))
        if not isinstance(inputs, np.ndarray):
            raise ValueError(f"We expect a numpy ndarray as input, got `{type(inputs)}`")
        if len(inputs.shape) != 1:
            raise ValueError("We expect a single channel audio input for AutomaticSpeechRecognitionPipeline")

        if chunk_length_s:
            if stride_length_s is None:
                stride_length_s = chunk_length_s / 6

            if isinstance(stride_length_s, (int, float)):
                stride_length_s = [stride_length_s, stride_length_s]

            # Carefuly, this variable will not exist in `seq2seq` setting.
            # Currently chunking is not possible at this level for `seq2seq` so
            # it's ok.
            align_to = getattr(self.model.config, "inputs_to_logits_ratio", 1)
            chunk_len = int(round(chunk_length_s * self.feature_extractor.sampling_rate / align_to) * align_to)
            stride_left = int(round(stride_length_s[0] * self.feature_extractor.sampling_rate / align_to) * align_to)
            stride_right = int(round(stride_length_s[1] * self.feature_extractor.sampling_rate / align_to) * align_to)

            if chunk_len < stride_left + stride_right:
                raise ValueError("Chunk length must be superior to stride length")

            yield from chunk_iter(
                inputs, self.feature_extractor, chunk_len, stride_left, stride_right, self.ms_dtype
            )
        else:
            if self.type == "seq2seq_whisper" and inputs.shape[0] > self.feature_extractor.n_samples:
                processed = self.feature_extractor(
                    inputs,
                    sampling_rate=self.feature_extractor.sampling_rate,
                    truncation=False,
                    padding="longest",
                    return_tensors="ms",
                )
            else:
                processed = self.feature_extractor(
                    inputs, sampling_rate=self.feature_extractor.sampling_rate, return_tensors="ms"
                )

            if self.ms_dtype is not None:
                processed = processed.to(dtype=self.ms_dtype)
            if stride is not None:
                if self.type == "seq2seq":
                    raise ValueError("Stride is only usable with CTC models, try removing it !")

                processed["stride"] = stride
            yield {"is_last": True, **processed, **extra}

    def _forward(self, model_inputs, return_timestamps=False, **generate_kwargs):
        """
        Performs the forward pass for Automatic Speech Recognition (ASR) in the AutomaticSpeechRecognitionPipeline class.

        Args:
            self (AutomaticSpeechRecognitionPipeline): The instance of the AutomaticSpeechRecognitionPipeline class.
            model_inputs (dict): A dictionary containing the model inputs.
            return_timestamps (bool, optional): Indicates whether to return token timestamps. Defaults to False.

        Returns:
            dict: A dictionary containing the output of the forward pass.
                The structure of the dictionary depends on the ASR model type.

        Raises:
            ValueError:
                If the model_inputs dictionary does not contain either 'input_features' or 'input_values' key,
                when using a seq2seq or seq2seq_whisper model.

        Note:
            Other exceptions may be raised depending on the underlying ASR model used.

        """
        attention_mask = model_inputs.pop("attention_mask", None)
        stride = model_inputs.pop("stride", None)
        is_last = model_inputs.pop("is_last")

        if self.type in {"seq2seq", "seq2seq_whisper"}:
            encoder = self.model.get_encoder()
            # Consume values so we can let extra information flow freely through
            # the pipeline (important for `partial` in microphone)
            if "input_features" in model_inputs:
                inputs = model_inputs.pop("input_features")
            elif "input_values" in model_inputs:
                inputs = model_inputs.pop("input_values")
            else:
                raise ValueError(
                    "Seq2Seq speech recognition model requires either a "
                    f"`input_features` or `input_values` key, but only has {model_inputs.keys()}"
                )

            # custom processing for Whisper timestamps and word-level timestamps
            if return_timestamps and self.type == "seq2seq_whisper":
                generate_kwargs["return_timestamps"] = return_timestamps
                if return_timestamps == "word":
                    generate_kwargs["return_token_timestamps"] = True
                    generate_kwargs["return_segments"] = True

                    if stride is not None:
                        if isinstance(stride, tuple):
                            generate_kwargs["num_frames"] = stride[0] // self.feature_extractor.hop_length
                        else:
                            generate_kwargs["num_frames"] = [s[0] // self.feature_extractor.hop_length for s in stride]

            if self.type == "seq2seq_whisper" and inputs.shape[-1] > self.feature_extractor.nb_max_frames:
                generate_kwargs["input_features"] = inputs
            else:
                generate_kwargs["encoder_outputs"] = encoder(inputs, attention_mask=attention_mask)

            tokens = self.model.generate(
                attention_mask=attention_mask,
                **generate_kwargs,
            )
            # whisper longform generation stores timestamps in "segments"
            if return_timestamps == "word" and self.type == "seq2seq_whisper":
                if "segments" not in tokens:
                    out = {"tokens": tokens["sequences"], "token_timestamps": tokens["token_timestamps"]}
                else:
                    token_timestamps = [
                        ops.cat([segment["token_timestamps"] for segment in segment_list])
                        for segment_list in tokens["segments"]
                    ]
                    out = {"tokens": tokens["sequences"], "token_timestamps": token_timestamps}
            else:
                out = {"tokens": tokens}
            if self.type == "seq2seq_whisper":
                if stride is not None:
                    out["stride"] = stride

        else:
            inputs = {
                self.model.main_input_name: model_inputs.pop(self.model.main_input_name),
                "attention_mask": attention_mask,
            }
            outputs = self.model(**inputs)
            logits = outputs.logits

            if self.type == "ctc_with_lm":
                out = {"logits": logits}
            else:
                out = {"tokens": logits.argmax(axis=-1)}
            if stride is not None:
                # Send stride to `postprocess`.
                # it needs to be handled there where
                # the pieces are to be concatenated.
                ratio = 1 / self.model.config.inputs_to_logits_ratio
                if isinstance(stride, tuple):
                    out["stride"] = rescale_stride([stride], ratio)[0]
                else:
                    out["stride"] = rescale_stride(stride, ratio)
        # Leftover
        extra = model_inputs
        return {"is_last": is_last, **out, **extra}

    def postprocess(
        self, model_outputs, decoder_kwargs: Optional[Dict] = None, return_timestamps=None, return_language=None
    ):
        """
        Method postprocess in the class AutomaticSpeechRecognitionPipeline.

        Args:
            self: Object instance of the class AutomaticSpeechRecognitionPipeline.
            model_outputs: List of dictionaries representing the outputs from the model.
                Each dictionary contains 'logits' or 'tokens' key with corresponding values.
            decoder_kwargs: Optional dictionary containing keyword arguments for the decoder. Defaults to None.
            return_timestamps: Optional parameter indicating whether to return timestamps.
                Can be None, 'word', or 'char'.
            return_language: Optional parameter specifying the language to return.
                Can be None or a specific language identifier.

        Returns:
            None: The method modifies the model_outputs and decoder_kwargs in place.

        Raises:
            ValueError: If the provided 'model_outputs' format is incorrect.
            AttributeError: If the 'stride' key is missing or improperly defined in the model_outputs dictionary.
            KeyError: If required keys are missing in the model_outputs dictionary.
            TypeError: If the input parameters are of incorrect types or incompatible values.
        """
        # Optional return types
        optional = {}

        final_items = []
        key = "logits" if self.type == "ctc_with_lm" else "tokens"
        stride = None
        for outputs in model_outputs:
            items = outputs[key].numpy()
            stride = outputs.get("stride", None)
            if stride is not None and self.type in {"ctc", "ctc_with_lm"}:
                total_n, left, right = stride
                # Total_n might be < logits.shape[1]
                # because of padding, that's why
                # we need to reforward this information
                # This won't work with left padding (which doesn't exist right now)
                right_n = total_n - right
                items = items[:, left:right_n]
            final_items.append(items)

        if stride and self.type == "seq2seq":
            items = _find_longest_common_sequence(final_items, self.tokenizer)
        elif self.type == "seq2seq_whisper":
            time_precision = self.feature_extractor.chunk_length / self.model.config.max_source_positions
            # Send the chunking back to seconds, it's easier to handle in whisper
            sampling_rate = self.feature_extractor.sampling_rate
            for output in model_outputs:
                if "stride" in output:
                    chunk_len, stride_left, stride_right = output["stride"]
                    # Go back in seconds
                    chunk_len /= sampling_rate
                    stride_left /= sampling_rate
                    stride_right /= sampling_rate
                    output["stride"] = chunk_len, stride_left, stride_right

            text, optional = self.tokenizer._decode_asr(
                model_outputs,
                return_timestamps=return_timestamps,
                return_language=return_language,
                time_precision=time_precision,
            )
        else:
            items = np.concatenate(final_items, axis=1)
            items = items.squeeze(0)

        if self.type == "ctc_with_lm":
            if decoder_kwargs is None:
                decoder_kwargs = {}
            beams = self.decoder.decode_beams(items, **decoder_kwargs)
            text = beams[0][0]
            if return_timestamps:
                # Simply cast from pyctcdecode format to wav2vec2 format to leverage
                # pre-existing code later
                chunk_offset = beams[0][2]
                offsets = []
                for word, (start_offset, end_offset) in chunk_offset:
                    offsets.append({"word": word, "start_offset": start_offset, "end_offset": end_offset})
        elif self.type != "seq2seq_whisper":
            skip_special_tokens = self.type != "ctc"
            text = self.tokenizer.decode(items, skip_special_tokens=skip_special_tokens)
            if return_timestamps:
                offsets = self.tokenizer.decode(
                    items, skip_special_tokens=skip_special_tokens, output_char_offsets=True
                )["char_offsets"]
                if return_timestamps == "word":
                    offsets = self.tokenizer._get_word_offsets(offsets, self.tokenizer.replace_word_delimiter_char)

        if return_timestamps and self.type not in {"seq2seq", "seq2seq_whisper"}:
            chunks = []
            for item in offsets:
                start = item["start_offset"] * self.model.config.inputs_to_logits_ratio
                start /= self.feature_extractor.sampling_rate

                stop = item["end_offset"] * self.model.config.inputs_to_logits_ratio
                stop /= self.feature_extractor.sampling_rate

                chunks.append({"text": item[return_timestamps], "timestamp": (start, stop)})
            optional["chunks"] = chunks

        extra = defaultdict(list)
        for output in model_outputs:
            output.pop("tokens", None)
            output.pop("logits", None)
            output.pop("is_last", None)
            output.pop("stride", None)
            output.pop("token_timestamps", None)
            for k, v in output.items():
                extra[k].append(v)
        return {"text": text, **optional, **extra}

mindnlp.transformers.pipelines.automatic_speech_recognition.AutomaticSpeechRecognitionPipeline.__call__(inputs, **kwargs)

Transcribe the audio sequence(s) given as inputs to text. See the [AutomaticSpeechRecognitionPipeline] documentation for more information.

PARAMETER DESCRIPTION
inputs
  • str that is either the filename of a local audio file, or a public URL address to download the audio file. The file will be read at the correct sampling rate to get the waveform using ffmpeg. This requires ffmpeg to be installed on the system.
  • bytes it is supposed to be the content of an audio file and is interpreted by ffmpeg in the same way.
  • (np.ndarray of shape (n, ) of type np.float32 or np.float64) Raw audio at the correct sampling rate (no further check will be done)
  • dict form can be used to pass raw audio sampled at arbitrary sampling_rate and let this pipeline do the resampling. The dict must be in the format {"sampling_rate": int, "raw": np.array} with optionally a "stride": (left: int, right: int) than can ask the pipeline to treat the first left samples and last right samples to be ignored in decoding (but used at inference to provide more context to the model). Only use stride with CTC models.

TYPE: `np.ndarray` or `bytes` or `str` or `dict`

return_timestamps
  • Only available for pure CTC models (Wav2Vec2, HuBERT, etc) and the Whisper model. Not available for other sequence-to-sequence models.
  • For CTC models, timestamps can take one of two formats:

    • "char": the pipeline will return timestamps along the text for every character in the text. For instance, if you get [{"text": "h", "timestamp": (0.5, 0.6)}, {"text": "i", "timestamp": (0.7, 0.9)}], then it means the model predicts that the letter "h" was spoken after 0.5 and before 0.6 seconds.
    • "word": the pipeline will return timestamps along the text for every word in the text. For instance, if you get [{"text": "hi ", "timestamp": (0.5, 0.9)}, {"text": "there", "timestamp": (1.0, 1.5)}], then it means the model predicts that the word "hi" was spoken after 0.5 and before 0.9 seconds.
    • For the Whisper model, timestamps can take one of two formats:

    • "word": same as above for word-level CTC timestamps. Word-level timestamps are predicted through the dynamic-time warping (DTW) algorithm, an approximation to word-level timestamps by inspecting the cross-attention weights.

    • True: the pipeline will return timestamps along the text for segments of words in the text. For instance, if you get [{"text": " Hi there!", "timestamp": (0.5, 1.5)}], then it means the model predicts that the segment "Hi there!" was spoken after 0.5 and before 1.5 seconds. Note that a segment of text refers to a sequence of one or more words, rather than individual words as with word-level timestamps.

TYPE: *optional*, `str` or `bool`

generate_kwargs

The dictionary of ad-hoc parametrization of generate_config to be used for the generation call. For a complete overview of generate, check the following guide.

TYPE: `dict`, *optional*

max_new_tokens

The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.

TYPE: `int`, *optional*

RETURNS DESCRIPTION

Dict: A dictionary with the following keys:

  • text (str): The recognized text.
  • chunks (optional(, List[Dict]) When using return_timestamps, the chunks will become a list containing all the various text chunks identified by the model, *e.g. [{"text": "hi ", "timestamp": (0.5, 0.9)}, {"text": "there", "timestamp": (1.0, 1.5)}]. The original full text can roughly be recovered by doing "".join(chunk["text"] for chunk in output["chunks"]).
Source code in mindnlp/transformers/pipelines/automatic_speech_recognition.py
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
def __call__(
    self,
    inputs: Union[np.ndarray, bytes, str],
    **kwargs,
):
    """
    Transcribe the audio sequence(s) given as inputs to text. See the [`AutomaticSpeechRecognitionPipeline`]
    documentation for more information.

    Args:
        inputs (`np.ndarray` or `bytes` or `str` or `dict`):
            - `str` that is either the filename of a local audio file, or a public URL address to download the
            audio file. The file will be read at the correct sampling rate to get the waveform using
            *ffmpeg*. This requires *ffmpeg* to be installed on the system.
            - `bytes` it is supposed to be the content of an audio file and is interpreted by *ffmpeg* in the same way.
            - (`np.ndarray` of shape (n, ) of type `np.float32` or `np.float64`)
            Raw audio at the correct sampling rate (no further check will be done)
            - `dict` form can be used to pass raw audio sampled at arbitrary `sampling_rate` and let this
            pipeline do the resampling. The dict must be in the format `{"sampling_rate": int, "raw":
            np.array}` with optionally a `"stride": (left: int, right: int)` than can ask the pipeline to
            treat the first `left` samples and last `right` samples to be ignored in decoding (but used at
            inference to provide more context to the model). Only use `stride` with CTC models.
        return_timestamps (*optional*, `str` or `bool`):
            - Only available for pure CTC models (Wav2Vec2, HuBERT, etc) and the Whisper model. Not available for
            other sequence-to-sequence models.
            - For CTC models, timestamps can take one of two formats:

                - `"char"`: the pipeline will return timestamps along the text for every character in the text. For
                instance, if you get `[{"text": "h", "timestamp": (0.5, 0.6)}, {"text": "i", "timestamp": (0.7,
                0.9)}]`, then it means the model predicts that the letter "h" was spoken after `0.5` and before
                `0.6` seconds.
                - `"word"`: the pipeline will return timestamps along the text for every word in the text. For
                instance, if you get `[{"text": "hi ", "timestamp": (0.5, 0.9)}, {"text": "there", "timestamp":
                (1.0, 1.5)}]`, then it means the model predicts that the word "hi" was spoken after `0.5` and
                before `0.9` seconds.
            - For the Whisper model, timestamps can take one of two formats:

                - `"word"`: same as above for word-level CTC timestamps. Word-level timestamps are predicted
                        through the *dynamic-time warping (DTW)* algorithm, an approximation to word-level timestamps
                        by inspecting the cross-attention weights.
                - `True`: the pipeline will return timestamps along the text for *segments* of words in the text.
                        For instance, if you get `[{"text": " Hi there!", "timestamp": (0.5, 1.5)}]`, then it means the
                        model predicts that the segment "Hi there!" was spoken after `0.5` and before `1.5` seconds.
                        Note that a segment of text refers to a sequence of one or more words, rather than individual
                        words as with word-level timestamps.
        generate_kwargs (`dict`, *optional*):
            The dictionary of ad-hoc parametrization of `generate_config` to be used for the generation call. For a
            complete overview of generate, check the [following
            guide](https://hf-mirror.com/docs/transformers/en/main_classes/text_generation).
        max_new_tokens (`int`, *optional*):
            The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.

    Returns:
        `Dict`:
            A dictionary with the following keys:

            - **text** (`str`): The recognized text.
            - **chunks** (*optional(, `List[Dict]`)
            When using `return_timestamps`, the `chunks` will become a list containing all the various text
            chunks identified by the model, *e.g.* `[{"text": "hi ", "timestamp": (0.5, 0.9)}, {"text":
            "there", "timestamp": (1.0, 1.5)}]`. The original full text can roughly be recovered by doing
            `"".join(chunk["text"] for chunk in output["chunks"])`.
    """
    return super().__call__(inputs, **kwargs)

mindnlp.transformers.pipelines.automatic_speech_recognition.AutomaticSpeechRecognitionPipeline.__init__(model, feature_extractor=None, tokenizer=None, decoder=None, ms_dtype=None, **kwargs)

This method initializes an instance of AutomaticSpeechRecognitionPipeline.

PARAMETER DESCRIPTION
self

The instance of the class.

model

The pre-trained model used for speech recognition.

TYPE: PreTrainedModel

feature_extractor

The feature extractor used for processing input data. It can be an instance of SequenceFeatureExtractor class or a string.

TYPE: Union[SequenceFeatureExtractor, str] DEFAULT: None

tokenizer

The tokenizer used for tokenizing input data.

TYPE: Optional[PreTrainedTokenizer] DEFAULT: None

decoder

The decoder used for decoding the model predictions. It can be an instance of BeamSearchDecoderCTC class or a string.

TYPE: Optional[Union[BeamSearchDecoderCTC, str]] DEFAULT: None

ms_dtype

The data type used for processing input data.

TYPE: Optional[str] DEFAULT: None

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/pipelines/automatic_speech_recognition.py
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
def __init__(
    self,
    model: "PreTrainedModel",
    feature_extractor: Union["SequenceFeatureExtractor", str] = None,
    tokenizer: Optional[PreTrainedTokenizer] = None,
    decoder: Optional[Union["BeamSearchDecoderCTC", str]] = None,
    ms_dtype: Optional[str] = None,
    **kwargs,
):
    """
    This method initializes an instance of AutomaticSpeechRecognitionPipeline.

    Args:
        self: The instance of the class.
        model (PreTrainedModel): The pre-trained model used for speech recognition.
        feature_extractor (Union[SequenceFeatureExtractor, str]): The feature extractor used for processing
            input data. It can be an instance of SequenceFeatureExtractor class or a string.
        tokenizer (Optional[PreTrainedTokenizer]): The tokenizer used for tokenizing input data.
        decoder (Optional[Union[BeamSearchDecoderCTC, str]]): The decoder used for decoding the model predictions.
            It can be an instance of BeamSearchDecoderCTC class or a string.
        ms_dtype (Optional[str]): The data type used for processing input data.

    Returns:
        None.

    Raises:
        None
    """
    # set the model type so we can check we have the right pre- and post-processing parameters
    if model.config.model_type == "whisper":
        self.type = "seq2seq_whisper"
    elif model.__class__.__name__ in MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES.values():
        self.type = "seq2seq"
    elif (
        feature_extractor._processor_class
        and feature_extractor._processor_class.endswith("WithLM")
        and decoder is not None
    ):
        self.decoder = decoder
        self.type = "ctc_with_lm"
    else:
        self.type = "ctc"

    super().__init__(model, tokenizer, feature_extractor, ms_dtype=ms_dtype, **kwargs)

mindnlp.transformers.pipelines.automatic_speech_recognition.AutomaticSpeechRecognitionPipeline.postprocess(model_outputs, decoder_kwargs=None, return_timestamps=None, return_language=None)

Method postprocess in the class AutomaticSpeechRecognitionPipeline.

PARAMETER DESCRIPTION
self

Object instance of the class AutomaticSpeechRecognitionPipeline.

model_outputs

List of dictionaries representing the outputs from the model. Each dictionary contains 'logits' or 'tokens' key with corresponding values.

decoder_kwargs

Optional dictionary containing keyword arguments for the decoder. Defaults to None.

TYPE: Optional[Dict] DEFAULT: None

return_timestamps

Optional parameter indicating whether to return timestamps. Can be None, 'word', or 'char'.

DEFAULT: None

return_language

Optional parameter specifying the language to return. Can be None or a specific language identifier.

DEFAULT: None

RETURNS DESCRIPTION
None

The method modifies the model_outputs and decoder_kwargs in place.

RAISES DESCRIPTION
ValueError

If the provided 'model_outputs' format is incorrect.

AttributeError

If the 'stride' key is missing or improperly defined in the model_outputs dictionary.

KeyError

If required keys are missing in the model_outputs dictionary.

TypeError

If the input parameters are of incorrect types or incompatible values.

Source code in mindnlp/transformers/pipelines/automatic_speech_recognition.py
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
def postprocess(
    self, model_outputs, decoder_kwargs: Optional[Dict] = None, return_timestamps=None, return_language=None
):
    """
    Method postprocess in the class AutomaticSpeechRecognitionPipeline.

    Args:
        self: Object instance of the class AutomaticSpeechRecognitionPipeline.
        model_outputs: List of dictionaries representing the outputs from the model.
            Each dictionary contains 'logits' or 'tokens' key with corresponding values.
        decoder_kwargs: Optional dictionary containing keyword arguments for the decoder. Defaults to None.
        return_timestamps: Optional parameter indicating whether to return timestamps.
            Can be None, 'word', or 'char'.
        return_language: Optional parameter specifying the language to return.
            Can be None or a specific language identifier.

    Returns:
        None: The method modifies the model_outputs and decoder_kwargs in place.

    Raises:
        ValueError: If the provided 'model_outputs' format is incorrect.
        AttributeError: If the 'stride' key is missing or improperly defined in the model_outputs dictionary.
        KeyError: If required keys are missing in the model_outputs dictionary.
        TypeError: If the input parameters are of incorrect types or incompatible values.
    """
    # Optional return types
    optional = {}

    final_items = []
    key = "logits" if self.type == "ctc_with_lm" else "tokens"
    stride = None
    for outputs in model_outputs:
        items = outputs[key].numpy()
        stride = outputs.get("stride", None)
        if stride is not None and self.type in {"ctc", "ctc_with_lm"}:
            total_n, left, right = stride
            # Total_n might be < logits.shape[1]
            # because of padding, that's why
            # we need to reforward this information
            # This won't work with left padding (which doesn't exist right now)
            right_n = total_n - right
            items = items[:, left:right_n]
        final_items.append(items)

    if stride and self.type == "seq2seq":
        items = _find_longest_common_sequence(final_items, self.tokenizer)
    elif self.type == "seq2seq_whisper":
        time_precision = self.feature_extractor.chunk_length / self.model.config.max_source_positions
        # Send the chunking back to seconds, it's easier to handle in whisper
        sampling_rate = self.feature_extractor.sampling_rate
        for output in model_outputs:
            if "stride" in output:
                chunk_len, stride_left, stride_right = output["stride"]
                # Go back in seconds
                chunk_len /= sampling_rate
                stride_left /= sampling_rate
                stride_right /= sampling_rate
                output["stride"] = chunk_len, stride_left, stride_right

        text, optional = self.tokenizer._decode_asr(
            model_outputs,
            return_timestamps=return_timestamps,
            return_language=return_language,
            time_precision=time_precision,
        )
    else:
        items = np.concatenate(final_items, axis=1)
        items = items.squeeze(0)

    if self.type == "ctc_with_lm":
        if decoder_kwargs is None:
            decoder_kwargs = {}
        beams = self.decoder.decode_beams(items, **decoder_kwargs)
        text = beams[0][0]
        if return_timestamps:
            # Simply cast from pyctcdecode format to wav2vec2 format to leverage
            # pre-existing code later
            chunk_offset = beams[0][2]
            offsets = []
            for word, (start_offset, end_offset) in chunk_offset:
                offsets.append({"word": word, "start_offset": start_offset, "end_offset": end_offset})
    elif self.type != "seq2seq_whisper":
        skip_special_tokens = self.type != "ctc"
        text = self.tokenizer.decode(items, skip_special_tokens=skip_special_tokens)
        if return_timestamps:
            offsets = self.tokenizer.decode(
                items, skip_special_tokens=skip_special_tokens, output_char_offsets=True
            )["char_offsets"]
            if return_timestamps == "word":
                offsets = self.tokenizer._get_word_offsets(offsets, self.tokenizer.replace_word_delimiter_char)

    if return_timestamps and self.type not in {"seq2seq", "seq2seq_whisper"}:
        chunks = []
        for item in offsets:
            start = item["start_offset"] * self.model.config.inputs_to_logits_ratio
            start /= self.feature_extractor.sampling_rate

            stop = item["end_offset"] * self.model.config.inputs_to_logits_ratio
            stop /= self.feature_extractor.sampling_rate

            chunks.append({"text": item[return_timestamps], "timestamp": (start, stop)})
        optional["chunks"] = chunks

    extra = defaultdict(list)
    for output in model_outputs:
        output.pop("tokens", None)
        output.pop("logits", None)
        output.pop("is_last", None)
        output.pop("stride", None)
        output.pop("token_timestamps", None)
        for k, v in output.items():
            extra[k].append(v)
    return {"text": text, **optional, **extra}

mindnlp.transformers.pipelines.automatic_speech_recognition.AutomaticSpeechRecognitionPipeline.preprocess(inputs, chunk_length_s=0, stride_length_s=None)

This method preprocesses the input data for the AutomaticSpeechRecognitionPipeline.

PARAMETER DESCRIPTION
self

The instance of the AutomaticSpeechRecognitionPipeline class.

TYPE: object

inputs

The input data, which can be in the form of a file path (str), binary data (bytes), a dictionary containing audio data and its properties, or a numpy array representing the audio.

TYPE: str, bytes, dict, or np.ndarray

chunk_length_s

The length of chunks into which the audio data should be divided for processing, in seconds. Defaults to 0.

TYPE: float DEFAULT: 0

stride_length_s

The length of stride for chunking the audio data, in seconds.

  • If a single value is provided, it is applied to both the left and right strides.
  • If a list is provided, the first value represents the left stride and the second value represents the right stride.
  • If not provided, it defaults to chunk_length_s / 6.

TYPE: float or list DEFAULT: None

RETURNS DESCRIPTION
None

This method yields processed chunks of the input audio data and does not return a single value.

RAISES DESCRIPTION
ValueError

If the input data does not meet the expected format or requirements, such as missing keys in the dictionary input, incorrect stride length, or invalid chunk length.

TypeError

If the type of the input does not match the expected type.

Source code in mindnlp/transformers/pipelines/automatic_speech_recognition.py
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
def preprocess(self, inputs, chunk_length_s=0, stride_length_s=None):
    """
    This method preprocesses the input data for the AutomaticSpeechRecognitionPipeline.

    Args:
        self (object): The instance of the AutomaticSpeechRecognitionPipeline class.
        inputs (str, bytes, dict, or np.ndarray):
            The input data, which can be in the form of a file path (str), binary data (bytes),
            a dictionary containing audio data and its properties, or a numpy array representing the audio.
        chunk_length_s (float):
            The length of chunks into which the audio data should be divided for processing, in seconds.
            Defaults to 0.
        stride_length_s (float or list):
            The length of stride for chunking the audio data, in seconds.

            - If a single value is provided, it is applied to both the left and right strides.
            - If a list is provided, the first value represents the left stride and the second value represents
            the right stride.
            - If not provided, it defaults to chunk_length_s / 6.

    Returns:
        None: This method yields processed chunks of the input audio data and does not return a single value.

    Raises:
        ValueError: If the input data does not meet the expected format or requirements,
            such as missing keys in the dictionary input, incorrect stride length, or invalid chunk length.
        TypeError: If the type of the input does not match the expected type.
    """
    if isinstance(inputs, str):
        if inputs.startswith("http://") or inputs.startswith("https://"):
            # We need to actually check for a real protocol, otherwise it's impossible to use a local file
            # like http_hf-mirror.com.png
            inputs = requests.get(inputs, timeout=3).content
        else:
            with open(inputs, "rb") as f:
                inputs = f.read()

    if isinstance(inputs, bytes):
        inputs = ffmpeg_read(inputs, self.feature_extractor.sampling_rate)

    stride = None
    extra = {}
    if isinstance(inputs, dict):
        stride = inputs.pop("stride", None)
        # Accepting `"array"` which is the key defined in `datasets` for
        # better integration
        if not ("sampling_rate" in inputs and ("raw" in inputs or "array" in inputs)):
            raise ValueError(
                "When passing a dictionary to AutomaticSpeechRecognitionPipeline, the dict needs to contain a "
                '"raw" key containing the numpy array representing the audio and a "sampling_rate" key, '
                "containing the sampling_rate associated with that array"
            )

        _inputs = inputs.pop("raw", None)
        if _inputs is None:
            # Remove path which will not be used from `datasets`.
            inputs.pop("path", None)
            _inputs = inputs.pop("array", None)
        in_sampling_rate = inputs.pop("sampling_rate")
        extra = inputs
        inputs = _inputs
        if in_sampling_rate != self.feature_extractor.sampling_rate:
            transform = Resample(orig_freq=in_sampling_rate, new_freq=self.feature_extractor.sampling_rate)
            inputs = transform(inputs)
            ratio = self.feature_extractor.sampling_rate / in_sampling_rate
        else:
            ratio = 1
        if stride is not None:
            if stride[0] + stride[1] > inputs.shape[0]:
                raise ValueError("Stride is too large for input")

            # Stride needs to get the chunk length here, it's going to get
            # swallowed by the `feature_extractor` later, and then batching
            # can add extra data in the inputs, so we need to keep track
            # of the original length in the stride so we can cut properly.
            stride = (inputs.shape[0], int(round(stride[0] * ratio)), int(round(stride[1] * ratio)))
    if not isinstance(inputs, np.ndarray):
        raise ValueError(f"We expect a numpy ndarray as input, got `{type(inputs)}`")
    if len(inputs.shape) != 1:
        raise ValueError("We expect a single channel audio input for AutomaticSpeechRecognitionPipeline")

    if chunk_length_s:
        if stride_length_s is None:
            stride_length_s = chunk_length_s / 6

        if isinstance(stride_length_s, (int, float)):
            stride_length_s = [stride_length_s, stride_length_s]

        # Carefuly, this variable will not exist in `seq2seq` setting.
        # Currently chunking is not possible at this level for `seq2seq` so
        # it's ok.
        align_to = getattr(self.model.config, "inputs_to_logits_ratio", 1)
        chunk_len = int(round(chunk_length_s * self.feature_extractor.sampling_rate / align_to) * align_to)
        stride_left = int(round(stride_length_s[0] * self.feature_extractor.sampling_rate / align_to) * align_to)
        stride_right = int(round(stride_length_s[1] * self.feature_extractor.sampling_rate / align_to) * align_to)

        if chunk_len < stride_left + stride_right:
            raise ValueError("Chunk length must be superior to stride length")

        yield from chunk_iter(
            inputs, self.feature_extractor, chunk_len, stride_left, stride_right, self.ms_dtype
        )
    else:
        if self.type == "seq2seq_whisper" and inputs.shape[0] > self.feature_extractor.n_samples:
            processed = self.feature_extractor(
                inputs,
                sampling_rate=self.feature_extractor.sampling_rate,
                truncation=False,
                padding="longest",
                return_tensors="ms",
            )
        else:
            processed = self.feature_extractor(
                inputs, sampling_rate=self.feature_extractor.sampling_rate, return_tensors="ms"
            )

        if self.ms_dtype is not None:
            processed = processed.to(dtype=self.ms_dtype)
        if stride is not None:
            if self.type == "seq2seq":
                raise ValueError("Stride is only usable with CTC models, try removing it !")

            processed["stride"] = stride
        yield {"is_last": True, **processed, **extra}