Skip to content

text2text_generation

mindnlp.transformers.pipelines.text2text_generation.Text2TextGenerationPipeline

Bases: Pipeline

Pipeline for text to text generation using seq2seq models.

Example
>>> from mindnlp.transformers import pipeline
...
>>> generator = pipeline("text2text-generation", model="t5-base")
>>> generator(
...     "answer: Manuel context: Manuel has created RuPERTa-base with the support of HF-Transformers and Google"
... )
[{'generated_text': 'question: Who created the RuPERTa-base?'}]

Learn more about the basics of using a pipeline in the pipeline tutorial. You can pass text generation parameters to this pipeline to control stopping criteria, decoding strategy, and more. Learn more about text generation parameters in Text generation strategies and Text generation.

This Text2TextGenerationPipeline pipeline can currently be loaded from [pipeline] using the following task identifier: "text2text-generation".

The models that this pipeline can use are models that have been fine-tuned on a translation task. See the up-to-date list of available models on hf-mirror.com/models. For a list of available parameters, see the following documentation

Example
>>> text2text_generator = pipeline("text2text-generation")
>>> text2text_generator("question: What is 42 ? context: 42 is the answer to life, the universe and everything")
Source code in mindnlp/transformers/pipelines/text2text_generation.py
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
class Text2TextGenerationPipeline(Pipeline):
    """
    Pipeline for text to text generation using seq2seq models.

    Example:
        ```python
        >>> from mindnlp.transformers import pipeline
        ...
        >>> generator = pipeline("text2text-generation", model="t5-base")
        >>> generator(
        ...     "answer: Manuel context: Manuel has created RuPERTa-base with the support of HF-Transformers and Google"
        ... )
        [{'generated_text': 'question: Who created the RuPERTa-base?'}]
        ```

    Learn more about the basics of using a pipeline in the [pipeline tutorial](../pipeline_tutorial). You can pass text
    generation parameters to this pipeline to control stopping criteria, decoding strategy, and more. Learn more about
    text generation parameters in [Text generation strategies](../generation_strategies) and [Text
    generation](text_generation).

    This Text2TextGenerationPipeline pipeline can currently be loaded from [`pipeline`] using the following task
    identifier: `"text2text-generation"`.

    The models that this pipeline can use are models that have been fine-tuned on a translation task. See the
    up-to-date list of available models on
    [hf-mirror.com/models](https://hf-mirror.com/models?filter=text2text-generation). For a list of available
    parameters, see the [following
    documentation](https://hf-mirror.com/docs/transformers/en/main_classes/text_generation#transformers.generation.GenerationMixin.generate)

    Example:
        ```python
        >>> text2text_generator = pipeline("text2text-generation")
        >>> text2text_generator("question: What is 42 ? context: 42 is the answer to life, the universe and everything")
        ```
    """
    # Used in the return key of the pipeline.
    return_name = "generated"

    def __init__(self, *args, **kwargs):
        """
        Initializes an instance of Text2TextGenerationPipeline.

        Args:
            self: An instance of the Text2TextGenerationPipeline class.

        Returns:
            None.

        Raises:
            None.
        """
        super().__init__(*args, **kwargs)

    def _sanitize_parameters(
            self,
            return_tensors=None,
            return_text=None,
            return_type=None,
            clean_up_tokenization_spaces=None,
            truncation=None,
            stop_sequence=None,
            **generate_kwargs,
    ):
        """
        This method '_sanitize_parameters' in the 'Text2TextGenerationPipeline' class sanitizes and prepares input
        parameters for text generation.

        Args:
            self: The instance of the class.
            return_tensors (bool): Whether to return the generated text as tensors. Default is None.
            return_text (bool): Whether to return the generated text as plain text. Default is None.
            return_type (ReturnType): The type of output to return. Can be either ReturnType.TENSORS or ReturnType.TEXT.
                Default is None.
            clean_up_tokenization_spaces (bool): Whether to clean up tokenization spaces in the generated text.
                Default is None.
            truncation (bool): Whether to truncate the generated text. Default is None.
            stop_sequence (str): The sequence of tokens that indicates the end of generation.

        Returns:
            preprocess_params (dict): A dictionary containing preprocessing parameters.
            forward_params (dict): A dictionary containing forward pass parameters.
            postprocess_params (dict): A dictionary containing postprocessing parameters.

        Raises:
            Warning: When attempting to stop generation on a multiple token sequence, as this feature is not
                supported in transformers.
        """
        preprocess_params = {}
        if truncation is not None:
            preprocess_params["truncation"] = truncation

        forward_params = generate_kwargs

        postprocess_params = {}
        if return_tensors is not None and return_type is None:
            return_type = ReturnType.TENSORS if return_tensors else ReturnType.TEXT
        if return_type is not None:
            postprocess_params["return_type"] = return_type
        if clean_up_tokenization_spaces is not None:
            postprocess_params["clean_up_tokenization_spaces"] = clean_up_tokenization_spaces
        if return_text is not None:
            postprocess_params["return_type"] = ReturnType.TEXT if return_text else ReturnType.TENSORS

        if stop_sequence is not None:
            stop_sequence_ids = self.tokenizer.encode(stop_sequence, add_special_tokens=False)
            if len(stop_sequence_ids) > 1:
                warnings.warn(
                    "Stopping on a multiple token sequence is not yet supported on transformers. The first token of"
                    " the stop sequence will be used as the stop sequence string in the interim."
                )
            generate_kwargs["eos_token_id"] = stop_sequence_ids[0]

        return preprocess_params, forward_params, postprocess_params

    def check_inputs(self, input_length: int, min_length: int, max_length: int):
        """
        Checks whether there might be something wrong with given input with regard to the model.
        """
        if input_length < min_length:
            logger.warning(
                f"Your min_length is set to {min_length}, but you input_length is only {input_length}. You might "
                "consider decreasing min_length manually, e.g. summarizer('...', min_length=10)"
            )
        if input_length > max_length:
            logger.warning(
                f"Your max_length is set to {max_length}, but you input_length is only {input_length}. You might "
                "consider increasing max_length manually, e.g. summarizer('...', max_length=400)"
            )

        return True

    def _parse_and_tokenize(self, *args, truncation):
        """
        This method '_parse_and_tokenize' is a member of the class 'Text2TextGenerationPipeline'.
        It parses and tokenizes input text data using a tokenizer and prepares it for model inference.

        Args:
            self: An instance of the Text2TextGenerationPipeline class.

        Returns:
            None.

        Raises:
            ValueError:
                Raised if the tokenizer's pad_token_id is not set when batch input is used or if the input data
                format is incorrect.
        """
        prefix = self.model.config.prefix if self.model.config.prefix is not None else ""
        if isinstance(args[0], list):
            if self.tokenizer.pad_token_id is None:
                raise ValueError("Please make sure that the tokenizer has a pad_token_id when using a batch input")
            args = ([prefix + arg for arg in args[0]],)
            padding = True

        elif isinstance(args[0], str):
            args = (prefix + args[0],)
            padding = False
        else:
            raise ValueError(
                f" `args[0]`: {args[0]} have the wrong format. The should be either of type `str` or type `list`"
            )
        inputs = self.tokenizer(*args, padding=padding, truncation=truncation, return_tensors='ms')
        # This is produced by tokenizers but is an invalid generate kwargs
        if "token_type_ids" in inputs:
            del inputs["token_type_ids"]
        return inputs

    def __call__(self, *args, **kwargs):
        r"""
        Generate the output text(s) using text(s) given as inputs.

        Args:
            args (`str` or `List[str]`):
                Input text for the encoder.
            return_tensors (`bool`, *optional*, defaults to `False`):
                Whether or not to include the tensors of predictions (as token indices) in the outputs.
            return_text (`bool`, *optional*, defaults to `True`):
                Whether or not to include the decoded texts in the outputs.
            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
                Whether or not to clean up the potential extra spaces in the text output.
            truncation (`TruncationStrategy`, *optional*, defaults to `TruncationStrategy.DO_NOT_TRUNCATE`):
                The truncation strategy for the tokenization within the pipeline. `TruncationStrategy.DO_NOT_TRUNCATE`
                (default) will never truncate, but it is sometimes desirable to truncate the input to fit the model's
                max_length instead of throwing an error down the line.
            generate_kwargs:
                Additional keyword arguments to pass along to the generate method of the model (see the generate method
                corresponding to your framework [here](./model#generative-models)).

        Returns:
            A list or a list of list of `dict`:
                Each result comes as a dictionary with the following keys:

                - **generated_text** (`str`, present when `return_text=True`) -- The generated text.
                - **generated_token_ids** (`torch.Tensor` or `tf.Tensor`, present when `return_tensors=True`) -- The token
                  ids of the generated text.
        """
        result = super().__call__(*args, **kwargs)
        if (
                isinstance(args[0], list)
                and all(isinstance(el, str) for el in args[0])
                and all(len(res) == 1 for res in result)
        ):
            return [res[0] for res in result]
        return result

    def preprocess(self, inputs, truncation=TruncationStrategy.DO_NOT_TRUNCATE, **kwargs):
        """
        Preprocesses the input text for text-to-text generation.

        Args:
            self (Text2TextGenerationPipeline): The instance of the Text2TextGenerationPipeline class.
            inputs (Union[str, List[str]]): The input text or list of input texts to be preprocessed.
            truncation (TruncationStrategy, optional): The strategy to use for truncating the input text
                if it exceeds the maximum length. Defaults to TruncationStrategy.DO_NOT_TRUNCATE.
            **kwargs: Additional keyword arguments to be passed to the _parse_and_tokenize method.

        Returns:
            None.

        Raises:
            TypeError: If the inputs parameter is not a string or a list of strings.
            ValueError: If the truncation parameter is not a valid TruncationStrategy.
            Exception: Any other exception that may occur during preprocessing.
        """
        inputs = self._parse_and_tokenize(inputs, truncation=truncation, **kwargs)
        return inputs

    def _forward(self, model_inputs, **generate_kwargs):
        '''
        Forward pass method for the Text2TextGenerationPipeline class.

        Args:
            self: An instance of the Text2TextGenerationPipeline class.
            model_inputs (dict):
                A dictionary containing the model's input data.

                - input_ids (Tensor): A tensor representing the input sequence.
                  Shape: (batch_size, sequence_length)
                - Additional model-specific input tensors can be included.

        Returns:
            None: The method performs a forward pass on the model and updates the instance with the generated output.

        Raises:
            ValueError: If input_length is less than the specified min_length or greater than the specified max_length.
            Exception: Any other exception raised during the model generation process.
        '''
        in_b, input_length = model_inputs["input_ids"].shape

        self.check_inputs(
            input_length,
            generate_kwargs.get("min_length", self.model.config.min_length),
            generate_kwargs.get("max_length", self.model.config.max_length),
        )
        output_ids = self.model.generate(**model_inputs, **generate_kwargs)
        out_b = output_ids.shape[0]

        output_ids = output_ids.reshape(in_b, out_b // in_b, *output_ids.shape[1:])

        return {"output_ids": output_ids}

    def postprocess(self, model_outputs, return_type=ReturnType.TEXT, clean_up_tokenization_spaces=False):
        """
        Postprocesses the model outputs to generate the final records based on the specified return type.

        Args:
            self (Text2TextGenerationPipeline): The instance of the Text2TextGenerationPipeline class.
            model_outputs (dict): The model outputs containing the generated output_ids.
            return_type (ReturnType): The type of return value to be generated. Defaults to ReturnType.TEXT.
            clean_up_tokenization_spaces (bool): Flag indicating whether to clean up tokenization spaces. Defaults to False.

        Returns:
            list: A list of records containing the processed model outputs based on the specified return type.

        Raises:
            None.
        """
        records = []
        for output_ids in model_outputs["output_ids"][0]:
            if return_type == ReturnType.TENSORS:
                record = {f"{self.return_name}_token_ids": output_ids}
            elif return_type == ReturnType.TEXT:
                record = {
                    f"{self.return_name}_text": self.tokenizer.decode(
                        output_ids,
                        skip_special_tokens=True,
                        clean_up_tokenization_spaces=clean_up_tokenization_spaces,
                    )
                }
            records.append(record)
        return records

mindnlp.transformers.pipelines.text2text_generation.Text2TextGenerationPipeline.__call__(*args, **kwargs)

Generate the output text(s) using text(s) given as inputs.

PARAMETER DESCRIPTION
args

Input text for the encoder.

TYPE: `str` or `List[str]` DEFAULT: ()

return_tensors

Whether or not to include the tensors of predictions (as token indices) in the outputs.

TYPE: `bool`, *optional*, defaults to `False`

return_text

Whether or not to include the decoded texts in the outputs.

TYPE: `bool`, *optional*, defaults to `True`

clean_up_tokenization_spaces

Whether or not to clean up the potential extra spaces in the text output.

TYPE: `bool`, *optional*, defaults to `False`

truncation

The truncation strategy for the tokenization within the pipeline. TruncationStrategy.DO_NOT_TRUNCATE (default) will never truncate, but it is sometimes desirable to truncate the input to fit the model's max_length instead of throwing an error down the line.

TYPE: `TruncationStrategy`, *optional*, defaults to `TruncationStrategy.DO_NOT_TRUNCATE`

generate_kwargs

Additional keyword arguments to pass along to the generate method of the model (see the generate method corresponding to your framework here).

RETURNS DESCRIPTION

A list or a list of list of dict: Each result comes as a dictionary with the following keys:

  • generated_text (str, present when return_text=True) -- The generated text.
  • generated_token_ids (torch.Tensor or tf.Tensor, present when return_tensors=True) -- The token ids of the generated text.
Source code in mindnlp/transformers/pipelines/text2text_generation.py
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
def __call__(self, *args, **kwargs):
    r"""
    Generate the output text(s) using text(s) given as inputs.

    Args:
        args (`str` or `List[str]`):
            Input text for the encoder.
        return_tensors (`bool`, *optional*, defaults to `False`):
            Whether or not to include the tensors of predictions (as token indices) in the outputs.
        return_text (`bool`, *optional*, defaults to `True`):
            Whether or not to include the decoded texts in the outputs.
        clean_up_tokenization_spaces (`bool`, *optional*, defaults to `False`):
            Whether or not to clean up the potential extra spaces in the text output.
        truncation (`TruncationStrategy`, *optional*, defaults to `TruncationStrategy.DO_NOT_TRUNCATE`):
            The truncation strategy for the tokenization within the pipeline. `TruncationStrategy.DO_NOT_TRUNCATE`
            (default) will never truncate, but it is sometimes desirable to truncate the input to fit the model's
            max_length instead of throwing an error down the line.
        generate_kwargs:
            Additional keyword arguments to pass along to the generate method of the model (see the generate method
            corresponding to your framework [here](./model#generative-models)).

    Returns:
        A list or a list of list of `dict`:
            Each result comes as a dictionary with the following keys:

            - **generated_text** (`str`, present when `return_text=True`) -- The generated text.
            - **generated_token_ids** (`torch.Tensor` or `tf.Tensor`, present when `return_tensors=True`) -- The token
              ids of the generated text.
    """
    result = super().__call__(*args, **kwargs)
    if (
            isinstance(args[0], list)
            and all(isinstance(el, str) for el in args[0])
            and all(len(res) == 1 for res in result)
    ):
        return [res[0] for res in result]
    return result

mindnlp.transformers.pipelines.text2text_generation.Text2TextGenerationPipeline.__init__(*args, **kwargs)

Initializes an instance of Text2TextGenerationPipeline.

PARAMETER DESCRIPTION
self

An instance of the Text2TextGenerationPipeline class.

RETURNS DESCRIPTION

None.

Source code in mindnlp/transformers/pipelines/text2text_generation.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
def __init__(self, *args, **kwargs):
    """
    Initializes an instance of Text2TextGenerationPipeline.

    Args:
        self: An instance of the Text2TextGenerationPipeline class.

    Returns:
        None.

    Raises:
        None.
    """
    super().__init__(*args, **kwargs)

mindnlp.transformers.pipelines.text2text_generation.Text2TextGenerationPipeline.check_inputs(input_length, min_length, max_length)

Checks whether there might be something wrong with given input with regard to the model.

Source code in mindnlp/transformers/pipelines/text2text_generation.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
def check_inputs(self, input_length: int, min_length: int, max_length: int):
    """
    Checks whether there might be something wrong with given input with regard to the model.
    """
    if input_length < min_length:
        logger.warning(
            f"Your min_length is set to {min_length}, but you input_length is only {input_length}. You might "
            "consider decreasing min_length manually, e.g. summarizer('...', min_length=10)"
        )
    if input_length > max_length:
        logger.warning(
            f"Your max_length is set to {max_length}, but you input_length is only {input_length}. You might "
            "consider increasing max_length manually, e.g. summarizer('...', max_length=400)"
        )

    return True

mindnlp.transformers.pipelines.text2text_generation.Text2TextGenerationPipeline.postprocess(model_outputs, return_type=ReturnType.TEXT, clean_up_tokenization_spaces=False)

Postprocesses the model outputs to generate the final records based on the specified return type.

PARAMETER DESCRIPTION
self

The instance of the Text2TextGenerationPipeline class.

TYPE: Text2TextGenerationPipeline

model_outputs

The model outputs containing the generated output_ids.

TYPE: dict

return_type

The type of return value to be generated. Defaults to ReturnType.TEXT.

TYPE: ReturnType DEFAULT: TEXT

clean_up_tokenization_spaces

Flag indicating whether to clean up tokenization spaces. Defaults to False.

TYPE: bool DEFAULT: False

RETURNS DESCRIPTION
list

A list of records containing the processed model outputs based on the specified return type.

Source code in mindnlp/transformers/pipelines/text2text_generation.py
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
def postprocess(self, model_outputs, return_type=ReturnType.TEXT, clean_up_tokenization_spaces=False):
    """
    Postprocesses the model outputs to generate the final records based on the specified return type.

    Args:
        self (Text2TextGenerationPipeline): The instance of the Text2TextGenerationPipeline class.
        model_outputs (dict): The model outputs containing the generated output_ids.
        return_type (ReturnType): The type of return value to be generated. Defaults to ReturnType.TEXT.
        clean_up_tokenization_spaces (bool): Flag indicating whether to clean up tokenization spaces. Defaults to False.

    Returns:
        list: A list of records containing the processed model outputs based on the specified return type.

    Raises:
        None.
    """
    records = []
    for output_ids in model_outputs["output_ids"][0]:
        if return_type == ReturnType.TENSORS:
            record = {f"{self.return_name}_token_ids": output_ids}
        elif return_type == ReturnType.TEXT:
            record = {
                f"{self.return_name}_text": self.tokenizer.decode(
                    output_ids,
                    skip_special_tokens=True,
                    clean_up_tokenization_spaces=clean_up_tokenization_spaces,
                )
            }
        records.append(record)
    return records

mindnlp.transformers.pipelines.text2text_generation.Text2TextGenerationPipeline.preprocess(inputs, truncation=TruncationStrategy.DO_NOT_TRUNCATE, **kwargs)

Preprocesses the input text for text-to-text generation.

PARAMETER DESCRIPTION
self

The instance of the Text2TextGenerationPipeline class.

TYPE: Text2TextGenerationPipeline

inputs

The input text or list of input texts to be preprocessed.

TYPE: Union[str, List[str]]

truncation

The strategy to use for truncating the input text if it exceeds the maximum length. Defaults to TruncationStrategy.DO_NOT_TRUNCATE.

TYPE: TruncationStrategy DEFAULT: DO_NOT_TRUNCATE

**kwargs

Additional keyword arguments to be passed to the _parse_and_tokenize method.

DEFAULT: {}

RETURNS DESCRIPTION

None.

RAISES DESCRIPTION
TypeError

If the inputs parameter is not a string or a list of strings.

ValueError

If the truncation parameter is not a valid TruncationStrategy.

Exception

Any other exception that may occur during preprocessing.

Source code in mindnlp/transformers/pipelines/text2text_generation.py
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
def preprocess(self, inputs, truncation=TruncationStrategy.DO_NOT_TRUNCATE, **kwargs):
    """
    Preprocesses the input text for text-to-text generation.

    Args:
        self (Text2TextGenerationPipeline): The instance of the Text2TextGenerationPipeline class.
        inputs (Union[str, List[str]]): The input text or list of input texts to be preprocessed.
        truncation (TruncationStrategy, optional): The strategy to use for truncating the input text
            if it exceeds the maximum length. Defaults to TruncationStrategy.DO_NOT_TRUNCATE.
        **kwargs: Additional keyword arguments to be passed to the _parse_and_tokenize method.

    Returns:
        None.

    Raises:
        TypeError: If the inputs parameter is not a string or a list of strings.
        ValueError: If the truncation parameter is not a valid TruncationStrategy.
        Exception: Any other exception that may occur during preprocessing.
    """
    inputs = self._parse_and_tokenize(inputs, truncation=truncation, **kwargs)
    return inputs