Skip to content

dataset.transforms

Dataset processing transforms.

dataset.transforms.Truncate

Definition

Class Truncate(max_seq_len)

Truncate the input sequence so that it does not exceed the maximum length.

Args:

  • max_seq_len (int): Maximum allowable length.

Raises:

  • TypeError: If max_seq_len is not of type int.
  • TypeError: if text_input is not a text line in 1-D Numpy format
  • ValueError:If max_seq_len value is less than or equal to 0.
  • RuntimeError :If the data type of the input tensor is not bool, int, float, double, or str.

Examples:

import mindspore.dataset as ds
from mindnlp.dataset.transforms import Truncate

dataset = ds.NumpySlicesDataset(data=[['a', 'b', 'c', 'd', 'e']], shuffle=False)
# Data before 
# ['a' 'b' 'c' 'd' 'e']
truncate = Truncate(max_seq_len=3)
dataset = dataset.map(operations=truncate)
# Data after
# ['a' 'b' 'c']

dataset.transforms.AddToken

Definition

Class AddToken(token, begin=True)

Add token to beginning or end of sequence. It is often used to add special tags to a text sequence to mark the beginning or end of the sequence when performing natural language processing tasks.

Args:

  • token (str): The token to be added.
  • begin (bool, optional): Choose the position where the token is inserted. If True, the token will be inserted at the beginning of the sequence. Otherwise, it willbe inserted at the end of the sequence. Default: True.

Raises:

  • TypeError: If token is not of type str.
  • TypeError: If input not a text line in 1-D ndarray contains string
  • TypeError: If begin is not of type bool.

Example:

import mindspore.dataset as ds
from mindnlp.dataset.transforms import AddToken

dataset = ds.NumpySlicesDataset(data={"text": [['a', 'b', 'c', 'd', 'e']]})
# Data before
# ['a' 'b' 'c' 'd' 'e']
add_start_token_op = AddToken(token='<start>', begin = True)
dataset = dataset.map(operations=add_start_token_op)
# Data after
# ['<start>' 'a' 'b' 'c' 'd' 'e']
add_end_token_op = AddToken(token='<start>', begin = True)
dataset = dataset.map(operations=add_end_token_op)
# Data after
# ['<start>' 'a' 'b' 'c' 'd' 'e' '<end>']

dataset.transforms.Lookup

Definition

Class Lookup( vocab, unk_token, return_dtype=mstype.int32)

Look up a word into an id according to the input vocabulary table.

Args:

  • vocab (Vocab): A vocabulary object.
  • unknown_token (str, optional): Word is used for lookup. In case of the word is out of vocabulary (OOV), the result of lookup will be replaced with unknown_token. If the unknown_token is not specified or it is OOV, runtime error will be thrown. Default: None, means no unknown_token is specified.
  • return_dtype (mindspore.dtype, optional): The data type that lookup operation maps string to. Default: mindspore.int32.

Raises:

  • TypeError: If vocab is not of type text.Vocab.
  • TypeError: If unknown_token is not of type string.
  • TypeError: If return_dtype is not of type mindspore.dtype.

Example:

import mindspore.dataset as ds
import mindspore.dataset.text as text
from mindnlp.dataset.transforms import Lookup

# Load vocabulary from list
vocab = text.Vocab.from_list(['a', 'b', 'c', 'd', 'e'])
# Use lookup operation to map token to ids
lookup_op = Lookup(vocab,None)
ids = lookup_op(["b", "c"])
print("lookup: ids",ids)

dataset.transforms.BasicTokenizer

Definition

Class BasicTokenizer(lower_case=False, py_transform=False)

Tokenize the input UTF-8 encoded string by specific rules.

Args:

  • lower_case (bool, optional): Whether to perform lowercase processing on the text. If True, will fold the text to lower case and strip accented characters. If False, will only perform normalization on the text.Default: False.
  • py_transform (bool, optional): Whether use python implementation. Default: False.

Raises:

  • TypeError: If lower_case is not of type bool.
  • TypeError: If py_transform is not of type bool.
  • TypeError:If text_input is not a text line in 1-D numpy format
  • RuntimeError: If dtype of input Tensor is not str.

Examlpe:

from mindnlp.dataset.transforms import BasicTokenizer

tokenizer_op = BasicTokenizer()
text = "Welcom to China!"
tokenized_text = tokenizer_op(text)
print("tokenized_text:", tokenized_text)
# tokenized_text: ['Welcom', 'to', 'China', '!']

dataset.transforms.PadTransform

Definition

Class PadTransform(max_length: int, pad_value:int, return_length:bool = False)

Pad tensor to a fixed length with given padding value.

Args:

  • max_length (int): Maximum length to pad to.
  • pad_value (int): Value to pad the tensor with.
  • return_length (bool): Whether return auxiliary sequence length.

Raises:

  • TypeError: If max_length is not of type int.
  • TypeError: If pad_value is not of type int.
  • TypeError: If return_length is not of type bool.
  • TypeError: If text_input is not a text line in 1-D ndarrya contains string.

Example:

import numpy as np
from mindnlp.dataset.transforms import PadTransform

pad_transform = PadTransform(max_length=10, pad_value='\0', return_length=True)
text_input = np.array(['Hello', 'world', 'this', 'is', 'a', 'test'], dtype='object')
text_output, length = pad_transform(text_input)

print("The length of the text sequence before filling:")
print(length) 
# 6
print("Filled text sequence:")
print(text_output)
# ['Hello' 'world' 'this' 'is' 'a' 'test' '\x00' '\x00' '\x00' '\x00']

dataset.transforms.JiebaTokenizer

Definition

Class JiebaTokenizer(dict_path='', custom_word_freq_dict=None)

Split a Chinese sentence into words and return the position of the words.

Args:

  • dict_path(str,optional):Used to specify a custom dictionary file path that the stutterer can load to expand its thesaurus to better identify specific fields or specific words.Defalut:""
  • custom_word_freq_dict(bool,optional):Allows users to provide custom words and their frequency information, so that the stutterer can consider the weight of these words when segmenting, thus more accurately segmenting words.Default:None

Raises:

  • TypeError: If dict_path is not of type str.
  • TypeError: If custom_word_freq_dict is not of type bool.
Methods

dataset.transforms.JiebaTokenizer.tokenize

def tokenize(self, sentence, cut_all=False, HMM=True)

Args:

  • sentence(str) :Sentences that need to be partitioned
  • cut_all(bool,optional):When the cut_all is True, the full-mode word segmentation is used, and all possible words in the sentence are segmented, which may produce more word segmentation results, which are suitable for lexical division in some specific scenarios, but may produce redundant results.
  • HMM(bool,optional):HMM stands for Hidden Markov Model. When HMM is True, Stuttering Segmentation enables the HMM model to recognize new words, such as unlogged words. This can improve the accuracy of word segmentation, especially for some obscure or new words.

Example:

from mindnlp.dataset.transforms import JiebaTokenizer

tokenizer = JiebaTokenizer()
sentence = "今天天气真好,适合出去玩。"
tokens = tokenizer.tokenize(sentence)
print(tokens)
# ['今天天气', '真', '好', ',', '适合', '出去玩', '。']