Skip to content

Data preprocessing

In this tutorial, we will guide you through the preparation of a dataset for training.

Same as in the Quick start, we will use the Large Movie Review Dataset dataset.

To preprocess the dataset, there are two available approaches: * Native MindSpore Dataset API * Modify the BaseMapFunction API in MindNLP

While native MindSpore approach gives you more flexibility, the BaseMapFunction approach helps to wrap the code for better readability.

In addition, working in Ascend or GPU environment brings subtle difference in the processing procedure, mainly due to different handling of dynamic shapes. We will see it as we proceed.

Load and split the dataset

First, load the dataset from Hugging Face repository:

from mindnlp import load_dataset

imdb_ds = load_dataset('imdb', split=['train', 'test'])
imdb_train = imdb_ds['train']
imdb_test = imdb_ds['test']

load_dataset accepts a dataset name for fetching remotely from the Hugging Face repository, as well as a local path pointing to a dataset stored on disk.

The split parameter informs load_dataset to fetch which split of the dataset. Here it will fetch both the training dataset ('train') and the test dataset ('test').

To further split training dataset into training and validation datasets, use the .split() method. The list of numbers specify the proportion of data entries going to each splits.

imdb_train, imdb_val = imdb_train.split([0.7, 0.3])

To have a peek into how the dataset looks like, get the first element from the iterator of the dataset:

print(next(imdb_train.create_dict_iterator()))

Load the tokenizer

A tokenizer converts raw text into a format that the corresponding model can process, which is crucial for natural language processing tasks.

We make use of the AutoTokenizer from MindNLP to fetch and instantiate the appropriate tokenizer for a pre-trained model:

from mindnlp.transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

To get the corresponding tokenizer, you can supply the model name, in this case 'bert-base-cased', to the AutoTokenizer.from_pretrained method. It will then download the tokenizer required by your model.

Preprocess with native MindSpore

To preprocess the dataset with native MindSpore, we write a function process_dataset that comprises the crucial steps:

import mindspore
import numpy as np
from mindspore.dataset import GeneratorDataset, transforms

def process_dataset(dataset: GeneratorDataset, tokenizer, max_seq_len=256, batch_size=32, shuffle=False, take_len=None):
    is_ascend = mindspore.get_context('device_target') == 'Ascend'
    # The tokenize function
    def tokenize(text):
        if is_ascend:
            tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=max_seq_len)
        else:
            tokenized = tokenizer(text, truncation=True, max_length=max_seq_len)
        return tokenized['input_ids'], tokenized['token_type_ids'], tokenized['attention_mask']

    # Shuffle the order of the dataset
    if shuffle:
        dataset = dataset.shuffle(buffer_size=batch_size)

        # Select the first several entries of the dataset
    if take_len:
        dataset = dataset.take(take_len)

    # Apply the tokenize function, transforming the 'text' column into the three output columns generated by the tokenizer.
    dataset = dataset.map(operations=[tokenize], input_columns="text", output_columns=['input_ids', 'token_type_ids', 'attention_mask'])
    # Cast the datatype of the 'label' column to int32 and rename the column to 'labels'
    dataset = dataset.map(operations=transforms.TypeCast(mindspore.int32), input_columns="label", output_columns="labels")
    # Batch the dataset with padding.
    if is_ascend:
        dataset = dataset.batch(batch_size)
    else:
        dataset = dataset.padded_batch(batch_size, pad_info={'input_ids': (None, tokenizer.pad_token_id),
                                                             'token_type_ids': (None, 0),
                                                             'attention_mask': (None, 0)})
    return dataset

Here is a breakdown of each step: * #### Tokenization The first step is tokenization. Tokenization converts the raw text into a format that can be fed into a machine learning model.

Define the tokenize function to process the text in each row of the dataset.

def tokenize(text):
    tokenized = tokenizer(text, truncation=True, max_length=max_seq_len)
    return tokenized['input_ids'], tokenized['token_type_ids'], tokenized['attention_mask']

Then make use of the GeneratorDataset.map API from MindSpore to map the tokenize operation onto all rows in the dataset. It will take the "text" column as input, tokenize it and return "input_ids", "token_type_ids" and "attention_mask" columns as output.

dataset = dataset.map(operations=[tokenize], input_columns="text", output_columns=['input_ids', 'token_type_ids', 'attention_mask'])

  • Type casting

    In some cases, the datatype of a column needs to be casted to a different one. Here, the "label" column in our dataset is originally of type Int64. We create an operation using mindspore.dataset.transform to cast the datatype into Int32. Then map this operation onto each element in the "label" column.

Notice that the output_columns is with name "labels", instead of "label", and hence we renamed the column to a new name after mapping.

from mindspore.dataset import transforms
dataset = dataset.map(operations=transforms.TypeCast(mindspore.int32), input_columns="label", output_columns="labels")

  • Shuffling

    Shuffling the order of dataset entries is important to ensure that the model does not learn the order of the data, which could lead to overfitting. Shuffle the dataset with the shuffle method:

    dataset = dataset.shuffle(buffer_size=batch_size)
    
    Note that normally shuffling should precede the batching step, ensuring that the entry order is randomized within each batch as well.

  • Batching with Padding

    To facilitate batch processing in the model, we will group every batch_size number of rows into one batch. A special requirement in batching natural language dataset is to ensures that all sequences in a batch are of the same length. This is achieved by padding, which is included in the padded_batch method.

    dataset = dataset.padded_batch(batch_size, pad_info={'input_ids': (None, tokenizer.pad_token_id),
                                                         'token_type_ids': (None, 0),
                                                         'attention_mask': (None, 0)})
    
    So far, padded_batch only works on GPU platforms that supports dynamic shape of tensors. If you are working with Ascend, you need to use the batch method:
    is_ascend = mindspore.get_context('device_target') == 'Ascend' # Check whether the platform is Ascend
    if is_ascend:
        dataset = dataset.batch(batch_size)
    

  • Taking a Subset

Sometimes, you might want to train or test on a smaller subset of the data, for example to debug training process. For this purpose, use the take method, which select the specified number (take_len) of entries from the dataset:

dataset = dataset.take(take_len)

Now apply the preprocessing function to the dataset:

batch_size = 4 # Size of each batch
processed_dataset_train = process_dataset(imdb_train, tokenizer, batch_size=batch_size, shuffle=True)

Check the processed dataset:

print(next(processed_dataset_train.create_dict_iterator()))

Preprocess with BaseMapFunction

An alternative way to preprocess the dataset for training is through the BaseMapFunction from MindNLP. You can modify the BaseMapFunction to create your mapping function:

import mindspore as ms
from mindnlp.dataset import BaseMapFunction

class ModifiedMapFunction(BaseMapFunction):
    def __call__(self, text, label):
        tokenized = tokenizer(text, max_length=512, padding='max_length', truncation=True)
        labels = label.astype(ms.int32)
        return tokenized['input_ids'], tokenized['token_type_ids'], tokenize['attention_mask'], labels

map_fn = ModifiedMapFunction(['text', 'label'], ['input_ids', 'token_type_ids', 'attention_mask', 'labels'])

The modified map function will take the text and label from each entry, tokenize the text, cast the label into type Int32 and output the input_ids, token_type_ids, attention_mask and labels.

Note that the names of input and output columns are defined only when the map function is instantiated.

You may notice that the map function does not involve the batching operation. This is because the Trainer class offers internal batching functionality, which can be enabled by setting the per_device_train_batch_size parameter in the TrainingArgument object.

Let's now pass the map_fn into the Trainer together with other arguments:

from mindnlp.engine import Trainer, TrainingArguments
from mindnlp.transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2)
training_args = TrainingArguments(
    output_dir='../../output',
    per_device_train_batch_size=16
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=imdb_train,
    map_fn=map_fn,
)