Data preprocessing¶
In this tutorial, we will guide you through the preparation of a dataset for training.
Same as in the Quick start, we will use the Large Movie Review Dataset dataset.
To preprocess the dataset, there are two available approaches:
* Native MindSpore Dataset
API
* Modify the BaseMapFunction
API in MindNLP
While native MindSpore approach gives you more flexibility, the BaseMapFunction
approach helps to wrap the code for better readability.
In addition, working in Ascend or GPU environment brings subtle difference in the processing procedure, mainly due to different handling of dynamic shapes. We will see it as we proceed.
Load and split the dataset¶
First, load the dataset from Hugging Face repository:
from mindnlp import load_dataset
imdb_ds = load_dataset('imdb', split=['train', 'test'])
imdb_train = imdb_ds['train']
imdb_test = imdb_ds['test']
load_dataset
accepts a dataset name for fetching remotely from the Hugging Face repository, as well as a local path pointing to a dataset stored on disk.
The split
parameter informs load_dataset
to fetch which split of the dataset. Here it will fetch both the training dataset ('train') and the test dataset ('test').
To further split training dataset into training and validation datasets, use the .split()
method. The list of numbers specify the proportion of data entries going to each splits.
imdb_train, imdb_val = imdb_train.split([0.7, 0.3])
To have a peek into how the dataset looks like, get the first element from the iterator of the dataset:
print(next(imdb_train.create_dict_iterator()))
Load the tokenizer¶
A tokenizer converts raw text into a format that the corresponding model can process, which is crucial for natural language processing tasks.
We make use of the AutoTokenizer
from MindNLP to fetch and instantiate the appropriate tokenizer for a pre-trained model:
from mindnlp.transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
To get the corresponding tokenizer, you can supply the model name, in this case 'bert-base-cased'
, to the AutoTokenizer.from_pretrained
method. It will then download the tokenizer required by your model.
Preprocess with native MindSpore¶
To preprocess the dataset with native MindSpore, we write a function process_dataset
that comprises the crucial steps:
import mindspore
import numpy as np
from mindspore.dataset import GeneratorDataset, transforms
def process_dataset(dataset: GeneratorDataset, tokenizer, max_seq_len=256, batch_size=32, shuffle=False, take_len=None):
is_ascend = mindspore.get_context('device_target') == 'Ascend'
# The tokenize function
def tokenize(text):
if is_ascend:
tokenized = tokenizer(text, padding='max_length', truncation=True, max_length=max_seq_len)
else:
tokenized = tokenizer(text, truncation=True, max_length=max_seq_len)
return tokenized['input_ids'], tokenized['token_type_ids'], tokenized['attention_mask']
# Shuffle the order of the dataset
if shuffle:
dataset = dataset.shuffle(buffer_size=batch_size)
# Select the first several entries of the dataset
if take_len:
dataset = dataset.take(take_len)
# Apply the tokenize function, transforming the 'text' column into the three output columns generated by the tokenizer.
dataset = dataset.map(operations=[tokenize], input_columns="text", output_columns=['input_ids', 'token_type_ids', 'attention_mask'])
# Cast the datatype of the 'label' column to int32 and rename the column to 'labels'
dataset = dataset.map(operations=transforms.TypeCast(mindspore.int32), input_columns="label", output_columns="labels")
# Batch the dataset with padding.
if is_ascend:
dataset = dataset.batch(batch_size)
else:
dataset = dataset.padded_batch(batch_size, pad_info={'input_ids': (None, tokenizer.pad_token_id),
'token_type_ids': (None, 0),
'attention_mask': (None, 0)})
return dataset
Here is a breakdown of each step: * #### Tokenization The first step is tokenization. Tokenization converts the raw text into a format that can be fed into a machine learning model.
Define the tokenize function to process the text in each row of the dataset.
def tokenize(text):
tokenized = tokenizer(text, truncation=True, max_length=max_seq_len)
return tokenized['input_ids'], tokenized['token_type_ids'], tokenized['attention_mask']
Then make use of the GeneratorDataset.map
API from MindSpore to map the tokenize operation onto all rows in the dataset. It will take the "text"
column as input, tokenize it and return "input_ids"
, "token_type_ids"
and "attention_mask"
columns as output.
dataset = dataset.map(operations=[tokenize], input_columns="text", output_columns=['input_ids', 'token_type_ids', 'attention_mask'])
-
Type casting¶
In some cases, the datatype of a column needs to be casted to a different one. Here, the"label"
column in our dataset is originally of typeInt64
. We create an operation usingmindspore.dataset.transform
to cast the datatype intoInt32
. Then map this operation onto each element in the"label"
column.
Notice that the output_columns
is with name "labels"
, instead of "label"
, and hence we renamed the column to a new name after mapping.
from mindspore.dataset import transforms
dataset = dataset.map(operations=transforms.TypeCast(mindspore.int32), input_columns="label", output_columns="labels")
-
Shuffling¶
Shuffling the order of dataset entries is important to ensure that the model does not learn the order of the data, which could lead to overfitting. Shuffle the dataset with the
shuffle
method:Note that normally shuffling should precede the batching step, ensuring that the entry order is randomized within each batch as well.dataset = dataset.shuffle(buffer_size=batch_size)
-
Batching with Padding¶
To facilitate batch processing in the model, we will group every
batch_size
number of rows into one batch. A special requirement in batching natural language dataset is to ensures that all sequences in a batch are of the same length. This is achieved by padding, which is included in thepadded_batch
method.So far,dataset = dataset.padded_batch(batch_size, pad_info={'input_ids': (None, tokenizer.pad_token_id), 'token_type_ids': (None, 0), 'attention_mask': (None, 0)})
padded_batch
only works on GPU platforms that supports dynamic shape of tensors. If you are working with Ascend, you need to use thebatch
method:is_ascend = mindspore.get_context('device_target') == 'Ascend' # Check whether the platform is Ascend if is_ascend: dataset = dataset.batch(batch_size)
-
Taking a Subset¶
Sometimes, you might want to train or test on a smaller subset of the data, for example to debug training process. For this purpose, use the take
method, which select the specified number (take_len
) of entries from the dataset:
dataset = dataset.take(take_len)
Now apply the preprocessing function to the dataset:
batch_size = 4 # Size of each batch
processed_dataset_train = process_dataset(imdb_train, tokenizer, batch_size=batch_size, shuffle=True)
Check the processed dataset:
print(next(processed_dataset_train.create_dict_iterator()))
Preprocess with BaseMapFunction
¶
An alternative way to preprocess the dataset for training is through the BaseMapFunction
from MindNLP. You can modify the BaseMapFunction
to create your mapping function:
import mindspore as ms
from mindnlp.dataset import BaseMapFunction
class ModifiedMapFunction(BaseMapFunction):
def __call__(self, text, label):
tokenized = tokenizer(text, max_length=512, padding='max_length', truncation=True)
labels = label.astype(ms.int32)
return tokenized['input_ids'], tokenized['token_type_ids'], tokenize['attention_mask'], labels
map_fn = ModifiedMapFunction(['text', 'label'], ['input_ids', 'token_type_ids', 'attention_mask', 'labels'])
The modified map function will take the text and label from each entry, tokenize the text, cast the label into type Int32
and output the input_ids, token_type_ids, attention_mask and labels.
Note that the names of input and output columns are defined only when the map function is instantiated.
You may notice that the map function does not involve the batching operation. This is because the Trainer
class offers internal batching functionality, which can be enabled by setting the per_device_train_batch_size
parameter in the TrainingArgument
object.
Let's now pass the map_fn
into the Trainer
together with other arguments:
from mindnlp.engine import Trainer, TrainingArguments
from mindnlp.transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2)
training_args = TrainingArguments(
output_dir='../../output',
per_device_train_batch_size=16
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=imdb_train,
map_fn=map_fn,
)