Dataset and transforms

Dataset

In mindnlp, there are download interfaces for some datasets, which can be used to download datasets directly. Based on the following classifications, the datasets currently included are:

  • Machine Translation

    • IWSLT2016

    • IWSLT2017

    • Multi30k​​

  • Question Answer

    • SQuAD1

    • SQuAD2​​

  • Sequence Tagging

    • CoNLL2000Chunking

    • UDPOS

  • Text Classification

    • AG_NEWS

    • AmazonReviewFull

    • AmazonReviewPolarity

    • CoLA

    • DBpedia

    • IMDB

    • MNLI

    • MRPC

    • QNLI

    • QQP

    • RTE

    • SogouNews

    • SST2

    • STSB

    • WNLI

    • YahooAnswers

    • YelpReviewFull

    • YelpReviewPolarity

  • Text Generation

    • LCSTS

    • PennTreebank

    • WikiText2

    • WikiText103

Dataset Loading

There are two ways to load a dataset. The first is to call the corresponding interface, and the second is to call a unified interface.

Method 1: Load by corresponding interface

The corresponding interface can be found under mindnlp.dataset . Here, the Multi30k dataset is used as an example:

from mindnlp.dataset import Multi30k

Parameter list and returns can be known through the annotation, or the corresponding docs on mindnlp website:

Parameters:

  • root (str) - Directory where the datasets are saved. Default:’~/.mindnlp’

  • split (str|Tuple[str]) - Split or splits to be returned. Default:(‘train’, ‘valid’, ‘test’).

  • language_pair (Tuple[str]) - Tuple containing src and tgt language. Default: (‘de’, ‘en’).

  • proxies (dict) - a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns:

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

For convenience, we use default parameters except for the first one:

multi30k_train, multi30k_valid, multi30k_test = Multi30k("./dataset")

Doubtlessly, if you just want to pick the train dataset, you only need to modify the parameters below:

multi30k_train = Multi30k(root="./dataset", split='train')

Method 2: Load by unified interface

Also, we can load dataset by a unified interface - load() . The first parameter is a string to specify a dataset:

from mindnlp.dataset import load
multi30k_train, multi30k_valid, multi30k_test = load('multi30k')

The other parameter can be added sequentially according to the interface:

multi30k_train, multi30k_valid, multi30k_test = load('multi30k', root="./dataset")

Customizing Dataset

If you want to use customizd dataset, more information about customizing dataset could be found on mindspore website.

Dataset Iteration

There are usually multiple columns in a dataset, and you can query the column names using the get_col_names() interface:

dataset_train.get_col_names()
['de', 'en']

After the dataset is loaded, the data is obtained iteratively and then sent to the neural network for training. We can use create_tuple_iterator() or create_dict_iterator() ​ interface to create an iterater for data access. Combining the column names interface above:

for de_value, en_value in dataset_train.create_tuple_iterator():
    print(de_value)
    print(en_value)
    break
Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.
Two young, White males are outside near many bushes.

Data Transforms

Common Operation

The most important operation in data transformation processing is the map operation. map can add a data transform to a specified column in a dataset, make it apply to each element of the column data, and then return the dataset after transformation. BasicTokenizer() is used here for word segmentation of two columns of the dataset, and from_dataset is used to generate the vocab:

from mindnlp.transforms import BasicTokenizer

tokenizer = BasicTokenizer(True)
dataset_train= dataset_train.map([tokenizer], 'en')
dataset_train= dataset_train.map([tokenizer], 'de')

en_vocab = text.Vocab.from_dataset(dataset_train, 'en', special_tokens=['<pad>', '<unk>'], special_first= True)
de_vocab = text.Vocab.from_dataset(dataset_train, 'de', special_tokens=['<pad>', '<unk>'], special_first= True)
vocab = {'en':en_vocab, 'de':de_vocab}

Data Preprocessing in mindnlp

There are different processes for different data sets in different domains. In mindnlp, specific processing functions are provided to help us process data quickly. Similarly, two ways can be used to process data. Using the Multi30k dataset as an example:

Method 1: Process by corresponding interface

The corresponding interface can be found under mindnlp.dataset , the name of which begins with dataset’s name, the underline and Process following. The vocab in the code was generated above:

from mindnlp.dataset import Multi30k_Process
train_dataset = Multi30k_Process(train_dataset, vocab=vocab)

Parameter list and returns can be known through the annotation, or the corresponding docs on mindnlp website:

Parameters:

  • dataset ( GeneratorDataset ) - Multi30k dataset.

  • vocab ( Vocab ) - vocabulary object, used to store the mapping of token and index.

  • batch_size ( int ) - The number of rows each batch is created with. Default: 64.

  • max_len ( int ) - The max length of the sentence. Default: 500.

  • drop_remainder ( bool ) - When the last batch of data contains a data entry smaller than batch_size, whether to discard the batch and not pass it to the next operation. Default: False.

Returns:

  • dataset (MapDataset) - dataset after transforms.

Method 2: Process by unified interface

from mindnlp.dataset import process
dataset_train = process('Multi30k', dataset_train, vocab = vocab)

For complete code, please check out the the github repository

Customizing Preprocess

If you want to preprocess dataset by yourself, please refer to more operations on mindspore website.