Sequence Tagging
Sequence Tagging refers to the process of tagging each Token in the sequence given an input sequence.Sequence tagging problems are usually used for information extraction from text, including Word Segmentation, POS Tagging, Named Entity Recognition(NER),Chunking, etc.
Take Chunking task as an example:
Text chunking consists of dividing a text in syntactically correlated parts of words. For example, the sentence He reckons the current account deficit will narrow to only # 1.8 billion in September . can be divided as follows:
[NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in ] [NP September ]. (NP: noun phrase; VP: verb phrase; PP: prepositional phrase)
The goal of this task is to come forward with machine learning methods which after a training phase can recognize the chunk segmentation of the test data as well as possible.
CoNLL2000Chunking:
The chunking tags in the CoNLL2000Chunking dataset are based on the IOB (Inside, Outside, Beginning) tagging scheme, which is commonly used for chunking tasks.In the IOB scheme, each word in a sentence is labeled with a chunk tag that indicates whether the word is part of a chunk, and if so, whether it is the beginning, inside, or outside of the chunk.
Hint
The CoNLL2000Chunking dataset includes a set of predefined chunk types, such as noun phrases (NP), verb phrases (VP), and prepositional phrases (PP). The chunk tags in the dataset are formed by combining the chunk type with the IOB tag, using the format “I-TYPE” for inside words, “B-TYPE” for beginning words, and “O” for outside words. For example, the chunk tag “B-NP” indicates the beginning of a noun phrase, while the chunk tag “I-VP” indicates an inside word in a verb phrase.
Example:
Sentence
They
refuse
to
permit
us
to
enter
.
Chunk Tag
B-NP
B-VP
B-PP
B-VP
B-NP
B-PP
B-VP
O
The following is an example of Chunking task training using the chunk tag of the CoNLL2000Chunking dataset and the Bi-LSTM+CRF model:
Define Model
First, inherit the Seq2vecModel
in mindnlp.abc
to define the Head
of the model, and then use the CRF
in mindnlp.modules
to complete
the definition of the BiLSTM_CRF
model.
import math
from mindspore import nn
from mindspore.common.initializer import Uniform, HeUniform
from mindnlp.abc import Seq2vecModel
from mindnlp.modules import CRF
class Head(nn.Cell):
""" Head for BiLSTM-CRF model """
def __init__(self, hidden_dim, num_tags):
super().__init__()
weight_init = HeUniform(math.sqrt(5))
bias_init = Uniform(1 / math.sqrt(hidden_dim * 2))
self.hidden2tag = nn.Dense(hidden_dim, num_tags,
weight_init=weight_init, bias_init=bias_init)
def construct(self, context):
return self.hidden2tag(context)
class BiLSTM_CRF(Seq2vecModel):
""" BiLSTM-CRF model """
def __init__(self, encoder, head, num_tags):
super().__init__(encoder, head)
self.encoder = encoder
self.head = head
self.crf = CRF(num_tags, batch_first=True)
def construct(self, text, seq_length, label=None):
output,_,_ = self.encoder(text)
feats = self.head(output)
res = self.crf(feats, label, seq_length)
return res
Define Hyperparameters
The following are some of the required hyperparameters in the model training process.
embedding_dim = 16
hidden_dim = 32
Data Preprocessing
The dataset was downloaded and preprocessed by calling the interface of
dataset in mindnlp.dataset
.
Load datasets:
from mindnlp.dataset import CoNLL2000Chunking
dataset_train,dataset_test = CoNLL2000Chunking()
Initializes the vocab for preprocessing:
from mindspore.dataset import text
vocab = text.Vocab.from_dataset(dataset_train,columns=["words"],freq_range=None,top_k=None,
special_tokens=["<pad>","<unk>"],special_first=True)
Process datasets:
from mindnlp.dataset import CoNLL2000Chunking_Process
dataset_train = CoNLL2000Chunking_Process(dataset=dataset_train, vocab=vocab,
batch_size=32, max_len=80)
Instantiate Model
from mindnlp.modules import RNNEncoder
embedding = nn.Embedding(vocab_size=len(vocab.vocab()), embedding_size=embedding_dim,
padding_idx=vocab.tokens_to_ids("<pad>"))
lstm_layer = nn.LSTM(embedding_dim, hidden_dim // 2, bidirectional=True, batch_first=True)
encoder = RNNEncoder(embedding, lstm_layer)
head = Head(hidden_dim, 23)
net = BiLSTM_CRF(encoder, head, 23)
Define Optimizer
from mindspore import ops
optimizer = nn.SGD(net.trainable_params(), learning_rate=0.01, weight_decay=1e-4)
grad_fn = ops.value_and_grad(net, None, optimizer.parameters)
Define Train Step
def train_step(data, seq_length, label):
""" train step """
loss, grads = grad_fn(data, seq_length, label)
loss = ops.depend(loss, optimizer(grads))
return loss
Training Process
Now that we have completed all the preparations, we can begin to train the model.
from tqdm import tqdm
size = dataset_train.get_dataset_size()
steps = size
with tqdm(total=steps) as t:
for batch, (data, seq_length, label) in enumerate(dataset_train.create_tuple_iterator()):
loss = train_step(data, seq_length ,label)
t.set_postfix(loss=loss)
t.update(1)