research-article

Open access

Using Pre-trained Language Model to Enhance Active Learning for Sentence Matching

Authors:

Guirong Bai,

Shizhu He,

Kang Liu,

Jun ZhaoAuthors Info & Claims

Transactions on Asian and Low-Resource Language Information Processing, Volume 21, Issue 2

Article No.: 42, Pages 1 - 19

https://doi.org/10.1145/3480937

Published: 30 December 2021 Publication History

All formats PDF

Abstract

Active learning is an effective method to substantially alleviate the problem of expensive annotation cost for data-driven models. Recently, pre-trained language models have been demonstrated to be powerful for learning language representations. In this article, we demonstrate that the pre-trained language model can also utilize its learned textual characteristics to enrich criteria of active learning. Specifically, we provide extra textual criteria with the pre-trained language model to measure instances, including noise, coverage, and diversity. With these extra textual criteria, we can select more efficient instances for annotation and obtain better results. We conduct experiments on both English and Chinese sentence matching datasets. The experimental results show that the proposed active learning approach can be enhanced by the pre-trained language model and obtain better performance.

1 Introduction

Sentence matching is an important task of natural language processing. It is to judge the relation between two sentences, such as whether the two sentences express the same meaning. Over the past few years, deep learning as a data-driven technique has achieved state-of-the-art performance on sentence matching [6, 14, 17, 28, 44, 48]. However, these data-driven models inevitably need a lot of manual annotation, which requires expensive cost. If large amounts of labeled data cannot be provided, then the advantages of deep learning will significantly diminish.

To alleviate the above problem, active learning is proposed to achieve better performance with fewer labeled training instances [34]. Instead of randomly selecting candidate instances, active learning is able to measure the whole unlabeled instances based on some criteria and thus is able to select more effective instances for subsequent annotation and training [11, 16, 38, 46, 51]. Nevertheless, previous active learning for natural language processing typically relies on an entropy-based uncertainty criterion [34] and neglects the characteristics of textual data. For example, in the question answer task, there may be many questions that sharing similar textual expression and intent. If we neglect the textual similarity, then we may select redundant instances, which make little effects on training the classifier and waste the annotation cost. Hence, how to use textual criteria to measure candidate instances still remains a challenge.

Recently, pre-trained language models [9, 30, 31, 49] have been shown to be effective for im39 proving many natural language processing (NLP) tasks. Typically, learned language representations of pre-trained language models contain textual characteristics of data. Therefore, pre-trained language models may be useful to enrich criteria of active learning with the textual characteristics. To this end, this article proposes a new active learning approach for sentence matching. It employs criteria from a pre-trained language model to capture textual characteristics and then utilizes these extra textual criteria to enhance active learning.

Specifically, the proposed active learning approach can simultaneously measure uncertainty, noise, coverage, and diversity of a sentence pair instance. It is shown in Figure 1. The uncertainty as a standard criterion indicates classification uncertainty of instances, and the classifier prefers instances with high uncertainty to improve discriminant ability. Besides, noise, coverage, and diversity are our proposed linguistic criteria, which are based on language characteristics from a pre-trained language model. Noise indicates how much potential noise there is in an instance. The noise may impact on the classifier performance, and instances with more potential noise should have lower priority to be added into the training set. Coverage indicates whether the language expression of instances is easy to model, and the classifier needs instances with low coverage (such as low-frequency professional expressions) to enrich representation learning. Diversity indicates the diversity of instances. The classifier should avoid selecting similar and redundant instances with low diversity, because they may make little effects on training the classifier and waste annotation cost. And diverse instances can help learn more various textual expressions and matching patterns. In the end, the above four criteria are combined to select the most effective instances for annotation and training.

Fig. 1.

Besides, the case of homophones is common in Chinese data, e.g., both “

(money)” and “

(front)” in Chinese have the same pronunciation (“qian”), which may be caused by typing errors, speech recognition errors, and so on. Surely, Chinese sentence matching inevitably suffers from this problem, which brings the impact to the model. For example, we sample 100 samples in Chinese dataset and find that 6 samples have the case of wrong homophones. However, if we absolutely drop these instances with wrong homophones by the noise criterion, then the model is unable to deal with the natural test samples that have the same wrong homophones (have the same error distribution and error forms). To solve this problem, we improve the noise criterion. Specifically, we also use the pre-trained language model to recognize instances with possible Chinese wrong homophones and allow the model to train these instances for robustness.

In brief, our main contributions include the following:

•

We demonstrate that the amount of labeled training data for sentence matching can be also substantially reduced with active learning.

•

We propose to use a pre-trained language model to enhance active learning for sentence matching, which provides extra criteria to capture textual characteristics.

•

For Chinese sentence matching task, we propose to utilize the pre-trained language model to recognize instances with possible Chinese wrong homophones and allow the model to train them for robustness.

•

The experimental results on both English and Chinese sentence matching datasets show that pre-trained language models are able to not only directly improve downstream tasks but also help measure instances to enhance active learning.

This journal paper is an extended version of the conference paper [2]. The new content in this article includes a more detailed method statement, a new contribution for Chinese setting in the method, as well as corresponding experiments, discussions, and experiments on another algorithm for the diversity rank, a comparative study on the different selection of the threshold for the coverage criterion, comparison among different pre-trained language models, and visual analysis of instance selection process.

2 Related Work

2.1 Active Learning

Active learning aims to reduce annotation cost for data-driven techniques. It selects more efficient instances based on some criteria and achieves better performance with fewer labeled training instances. Typically, previous studies consider a pool-based setting [41, 52], where there is a large pool of available unlabeled data and the task is to draw limited examples to be labeled for maximizing classifier performance. Tong et al. [41] suggests a margin-based selection criterion, References [35, 36, 52] combine multiple criteria for NLP tasks, and so on. With developments of deep learning, active learning plays an important role in low-resource requirements, such as text classification [51], sequence tagging tasks like named entity recognition [11, 38], entity resolution [16], implicit discourse relation recognition [46], and so on. In this article, we propose extra criteria to capture textual characteristics ignored by previous methods and enhance active learning. Besides, Siddhant et al. [39] provides a large-scale empirical study of deep active learning, Padmakumar et al. [27] incorporates active learning into interactive tasks, Fang et al. [12] reframes the active learning as a reinforcement learning problem with a data selection policy.

2.2 Sentence Matching

Sentence matching is to judge the relation between two sentences and has been widely utilized in many natural language processing tasks, such as question matching [15] and natural language inference [4, 10, 45]. Earlier approaches mainly relied on conventional methods [32, 43]. Then deep learning-based methods substantially improve performance of sentence matching. They are divided into two types. The first type is sentence-encoding-based [7, 8, 26, 37], where sentences are encoded into independent sentence representations for classification. The other type is alignment-based [6, 14, 17, 28, 44, 48], which establishes word-level alignment and dependency relationship between two sentences to capture relationships between sentences. Active learning is a model-agnostic policy, and thus we do not focus on which model to use.

2.3 Pre-trained Language Model

Recently, pre-trained language models learn effective unsupervised language representations with large-scale free text and dramatically improve performance of many natural language processing tasks. Some studies proposed to learn context-dependent representations for each word [25, 29]. Peters et al. [30] learned contextual token representations, which depend on representations of the entire input sentence. Radford et al. [31] used unidirectional Transformer [42] to learn representations by autoregressive language modeling. Devlin et al. [9] used bidirectional Transformer to learn representations by autoencoding language modeling. Yang et al. [49] leveraged advantages of both autoregressive and autoencoding language modeling while avoiding their limitations. In this article, we employ a pre-trained language model to provide extra textual criteria for enhancing active learning, which is beneficial for obtaining more efficient instances.

3 Preliminaries

3.1 Sentence Matching

Given a sentence pair as input, the goal of sentence matching task is to judge the relation between the two sentences, such as whether they have the same intent. Formally, we have two sentences A = [\(a_1\), \(a_2\),...,\(a_{l_A}\)] and B = [\(b_1\), \(b_2\),...,\(b_{l_B}\)], where \(a_i\) and \(b_j\) are the ith and jth word, respectively, in the two input sentences, and \(l_A\) and \(l_B\) denote the length of them.

Through a shared word embedding matrix \({\bf W}_e\) \(\in\) \(\mathbb {R}^{n_e \times d}\), we can obtain word embeddings of the two input sentences \({\bf a}\) = [\({\bf e}(a_1)\), \({\bf e}(a_2)\),..., \({\bf e}(a_{l_A}\))] and \({\bf b}\) = [\({\bf e}(b_1)\), \({\bf e}(b_2)\),..., \({\bf e}(b_{l_B})\)], where \(n_e\) denotes the vocabulary size, d denotes the embedding size, and \({\bf e}(a_i)\) and \({\bf e}(b_j)\) denote the word embedding of the ith and jth word, respectively, at corresponding two sentences. And there is a model M for sentence matching to predict a label \(\hat{y}\) according to \({\bf a}\) and \({\bf b}\). For testing, we select the label with the highest probability in prediction distribution \(P(y_i|{\bf a},{\bf b};\theta _M)\) as output, where \(\theta _M\) is the parameters of the model M and \(y_i\) is a possible label. For training, the model M is optimized by minimizing the following cross entropy loss:

\begin{equation} Loss=-P(y|{\bf a},{\bf b};\theta _M)\log P(y|{\bf a},{\bf b};\theta _M), \end{equation}

(1)

where y denotes the golden label.

3.2 Active Learning

In this article, we follow the pool-based active learning setting [41, 52], where there is a small set of labeled data P and a large pool of available unlabeled data Q. P is used to train a classifier/model and can absorbing new instances from Q. The task for the active learning is to select instances in Q according to some criteria and then label and add them into P to maximize the performance of the classifier/model and minimize the expensive annotation cost. In the criteria for instance selection, a measure is used to score all candidate instances in Q, and instances maximizing this measure are selected into P.

The pipeline of active learning is shown in Algorithm 1. The instance selection process in active learning is iterative, and the process will repeat until a fixed annotation budget is reached. At each round, there are limited n instances to be selected and labeled for subsequent training.

With the same size of labeled dataset P, the criteria for instance selection in active learning determine the classifier/model performance. Commonly, the criteria mainly rely on uncertainty criterion (uncertainty sampling) [41, 52], where ones near decision boundaries have priority to be selected. A standard uncertainty criterion uses entropy, which is defined as follows:

\begin{equation} Ent(x_i)=-\sum _{k}P(y_i=k|x_i;\theta _M)\log P(y_i=k|x_i;\theta _M), \end{equation}

(2)

where k indexes all possible labels, \(x_i\) is a candidate instance that consist of a sentence pair (A/B) in available unlabeled dataset Q.

4 Methodology

We motivate our proposed method with the idea that learned textual characteristics in pre-trained language models are useful to enrich criteria of active learning. To achieve it, we employ a pre-trained language model to provide textual criteria for active learning. During active learning, a sentence matching model M is trained at each round with labeled instances from active learning. Active learning is a model-agnostic policy and focus on instance selection to reduce annotation cost, thus we choose the model in Reference [9] as the classifier M and fix it for all active learning baselines.

4.1 Pre-trained Language Model

The pre-trained language model is trained by large-scale unsupervised text in advance and is independent from the sentence matching model. In this article, we use the autoencoding language model BERT [9] as the pre-trained language model, because it is the most widely used. When pre-training the autoencoding language model, tokens are randomly selected to be masked, and then these masked tokens are reconstructed again based on the context.

From the pre-trained language model, we can obtain two kinds of information, which can be useful to provide extra textual criteria for active learning. One is the loss of reconstructing a masked token. Give a sentence A (it is the same with B), we can obtain the cross entropy loss \(s_{a_i}\) of reconstructing of the ith word \(a_i\) by masking only \(a_i\) and predicting \(a_i\) again. It is illustrated in Figure 2. We can obtain all reconstruction losses of every token by taking turns to mask them. The other is word embeddings of tokens in the sentence \({\bf a}\) = [\({\bf e}(a_1)\), \({\bf e}(a_2),\ldots ,{\bf e}(a_{l_A}\))]. Here we use the last word embeddings layer in BERT, which are contextual representations.

Fig. 2.

4.2 Criteria for Instance Selection

In our proposed active learning, there are four criteria for instance selection, including uncertainty, noise, coverage, and diversity. Uncertainty is the standard criterion of active learning. Noise, coverage, and diversity are proposed textual criteria from the pre-trained language model to capture textual characteristics. In each criterion, we can obtain the rank of candidate instances, which is used for the final instance selection.

(1) Uncertainty: The uncertainty criterion indicates classification uncertainty of an instance and is the standard criterion in active learning. Instances with high uncertainty are more helpful to optimize the classifier and thus are worthier to be selected. The uncertainty is computed as the entropy, and we can obtain uncertainty rank \(rank_{uncertain}(x_i)\) for the ith instance in Q according to the entropy. Specifically,

\begin{equation} rank_{uncertain}(x_i)\propto -Ent(x_i), \end{equation}

(3)

where \(Ent(x_i)\) is defined in Equation (2).

(2) Noise: The noise criterion indicates how much potential noise there is in an instance. Intuitively, instances with noise may degrade the labeled data P, and impact on the classifier performance. Thus, we tend to select noiseless instances. In such cases, instances with more potential noise should have lower priority to be added into P. Sentences in noisy instances usually have rare textual expression, and generating probabilities of them are low. In other words, noisy token may be hard to be reconstruct with the context by the pre-trained language model. For example, in the sentence “What’s the nest way to learn Japanese?,” “best” is a spelling mistake (should be “best”) and have higher reconstruction losses. According to the above assumption, we can formulate the noise criterion as losses of reconstructing masked tokens. It is defined as follows:

\begin{align} rank_{noise}(x_i)&\propto \frac{1}{P(A)+P(B)}, \end{align}

(4)

\begin{align} P(A)=\prod _{i=1}^{l_A}P(a_i|a_1 &\dots a_{i-1})\propto \frac{l_A}{\sum _{i \in l_A}s_{a_i}}, \end{align}

(5)

\begin{align} P(B)=\prod _{i=1}^{l_B}P(b_i|b_1 &\dots b_{i-1})\propto \frac{l_B}{\sum _{i \in l_B}s_{b_i}}, \end{align}

(6)

where \(P(A)\) and \(P(B)\) denote the probabilities of generating corresponding sentences, \(rank_{noise}(x_i)\) denotes noise rank of the ith instance in Q, and \(s_{a_i}\) and \(s_{b_i}\) are reconstruction losses of corresponding words (\(a_i\) and \(b_i\)) from the pre-trained language model.

In particular, to solve the common case of wrong homophones in Chinese data, which may be caused by typing errors, speech recognition errors, and so on, we improve the noise criterion for Chinese setting. Specifically, we reconstruct the noise tokes whose reconstruction losses are more than 10.0 by BERT. If the new reconstructed token (top 5) generated by the context has the similar phonetic alphabet with the original noise token, then we regard that it is possibly a wrong homophone and use the reconstruction loss of the new token to replace the old one into the noise rank. As a result, the instance is easier to be allowed to fit the model for robustness.

(3) Coverage: The coverage criterion indicates whether the current textual expression of instances is easy to model. On the one hand, some tokens such as stop words are frequent and easy to model (high coverage). On the other hand, the classifier needs instances with low coverage (such as low-frequency professional expressions) to enrich representation learning. In such cases, we tend to select instances with low coverage. Obviously, harder ones such as low-frequency sentences usually have low generating probabilities. In other words, it is more difficult to predict masked tokens for low-coverage textual expression. For example, in the sentence “Can I carry needle in domestic flight?,” “domestic” is a relatively low-frequency and professional token than others like “in” and has higher reconstruction losses. Thus, we can employ reconstruction losses to capture the low-coverage textual expressions. It is defined as follows:

\begin{align} rank_{coverage}(x_i)&\propto \frac{1}{\frac{\sum _{j \in l_A}c_{a_j}s_{a_j}}{\sum _{j \in l_A}c_{a_j}}+\frac{\sum _{j \in l_B}c_{b_j}s_{b_j}}{\sum _{j \in l_B}c_{b_j}}}, \end{align}

(7)

\begin{align} &c_{a_j}=\left\lbrace \begin{array}{ll} {0} &if \ s_{a_j} \gt \beta \\ {1} &others \end{array}, \right. \end{align}

(8)

\begin{align} &c_{b_j}=\left\lbrace \begin{array}{ll} {0} &if \ s_{b_j} \gt \beta \\ {1} &others \end{array}, \right. \end{align}

(9)

where \(\beta\) denotes a threshold. If the reconstruction loss of a token is too high, then it may be more like noise. Thus, if the reconstruction loss of a token (\(a_j\) or \(b_j\)) is more than \(\beta\), then we ignore it by corresponding weight (\(c_{a_j}\) or \(c_{b_j}\)). With observing only several instances, we set \(\beta\) to 10.0 for both English and Chinese.

(4) Diversity: The diversity criterion indicates the diversity of instances. On the one hand, similar and redundant instances are not efficient for training a sentence matching model. On the other hand, diverse instances can help learn more various textual expressions and matching patterns. Thus, we tend to select ones that are more diverse to be labeled into the labeled dataset P. Diversity criterion is based on two parts: instance representation \({\bf v}_i\) \(\in\) \(\mathbb {R}^d\) and diversity rank \(rank_{diversity}(x_i)\). And \(rank_{diversity}(x_i)\) depends on \({\bf v}_i\).

For instance representation, we employ a vector to represent a pair of sentences. The common approach is to use the sum of word embeddings of the two sentences. However, simple sum operation is hard to capture the difference between the two sentences. In fact, in sentence matching, it is the difference between two sentences that determines whether the two sentences share the same meaning. Surely, two sentences must have the same content if their characters are exactly the same. To model the difference between the two input sentences, we propose a novel Levenshtein Distance-based approach for instance representation.

Specifically, we employ the subtraction of word embeddings between “Delete Sequence” \(L_D\) and “Insert Sequence” \(L_I\) for instance representation. \(L_D\) and \(L_I\) are from Levenshtein Distance. When we transform sentence A to sentence B by deleting and inserting tokens, these tokens are added into “Delete Sequence” and “Insert Sequence,” respectively. This is illustrated in Figure 3. Besides, the word embeddings in the subtraction are weighted. Intuitively, some meaningless tokens have little effect on content, and they should have less weight. In addition, these meaningless tokens (such as preposition) are typically easier to predict and have lower reconstruction losses. Thus, we can employ reconstruction losses of tokens to formulate the weight. Finally, the representation of an instance is defined as follows:

\begin{align} {\bf v}_i=\Big |\sum _{j \in L_I}&w_{b_j}{\bf e}(b_j)-\sum _{j \in L_D}w_{a_j}{\bf e}(a_j)\Big |, \end{align}

(10)

\begin{align} &w_{a_j}=\frac{s_{a_j}}{\sum _{k \in l_A}s_{a_k}}, \end{align}

(11)

\begin{align} &w_{b_j}=\frac{s_{b_j}}{\sum _{k \in l_B}s_{b_k}}, \end{align}

(12)

where \(s_{a_i}\) and \(s_{b_j}\) are reconstruction losses of the ith word of A and the jth word of B, respectively. \(w_{a_i}\) and \(w_{b_j}\) denotes the weight for each token in the subtraction, and tokens with lower reconstruction loss have less weight. Considering the symmetry of two input sentences, we use absolute value operation for every element in the representation.

Fig. 3.

Then, we obtain diversity rank based on instance representation. To select diverse instances, we want to select representative ones and make them as different as possible. Thus, we can employ k-means clustering algorithm for diversity rank. Specifically, we divided instances into n clusters by k-means and obtain a representative instance that is the closest to the center for each cluster. These representative instances \(O_{diverse}\) are considered to have more diversity and thus are more likely to be labeled into the labeled dataset P for subsequent training. It is formulated as follows:

\begin{align} rank_{diversity}(x_i)&=\left\lbrace \begin{array}{ll} {0} &if \ x_i \in O_{diverse}\\ {n} &others \end{array}\!\!\!. \right. \end{align}

(13)

In each round, we can obtain diversity rank \(rank_{diversity}(x_i)\) for the ith instance in Q.

4.3 Instance Selection

When selecting instances, we use the rank combination to combines each criterion into an overall. Specifically, we sequentially employ \(rank_{uncer}\), \(rank_{diver}\), \(rank_{cover}\), \(rank_{noise}\) to select top \(8n\), \(4n\), \(2n\), n candidate instances, which is illustrated in Figure 4. It is shown as \(rank(x_i)=rank_{uncertain}(x_i)\rightarrow rank_{diversity}(x_i)\rightarrow rank_{coverage}(x_i)\rightarrow rank_{noise}(x_i)\), where \(rank(x_i)\) denote the final rank of a candidate instance in unlabeled dataset Q. After ranking these all candidate instances, we select top n instances to label them and add them into labeled dataset P for subsequent training.

Fig. 4.

5 Experiments

5.1 Configuration

The number of instances to select at every round n is 100 and we perform 25 rounds of active learning, i.e., there are total of 2,500 labeled instances for training in the end. Batch size is 16 for English and 32 for Chinese, Adam [18] is used for optimization. We evaluated performance by calculating accuracy and learning curves on a held-out test set (classes are fairly balanced in datasets) after all rounds.

5.2 Datasets

We conduct experiments on three English datasets and two Chinese dataset. Table 1 provides the statistics of them.

Table 1.

	training	validation	test
SNLI	549,367	9,842	9,824
MultiNLI	392,702	9,815	9,832
Quora	384,348	10,000	10,000
LCQMC	238,766	8,802	12,500
BQ	100,000	1,000	1,000

Table 1. Statistics of Datasets for Sentence Matching

•

SNLI: an English natural language inference corpus based on image captioning [4].

•

MultiNLI: an English natural language inference corpus with greater linguistic difficulty and diversity [45].

•

Quora: an English question matching corpus from the online question answering forum Quora [15].

•

LCQMC: an open-domain Chinese question matching corpus from the community question answering website Baidu Knows [22].

•

BQ: an in-domain Chinese corpus question matching corpus from online bank custom service logs [5].

5.3 Comparisons

To verify the effectiveness of our method, following active learning approaches are compared:

•

Random sampling (Random): At each round, it randomly selects instances for annotation and training.

•

Uncertainty sampling (Entropy): It is the standard approach base on entropy [41, 52].

•

Expected Gradient Length (EGL): It aims to select instances expected to result in the greatest change to the gradients of tokens. [35, 51].

•

Discriminative Active Learning (DAL): It poses active learning as a binary classification task, selecting instances to label in such a way as to make the labeled set and the unlabeled pool indistinguishable [13].

•

Core-Set (CORE): It selects instances such that a model learned over the selected subset is competitive for the remaining data points [33].

•

Gradient Lower Bounds (BADGE): It selects instances that are disparate and high magnitude when represented in a hallucinated gradient space [1].

•

Cold-Start (COLD): It improves above “BADGE” with cold-start [50].

•

Pre-trained language model sampling (LM): It is our proposed active learning approach.

5.4 Overall Results

Table 2 and Figure 5 (panels 1–5) show the accuracy and curves of different approaches, respectively, on five datasets. Overall, it is shown that our proposed approach achieves better performance on both English and Chinese datasets. We can know that extra textual criteria from the pre-trained language model are helpful, which demonstrates that a pre-trained language model can capture textual characteristics and provide more efficient instances for subsequent training. Besides, we can see that active learning approaches always outperform random sampling with the same number of instances. The results demonstrate that the annotation cost for training the sentence matching model can be substantially reduced by active learning. Besides, as a gradient-based active learning approach, EGL performs worse than the standard active learning, which shows that it is not suitable for sentence matching. Intuitively, a single token may determine the polarity in the text classification task. However, the sentence matching task needs to focus on the relation between two words and a single token is unable to reflect the relation. Moreover, Entropy performs well compared with other baselines most of the time. We think that the reason is about the type of task. Other methods of active learning focus on classification for a single sample. However, every two samples make up an instance in sentence matching. Thus, other methods may be not fit and perform worse than standard Entropy.

Fig. 5.

Table 2.

	Random	Entropy	EGL	DAL	CORE	BADGE	COLD	LM
SNLI	77.90	79.80	77.86	79.76	79.30	79.55	78.75	80.99
MultiNLI	67.83	70.27	66.80	70.56	69.40	69.29	68.65	71.79
Quora	79.01	80.21	77.91	79.68	80.55	80.69	79.48	81.79
LCQMC	82.04	83.25	80.35	82.72	82.82	82.66	81.92	84.29
BQ	71.44	73.60	71.59	73.29	73.67	73.88	72.81	74.73

Table 2. Accuracy of Different Approaches

5.5 Ablation Study

To demonstrate the effectiveness of extra textual criteria from the pre-trained language model, we separately combining different criteria with the standard uncertainty criterion. “Ent” denotes the standard uncertainty criterion, “E+Noi” denotes combining the uncertainty criterion with the noise criterion, “E+Cov” denotes combining the uncertainty criterion with the coverage criterion, “E+Div” denotes combining the uncertainty criterion with the diversity criterion, and “E+All” denotes the complete criteria combining all criteria.

Table 3 and Figure 6 show the accuracy and curves, respectively. We can observe each combined criterion performs better than a single standard uncertainty criterion. We can believe that each textual criterion from a pre-trained language model is effective. Besides, in Figure 6, although “E+Div” sometimes performs better at later rounds, we can see that “E+ALL” obtains better performance at early rounds. It demonstrates that the noise and coverage criteria are obviously helpful at early rounds, and the diversity criterion is not enough to measure instances at early rounds. Thus, combining all criteria can speed up fitting models. However, we also see that the diversity of instances becomes more importance when the model is relatively workable at later rounds. Therefore, we can know that “E+ALL” is more useful for less annotation budget.

Fig. 6.

Table 3.

Ent	E+Cov	E+Noi	E+Div	E+All
79.80	80.99	81.11	81.45	80.99

Table 3. Accuracy of Combining Different Proposed Textual Criteria with the Uncertainty Criterion on SNLI Dataset

5.6 Discussion

(1) Effectiveness of the improved noisy criterion for Chinese wrong homophones: For the case of Chinese wrong homophones, we improve noisy criterion with the pre-trained language model. To demonstrate its effectiveness, we conduct experiments on the Chinese sentence matching datasets. Table 4

Table 4.

	Ent	original	improved	original	improved
	Ent	E+Noi		E+All
LCQMC	83.25	83.72	83.99	84.29	84.60
BQ	73.60	73.85	74.43	74.73	75.43

Table 4. Results of the Original Noisy Criterion and the Improved Noisy Criterion

reports the results.

We can find that the improved noisy criterion performs better than the original noisy criterion, which demonstrates the effectiveness of our method. Besides, the improvement on BQ is more obvious than the improvement on LCQMC. The reason may be is that the case of Chinese wrong homophones is more common in BQ.

(2) Size of unlabeled dataset versus accuracy: Additionally, we conduct experiments to report the relation between the size of unlabeled dataset and accuracy of the claasifier. Specifically, we choose SNLI dataset and construct four subsets of different size including 5%, 10%, 50%, and 100% of the original dataset. And then we observe the learning curves of them.

The results are reported in Figure 5 (panel 6). When the size of unlabeled dataset is small, the superiority of the pre-trained model-based approach is not very obvious. However, as the size increasing, we observe that the performance of other approaches has little improvement. In contrast, our approach has substantial improvement and significantly outperforms others, which demonstrates that our approach has superiority for larger dataset size. The main reason may be that the effects of diversity criterion are more significant for a larger unlabeled dataset. With more candidate instances, it has more chances to avoid selecting similar and redundant instances, and accelerate the speed of convergence for training.

(3) Effectiveness of different instance representation methods: We demonstrate the effectiveness of different instance representation methods in diversity criterion. We compare our method with four approaches on SNLI dataset: (a) using the first word embedding layer in BERT as context-dependent representations (Uncontext); (b) using the subtraction between sentence vectors from auto-encoding (AE) [19]; (c) using the subtraction between sentence vectors from topic model (Topic) [3]; and (d) using the subtraction between sentence vectors from Skip-Thoughts (Skip) [20].

Table 5 and Figure 7 show the accuracy and curves, respectively. We can see contextual representations are better than context-dependent representations. In intuition, contextual representations are more exact especially when dealing with polysemy. Next, we find our proposed method outperform sentence vector-based methods (Topic, AE, and Skip). It is possibly because BERT used more data to learn language representations.

Fig. 7.

Table 5.

Entroy	Uncontext	AE	Topic	Skip	LM
79.80	80.63	80.42	80.54	80.71	80.99

Table 5. Accuracy of Different Instance Representation Methods

(4) Effectiveness of subtraction operation on Levenshtein Distance: Here we demonstrate the effectiveness of the operation that use the subtraction of word embeddings between “Delete Sequence” and “Insert Sequence” in the diversity criterion. We compare it with four approaches on the SNLI dataset: (a) using the sum of word embeddings of the whole sentence pair (Sum); (b) directly using the subtraction of word embeddings of the two sentences in a pair without “Delete Sequence” and “Insert Sequence” (Sub); (c) without weight for word embeddings (Nowei); and (d) without absolute value operation for symmetry (Noabs).

Table 6 and Figure 8 show the accuracy and curves, respectively. We can see that subtraction operation is better than sum operation. It demonstrates that subtraction has better ability to capture the difference between a input sentence pair, and provides better instance representation for diversity rank. Besides, we can see the results without “Delete Sequence” and “Insert Sequence” performs a little worse, verifying its effectiveness. And the results without weight operation for word embeddings performs worse. We can know that weighting different tokens is useful. Moreover, we can observe that the results without absolute value operation for symmetry is worse, demonstrating the necessity of this operation.

Fig. 8.

Table 6.

Entroy	Sum	Sub	Nowei	Noabs	LM
79.80	80.35	80.67	80.29	80.44	80.99

Table 6. Accuracy of Subtraction Operation on Levenshtein Distance

(5) Effectiveness of the threshold for coverage criterion: For coverage criterion, the threshold \(\beta\) in Equation (2) decides which tokens are regarded as noise. Here we show the effects of three different threshold (3.0, 5.0, and 10.0), and the experimental results on SNLI dataset are in Table 7.

Table 7.

	threshold-3.0	threshold-5.0	threshold-10.0
SNLI	80.46	80.77	80.99
MultiNLI	70.92	71.56	71.79
Quora	81.36	81.55	81.79
LCQMC	83.94	84.07	84.29
BQ	74.18	74.64	74.73

Table 7. The Experiments of Different Thresholds for Coverage Criterion

With observing these results, we can find that a larger threshold is better for coverage criterion, which means that noisy tokens may have high reconstruction loss, and we may lose some worthy instances if we adopt a lower threshold.

(6) Effectiveness of the k-means clustering algorithm for diversity rank: In our active learning approach, we propose to employ k-means clustering algorithm to select representative instances with instance representation. Inspired by the recent works about prototype learning [40, 47], here we also implement a prototype learning-based algorithm to replace the k-means clustering algorithm to select representative instances with instance representation and compare the effectiveness of the two different strategies.

Prototype learning refers to the prototype as the class representative point in feature space. Inspired by this, we introduce a prototype learning-based selection algorithm. For each sentence matching class, the algorithm calculate the distance between each candidate instance and the corresponding class prototype and then yield a sorted list of instances of one class according to the distance to the prototype of this class. Intuitively, the closer the candidate instance to the prototype, the more representative the candidate instance is. According to the sorted list of candidate instances, the first n candidate instances of the list of all classes are selected to be labeled and added in to the labeled dataset P. Formally, the prototype of a class is calculated as the weighted sum of instance representations of all candidate instances, where the weight is based on the prediction probability of the Corresponding class.

We conduct experiments on SNLI dataset to compare the two different algorithms for diversity rank on the “E+Div” criterion and “E+All” criterion, which both contain the diversity criterion. “E+Div” criterion combines the uncertainty criterion with the diversity criterion, “E+All” combines the uncertainty criterion with all criteria. The experimental results are reported in Table 8. We can observe that our k-means clustering algorithm perform better than prototype learning-based algorithm. The results demonstrate that the k-means clustering algorithm is more effective for diversity rank and is able to select more diverse and representative instances for active learning. It is because prototype learning-based algorithm can only capture the points around the prototype of a class and ignore the diversity of instance representation. In contrast, selected instances by the k-means clustering algorithm are more scattered, which brings more diversity.

Table 8.

	k-means	prototype	k-means	prototype
	E+Div		E+All
SNLI	81.45	80.32	80.99	80.16
MultiNLI	71.66	70.45	71.79	70.38
Quora	81.59	80.24	81.79	80.50
LCQMC	83.70	83.51	84.29	83.66
BQ	74.65	73.86	74.73	74.13

Table 8. Comparison between k-means Clustering Algorithm and Prototype Learning-based Algorithm for Instance Selection in the Diversity Criterion

5.7 Comparison among Different Pre-trained Language Models

In our method, we use BERT [9] as our pre-trained language model to enhance active learning. Our method is also easily compatible with other pre-trained language models. To explore the effectiveness of different pre-trained language models, we conduct additional experiments by using other pre-trained language models to replace BERT, including (1) RoBERTa [23], which modifies BERT with better hyperparameter choices. (2) ALBERT [21] improves BERT with lower memory consumption and faster training speed. (3) XLNet [49] uses autoregressive pretraining without “mask” tokens (put all context before the target token to simulate “mask” in our method). We implement them with the resource.¹ The results are shown in Table 9. We can see that different pre-trained language models have obviously different performance. For example, XLNet performs best on English dataset, and RoBERTa performs best on the Chinese dataset. The results indicate that the factor of the pre-trained language model is important and a better pre-trained language model can further benefit the active learning.

Table 9.

	BERT	RoBERTa	ALBERT	XLNet
SNLI	80.99	81.45	80.39	82.58
MultiNLI	71.79	72.65	71.20	72.87
Quora	81.79	82.44	80.47	83.12
LCQMC	84.29	84.36	83.34	82.31
BQ	74.73	75.32	74.17	74.08

Table 9. Accuracy with Different Pre-trained Language Models

5.8 Visualization of Instance Selection

To better understand how the active learning approach selects valuable candidate instances for subsequent annotation and training, we conduct experiments on SNLI dataset to show the visualization of instance selection in the active learning.

Specifically, we sample a round as an example and employ t-SNE [24] to plot the embeddings for each candidate sentence pair instance (i.e., Levenshtein Distance-based instance representation in the diversity rank). Then, we highlight the selected candidate instances during the sequential rank combination (\(rank_{uncertain}(x_i)\rightarrow\) final \(rank(x_i)\)) to observe the instance selection process.

The Figure 9 and Figure 10 show the visualization of instance selection during the rank combination. We can see how the final selected instances are decided with the sequential rank combination. First, the uncertainty criterion filters most of candidate instances. Then, with the k-means clustering algorithm in the diversity criterion, we can observe that our method selects representative ones that are geographically dispersed in the visualization map, which indicates that our method tends to select diverse instances and can avoid selecting redundant ones. Besides diversity, other factors also make effects on instance selection. We can see that the final selected ones are not evenly dispersed in the space, the noise criterion and coverage criterion bring some difference to them.

Fig. 9.

Fig. 10.

6 Conclusion

In this article, we propose a new active learning approach for sentence matching. Besides standard uncertainty criterion, it employs a pre-trained language model to provide extra textual criteria, which can capture textual characteristics of candidate instances and enhance active learning. We conduct experiments on both English and Chinese sentence matching datasets. The experimental results show that our proposed approach can effectiveness improve the performance of active learning for sentence matching.

Footnote

https://huggingface.co/models.

References

[1]

Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. 2019. Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv:1906.03671. Retrieved from https://arxiv.org/abs/1906.03671.

Abstract

1 Introduction

2 Related Work

2.1 Active Learning

2.2 Sentence Matching

2.3 Pre-trained Language Model

3 Preliminaries

3.1 Sentence Matching

3.2 Active Learning

4 Methodology

4.1 Pre-trained Language Model

4.2 Criteria for Instance Selection

4.3 Instance Selection

5 Experiments

5.1 Configuration

5.2 Datasets

5.3 Comparisons

5.4 Overall Results

5.5 Ablation Study

5.6 Discussion

5.7 Comparison among Different Pre-trained Language Models

5.8 Visualization of Instance Selection

6 Conclusion

Footnote

References

Index Terms

Recommendations

Pre-Training Acquisition Functions by Deep Reinforcement Learning for Fixed Budget Active Learning

Pre-trained Language Models for Tagalog with Multi-source Data

KnowLog: Knowledge Enhanced Pre-trained Language Model for Log Understanding

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations