skip to main content
research-article
Open Access

Using Pre-trained Language Model to Enhance Active Learning for Sentence Matching

Authors Info & Claims
Published:30 December 2021Publication History

Skip Abstract Section

Abstract

Active learning is an effective method to substantially alleviate the problem of expensive annotation cost for data-driven models. Recently, pre-trained language models have been demonstrated to be powerful for learning language representations. In this article, we demonstrate that the pre-trained language model can also utilize its learned textual characteristics to enrich criteria of active learning. Specifically, we provide extra textual criteria with the pre-trained language model to measure instances, including noise, coverage, and diversity. With these extra textual criteria, we can select more efficient instances for annotation and obtain better results. We conduct experiments on both English and Chinese sentence matching datasets. The experimental results show that the proposed active learning approach can be enhanced by the pre-trained language model and obtain better performance.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Sentence matching is an important task of natural language processing. It is to judge the relation between two sentences, such as whether the two sentences express the same meaning. Over the past few years, deep learning as a data-driven technique has achieved state-of-the-art performance on sentence matching [6, 14, 17, 28, 44, 48]. However, these data-driven models inevitably need a lot of manual annotation, which requires expensive cost. If large amounts of labeled data cannot be provided, then the advantages of deep learning will significantly diminish.

To alleviate the above problem, active learning is proposed to achieve better performance with fewer labeled training instances [34]. Instead of randomly selecting candidate instances, active learning is able to measure the whole unlabeled instances based on some criteria and thus is able to select more effective instances for subsequent annotation and training [11, 16, 38, 46, 51]. Nevertheless, previous active learning for natural language processing typically relies on an entropy-based uncertainty criterion [34] and neglects the characteristics of textual data. For example, in the question answer task, there may be many questions that sharing similar textual expression and intent. If we neglect the textual similarity, then we may select redundant instances, which make little effects on training the classifier and waste the annotation cost. Hence, how to use textual criteria to measure candidate instances still remains a challenge.

Recently, pre-trained language models [9, 30, 31, 49] have been shown to be effective for im39 proving many natural language processing (NLP) tasks. Typically, learned language representations of pre-trained language models contain textual characteristics of data. Therefore, pre-trained language models may be useful to enrich criteria of active learning with the textual characteristics. To this end, this article proposes a new active learning approach for sentence matching. It employs criteria from a pre-trained language model to capture textual characteristics and then utilizes these extra textual criteria to enhance active learning.

Specifically, the proposed active learning approach can simultaneously measure uncertainty, noise, coverage, and diversity of a sentence pair instance. It is shown in Figure 1. The uncertainty as a standard criterion indicates classification uncertainty of instances, and the classifier prefers instances with high uncertainty to improve discriminant ability. Besides, noise, coverage, and diversity are our proposed linguistic criteria, which are based on language characteristics from a pre-trained language model. Noise indicates how much potential noise there is in an instance. The noise may impact on the classifier performance, and instances with more potential noise should have lower priority to be added into the training set. Coverage indicates whether the language expression of instances is easy to model, and the classifier needs instances with low coverage (such as low-frequency professional expressions) to enrich representation learning. Diversity indicates the diversity of instances. The classifier should avoid selecting similar and redundant instances with low diversity, because they may make little effects on training the classifier and waste annotation cost. And diverse instances can help learn more various textual expressions and matching patterns. In the end, the above four criteria are combined to select the most effective instances for annotation and training.

Fig. 1.

Fig. 1. Illustration of our proposed active learning approach. Noise, coverage, and diversity are extra textual criteria from the pre-trained language model to enhance active learning.

Besides, the case of homophones is common in Chinese data, e.g., both “(money)” and “(front)” in Chinese have the same pronunciation (“qian”), which may be caused by typing errors, speech recognition errors, and so on. Surely, Chinese sentence matching inevitably suffers from this problem, which brings the impact to the model. For example, we sample 100 samples in Chinese dataset and find that 6 samples have the case of wrong homophones. However, if we absolutely drop these instances with wrong homophones by the noise criterion, then the model is unable to deal with the natural test samples that have the same wrong homophones (have the same error distribution and error forms). To solve this problem, we improve the noise criterion. Specifically, we also use the pre-trained language model to recognize instances with possible Chinese wrong homophones and allow the model to train these instances for robustness.

In brief, our main contributions include the following:

  • We demonstrate that the amount of labeled training data for sentence matching can be also substantially reduced with active learning.

  • We propose to use a pre-trained language model to enhance active learning for sentence matching, which provides extra criteria to capture textual characteristics.

  • For Chinese sentence matching task, we propose to utilize the pre-trained language model to recognize instances with possible Chinese wrong homophones and allow the model to train them for robustness.

  • The experimental results on both English and Chinese sentence matching datasets show that pre-trained language models are able to not only directly improve downstream tasks but also help measure instances to enhance active learning.

This journal paper is an extended version of the conference paper [2]. The new content in this article includes a more detailed method statement, a new contribution for Chinese setting in the method, as well as corresponding experiments, discussions, and experiments on another algorithm for the diversity rank, a comparative study on the different selection of the threshold for the coverage criterion, comparison among different pre-trained language models, and visual analysis of instance selection process.

Skip 2RELATED WORK Section

2 RELATED WORK

2.1 Active Learning

Active learning aims to reduce annotation cost for data-driven techniques. It selects more efficient instances based on some criteria and achieves better performance with fewer labeled training instances. Typically, previous studies consider a pool-based setting [41, 52], where there is a large pool of available unlabeled data and the task is to draw limited examples to be labeled for maximizing classifier performance. Tong et al. [41] suggests a margin-based selection criterion, References [35, 36, 52] combine multiple criteria for NLP tasks, and so on. With developments of deep learning, active learning plays an important role in low-resource requirements, such as text classification [51], sequence tagging tasks like named entity recognition [11, 38], entity resolution [16], implicit discourse relation recognition [46], and so on. In this article, we propose extra criteria to capture textual characteristics ignored by previous methods and enhance active learning. Besides, Siddhant et al. [39] provides a large-scale empirical study of deep active learning, Padmakumar et al. [27] incorporates active learning into interactive tasks, Fang et al. [12] reframes the active learning as a reinforcement learning problem with a data selection policy.

2.2 Sentence Matching

Sentence matching is to judge the relation between two sentences and has been widely utilized in many natural language processing tasks, such as question matching [15] and natural language inference [4, 10, 45]. Earlier approaches mainly relied on conventional methods [32, 43]. Then deep learning-based methods substantially improve performance of sentence matching. They are divided into two types. The first type is sentence-encoding-based [7, 8, 26, 37], where sentences are encoded into independent sentence representations for classification. The other type is alignment-based [6, 14, 17, 28, 44, 48], which establishes word-level alignment and dependency relationship between two sentences to capture relationships between sentences. Active learning is a model-agnostic policy, and thus we do not focus on which model to use.

2.3 Pre-trained Language Model

Recently, pre-trained language models learn effective unsupervised language representations with large-scale free text and dramatically improve performance of many natural language processing tasks. Some studies proposed to learn context-dependent representations for each word [25, 29]. Peters et al. [30] learned contextual token representations, which depend on representations of the entire input sentence. Radford et al. [31] used unidirectional Transformer [42] to learn representations by autoregressive language modeling. Devlin et al. [9] used bidirectional Transformer to learn representations by autoencoding language modeling. Yang et al. [49] leveraged advantages of both autoregressive and autoencoding language modeling while avoiding their limitations. In this article, we employ a pre-trained language model to provide extra textual criteria for enhancing active learning, which is beneficial for obtaining more efficient instances.

Skip 3PRELIMINARIES Section

3 PRELIMINARIES

3.1 Sentence Matching

Given a sentence pair as input, the goal of sentence matching task is to judge the relation between the two sentences, such as whether they have the same intent. Formally, we have two sentences A = [\(a_1\), \(a_2\),...,\(a_{l_A}\)] and B = [\(b_1\), \(b_2\),...,\(b_{l_B}\)], where \(a_i\) and \(b_j\) are the ith and jth word, respectively, in the two input sentences, and \(l_A\) and \(l_B\) denote the length of them.

Through a shared word embedding matrix \({\bf W}_e\) \(\in\) \(\mathbb {R}^{n_e \times d}\), we can obtain word embeddings of the two input sentences \({\bf a}\) = [\({\bf e}(a_1)\), \({\bf e}(a_2)\),..., \({\bf e}(a_{l_A}\))] and \({\bf b}\) = [\({\bf e}(b_1)\), \({\bf e}(b_2)\),..., \({\bf e}(b_{l_B})\)], where \(n_e\) denotes the vocabulary size, d denotes the embedding size, and \({\bf e}(a_i)\) and \({\bf e}(b_j)\) denote the word embedding of the ith and jth word, respectively, at corresponding two sentences. And there is a model M for sentence matching to predict a label \(\hat{y}\) according to \({\bf a}\) and \({\bf b}\). For testing, we select the label with the highest probability in prediction distribution \(P(y_i|{\bf a},{\bf b};\theta _M)\) as output, where \(\theta _M\) is the parameters of the model M and \(y_i\) is a possible label. For training, the model M is optimized by minimizing the following cross entropy loss: (1) \[\begin{equation} Loss=-P(y|{\bf a},{\bf b};\theta _M)\log P(y|{\bf a},{\bf b};\theta _M), \end{equation}\] where y denotes the golden label.

3.2 Active Learning

In this article, we follow the pool-based active learning setting [41, 52], where there is a small set of labeled data P and a large pool of available unlabeled data Q. P is used to train a classifier/model and can absorbing new instances from Q. The task for the active learning is to select instances in Q according to some criteria and then label and add them into P to maximize the performance of the classifier/model and minimize the expensive annotation cost. In the criteria for instance selection, a measure is used to score all candidate instances in Q, and instances maximizing this measure are selected into P.

The pipeline of active learning is shown in Algorithm 1. The instance selection process in active learning is iterative, and the process will repeat until a fixed annotation budget is reached. At each round, there are limited n instances to be selected and labeled for subsequent training.

With the same size of labeled dataset P, the criteria for instance selection in active learning determine the classifier/model performance. Commonly, the criteria mainly rely on uncertainty criterion (uncertainty sampling) [41, 52], where ones near decision boundaries have priority to be selected. A standard uncertainty criterion uses entropy, which is defined as follows: (2) \[\begin{equation} Ent(x_i)=-\sum _{k}P(y_i=k|x_i;\theta _M)\log P(y_i=k|x_i;\theta _M), \end{equation}\] where k indexes all possible labels, \(x_i\) is a candidate instance that consist of a sentence pair (A/B) in available unlabeled dataset Q.

Skip 4METHODOLOGY Section

4 METHODOLOGY

We motivate our proposed method with the idea that learned textual characteristics in pre-trained language models are useful to enrich criteria of active learning. To achieve it, we employ a pre-trained language model to provide textual criteria for active learning. During active learning, a sentence matching model M is trained at each round with labeled instances from active learning. Active learning is a model-agnostic policy and focus on instance selection to reduce annotation cost, thus we choose the model in Reference [9] as the classifier M and fix it for all active learning baselines.

4.1 Pre-trained Language Model

The pre-trained language model is trained by large-scale unsupervised text in advance and is independent from the sentence matching model. In this article, we use the autoencoding language model BERT [9] as the pre-trained language model, because it is the most widely used. When pre-training the autoencoding language model, tokens are randomly selected to be masked, and then these masked tokens are reconstructed again based on the context.

From the pre-trained language model, we can obtain two kinds of information, which can be useful to provide extra textual criteria for active learning. One is the loss of reconstructing a masked token. Give a sentence A (it is the same with B), we can obtain the cross entropy loss \(s_{a_i}\) of reconstructing of the ith word \(a_i\) by masking only \(a_i\) and predicting \(a_i\) again. It is illustrated in Figure 2. We can obtain all reconstruction losses of every token by taking turns to mask them. The other is word embeddings of tokens in the sentence \({\bf a}\) = [\({\bf e}(a_1)\), \({\bf e}(a_2),\ldots ,{\bf e}(a_{l_A}\))]. Here we use the last word embeddings layer in BERT, which are contextual representations.

Fig. 2.

Fig. 2. Reconstruction loss of a token from the pre-trained language model.

4.2 Criteria for Instance Selection

In our proposed active learning, there are four criteria for instance selection, including uncertainty, noise, coverage, and diversity. Uncertainty is the standard criterion of active learning. Noise, coverage, and diversity are proposed textual criteria from the pre-trained language model to capture textual characteristics. In each criterion, we can obtain the rank of candidate instances, which is used for the final instance selection.

(1) Uncertainty: The uncertainty criterion indicates classification uncertainty of an instance and is the standard criterion in active learning. Instances with high uncertainty are more helpful to optimize the classifier and thus are worthier to be selected. The uncertainty is computed as the entropy, and we can obtain uncertainty rank \(rank_{uncertain}(x_i)\) for the ith instance in Q according to the entropy. Specifically, (3) \[\begin{equation} rank_{uncertain}(x_i)\propto -Ent(x_i), \end{equation}\] where \(Ent(x_i)\) is defined in Equation (2).

(2) Noise: The noise criterion indicates how much potential noise there is in an instance. Intuitively, instances with noise may degrade the labeled data P, and impact on the classifier performance. Thus, we tend to select noiseless instances. In such cases, instances with more potential noise should have lower priority to be added into P. Sentences in noisy instances usually have rare textual expression, and generating probabilities of them are low. In other words, noisy token may be hard to be reconstruct with the context by the pre-trained language model. For example, in the sentence “What’s the nest way to learn Japanese?,” “best” is a spelling mistake (should be “best”) and have higher reconstruction losses. According to the above assumption, we can formulate the noise criterion as losses of reconstructing masked tokens. It is defined as follows: (4) \[\begin{align} rank_{noise}(x_i)&\propto \frac{1}{P(A)+P(B)}, \end{align}\] (5) \[\begin{align} P(A)=\prod _{i=1}^{l_A}P(a_i|a_1 &\dots a_{i-1})\propto \frac{l_A}{\sum _{i \in l_A}s_{a_i}}, \end{align}\] (6) \[\begin{align} P(B)=\prod _{i=1}^{l_B}P(b_i|b_1 &\dots b_{i-1})\propto \frac{l_B}{\sum _{i \in l_B}s_{b_i}}, \end{align}\] where \(P(A)\) and \(P(B)\) denote the probabilities of generating corresponding sentences, \(rank_{noise}(x_i)\) denotes noise rank of the ith instance in Q, and \(s_{a_i}\) and \(s_{b_i}\) are reconstruction losses of corresponding words (\(a_i\) and \(b_i\)) from the pre-trained language model.

In particular, to solve the common case of wrong homophones in Chinese data, which may be caused by typing errors, speech recognition errors, and so on, we improve the noise criterion for Chinese setting. Specifically, we reconstruct the noise tokes whose reconstruction losses are more than 10.0 by BERT. If the new reconstructed token (top 5) generated by the context has the similar phonetic alphabet with the original noise token, then we regard that it is possibly a wrong homophone and use the reconstruction loss of the new token to replace the old one into the noise rank. As a result, the instance is easier to be allowed to fit the model for robustness.

(3) Coverage: The coverage criterion indicates whether the current textual expression of instances is easy to model. On the one hand, some tokens such as stop words are frequent and easy to model (high coverage). On the other hand, the classifier needs instances with low coverage (such as low-frequency professional expressions) to enrich representation learning. In such cases, we tend to select instances with low coverage. Obviously, harder ones such as low-frequency sentences usually have low generating probabilities. In other words, it is more difficult to predict masked tokens for low-coverage textual expression. For example, in the sentence “Can I carry needle in domestic flight?,” “domestic” is a relatively low-frequency and professional token than others like “in” and has higher reconstruction losses. Thus, we can employ reconstruction losses to capture the low-coverage textual expressions. It is defined as follows: (7) \[\begin{align} rank_{coverage}(x_i)&\propto \frac{1}{\frac{\sum _{j \in l_A}c_{a_j}s_{a_j}}{\sum _{j \in l_A}c_{a_j}}+\frac{\sum _{j \in l_B}c_{b_j}s_{b_j}}{\sum _{j \in l_B}c_{b_j}}}, \end{align}\] (8) \[\begin{align} &c_{a_j}=\left\lbrace \begin{array}{ll} {0} &if \ s_{a_j} \gt \beta \\ {1} &others \end{array}, \right. \end{align}\] (9) \[\begin{align} &c_{b_j}=\left\lbrace \begin{array}{ll} {0} &if \ s_{b_j} \gt \beta \\ {1} &others \end{array}, \right. \end{align}\] where \(\beta\) denotes a threshold. If the reconstruction loss of a token is too high, then it may be more like noise. Thus, if the reconstruction loss of a token (\(a_j\) or \(b_j\)) is more than \(\beta\), then we ignore it by corresponding weight (\(c_{a_j}\) or \(c_{b_j}\)). With observing only several instances, we set \(\beta\) to 10.0 for both English and Chinese.

(4) Diversity: The diversity criterion indicates the diversity of instances. On the one hand, similar and redundant instances are not efficient for training a sentence matching model. On the other hand, diverse instances can help learn more various textual expressions and matching patterns. Thus, we tend to select ones that are more diverse to be labeled into the labeled dataset P. Diversity criterion is based on two parts: instance representation \({\bf v}_i\) \(\in\) \(\mathbb {R}^d\) and diversity rank \(rank_{diversity}(x_i)\). And \(rank_{diversity}(x_i)\) depends on \({\bf v}_i\).

For instance representation, we employ a vector to represent a pair of sentences. The common approach is to use the sum of word embeddings of the two sentences. However, simple sum operation is hard to capture the difference between the two sentences. In fact, in sentence matching, it is the difference between two sentences that determines whether the two sentences share the same meaning. Surely, two sentences must have the same content if their characters are exactly the same. To model the difference between the two input sentences, we propose a novel Levenshtein Distance-based approach for instance representation.

Specifically, we employ the subtraction of word embeddings between “Delete Sequence” \(L_D\) and “Insert Sequence” \(L_I\) for instance representation. \(L_D\) and \(L_I\) are from Levenshtein Distance. When we transform sentence A to sentence B by deleting and inserting tokens, these tokens are added into “Delete Sequence” and “Insert Sequence,” respectively. This is illustrated in Figure 3. Besides, the word embeddings in the subtraction are weighted. Intuitively, some meaningless tokens have little effect on content, and they should have less weight. In addition, these meaningless tokens (such as preposition) are typically easier to predict and have lower reconstruction losses. Thus, we can employ reconstruction losses of tokens to formulate the weight. Finally, the representation of an instance is defined as follows: (10) \[\begin{align} {\bf v}_i=\Big |\sum _{j \in L_I}&w_{b_j}{\bf e}(b_j)-\sum _{j \in L_D}w_{a_j}{\bf e}(a_j)\Big |, \end{align}\] (11) \[\begin{align} &w_{a_j}=\frac{s_{a_j}}{\sum _{k \in l_A}s_{a_k}}, \end{align}\] (12) \[\begin{align} &w_{b_j}=\frac{s_{b_j}}{\sum _{k \in l_B}s_{b_k}}, \end{align}\] where \(s_{a_i}\) and \(s_{b_j}\) are reconstruction losses of the ith word of A and the jth word of B, respectively. \(w_{a_i}\) and \(w_{b_j}\) denotes the weight for each token in the subtraction, and tokens with lower reconstruction loss have less weight. Considering the symmetry of two input sentences, we use absolute value operation for every element in the representation.

Fig. 3.

Fig. 3. Illustration of “Delete Sequence” and “Insert Sequence.”

Then, we obtain diversity rank based on instance representation. To select diverse instances, we want to select representative ones and make them as different as possible. Thus, we can employ k-means clustering algorithm for diversity rank. Specifically, we divided instances into n clusters by k-means and obtain a representative instance that is the closest to the center for each cluster. These representative instances \(O_{diverse}\) are considered to have more diversity and thus are more likely to be labeled into the labeled dataset P for subsequent training. It is formulated as follows: (13) \[\begin{align} rank_{diversity}(x_i)&=\left\lbrace \begin{array}{ll} {0} &if \ x_i \in O_{diverse}\\ {n} &others \end{array}\!\!\!. \right. \end{align}\] In each round, we can obtain diversity rank \(rank_{diversity}(x_i)\) for the ith instance in Q.

4.3 Instance Selection

When selecting instances, we use the rank combination to combines each criterion into an overall. Specifically, we sequentially employ \(rank_{uncer}\), \(rank_{diver}\), \(rank_{cover}\), \(rank_{noise}\) to select top \(8n\), \(4n\), \(2n\), n candidate instances, which is illustrated in Figure 4. It is shown as \(rank(x_i)=rank_{uncertain}(x_i)\rightarrow rank_{diversity}(x_i)\rightarrow rank_{coverage}(x_i)\rightarrow rank_{noise}(x_i)\), where \(rank(x_i)\) denote the final rank of a candidate instance in unlabeled dataset Q. After ranking these all candidate instances, we select top n instances to label them and add them into labeled dataset P for subsequent training.

Fig. 4.

Fig. 4. Illustration of sequential rank combination for the final instance selection.

Skip 5EXPERIMENTS Section

5 EXPERIMENTS

5.1 Configuration

The number of instances to select at every round n is 100 and we perform 25 rounds of active learning, i.e., there are total of 2,500 labeled instances for training in the end. Batch size is 16 for English and 32 for Chinese, Adam [18] is used for optimization. We evaluated performance by calculating accuracy and learning curves on a held-out test set (classes are fairly balanced in datasets) after all rounds.

5.2 Datasets

We conduct experiments on three English datasets and two Chinese dataset. Table 1 provides the statistics of them.

Table 1.
trainingvalidationtest
SNLI549,3679,8429,824
MultiNLI392,7029,8159,832
Quora384,34810,00010,000
LCQMC238,7668,80212,500
BQ100,0001,0001,000

Table 1. Statistics of Datasets for Sentence Matching

  • SNLI: an English natural language inference corpus based on image captioning [4].

  • MultiNLI: an English natural language inference corpus with greater linguistic difficulty and diversity [45].

  • Quora: an English question matching corpus from the online question answering forum Quora [15].

  • LCQMC: an open-domain Chinese question matching corpus from the community question answering website Baidu Knows [22].

  • BQ: an in-domain Chinese corpus question matching corpus from online bank custom service logs [5].

5.3 Comparisons

To verify the effectiveness of our method, following active learning approaches are compared:

  • Random sampling (Random): At each round, it randomly selects instances for annotation and training.

  • Uncertainty sampling (Entropy): It is the standard approach base on entropy [41, 52].

  • Expected Gradient Length (EGL): It aims to select instances expected to result in the greatest change to the gradients of tokens. [35, 51].

  • Discriminative Active Learning (DAL): It poses active learning as a binary classification task, selecting instances to label in such a way as to make the labeled set and the unlabeled pool indistinguishable [13].

  • Core-Set (CORE): It selects instances such that a model learned over the selected subset is competitive for the remaining data points [33].

  • Gradient Lower Bounds (BADGE): It selects instances that are disparate and high magnitude when represented in a hallucinated gradient space [1].

  • Cold-Start (COLD): It improves above “BADGE” with cold-start [50].

  • Pre-trained language model sampling (LM): It is our proposed active learning approach.

5.4 Overall Results

Table 2 and Figure 5 (panels 1–5) show the accuracy and curves of different approaches, respectively, on five datasets. Overall, it is shown that our proposed approach achieves better performance on both English and Chinese datasets. We can know that extra textual criteria from the pre-trained language model are helpful, which demonstrates that a pre-trained language model can capture textual characteristics and provide more efficient instances for subsequent training. Besides, we can see that active learning approaches always outperform random sampling with the same number of instances. The results demonstrate that the annotation cost for training the sentence matching model can be substantially reduced by active learning. Besides, as a gradient-based active learning approach, EGL performs worse than the standard active learning, which shows that it is not suitable for sentence matching. Intuitively, a single token may determine the polarity in the text classification task. However, the sentence matching task needs to focus on the relation between two words and a single token is unable to reflect the relation. Moreover, Entropy performs well compared with other baselines most of the time. We think that the reason is about the type of task. Other methods of active learning focus on classification for a single sample. However, every two samples make up an instance in sentence matching. Thus, other methods may be not fit and perform worse than standard Entropy.

Fig. 5.

Fig. 5. Panels 1–5 are learning curves of overall comparisons on the five datasets. Panel 6 is about learning curves on four SNLI subsets to show the relation between size of unlabeled data and accuracy.

Table 2.
RandomEntropyEGLDALCOREBADGECOLDLM
SNLI77.9079.8077.8679.7679.3079.5578.7580.99
MultiNLI67.8370.2766.8070.5669.4069.2968.6571.79
Quora79.0180.2177.9179.6880.5580.6979.4881.79
LCQMC82.0483.2580.3582.7282.8282.6681.9284.29
BQ71.4473.6071.5973.2973.6773.8872.8174.73

Table 2. Accuracy of Different Approaches

5.5 Ablation Study

To demonstrate the effectiveness of extra textual criteria from the pre-trained language model, we separately combining different criteria with the standard uncertainty criterion. “Ent” denotes the standard uncertainty criterion, “E+Noi” denotes combining the uncertainty criterion with the noise criterion, “E+Cov” denotes combining the uncertainty criterion with the coverage criterion, “E+Div” denotes combining the uncertainty criterion with the diversity criterion, and “E+All” denotes the complete criteria combining all criteria.

Table 3 and Figure 6 show the accuracy and curves, respectively. We can observe each combined criterion performs better than a single standard uncertainty criterion. We can believe that each textual criterion from a pre-trained language model is effective. Besides, in Figure 6, although “E+Div” sometimes performs better at later rounds, we can see that “E+ALL” obtains better performance at early rounds. It demonstrates that the noise and coverage criteria are obviously helpful at early rounds, and the diversity criterion is not enough to measure instances at early rounds. Thus, combining all criteria can speed up fitting models. However, we also see that the diversity of instances becomes more importance when the model is relatively workable at later rounds. Therefore, we can know that “E+ALL” is more useful for less annotation budget.

Fig. 6.

Fig. 6. Learning curves of combining different proposed textual criteria with the uncertainty criterion on SNLI dataset.

Table 3.
EntE+CovE+NoiE+DivE+All
79.8080.9981.1181.4580.99

Table 3. Accuracy of Combining Different Proposed Textual Criteria with the Uncertainty Criterion on SNLI Dataset

5.6 Discussion

(1) Effectiveness of the improved noisy criterion for Chinese wrong homophones: For the case of Chinese wrong homophones, we improve noisy criterion with the pre-trained language model. To demonstrate its effectiveness, we conduct experiments on the Chinese sentence matching datasets. Table 4

Table 4.
EntE+NoiE+All
Entoriginalimprovedoriginalimproved
LCQMC83.2583.7283.9984.2984.60
BQ73.6073.8574.4374.7375.43

Table 4. Results of the Original Noisy Criterion and the Improved Noisy Criterion

reports the results.

We can find that the improved noisy criterion performs better than the original noisy criterion, which demonstrates the effectiveness of our method. Besides, the improvement on BQ is more obvious than the improvement on LCQMC. The reason may be is that the case of Chinese wrong homophones is more common in BQ.

(2) Size of unlabeled dataset versus accuracy: Additionally, we conduct experiments to report the relation between the size of unlabeled dataset and accuracy of the claasifier. Specifically, we choose SNLI dataset and construct four subsets of different size including 5%, 10%, 50%, and 100% of the original dataset. And then we observe the learning curves of them.

The results are reported in Figure 5 (panel 6). When the size of unlabeled dataset is small, the superiority of the pre-trained model-based approach is not very obvious. However, as the size increasing, we observe that the performance of other approaches has little improvement. In contrast, our approach has substantial improvement and significantly outperforms others, which demonstrates that our approach has superiority for larger dataset size. The main reason may be that the effects of diversity criterion are more significant for a larger unlabeled dataset. With more candidate instances, it has more chances to avoid selecting similar and redundant instances, and accelerate the speed of convergence for training.

(3) Effectiveness of different instance representation methods: We demonstrate the effectiveness of different instance representation methods in diversity criterion. We compare our method with four approaches on SNLI dataset: (a) using the first word embedding layer in BERT as context-dependent representations (Uncontext); (b) using the subtraction between sentence vectors from auto-encoding (AE) [19]; (c) using the subtraction between sentence vectors from topic model (Topic) [3]; and (d) using the subtraction between sentence vectors from Skip-Thoughts (Skip) [20].

Table 5 and Figure 7 show the accuracy and curves, respectively. We can see contextual representations are better than context-dependent representations. In intuition, contextual representations are more exact especially when dealing with polysemy. Next, we find our proposed method outperform sentence vector-based methods (Topic, AE, and Skip). It is possibly because BERT used more data to learn language representations.

Fig. 7.

Fig. 7. Learning curves of different instance representation methods.

Table 5.
EntroyUncontextAETopicSkipLM
79.8080.6380.4280.5480.7180.99

Table 5. Accuracy of Different Instance Representation Methods

(4) Effectiveness of subtraction operation on Levenshtein Distance: Here we demonstrate the effectiveness of the operation that use the subtraction of word embeddings between “Delete Sequence” and “Insert Sequence” in the diversity criterion. We compare it with four approaches on the SNLI dataset: (a) using the sum of word embeddings of the whole sentence pair (Sum); (b) directly using the subtraction of word embeddings of the two sentences in a pair without “Delete Sequence” and “Insert Sequence” (Sub); (c) without weight for word embeddings (Nowei); and (d) without absolute value operation for symmetry (Noabs).

Table 6 and Figure 8 show the accuracy and curves, respectively. We can see that subtraction operation is better than sum operation. It demonstrates that subtraction has better ability to capture the difference between a input sentence pair, and provides better instance representation for diversity rank. Besides, we can see the results without “Delete Sequence” and “Insert Sequence” performs a little worse, verifying its effectiveness. And the results without weight operation for word embeddings performs worse. We can know that weighting different tokens is useful. Moreover, we can observe that the results without absolute value operation for symmetry is worse, demonstrating the necessity of this operation.

Fig. 8.

Fig. 8. Learning curves of subtraction operation on Levenshtein Distance.

Table 6.
EntroySumSubNoweiNoabsLM
79.8080.3580.6780.2980.4480.99

Table 6. Accuracy of Subtraction Operation on Levenshtein Distance

(5) Effectiveness of the threshold for coverage criterion: For coverage criterion, the threshold \(\beta\) in Equation (2) decides which tokens are regarded as noise. Here we show the effects of three different threshold (3.0, 5.0, and 10.0), and the experimental results on SNLI dataset are in Table 7.

Table 7.
threshold-3.0threshold-5.0threshold-10.0
SNLI80.4680.7780.99
MultiNLI70.9271.5671.79
Quora81.3681.5581.79
LCQMC83.9484.0784.29
BQ74.1874.6474.73

Table 7. The Experiments of Different Thresholds for Coverage Criterion

With observing these results, we can find that a larger threshold is better for coverage criterion, which means that noisy tokens may have high reconstruction loss, and we may lose some worthy instances if we adopt a lower threshold.

(6) Effectiveness of the k-means clustering algorithm for diversity rank: In our active learning approach, we propose to employ k-means clustering algorithm to select representative instances with instance representation. Inspired by the recent works about prototype learning [40, 47], here we also implement a prototype learning-based algorithm to replace the k-means clustering algorithm to select representative instances with instance representation and compare the effectiveness of the two different strategies.

Prototype learning refers to the prototype as the class representative point in feature space. Inspired by this, we introduce a prototype learning-based selection algorithm. For each sentence matching class, the algorithm calculate the distance between each candidate instance and the corresponding class prototype and then yield a sorted list of instances of one class according to the distance to the prototype of this class. Intuitively, the closer the candidate instance to the prototype, the more representative the candidate instance is. According to the sorted list of candidate instances, the first n candidate instances of the list of all classes are selected to be labeled and added in to the labeled dataset P. Formally, the prototype of a class is calculated as the weighted sum of instance representations of all candidate instances, where the weight is based on the prediction probability of the Corresponding class.

We conduct experiments on SNLI dataset to compare the two different algorithms for diversity rank on the “E+Div” criterion and “E+All” criterion, which both contain the diversity criterion. “E+Div” criterion combines the uncertainty criterion with the diversity criterion, “E+All” combines the uncertainty criterion with all criteria. The experimental results are reported in Table 8. We can observe that our k-means clustering algorithm perform better than prototype learning-based algorithm. The results demonstrate that the k-means clustering algorithm is more effective for diversity rank and is able to select more diverse and representative instances for active learning. It is because prototype learning-based algorithm can only capture the points around the prototype of a class and ignore the diversity of instance representation. In contrast, selected instances by the k-means clustering algorithm are more scattered, which brings more diversity.

Table 8.
E+DivE+All
k-meansprototypek-meansprototype
SNLI81.4580.3280.9980.16
MultiNLI71.6670.4571.7970.38
Quora81.5980.2481.7980.50
LCQMC83.7083.5184.2983.66
BQ74.6573.8674.7374.13

Table 8. Comparison between k-means Clustering Algorithm and Prototype Learning-based Algorithm for Instance Selection in the Diversity Criterion

5.7 Comparison among Different Pre-trained Language Models

In our method, we use BERT [9] as our pre-trained language model to enhance active learning. Our method is also easily compatible with other pre-trained language models. To explore the effectiveness of different pre-trained language models, we conduct additional experiments by using other pre-trained language models to replace BERT, including (1) RoBERTa [23], which modifies BERT with better hyperparameter choices. (2) ALBERT [21] improves BERT with lower memory consumption and faster training speed. (3) XLNet [49] uses autoregressive pretraining without “mask” tokens (put all context before the target token to simulate “mask” in our method). We implement them with the resource.1 The results are shown in Table 9. We can see that different pre-trained language models have obviously different performance. For example, XLNet performs best on English dataset, and RoBERTa performs best on the Chinese dataset. The results indicate that the factor of the pre-trained language model is important and a better pre-trained language model can further benefit the active learning.

Table 9.
BERTRoBERTaALBERTXLNet
SNLI80.9981.4580.3982.58
MultiNLI71.7972.6571.2072.87
Quora81.7982.4480.4783.12
LCQMC84.2984.3683.3482.31
BQ74.7375.3274.1774.08

Table 9. Accuracy with Different Pre-trained Language Models

5.8 Visualization of Instance Selection

To better understand how the active learning approach selects valuable candidate instances for subsequent annotation and training, we conduct experiments on SNLI dataset to show the visualization of instance selection in the active learning.

Specifically, we sample a round as an example and employ t-SNE [24] to plot the embeddings for each candidate sentence pair instance (i.e., Levenshtein Distance-based instance representation in the diversity rank). Then, we highlight the selected candidate instances during the sequential rank combination (\(rank_{uncertain}(x_i)\rightarrow\) final \(rank(x_i)\)) to observe the instance selection process.

The Figure 9 and Figure 10 show the visualization of instance selection during the rank combination. We can see how the final selected instances are decided with the sequential rank combination. First, the uncertainty criterion filters most of candidate instances. Then, with the k-means clustering algorithm in the diversity criterion, we can observe that our method selects representative ones that are geographically dispersed in the visualization map, which indicates that our method tends to select diverse instances and can avoid selecting redundant ones. Besides diversity, other factors also make effects on instance selection. We can see that the final selected ones are not evenly dispersed in the space, the noise criterion and coverage criterion bring some difference to them.

Fig. 9.

Fig. 9. Pink points are top instances selected in the uncertainty rank.

Fig. 10.

Fig. 10. Red points are top instances selected in the final sequentially combined rank.

Skip 6CONCLUSION Section

6 CONCLUSION

In this article, we propose a new active learning approach for sentence matching. Besides standard uncertainty criterion, it employs a pre-trained language model to provide extra textual criteria, which can capture textual characteristics of candidate instances and enhance active learning. We conduct experiments on both English and Chinese sentence matching datasets. The experimental results show that our proposed approach can effectiveness improve the performance of active learning for sentence matching.

Footnotes

REFERENCES

  1. [1] Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. 2019. Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv:1906.03671. Retrieved from https://arxiv.org/abs/1906.03671.Google ScholarGoogle Scholar
  2. [2] Guirong Bai, Shizhu He, Kang Liu, Jun Zhao, and Zaiqing Nie. 2020. Pre-trained language model based active learning for sentence matching. In Proceedings of the 28th International Conference on Computational Linguistics. 1495–1504. https://doi.org/10.18653/v1/2020.coling-main.130Google ScholarGoogle Scholar
  3. [3] Blei David M., Ng Andrew Y., and Jordan Michael I.. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 9931022. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Bowman Samuel R., Angeli Gabor, Potts Christopher, and Manning Christopher D.. 2015. A large annotated corpus for learning natural language inference. arXiv:1508.05326. Retrieved from https://arxiv.org/abs/1508.05326.Google ScholarGoogle Scholar
  5. [5] Chen Jing, Chen Qingcai, Liu Xin, Yang Haijun, Lu Daohe, and Tang Buzhou. 2018. The BQ corpus: A large-scale domain-specific chinese corpus for sentence semantic equivalence identification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 49464951. https://doi.org/10.18653/v1/D18-1536Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Chen Qian, Zhu Xiaodan, Ling Zhenhua, Wei Si, Jiang Hui, and Inkpen Diana. 2016. Enhanced lstm for natural language inference. arXiv:1609.06038. Retrieved from https://arxiv.org/abs/.Google ScholarGoogle Scholar
  7. [7] Choi Jihun, Yoo Kang Min, and Lee Sang-goo. 2018. Learning to compose task-specific tree structures. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’18). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Conneau Alexis, Kiela Douwe, Schwenk Holger, Barrault Loic, and Bordes Antoine. 2017. Supervised learning of universal sentence representations from natural language inference data. arXiv:1705.02364. Retrieved from https://arxiv.org/abs/1705.02364.Google ScholarGoogle Scholar
  9. [9] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Retrieved from https://arxiv.org/abs/1810.04805.Google ScholarGoogle Scholar
  10. [10] Du Qianlong, Zong Chengqing, and Su Keh-Yih. 2020. Conducting natural language inference with word-pair-dependency and local context. ACM Trans. Asian Low-Resourc. Lang. Inf. Process. 19, 3 (2020), 1–23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Erdmann Alexander, Wrisley David Joseph, Allen Benjamin, Brown Christopher, Cohen-Bodénès Sophie, Elsner Micha, Feng Yukun, Joseph Brian, Joyeux-Prunel Béatrice, and de Marneffe Marie-Catherine. 2019. Practical, efficient, and customizable active learning for named entity recognition in the digital humanities. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2223–2234. https://doi.org/10.18653/v1/N19-1231Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Fang Meng, Li Yuan, and Cohn Trevor. 2017. Learning how to active learn: A deep reinforcement learning approach. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 595–605. https://doi.org/10.18653/v1/D17-1063Google ScholarGoogle Scholar
  13. [13] Gissin Daniel and Shalev-Shwartz Shai. 2019. Discriminative active learning. arXiv:1907.06347. Retrieved from https://arxiv.org/abs/1907.06347.Google ScholarGoogle Scholar
  14. [14] Gong Yichen, Luo Heng, and Zhang Jian. 2017. Natural language inference over interaction space. arXiv:1709.04348. Retrieved from https://arxiv.org/abs/1709.04348.Google ScholarGoogle Scholar
  15. [15] Iyer Shankar, Dandekar Nikhil, and Csernai Kornél. 2017. First Quora Dataset Release: Question Pairs. Retrieved from Data.quora.com.Google ScholarGoogle Scholar
  16. [16] Kasai Jungo, Qian Kun, Gurajada Sairam, Li Yunyao, and Popa Lucian. 2019. Low-resource deep entity resolution with transfer and active learning. arXiv:1906.08042. Retrieved from https://arxiv.org/abs/1906.08042.Google ScholarGoogle Scholar
  17. [17] Kim Seonhoon, Kang Inho, and Kwak Nojun. 2019. Semantic sentence matching with densely-connected recurrent and co-attentive information. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 65866593. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Kingma Diederik P. and Ba Jimmy. 2014. Adam: A method for stochastic optimization. Comput. Sci. (2014).Google ScholarGoogle Scholar
  19. [19] Kingma Diederik P. and Welling Max. 2013. Auto-encoding variational bayes. arXiv:1312.6114. Retrieved from https://arxiv.org/abs/1312.6114.Google ScholarGoogle Scholar
  20. [20] Kiros Ryan, Zhu Yukun, Salakhutdinov Russ R., Zemel Richard, Urtasun Raquel, Torralba Antonio, and Fidler Sanja. 2015. Skip-thought vectors. In Advances in Neural Information Processing Systems. 32943302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Lan Zhenzhong, Chen Mingda, Goodman Sebastian, Gimpel Kevin, Sharma Piyush, and Soricut Radu. 2020. ALBERT: A lite BERT for self-supervised learning of language representations. arXiv:1909.11942. Retrieved from https://arxiv.org/abs/1909.11942.Google ScholarGoogle Scholar
  22. [22] Liu Xin, Chen Qingcai, Deng Chong, Zeng Huajun, Chen Jing, Li Dongfang, and Tang Buzhou. 2018. Lcqmc: A large-scale chinese question matching corpus. In Proceedings of the 27th International Conference on Computational Linguistics. 19521962.Google ScholarGoogle Scholar
  23. [23] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692. Retrieved from https://arxiv.org/abs/1907.11692.Google ScholarGoogle Scholar
  24. [24] van der Maaten Laurens and Hinton Geoffrey. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (2008), 25792605.Google ScholarGoogle Scholar
  25. [25] Mikolov Tomas, Sutskever Ilya, Chen Kai, Corrado Greg S., and Dean Jeff. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. 31113119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Nie Yixin and Bansal Mohit. 2017. Shortcut-stacked sentence encoders for multi-domain inference. arXiv:1708.02312. Retrieved from https://arxiv.org/abs/1708.02312.Google ScholarGoogle Scholar
  27. [27] Padmakumar Aishwarya, Stone Peter, and Mooney Raymond. 2018. Learning a policy for opportunistic active learning. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP’18). 1347–1357.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Parikh Ankur P., Täckström Oscar, Das Dipanjan, and Uszkoreit Jakob. 2016. A decomposable attention model for natural language inference. arXiv:1606.01933. Retrieved from https://arxiv.org/abs/1606.01933.Google ScholarGoogle Scholar
  29. [29] Pennington Jeffrey, Socher Richard, and Manning Christopher D.. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 15321543.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Peters Matthew E., Neumann Mark, Iyyer Mohit, Gardner Matt, Clark Christopher, Lee Kenton, and Zettlemoyer Luke. 2018. Deep contextualized word representations. arXiv:1802.05365. Retrieved from https://arxiv.org/abs/1802.05365.Google ScholarGoogle Scholar
  31. [31] Radford Alec, Narasimhan Karthik, Salimans Tim, and Sutskever Ilya. 2018. Improving Language Understanding with Unsupervised Learning. Technical Report. OpenAI.Google ScholarGoogle Scholar
  32. [32] Romano Lorenza, Kouylekov Milen, Szpektor Idan, Dagan Ido, and Lavelli Alberto. 2006. Investigating a generic paraphrase-based approach for relation extraction. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL’06). 409–416.Google ScholarGoogle Scholar
  33. [33] Sener Ozan and Savarese Silvio. 2018. Active learning for convolutional neural networks: A core-set approach. arXiv:1708.00489. Retrieved from https://arxiv.org/abs/1708.00489.Google ScholarGoogle Scholar
  34. [34] Settles Burr. 2009. Active Learning Literature Survey. Technical Report. University of Wisconsin—Madison Department of Computer Sciences.Google ScholarGoogle Scholar
  35. [35] Settles Burr and Craven Mark. 2008. An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. 10701079. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Shen Dan, Zhang Jie, Su Jian, Zhou Guodong, and Tan Chew Lim. 2004. Multi-criteria-based active learning for named entity recognition. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL’04). 589596. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Shen Tao, Zhou Tianyi, Long Guodong, Jiang Jing, Wang Sen, and Zhang Chengqi. 2018. Reinforced self-attention network: A hybrid of hard and soft attention for sequence modeling. arXiv:1801.10296. Retrieved from https://arxiv.org/abs/1801.10296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Shen Yanyao, Yun Hyokun, Lipton Zachary C., Kronrod Yakov, and Anandkumar Animashree. 2017. Deep active learning for named entity recognition. arXiv:1707.05928. Retrieved from https://arxiv.org/abs/1707.05928.Google ScholarGoogle Scholar
  39. [39] Siddhant Aditya and Lipton Zachary C.. 2018. Deep bayesian active learning for natural language processing: Results of a large-scale empirical study. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP’18). 2904–2909.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Snell Jake, Swersky Kevin, and Zemel Richard. 2017. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems. 40774087. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Tong Simon and Koller Daphne. 2001. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2, (Nov.2001), 4566. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 59986008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] Wang Mengqiu, Smith Noah A., and Mitamura Teruko. 2007. What is the jeopardy model? a quasi-synchronous grammar for QA. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL’07). 2232.Google ScholarGoogle Scholar
  44. [44] Wang Zhiguo, Hamza Wael, and Florian Radu. 2017. Bilateral multi-perspective matching for natural language sentences. arXiv:1702.03814. Retrieved from https://arxiv.org/abs/1702.03814. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] Williams Adina, Nangia Nikita, and Bowman Samuel R.. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv:1704.05426. Retrieved from https://arxiv.org/abs/1704.05426.Google ScholarGoogle Scholar
  46. [46] Xu Yang, Hong Yu, Ruan Huibin, Yao Jianmin, Zhang Min, and Zhou Guodong. 2018. Using active learning to expand training data for implicit discourse relation recognition. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 725731.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Yang Hong-Ming, Zhang Xu-Yao, Yin Fei, and Liu Cheng-Lin. 2018. Robust classification with convolutional prototype learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 34743482.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Yang Liu, Ai Qingyao, Guo Jiafeng, and Croft W. Bruce. 2016. aNMM: Ranking short answer texts with attention-based neural matching model. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 287296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Yang Zhilin, Dai Zihang, Yang Yiming, Carbonell Jaime, Salakhutdinov Russ R., and Le Quoc V.. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, Vol. 32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Yuan Michelle, Lin Hsuan-Tien, and Boyd-Graber Jordan. 2020. Cold-start active learning through self-supervised language modeling. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 79357948.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Zhang Ye, Lease Matthew, and Wallace Byron. 2017. Active discriminative text representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Zhu Jingbo, Wang Huizhen, Yao Tianshun, and Tsou Benjamin K. 2008. Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008). 1137–1144.Google ScholarGoogle Scholar

Index Terms

  1. Using Pre-trained Language Model to Enhance Active Learning for Sentence Matching

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 2
      March 2022
      413 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3494070
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 December 2021
      • Accepted: 1 August 2021
      • Revised: 1 May 2021
      • Received: 1 December 2020
      Published in tallip Volume 21, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed
    • Article Metrics

      • Downloads (Last 12 months)401
      • Downloads (Last 6 weeks)37

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format