A novel multi-domain machine reading comprehension model with domain interference mitigation

doi:10.1016/j.neucom.2022.05.102

Neurocomputing

Volume 500, 21 August 2022, Pages 791-798

https://doi.org/10.1016/j.neucom.2022.05.102 Get rights and content

Abstract

Machine reading comprehension (MRC), as an important task in natural language processing (NLP), is to automatically answer the question after reading a passage. In this aspect, dominant studies mainly focus on domain-specific models. However, domain-specific models trained only on single domain data often cannot achieve satisfactory performance. Although using data of other domains can bring improvement to some extent, building MRC models specific to each domain also makes deployment more difficult in practice. In this paper, we propose a multi-domain MRC model based on knowledge distillation (KD) with domain interference mitigation. Specifically, we employ KD to train a joint model by simultaneously using the multi-domain data and the output distributions of all domain-specific models. In this way, our joint model can better exploit multi-domain data while enabling simpler deployment at the same time. Moreover, to deal with the gradient conflict caused by using data of different domains, we resort to measuring domain-level gradient similarity, based on which an improved PCGrad (short for projecting conflicting gradients) algorithm with adaptive learning rate is proposed. The algorithm mitigates domain interference to improve our joint model across domains. Experimental results and in-depth analysis demonstrate the effectiveness of our joint model and mitigating domain interference further improves the overall performance of our model on a set of benchmark datasets.

Introduction

As one important task of natural language processing (NLP), machine reading comprehension (MRC) aims to answer questions given a passage. It has been employed in many real industrial applications such as search engines and become a research hotspot in NLP field [1], [2], [3], [4], [5], [6], [7], [8]. Nowadays, dominant MRC models are almost either extractive or generative ones, where the former attract more attention due to its advantages of simplicity and computing efficiency.

In recent years, as the basis of MRC studies, many datasets of different domains have been constructed, e.g., SQuAD [9] collected from WikiPedia, NewsQA [10] for news passages. The data of different domains usually vary greatly. In order to handle datasets of various domains, most existing approaches resort to building domain-specific models [11], [12]. However, the training data of a single domain are often limited, and thus they are insufficient to solely train a domain-specific model with satisfying performance.

To address this issue, previous study explores data augmentation to enhance domain-specific models [13]. Meanwhile, more researchers are committed to employing transfer learning to exploit multi-domain data for domain-specific models [14], [15], [16], [17]. In spite of their success to some extent, building domain-specific models makes deployment more difficult, especially when there are a large number of domains involved. There is also an exception that Xu et al. [18] explores a multi-task framework to train a unified model using the sample re-weighting technique. It is notable that such an approach only focuses on the data-level knowledge transfer, ignoring the beneficial knowledge of domain-specific models.

In this paper, inspired by the successful applications of knowledge distillation (KD) [19] in multilingual neural machine translation [20] and multi-task learning [21], we propose a KD-based multi-domain MRC model with domain interference mitigation. Specifically, we first individually train domain-specific MRC models on their own training data. Then using the mixed-domain data, we use KD to transfer the knowledge of domain-specific models to the joint model. During this process, the joint model is simultaneously supervised by the signal of training data and the output distributions of domain-specific models. More importantly, since the training samples from different domains contribute to the parameter update of the joint model differently and result in conflicting gradients, we introduce the PCGrad algorithm [22], formerly used in multi-task learning scenarios to modulate gradients from different domains. However, in the original PCGrad algorithm, the updating step size of model paramaters is independent of the domain-level gradient similarities, which leads to negative effect on model optimization. Intuitively, if there exists a great conflict between the gradients in different domains, the excessive updating step size will have a negative impact on the model performance for some domains. Therefore, the smaller the gradient similarity is, the smaller the updating step size should be. On the contrary, the larger the gradient similarity is, the larger the updating step size can be. To model this intuition, we propose to measure the gradient similarity between domains in terms of two factors: gradient direction and gradient magnitude. Then, based on the overall gradient similarity, we refine the conventional PCGrad algorithm by adaptively adjusting the learning rate to optimize model parameters. Compared to [18] which only transfers knowledge at the data level, our joint model leverages the model-level knowledge transfer for MRC. Moreover, mitigating domain interference with our improved PCGrad algorithm further enhances the performance of our joint model.

The contribution of our work lies in the following points:

•
We introduce knowledge distillation to construct a multi-domain model for MRC. To the best of our knowledge, our work is the first attempt to explore knowledge transfer at both data and model levels in MRC.
•
We propose to refine the conventional PCGrad algorithm by adaptively adjusting its learning rate according to our designed gradient similarity metric, which mainly depends on the similarities in gradient direction and magnitude.
•
Experimental results show that our multi-domain model surpass several commonly-used baselines, including domain-specific models. In particular, further analysis demonstrates the effectiveness of our improved PCGrad algorithm.

Section snippets

Related work

In this work, we propose a KD-based multi-domain model with domain interference mitigation for MRC. Our related work mainly include the following two aspects:

Transfer Learning in MRC. Transfer learning is an effective approach to improve performance on target task/domain by absorbing the knowledge gained in another one. In MRC, some domains lack abundant training data to build satisfying domain-specific models. As pre-trained word embeddings [23], [24] and pre-trained language models [25], [26]

BERT-based MRC

In this section, we will describe the BERT-based MRC model [26], which is chosen as our basic model due to its competitive performance.

Fig. 1 shows the basic architecture of the BERT-based MRC model. As for the model input, let $q = q_{1}, q_{2}, \dots, q_{| q |}$ be the input question and $p = p_{1}, p_{2}, \dots, p_{| p |}$ be the input passage, these two sequences are packed in a manner as $x = [CLS], q, [SEP], p, [SEP]$ , where $[CLS]$ is a specific token for sentence-level classification and $[SEP]$ is a seperator token. Thus, the total length of input

Our model

In this section, we first describe the construction of our multi-domain MRC model based on knowledge distillation. Then, we introduce how to effectively train our model using the proposed improved PCGrad algorithm with adaptive learning rate.

Datasets

Our experimental datasets mainly include:

•
SQuAD [9]. As one of the most commonly-used MRC dataset, it contains more than 100 K instances. The context paragraphs are collected from Wikipedia and the corresponding questions are created by human, where the answers can be found in the context paragraph. In particular, we use the SQuAD v1.1 in our experiment.
•
NewsQA (News) [10]. It is a challenging dataset that contains about 120 K question–answer pairs. Its paragraphs are from CNN articles and

Conclusion

In this paper, we have proposed to introduce KD to train a multi-domain joint model for MRC, which allows the joint model to simultaneously learn training data and absorb the beneficial knowledge from all domain-specific models. Moreover, considering that the distributions of multiple MRC domains are often different, we propose an improved PCGrad algorithm with adaptive learning rate to better mitigate the domain interference.

On several commonly-used datasets, experimental results and in-depth

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The project was supported by National Natural Science Foundation of China (No. 62036004, No. 61672440), Natural Science Foundation of Fujian Province of China (No. 2020J06001), and Youth Innovation Fund of Xiamen (No. 3502Z20206059). We also thank the reviewers for their insightful comments.

Chulun Zhou received the B.S. degree from Xiamen University, Xiamen, China, in 2019. He is currently a Master candidate in Xiamen University, Xiamen, China. His research interests include natural language processing, text generation and neural machine translation.

References (56)

R. Kadlec et al.
Text understanding with the attention sum reader network
Y. Cui et al.
Attention-over-attention neural networks for reading comprehension
ACL
(2017)
M.J. Seo et al.
Bidirectional attention flow for machine comprehension
M. Hu, Y. Peng, F. Wei, Z. Huang, D. Li, N. Yang, M. Zhou, Attention-guided answer distillation for machine reading...
A.W. Yu et al.
Qanet: Combining local convolution with global self-attention for reading comprehension
ICLR
(2018)
K. Liu et al.
A robust adversarial training approach to machine reading comprehension
AAAI
(2020)
Z. Zhang, J. Yang, H. Zhao, Retrospective reader for machine reading comprehension, CoRR...
P. Sen et al.
What do models learn from question answering datasets?
P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100, 000+ questions for machine comprehension of text, in: EMNLP,...
A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, K. Suleman, Newsqa: A machine comprehension dataset,...

H. Wang, Z. Gan, X. Liu, J. Liu, J. Gao, H. Wang, Adversarial domain adaptation for machine reading comprehension, in:...

Y. Cao et al.

Unsupervised domain adaptation on reading comprehension

N. Duan et al.

Question generation for question answering

S. Min et al.

Question answering through transfer learning from large fine-grained supervision data

Y. Chung et al.

Supervised and unsupervised transfer learning for question answering

A. Talmor et al.

Multiqa: An empirical investigation of generalization and transfer in reading comprehension

X. Liu et al.

An iterative multi-source mutual knowledge transfer framework for machine reading comprehension

IJCAI

(2020)

Y. Xu et al.

Multi-task learning with sample re-weighting for machine reading comprehension

G.E. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, CoRR...

X. Tan, Y. Ren, D. He, T. Qin, Z. Zhao, T. Liu, Multilingual neural machine translation with knowledge distillation,...

K. Clark et al.

Bam! born-again multi-task networks for natural language understanding

ACL

(2019)

T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, C. Finn, Gradient surgery for multi-task learning, in: NeurIPS,...

Q.V. Le, T. Mikolov, Distributed representations of sentences and documents, in: ICML,...

J. Pennington et al.

Glove: Global vectors for word representation

M.E. Peters et al.

Deep contextualized word representations

J. Devlin et al.

BERT: pre-training of deep bidirectional transformers for language understanding

NAACL-HLT

(2019)

D. Golub, P. Huang, X. He, L. Deng, Two-stage synthesis networks for transfer learning in machine comprehension, in:...

C. Wang et al.

Meta fine-tuning neural language models for multi-domain text mining

EMNLP

(2020)

Cited by (4)

Multiview Jointly Sparse Discriminant Common Subspace Learning
2023, Pattern Recognition
Multiview data leads to the demand for classifying samples from various views, and the large gap between different views makes the classification task challenging. Recently, researchers have extended linear discriminant analysis (LDA) to multi-view scenarios. However, the extended methods are generally associated with the small-class problem, that is, the projection size is limited by the number of classes. In addition, they are sensitive to variations in images or outliers. To solve these problems, this study proposes a generalized robust multiview discriminant analysis (GRMDA) to obtain a linear transform for each view and for learning multiview jointly sparse discriminant common subspace. GRMDA aims to achieve both maximal between-class and minimal within-class variation for data from multiple views in a common space. Instead of formulating the ratio trace problem, we reformulate GRMDA inspired by maximum margin criterion (MMC) to address the small-class problem. Moreover, the proposed method achieves stronger robustness by reconstructing the within-class and between-class scatter terms from the definition of $L_{2, 1}$ norm. Furthermore, GRMDA ensures joint sparsity using the $L_{2, 1}$ norm-based regularization term. Additionally, we present an iterative algorithm, convergence proof, and complexity analysis. Experiments on six popular databases, that is, COIL100, USPS/MNIST, Extended Yale Face B, AR, BBCSport, and multiple feature datasets, were conducted to evaluate the performance of GRMDA against the state-of-the-art multiview methods. The experimental results demonstrate that the proposed method can achieve a significant performance with strong robustness and fast convergence.
Machine reading comprehension model based on query reconstruction technology and deep learning
2024, Neural Computing and Applications
MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets
2023, arXiv
MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Zhihao Wang was born in 1993. He received the MA.Eng degree in Xiamen University, and is studying for his Ph.D. in Xiamen University. His research interests include natural language processing, neural machine translation and fuzzy clustering.

Shaojie He was born in 1999. He is now a graduate student in the major of signal and information processing at University of Chinese Academy of Sciences. His research interests include deep learning and natural language processing.

Haiying Zhang received the Ph.D. degree from the Harbin Institute of Technology, Harbin, China. She is currently an Associate Professor with Xiamen University, Xiamen, China. Her research interests include artificial intelligence and deep learning.

Jinsong Su was born in 1982. He received the Ph.D. degree in Chinese Academy of Sciences, and is now a professor in Xiamen University. His research interests include natural language processing, neural machine translation and text generation. He has served as the Area Co-Chair of the NLPCC 2018, EMNLP 2019, EMNLP 2020, NLPCC 2020, ACL 2021.

¹: Equal contribution.

View full text