A novel multi-domain machine reading comprehension model with domain interference mitigation
Introduction
As one important task of natural language processing (NLP), machine reading comprehension (MRC) aims to answer questions given a passage. It has been employed in many real industrial applications such as search engines and become a research hotspot in NLP field [1], [2], [3], [4], [5], [6], [7], [8]. Nowadays, dominant MRC models are almost either extractive or generative ones, where the former attract more attention due to its advantages of simplicity and computing efficiency.
In recent years, as the basis of MRC studies, many datasets of different domains have been constructed, e.g., SQuAD [9] collected from WikiPedia, NewsQA [10] for news passages. The data of different domains usually vary greatly. In order to handle datasets of various domains, most existing approaches resort to building domain-specific models [11], [12]. However, the training data of a single domain are often limited, and thus they are insufficient to solely train a domain-specific model with satisfying performance.
To address this issue, previous study explores data augmentation to enhance domain-specific models [13]. Meanwhile, more researchers are committed to employing transfer learning to exploit multi-domain data for domain-specific models [14], [15], [16], [17]. In spite of their success to some extent, building domain-specific models makes deployment more difficult, especially when there are a large number of domains involved. There is also an exception that Xu et al. [18] explores a multi-task framework to train a unified model using the sample re-weighting technique. It is notable that such an approach only focuses on the data-level knowledge transfer, ignoring the beneficial knowledge of domain-specific models.
In this paper, inspired by the successful applications of knowledge distillation (KD) [19] in multilingual neural machine translation [20] and multi-task learning [21], we propose a KD-based multi-domain MRC model with domain interference mitigation. Specifically, we first individually train domain-specific MRC models on their own training data. Then using the mixed-domain data, we use KD to transfer the knowledge of domain-specific models to the joint model. During this process, the joint model is simultaneously supervised by the signal of training data and the output distributions of domain-specific models. More importantly, since the training samples from different domains contribute to the parameter update of the joint model differently and result in conflicting gradients, we introduce the PCGrad algorithm [22], formerly used in multi-task learning scenarios to modulate gradients from different domains. However, in the original PCGrad algorithm, the updating step size of model paramaters is independent of the domain-level gradient similarities, which leads to negative effect on model optimization. Intuitively, if there exists a great conflict between the gradients in different domains, the excessive updating step size will have a negative impact on the model performance for some domains. Therefore, the smaller the gradient similarity is, the smaller the updating step size should be. On the contrary, the larger the gradient similarity is, the larger the updating step size can be. To model this intuition, we propose to measure the gradient similarity between domains in terms of two factors: gradient direction and gradient magnitude. Then, based on the overall gradient similarity, we refine the conventional PCGrad algorithm by adaptively adjusting the learning rate to optimize model parameters. Compared to [18] which only transfers knowledge at the data level, our joint model leverages the model-level knowledge transfer for MRC. Moreover, mitigating domain interference with our improved PCGrad algorithm further enhances the performance of our joint model.
The contribution of our work lies in the following points:
- •
We introduce knowledge distillation to construct a multi-domain model for MRC. To the best of our knowledge, our work is the first attempt to explore knowledge transfer at both data and model levels in MRC.
- •
We propose to refine the conventional PCGrad algorithm by adaptively adjusting its learning rate according to our designed gradient similarity metric, which mainly depends on the similarities in gradient direction and magnitude.
- •
Experimental results show that our multi-domain model surpass several commonly-used baselines, including domain-specific models. In particular, further analysis demonstrates the effectiveness of our improved PCGrad algorithm.
Section snippets
Related work
In this work, we propose a KD-based multi-domain model with domain interference mitigation for MRC. Our related work mainly include the following two aspects:
Transfer Learning in MRC. Transfer learning is an effective approach to improve performance on target task/domain by absorbing the knowledge gained in another one. In MRC, some domains lack abundant training data to build satisfying domain-specific models. As pre-trained word embeddings [23], [24] and pre-trained language models [25], [26]
BERT-based MRC
In this section, we will describe the BERT-based MRC model [26], which is chosen as our basic model due to its competitive performance.
Fig. 1 shows the basic architecture of the BERT-based MRC model. As for the model input, let be the input question and be the input passage, these two sequences are packed in a manner as , where is a specific token for sentence-level classification and is a seperator token. Thus, the total length of input
Our model
In this section, we first describe the construction of our multi-domain MRC model based on knowledge distillation. Then, we introduce how to effectively train our model using the proposed improved PCGrad algorithm with adaptive learning rate.
Datasets
Our experimental datasets mainly include:
- •
SQuAD [9]. As one of the most commonly-used MRC dataset, it contains more than 100 K instances. The context paragraphs are collected from Wikipedia and the corresponding questions are created by human, where the answers can be found in the context paragraph. In particular, we use the SQuAD v1.1 in our experiment.
- •
NewsQA (News) [10]. It is a challenging dataset that contains about 120 K question–answer pairs. Its paragraphs are from CNN articles and
Conclusion
In this paper, we have proposed to introduce KD to train a multi-domain joint model for MRC, which allows the joint model to simultaneously learn training data and absorb the beneficial knowledge from all domain-specific models. Moreover, considering that the distributions of multiple MRC domains are often different, we propose an improved PCGrad algorithm with adaptive learning rate to better mitigate the domain interference.
On several commonly-used datasets, experimental results and in-depth
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The project was supported by National Natural Science Foundation of China (No. 62036004, No. 61672440), Natural Science Foundation of Fujian Province of China (No. 2020J06001), and Youth Innovation Fund of Xiamen (No. 3502Z20206059). We also thank the reviewers for their insightful comments.
Chulun Zhou received the B.S. degree from Xiamen University, Xiamen, China, in 2019. He is currently a Master candidate in Xiamen University, Xiamen, China. His research interests include natural language processing, text generation and neural machine translation.
References (56)
- et al.
Text understanding with the attention sum reader network
- et al.
Attention-over-attention neural networks for reading comprehension
ACL
(2017) - et al.
Bidirectional attention flow for machine comprehension
- M. Hu, Y. Peng, F. Wei, Z. Huang, D. Li, N. Yang, M. Zhou, Attention-guided answer distillation for machine reading...
- et al.
Qanet: Combining local convolution with global self-attention for reading comprehension
ICLR
(2018) - et al.
A robust adversarial training approach to machine reading comprehension
AAAI
(2020) - Z. Zhang, J. Yang, H. Zhao, Retrospective reader for machine reading comprehension, CoRR...
- et al.
What do models learn from question answering datasets?
- P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100, 000+ questions for machine comprehension of text, in: EMNLP,...
- A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, K. Suleman, Newsqa: A machine comprehension dataset,...
Unsupervised domain adaptation on reading comprehension
Question generation for question answering
Question answering through transfer learning from large fine-grained supervision data
Supervised and unsupervised transfer learning for question answering
Multiqa: An empirical investigation of generalization and transfer in reading comprehension
An iterative multi-source mutual knowledge transfer framework for machine reading comprehension
IJCAI
Multi-task learning with sample re-weighting for machine reading comprehension
Bam! born-again multi-task networks for natural language understanding
ACL
Glove: Global vectors for word representation
Deep contextualized word representations
BERT: pre-training of deep bidirectional transformers for language understanding
NAACL-HLT
Meta fine-tuning neural language models for multi-domain text mining
EMNLP
Cited by (4)
Multiview Jointly Sparse Discriminant Common Subspace Learning
2023, Pattern RecognitionMachine reading comprehension model based on query reconstruction technology and deep learning
2024, Neural Computing and ApplicationsMDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Chulun Zhou received the B.S. degree from Xiamen University, Xiamen, China, in 2019. He is currently a Master candidate in Xiamen University, Xiamen, China. His research interests include natural language processing, text generation and neural machine translation.
Zhihao Wang was born in 1993. He received the MA.Eng degree in Xiamen University, and is studying for his Ph.D. in Xiamen University. His research interests include natural language processing, neural machine translation and fuzzy clustering.
Shaojie He was born in 1999. He is now a graduate student in the major of signal and information processing at University of Chinese Academy of Sciences. His research interests include deep learning and natural language processing.
Haiying Zhang received the Ph.D. degree from the Harbin Institute of Technology, Harbin, China. She is currently an Associate Professor with Xiamen University, Xiamen, China. Her research interests include artificial intelligence and deep learning.
Jinsong Su was born in 1982. He received the Ph.D. degree in Chinese Academy of Sciences, and is now a professor in Xiamen University. His research interests include natural language processing, neural machine translation and text generation. He has served as the Area Co-Chair of the NLPCC 2018, EMNLP 2019, EMNLP 2020, NLPCC 2020, ACL 2021.
- 1
Equal contribution.