Abstract
Machine Reading Comprehension (MRC) has achieved impressive answer inference performance in recent years but rarely considers the trustworthiness and reliability of the deployed systems. However, it is crucial to estimate the predictive uncertainty in real-world applications to measure how likely the prediction is wrong. Hence it is possible to abstain from the uncertain prediction with low confidence and build a trustworthy system. Prior studies use post-processing ways to measure the predictive uncertainty, such as employing heuristic softmax probability or training a calibrator on top of a trained MRC model. However, they only calibrate the confidence without considering the domain adaptation relationship. To handle the limitations, this paper presents TrustMRC, a non-postprocessing trustworthy MRC system that leverages (1) conditional calibration strategy to get reliable uncertainty, and (2) conditional adversarial learning strategy to learn transfer representations under domain shift setting. On the one hand, to estimate the predictive uncertainty, a conditional calibration module is proposed to predict whether the output of the answer prediction module is correct, and it is combined with an additional ECE constraint to restrict the confidence more reliable. On the other hand, for domain shift, TrustMRC designs a conditional adversarial learning strategy to learn transfer representations through a domain discriminator with uncertainty constraints, which takes both input and uncertainty alignment into account. Besides, TrustMRC is a non-postprocessing model that completes the answer prediction and uncertainty prediction in an end-to-end framework, so that these two sub-tasks can benefit from each other via multi-task learning. Instead of traditional EM and F1 metrics, EM-coverage and F1-coverage curves are used, for the trustworthiness-aware MRC evaluation. The experimental results on SQuAD 1.1, Natural Questions, and NewsQA datasets indicate that TrustMRC can make reliable predictions under domain shift settings.
Similar content being viewed by others
References
Seo MJ, Kembhavi A, Farhadi A, Hajishirzi H (2017) Bidirectional attention flow for machine comprehension. In: 5Th international conference on learning representations, ICLR 2017, Toulon, France, April 24-26, 2017
Zhang Z, Zhang Y, Zhao H (2022) Syntax-aware multi-spans generation for reading comprehension. IEEE ACM Trans Audio Speech Lang Process 30:260–268. https://doi.org/10.1109/TASLP.2021.3138679https://doi.org/10.1109/TASLP.2021.3138679
Seo J, Oh D, Eo S, Park C, Yang K, Moon H, Park K, Lim H (2022) Pu-gen: enhancing generative commonsense reasoning for language models with human-centered knowledge. Knowl Based Syst 256:109861. https://doi.org/10.1016/j.knosys.2022.109861
Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100, 000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on empirical methods in natural language processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp 2383–2392
Fisch A, Talmor A, Jia R, Seo M, Choi E, Chen D (2019) MRQA 2019 shared task: evaluating generalization in reading comprehension. In: Proceedings of the 2nd workshop on machine reading for question answering, MRQA@EMNLP 2019, Hong Kong, China, November 4, 2019, pp 1–13. https://doi.org/10.18653/v1/D19-5801https://doi.org/10.18653/v1/D19-5801
Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the north american chapter of the association for computational linguistics, NAACL 2019, Minneapolis, MN, USA, June 2-7, 2019, pp 4171–4186
Seonwoo Y, Kim J-H, Ha J-W, Oh A (2020) Context-aware answer extraction in question answering. In: Proceedings of the 2020 Conference on empirical methods in natural language processing, EMNLP 2020, Online, November 16-20, 2020, pp 2418–2428
Zhang Z, Yang J, Zhao H (2021) Retrospective reader for machine reading comprehension. In: Thirty-fifth AAAI conference on artificial intelligence, AAAI 2021, thirty-third conference on innovative applications of artificial intelligence, IAAI 2021, the eleventh symposium on educational advances in artificial intelligence, EAAI 2021, virtual event, february 2-9, 2021, pp 14506–14514
Kundu D, Pal RK, Mandal DP (2021) Time-aware hybrid expertise retrieval system in community question answering services. Appl Intell 51(10):6914–6931. https://doi.org/10.1007/s10489-020-02177-2https://doi.org/10.1007/s10489-020-02177-2
Kamath A, Jia R, Liang P (2020) Selective question answering under domain shift. In: Proceedings of the 58th Annual meeting of the association for computational linguistics, ACL 2020, Online, July 5-10, 2020, pp 5684–5696. https://doi.org/10.18653/v1/2020.acl-main.503https://doi.org/10.18653/v1/2020.acl-main.503
Ye X, Durrett G (2022) Can explanations be useful for calibrating black box models?. In: Proceedings of the 60th Annual meeting of the association for computational linguistics (vol 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp 6199–6212. https://doi.org/10.18653/v1/2022.acl-long.429
Su L, Guo J, Fan Y, Lan Y, Cheng X (2019) Controlling risk of web question answering. In: Proceedings of the 42nd International ACM SIGIR Conference on research and development in information retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, pp 115–124. https://doi.org/10.1145/3331184.3331261
Yu D, Li J, Deng L (2011) Calibration of confidence measures in speech recognition. IEEE ACM Trans Audio Speech Lang Process 19(8):2461–2473. https://doi.org/10.1109/tasl.2011.2141988
Shen Y, Huang X, Tang B, Wang X, Chen Q, Ni Y (2021) A deep transfer learning method for medical question matching. In: 9th IEEE International Conference on Healthcare Informatics, ICHI 2021, Victoria, BC, Canada, August 9-12, 2021, pp 515–516. https://doi.org/10.1109/ICHI52183.2021.00097
Rajpurkar P, Jia R, Liang P (2018) Know what you don’t know: unanswerable questions for squad. In: Proceedings of the 56th Annual meeting of the association for computational linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018. https://doi.org/10.18653/v1/P18-2124https://doi.org/10.18653/v1/P18-2124
Hendrycks D, Gimpel K (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In: 5Th international conference on learning representations, ICLR 2017, toulon, france, april 24-26, 2017, conference track proceedings
Gal Y, Ghahramani Z (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In: Proceedings of the 33nd International conference on machine learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp 1050–1059
Baradaran R, Amirkhani H (2021) Ensemble learning-based approach for improving generalization capability of machine reading comprehension systems. Neurocomputing 466:229–242. https://doi.org/10.1016/j.neucom.2021.08.095
Raina V, Gales MJF (2022) Answer uncertainty and unanswerability in multiple-choice machine reading comprehension. In: Findings of the association for computational linguistics: ACL 2022, dublin, ireland, may 22-27, 2022, pp 1020–1034
Jiang Z, Araki J, Ding H, Neubig G (2021) How can we know when language models know? on the calibration of language models for question answering. Trans Assoc Comput Linguist 9:962–977
Peng Y, Li X, Song J, Luo Y, Hu S, Qian W (2021) Verification mechanism to obtain an elaborate answer span in machine reading comprehension. Neurocomputing 466:80–91. https://doi.org/10.1016/j.neucom.2021.08.084https://doi.org/10.1016/j.neucom.2021.08.084
Zhang S, Gong C, Choi E (2021) Knowing more about questions can help: improving calibration in question answering. In: Findings of the association for computational linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, pp 1958–1970. https://doi.org/10.18653/v1/2021.findings-acl.172
Kwiatkowski T, Palomaki J, Redfield O, Collins M, Parikh AP, Alberti C, Epstein D, Polosukhin I, Devlin J, Lee K, Toutanova K, Jones L, Kelcey M, Chang M, Dai AM, Uszkoreit J, Le Q, Petrov S (2019) Natural questions: a benchmark for question answering research. Trans Assoc Comput Linguistics 7:452–466
Trischler A, Wang T, Yuan X, Harris J, Sordoni A, Bachman P, Suleman K (2017) Newsqa: a machine comprehension dataset. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL 2017, Vancouver, Canada, August 3, 2017, pp 191–200. https://doi.org/10.18653/v1/w17-2623
Abdar M, Pourpanah F, Hussain S, Rezazadegan D, Liu L, Ghavamzadeh M, Fieguth PW, Cao X, Khosravi A, Acharya UR, Makarenkov V, Nahavandi S (2021) A review of uncertainty quantification in deep learning: techniques, applications and challenges. Inf Fusion 76:243–297. https://doi.org/10.1016/j.inffus.2021.05.008
Ghesu FC, Georgescu B, Mansoor A, Yoo Y, Gibson E, Vishwanath RS, Balachandran A, Balter JM, Cao Y, Singh R, Digumarthy SR, Kalra MK, Grbic S, Comaniciu D (2021) Quantifying and leveraging predictive uncertainty for medical image assessment. Medical Image Anal 68:101855. https://doi.org/10.1016/j.media.2020.101855https://doi.org/10.1016/j.media.2020.101855
He J, Zhang X, Lei S, Chen Z, Chen F, Alhamadani A, Xiao B, Lu C (2020) Towards more accurate uncertainty estimation in text classification. In: Proceedings of the 2020 Conference on empirical methods in natural language processing, EMNLP 2020, Online, November 16-20, 2020, pp 8362–8372. https://doi.org/10.18653/v1/2020.emnlp-main.671
Maroñas J, Paredes R, Ramos D (2020) Calibration of deep probabilistic models with decoupled bayesian neural networks. Neurocomputing 407:194–205. https://doi.org/10.1016/j.neucom.2020.04.103https://doi.org/10.1016/j.neucom.2020.04.103
Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, december 4-9, 2017, long beach, CA, USA, pp 6402–6413
Jain S, Liu G, Mueller J, Gifford D (2020) Maximizing overall diversity for improved uncertainty estimates in deep ensembles. In: The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, the thirty-second innovative applications of artificial intelligence conference, IAAI 2020, the tenth AAAI symposium on educational advances in artificial intelligence, EAAI 2020, new york, NY, USA, February 7-12, 2020, pp 4264–4271
Kumar S (2022) Answer-level calibration for free-form multiple choice question answering. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp 665–679
Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky VS (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17:59–15935
Zhang C, Zhang J (2022) Transferable regularization and normalization: towards transferable feature learning for unsupervised domain adaptation. Inf Sci 609:595–604. https://doi.org/10.1016/j.ins.2022.07.083https://doi.org/10.1016/j.ins.2022.07.083
Gopalan R, Li R, Chellappa R (2014) Unsupervised adaptation across domain shifts by generating intermediate data representations. IEEE Trans Pattern Anal Mach Intell 36(11):2288–2302. https://doi.org/10.1109/TPAMI.2013.249
Wang H, Gan Z, Liu X, Liu J, Gao J, Wang H (2019) Adversarial domain adaptation for machine reading comprehension. In: Proceedings of the 2019 Conference on empirical methods in natural language processing and the 9th International joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp 2510–2520. https://doi.org/10.18653/v1/D19-1254
Cao Y, Fang M, Yu B, Zhou JT (2020) Unsupervised domain adaptation on reading comprehension. In: The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, new york, NY, USA, February 7-12, 2020, pp 7480–7487
des Combes RT, Zhao H, Wang Y, Gordon GJ (2020) Domain adaptation with conditional distribution matching and generalized label shift. In: Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, neurIPS 2020, december 6-12, 2020, virtual
Guo C, Pleiss G, Sun Y, Weinberger KQ (2017) On calibration of modern neural networks. In: Proceedings of the 34th International conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp 1321–1330
Naeini MP, Cooper GF, Hauskrecht M (2015) Obtaining well calibrated probabilities using bayesian binning. In: Proceedings of the Twenty-Ninth AAAI Conference on artificial intelligence, January 25-30, 2015, Austin, Texas, USA, pp 2901–2907
Acknowledgment
This paper is funded by National Natural Science Foundation of China (Grant No. 62173195). This work is partly supported by seed fund of Tsinghua University (Department of Computer Science and Technology) -Siemens Ltd., (China) Joint Research Center for Industrial Intelligence and Internet of Things.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wu, Z., Xu, H. Trustworthy machine reading comprehension with conditional adversarial calibration. Appl Intell 53, 14298–14315 (2023). https://doi.org/10.1007/s10489-022-04235-3
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04235-3