Skip to main content
Log in

Trustworthy machine reading comprehension with conditional adversarial calibration

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Machine Reading Comprehension (MRC) has achieved impressive answer inference performance in recent years but rarely considers the trustworthiness and reliability of the deployed systems. However, it is crucial to estimate the predictive uncertainty in real-world applications to measure how likely the prediction is wrong. Hence it is possible to abstain from the uncertain prediction with low confidence and build a trustworthy system. Prior studies use post-processing ways to measure the predictive uncertainty, such as employing heuristic softmax probability or training a calibrator on top of a trained MRC model. However, they only calibrate the confidence without considering the domain adaptation relationship. To handle the limitations, this paper presents TrustMRC, a non-postprocessing trustworthy MRC system that leverages (1) conditional calibration strategy to get reliable uncertainty, and (2) conditional adversarial learning strategy to learn transfer representations under domain shift setting. On the one hand, to estimate the predictive uncertainty, a conditional calibration module is proposed to predict whether the output of the answer prediction module is correct, and it is combined with an additional ECE constraint to restrict the confidence more reliable. On the other hand, for domain shift, TrustMRC designs a conditional adversarial learning strategy to learn transfer representations through a domain discriminator with uncertainty constraints, which takes both input and uncertainty alignment into account. Besides, TrustMRC is a non-postprocessing model that completes the answer prediction and uncertainty prediction in an end-to-end framework, so that these two sub-tasks can benefit from each other via multi-task learning. Instead of traditional EM and F1 metrics, EM-coverage and F1-coverage curves are used, for the trustworthiness-aware MRC evaluation. The experimental results on SQuAD 1.1, Natural Questions, and NewsQA datasets indicate that TrustMRC can make reliable predictions under domain shift settings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Seo MJ, Kembhavi A, Farhadi A, Hajishirzi H (2017) Bidirectional attention flow for machine comprehension. In: 5Th international conference on learning representations, ICLR 2017, Toulon, France, April 24-26, 2017

  2. Zhang Z, Zhang Y, Zhao H (2022) Syntax-aware multi-spans generation for reading comprehension. IEEE ACM Trans Audio Speech Lang Process 30:260–268. https://doi.org/10.1109/TASLP.2021.3138679https://doi.org/10.1109/TASLP.2021.3138679

    Article  Google Scholar 

  3. Seo J, Oh D, Eo S, Park C, Yang K, Moon H, Park K, Lim H (2022) Pu-gen: enhancing generative commonsense reasoning for language models with human-centered knowledge. Knowl Based Syst 256:109861. https://doi.org/10.1016/j.knosys.2022.109861

    Article  Google Scholar 

  4. Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100, 000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on empirical methods in natural language processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp 2383–2392

  5. Fisch A, Talmor A, Jia R, Seo M, Choi E, Chen D (2019) MRQA 2019 shared task: evaluating generalization in reading comprehension. In: Proceedings of the 2nd workshop on machine reading for question answering, MRQA@EMNLP 2019, Hong Kong, China, November 4, 2019, pp 1–13. https://doi.org/10.18653/v1/D19-5801https://doi.org/10.18653/v1/D19-5801

  6. Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the north american chapter of the association for computational linguistics, NAACL 2019, Minneapolis, MN, USA, June 2-7, 2019, pp 4171–4186

  7. Seonwoo Y, Kim J-H, Ha J-W, Oh A (2020) Context-aware answer extraction in question answering. In: Proceedings of the 2020 Conference on empirical methods in natural language processing, EMNLP 2020, Online, November 16-20, 2020, pp 2418–2428

  8. Zhang Z, Yang J, Zhao H (2021) Retrospective reader for machine reading comprehension. In: Thirty-fifth AAAI conference on artificial intelligence, AAAI 2021, thirty-third conference on innovative applications of artificial intelligence, IAAI 2021, the eleventh symposium on educational advances in artificial intelligence, EAAI 2021, virtual event, february 2-9, 2021, pp 14506–14514

  9. Kundu D, Pal RK, Mandal DP (2021) Time-aware hybrid expertise retrieval system in community question answering services. Appl Intell 51(10):6914–6931. https://doi.org/10.1007/s10489-020-02177-2https://doi.org/10.1007/s10489-020-02177-2

    Article  Google Scholar 

  10. Kamath A, Jia R, Liang P (2020) Selective question answering under domain shift. In: Proceedings of the 58th Annual meeting of the association for computational linguistics, ACL 2020, Online, July 5-10, 2020, pp 5684–5696. https://doi.org/10.18653/v1/2020.acl-main.503https://doi.org/10.18653/v1/2020.acl-main.503

  11. Ye X, Durrett G (2022) Can explanations be useful for calibrating black box models?. In: Proceedings of the 60th Annual meeting of the association for computational linguistics (vol 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp 6199–6212. https://doi.org/10.18653/v1/2022.acl-long.429

  12. Su L, Guo J, Fan Y, Lan Y, Cheng X (2019) Controlling risk of web question answering. In: Proceedings of the 42nd International ACM SIGIR Conference on research and development in information retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, pp 115–124. https://doi.org/10.1145/3331184.3331261

  13. Yu D, Li J, Deng L (2011) Calibration of confidence measures in speech recognition. IEEE ACM Trans Audio Speech Lang Process 19(8):2461–2473. https://doi.org/10.1109/tasl.2011.2141988

    Article  Google Scholar 

  14. Shen Y, Huang X, Tang B, Wang X, Chen Q, Ni Y (2021) A deep transfer learning method for medical question matching. In: 9th IEEE International Conference on Healthcare Informatics, ICHI 2021, Victoria, BC, Canada, August 9-12, 2021, pp 515–516. https://doi.org/10.1109/ICHI52183.2021.00097

  15. Rajpurkar P, Jia R, Liang P (2018) Know what you don’t know: unanswerable questions for squad. In: Proceedings of the 56th Annual meeting of the association for computational linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018. https://doi.org/10.18653/v1/P18-2124https://doi.org/10.18653/v1/P18-2124

  16. Hendrycks D, Gimpel K (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In: 5Th international conference on learning representations, ICLR 2017, toulon, france, april 24-26, 2017, conference track proceedings

  17. Gal Y, Ghahramani Z (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In: Proceedings of the 33nd International conference on machine learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pp 1050–1059

  18. Baradaran R, Amirkhani H (2021) Ensemble learning-based approach for improving generalization capability of machine reading comprehension systems. Neurocomputing 466:229–242. https://doi.org/10.1016/j.neucom.2021.08.095

    Article  Google Scholar 

  19. Raina V, Gales MJF (2022) Answer uncertainty and unanswerability in multiple-choice machine reading comprehension. In: Findings of the association for computational linguistics: ACL 2022, dublin, ireland, may 22-27, 2022, pp 1020–1034

  20. Jiang Z, Araki J, Ding H, Neubig G (2021) How can we know when language models know? on the calibration of language models for question answering. Trans Assoc Comput Linguist 9:962–977

    Article  Google Scholar 

  21. Peng Y, Li X, Song J, Luo Y, Hu S, Qian W (2021) Verification mechanism to obtain an elaborate answer span in machine reading comprehension. Neurocomputing 466:80–91. https://doi.org/10.1016/j.neucom.2021.08.084https://doi.org/10.1016/j.neucom.2021.08.084

    Article  Google Scholar 

  22. Zhang S, Gong C, Choi E (2021) Knowing more about questions can help: improving calibration in question answering. In: Findings of the association for computational linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, pp 1958–1970. https://doi.org/10.18653/v1/2021.findings-acl.172

  23. Kwiatkowski T, Palomaki J, Redfield O, Collins M, Parikh AP, Alberti C, Epstein D, Polosukhin I, Devlin J, Lee K, Toutanova K, Jones L, Kelcey M, Chang M, Dai AM, Uszkoreit J, Le Q, Petrov S (2019) Natural questions: a benchmark for question answering research. Trans Assoc Comput Linguistics 7:452–466

  24. Trischler A, Wang T, Yuan X, Harris J, Sordoni A, Bachman P, Suleman K (2017) Newsqa: a machine comprehension dataset. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL 2017, Vancouver, Canada, August 3, 2017, pp 191–200. https://doi.org/10.18653/v1/w17-2623

  25. Abdar M, Pourpanah F, Hussain S, Rezazadegan D, Liu L, Ghavamzadeh M, Fieguth PW, Cao X, Khosravi A, Acharya UR, Makarenkov V, Nahavandi S (2021) A review of uncertainty quantification in deep learning: techniques, applications and challenges. Inf Fusion 76:243–297. https://doi.org/10.1016/j.inffus.2021.05.008

    Article  Google Scholar 

  26. Ghesu FC, Georgescu B, Mansoor A, Yoo Y, Gibson E, Vishwanath RS, Balachandran A, Balter JM, Cao Y, Singh R, Digumarthy SR, Kalra MK, Grbic S, Comaniciu D (2021) Quantifying and leveraging predictive uncertainty for medical image assessment. Medical Image Anal 68:101855. https://doi.org/10.1016/j.media.2020.101855https://doi.org/10.1016/j.media.2020.101855

    Article  Google Scholar 

  27. He J, Zhang X, Lei S, Chen Z, Chen F, Alhamadani A, Xiao B, Lu C (2020) Towards more accurate uncertainty estimation in text classification. In: Proceedings of the 2020 Conference on empirical methods in natural language processing, EMNLP 2020, Online, November 16-20, 2020, pp 8362–8372. https://doi.org/10.18653/v1/2020.emnlp-main.671

  28. Maroñas J, Paredes R, Ramos D (2020) Calibration of deep probabilistic models with decoupled bayesian neural networks. Neurocomputing 407:194–205. https://doi.org/10.1016/j.neucom.2020.04.103https://doi.org/10.1016/j.neucom.2020.04.103

    Article  Google Scholar 

  29. Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In: Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, december 4-9, 2017, long beach, CA, USA, pp 6402–6413

  30. Jain S, Liu G, Mueller J, Gifford D (2020) Maximizing overall diversity for improved uncertainty estimates in deep ensembles. In: The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, the thirty-second innovative applications of artificial intelligence conference, IAAI 2020, the tenth AAAI symposium on educational advances in artificial intelligence, EAAI 2020, new york, NY, USA, February 7-12, 2020, pp 4264–4271

  31. Kumar S (2022) Answer-level calibration for free-form multiple choice question answering. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pp 665–679

  32. Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky VS (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17:59–15935

    MathSciNet  MATH  Google Scholar 

  33. Zhang C, Zhang J (2022) Transferable regularization and normalization: towards transferable feature learning for unsupervised domain adaptation. Inf Sci 609:595–604. https://doi.org/10.1016/j.ins.2022.07.083https://doi.org/10.1016/j.ins.2022.07.083

    Article  Google Scholar 

  34. Gopalan R, Li R, Chellappa R (2014) Unsupervised adaptation across domain shifts by generating intermediate data representations. IEEE Trans Pattern Anal Mach Intell 36(11):2288–2302. https://doi.org/10.1109/TPAMI.2013.249

    Article  Google Scholar 

  35. Wang H, Gan Z, Liu X, Liu J, Gao J, Wang H (2019) Adversarial domain adaptation for machine reading comprehension. In: Proceedings of the 2019 Conference on empirical methods in natural language processing and the 9th International joint conference on natural language processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp 2510–2520. https://doi.org/10.18653/v1/D19-1254

  36. Cao Y, Fang M, Yu B, Zhou JT (2020) Unsupervised domain adaptation on reading comprehension. In: The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, new york, NY, USA, February 7-12, 2020, pp 7480–7487

  37. des Combes RT, Zhao H, Wang Y, Gordon GJ (2020) Domain adaptation with conditional distribution matching and generalized label shift. In: Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, neurIPS 2020, december 6-12, 2020, virtual

  38. Guo C, Pleiss G, Sun Y, Weinberger KQ (2017) On calibration of modern neural networks. In: Proceedings of the 34th International conference on machine learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pp 1321–1330

  39. Naeini MP, Cooper GF, Hauskrecht M (2015) Obtaining well calibrated probabilities using bayesian binning. In: Proceedings of the Twenty-Ninth AAAI Conference on artificial intelligence, January 25-30, 2015, Austin, Texas, USA, pp 2901–2907

Download references

Acknowledgment

This paper is funded by National Natural Science Foundation of China (Grant No. 62173195). This work is partly supported by seed fund of Tsinghua University (Department of Computer Science and Technology) -Siemens Ltd., (China) Joint Research Center for Industrial Intelligence and Internet of Things.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hua Xu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, Z., Xu, H. Trustworthy machine reading comprehension with conditional adversarial calibration. Appl Intell 53, 14298–14315 (2023). https://doi.org/10.1007/s10489-022-04235-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04235-3

Keywords

Navigation