Skip to main content

Rethinking Label Smoothing on Multi-Hop Question Answering

  • Conference paper
  • First Online:
Chinese Computational Linguistics (CCL 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14232))

Included in the following conference series:

  • 745 Accesses

Abstract

Multi-Hop Question Answering (MHQA) is a significant area in question answering, requiring multiple reasoning components, including document retrieval, supporting sentence prediction, and answer span extraction. In this work, we present the first application of label smoothing to the MHQA task, aiming to enhance generalization capabilities in MHQA systems while mitigating overfitting of answer spans and reasoning paths in the training set. We introduce a novel label smoothing technique, F1 Smoothing, which incorporates uncertainty into the learning process and is specifically tailored for Machine Reading Comprehension (MRC) tasks. Moreover, we employ a Linear Decay Label Smoothing Algorithm (LDLA) in conjunction with curriculum learning to progressively reduce uncertainty throughout the training process. Experiment on the HotpotQA dataset confirms the effectiveness of our approach in improving generalization and achieving significant improvements, leading to new state-of-the-art performance on the HotpotQA leaderboard.

Y. Wang—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Danyluk, A.P., Bottou, L., Littman, M.L. (eds.) Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14–18, 2009. ACM International Conference Proceeding Series, vol. 382, pp. 41–48. ACM (2009). https://doi.org/10.1145/1553374.1553380

  2. Chorowski, J., Jaitly, N.: Towards better decoding and language model integration in sequence to sequence models. In: INTERSPEECH (2017)

    Google Scholar 

  3. Clark, K., Luong, M., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. OpenReview.net (2020). https://openreview.net/forum?id=r1xMH1BtvB

  4. Fang, Y., Sun, S., Gan, Z., Pillai, R., Wang, S., Liu, J.: Hierarchical graph network for multi-hop question answering. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8823–8838. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.710

  5. Gao, Y., Wang, W., Herold, C., Yang, Z., Ney, H.: Towards a better understanding of label smoothing in neural machine translation. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 212–223. Association for Computational Linguistics, Suzhou, China (2020). https://aclanthology.org/2020.aacl-main.25

  6. Graça, M., Kim, Y., Schamper, J., Khadivi, S., Ney, H.: Generalizing back-translation in neural machine translation. In: Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp. 45–52. Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/v1/W19-5205, https://aclanthology.org/W19-5205

  7. Groeneveld, D., Khot, T., Mausam, Sabharwal, A.: A simple yet strong pipeline for HotpotQA. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8839–8845. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.711, https://aclanthology.org/2020.emnlp-main.711

  8. He, P., Gao, J., Chen, W.: Debertav 3: improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. ArXiv preprint abs/ arXiv: 2111.09543 (2021)

  9. Kočiský, T., et al.: The NarrativeQA reading comprehension challenge. Trans. Assoc. Comput. Linguist. 6, 317–328 (2018)

    Article  Google Scholar 

  10. Li, R., Wang, L., Wang, S., Jiang, Z.: Asynchronous multi-grained graph network for interpretable multi-hop reading comprehension. In: IJCA, pp. 3857–3863 (2021)

    Google Scholar 

  11. Li, X.Y., Lei, W.J., Yang, Y.B.: From easy to hard: two-stage selector and reader for multi-hop question answering. ArXiv preprint abs/ arXiv: 2205.11729 (2022)

  12. Liu, Y., et al.: Roberta: A robustly optimized bert pretraining approach. ArXiv preprint abs/ arXiv: 1907.11692 (2019)

  13. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  14. Lukasik, M., Bhojanapalli, S., Menon, A.K., Kumar, S.: Does label smoothing mitigate label noise? In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13–18 July 2020, Virtual Event. Proceedings of Machine Learning Research, vol. 119, pp. 6448–6458. PMLR (2020). https://proceedings.mlr.press/v119/lukasik20a.html

  15. Lukasik, M., Jain, H., Menon, A., Kim, S., Bhojanapalli, S., Yu, F., Kumar, S.: Semantic label smoothing for sequence to sequence problems. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4992–4998. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.405

  16. Müller, R., Kornblith, S., Hinton, G.E.: When does label smoothing help? In: Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8–14 December 2019, Vancouver, BC, Canada, pp. 4696–4705 (2019). https://proceedings.neurips.cc/paper/2019/hash/f1748d6b0fd9d439f71450117eba2725-Abstract.html

  17. Nishida, K., et al.: Answering while summarizing: Multi-task learning for multi-hop QA with evidence extraction. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 2335–2345. Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/v1/P19-1225, https://aclanthology.org/P19-1225

  18. Penha, G., Hauff, C.: Weakly supervised label smoothing. In: Hiemstra, D., Moens, M.-F., Mothe, J., Perego, R., Potthast, M., Sebastiani, F. (eds.) ECIR 2021. LNCS, vol. 12657, pp. 334–341. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72240-1_33

    Chapter  Google Scholar 

  19. Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., Hinton, G.: Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548 (2017)

  20. Qiu, L., et al.: Dynamically fused graph network for multi-hop reasoning. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6140–6150. Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/v1/P19-1617, https://aclanthology.org/P19-1617

  21. Saha, S., Das, S., Srihari, R.: Similarity based label smoothing for dialogue generation. ArXiv preprint abs/ arXiv: 2107.11481 (2021

  22. Shao, N., Cui, Y., Liu, T., Wang, S., Hu, G.: Is Graph structure necessary for multi-hop question answering? In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7187–7192. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.583, https://aclanthology.org/2020.emnlp-main.583

  23. Su, L., Guo, J., Fan, Y., Lan, Y., Cheng, X.: Label distribution augmented maximum likelihood estimation for reading comprehension. In: Caverlee, J., Hu, X.B., Lalmas, M., Wang, W. (eds.) WSDM 2020: The Thirteenth ACM International Conference on Web Search and Data Mining, Houston, TX, USA, 3–7 February 2020, pp. 564–572. ACM (2020). https://doi.org/10.1145/3336191.3371835

  24. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June, 2016, pp. 2818–2826. IEEE Computer Society (2016). https://doi.org/10.1109/CVPR.2016.308

  25. Tu, M., Huang, K., Wang, G., Huang, J., He, X., Zhou, B.: Select, answer and explain: interpretable multi-hop reading comprehension over multiple documents (2020)

    Google Scholar 

  26. Welbl, J., Stenetorp, P., Riedel, S.: Constructing datasets for multi-hop reading comprehension across documents. Trans. Asso. Comput. Linguist. 6, 287–302 (2018)

    Google Scholar 

  27. Wolf, T., et al.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-demos.6, https://aclanthology.org/2020.emnlp-demos.6

  28. Wu, B., Zhang, Z., Zhao, H.: Graph-free multi-hop reading comprehension: a select-to-guide strategy. ArXiv preprint abs/ arXiv: 2107.11823 (2021)

  29. Xu, Y., Xu, Y., Qian, Q., Li, H., Jin, R.: Towards understanding label smoothing. ArXiv preprint abs/ arXiv: 2006.11653 (2020)

  30. Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R., Manning, C.D.: HotpotQA: A dataset for diverse, explainable multi-hop question answering. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380. Association for Computational Linguistics, Brussels, Belgium (2018). https://doi.org/10.18653/v1/D18-1259, https://aclanthology.org/D18-1259

  31. Zhao, Z., Wu, S., Yang, M., Chen, K., Zhao, T.: Robust machine reading comprehension by learning soft labels. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 2754–2759. International Committee on Computational Linguistics, Barcelona, Spain (Online) (2020). https://doi.org/10.18653/v1/2020.coling-main.248, https://aclanthology.org/2020.coling-main.248

Download references

Acknowledgement

We would like to express our heartfelt thanks to the students and teachers of Fudan Natural Language Processing Lab. Their thoughtful suggestions, viewpoints, and enlightening discussions have made significant contributions to this work. We also greatly appreciate the strong support from Huawei Poisson Lab for our work, and their invaluable advice. We are sincerely grateful to the anonymous reviewers and the domain chairs, whose constructive feedback played a crucial role in enhancing the quality of our research. This work was supported by the National Key Research and Development Program of China (No.2022CSJGG0801), National Natural Science Foundation of China (No.62022027) and CAAI-Huawei MindSpore Open Fund.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xipeng Qiu .

Editor information

Editors and Affiliations

7 Appendix A

7 Appendix A

In order to alleviate the complexity introduced by multiple for loops in the F1 Smoothing method, we have optimized Eq. (12) and Eq. (13). We use \(L_a=e^{*}-s^{*}+1\) and \(L_p=e-s+1\) to denote respectively the length of gold answer and predicted answer.

$$\begin{aligned} q_s(t|x)=\sum _{\xi =t}^{L-1} \text {F1}\left( (t,\xi ),a_{\text {gold}}\right) . \end{aligned}$$
(16)

If \(t < s^{*}\), the distribution is

$$\begin{aligned} q_s(t|x)= \sum _{\xi =s^{*}}^{e^{*}} \frac{2(\xi -s^{*}+1)}{L_p+L_a} + \sum _{\xi =e^{*}+1}^{L-1} \frac{2L_a}{L_p+L_a}, \end{aligned}$$
(17)

else if \(s^{*} \le t \le e^{*}\), we have the following distribution

$$\begin{aligned} q_s(t|x)=\sum _{\xi =s}^{e^{*}} \frac{2L_p}{L_p+L_a} + \sum _{\xi =e^{*}+1}^{L-1} \frac{2(e^{*}-s+1)}{L_p+L_a}. \end{aligned}$$
(18)

In Eq. 17 and 18, \(L_p=e-i+1\).

We can get \(q_e(t|x)\) similarly. If \(t > e^{*}\),

$$\begin{aligned} q_e(t|x)= \sum _{\xi =s^{*}}^{e^{*}} \frac{2(e^{*}-\xi +1)}{L_p+L_a} + \sum _{\xi =0}^{s^{*}-1} \frac{2L_a}{L_p+L_a}, \end{aligned}$$
(19)

else if \(s^{*} \le t \le e^{*}\),

$$\begin{aligned} q_e(t|x)= \sum _{\xi =s^{*}}^{e} \frac{2L_p}{L_p+L_a} + \sum _{\xi =0}^{s^{*}-1} \frac{2(e-s^{*}+1)}{L_p+L_a}. \end{aligned}$$
(20)

In Eqs. 19 and 20, \(L_p=i-s+1\).

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yin, Z. et al. (2023). Rethinking Label Smoothing on Multi-Hop Question Answering. In: Sun, M., et al. Chinese Computational Linguistics. CCL 2023. Lecture Notes in Computer Science(), vol 14232. Springer, Singapore. https://doi.org/10.1007/978-981-99-6207-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-6207-5_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-6206-8

  • Online ISBN: 978-981-99-6207-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics