Abstract
In a low-resource scenario, the lack of annotated data can be an obstacle not only to train a robust system, but also to evaluate and compare different approaches before deploying the best one for a given setting. We propose to dynamically find the best approach for a given setting by taking advantage of feedback naturally present on the scenario in hand (when it exists). To this end, we present a novel application of online learning algorithms, where we frame the choice of the best approach as a multi-armed bandits problem. Our proof-of-concept is a retrieval-based conversational agent, in which the answer selection criteria available to the agent are the competing approaches (arms). In our experiment, an adversarial multi-armed bandits approach converges to the performance of the best criterion after just three interaction turns, which suggests the appropriateness of our approach in a low-resource conversational agent.
This work was supported by: Fundação para a Ciência e a Tecnologia (FCT) under reference UIDB/50021/2020 (INESC-ID multi-annual funding), as well as under the HOTSPOT project with reference PTDC/CCI-COM/7203/2020; Air Force Office of Scientific Research under award number FA9550-19-1-0020; P2020 program, supervised by Agência Nacional de Inovação (ANI), under the project CMU-PT Ref. 045909 (MAIA). Vânia Mendonça was funded by an FCT grant, ref. SFRH/BD/121443/2016.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
See Boussaha et al. [5] for a review of recent retrieval-based systems.
- 2.
However, we are not using generation and/or deep learning.
- 3.
For EXP3, we rounded each arm’s reward to an integer value, to avoid exploding weight values, and we set \(\eta \) to \(\sqrt{8\log \frac{K}{T}}\), following Mendonça et al. [19].
- 4.
For UCB, we consider the estimated cost \(\hat{Q}(k)\) as the “weight” for the arm k.
- 5.
We kept SSS’s default configuration of \(N = 20\) candidates.
- 6.
We use an updated version of the corpus reported by Oliveira et al. [20], which includes more question variants for each answer.
- 7.
References
Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2–3), 235–256 (2002). https://doi.org/10.1023/A:1013689704352
Auer, P., Cesa-Bianchi, N., Freund, Y., Schapire, R.E.: Gambling in a rigged casino: the adversarial multi-armed bandit problem. In: Annual Symposium on Foundations of Computer Science - Proceedings, pp. 322–331 (1995). https://doi.org/10.1109/sfcs.1995.492488
Banchs, R.E., Li, H.: Iris: a chat-oriented dialogue system based on the vector space model. In: Proceedings of the ACL 2012 System Demonstrations, ACL 2012, pp. 37–42. Association for Computational Linguistics, Stroudsburg (2012). http://dl.acm.org/citation.cfm?id=2390470.2390477
Biermann, A.W., Long, P.M.: The composition of messages in speech-graphics interactive systems. In: International Symposium on Spoken Dialogue, pp. 97–100 (1996). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.55.721&rep=rep1&type=pdf
Boussaha, B.E.A., Hernandez, N., Jacquin, C., Morin, E.: Deep Retrieval-Based Dialogue Systems: A Short Review. Technical report (2019). http://arxiv.org/abs/1907.12878
Brill, E., Dumais, S., Banko, M.: An analysis of the askmsr question-answering system. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP 2002, pp. 257–264. Association for Computational Linguistics, USA (2002). https://doi.org/10.3115/1118693.1118726, https://doi.org/10.3115/1118693.1118726
Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning and Games. Cambridge University Press, Cambridge (2006)
Chen, Q., Wang, W.: Sequential neural networks for noetic end-to-end response selection. In: Proceedings of the 7th Dialog System Technology Challenge (DSTC7) (2019). https://doi.org/10.1016/j.csl.2020.101072
Gašić, M., Jurčiček, F., Thomson, B., Yu, K., Young, S.: On-line policy optimisation of spoken dialogue systems via live interaction with human subjects. In: 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings, pp. 312–317 (2011). https://doi.org/10.1109/ASRU.2011.6163950
Genevay, A., Laroche, R.: Transfer learning for user adaptation in spoken dialogue systems. In: Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS, pp. 975–983 (2016)
Jaccard, P.: The distribution of the flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)
Lai, T.L., Robbins, H.: Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(1), 4–22 (1985). https://doi.org/10.1016/0196-8858(85)90002-8
Levin, E., Pieraccini, R., Eckert, W.: A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans. Speech Audio Process. 8 (2000)
Lin, J.: An exploration of the principles underlying redundancy-based factoid question answering. ACM Trans. Inf. Syst. 25(2), 6-es (2007). https://doi.org/10.1145/1229179.1229180. https://doi.org/10.1145/1229179.1229180
Liu, B., Yu, T., Lane, I., Mengshoel, O.J.: Customized nonlinear bandits for online response selection in neural conversation models. In: The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), pp. 5245–5252 (2018)
Magarreiro, D., Coheur, L., Melo, F.S.: Using subtitles to deal with out-of-domain interactions. In: SemDial 2014 - DialWatt (2014)
McCandless, M., Hatcher, E., Gospodnetic, O.: Lucene in Action, Second Edition: Covers Apache Lucene 3.0. Manning Publications Co., Greenwich, CT, USA (2010)
Mendonça, V., Melo, F.S., Coheur, L., Sardinha, A.: A Conversational Agent Powered by Online Learning, vol. 3, pp. 1637–1639. International Foundation for Autonomous Agents and Multiagent Systems, São Paulo, Brazil (2017). http://dl.acm.org/citation.cfm?id=3091282.3091388
Mendonça, V., Melo, F.S., Coheur, L., Sardinha, A.: Online learning for conversational agents. In: Oliveira, E., Gama, J., Vale, Z., Lopes Cardoso, H. (eds.) EPIA 2017. LNCS (LNAI), vol. 10423, pp. 739–750. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-65340-2_60
Oliveira, H.G., et al.: AIA-BDE: a corpus of FAQs in Portuguese and their variations. In: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pp. 5442–5449 (2020)
Robbins, H.: Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58(5), 527–535 (1952). https://doi.org/10.1090/S0002-9904-1952-09620-8
Roller, S., et al.: Recipes for building an open-domain chatbot. Technical report (2020). http://arxiv.org/abs/2004.13637
Serban, I.V., et al.: A deep reinforcement learning chatbot. Technical report (2018)
Singh, S., Litman, D., Kearns, M., Walker, M.: Optimizing dialogue management with reinforcement learning: Experiments with the njfun system. J. Artif. Intell. Res. 16, 105–133 (2002)
Su, P.H., et al.: On-line active reward learning for policy optimisation in spoken dialogue systems. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2431–2441. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/P16-1230, http://aclweb.org/anthology/P16-1230
Upadhyay, S., Agarwal, M., Bounneffouf, D., Khazaeni, Y.: A bandit approach to posterior dialog orchestration under a budget. In: 32nd Conference on Neural Information Processing Systems (NeurIPS 2018) (2018)
Wang, C.C., Kulkarni, S.R., Poor, H.V.: Bandit problems with side observations. IEEE Trans. Autom. Control 50(3), 338–355 (2005). https://doi.org/10.1109/TAC.2005.844079
Yu, Z., Xu, Z., Black, A.W., Rudnicky, A.I.: Strategy and policy learning for non-task-oriented conversational systems. In: Proceedings of the SIGDIAL 2016 Conference, pp. 404–412 (2016)
Zhang, Z., Li, J., Zhu, P., Zhao, H., Liu, G.: Modeling multi-turn conversation with deep utterance aggregation. In: Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018), pp. 3740–3752 (2018). http://arxiv.org/abs/1806.09102
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Mendonça, V., Coheur, L., Sardinha, A. (2021). One Arm to Rule Them All: Online Learning with Multi-armed Bandits for Low-Resource Conversational Agents. In: Marreiros, G., Melo, F.S., Lau, N., Lopes Cardoso, H., Reis, L.P. (eds) Progress in Artificial Intelligence. EPIA 2021. Lecture Notes in Computer Science(), vol 12981. Springer, Cham. https://doi.org/10.1007/978-3-030-86230-5_49
Download citation
DOI: https://doi.org/10.1007/978-3-030-86230-5_49
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86229-9
Online ISBN: 978-3-030-86230-5
eBook Packages: Computer ScienceComputer Science (R0)