Skip to main content

Automatic Evaluation of Non-task Oriented Dialog Systems by Using Sentence Embeddings Projections and Their Dynamics

  • Chapter
  • First Online:
  • 864 Accesses

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 704))

Abstract

Human-Machine interaction through open-domain conversational agents has considerably grown in the last years. These social conversational agents try to solve the hard task of maintaining a meaningful, engaging and long-term conversation with human users by selecting or generating the most contextually appropriated response to a human prompt. Unfortunately, there is not a well-defined criteria or automatic metric that can be used to evaluate the best answer to provide. The traditional approach is to ask humans to evaluate each turn or the whole dialog according to a given dimension (e.g. naturalness, originality, appropriateness, syntax, engagingness, etc.). In this paper, we present our initial efforts on proposing an explainable metric by using sentence embedding projections and measuring different distances between the human-chatbot, human-human, and chatbot-chatbot turns on two different sets of dialogues. Our preliminary results show insights to visually and intuitively distinguish between good and bad dialogues.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://projector.tensorflow.org/.

References

  1. Hori C, Perez J, Higasinaka R, Hori T, Boureau Y-L et al (2018) Overview of the sixth dialog system technology challenge: DSTC6. Comput Speech Lang 55:125

    Google Scholar 

  2. Tao C, Mou L, Zhao D, Yan R (2018) RUBER: an unsupervised method for automatic evaluation of open-domain dialog systems. In: AAAI Conference on Artificial Intelligence

    Google Scholar 

  3. Liu C-W, Lowe R, Serban IV, Noseworthy M, Charlin L, Pineau J (2016) How NOT To evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. CoRR, abs/1603.08023

    Google Scholar 

  4. Rodríguez-Cantelar M, Matía F, San Segundo P (2019) Analysis of the dialogue management of a generative neuronal conversational agent. In: Archivo Digital UPM

    Google Scholar 

  5. D’Haro L, Banchs R, Hori C, Li H (2019) Automatic evaluation of end-to-end dialog systems with adequacy-fluency metrics. Comput Speech Lang 55:200–215

    Article  Google Scholar 

  6. Banchs R, D’Haro L, Li H (2015) Adequacy-fluency metrics: evaluating MT in the continuous space model framework. IEEE/ACM Trans Audio Speech Lang Process 23:472–482

    Article  Google Scholar 

  7. Li M, Weston J, Roller S (2019) ACUTE-EVAL: improved dialogue evaluation with optimized questions and multi-turn comparisons. arXiv:1909.03087

  8. Lison P, Tiedemann J, Kouylekov M (2018) OpenSubtitles2018: Statistical Rescoring of Sentence Alignments in Large, Noisy Parallel Corpora. LREC

    Google Scholar 

  9. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: NIPS 2014, vol 2

    Google Scholar 

  10. Chung J, Gülçehre Ç, Cho K, Bengio Y (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555 (2014)

    Google Scholar 

  11. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: ICLR

    Google Scholar 

  12. Wiseman S, Rush AM (2016) Sequence-to-sequence learning as beam-search optimization. CoRR, abs/1606.02960

    Google Scholar 

  13. Zhang S, Dinan E, Urbanek J, Szlam A, Kiela D, Weston J (2018) Personalizing dialogue agents: i have a dog, do you have pets too? CoRR, abs/1801.07243

    Google Scholar 

  14. Henderson M, Casanueva I, Mrkšić N, et al (2019) ConveRT: efficient and accurate conversational representations from transformers. CoRR, abs/1911.03688

    Google Scholar 

  15. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. CoRR, abs/1706.03762 (2017)

    Google Scholar 

  16. Henderson M, Budzianowski P, Casanueva I, et al (2019) A repository of conversational datasets. In: Proceedings of the workshop on NLP for conversational AI (2019)

    Google Scholar 

  17. Gunasekara C, Kummerfeld JK, Polymenakos L, Lasecki W (2019) DSTC7 Task 1: noetic end-to-end response selection. In: 7th edition of the dialog system technology challenges at AAAI (2019)

    Google Scholar 

  18. Devlin J, Chang MW, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805 (2018)

    Google Scholar 

  19. Cer D, Yang Y, Kong SY, Hua N, Limtiaco N, John RS, Constant N, Guajardo-Cespedes M, Yuan S, Tar C, Sung YH (2018) Universal sentence encoder. arXiv:1803.11175

  20. Casanueva I (2019) We’re building the most accurate intent detector on the market. PolyAi Blog. https://www.polyai.com/were-building-the-most-accurate-intent-detector-on-the-market/, 8 August 2019

  21. Maaten LVD, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(Nov):2579–2605

    MATH  Google Scholar 

  22. McInnes L, Healy J, Melville J (2018) UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426

  23. Higashinaka R, D’Haro LF, Shawar BA, et al.: Overview of the dialogue breakdown detection challenge 4. In: IWSDS2019

    Google Scholar 

  24. Wolf T, Sanh V, Chaumond J, Delangue C (2019) TransferTransfo: a transfer learning approach for neural network based conversational agents. arXiv:1901.08149

Download references

Acknowledgements

This work has been funded by the Spanish Ministry of Economy and Competitiveness (Artificial Intelligence Techniques and Assistance to Autonomous Navigation, reference DPI 2017-86915-C3-3-R). It has also received funding from RoboCity2030-DIH-CM, Madrid Robotics Digital Innovation Hub, S2018/NMT-4331, funded by “Programas de Actividades I+D en la Comunidad de Madrid” and co-funded by Structural Funds of the EU. It has been also supported by the Spanish projects AMIC (MINECO, TIN2017-85854-C4-4-R) and CAVIAR (MINECO, TEC2017-84593-C2-1-R). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro P5000 used for this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mario Rodríguez-Cantelar .

Editor information

Editors and Affiliations

Appendices

Appendix A: Dialog Evolution Figures

Fig. 1
figure 1

Examples of two-dimensional projections of the dialogue evolution (Evo.) of the prompts and responses for the “good” human-chatbot (H-C) dialogues (top left, ids: p29-r35 in Table 2), “good” human-human (H-H) dialogues (top right, ids: p1-r7 in Table 3), “bad” human-chatbot dialogues (bottom left, ids: p1-r7 in Table 2) and “bad” human-human dialogues (bottom right, ids: p1-r8 in Table 4). The solid lines indicate the human’s prompts or the prompts for the first human in the H-H case. The dashed lines indicate the chatbot’s answers or the answers for the second human in the H-H case.

Appendix B: Dialog Coherence Figures

Fig. 2
figure 2

Examples of two-dimensional projections of the dialogue coherence (Coh.) between the prompts and responses for the “good” human-chatbot (H-C) dialogues (top left, ids: p29-r35 in Table 2), “good” human-human (H-H) dialogues (top right, ids: p1-r7 in Table 3), “bad” human-chatbot dialogues (bottom left, ids: p1-r7 in Table 2) and “bad” human-human dialogues (bottom right, ids: p1-r8 in Table 4). The solid lines indicate the human’s prompts or the prompts for the first human in the H-H case. The dashed lines indicate the chatbot’s answers or the answers for the second human in the H-H case.

Appendix C: Human-Chatbot Conversation

Table 2 Human-Chatbot (H-C) conversation using a bi-GRU Seq2Seq approach. The total number of turn pairs is 59. For each chatbot’s turn, a subjective human score was obtained. The number next to each message is the identifier (id).

Appendix D: Human-Human Conversations

Table 3 Examples of “good” Human-Human (H-H) conversations extracted from the Persona-Chat dataset. The number next to each message is the identifier (id).
Table 4 Examples of “bad” Human-Human (H-H) conversations extracted from the Persona-Chat dataset. The number next to each message is the identifier (id).

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Rodríguez-Cantelar, M., D’Haro, L.F., Matía, F. (2021). Automatic Evaluation of Non-task Oriented Dialog Systems by Using Sentence Embeddings Projections and Their Dynamics. In: D'Haro, L.F., Callejas, Z., Nakamura, S. (eds) Conversational Dialogue Systems for the Next Decade. Lecture Notes in Electrical Engineering, vol 704. Springer, Singapore. https://doi.org/10.1007/978-981-15-8395-7_6

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-8395-7_6

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-8394-0

  • Online ISBN: 978-981-15-8395-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics