Deep AM-FM: Toolkit for Automatic Dialogue Evaluation

Zhang, Chen; D’Haro, Luis Fernando; Banchs, Rafael E.; Friedrichs, Thomas; Li, Haizhou

doi:10.1007/978-981-15-8395-7_5

Chen Zhang³⁷,
Luis Fernando D’Haro³⁸,
Rafael E. Banchs³⁹,
Thomas Friedrichs⁴⁰ &
…
Haizhou Li³⁷

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 704))

1034 Accesses
7 Citations

Abstract

There have been many studies on human-machine dialogue systems. To evaluate them accurately and fairly, many resort to human grading of system outputs. Unfortunately, this is time-consuming and expensive. The study of AM-FM (Adequacy Metric - Fluency Metric) suggests an automatic evaluation metric, that achieves good performance in terms of correlation with human judgements. AM-FM framework intends to measure the quality of dialogue generation along two dimensions with the help of gold references: (1) The semantic closeness of generated response to the corresponding gold references; (2) The syntactic quality of the sentence construction. However, the original formulation of both adequacy and fluency metrics face some technical limitations. The latent semantic indexing (LSI) approach to AM modeling is not scalable to large amount of data. The bag-of-words representation of sentences fails to capture the contextual information. As for FM modeling, the n-gram language model implementation is not able to capture long-term dependency. Many deep learning approaches, such as the long short-term memory network (LSTM) or transformer-based architectures, are able to address these issues well by providing better contextual-aware sentence representations than the LSI approach and achieving much lower perplexity on benchmarking datasets as compared to the n-gram language model. In this paper, we propose deep AM-FM, a DNN-based implementation of the framework and demonstrate that it achieves promising improvements in both Pearson and Spearman correlation w.r.t human evaluation on the bench-marking DSTC6 End-to-end Conversation Modeling task as compared to its original implementation and other popular automatic metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/e0397123/AM-FM-PM.git.
2.
http://workshop.colips.org/dstc6/index.html.
3.
R: reference, H: system response, j: system index, i: test case index and k: reference index.
4.
Refer to https://github.com/dialogtekgeek/DSTC6-End-to-End-Conversation-Modeling.git for the data collection process.

References

Banchs RE, D’Haro LF, Li H (2015) Adequacy-fluency metrics: evaluating MT in the continuous space model framework. IEEE/ACM TASLP 23(3):472–482
Google Scholar
Bojanowski P, Grave E, Joulin A et al (2017) Enriching word vectors with subword information. Trans ACL 5:135–146
Google Scholar
Chelba C, Mikolov T, Schuster M, et al (2014) One billion word benchmark for measuring progress in statistical language modeling. In: Interspeech 2014
Google Scholar
Cho K, Van Merriënboer B, Gulcehre C, et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Dai Z, Yang Z, Yang Y, et al (2019) Transformer-XL: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860
Devlin J, Chang MW, Lee K, et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL 2019: human language technologies, vol 1 (Long and Short Papers, pp 4171–4186
Google Scholar
D’Haro LF, Banchs RE, Hori C, Li H (2019) Automatic evaluation of end-to-end dialog systems with adequacy-fluency metrics. Comput Speech Lang 55:200–215
Article Google Scholar
Hill F, Cho K, Korhonen A (2016) Learning distributed representations of sentences from unlabelled data. arXiv preprint arXiv:1602.03483
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hori C, Hori T (2017) End-to-end conversation modeling track in DSTC6. arXiv preprint arXiv:1706.07440
Kiros R, Zhu Y, Salakhutdinov RR, et al (2015) Skip-thought vectors. In: Advances in neural information processing systems, pp 3294–3302
Google Scholar
Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Proces 25(2–3):259–284
Article Google Scholar
Liu CW, Lowe R, Serban IV, et al (2016) How NOT to evaluate your dialogue system: an empirical study of unsupervised evaluation metrics for dialogue response generation. In: EMNLP 2016, pp 2122–2132
Google Scholar
Kudo T, Richardson J (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226
Jozefowicz R, Vinyals O, Schuster M, et al (2016) Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410
Logeswaran L, Lee H (2018) An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893
Marelli M, Bentivogli L, Baroni M, et al (2014) Semeval-2014 task 1: evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In: SemEval 2014, pp 1–8
Google Scholar
Mei H, Bansal M, Walter MR (2017) Coherent dialogue with attention-based language models. In: Thirty-first AAAI conference on artificial intelligence, February 2017
Google Scholar
Mikolov T, Karafiát M, Burget L, et al (2010) Recurrent neural network based language model. In: InterSpeech 2011
Google Scholar
Palangi H, Deng L, Shen Y et al (2016) Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval. IEEE/ACM TASLP 24(4):694–707
Google Scholar
Peters ME, Neumann M, Iyyer M, et al (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365
Radford A, Wu J, Child R, et al (2019) Language models are unsupervised multitask learners. In: OpenAI Blog, vol 1, no 8
Google Scholar
Sundermeyer M, Schlüter R, Ney H (2012) LSTM neural networks for language modeling. In: Interspeech 2012
Google Scholar
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. In: Advances in neural information processing systems, pp. 5998–6008
Google Scholar
Adiwardana D, Luong MT, So DR, Hall J, Fiedel N, Thoppilan R, Le QV (2020) Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977

Download references

Acknowledgements

This research is carried out under the collaboration program between Electrical & Computer Engineering Department, National University of Singapore and Robert Bosch (SEA) Pte Ltd. This research is also supported by the National Research Foundation Singapore under its AI Singapore Programme (Award Number: AISG-GC-2019-002). The work leading to these results has been supported by AMIC (MINECO, TIN2017-85854-C4-4-R), and CAVIAR (MINECO, TEC2017-84593-C2-1-R) projects partially funded by the European Union.

Author information

Authors and Affiliations

National University of Singapore (NUS), Singapore, Singapore
Chen Zhang & Haizhou Li
Universidad Politécnica de Madrid (UPM), Madrid, Spain
Luis Fernando D’Haro
Nanyang Technological University (NTU), Singapore, Singapore
Rafael E. Banchs
Robert Bosch (SEA) Pte Ltd, Singapore, Singapore
Thomas Friedrichs

Authors

Chen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Luis Fernando D’Haro
View author publications
You can also search for this author in PubMed Google Scholar
Rafael E. Banchs
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Friedrichs
View author publications
You can also search for this author in PubMed Google Scholar
Haizhou Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chen Zhang .

Editor information

Editors and Affiliations

Speech Technology Group - Information Processing and Telecommunications Center (IPTC), Universidad Politécnica de Madrid, Madrid, Spain
Luis Fernando D'Haro
Department of Languages and Computer Systems, Universidad de Granada, CITIC-UGR, Granada, Spain
Zoraida Callejas
Information Science, Nara Institute of Science and Technology, Ikoma, Japan
Satoshi Nakamura

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zhang, C., D’Haro, L.F., Banchs, R.E., Friedrichs, T., Li, H. (2021). Deep AM-FM: Toolkit for Automatic Dialogue Evaluation. In: D'Haro, L.F., Callejas, Z., Nakamura, S. (eds) Conversational Dialogue Systems for the Next Decade. Lecture Notes in Electrical Engineering, vol 704. Springer, Singapore. https://doi.org/10.1007/978-981-15-8395-7_5

Download citation

DOI: https://doi.org/10.1007/978-981-15-8395-7_5
Published: 25 October 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-8394-0
Online ISBN: 978-981-15-8395-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics