Investigating Paraphrasing-Based Data Augmentation for Task-Oriented Dialogue Systems

Vogel, Liane; Flek, Lucie

doi:10.1007/978-3-031-16270-1_39

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13502))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

883 Accesses

Abstract

With synthetic data generation, the required amount of human-generated training data can be reduced significantly. In this work, we explore the usage of automatic paraphrasing models such as GPT-2 and CVAE to augment template phrases for task-oriented dialogue systems while preserving the slots. Additionally, we systematically analyze how far manually annotated training data can be reduced. We extrinsically evaluate the performance of a natural language understanding system on augmented data on various levels of data availability, reducing manually written templates by up to 75% while preserving the same level of accuracy. We further point out that the typical NLG quality metrics such as BLEU or utterance similarity are not suitable to assess the intrinsic quality of NLU paraphrases, and that public task-oriented NLU datasets such as ATIS and SNIPS have severe limitations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Dataset Source: https://www.kaggle.com/siddhadev/atis-dataset-clean.

References

Andreas, J.: Good-enough compositional data augmentation. In: Proceedings of the 58th Annual Meeting of the ACL, pp. 7556–7566 (2020)
Google Scholar
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Cer, D., et al.: Universal sentence encoder for English. In: Proceedings of the 2018 EMNLP Conference: System Demonstrations, pp. 169–174 (2018)
Google Scholar
Chen, Q., Zhuo, Z., Wang, W.: BERT for joint intent classification and slot filling. CoRR (2019). http://arxiv.org/abs/1902.10909
Chen, S.F., Beeferman, D., Rosenfeld, R.: Evaluation metrics for language models (1998)
Google Scholar
Coucke, A., et al.: Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190 (2018)
Dozat, T.: Incorporating Nesterov momentum into Adam. In: Proceedings of 4th International Conference on Learning Representations, Workshop Track (2016)
Google Scholar
d’Ascoli, S., Coucke, A., Caltagirone, F., Caulier, A., Lelarge, M.: Conditioned text generation with transfer for closed-domain dialogue systems. In: Espinosa-Anke, L., Martín-Vide, C., Spasić, I. (eds.) SLSP 2020. LNCS (LNAI), vol. 12379, pp. 23–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59430-5_2
Chapter Google Scholar
Fan, A., Lewis, M., Dauphin, Y.: Hierarchical Neural Story Generation. In: Proceedings of the 56th Annual Meeting of the ACL (Volume 1: Long Papers), pp. 889–898 (2018)
Google Scholar
Gage, P.: A new algorithm for data compression. C Users J. 12(2), 23–38 (1994)
Google Scholar
Gaspers, J., Karanasou, P., Chatterjee, R.: Selecting machine-translated data for quick bootstrapping of a natural language understanding system. In: Proceedings of NAACL-HLT, pp. 137–144 (2018)
Google Scholar
Hakkani-Tür, D., et al.: Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM. In: InterSpeech 2016, pp. 715–719 (2016)
Google Scholar
Hegde, C., Patil, S.: Unsupervised paraphrase generation using pre-trained language models. arXiv preprint arXiv:2006.05477 (2020)
Hemphill, C.T., Godfrey, J.J., Doddington, G.R.: The ATIS spoken language systems pilot corpus. In: Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, 24–27 June 1990 (1990)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neural text degeneration. In: 8th International Conference on Learning Representations, ICLR (2020)
Google Scholar
Kumar, V., Choudhary, A., Cho, E.: Data augmentation using pre-trained transformer models. CoRR (2020). https://arxiv.org/abs/2003.02245
Lau, J.H., Clark, A., Lappin, S.: Grammaticality, acceptability, and probability: a probabilistic view of linguistic knowledge. Cogn. Sci. 41(5), 1202–1241 (2017)
Article Google Scholar
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Louvan, S., Magnini, B.: Recent neural methods on slot filling and intent classification for task-oriented dialogue systems: a survey. In: 28th COLING, pp. 480–496 (2020)
Google Scholar
Louvan, S., Magnini, B.: Simple is better! Lightweight data augmentation for low resource slot filling and intent classification. In: Proceedings of the 34th PACLIC, pp. 167–177 (2020)
Google Scholar
Malandrakis, N., et al.: Controlled text generation for data augmentation in intelligent artificial agents. In: EMNLP-IJCNLP 2019, p. 90 (2019)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th ACL Meeting, pp. 311–318 (2002)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 EMNLP Conference, pp. 1532–1543 (2014)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)
Google Scholar
Sahu, G., Rodríguez, P., Laradji, I.H., Atighehchian, P., Vázquez, D., Bahdanau, D.: Data augmentation for intent classification with off-the-shelf large language models. CoRR (2022). https://doi.org/10.48550/arXiv.2204.01959
Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. Adv. Neural. Inf. Process. Syst. 28, 3483–3491 (2015)
Google Scholar
Tur, G., De Mori, R.: Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. Wiley, Hoboken (2011)
Google Scholar
Witteveen, S., AI, R.D., Andrews, M.: Paraphrasing with large language models. In: EMNLP-IJCNLP 2019, p. 215 (2019)
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 EMNLP Conference: System Demonstrations, pp. 38–45 (2020)
Google Scholar
Yu, B., Arkoudas, K., Hamza, W.: Delexicalized paraphrase generation. In: Proceedings of the 28th COLING: Industry Track, pp. 102–112 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Data and AI Systems, Department of Computer Science, Technical University of Darmstadt, Darmstadt, Germany
Liane Vogel
Conversational AI and Social Analytics (CAISA) Lab, Department of Mathematics and Computer Science, University of Marburg, Marburg, Germany
Lucie Flek

Authors

Liane Vogel
View author publications
You can also search for this author in PubMed Google Scholar
Lucie Flek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liane Vogel .

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Aleš Horák
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vogel, L., Flek, L. (2022). Investigating Paraphrasing-Based Data Augmentation for Task-Oriented Dialogue Systems. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2022. Lecture Notes in Computer Science(), vol 13502. Springer, Cham. https://doi.org/10.1007/978-3-031-16270-1_39

Download citation

DOI: https://doi.org/10.1007/978-3-031-16270-1_39
Published: 16 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16269-5
Online ISBN: 978-3-031-16270-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Investigating Paraphrasing-Based Data Augmentation for Task-Oriented Dialogue Systems