Skip to main content

Investigating Paraphrasing-Based Data Augmentation for Task-Oriented Dialogue Systems

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2022)

Abstract

With synthetic data generation, the required amount of human-generated training data can be reduced significantly. In this work, we explore the usage of automatic paraphrasing models such as GPT-2 and CVAE to augment template phrases for task-oriented dialogue systems while preserving the slots. Additionally, we systematically analyze how far manually annotated training data can be reduced. We extrinsically evaluate the performance of a natural language understanding system on augmented data on various levels of data availability, reducing manually written templates by up to 75% while preserving the same level of accuracy. We further point out that the typical NLG quality metrics such as BLEU or utterance similarity are not suitable to assess the intrinsic quality of NLU paraphrases, and that public task-oriented NLU datasets such as ATIS and SNIPS have severe limitations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Dataset Source: https://www.kaggle.com/siddhadev/atis-dataset-clean.

References

  1. Andreas, J.: Good-enough compositional data augmentation. In: Proceedings of the 58th Annual Meeting of the ACL, pp. 7556–7566 (2020)

    Google Scholar 

  2. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  3. Cer, D., et al.: Universal sentence encoder for English. In: Proceedings of the 2018 EMNLP Conference: System Demonstrations, pp. 169–174 (2018)

    Google Scholar 

  4. Chen, Q., Zhuo, Z., Wang, W.: BERT for joint intent classification and slot filling. CoRR (2019). http://arxiv.org/abs/1902.10909

  5. Chen, S.F., Beeferman, D., Rosenfeld, R.: Evaluation metrics for language models (1998)

    Google Scholar 

  6. Coucke, A., et al.: Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190 (2018)

  7. Dozat, T.: Incorporating Nesterov momentum into Adam. In: Proceedings of 4th International Conference on Learning Representations, Workshop Track (2016)

    Google Scholar 

  8. d’Ascoli, S., Coucke, A., Caltagirone, F., Caulier, A., Lelarge, M.: Conditioned text generation with transfer for closed-domain dialogue systems. In: Espinosa-Anke, L., Martín-Vide, C., Spasić, I. (eds.) SLSP 2020. LNCS (LNAI), vol. 12379, pp. 23–34. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-59430-5_2

    Chapter  Google Scholar 

  9. Fan, A., Lewis, M., Dauphin, Y.: Hierarchical Neural Story Generation. In: Proceedings of the 56th Annual Meeting of the ACL (Volume 1: Long Papers), pp. 889–898 (2018)

    Google Scholar 

  10. Gage, P.: A new algorithm for data compression. C Users J. 12(2), 23–38 (1994)

    Google Scholar 

  11. Gaspers, J., Karanasou, P., Chatterjee, R.: Selecting machine-translated data for quick bootstrapping of a natural language understanding system. In: Proceedings of NAACL-HLT, pp. 137–144 (2018)

    Google Scholar 

  12. Hakkani-Tür, D., et al.: Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM. In: InterSpeech 2016, pp. 715–719 (2016)

    Google Scholar 

  13. Hegde, C., Patil, S.: Unsupervised paraphrase generation using pre-trained language models. arXiv preprint arXiv:2006.05477 (2020)

  14. Hemphill, C.T., Godfrey, J.J., Doddington, G.R.: The ATIS spoken language systems pilot corpus. In: Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, 24–27 June 1990 (1990)

    Google Scholar 

  15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  16. Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neural text degeneration. In: 8th International Conference on Learning Representations, ICLR (2020)

    Google Scholar 

  17. Kumar, V., Choudhary, A., Cho, E.: Data augmentation using pre-trained transformer models. CoRR (2020). https://arxiv.org/abs/2003.02245

  18. Lau, J.H., Clark, A., Lappin, S.: Grammaticality, acceptability, and probability: a probabilistic view of linguistic knowledge. Cogn. Sci. 41(5), 1202–1241 (2017)

    Article  Google Scholar 

  19. Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)

    Google Scholar 

  20. Louvan, S., Magnini, B.: Recent neural methods on slot filling and intent classification for task-oriented dialogue systems: a survey. In: 28th COLING, pp. 480–496 (2020)

    Google Scholar 

  21. Louvan, S., Magnini, B.: Simple is better! Lightweight data augmentation for low resource slot filling and intent classification. In: Proceedings of the 34th PACLIC, pp. 167–177 (2020)

    Google Scholar 

  22. Malandrakis, N., et al.: Controlled text generation for data augmentation in intelligent artificial agents. In: EMNLP-IJCNLP 2019, p. 90 (2019)

    Google Scholar 

  23. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th ACL Meeting, pp. 311–318 (2002)

    Google Scholar 

  24. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the 2014 EMNLP Conference, pp. 1532–1543 (2014)

    Google Scholar 

  25. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)

    Google Scholar 

  26. Sahu, G., Rodríguez, P., Laradji, I.H., Atighehchian, P., Vázquez, D., Bahdanau, D.: Data augmentation for intent classification with off-the-shelf large language models. CoRR (2022). https://doi.org/10.48550/arXiv.2204.01959

  27. Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. Adv. Neural. Inf. Process. Syst. 28, 3483–3491 (2015)

    Google Scholar 

  28. Tur, G., De Mori, R.: Spoken Language Understanding: Systems for Extracting Semantic Information from Speech. Wiley, Hoboken (2011)

    Google Scholar 

  29. Witteveen, S., AI, R.D., Andrews, M.: Paraphrasing with large language models. In: EMNLP-IJCNLP 2019, p. 215 (2019)

    Google Scholar 

  30. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 EMNLP Conference: System Demonstrations, pp. 38–45 (2020)

    Google Scholar 

  31. Yu, B., Arkoudas, K., Hamza, W.: Delexicalized paraphrase generation. In: Proceedings of the 28th COLING: Industry Track, pp. 102–112 (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liane Vogel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vogel, L., Flek, L. (2022). Investigating Paraphrasing-Based Data Augmentation for Task-Oriented Dialogue Systems. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2022. Lecture Notes in Computer Science(), vol 13502. Springer, Cham. https://doi.org/10.1007/978-3-031-16270-1_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16270-1_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16269-5

  • Online ISBN: 978-3-031-16270-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics