Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications

Qian, Yao; Ubale, Rutuja; Lange, Patrick; Evanini, Keelan; Ramanarayanan, Vikram; Soong, Frank K.

doi:10.1007/s11265-019-01484-3

Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications

Published: 11 November 2019

Volume 92, pages 805–817, (2020)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Yao Qian ORCID: orcid.org/0000-0003-1855-9630¹,
Rutuja Ubale¹,
Patrick Lange¹,
Keelan Evanini²,
Vikram Ramanarayanan^1,3 &
…
Frank K. Soong⁴

643 Accesses
5 Citations
6 Altmetric
Explore all metrics

Abstract

Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic label that determines the machine’s subsequent response. However, the robustness of SLU results can suffer in the context of a human-machine conversation-based language learning system due to the presence of ambient noise, heavily accented pronunciation, ungrammatical utterances, etc. To address these issues, this paper proposes an end-to-end (E2E) modeling approach for SLU and evaluates the semantic labeling performance of a bidirectional LSTM-RNN with input at three different levels: acoustic (filterbank features), phonetic (subphone posteriorgrams), and lexical (ASR hypotheses). Experimental results for spoken responses collected in a dialog application designed for English learners to practice job interviewing skills show that multi-level BLSTM-RNNs can utilize complementary information from the three different levels to improve the semantic labeling performance. An analysis of results on OOV utterances, which can be common in a conversation-based dialog system, also indicates that using subphone posteriorgrams outperforms ASR hypotheses and incorporating the lower-level features for semantic labeling can be advantageous to improving the final SLU performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Exploring the opportunities and challenges of ChatGPT in academia

Article Open access 27 March 2024

Prompt Engineering in Large Language Models

Notes

References

Mesnil, G., Dauphin, Y., Yao, K., Bengio, Y., Deng, L., Hakkani- Tur, D., He, X., Heck, L., Tur, D.Y.G., Zweig, G. (2015). Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech and Language Processing, 23(3), 530–539.
Article Google Scholar
Xu, P., & Sarikaya, R. (2013). Convolutional neural network based triangular CRF for joint intent detection and slot filling. In Proc. of ASRU (pp. 78–83).
Tur, G., Hakkani-Tur, D., Heck, L., Parthasarathy, S. (2011). Sentence simplification for spoken language understanding. In Proc. of ICASSP (pp. 5628–5631).
Huang, Q., & Cox, S. (2006). Task-independent call-routing. Speech Communication, 48(3), 374–389.
Article Google Scholar
Gorin, A.L., Petrovska-Delacretaz, D., Riccardi, G., Wright, J.H. (1999). Learning spoken language without transcriptions. In Proc. of ASRU (Vol. 99).
Alshawi, H. (2003). Effective utterance classification with unsupervised phonotactic models. In Proc. of NAACL HLT, (Vol. 1 pp. 1–7).
Wang, Y.Y., Lee, J., Acero, A. (2006). Speech utterance classification model training without manual transcriptions. In Proc. of ICASSP, (Vol. 1 pp. 553–556).
Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q. (2017). Tacotron: Towards end-to-end speech synthesis. In Proc. Interspeechpp 4006–4010.
Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., Saurous, R.A. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4779–4783). IEEE.
Sotelo, J., Mehri, S., Kumar, K., Santos, J.F., Kastner, K., Courville, A., Bengio, Y. (2017). Char2wav: End-to-end speech synthesis. In ICLR 2017 workshop.
Heigold, G., Moreno, I., Bengio, S., Shazeer, N. (2016). End-to-end text-dependent speaker verification. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5115–5119). IEEE.
Zhang, S.X., Chen, Z., Zhao, Y., Li, J., Gong, Y. (2016). End-to-end attention based text-dependent speaker verification. In: Proceedings of the IEEE Workshop on Spoken Language Technology (SLT 2016) (pp. 171–178). IEEE.
Ubale, R., Qian, Y., Evanini, K. (2018). Exploring end-to-end attention-based neural networks for native language identification. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 84–91). IEEE.
Geng, W., Wang, W., Zhao, Y., Cai, X., Xu, B. (2016). End-to-End Language Identification Using Attention-Based Recurrent Neural Networks. In: Proc. INTERSPEECH (pp. 2944–2948).
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M.A., Schuller, B., Zafeiriou, S. (2016). Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5200–5204). IEEE. https://www.overleaf.com/project/5d41b311fb69574cddcacef7.
Audhkhasi, K., Rosenberg, A., Sethy, A., Ramabhadran, B., Kingsbury, B. (2017). End-to-end ASR-free keyword search from speech. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1351–1359. IEEE.
Article Google Scholar
Lengerich, C., & Hannun, A. (2016). An end-to-end architecture for keyword spotting and voice activity detection. arXiv:1611.09405.
Li, B., Sainath, T.N., Sim, K.C., Bacchiani, M., Weinstein, E., Nguyen, P., Chen, Z., Wu, Y., Rao, K. (2018). Multi-dialect speech recognition with a single sequence-to-sequence model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4749–4753). IEEE.
Toshniwal, S., Sainath, T.N., Weiss, R.J., Li, B., Moreno, P., Weinstein E., Rao, K. (2018). Multilingual speech recognition with a single end-to-end model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4904–4908). IEEE.
Chan, W., Jaitly, N., Le, Q., Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4960–4964). IEEE.
Prabhavalkar, R., Rao, K., Sainath, T.N., Li, B., Johnson, L., Jaitly, N. (2017). A comparison of sequence-to-sequence models for speech recognition. In Proc. Interspeech (pp. 939–943).
Chiu, C.C., Sainath, T.N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R.J., Rao, K., Gonina, E., Jaitly, N. (2018). State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4774–4778) IEEE.
Prabhavalkar, R., Sainath, T.N., Wu, Y., Nguyen, P., Chen, Z., Chiu, C.C., Kannan, A. (2018). Minimum word error rate training for attention-based sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4839–4843). IEEE.
Sainath, T.N., Chiu, C.C., Prabhavalkar, R., Kannan, A., Wu, Y., Nguyen, P., Chen, Z. (2018). Improving the performance of online neural transducer models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5864–5868). IEEE.
Qian, Y., Ubale, R., Ramanaryanan, V., Lange, P., Suendermann-Oeft, D., Evanini, K., Tsuprun, E. (2017). Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 569–576). IEEE.
Qian, Y., Ubale, R., Lange, P., Evanini, K., Soong, F. (2018). From speech signals to semantics - tagging performance at acoustic, phonetic and word levels. In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE.
Serdyuk, D., Wang, Y., Fuegen, C., Kumar, A., Liu, B., Bengio, Y. (2018). Towards end-to-end spoken language understanding. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5754–5758). IEEE.
Haghani, P., Narayanan, A., Bacchiani, M., Chuang, G., Gaur, N., Moreno, P., Prabhavalkar, R., Qu, Z., Waters, A. (2018). From Audio to Semantics:, Approaches to end-to-end spoken language understanding. arXiv:1809.09190.
Ramanarayanan, V., Suendermann-Oeft, D., Lange, P., Ivanov, A.V., Evanini, K., Yu, Z., Tsuprun, E., Qian, Y. (2016). Bootstrapping development of a Cloud-Based spoken dialog system in the educational domain from scratch using crowdsourced data. ETS Research Report Series, Wiley, https://doi.org/10.1002/ets2.12105.
Ramanarayanan, V., Suendermann-Oeft, D., Lange, P., Mundkowsky, R., Ivanov, A., Yu, Z., Qian, Y., Evanini, K. (2017). Assembling the Jigsaw: How Multiple Open Standards Are Synergistically Combined in the HALEF Multimodal Dialog System. Multimodal Interaction with W3C Standards, (pp. 295–310). Berlin: Springer.
Google Scholar
Cheng, J., Chen, X., Metallinou, A. (2015). Deep neural network acoustic models for spoken assessment applications. Speech Communication, 73, 14–27.
Article Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Chan, W., Jaitly, N., Le, Q.V., Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proc. of ICASSP.
Audhkhasi, K., Rosenberg, A., Sethy, A., Ramabhadran, B., Kingsbury, B. (2017). End-to-end ASR-free keyword search from speech. In Proc. of ICASSP.
Chung, Y., Wu, C., Shen, C., Lee, H., Lee, L. (2016). Audio Word2Vec: unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. In Proc. of Interspeech.
Bengio, Y., Courville, A., Vincent, P. (2013). Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.
Article Google Scholar
Siohan, O., & Bacchiani, M. (2005). Fast vocabulary-independent audio search using path-based graph indexing. In Proc. of Interspeech.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K. (2011). The kaldi speech recognition toolkit. In Proc. of ASRU.
Cieri, J., Miller, D., Walker, K. (2004). The fisher corpus: a resource for the next generations of speech-to-text. In LREC, (Vol. 4 pp. 69–71).
Peddinti, V., Povey, D., Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. In Proc. of INTERSPEECH (pp. 3214–3218).
Qian, Y., Wang, X., Evanini, K., Suendermann- Oeft, D. (2016). Self-adaptive dnn for improving spoken language proficiency assessment. In Proc. of Interspeech (pp. 3122–3126).
Tetariy, E., Gishri, M., Har-Lev, B., Aharonson, V., Moyal, A. (2013). An efficient lattice-based phonetic search method for accelerating keyword spotting in large speech databases. International Journal of Speech Technology, 16(2), 161–169.
Article Google Scholar
Saraclar, M., & Sproat, R. (2004). Lattice-based search for spoken utterance retrieval. In Proc. of ACL (pp. 129–136).

Download references

Author information

Authors and Affiliations

Educational Testing Service Research, San Francisco, CA, USA
Yao Qian, Rutuja Ubale, Patrick Lange & Vikram Ramanarayanan
Educational Testing Service Research, Princeton, NJ, USA
Keelan Evanini
University of California, San Francisco, CA, USA
Vikram Ramanarayanan
Microsoft Research Asia, Beijing, China
Frank K. Soong

Authors

Yao Qian
View author publications
You can also search for this author in PubMed Google Scholar
Rutuja Ubale
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Lange
View author publications
You can also search for this author in PubMed Google Scholar
Keelan Evanini
View author publications
You can also search for this author in PubMed Google Scholar
Vikram Ramanarayanan
View author publications
You can also search for this author in PubMed Google Scholar
Frank K. Soong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yao Qian.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qian, Y., Ubale, R., Lange, P. et al. Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications. J Sign Process Syst 92, 805–817 (2020). https://doi.org/10.1007/s11265-019-01484-3

Download citation

Received: 15 February 2019
Revised: 02 August 2019
Accepted: 09 September 2019
Published: 11 November 2019
Issue Date: August 2020
DOI: https://doi.org/10.1007/s11265-019-01484-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Exploring the opportunities and challenges of ChatGPT in academia

Prompt Engineering in Large Language Models

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Exploring the opportunities and challenges of ChatGPT in academia

Prompt Engineering in Large Language Models

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation