Abstract
Spoken language understanding (SLU) in human machine conversational systems is the process of interpreting the semantic meaning conveyed by a user’s spoken utterance. Traditional SLU approaches transform the word string transcribed by an automatic speech recognition (ASR) system into a semantic label that determines the machine’s subsequent response. However, the robustness of SLU results can suffer in the context of a human-machine conversation-based language learning system due to the presence of ambient noise, heavily accented pronunciation, ungrammatical utterances, etc. To address these issues, this paper proposes an end-to-end (E2E) modeling approach for SLU and evaluates the semantic labeling performance of a bidirectional LSTM-RNN with input at three different levels: acoustic (filterbank features), phonetic (subphone posteriorgrams), and lexical (ASR hypotheses). Experimental results for spoken responses collected in a dialog application designed for English learners to practice job interviewing skills show that multi-level BLSTM-RNNs can utilize complementary information from the three different levels to improve the semantic labeling performance. An analysis of results on OOV utterances, which can be common in a conversation-based dialog system, also indicates that using subphone posteriorgrams outperforms ASR hypotheses and incorporating the lower-level features for semantic labeling can be advantageous to improving the final SLU performance.
Similar content being viewed by others
References
Mesnil, G., Dauphin, Y., Yao, K., Bengio, Y., Deng, L., Hakkani- Tur, D., He, X., Heck, L., Tur, D.Y.G., Zweig, G. (2015). Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech and Language Processing, 23(3), 530–539.
Xu, P., & Sarikaya, R. (2013). Convolutional neural network based triangular CRF for joint intent detection and slot filling. In Proc. of ASRU (pp. 78–83).
Tur, G., Hakkani-Tur, D., Heck, L., Parthasarathy, S. (2011). Sentence simplification for spoken language understanding. In Proc. of ICASSP (pp. 5628–5631).
Huang, Q., & Cox, S. (2006). Task-independent call-routing. Speech Communication, 48(3), 374–389.
Gorin, A.L., Petrovska-Delacretaz, D., Riccardi, G., Wright, J.H. (1999). Learning spoken language without transcriptions. In Proc. of ASRU (Vol. 99).
Alshawi, H. (2003). Effective utterance classification with unsupervised phonotactic models. In Proc. of NAACL HLT, (Vol. 1 pp. 1–7).
Wang, Y.Y., Lee, J., Acero, A. (2006). Speech utterance classification model training without manual transcriptions. In Proc. of ICASSP, (Vol. 1 pp. 553–556).
Wang, Y., Skerry-Ryan, R.J., Stanton, D., Wu, Y., Weiss, R.J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q. (2017). Tacotron: Towards end-to-end speech synthesis. In Proc. Interspeechpp 4006–4010.
Shen, J., Pang, R., Weiss, R.J., Schuster, M., Jaitly, N., Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R., Saurous, R.A. (2018). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4779–4783). IEEE.
Sotelo, J., Mehri, S., Kumar, K., Santos, J.F., Kastner, K., Courville, A., Bengio, Y. (2017). Char2wav: End-to-end speech synthesis. In ICLR 2017 workshop.
Heigold, G., Moreno, I., Bengio, S., Shazeer, N. (2016). End-to-end text-dependent speaker verification. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5115–5119). IEEE.
Zhang, S.X., Chen, Z., Zhao, Y., Li, J., Gong, Y. (2016). End-to-end attention based text-dependent speaker verification. In: Proceedings of the IEEE Workshop on Spoken Language Technology (SLT 2016) (pp. 171–178). IEEE.
Ubale, R., Qian, Y., Evanini, K. (2018). Exploring end-to-end attention-based neural networks for native language identification. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 84–91). IEEE.
Geng, W., Wang, W., Zhao, Y., Cai, X., Xu, B. (2016). End-to-End Language Identification Using Attention-Based Recurrent Neural Networks. In: Proc. INTERSPEECH (pp. 2944–2948).
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M.A., Schuller, B., Zafeiriou, S. (2016). Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5200–5204). IEEE. https://www.overleaf.com/project/5d41b311fb69574cddcacef7.
Audhkhasi, K., Rosenberg, A., Sethy, A., Ramabhadran, B., Kingsbury, B. (2017). End-to-end ASR-free keyword search from speech. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1351–1359. IEEE.
Lengerich, C., & Hannun, A. (2016). An end-to-end architecture for keyword spotting and voice activity detection. arXiv:1611.09405.
Li, B., Sainath, T.N., Sim, K.C., Bacchiani, M., Weinstein, E., Nguyen, P., Chen, Z., Wu, Y., Rao, K. (2018). Multi-dialect speech recognition with a single sequence-to-sequence model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4749–4753). IEEE.
Toshniwal, S., Sainath, T.N., Weiss, R.J., Li, B., Moreno, P., Weinstein E., Rao, K. (2018). Multilingual speech recognition with a single end-to-end model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4904–4908). IEEE.
Chan, W., Jaitly, N., Le, Q., Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition, In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4960–4964). IEEE.
Prabhavalkar, R., Rao, K., Sainath, T.N., Li, B., Johnson, L., Jaitly, N. (2017). A comparison of sequence-to-sequence models for speech recognition. In Proc. Interspeech (pp. 939–943).
Chiu, C.C., Sainath, T.N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., Kannan, A., Weiss, R.J., Rao, K., Gonina, E., Jaitly, N. (2018). State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4774–4778) IEEE.
Prabhavalkar, R., Sainath, T.N., Wu, Y., Nguyen, P., Chen, Z., Chiu, C.C., Kannan, A. (2018). Minimum word error rate training for attention-based sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4839–4843). IEEE.
Sainath, T.N., Chiu, C.C., Prabhavalkar, R., Kannan, A., Wu, Y., Nguyen, P., Chen, Z. (2018). Improving the performance of online neural transducer models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5864–5868). IEEE.
Qian, Y., Ubale, R., Ramanaryanan, V., Lange, P., Suendermann-Oeft, D., Evanini, K., Tsuprun, E. (2017). Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 569–576). IEEE.
Qian, Y., Ubale, R., Lange, P., Evanini, K., Soong, F. (2018). From speech signals to semantics - tagging performance at acoustic, phonetic and word levels. In 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE.
Serdyuk, D., Wang, Y., Fuegen, C., Kumar, A., Liu, B., Bengio, Y. (2018). Towards end-to-end spoken language understanding. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5754–5758). IEEE.
Haghani, P., Narayanan, A., Bacchiani, M., Chuang, G., Gaur, N., Moreno, P., Prabhavalkar, R., Qu, Z., Waters, A. (2018). From Audio to Semantics:, Approaches to end-to-end spoken language understanding. arXiv:1809.09190.
Ramanarayanan, V., Suendermann-Oeft, D., Lange, P., Ivanov, A.V., Evanini, K., Yu, Z., Tsuprun, E., Qian, Y. (2016). Bootstrapping development of a Cloud-Based spoken dialog system in the educational domain from scratch using crowdsourced data. ETS Research Report Series, Wiley, https://doi.org/10.1002/ets2.12105.
Ramanarayanan, V., Suendermann-Oeft, D., Lange, P., Mundkowsky, R., Ivanov, A., Yu, Z., Qian, Y., Evanini, K. (2017). Assembling the Jigsaw: How Multiple Open Standards Are Synergistically Combined in the HALEF Multimodal Dialog System. Multimodal Interaction with W3C Standards, (pp. 295–310). Berlin: Springer.
Cheng, J., Chen, X., Metallinou, A. (2015). Deep neural network acoustic models for spoken assessment applications. Speech Communication, 73, 14–27.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Chan, W., Jaitly, N., Le, Q.V., Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proc. of ICASSP.
Audhkhasi, K., Rosenberg, A., Sethy, A., Ramabhadran, B., Kingsbury, B. (2017). End-to-end ASR-free keyword search from speech. In Proc. of ICASSP.
Chung, Y., Wu, C., Shen, C., Lee, H., Lee, L. (2016). Audio Word2Vec: unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. In Proc. of Interspeech.
Bengio, Y., Courville, A., Vincent, P. (2013). Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828.
Siohan, O., & Bacchiani, M. (2005). Fast vocabulary-independent audio search using path-based graph indexing. In Proc. of Interspeech.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K. (2011). The kaldi speech recognition toolkit. In Proc. of ASRU.
Cieri, J., Miller, D., Walker, K. (2004). The fisher corpus: a resource for the next generations of speech-to-text. In LREC, (Vol. 4 pp. 69–71).
Peddinti, V., Povey, D., Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts. In Proc. of INTERSPEECH (pp. 3214–3218).
Qian, Y., Wang, X., Evanini, K., Suendermann- Oeft, D. (2016). Self-adaptive dnn for improving spoken language proficiency assessment. In Proc. of Interspeech (pp. 3122–3126).
Tetariy, E., Gishri, M., Har-Lev, B., Aharonson, V., Moyal, A. (2013). An efficient lattice-based phonetic search method for accelerating keyword spotting in large speech databases. International Journal of Speech Technology, 16(2), 161–169.
Saraclar, M., & Sproat, R. (2004). Lattice-based search for spoken utterance retrieval. In Proc. of ACL (pp. 129–136).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Qian, Y., Ubale, R., Lange, P. et al. Spoken Language Understanding of Human-Machine Conversations for Language Learning Applications. J Sign Process Syst 92, 805–817 (2020). https://doi.org/10.1007/s11265-019-01484-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-019-01484-3