Aggregated Multimodal Bidirectional Recurrent Model for Audiovisual Speech Recognition

Wen, Yu; Yao, Ke; Tian, Chunlin; Wu, Yao; Zhang, Zhongmin; Shi, Yaning; Tian, Yin; Yang, Jin; Wang, Peiqi

doi:10.1007/978-3-030-00021-9_35

Yu Wen¹⁶,
Ke Yao¹⁶,
Chunlin Tian¹⁷,
Yao Wu¹⁶,
Zhongmin Zhang¹⁶,
Yaning Shi¹⁶,
Yin Tian¹⁶,
Jin Yang¹⁶ &
…
Peiqi Wang¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11068))

Included in the following conference series:

International Conference on Cloud Computing and Security

1576 Accesses

Abstract

The Audiovisual Speech Recognition (AVSR) most commonly applied to multimodal learning employs both the video and audio information to do Robust Automatic Speech Recognition. Traditionally, AVSR was regarded as the inference and projection, a lot of restrictions on the ability of it. With the in-depth study, DNN becomes an important part of the toolkit in traditional classification tools, such as automatic speech recognition, image classification, natural language processing. AVSR often use some DNN models including Multimodal Deep Autoencoders (MDAEs), Multimodal Deep Belief Network (MDBN) and Multimodal Deep Boltzmann Machine (MDBM), which are always better than the traditional methods. However, such DNN models have several shortcomings: Firstly, they can’t balance the modal fusion and temporal fusion, or even haven’t temporal fusion; Secondly, the architecture of these models isn’t end-to-end. In addition, the training and testing are cumbersome. We designed a DNN model—Aggregate$\varvec{d}$ Mult$\varvec{i}$moda$\varvec{l}$ Bidirection$\varvec{a}$l Recurren$\varvec{t}$ Mod$\varvec{e}$l (DILATE)—to overcome such weakness. The DILATE could be not just trained and tested simultaneously, but alternatively easy to train and prevent overfitting automatically. The experiments show that DILATE is superior to traditional methods and other DNN models in some benchmark datasets.

Yu Wen, woman, born in 1980, master, lecturer, research direction: information resource management, computer network.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Deep Temporal Architecture for Audiovisual Speech Recognition

Bimodal variational autoencoder for audiovisual speech recognition

Article 24 November 2021

Exploration of Properly Combined Audiovisual Representation with the Entropy Measure in Audiovisual Speech Recognition

Article 09 November 2018

References

Amer, M.R., Siddiquie, B., Khan, S., Divakaran, A., Sawhney, H.: Multimodal fusion using dynamic hybrid models. In: 2014 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 556–563. IEEE (2014)
Google Scholar
Amodei, D., et al.: Deep speech 2: End-to-end speech recognition in english and mandarin. In: ICML (2016)
Google Scholar
Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16(6), 345–379 (2010)
Article Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016). http://www.deeplearningbook.org
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hu, D., Li, X., et al.: Temporal multimodal learning in audiovisual speech recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3574–3582 (2016)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Maragos, P., Potamianos, A., Gros, P.: Multimodal Processing and Interaction: Audio, Video, Text, vol. 33. Springer Science & Business Media (2008)
Google Scholar
Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)
Article Google Scholar
Mroueh, Y., Marcheret, E., Goel, V.: Deep multimodal learning for audio-visual speech recognition, pp. 2130–2134 (2015)
Google Scholar
Nath, A.R., Beauchamp, M.S.: A neural basis for interindividual differences in the mcgurk effect, a multisensory speech illusion. NeuroImage 59(1), 781–787 (2012)
Article Google Scholar
Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic bayesian networks for audio-visual speech recognition. EURASIP J. Adv. Signal Process. 2002(11), 1–15 (2002)
Article Google Scholar
Ngiam, J., et al.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning (ICML-2011), pp. 689–696 (2011)
Google Scholar
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Audio-visual speech recognition using deep learning. Appl. Intell. 42(4), 722–737 (2015)
Article Google Scholar
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning, pp. 1310–1318 (2013)
Google Scholar
Prechelt, L.: Automatic early stopping using cross validation: quantifying the criteria. Neural Networks 11(4), 761–767 (1998)
Article Google Scholar
Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Article Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
Google Scholar
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Srivastava, N., Salakhutdinov, R.: Learning representations for multimodal data with deep belief nets. In: International Conference on Machine Learning Workshop, vol. 79 (2012)
Google Scholar
Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep boltzmann machines. In: Advances in Neural Information Processing Systems, pp. 2222–2230 (2012)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Tamura, S., et al.: Audio-visual speech recognition using deep bottleneck features and high-performance lipreading. In: 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 575–582. IEEE (2015)
Google Scholar
Tian, C., Yuan, Y., Lu, X.: Deep temporal architecture for audiovisual speech recognition. In: CCF Chinese Conference on Computer Vision, pp. 650–661 (2017)
Google Scholar
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, p. I-511. IEEE (2001)
Google Scholar
Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Experimental Training Base, National University of Defence Technology, Xian, 710106, Shaanxi, People’s Republic of China
Yu Wen, Ke Yao, Yao Wu, Zhongmin Zhang, Yaning Shi, Yin Tian, Jin Yang & Peiqi Wang
University of Chinese Academy of Sciences, 19A Yuquanlu, Beijing, 100049, People’s Republic of China
Chunlin Tian

Authors

Yu Wen
View author publications
You can also search for this author in PubMed Google Scholar
Ke Yao
View author publications
You can also search for this author in PubMed Google Scholar
Chunlin Tian
View author publications
You can also search for this author in PubMed Google Scholar
Yao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongmin Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yaning Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yin Tian
View author publications
You can also search for this author in PubMed Google Scholar
Jin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Peiqi Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu Wen .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Xingming Sun
Nanjing University of Information Science and Technology, Nanjing, China
Zhaoqing Pan
Department of Computer Science, Purdue University, West Lafayette, IN, USA
Elisa Bertino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wen, Y. et al. (2018). Aggregated Multimodal Bidirectional Recurrent Model for Audiovisual Speech Recognition. In: Sun, X., Pan, Z., Bertino, E. (eds) Cloud Computing and Security. ICCCS 2018. Lecture Notes in Computer Science(), vol 11068. Springer, Cham. https://doi.org/10.1007/978-3-030-00021-9_35

Download citation

DOI: https://doi.org/10.1007/978-3-030-00021-9_35
Published: 26 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00020-2
Online ISBN: 978-3-030-00021-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics