Abstract
The Audiovisual Speech Recognition (AVSR) is one of the applications of multimodal machine learning related to speech recognition, lipreading systems and video classification. In recent and related work, increasing efforts are made in Deep Neural Network (DNN) for AVSR, moreover some DNN models including Multimodal Deep Autoencoder, Multimodal Deep Belief Network and Multimodal Deep Boltzmann Machine perform well in experiments owing to the better generalization and nonlinear transformation. However, these DNN models have several disadvantages: (1) They mainly deal with modal fusion while ignoring temporal fusion. (2) Traditional methods fail to consider the connection among frames in the modal fusion. (3) These models aren’t end-to-end structure. We propose a deep temporal architecture, which has not only classical modal fusion, but temporal modal fusion and temporal fusion. Furthermore, the overfitting and learning with small size samples in the AVSR are also studied, so that we propose a set of useful training strategies. The experiments show the superiority of our model and necessity of the training strategies in three datasets: AVLetters, AVLetters2, AVDigits. In the end, we conclude the work.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
Automatic Speech Recognition (ASR) has a plethora of applications, and plays a significant component in the communication between human and computers. Increasing research has been done in this area and push the accuracy up sustainedly [2, 27]. But it has limitations including bad ability to fight against noise and disturbance. Furthermore, some illusion occurs when the auditory component of one sound is paired with the visual component of another sound, leading to the perception of a third sound [15] (this is called McGurk effect [13]). Therefore, researchers have paid attention to the Audiovisual Speech Recognition (AVSR) utilizing both audio and visual information to tackle such problems and strengthen the ASR systems [1, 17]. Besides, AVSR is, generally, an application of the multimodal machine learning. Much research in the AVSR focuses on multimodal machine learning which enhances the robustness and adaptability using multiple modalities, and particularly broadens the horizons including Multimodal Deep Learning [17], Affective Computing [20] and Lipreading [5].
Traditional methods on the AVSR usually use multimodal extensions of Hidden Markov Models (HMMs) and some statistical models like Canonical Correlation Analysis (CCA) [11]. Recent and related work on Deep Neural Networks (DNN) verified its efficiency of multimodal representation and fusion. DNN models have two main differentia and advantages with contrast to the classical methods. First, They take advantage of the superiority of feature extraction and refinement of deep learning: Convolution Neural Network (CNN) for extracting image features and some unsupervised deep learning algorithms for refining audio features [14, 18, 28]. Second, They employ DNN models to do fusion or recognition because of the better generalization and nonlinear transformation.
All these methods achieved better performance in the AVSR. However, the existed traditional methods and DNN methods cannot satisfy the demand of higher recognition accuracy and show three primary disadvantages. (1) The first disadvantage is that they mainly deal with modal fusion while ignoring temporal fusion. (2) The second disadvantage is that traditional methods fail to consider the connection among frame in the modal fusion. (3) The third disadvantage is that these models aren’t end-to-end structure.
In order to relieve the disadvantages, we propose the deep temporal architecture for audiovisual speech recognition. The architecture of the proposed method is simply shown as Fig. 1. There are five phases in our method: data pre-processing, modal fusion, temporal modal fusion, temporal fusion and recognition network. The details of data pre-processing is not drawn in Fig. 1 as it is not the main part of the model. In the beginning, we extract the lip visual features and sound spectrogram features in pre-preprocessing phase. The modal fusion fuses jointly mouth lip features and sound spectrogram features of different frames. The temporal modal fusion considers the connection among frames in a video and further learns the joint representations. The temporal fusion encodes the modal-fused features of a video into a feature vector preparing for recognition. The proposed model takes into account the significant temporal information and helps obtain more semantic representation from both modalities. In practice, overfitting is serious as training set is insufficiently abundant, which is known as learning with small size samples, hence we adopt a set of training strategies to fight against overfitting, including visual data augmentation, aural data augmentation, multimodal data augmentation and other techniques.
In summary, the main contributions of this paper are as follows:
-
(1)
An end-to-end deep temporal architecture mixing unsupervised with supervised learning is advanced for AVSR. The model chiefly considers the significance of temporal information which is demonstrated its value by experimental results.
-
(2)
We study the overfitting and learning with small size samples in the AVSR. A set of training strategies are employed to fight against overfitting.
In the following sections, we firstly survey the related work about AVSR in Sect. 2. In Sect. 3, we briefly review two main components of our model: Multimodal Deep Autoencoder (MDAE) [17] and Long-short Term Memory (LSTM) [7], and then advance the deep temporal architecture and training strategies for AVSR. In Sect. 4, we conducts AVSR and cross modality speech recognition experiments for evaluating the model and training strategies on the three datasets: AVLetters (Patterson et al. 2002), AVLetters2 (Cox et al. 2008) and AVDigits (Di Hu et al. 2015), and afterwards the results are displayed and discussed. Section 5 concludes this paper.
2 Related Work
This section reviews the related multimodal models for AVSR historically.
2.1 Traditional AVSR Systems
The research on multimodal processing and interaction has long history. Humans understand the multimodal world in a seemingly effortless manner, although there are vast information processing resources dedicated to the corresponding tasks by the brain [11]. But computers are difficult to tackle this problem. When it comes to the modal fusion, which has attracted numerous investigators for a long time, it usually benefits discriminative tasks because of rich and hierarchical information.
Specifically AVSR has been studied for a few years, deal of work all focused on multimodal fusion. There are three levels of multimodal fusion: early fusion, intermediate fusion and late fusion. For early integration, it concatenates video and audio features into a single descriptor. For late fusion, fusion is done at decision part, which resembles ensemble learning. Intermediate fusion, or called hybrid fusion, lies in between late and early fusion [3, 11]. Early and late fusion are uncomplicated to understand and easy to put into practice. Whereas they both are too straightforward to capture the abundant information and correlation between aural and visual features.
In the early years, investigators attempted to use probabilistic models, but they depend on strong different prior assumption. The typical models are multistream HMMs (mHMMs) [16] that were affirmed strong ability to model sequence data [22]. However, mHMMs don’t work well, because mHMMs building a hybrid fusion structure and mapping aural and visual features jointly to a low-dimensional space are shallow models, additionally distinct modalities have various type of information representations. In this method, researchers are interested in distinct feature extraction and different feature combination [6].
2.2 Deep Learning for AVSR
Deep Learning provides strong representation ability and capacity [4], especially CNN for image feature extraction and representation [9, 10, 30, 32]. For instance, [18] uses CNN pre-trained visual feature and aural feature refined by denoising autoencoder, and then traditional models are used for fusion and recognition. As has been argued, these models mainly make use of the rich and robust feature representation of DNN.
Some other methods are posed based on multimodal fusion by DNN. There are three levels of multimodal fusion: early fusion, intermediate fusion and late fusion. For early integration, it concatenates video and audio features into a single descriptor. For late fusion, fusion is done at decision part, which resembles ensemble learning. Intermediate fusion, or called hybrid fusion, lies in between late and early fusion [3, 11]. Early and late fusion are uncomplicated to understand and easy to put into practice. Whereas they both are too straightforward to capture the abundant information and correlation between aural and visual features.
[17] used MDAE; [25] used Multimodal Deep Network (MDBN); [26] used Multimodal Deep Boltzmann Machine (MDBM); [28] uses Deep Bottleneck Features (DBNF); [8] used Recurrent Temporal Multimodal Restricted Boltzmann Machines (RTMRBM). They indeed achieve more accurate result. But the majority of them are unsupervised models, and additionally most are based on Boltzmann Machines that are considered difficult to train of partial function [4]. Besides, they involve several independent training process and testing process, and give rise to the loss of temporal information in the modal fusion.
3 Proposed Method
In this section, we first briefly review two main components: MDAE and LSTM of our model, and then advance the deep temporal architecture for AVSR. Finally, we propose a set of training strategies for fighting against overfitting and learning with small size samples.
3.1 MDAE and LSTM
MDAE: In the multimodal fusion task, [17] advanced two types of MDAE: Bimodal Deep Autoencoder and Video-Only Deep Autoencoder (a similar model can be drawn for the audio-only setting). One reconstructs both modalities given audio and video used for AVSR, and the other reconstructs both modalities given only one modality used for cross modality speech recognition – that is, multiple modalities are available during training; during testing phase, only data from a single modality is provided [17]. The MDAE mentioned in [17] has a RBM greedy pre-training process, thereby the data ought to encode into binary.
LSTM: LSTM is the suitable algorithm to model the sequence data. There are two widely known issues with properly training vanilla RNN, the vanishing and the exploding gradient [19]. Without some training tricks in training RNN, LSTM is an off-the-shelf favorable solution. It uses gates to avoid gradient vanishing and exploding. A typical LSTM has 3 gates: input gate, output gate and forget gate which reserves the sequence information and makes Backpropagation through Time (BPTT) easier.
3.2 Deep Temporal Architecture for AVSR
Our DNN architecture consists of four primary components: modal fusion, temporal modal fusion, temporal fusion, recognition network.
Modal Fusion: MDAE is chosen as the main tool of the modal fusion (Fig. 2). In the model, in order to eliminates truncation error, there is no binary encoding process. We do the early fusion by concatenating video and audio features as the input and then reconstructing them. The shared representation is extracted as the modal joint fused representation. As the boost of computation power, the MDAE is easy to train. The loss function of MDAE is:
where n is the number features through pre-processing, \(x_i\) is the reconstruction, \(y_i\) is the original data.
Temporal Modal Fusion: LSTM is used to do the temporal modal fusion as it takes into account the temporal factors. With the decrease of attributes and deep fusion of the features after LSTM, it would be simpler for temporal fusion and recognition. Two LSTMs are taken to accomplish the temporal modal fusion (Fig. 3).
The mechanism of temporal modal fusion is as follows:
In the interior of temporal modal fusion, W, U and b are the parameter matrices and vector, \(c_t\) is the cell state vector, \(f_t\) is the forget gate vector, \(i_t\) is the input gate vector, \(o_t\) is the output gate vector. The inner parts aim to preventing vanishing and the exploding gradient and passing sequence information as typical LSTM. \(x_t\) is the modal fused input, \(h_t\) is the temporal modal fused output.
Temporal Fusion: Traditional methods usually concatenate different frames of one video after modal fusion. But it needs some hyperparameters to tune, which isn’t recommended by modern learning architecture. Here, another LSTM and mean pooling are used to map the audiovisual features after temporal modal fusion into a well fused feature vector (shown in Fig. 4). The mechanism of temporal fusion LSTM is roughly the same as (2)–(6), and the mean pooling works as follows:
where m is the number of frames in a video, \(h_t\) is the temporal modal fused vector, \(z_i\) is the output of temporal fusion.
Recognition Network: Our model uses a feed forward network with batch normalization mapping the features for recognition. It has eight layers: fully connected layer, activation layer with Rectified Linear Units (ReLU) and batch normalization layer one by one repeatedly. The last layer is general softmax layer with squared multi-label margin loss function :
where n is the number of features in a video, \(p_i\) is the output of the recognition network, \(q_i\) is the ground truth.
3.3 Training Strageties
The samples of AVSR is insufficient usually and therefore sequence learning is easy to overfit. Hence we utilize a set of training strategies for learning with small size samples and against overfitting.
Visual Data Augmentation: As is mentioned before, it is common that insufficient visual data may cause serious overfitting in the AVSR. We employ visual data augmentation with extracted lip videos. Color jittering, small angle rotation and random scaling are adopted to augment data. The augmentation will also enhance the generalization of color change, space variousness and image quality difference.
Aural Data Augmentation: To improve model’s tolerance to audio noise and prevent overfitting, we apply white Gaussian noise in the training phase.
Multimodal Data Augmentation: When video and audio are both fused after the modal fusion, we make the fused features of each video simultaneously shift slight frames up and down randomly to augment data. As a benefit, the trained model will have better generalization of variance in the time domain.
Other Techniques: Common DNN training strategies including dropout [24] and early stopping [21] are also used in our model.
4 Experiments and Discussion
In this section, we show the results of the proposed model compared with some state-of-the-art methods.
4.1 Datasets
We conducted the experiments in 3 datasets: AVLetters (Patterson et al. 2002), AVLetters2 (Cox et al. 2008) and AVDigits (Di Hu et al. 2015).
AVLetters. 10 volunteers saying the letters A to Z three times each. The dataset pre-extracts the lip region of 60 \(\times \) 80 pixels and Mel-Frequency Cepstrum Coefficientx (MFCC).
AVLetters2. 5 volunteers spoke the letters A to Z seven times each. The dataset provide raw video and audio in different folders.
AVDigits. 6 speakers saying digits 0 to 9, nine times each. It doesn’t pre-extract the video and audio, and provides raw video of long or short time length from 1 s to 2 s.
4.2 Data Pre-processing
If the video and audio are not separated, we separate them from each other. And then truncate the video and audio into the same length.
Pre-processing of Video. Firstly, the off-the-shelf Viola-Jones algorithm [29] is used to extract the Region-of-Interest surrounding the mouth. The region is resized to \(224\times 224\) pixels, and use aforementioned visual data augmentation strategy to double the lip visual data. The features are obtained by the pre-trained VGG-16’s [23] last fully connected layer. Finally, reduce features to 100 principal components with PCA whitening and center them.
Pre-processing of Audio. Double aural data with SNR of 5 dB (signal power: noise power = 5:1) white Gaussian noise as mentioned in Sect. 3.3. The features of audio signal are extracted as spectrogram with 20 ms Hamming window and 10 ms overlap. The spectral coefficient vector is gained with 251 points of Fast Fourier Transform and 50 principal components by PCA.
Visual and Aural Features Combination. When video and audio are both prepared, four contiguous audio frames correspond to one video frame for each time step.
4.3 Implementation Details
The modal fusion network has eight layers and loss is binary cross entropy. While the modal fusion finished, the shared representation use multimodal data augmentation strategy with shifting ten frames up and down randomly in Sect. 3.3, and reshape the fused data waiting for temporal modal fusion. Then the reshaped fused features are sent to the temporal modal fusion network. The temporal modal fusion network continues fusing the existed shared representation of audiovisual fusion information. The temporal fusion network maps the features of several frames in a video to a feature vector. Finally, every video has one low-dimensional well-fused vector waiting for recognition. The recognition network has two nonlinear layer to augment the nonlinearity.
The temporal modal fusion, temporal fusion and recognition network are trained by one loss function. Thus the model needs two steps of training: unsupervised training and supervised training, a reshape layer connects them at once in evaluation. One step of evaluation and testing.
4.4 Quantitative Evalution
To evaluate the proposed model, we conducted AVSR and cross modality speech recognition experiments on the multimodal data. At the same time, the experiments are also conducted to evaluate the necessity of training strategies.
Evaluation of AVSR and Training Strategies: We evaluate our methods in the AVSR task, compared with MDAE, MDBN and RTMRBM on AVLetters2 and AVDigits. Moreover, the training strategies in Sect. 3.3 are evaluated in the experiment. The quantitative results demonstrate the superiority of our model for AVSR and necessity of the training strategies. In addition, it is not uncommon that the model slowly converges without training strategies in our practice (Table 1).
Evaluation of Cross Modality Speech Recognition: One purpose of the multimodal deep learning is learning better single modality representations given unlabeled data from multiple modalities [17]. In cross modality learning experiments, we evaluate the accuracy of one modality (e.g. V) when given multiple modalities (e.g. V and A) during learning. We compare the model to MDAE, CRBM [1] and some single modality models including Multiscale Spatial Analysis [12] and Local Binary Pattern [31] on AVLetters. The experiments display proper cross modality speech recognition is better than single modal speech recognition, moreover our method performs better than other related multimodal models in cross modality speech recognition (Table 2).
5 Conclusion
A deep temporal architecture and a set of training strategies are proposed for AVSR task in this paper. Once the model has been trained, it’s simple to do AVSR in reality. Our method considers modal fusion, temporal modal fusion and temporal fusion, such fusion enhances the robustness and the ability for sequence modeling. The experimental results suggest that the deep temporal architecture achieve better AVSR recognition and cross modality speech recognition results in three datasets. Besides our training strategies efficiently weaken the overfitting as the experiments shows.
References
Amer, M.R., Siddiquie, B., Khan, S., Divakaran, A., Sawhney, H.: Multimodal fusion using dynamic hybrid models, pp. 556–563 (2014)
Amodei, D., Anubhai, R., Battenberg, E., Case, C.J., Casper, J., Catanzaro, B., Chen, J., Chrzanowski, M., Coates, A., Diamos, G., et al.: Deep speech 2: end-to-end speech recognition in English and mandarin, pp. 173–182 (2015)
Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16(6), 345–379 (2010)
Bengio, Y., Goodfellow, I.J., Courville, A.: Deep Learning. MIT Press, Cambridge (2015). http://www.iro.umontreal.ca/bengioy/dlbook
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild (2016)
Galatas, G., Potamianos, G., Makedon, F.: Audio-visual speech recognition incorporating facial depth information captured by the kinect. In: 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp. 2714–2717. IEEE (2012)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Hu, D., Li, X., et al.: Temporal multimodal learning in audiovisual speech recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3574–3582 (2016)
Kruthiventi, S.S., Ayush, K., Babu, R.V.: DeepFix: a fully convolutional neural network for predicting human eye fixations. IEEE Trans. Image Process. (2017)
Lu, X., Zheng, X., Yuan, Y.: Remote sensing scene classification by unsupervised representation learning. IEEE Trans. Geosci. Remote Sens. (2017)
Maragos, P., Potamianos, A., Gros, P.: Multimodal Processing and Interaction: Audio, Video, Text, vol. 33. Springer Science & Business Media, Heidelberg (2008). https://doi.org/10.1007/978-0-387-76316-3
Matthews, I., Cootes, T.F., Bangham, J.A., Cox, S., Harvey, R.: Extraction of visual features for lipreading. IEEE Trans. Pattern Anal. Mach. Intell. 24(2), 198–213 (2002)
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746–748 (1976)
Mroueh, Y., Marcheret, E., Goel, V.: Deep multimodal learning for audio-visual speech recognition, pp. 2130–2134 (2015)
Nath, A.R., Beauchamp, M.S.: A neural basis for interindividual differences in the McGurk effect, a multisensory speech illusion. NeuroImage 59(1), 781–787 (2012)
Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic Bayesian networks for audio-visual speech recognition. EURASIP J. Adv. Sig. Process. 2002(11), 1–15 (2002)
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of the 28th International Conference on Machine Learning, ICML 2011, pp. 689–696 (2011)
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Audio-visual speech recognition using deep learning. Appl. Intell. 42(4), 722–737 (2015)
Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: ICML (3), vol. 28, pp. 1310–1318 (2013)
Poria, S., Cambria, E., Bajpai, R., Hussain, A.: A review of affective computing: from unimodal analysis to multimodal fusion. Inf. Fusion 37, 98–125 (2017)
Prechelt, L.: Early stopping—but when? In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, 2nd edn, pp. 53–67. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_5
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Srivastava, N., Salakhutdinov, R.: Learning representations for multimodal data with deep belief nets. In: International Conference on Machine Learning Workshop (2012)
Srivastava, N., Salakhutdinov, R.R.: Multimodal learning with deep Boltzmann machines. In: Advances in Neural Information Processing Systems, pp. 2222–2230 (2012)
Sutskever, I., Vinyals, O., Le, Q.: Sequence to sequence learning with neural networks, pp. 3104–3112 (2014)
Tamura, S., Ninomiya, H., Kitaoka, N., Osuga, S., Iribe, Y., Takeda, K., Hayamizu, S.: Audio-visual speech recognition using deep bottleneck features and high-performance lipreading. In: 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 575–582. IEEE (2015)
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 1, pp. I–511. IEEE (2001)
Xia, G.S., Hu, J., Hu, F., Shi, B., Bai, X., Zhong, Y., Zhang, L., Lu, X.: AID: a benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. (2017)
Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 11(7), 1254–1265 (2009)
Zheng, X., Yuan, Y., Lu, X.: Dimensionality reduction by spatial-spectral preservation in selected bands. IEEE Trans. Geosci. Remote Sens. (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Tian, C., Yuan, Y., Lu, X. (2017). Deep Temporal Architecture for Audiovisual Speech Recognition. In: Yang, J., et al. Computer Vision. CCCV 2017. Communications in Computer and Information Science, vol 771. Springer, Singapore. https://doi.org/10.1007/978-981-10-7299-4_54
Download citation
DOI: https://doi.org/10.1007/978-981-10-7299-4_54
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7298-7
Online ISBN: 978-981-10-7299-4
eBook Packages: Computer ScienceComputer Science (R0)