1 Introduction

Automatic Speech Recognition (ASR) has a plethora of applications, and plays a significant component in the communication between human and computers. Increasing research has been done in this area and push the accuracy up sustainedly [2, 27]. But it has limitations including bad ability to fight against noise and disturbance. Furthermore, some illusion occurs when the auditory component of one sound is paired with the visual component of another sound, leading to the perception of a third sound [15] (this is called McGurk effect [13]). Therefore, researchers have paid attention to the Audiovisual Speech Recognition (AVSR) utilizing both audio and visual information to tackle such problems and strengthen the ASR systems [1, 17]. Besides, AVSR is, generally, an application of the multimodal machine learning. Much research in the AVSR focuses on multimodal machine learning which enhances the robustness and adaptability using multiple modalities, and particularly broadens the horizons including Multimodal Deep Learning [17], Affective Computing [20] and Lipreading [5].

Traditional methods on the AVSR usually use multimodal extensions of Hidden Markov Models (HMMs) and some statistical models like Canonical Correlation Analysis (CCA) [11]. Recent and related work on Deep Neural Networks (DNN) verified its efficiency of multimodal representation and fusion. DNN models have two main differentia and advantages with contrast to the classical methods. First, They take advantage of the superiority of feature extraction and refinement of deep learning: Convolution Neural Network (CNN) for extracting image features and some unsupervised deep learning algorithms for refining audio features [14, 18, 28]. Second, They employ DNN models to do fusion or recognition because of the better generalization and nonlinear transformation.

Fig. 1.
figure 1

The simplified architecture of our model

All these methods achieved better performance in the AVSR. However, the existed traditional methods and DNN methods cannot satisfy the demand of higher recognition accuracy and show three primary disadvantages. (1) The first disadvantage is that they mainly deal with modal fusion while ignoring temporal fusion. (2) The second disadvantage is that traditional methods fail to consider the connection among frame in the modal fusion. (3) The third disadvantage is that these models aren’t end-to-end structure.

In order to relieve the disadvantages, we propose the deep temporal architecture for audiovisual speech recognition. The architecture of the proposed method is simply shown as Fig. 1. There are five phases in our method: data pre-processing, modal fusion, temporal modal fusion, temporal fusion and recognition network. The details of data pre-processing is not drawn in Fig. 1 as it is not the main part of the model. In the beginning, we extract the lip visual features and sound spectrogram features in pre-preprocessing phase. The modal fusion fuses jointly mouth lip features and sound spectrogram features of different frames. The temporal modal fusion considers the connection among frames in a video and further learns the joint representations. The temporal fusion encodes the modal-fused features of a video into a feature vector preparing for recognition. The proposed model takes into account the significant temporal information and helps obtain more semantic representation from both modalities. In practice, overfitting is serious as training set is insufficiently abundant, which is known as learning with small size samples, hence we adopt a set of training strategies to fight against overfitting, including visual data augmentation, aural data augmentation, multimodal data augmentation and other techniques.

In summary, the main contributions of this paper are as follows:

  1. (1)

    An end-to-end deep temporal architecture mixing unsupervised with supervised learning is advanced for AVSR. The model chiefly considers the significance of temporal information which is demonstrated its value by experimental results.

  2. (2)

    We study the overfitting and learning with small size samples in the AVSR. A set of training strategies are employed to fight against overfitting.

In the following sections, we firstly survey the related work about AVSR in Sect. 2. In Sect. 3, we briefly review two main components of our model: Multimodal Deep Autoencoder (MDAE) [17] and Long-short Term Memory (LSTM) [7], and then advance the deep temporal architecture and training strategies for AVSR. In Sect. 4, we conducts AVSR and cross modality speech recognition experiments for evaluating the model and training strategies on the three datasets: AVLetters (Patterson et al. 2002), AVLetters2 (Cox et al. 2008) and AVDigits (Di Hu et al. 2015), and afterwards the results are displayed and discussed. Section 5 concludes this paper.

2 Related Work

This section reviews the related multimodal models for AVSR historically.

2.1 Traditional AVSR Systems

The research on multimodal processing and interaction has long history. Humans understand the multimodal world in a seemingly effortless manner, although there are vast information processing resources dedicated to the corresponding tasks by the brain [11]. But computers are difficult to tackle this problem. When it comes to the modal fusion, which has attracted numerous investigators for a long time, it usually benefits discriminative tasks because of rich and hierarchical information.

Specifically AVSR has been studied for a few years, deal of work all focused on multimodal fusion. There are three levels of multimodal fusion: early fusion, intermediate fusion and late fusion. For early integration, it concatenates video and audio features into a single descriptor. For late fusion, fusion is done at decision part, which resembles ensemble learning. Intermediate fusion, or called hybrid fusion, lies in between late and early fusion [3, 11]. Early and late fusion are uncomplicated to understand and easy to put into practice. Whereas they both are too straightforward to capture the abundant information and correlation between aural and visual features.

In the early years, investigators attempted to use probabilistic models, but they depend on strong different prior assumption. The typical models are multistream HMMs (mHMMs) [16] that were affirmed strong ability to model sequence data [22]. However, mHMMs don’t work well, because mHMMs building a hybrid fusion structure and mapping aural and visual features jointly to a low-dimensional space are shallow models, additionally distinct modalities have various type of information representations. In this method, researchers are interested in distinct feature extraction and different feature combination [6].

2.2 Deep Learning for AVSR

Deep Learning provides strong representation ability and capacity [4], especially CNN for image feature extraction and representation [9, 10, 30, 32]. For instance, [18] uses CNN pre-trained visual feature and aural feature refined by denoising autoencoder, and then traditional models are used for fusion and recognition. As has been argued, these models mainly make use of the rich and robust feature representation of DNN.

Some other methods are posed based on multimodal fusion by DNN. There are three levels of multimodal fusion: early fusion, intermediate fusion and late fusion. For early integration, it concatenates video and audio features into a single descriptor. For late fusion, fusion is done at decision part, which resembles ensemble learning. Intermediate fusion, or called hybrid fusion, lies in between late and early fusion [3, 11]. Early and late fusion are uncomplicated to understand and easy to put into practice. Whereas they both are too straightforward to capture the abundant information and correlation between aural and visual features.

[17] used MDAE; [25] used Multimodal Deep Network (MDBN); [26] used Multimodal Deep Boltzmann Machine (MDBM); [28] uses Deep Bottleneck Features (DBNF); [8] used Recurrent Temporal Multimodal Restricted Boltzmann Machines (RTMRBM). They indeed achieve more accurate result. But the majority of them are unsupervised models, and additionally most are based on Boltzmann Machines that are considered difficult to train of partial function [4]. Besides, they involve several independent training process and testing process, and give rise to the loss of temporal information in the modal fusion.

3 Proposed Method

In this section, we first briefly review two main components: MDAE and LSTM of our model, and then advance the deep temporal architecture for AVSR. Finally, we propose a set of training strategies for fighting against overfitting and learning with small size samples.

Fig. 2.
figure 2

Bimodal DAE and video-only DAE in our model

3.1 MDAE and LSTM

MDAE: In the multimodal fusion task, [17] advanced two types of MDAE: Bimodal Deep Autoencoder and Video-Only Deep Autoencoder (a similar model can be drawn for the audio-only setting). One reconstructs both modalities given audio and video used for AVSR, and the other reconstructs both modalities given only one modality used for cross modality speech recognition – that is, multiple modalities are available during training; during testing phase, only data from a single modality is provided [17]. The MDAE mentioned in [17] has a RBM greedy pre-training process, thereby the data ought to encode into binary.

LSTM: LSTM is the suitable algorithm to model the sequence data. There are two widely known issues with properly training vanilla RNN, the vanishing and the exploding gradient [19]. Without some training tricks in training RNN, LSTM is an off-the-shelf favorable solution. It uses gates to avoid gradient vanishing and exploding. A typical LSTM has 3 gates: input gate, output gate and forget gate which reserves the sequence information and makes Backpropagation through Time (BPTT) easier.

3.2 Deep Temporal Architecture for AVSR

Our DNN architecture consists of four primary components: modal fusion, temporal modal fusion, temporal fusion, recognition network.

Modal Fusion: MDAE is chosen as the main tool of the modal fusion (Fig. 2). In the model, in order to eliminates truncation error, there is no binary encoding process. We do the early fusion by concatenating video and audio features as the input and then reconstructing them. The shared representation is extracted as the modal joint fused representation. As the boost of computation power, the MDAE is easy to train. The loss function of MDAE is:

$$\begin{aligned} loss(x, y)=\frac{1}{n}\sum ^{n}_{i=1}[y_{i}log(x_i)+(1-y_i)log(1-x_i)], \end{aligned}$$
(1)

where n is the number features through pre-processing, \(x_i\) is the reconstruction, \(y_i\) is the original data.

Temporal Modal Fusion: LSTM is used to do the temporal modal fusion as it takes into account the temporal factors. With the decrease of attributes and deep fusion of the features after LSTM, it would be simpler for temporal fusion and recognition. Two LSTMs are taken to accomplish the temporal modal fusion (Fig. 3).

Fig. 3.
figure 3

Temporal modal fusion structure

The mechanism of temporal modal fusion is as follows:

$$\begin{aligned} i_t = sigmoid(W_{i}x_t+U_{i}c_{t-1}+b_i) \end{aligned}$$
(2)
$$\begin{aligned} f_t = sigmoid(W_{f}x_t+U_{f}c_{t-1}+b_f) \end{aligned}$$
(3)
$$\begin{aligned} o_t = sigmoid(W_{o}x_t+U_{o}c_{t-1}+b_o) \end{aligned}$$
(4)
$$\begin{aligned} c_t = f_t\circ c_{t-1}+i_t \circ tanh(W_c x_t +b_c) \end{aligned}$$
(5)
$$\begin{aligned} h_t = o_{t}\circ tanh(c_t), \end{aligned}$$
(6)

In the interior of temporal modal fusion, W, U and b are the parameter matrices and vector, \(c_t\) is the cell state vector, \(f_t\) is the forget gate vector, \(i_t\) is the input gate vector, \(o_t\) is the output gate vector. The inner parts aim to preventing vanishing and the exploding gradient and passing sequence information as typical LSTM. \(x_t\) is the modal fused input, \(h_t\) is the temporal modal fused output.

Temporal Fusion: Traditional methods usually concatenate different frames of one video after modal fusion. But it needs some hyperparameters to tune, which isn’t recommended by modern learning architecture. Here, another LSTM and mean pooling are used to map the audiovisual features after temporal modal fusion into a well fused feature vector (shown in Fig. 4). The mechanism of temporal fusion LSTM is roughly the same as (2)–(6), and the mean pooling works as follows:

Fig. 4.
figure 4

Temporal fusion structure

$$\begin{aligned} z_i = \frac{1}{m}\sum ^m_{t=1}h_t, \end{aligned}$$
(7)

where m is the number of frames in a video, \(h_t\) is the temporal modal fused vector, \(z_i\) is the output of temporal fusion.

Recognition Network: Our model uses a feed forward network with batch normalization mapping the features for recognition. It has eight layers: fully connected layer, activation layer with Rectified Linear Units (ReLU) and batch normalization layer one by one repeatedly. The last layer is general softmax layer with squared multi-label margin loss function :

$$\begin{aligned} loss(p,q)=\frac{1}{n}\sum ^{n}_{i=1}max\{0,(1-p_i+q_i)^2\}, \end{aligned}$$
(8)

where n is the number of features in a video, \(p_i\) is the output of the recognition network, \(q_i\) is the ground truth.

3.3 Training Strageties

The samples of AVSR is insufficient usually and therefore sequence learning is easy to overfit. Hence we utilize a set of training strategies for learning with small size samples and against overfitting.

Visual Data Augmentation: As is mentioned before, it is common that insufficient visual data may cause serious overfitting in the AVSR. We employ visual data augmentation with extracted lip videos. Color jittering, small angle rotation and random scaling are adopted to augment data. The augmentation will also enhance the generalization of color change, space variousness and image quality difference.

Aural Data Augmentation: To improve model’s tolerance to audio noise and prevent overfitting, we apply white Gaussian noise in the training phase.

Multimodal Data Augmentation: When video and audio are both fused after the modal fusion, we make the fused features of each video simultaneously shift slight frames up and down randomly to augment data. As a benefit, the trained model will have better generalization of variance in the time domain.

Other Techniques: Common DNN training strategies including dropout [24] and early stopping [21] are also used in our model.

4 Experiments and Discussion

In this section, we show the results of the proposed model compared with some state-of-the-art methods.

4.1 Datasets

We conducted the experiments in 3 datasets: AVLetters (Patterson et al. 2002), AVLetters2 (Cox et al. 2008) and AVDigits (Di Hu et al. 2015).

AVLetters. 10 volunteers saying the letters A to Z three times each. The dataset pre-extracts the lip region of 60 \(\times \) 80 pixels and Mel-Frequency Cepstrum Coefficientx (MFCC).

AVLetters2. 5 volunteers spoke the letters A to Z seven times each. The dataset provide raw video and audio in different folders.

AVDigits. 6 speakers saying digits 0 to 9, nine times each. It doesn’t pre-extract the video and audio, and provides raw video of long or short time length from 1 s to 2 s.

4.2 Data Pre-processing

If the video and audio are not separated, we separate them from each other. And then truncate the video and audio into the same length.

Pre-processing of Video. Firstly, the off-the-shelf Viola-Jones algorithm [29] is used to extract the Region-of-Interest surrounding the mouth. The region is resized to \(224\times 224\) pixels, and use aforementioned visual data augmentation strategy to double the lip visual data. The features are obtained by the pre-trained VGG-16’s [23] last fully connected layer. Finally, reduce features to 100 principal components with PCA whitening and center them.

Pre-processing of Audio. Double aural data with SNR of 5 dB (signal power: noise power = 5:1) white Gaussian noise as mentioned in Sect. 3.3. The features of audio signal are extracted as spectrogram with 20 ms Hamming window and 10 ms overlap. The spectral coefficient vector is gained with 251 points of Fast Fourier Transform and 50 principal components by PCA.

Visual and Aural Features Combination. When video and audio are both prepared, four contiguous audio frames correspond to one video frame for each time step.

4.3 Implementation Details

The modal fusion network has eight layers and loss is binary cross entropy. While the modal fusion finished, the shared representation use multimodal data augmentation strategy with shifting ten frames up and down randomly in Sect. 3.3, and reshape the fused data waiting for temporal modal fusion. Then the reshaped fused features are sent to the temporal modal fusion network. The temporal modal fusion network continues fusing the existed shared representation of audiovisual fusion information. The temporal fusion network maps the features of several frames in a video to a feature vector. Finally, every video has one low-dimensional well-fused vector waiting for recognition. The recognition network has two nonlinear layer to augment the nonlinearity.

The temporal modal fusion, temporal fusion and recognition network are trained by one loss function. Thus the model needs two steps of training: unsupervised training and supervised training, a reshape layer connects them at once in evaluation. One step of evaluation and testing.

4.4 Quantitative Evalution

To evaluate the proposed model, we conducted AVSR and cross modality speech recognition experiments on the multimodal data. At the same time, the experiments are also conducted to evaluate the necessity of training strategies.

Evaluation of AVSR and Training Strategies: We evaluate our methods in the AVSR task, compared with MDAE, MDBN and RTMRBM on AVLetters2 and AVDigits. Moreover, the training strategies in Sect. 3.3 are evaluated in the experiment. The quantitative results demonstrate the superiority of our model for AVSR and necessity of the training strategies. In addition, it is not uncommon that the model slowly converges without training strategies in our practice (Table 1).

Table 1. AVSR performance on AVLetters2 and AVDigits. The result indicates that our model performs better than MDBN, MDAE and RTMRBM. And the experiments show that model with training strategies in Sect. 3.3 is better than without training strategies
Table 2. Cross modality speech recognition performance. The results of V modality suggest that cross modality speech recognition (MDAE, CRBM, our method) is better than single modality speech recognition (Multiscale Spatial Analysis, Local Binary Pattern). The V modality and A modality experiments show that our method performs better than other models in cross modality speech recognition.

Evaluation of Cross Modality Speech Recognition: One purpose of the multimodal deep learning is learning better single modality representations given unlabeled data from multiple modalities [17]. In cross modality learning experiments, we evaluate the accuracy of one modality (e.g. V) when given multiple modalities (e.g. V and A) during learning. We compare the model to MDAE, CRBM [1] and some single modality models including Multiscale Spatial Analysis [12] and Local Binary Pattern [31] on AVLetters. The experiments display proper cross modality speech recognition is better than single modal speech recognition, moreover our method performs better than other related multimodal models in cross modality speech recognition (Table 2).

5 Conclusion

A deep temporal architecture and a set of training strategies are proposed for AVSR task in this paper. Once the model has been trained, it’s simple to do AVSR in reality. Our method considers modal fusion, temporal modal fusion and temporal fusion, such fusion enhances the robustness and the ability for sequence modeling. The experimental results suggest that the deep temporal architecture achieve better AVSR recognition and cross modality speech recognition results in three datasets. Besides our training strategies efficiently weaken the overfitting as the experiments shows.