Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The emergence of smart devices is opening doors to a range of applications such as the e-lodgment of service requests, e-transfer of payments, and e-banking. The viability of such applications would however require a robust and error free biometric system to identify the end users. This is a challenging task due to the significant session variabilities that may be contained in the captured data [1, 2]. Recently, audio-visual person recognition on mobile phones has gained a significant attention. For example, two evaluation competitions were organized in 2013 for speaker [3] and face [4] recognition on the MOBIO [5] dataset. In those competitions, the state-of-the-art face and speaker recognition techniques were evaluated. A majority of the speaker recognition systems in [3] used the Total Variability Modeling (TVM) [6] to learn a total variability matrix (T) which was then used to extract the i-vectors. Recently, in [7, 8], TVM was used for both speaker and face verification to achieve one of the top performing results. This motivates the use of i-vectors as features for audio-visual identification.

Fig. 1.
figure 1

right: a deep belief network; left: a deep Boltzmann machine with undirected edges between the layers.

In addition, learning high level representations using a deep architecture with multiple layers of non-linear information processing has recently gained popularity in the areas of image, audio and speech processing [9, 10]. For example, as shown in Fig. 1 right panel, a Deep Belief Network (DBN) is a generative architecture built by stacking multiple layers of restricted Boltzmann machines (RBMs) [11]. A DBN can be converted into a discriminative network, which is referred to as DBN-DNN in [9], by adding a top label layer and using the standard back-propagation algorithm. Although it has been extensively used for speech recognition [9], DBN-DNN was also used for speaker recognition in [12]. In [13] a DBN was used as a pseudo-ivector extractor and then a Probabilistic Linear Discriminant Analysis (PLDA) [14, 15] was used for classification. Such greedy layer wise learning of DBNs however limits the network to a single bottom-up pass. It also ignores the top-down influence during the inference process, which may lead to failures in the modeling of variabilities in the case of ambiguous inputs. This motivates the use of Deep Boltzmann Machines (DBMs) as an alternative of DBN.

A DBM is a variant of Boltzmann machine which not only retains the multi-layer architecture but also incorporates the top-down feedback (Fig. 1 left panel). Hence, a DBM has the potential of learning complex internal representations and dealing more robustly the ambiguous inputs (e.g., image or speech) [16]. Similar to the DBN-DNN, a DBM can be converted into a discriminative network, which is referred to as a DBM-DNN [17]. Although they have been used for a variety of classification tasks (e.g., handwritten digit recognition and object recognition [16], query detection [18], phone recognition [17], and multi-modal learning [19]), multi-modal person identification using DBMs has not been well studied. In this paper, we propose to use DBM-DNNs for i-vector based audio-visual person identification on mobile phone data (details in Sect. 4). As opposed to the DBM-DNN in [17] (used for speech recognition), the DBM-DNNs presented in this paper do not use the hidden representations as additional input to the DBM-DNNs. Rather than using DBMs for learning hierarchical representations we use them to learn a set of initial parameters (weights and biases) of the DNNs.

In summary, our contributions in this paper can be listed as follows: (a) We use DBM-DNNs for i-vector based audio-visual person identification. To the best of our knowledge, this is the first application of DBM-DNN with i-vectors as inputs. (b) We show that a higher accuracy can be achieved with DBM-DNN compared to the cosine distance classifier [6] commonly used in the literature to evaluate i-vector based systems (see Fig. 2) and also the state-of-the-art DBN-DNN (c) We study three configurations of DBM-DNN. Our experimental results show that two hidden layers having 800 units each achieved the best accuracy with 400 dimensional i-vectors.

2 Background

In this section, we briefly present the theoretical background of DBMs and the i-vector extraction using TVM.

2.1 Deep Boltzmann Machines

A deep Boltzmann machine is formed by stacking multiple layers of Boltzmann machines as shown in Fig. 1 left panel. In a DBM each layer captures higher-order correlations between the activities of hidden units in the layer below. In [16], some key aspects of DBMs were mentioned: (i) potential of learning complex internal representations, (ii) high-level representations can be built from a large supply of unlabeled data and very small number of labeled data can be used to slightly fine-tune the model, (iii) deal more robustly with ambiguous inputs (e.g., image and speech). Therefore, DBMs are considered a promising tool for solving object and speech/speaker recognition problems [20, 21].

Consider a two-layer DBM with no within-layer connection, Gaussian visible units (e.g., speech, image) \(v \in \mathbb {R}^D \) and binary hidden units \(h \in \{0,1\}^P\). Then, the energy for the state \({v,h^1,h^2}\) can be defined as:

$$\begin{aligned} \small E(v,h^1,h^2|\theta ) = \sum _{i=1}^{D} \frac{(v_i - b_i)^2}{2\sigma _i^2} - \sum _{i=1}^{D} \sum _{j=1}^{P_1} \frac{v_i }{\sigma _i^2}h_j^1W_{ij}^1 - \sum _{n=1}^{2} \sum _{j=1}^{P_l}c_j^n h_j^n - \sum _{j=1}^{P_1}\sum _{k=1}^{P_{2}} h_j^1 h_k^2 W_{jk}^2, \end{aligned}$$
(1)

where b and \(c^n\) represent the biases of the visible and n-th hidden layer, respectively; \(\sigma _i\) is the standard-deviation of the visible units; \(W^n\) represents the synaptic connection weights between the n-th hidden layer and the previous layer; and D and \(P_n\) represent the number of units in the visible layer and in the n-th hidden layer, respectively. Here, \(\theta =\{b,c,W^1,W^2\}\) represents the set of parameters.

DBMs can be trained with the stochastic maximization of the log-likelihood function. The partial-derivative of the log-likelihood function is:

$$\begin{aligned} \frac{\partial \mathcal {L}(\theta |v)}{\partial \theta } = \bigg \langle \frac{\partial E(v_{(t)},h|\theta ) }{\partial \theta }\bigg \rangle _{data} -\bigg \langle \frac{\partial E(v,h|\theta ) }{\partial \theta }\bigg \rangle _{model}, \end{aligned}$$
(2)

where \(\langle .\rangle _{data}\) and \(\langle .\rangle _{model}\) denote the expectations over the data distribution \(P(h|\{v_{(t)}\},\theta )\) and the model distribution \(P(v,h | \theta )\), respectively. The training set \(\{v_{(t)}\}_{t = 1,\ldots ,T}\) contains T samples. Although the update rules are well defined, it is intractable to exactly compute them. Variational approximation is commonly used to compute the expectation over the data distribution and different persistent sampling methods (e.g., [16, 22, 23]) are used to compute the expectation over the model distribution. A greedy layerwise approach [16] or a two-stage pre-training algorithm [24] can be used to initialize the parameters of DBM.

Fig. 2.
figure 2

Steps followed by the cosine distance classifier and the DBM-DNNs to obtain the matching score matrices S\(_\text {cosine}\) and S\(_\text {DBM-DNN}\), respectively.

2.2 Total Variability Modeling (TVM)

Inter-session variability (ISV) [25] and joint factor analysis (JFA) [26] are two session variability modeling techniques widely used for session compensation. Total Variability Modeling (TVM) [6] overcomes the high-dimensionality issue of ISV and JFA. In TVM, each sample in the training set is treated as if it comes from a distinct subject. TVM utilizes the factor analysis as a front-end processing step and extracts low-dimensional i-vector. The TVM training process assumes that the jth sample of subject i is can be represented by the Gaussian Mixture Model (GMM) mean super-vector

$$\begin{aligned} \mu _{i,j} = m + Tw_{i,j} \end{aligned}$$
(3)

where m is the speaker- and session-independent mean super-vector obtained from a Universal Background Model (UBM), T is the low-dimensional total variability matrix, and \(w_{i,j}\) is the i-vector representation.

The factor analysis process in TVM is used to extract a low-dimensional representation of each sample known as i-vector. An i-vector in its raw form captures the subject-specific information needed for discrimination as well as detrimental session variability. Hence, session compensation (e.g., whitening and i-vector length normalization) and scoring (e.g., PLDA or cosine distance) are performed as separate processes (see Fig. 2).

Fig. 3.
figure 3

left: DBM\(_\text {face}\) architecture; right: DBM\(_\text {speech}\) architecture. We use 400 dimensional raw i-vectors as inputs to the DBMs.

3 DBM-DNN Classification

In this section, we present the DBM-DNNs for audio-visual person identification. We train two DBMs (e.g., DBM\(_\text {speech}\) and DBM\(_\text {face}\)) as shown in Fig. 3, in an unsupervised fashion using the raw i-vectors extracted from the unlabeled samples from the background subjects. The steps followed in our proposed framework are: (i) DBMs pre-training, (ii) DBMs fine-tuning, (iii) discriminative training of the DBM-DNNs for classification, and (iv) fusion.

In the first step, we use the two-stage (Stage 1 and 2) pre-training algorithm presented in [24]. In Stage 1, each even-numbered layer of a DBM is trained as an RBM on top of each other. This is a common practice when a DBN is trained. In Stage 2, a model that has the predictive power of the variational parameters given the visible vector is trained. This is done by learning a joint distribution over the visible and hidden vectors using an RBM. In the second step, the initial set of parameters are fine-tuned using a layer-by-layer approach. This is similar to the one in [16] except that the visible units at the bottom layer and the hidden units at the top layer are not repeated. This allows the DBMs to adjust the parameters of all the layers (both even and odd) at one go.

In the third step, we use the learned DBM parameters to initialize deep neural networks (DNM-DNNs) with a top label layer. The top label layer of a DBM-DNN has as many units as the number of enrolled subjects. The bottom layers have exactly the same architecture as their corresponding DBM. Here, the connection weights between the layer at the top and the one immediately below are randomly initialized. After the initialization, they are discriminatively fine-tuned using a small set of labeled training data and the standard back-propagation algorithm. Finally, at the fourth step, we combine the outputs of the DBM-DNNs using the sum fusion, which for an identity j is given by:

$$\begin{aligned} \large f_j = \sum _{m} p_m(v_m,j) \end{aligned}$$
(4)

where m is the modality assignment (in our case, \(m=1\) represents DBM-DNN\(_{\text {speech}}\) and \(m=2\) represents DBM-DNN\(_{\text {face}}\)) and \(p_m(v_m,j)\) represents the probability of the input \(v_m\) belonging to person j ( i.e., the value assigned by j-th node of DBM-DNN for \(v_m\)). For a given set of observation vectors \(o = \{v_1,v_2\}\), a decision is given in favor of the j-th identity if \(f_j\) is maximum in the fused score vector \(f = [f_1, f_2, \ldots , f_N]\), where N is the number of target subjects.

Fig. 4.
figure 4

DBM-DNN\(_\text {speech}\) (left) and DBM-DNN\(_\text {face}\) are initialized with the generative weights of the DBM\(_\text {speech}\) and DBM\(_\text {face}\), respectively, and discriminatingly fine-tuned using the standard back-propagation algorithm. The output scores are fused using the sum rule of fusion.

4 Database and Features

In our experiments, we used the MOBIO dataset which is a collection of videos with speech captured using mobile devices. There are videos from 150 subjects (50 females and 100 males) captured in 12 different sessions over a one-and-a-half-year period. Each session contains 11–21 videos with significant pose and illumination variations (see Fig. 5) as well as different environment noise. We divided the subjects into two sets: (a) background (50 subjects: randomly picked 37 males and 13 females to retain the gender representation ratio of the dataset) and (b) target (remaining 100 subjects). This was repeated 10 times to ensure the experimental results were not based on a held-out set of data. Therefore, each experimental result presented in this paper represents th e mean of 10 evaluations. In each evaluation, we used the audio and visual data from the background subjects for: (a) building a UBM to learn the total variability matrix, and (b) for the unsupervised training of a DBM. We picked 5 samples each from a set of 6 randomly selected session (out of 12 sessions) from all the target subjects as the training data. Similarly, we used 5 samples each from the remaining 6 sessions as the test data.

4.1 Speech Features

We used speech enhancement and voice activity detection algorithms in the VOICEBOX toolbox [27] for preprocessing the speech signals. Then, frames were extracted from each silence removed speech with a window size of 20 ms and sampling rate of 10ms. Then, 12 cepstral coefficients were derived and augmented with the log energy forming a 13 dimensional static feature vector. The delta and acceleration were appended to form the final 13 static + 13 delta + 13 acceleration = 39 dimensional mel frequency cepstral coefficients (MFCCs) feature vector per frame.

4.2 Visual Features

Each image is rotated, scaled and cropped to a size of 64\(\times \)80 pixels. This is done in such a way that the eyes are 16 pixels from the top and separated by 33 pixels. Then, each cropped image is photometrically normalized using the Tan-Triggs algorithm [28]. We extracted 12\(\times \)12 pixel blocks from a preprocessed image using an exhaustive overlap which led to 3657 blocks per image. Then, the 44 lowest frequency 2D discrete cosine transform (2D-DCT) coefficients [29] excluding the zero frequency coefficient were extracted from each normalized (zero mean and unit variance) image block. The resulting 2D-DCT feature vectors were also normalized to zero mean and unit variance in each dimension with respect to other feature vectors of the image.

Fig. 5.
figure 5

Top row: presence of appearance, background and illumination variations on the video frames in different sessions in MOBIO database; Bottom row: detected faces from the video frames.

5 Results and Analysis

In this section, we present the implementation details, experimental results and analysis. We carried our experiments on the MOBIO dataset and compared the performance of the DBM-DNNs with the DBN-DNN and the cosine distance classifier.

5.1 Implementation

In this section, we evaluate the identification accuracy of the DBM-DNNs and compare with state-of-the-art classifiers, such as the cosine distance classifier [6] and the DBN-DNN. In our experiments, i-vectors were extracted using the MSR Identity Toolbox [30]. We learned 512 mixture gender-independent UBMs for each modality. The rank of the TVM subspace was set to 400 (commonly used in the literature) and 5 iterations of the total variability modeling was carried out. The DBM-DNNs and DBN-DNNs presented in this paper were implemented using the Deepmat toolbox [31] and the same set of learning parameters. The input data was subdivided into mini-batches and the connection weights between the units of two layers were updated after each mini-batch. During the pre-training phase, the parameters of RBM were learned by the contrastive divergence (CD), adaptive learning rate and enhanced gradient techniques as in [32]. The learning rates for Stage 1 and 2 were set to 0.05 and 0.01, respectively. Then, persistent contrastive divergence (PCD) and the enhanced gradient were used to fine-tune the DBM parameters with a learning rate of 0.001. In each step, the model was trained for 50 epochs and with a mini-batch size of 100.

Table 1. Rank-1 identification rate
Fig. 6.
figure 6

Cumulative match characteristics (CMC) curves for speech modality (left), face modality (middle) and fused (right).

We evaluated the performance of the cosine distance classifier, DBN-DNNs (two hidden layers with 400 units each) and three configurations of the DBM-DNNs. Their rank-1 identification rates are reported in Table 1. The overall rank-1 identification rate obtained using the cosine distance classifier is 0.946 which is significantly better than the identification rates of the individual modalities (i.e., 0.775 for speech and 0.733 for face). In Table 1, the results show that deep learning methods significantly improved the identification accuracy. We carried out our experiments using DBM-DNNs and DBN-DNNs with two hidden layers. We used three different configurations for DBM-DNNs: 400-400, 800-800, and 1200-1200, representing the number of units in the hidden layers. Our experimental results in Table 1 show that using DBM-DNNs with 800 units in each hidden layer performed better in terms of individual and overall identification rates compared to the use of 400 or 1200 units in the hidden layers.

In Fig. 6, we also report the Cumulative Match Characteristics (CMC) curves for the individual modalities and their fusion performed using the equally weighted sum rule. The CMC curves for the best DBM-DNN (i.e., 800–800) as shown in Table 1 are reported. It can be seen that the DBM-DNN consistently outperformed the cosine distance classifier and the DBN-DNN. This is because the posteriors obtained using the DBM-DNNs were able to discriminate between the target subjects better than the DBN-DNNs and the cosine distance classifier.

6 Conclusion

In this paper, we present the first use of DBM-DNNs for i-vector based audio-visual person identification. We compare the performance of DBM-DNNs with the state-of-the-art DBN-DNNs and the cosine distance classifier commonly used in the literature to evaluate i-vector based systems. Our experiments were carried out on the challenging MOBIO dataset. Experimental results show that DBM-DNNs achieved higher accuracies compared to the DBN-DNNs and the cosine distance classifier. We also studied three different configurations of the DBM-DNNs in this paper. Our results show that when a 400 dimensional i-vector is presented to a DBM-DNN with two hidden layers and 800 units each this performed more accurately than the other configurations. The fact that DBMs incorporate top-down feedback in the learning process enables them to learn a good generative model of the underlying data. The performance of the DBM-DNNs and DBN-DNNs under various environmental noise conditions will be an interesting study.