1 Introduction

Many real life events are inherently multimodal with each modality containing information useful for detecting or recognizing the event. Despite this, there is a significant amount of work focused on modeling and recognizing events using a single modality, neglecting other sources of information when available. While this might be sufficient for certain problems, it is inadequate when the events to be detected are complex and subtle. Humans are capable of combining cues from multiple modalities to reason about specific events. Therefore when multiple, information rich, modalities are present, it becomes important to jointly interpret and reason about the information from each modality. While jointly modeling multiple modalities, the temporal information within and across modalities also needs to be accounted for. Following the human cognitive system, we propose to solve the multimodal fusion using a biologically-inspired model namely Conditional Restricted Boltzmann Machines (CRBMs) (Taylor et al. 2011).

Discriminative models focus on maximizing the separation between classes, however, they are often uninterpretable or require some heavy reverse engineering (Zeiler and Fergus 2014). On the other hand, generative models focus solely on modeling distributions and are often unable to incorporate higher level knowledge. Hybrid models tend to address these problems by combining the advantages of discriminative and generative models. They encode higher level knowledge as well as model the distribution from a discriminative perspective. We propose a novel hybrid model that allows us to recognize classes, correlate features, and generate input data.

The CRBMs are non-linear generative models for modeling time series data. They use an undirected model with binary latent variables connected to a number of visible variables. A CRBM based generative model enables modeling short-term multimodal phenomenon and also allows us to deal with missing data by generating it within or across modalities. We propose a hybrid model to acquire the benefits of a discriminative classifier. The hybrid model involves enhancing the CRBM with a discriminative component based on the work of (Larochelle and Bengio 2008). Leading to a superior classification performance, while also allowing us to model temporal dynamics. We evaluate our approach on multiple audio-visual datasets and show how our results are comparable/superior to the state-of-the-art approaches.

Our Contributions We extend our initial work (Amer et al. 2014) and propose a new general hybrid model. Compared to Amer et al. (2014) we contributed the following:

  • We propose a new jointly trained hybrid model that combines the advantages of temporal generative and discriminative models forming an extendable formal multimodal fusion framework for classifying multimodal data.

  • We evaluate our model on realistic datasets and we are the first to report generation accuracy on both datasets since other work used only discriminative models.

  • We extensively evaluate the effect of additional training data and the effect of additional model parameters affect our performance.

Paper Organization In Sect. 2 we discuss prior work. In Sect. 3 we give a brief background of similar models that motivate our approach, followed by a description of our hybrid model. In Sect. 4 we describe the inference algorithm. In Sect. 5 we specify our learning algorithm. In Sect. 6 we show quantitative results of our approach, followed by the conclusion in Sect. 7.

2 Prior Work

In this section we first review literature on multimodal fusion; second we review hybrid models; finally we review temporal, energy-based, deep learning.

Multimodal Fusion Deep networks have been used for multimodal fusion (Srivastava and Salakhutdinov 2012) for tags and image fusion (Ngiam et al. 2011) for audio spectrograms and image fusion. These models are generative, however, they operate on static data. Other work focused on temporal datasets (Audio-Video) such as AVEC dataset (Glodek et al. 2011). Representative work on AVEC includes generative models such as the Hidden Markov Models (HMMs) based methods (Glodek et al. 2011) and discriminative models such as the Conditional Random Fields (CRFs) (Ramirez et al. 2011). Each of these approaches separately lack the advantages of the other, whereas hybrid models, include the abilities to learn a joint representation combining the benefits of both discriminative and generative models benefits. Recently a new challenging temporal dataset, ChaLearn (Escalera et al. 2014), was released for evaluating multimodal gesture recognition. The most successful algorithm in the challenge was a discriminative deep learning Convolutional Neural Networks (CNNs) model (Neverova et al. 2014). While this approach achieves the best results (even compared to our approach), they ignore the temporal aspect of the data and modeled each modality specifically with so much engineering. Our experiments showed that jointly modeling modalities using a generative deep learning architecture helps in substantially improving both classification over (Amer et al. 2014), achieving relatively comparable results to Neverova et al. (2014) with no engineering of the architecture and generation performance that was not explored by their approach. In this paper, we focus on the modeling of multimodal data using a hybrid model.

Hybrid Models These models consist of a generative component, which usually learns a feature representation given low-level input, and a discriminative component for higher level reasoning. Recent work has empirically shown that generative models which learn a rich feature representation tend to outperform discriminative models that rely solely on hand-crafted features (Perina et al. 2012). Hybrid models can be divided into three groups, joint methods (Larochelle and Bengio 2008; Druck and McCallum 2010), iterative methods (Sminchisescu et al. 2006; Fujino et al. 2008), and staged methods (Li et al. 2011; Ranzato et al. 2011; Perina et al. 2012). Joint methods optimize a single objective function which consists of both the generative and discriminative components used to learn a joint representation. They are usually learned using methods such as variational learning. Iterative methods, similar to joint methods. learn a shared representation layer using an iterative learning approach, such as Expectation Maximization, where the representations are updated using updates from the discriminative component and the generative component. Staged methods, are different than joint and iterative since both generative and discriminative components are trained separately in a staged manner. Generative representations are learned in an unsupervised manner, followed by discriminative components learned with supervision using the generative representations as a new input.

Representation Learning Deep learning have been successfully applied to many problems (Bengio 2009). Restricted Boltzmann Machines (RBMs) form the building blocks in energy based deep networks (Hinton et al. 2006; Salakhutdinov and Hinton 2006). In Hinton et al. (2006), Salakhutdinov and Hinton (2006), the networks are trained using the contrastive divergence (CD) algorithm (Hinton 2002), which demonstrated the ability of deep networks to capture the distributions over the features efficiently and to learn complex representations. RBMs can be stacked together to form deeper networks known as Deep Boltzmann Machines (DBMs), which capture more complex representations. Recently, temporal models based on deep networks have been proposed, capable of modeling a more temporally rich set of problems. These include Conditional RBMs (CRBMs) (Taylor et al. 2011) and Temporal RBMs (TRBMs) (Sutskever and Hinton 2007; Sutskever et al. 2008; Hausler and Susemihl 2012). CRBMs have been successfully used in both visual and audio domains. They have been used for modeling human motion (Taylor et al. 2011), tracking 3D human pose (Taylor et al. 2010) and phone recognition (Mohamed and Hinton 2009). TRBMs have been applied for transferring 2D and 3D point clouds (Zeiler et al. 2011), transition based dependency parsing (Garg and Henderson 2011), and polyphonic music generation (Lewandowski et al. 2012).

Fig. 1
figure 1

This figure illustrates the progression of models described in Sect. 3. a RBM b CRBM and c MMCRBM are generative models that can be trained in an unsupervised manner. d DRBM e DCRBM and f MMDCRBM are the discriminative counter part that are trained in a supervised manner. The extension from the left column to the right column lies in adding a discriminative component (Larochelle and Bengio 2008) to the generative models. The extension across the rows is a progression from static models, to dynamic models, to multimodal dynamic models

3 Model

Using a hybrid model allows us to take advantage of the benefits of generative models, which include filling in missing data and the benefits of a discriminative model, leading to a stronger classifier compared to purely generative models.

Rather than immediately defining our MMDCRBM model, we discuss a sequence of models, gradually increasing in complexity, so that the different components of our hybrid model can be understood in isolation. We start with the basic RBM model (Sect. 3.1), then we extend the RBM to the temporal CRBM model (Sect. 3.2), then we extend the CRBM to the multimodal MMCRBM model (Sect. 3.3). Then we make each of those three models discriminative: DRBM (Sect. 3.4), DCRBM (Sect. 3.5), and finally MMDCRBM (Sect. 3.6).

3.1 Restricted Boltzmann Machines (RBMs)

RBMs (Salakhutdinov and Hinton 2006), shown in Fig. 1a, defines a probability distribution \(p_{\text {R}}\) as a Gibbs distribution (1), where \(\mathbf{v}\) is a vector of visible nodes, \(\mathbf{h}\) is a vector of hidden nodes, \(E_{\text {R}}\) is the energy function, and Z is the partition function. The parameters \({\varvec{\theta }}\) to be learned are \(\mathbf{a}\) and \(\mathbf{b}\) the biases for \(\mathbf{v}\) and \(\mathbf{h}\) respectively and the weights \({ W}\). The RBM is fully connected between layers, with no lateral connections. This architecture implies that v and h are factorial given one of the two vectors. This allows for the exact computation of \(p_{\text {R}}(\mathbf{v}|\mathbf{h})\) and \(p_{\text {R}}(\mathbf{h}|\mathbf{v})\).

$$\begin{aligned} \begin{array}{rcl} p_{\text {R}}(\mathbf{v},\mathbf{h})&{}=&{}\frac{\exp [-E_{\text {R}}(\mathbf{v},\mathbf{h})]}{Z({\varvec{\theta }})},\\ \\ Z({\varvec{\theta }})&{}=&{}\sum _{\mathbf{v},\mathbf{h}}\exp [-E_{\text {R}}(\mathbf{v},\mathbf{h})],\\ \\ {\varvec{\theta }}&{}=&{} \Bigg [ \begin{matrix} \{\mathbf{a},\mathbf{b}\}&{}\text {-bias},\\ \{{ W}\}&{}\text {-fully connected}.\\ \end{matrix} \Bigg ] \end{array} \end{aligned}$$
(1)

In case of binary valued data \(v_i\) is defined as a logistic function. In case of real valued data, \(v_i\) is defined as a multivariate Gaussian distribution with a unit covariance. A binary valued hidden layer \(h_j\) is defined as a logistic function such that the hidden layer becomes sparse (Taylor et al. 2011; Sutskever and Hinton 2007). The probability distributions over v, and h is defined as in (2).

$$\begin{aligned} \begin{array}{rcll} p_{\text {R}}(v_{i} = 1 |\mathbf{h})&{}=&{}\sigma (a_{i}+\sum _{j} h_{j} w_{ij}),\quad &{} \text {Binary,}\\ \\ p_{\text {R}}(v_{i}|\mathbf{h})&{}=&{}\mathcal {N}(a_{i}+\sum _{j} h_{j}w_{ij},1),\quad &{} \text {Real,}\\ \\ p_{\text {R}}(h_{j} = 1 |\mathbf{v})&{}=&{}\sigma (b_{j}+\sum _{i} v_{i} w_{ij}),\quad &{} \text {Binary.} \end{array} \end{aligned}$$
(2)

The energy function \(E_{\text {R}}\) for binary \(v_i\) is defined as in (3).

$$\begin{aligned} E_{\text {R}}(\mathbf{v},\mathbf{h})=-\sum _{i} a_{i} v_{i}- \sum _{j} b_{j} h_{j}- \sum _{i,j} v_{i}w_{ij} h_{j}, \end{aligned}$$
(3)

while, the energy function \(E_{\text {R}}\) is slightly modified to allow for the real valued \(\mathbf{v}\) as shown in (4).

$$\begin{aligned} E_{\text {R}}(\mathbf{v},\mathbf{h})=-\sum _{i} \frac{(a_{i}-v_{i})^2}{2} - \sum _{j} b_{j} h_{j}- \sum _{i,j} v_{i}w_{ij} h_{j} \end{aligned}$$
(4)

3.2 Conditional Restricted Boltzmann Machines (CRBMs)

CRBMs (Taylor et al. 2011), shown in Fig. 1b, are a natural extension of RBMs for modeling short term temporal dependencies. A CRBM (Fig. 1) is an RBM which takes into account history from the previous time instances \(t-N,\ldots ,t-1\) at time t. This is done by treating the previous time instances as additional inputs. Doing so does not complicate inference. Some approximations have been made to facilitate efficient training and inference, more details are available in Taylor et al. (2011). A CRBM defines a probability distribution \(p_{\text {C}}\) as a Gibbs distribution (5).

$$\begin{aligned} \begin{array}{c} p_{\text {C}}(\mathbf{v}_{t},\mathbf{h}_{t}|\mathbf{v}_{<t})=\frac{\exp [-E_{\text {C}}(\mathbf{v}_{t},\mathbf{h}_{t}|\mathbf{v}_{<t})]}{Z({\varvec{\theta }})},\\ \\ Z({\varvec{\theta }})=\sum _{\mathbf{v},\mathbf{h}}\exp [-E_{\text {C}}(\mathbf{v}_{t},\mathbf{h}_{t}|\mathbf{v}_{<t})]\\ \\ {\varvec{\theta }}= \left[ \begin{matrix} \{\mathbf{a},\mathbf{b}\}&{}\text {-bias},\\ \{{ A},{ B}\}&{}\text {-auto regressive},\\ \{{ W}\}&{}\text {-fully connected}.\\ \end{matrix} \right] \end{array} \end{aligned}$$
(5)

The visible vectors from the previous N time instances, denoted as \(\mathbf{v}_{<t}\), influence the current visible and hidden vectors. The probability distributions are defined in (6).

$$\begin{aligned} \begin{array}{c} p_{\text {C}}(v_{i}|\mathbf{h},\mathbf{v}_{<t})=\mathcal {N}(c_i+ \sum _{j} h_{j}w_{ij},1),\\ \\ p_{\text {C}}(h_{j} = 1 |\mathbf{v},\mathbf{v}_{<t})=\sigma (d_j + \sum _{i} v_{i} w_{ij}). \end{array} \end{aligned}$$
(6)

where,

$$\begin{aligned} \begin{array}{c} c_{i}= a_{i} + \sum _{p}A_{pi} v_{p,<t},\\ \\ d_{j}=b_{j} + \sum _{p}B_{pj} v_{p,<t}. \end{array} \end{aligned}$$
(7)

The new energy function \(E_{\text {C}}(\mathbf{v}_{t},\mathbf{h}_{t}|\mathbf{v}_{<t})\) in (8) is defined in a manner similar to that of the RBM (4).

$$\begin{aligned} \begin{array}{rcl} E_{\text {C}}(\mathbf{v}_{t},\mathbf{h}_{t}|\mathbf{v}_{<t})&{}=&{}-\sum _{i} (c_{i}-v_{i,t})^2/2 - \sum _{j} d_{j} h_{j,t}\\ &{}&{}- \sum _{i,j} v_{i,t} w_{ij} h_{j,t}, \end{array} \end{aligned}$$
(8)

Note that A and B are matrices of concatenated vectors of previous time instances of \(\mathbf{a}\) and \(\mathbf{b}\).

3.3 Multimodal Conditional Restricted Boltzmann Machines (MMCRBMs)

The extension of the CRBM to multimodal is straightforward as shown in Fig. 1c. We define a Gibbs distribution over a multimodal network of stacked CRBMs, letting \(p_{\text {MC}}(\mathbf{v}^{1:M}_{t},\mathbf{h}^{1:M}_{t},\mathbf{h}^{F}_{t}|\mathbf{v}_{<t}^{1:M})\) denote the distribution (9). This is similar to the approach proposed in Srivastava and Salakhutdinov (2012) and Ngiam et al. (2011) except that we use CRBMs as our main building block instead of RBMs or auto-encoders. This enables us to model the temporal nature of the time-series data.

$$\begin{aligned} \begin{array}{l} p_{\text {MC}}(\mathbf{v}^{1:M}_{t},\mathbf{h}^{1:M}_{t},\mathbf{h}^{F}_{t}|\mathbf{v}_{<t}^{1:M})=\frac{\exp [-E_{\text {MC}}(\mathbf{v}^{1:M}_{t},\mathbf{h}^{1:M}_{t},\mathbf{h}^{F}_{t}|\mathbf{v}^{1:M}_{<t})]}{Z({\varvec{\theta }})},\\ \\ Z({\varvec{\theta }})=\sum _{\mathbf{v},\mathbf{h}} \exp [-E_{\text {MC}}(\mathbf{v}^{1:M}_{t},\mathbf{h}^{1:M}_{t},\mathbf{h}^{F}_{t}|\mathbf{v}^{1:M}_{<t})], \\ \\ {\varvec{\theta }}= \left[ \begin{matrix} \{\mathbf{a}^{1:M},\mathbf{b}^{1:M},\mathbf{e}\} &{}\text {-bias},\\ \{{ A}^{1:M},{ B}^{1:M},{ C^{1:M}}\} &{}\text {-auto regressive},\\ \{{ W}^{1:M},{ W}^F\}&{}\text {-fully connected}. \end{matrix} \right] \end{array} \end{aligned}$$
(9)

The probability distributions are defined in (10).

$$\begin{aligned} \begin{array}{l} p_{\text {MC}}(v^{m}_{i,t}|\mathbf{h}^{m}_{t},\mathbf{v}^{m}_{<t})=\mathcal {N}(c^{m}_{i} +\sum _{j} h^{m}_{j}w^{m}_{ij},1),\\ \\ p_{\text {MC}}(h^{m}_{j,t} = 1 |\mathbf{h}^{F}_{t},\mathbf{v}^{m}_{t},\mathbf{v}^{m}_{<t})= \sigma (d^{m}_{j}\\ \quad +\sum _k h^{F}_k w^{F}_{jk} + \sum _{i} v^{m}_{i,t} w^{m}_{ij}),\\ \\ p_{\text {MC}}(h^{F}_{t}|\mathbf{h}_{t}^{1:M},\mathbf{h}_{<t}^{1:M})=\sigma (f_k +\sum _{m,j} h_{t}^{m} w^{F}_{jk}). \end{array} \end{aligned}$$
(10)

where,

$$\begin{aligned} \begin{array}{rcccl} c^{m}_{i} &{}=&{} a^{m}_{i} &{}+&{} \sum _{p}A^{m}_{pi} v^{m}_{p,<t},\\ \\ d^{m}_{j} &{}=&{} b^{m}_{j} &{}+&{} \sum _{p}B^{m}_{pj} v_{p,<t} ,\\ \\ f_{k} &{}=&{} e_{k} &{}+&{} \sum _{m,r}C_{rk}^{m} h^{m}_{k,<t}. \end{array} \end{aligned}$$
(11)

For a multimodal CRBM, we define the joint representation (fusion) layer to be the top layer. The multimodal energy \(E_{\text {MC}}(\mathbf{v}^{1:M}_{t},\mathbf{h}^{1:M}_{t},\mathbf{h}^{F}_{t}|\mathbf{v}^{1:M}_{<t})\) is decomposed into two parts as shown in (12). The first part is the fusion energy for the joint representation, where \(\mathbf{h}^{F}_{t}\) is the fusion hidden layer. The second part is the single modality energy, which is defined over a CRBM of a single modality m. It consists of unary terms representing the bias of each layer, and a pairwise term which relates the nodes of two layers.

$$\begin{aligned} \begin{array}{c} E_{\text {MC}}(\mathbf{v}^{1:M}_{t},\mathbf{h}^{1:M}_{t},\mathbf{h}^{F}_{t}|\mathbf{v}^{1:M}_{<t})=\underbrace{\sum _{m} E_{\text {C}}(\mathbf{v}^{m}_{t},\mathbf{h}^{m}_{t}|\mathbf{v}^{m}_{<t})}_{\text {Unimodal}}\\ \\ \underbrace{-\sum _{j} f_{k,t} h^{F}_{k,t}-\sum _{j,k}h^{1:M}_{j,t}w^{F}_{jk} h^{F}_{k,t}}_{\text {Fusion}}. \end{array} \end{aligned}$$
(12)

3.4 Discriminative Restricted Boltzmann Machines (DRBMs)

DRBMs, shown in Fig. 1d, are a natural extension of RBMs which have an additional discriminative term for classification (Larochelle and Bengio 2008). They are based on the model in Larochelle and Bengio (2008). The DRBM defines a probability distribution \(p_{\text {DR}}\) as a Gibbs distribution (13).

$$\begin{aligned} \begin{array}{rcl} p_{\text {DR}}(\mathbf{y},\mathbf{v},\mathbf{h}|\mathbf{v})&{}=&{}\frac{\exp [-E_{\text {DR}}( \mathbf{y},\mathbf{v},\mathbf{h})]}{Z({\varvec{\theta }})},\\ \\ Z({\varvec{\theta }})&{}=&{}\sum _{\mathbf{y},\mathbf{v},\mathbf{h}}\exp [-E_{\text {DR}}(\mathbf{y},\mathbf{v},\mathbf{h})],\\ \\ {\varvec{\theta }}&{}=&{} \left[ \begin{matrix} \{\mathbf{a},\mathbf{b},\mathbf{s}\}&{}\text {-bias},\\ \{{ W},{ U}\}&{}\text {-fully connected}. \end{matrix} \right] \end{array} \end{aligned}$$
(13)

The probability distribution over the visible layer will follow the same distributions as in (2). The hidden layer \(\mathbf{h}\) is defined as a function of the labels y and the visible nodes \(\mathbf{v}\). Also, a new probability distribution for the classifier is defined to relate the label y to the hidden nodes \(\mathbf{h}\) as in (14).

$$\begin{aligned} \begin{array}{rcl} p_{\text {DR}}(v_{i}|\mathbf{h})&{}=&{}\mathcal {N}(a_{i}+\sum _{j} h_{j}w_{ij},1),\\ \\ p_{\text {DR}}(h_{j} &{}=&{} 1 |y_l,\mathbf{v})=\sigma (b_{j}+ u_{jl}+\sum _{i} v_{i} w_{ij}),\\ \\ p_{\text {DR}}(y_l|\mathbf{h})&{}=&{}\frac{\exp [s_l+\sum _j u_{jl}h_j]}{\sum _{l^*}\exp [s_{l^*}+\sum _j u_{jl^*}h_j]}. \end{array} \end{aligned}$$
(14)

The new energy function \(E_{\text {DR}}\) is defined similar to (15),

$$\begin{aligned} \begin{array}{l} E_{\text {DR}}(\mathbf{y},\mathbf{v},\mathbf{h})= \underbrace{E_{\text {R}}(\mathbf{v},\mathbf{h})}_{\text {Generative}} - \underbrace{\sum _{l} s_{l} y_{l} - \sum _{j,l} h_{j}u_{jl} y_{l}}_{\text {Discriminative}} \end{array} \end{aligned}$$
(15)

3.5 Discriminative Conditional Restricted Boltzmann Machines (DCRBMs)

In the same way the RBM can be extended to the DRBM by adding a discriminative term to the model, we can extend the CRBM to the DCRBM (Fig. 1e). DCRBMs are based on the model in Larochelle and Bengio (2008), generalized to account for temporal phenomenon using CRBMs. DCRBMs are a simpler version of the Factored Conditional Restricted Boltzmann Machines (Taylor et al. 2011) and Gated Restricted Boltzmann Machines (Memisevic and Hinton 2007). Both these models incorporate labels in learning representations, however, they use a more complicated potential which involves three way connections into factors. DCRBMs define the probability distribution \(p_{\text {DC}}\) as a Gibbs distribution (16).

$$\begin{aligned} \begin{array}{c} p_{\text {DC}}(\mathbf{y}_{t},\mathbf{v}_{t},\mathbf{h}_{t}|\mathbf{v}_{<t};{\varvec{\theta }})=\frac{\exp [-E_{\text {DC}}( \mathbf{y}_{t},\mathbf{v}_{t},\mathbf{h}_{t}|\mathbf{v}_{<t})]}{Z({\varvec{\theta }})},\\ \\ Z({\varvec{\theta }})=\sum _{\mathbf{y},\mathbf{v},\mathbf{h}}\exp [-E_{\text {DC}}(\mathbf{y}_{t},\mathbf{v}_{t},\mathbf{h}_{t}|\mathbf{v}_{<t})],\\ \\ {\varvec{\theta }}= \left[ \begin{matrix} \{\mathbf{a},\mathbf{b},\mathbf{s}\}&{}\text {-bias},\\ \{{ A},{ B}\}&{}\text {-auto regressive},\\ \{{ W},{ U}\}&{}\text {-fully connected}. \end{matrix} \right] \end{array} \end{aligned}$$
(16)

The probability distribution over the visible layer will follow the same distributions as in (14). The hidden layer \(\mathbf{h}\) is defined as a function of the labels y and the visible nodes \(\mathbf{v}\). A new probability distribution for the classifier is defined to relate the label y to the hidden nodes \(\mathbf{h}\) (17).

$$\begin{aligned} \begin{array}{l} p_{\text {DC}}(v_{i,t}|\mathbf{h}_{t},\mathbf{v}_{<t})=\mathcal {N}(c_i + \sum _{j} h_{j}w_{ij},1),\\ \\ p_{\text {DC}}(h_{j,t} = 1 |y_{t},\mathbf{v}_{t},\mathbf{v}_{<t})= \sigma (d_j+ u_{j,k} + \sum _{i} v_{i,t} w_{ij}),\\ \\ p_{\text {DC}}(y_{l,t}|\mathbf{h})=\frac{\exp [s_l+\sum _j u_{j,l}h_j]}{\sum _{l^*}\exp [s_{l^*}+\sum _j u_{j,l^*}h_j]}. \end{array} \end{aligned}$$
(17)

where,

$$\begin{aligned} \begin{array}{c} c_{i}= a_{i,t} + \sum _{p}A_{p,i} v_{p,<t},\\ \\ d_{j}= b_{j,t} + \sum _{p}B_{p,j} v_{p,<t}. \end{array} \end{aligned}$$
(18)

The new energy function \(E_{\text {DC}}\) is defined similar to that of the DRBM (15).

$$\begin{aligned} \begin{array}{c} E_{\text {DC}}(\mathbf{y}_{t},\mathbf{v}_{t},\mathbf{h}_{t}|\mathbf{v}_{<t})= \underbrace{E_{\text {C}}(\mathbf{v}_{t},\mathbf{h}_{t}|\mathbf{v}_{<t})}_{\text {Generative}}\\ \\ - \underbrace{\sum _{j,l} h_{j,t} u_{jl} y_{l,t}- \sum _{l} s_{l} y_{l,t}}_{\text {Discriminative}} \end{array} \end{aligned}$$
(19)

3.6 Multimodal Discriminative Conditional Restricted Boltzmann Machines (MMDCRBMs)

In the same way CRBMs can be extended to MMCRBMs, we can naturally extend DCRBMs to MMDCRBMs. A MMDCRBM combines a collection of unimodal DCRBMs, one for each visible modality. The hidden representations produced by the unimodal DCRBMs are then treated as the visible vector of a single fusion DCRBM. The result is a MMDCRBM model that relates multiple temporal modalities to a classification label. MMDCRBMs define the probability distribution \(p_{\text {MDC}}\) as a Gibbs distribution (20).

$$\begin{aligned} \begin{array}{l} p_{\text {MDC}}(\mathbf{y}_{t},\mathbf{v}^{1:M}_{t},\mathbf{h}^{1:M}_{t},\mathbf{h}^{F}_{t}|\mathbf{v}^{1:M}_{<t})=\\ \quad \quad \quad \exp [-E_{\text {MDC}}(\mathbf{y}_{t},\mathbf{v}^{1:M}_{t},\mathbf{h}^{1:M}_{t},\mathbf{h}^{F}_{t}|\mathbf{v}^{1:M}_{<t})]/Z({\varvec{\theta }}),\\ \\ Z({\varvec{\theta }})=\sum _{\mathbf{y},\mathbf{v},\mathbf{h}}\exp [-E_{\text {MDC}}(\mathbf{y}_{t},\mathbf{v}^{1:M}_{t},\mathbf{h}^{1:M}_{t},\mathbf{h}^{F}_{t}|\mathbf{v}^{1:M}_{<t}),\\ \\ {\varvec{\theta }}= \left[ \begin{matrix} \{\mathbf{a}^{1:M},\mathbf{b}^{1:M},\mathbf{e}, \mathbf{s}\} &{}\text {-bias},\\ \{{ A}^{1:M},{ B}^{1:M},{ C}^{1:M}\} &{}\text {-auto regressive},\\ \{{ W}^{1:M},{ W}^F, { U}^{1:M},{ U}^{F}\} &{}\text {-fully connected}. \end{matrix} \right] \end{array} \end{aligned}$$
(20)

The probability distribution over the visible layer will follow the same distributions as in (14). The hidden layer \(\mathbf{h}\) is defined as a function of the labels y and the visible nodes \(\mathbf{v}\). A new probability distribution for the classifier is defined to relate the label y to the hidden nodes \(\mathbf{h}\) is defined as in (21).

$$\begin{aligned} \begin{array}{l} p_{\text {MDC}}(v^{m}_{i,t}|\mathbf{h}^{m}_{t},\mathbf{v}^{m}_{<t})=\mathcal {N}(c^{m}_{i} + \sum _{j} h^{m}_{j}w^{m}_{ij},1),\\ \\ p_{\text {MDC}}(h^{m}_{j,t} = 1 |y_{l,t},\mathbf{v}^{m}_{t},\mathbf{v}^{m}_{<t})= \sigma (d^{m}_{j} + u^{m}_{jl} + \sum _{i} v^{m}_{i,t} w^{m}_{ij}),\\ \\ p_{\text {MDC}}(y_{l,t}|\mathbf{h}_{t}^{m})=\frac{\exp [s_l+\sum _j u^{m}_{jl}h_{j,t}^{m}]}{\sum _{l^*} \exp [s_{l^*}+\sum _j u^{m}_{jl^*}h_{j,t}^{m}]},\\ \\ p_{\text {MDC}}(h^{F}_{k,t} = 1 |y_{l,t},\mathbf{h}^{1:M}_{t},\mathbf{h}^{1:M}_{<t})= \sigma (f_{k} + u^{F}_{kl} \\ \quad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad + \sum _{m,j} h^{m}_{j,t} w^{m}_{jk}),\\ \\ p_{\text {MDC}}(y_{l,t}|\mathbf{h}_{t}^{F})= \frac{\exp [s_l+\sum _k u^{F}_{kl}h_{k,t}^{F}]}{\sum _{l^*} \exp [s_{l^*}+\sum _k u^{F}_{kl^*}h_{k,t}^{F}]}. \end{array} \end{aligned}$$
(21)

where,

$$\begin{aligned} \begin{array}{rcccl} c^{m}_{i} &{}=&{} a^{m}_{i} &{}+&{} \sum _{p}A^{m}_{p,i} v^{m}_{p,<t},\\ \\ d^{m}_{j} &{}=&{} b^{m}_{j} &{}+&{} \sum _{p}B^{m}_{p,j} v_{p,<t},\\ \\ f_{k} &{}=&{} e_{k} &{}+&{} \sum _{m,r}C_{r,k}^{m} h^{m}_{r,<t}. \end{array} \end{aligned}$$
(22)

The new energy function \(E_{\text {MDC}}\) is defined similar to that of the DRBM (15).

$$\begin{aligned} \begin{array}{c} E_{\text {MDC}}(\mathbf{y}_{t},\mathbf{v}_{t},\mathbf{h}_{t}|\mathbf{v}_{<t})= \underbrace{E_{\text {MC}}(\mathbf{v}_{t},\mathbf{h}_{t}|\mathbf{v}_{<t})}_{\text {Generative}} \\ - \underbrace{\sum _{k} s_{l} y_{l,t} - \sum _{k,l} h^{F}_{k,t} u^{F}_{kl} y_{l,t} - \sum _{j,l,m} h^{m}_{j,t} u^{m}_{jl} y_{l,t}}_{\text {Discriminative}} \end{array} \end{aligned}$$
(23)
Fig. 2
figure 2

This figure specifies the inference algorithm. We first classify the unimodal data by activating the corresponding hidden layers \(\mathbf{h}^{m}_{t}\) as shown in (a), followed by classifying the multimodal data by activating the fusion layer \(\mathbf{h}^{F}_{t}\) as shown in (b)

figure a
figure b

4 Inference

Classification to perform classification at time t in the MMDCRBM given \(\mathbf{v}_{<t}^{1:M}\) and \(\mathbf{v}_{t}^{1:M}\) we use a bottom-up approach, computing a cost for each possible label \(\mathbf{y}_{t}\) then choosing the label with least cost. Ideally we would like the cost for label \(\mathbf{y}_{t}\) to be the free energy \(-\log p_{\text {MDC}}(\mathbf{y}_{t},\mathbf{v}_{t}^{1:M}|\mathbf{v}_{<t}^{1:M})\) computed by marginalizing over \(\mathbf{h}_{<t}^{1:M}\), \(\mathbf{h}_{t}^{1:M}\), and \(\mathbf{h}_{t}^{F}\), but this is intractable due to the hidden-hidden edges.

Because this preferred cost is intractable, we use an approximate procedure instead. First, for each modality m, the expected value of \(\mathbf{h}_{t}^{m}\) is computed according to (21). Then, the cost associated with the candidate label is the free energy in the fusion DCRBM, namely \(-\log p_{\text {DC}}(\mathbf{y}_{t},\mathbf{h}_{t}^{1:M}|\mathbf{h}_{<t}^{1:M})\) computed by marginalizing over only \(\mathbf{h}_{t}^{F}\). Since this marginalization does not involve any hidden-hidden edges, it is tractable, because the sum over exponentially many terms can be algebraically eliminated. Figure 2 illustrates our inference. The details of the classification algorithm is shown in Algorithm 1.

Generation to perform unimodal generation for modality m at time t given \(\mathbf{v}_{<t}^{m}\) and \(\mathbf{y}_{t}\) we initialize \(\mathbf{v}_{t}^{m}\) to \(\mathbf{v}_{t-1}^{m}\) then sample the distribution \(p_{\text {DC}}\) (17) using Gibbs sampling. Each Gibbs cycle samples \(p_{\text {DC}}(\mathbf{h}_{t}^{m} | \mathbf{y}_{t}, \mathbf{v}_{t}^{m}, \mathbf{v}_{<t}^{m})\) then sampling \(p_{\text {DC}}(\mathbf{v}_{t}^{m} | \mathbf{h}_{t}^{m}, \mathbf{v}_{<t}^{m})\). In the last Gibbs cycle, \(\mathbf{h}_{t}^{m}\) is assigned its expected value according to the distribution, instead of being sampled, we used 50 Gibbs cycles. The details of the generation algorithm is shown in Algorithm 2.

5 Learning

Learning our model is done using contrastive divergence (CD) (Hinton 2002), where \(\langle \cdot \rangle _{data}\) is the expectation with respect to the data distribution and \(\langle \cdot \rangle _{recon}\) is the expectation with respect to the reconstructed data. The learning is done using two steps a bottom-up pass and a top-down pass using sampling equations from (21).

Bottom-up the reconstruction is generated by first sampling the unimodal layers \(p(h^{m}_{t,j}=1|\mathbf{v}_{t}^{m},\mathbf{v}^{m}_{<t},y_l)\) for all the hidden nodes in parallel. This is followed by sampling the fusion layer \(p(h^{F}_{t,k}=1|\mathbf{h}_{t}^{1:M},\mathbf{h}^{1:M}_{<t},y_l)\). This is done using the classification algorithm Algorithm 1.

Top-down The unimodal layer is generated using the activated fusion layer \(p(h^{m}_{t,j}=1|\mathbf{h}_{t}^{F},y_l)\). This is followed by sampling the visible nodes \(p(v^{m}_{t,i}|\mathbf{h}_{t}^{m},\mathbf{v}^{m}_{<t})\) for all the visible nodes in parallel. The gradient updates are described in (24). This is done using the generation algorithm Algorithm 2.

$$\begin{aligned} \begin{array}{lclcl} \varDelta a_{i}&{}\propto &{}\langle v^{m}_{i}\rangle _{data} &{}-&{} \langle v^{m}_{i}\rangle _{recon},\\ \varDelta b_{j}&{}\propto &{}\langle h^{m}_{j}\rangle _{data} &{}-&{} \langle h^{m}_{j}\rangle _{recon},\\ \varDelta e_{k}&{}\propto &{}\langle h^{F}_{k}\rangle _{data} &{}-&{} \langle h^{F}_{k}\rangle _{recon},\\ \varDelta s_{l}&{}\propto &{}\langle y_{l}\rangle _{data} &{}-&{} \langle y_{l}\rangle _{recon},\\ \varDelta A^{m}_{p,i,<t}&{}\propto &{} v^{m}_{k,<t}(\langle v^{m}_{i,t}\rangle _{data} &{}-&{} \langle v^{m}_{i,t}\rangle _{recon}),\\ \varDelta B^{m}_{p,j,<t}&{}\propto &{} v^{m}_{i,<t}(\langle h^{m}_{j,t}\rangle _{data} &{}-&{} \langle h^{m}_{j,t}\rangle _{recon}),\\ \varDelta C^{m}_{r,k,<t}&{}\propto &{} h^{m}_{j,<t}(\langle h^{F}_{k,t}\rangle _{data} &{}-&{} \langle h^{F}_{j,t}\rangle _{recon}),\\ \varDelta w^{m}_{i,j}&{}\propto &{}\langle v^{m}_{i}h^{m}_{j}\rangle _{data} &{}-&{} \langle v^{m}_{i}h^{m}_{j}\rangle _{recon},\\ \varDelta w^{F}_{j,k}&{}\propto &{}\langle h^{m}_{j}h^{F}_{k}\rangle _{data} &{}-&{} \langle h^{m}_{j}h^{F}_{k}\rangle _{recon},\\ \varDelta u^{m}_{j,l}&{}\propto &{}\langle y_{l}h^{m}_{j}\rangle _{data} &{}-&{} \langle y_{l}h^{m}_{j}\rangle _{recon},\\ \varDelta u^{F}_{k,l}&{}\propto &{}\langle y_{k}h^{F}_{l}\rangle _{data} &{}-&{} \langle y_{k}h^{F}_{l}\rangle _{recon}. \end{array} \end{aligned}$$
(24)

6 Experiments

In Sect. 6.1 we describe the datasets we use for evaluation; In Sect. 6.2 we specify the implementation details and model parameters selection; In Sect. 6.3 we explain how we selected the model parameters; Finally, in Sect. 6.4 we present our results.

6.1 Datasets

We focus our analysis on temporal multimodal datasets from raw sensor data. In the literature we found some relevant datasets, we decided to evaluate our approach on two realistic datasets and three toy datasets that would highlight the contribution of our approach.

The two realistic datasets are: The Tower Game dataset (Salter et al. 2015) where its an interaction between two humans with goal of classifying entrainment. The dataset is captured using a Kinect sensor. For this dataset we evaluate the model classification and generating accuracy using mocap-mocap multimodal data; ChaLearn dataset (Escalera et al. 2014). This dataset is captured in a similar manner to the Tower Game Dataset except that they provide audio. For this dataset we evaluate the model classification and generating accuracy using mocap-audio multimodal data.

The three toy datasets are: AVEC (Schuller et al. 2011) is an audio-visual dataset for single person affect analysis. AVLetters (Matthews et al. 2002), consists of 10 speakers uttering the letters A to Z, three times each. CUAVE (Patterson et al. 2002), consists of 36 speakers uttering the digits 0 to 9. AVEC, AVLetters, and CUAVE, are relatively simple datasets for the task we address.

Other relevant datasets include the Multimodal Dyadic Behavior dataset (Rehg et al. 2013), which focuses on analyzing dyadic social interactions between adults and children in a developmental context. The dataset was not fully released. Mimicry database (Sun et al. 2011) which focuses on studying social interactions between humans with the aim of analyzing mimicry in human-human interactions. This dataset was collected in an unstructured format where the two humans talk to each other about different subjects. We were unable to gain access to the dataset due to being a non-educational institute.

6.1.1 Realistic Datasets

The Tower Game dataset (Salter et al. 2015) is a simple game of tower building often used in social psychology to elicit different kinds of interactive behaviors from the participants. It is typically played between two people working with a small fixed number of simple toy blocks that can be stacked to form various kinds of towers. We choose these tower games as they force the players to engage and communicate with each other in order to achieve the objectives of the game, thereby evoking behaviors such as joint-attention and entrainment from the participants. The game, due to its simplicity, allows for total control over the variables of an interaction. Due to the small number of blocks involved, the number of potential moves (actions) is limited. Also since the game involves interacting with physical objects, joint-attention is mediated through concrete objects. Furthermore, only two players are involved, ensuring that we can stay in the realm of dyadic interactions. The data consists of 112 videos which were divided into 1213, 10-s, segments indicating the presence or absence of these behaviors in each segment. Entrainment is the alignment in the behavior of two individuals and it involves simultaneous movement, tempo similarity, and coordination. Each measure was rated using a low, medium, high measure for the entire 10 s segment. 70% of that data was used for training and 30% were used for testing. In this dataset we call each person’s skeletal data a modality, where our goal is to model mocap-mocap representations. ChaLearn dataset (Escalera et al. 2014) consists of a set of italian gestures, featured by a challenge in 2014. The dataset was designed to evaluate user independent continuous Gesture Recognition performance. The dataset consists of 13,858 gestures from a vocabulary of 20 Italian cultural signs performed by 27 unique users. The list of Italian Gestures in the dataset: vattene, ok, vieniqui, cosatifarei, perfetto, basta, furbo, prendere, cheduepalle, noncenepiu, chevuoi, fame, daccordo, tantotempo, seipazzo, buonissimo, combinato, messidaccordo, freganiente, sonostufo. The dataset was recorded by Kinect sensors, and it includes skeleton model, user mask, RGB, and depth images. The dataset consists of 450 development, 250 validation, and 240 test videos. Each gesture is labeled using ground truth gesture type and its start and end timestamps. There are a total of 7754 instances for development, 3362 for validation, and 2742 for testing. The dataset was featured by ChaLearn 2014 Looking at People competition’s Track 3: Gesture Recognition. The emphasis of the gesture recognition track was on multi-modal automatic learning of a set of 20 gestures performed by several different users, with the aim of performing user independent continuous gesture localization. We followed the setup provided by Neverova et al. (2014) by using their augmented dataset which contains audio. In this dataset our goal is to model mocap-audio representations.

6.1.2 Toy Datasets

To compare against the prior work of Amer et al. (2014); Ngiam et al. (2011) we evaluate our approach on three toy datasets, AVEC (Schuller et al. 2011), AVLetters (Matthews et al. 2002), and CUAVE datasets (Patterson et al. 2002) which were used in their experiments. AVEC dataset (Schuller et al. 2011) is an audio-visual dataset for single person affect analysis. The dataset involves users interacting with emotionally stereotyped virtual characters operated by a human. The visual data contains mainly the face of the user interacting with the character. The Audio data consists of recordings of utterances of the user and is synchronized with the video. The dataset has been annotated with binary labels for four different affective dimensions - Activation, Expectation, Power and Valence. We use the AVEC dataset to compare against (Ramirez et al. 2011; Glodek et al. 2011). The dataset is divided into two sets, 31 sequences for training and 32 sequences for testing. AVLetters dataset (Matthews et al. 2002), consists of 10 speakers uttering the letters A to Z, three times each. The dataset also provides pre-extracted \(60 \times 80\) patches of lip regions along with audio features (MFCC features of 483 dimensions). The dataset is divided into two sets, 2/3 of the sequences for training and 1/3 for testing. CUAVE dataset (Patterson et al. 2002), consists of 36 speakers uttering the digits 0 to 9. The dataset provides the aligned face of each speaker of size \(75 \times 50\), as well as the audio spectrogram and MFCC features of dimensionality 534. The dataset is divided into two sets, 1/2 for training and 1/2 for testing. We follow the same experimental setup as in Ngiam et al. (2011).

6.2 Implementation Details

For ChaLearn and Tower Game datasets pre-processing the mocap data, we followed the same approach as Neverova et al. (2014) by forming a body centric transformation of the skeletons generated by the Kinect sensors. For mocap data we use the 11 joints from the upper body of the two players since the tower game almost entirely involves only upper body actions as well as gestures are done using the upper body. We used the raw joint locations normalized with respect to a selected origin point. We use the same descriptor provided by Neverova et al. (2014), Zanfir et al. (2013). The descriptor consists of 84 dimensions based on the normalized joints location, inclination angles formed by all triples of anatomically connected joints, azimuth angles between projections of the second bone and the vector on the plane perpendicular to the orientation of the first bone, bending angles between a basis vector, perpendicular to the torso, and joint positions.

For the audio component of ChaLearn, pre-processing the audio stream, we followed the same approach as Neverova et al. (2014) by using feature learning within a convolution architecture for audio. First, they apply a short-time Fourier transform on the raw audio signal to obtain an audio spectrogram. Second, they transform the spectrogram to the Mel-scale to produce 40 filterbanks. Finally, they input the filterbanks to a one-layer convolutional network in combination with two fully-connected layers resulting in a 40 dimensional feature. For the ChaLearn dataset we trained one multi-class model on the 20 gestures and background. For the Tower Game dataset we trained three multi-class models, one for each of the labels, Tempo Similarity, Coordination, and Simultaneous Movement since they could co-occur with three different values {low, medium, high}.

Fig. 3
figure 3

This figure shows the sensitivity of our model’s average classification accuracy to the number of hidden nodes and auto-regressive edges on the ChaLearn dataset. ac show the sensitivity of our audio, mocap, multimodal classifiers respectively to the number of hidden nodes and auto-regressive edges

AVEC dataset comes with pre-computed audio and video features; refer to Schuller et al. (2011) for details. We apply PCA on the extracted features and reduce each of the audio to 100 dimensions and video features to 32 dimensions. For each modality we choose a CRBM with a temporal order \(N=5\), with the first hidden layer being over-complete consisting of 150 nodes, and the multimodal fusion layer consisting of 300 nodes. As with the AVLetters dataset, following the same setup as in Ngiam et al. (2011), we reduce the dimensionality of the audio features to 100 dimensions using PCA whitening and the video features (lip region) to 32 dimensions. Similarly for CUAVE dataset, we reduce the dimensionality of the audio features to 100 dimensions using PCA whitening and the video features (lip region) to 32 dimensions. For AVEC we trained 4 different classifiers with labels Activity, Expectation, Power, and Valence. For AVLetters we trained multi-class classifier with 26 classes and for CUAVE a multi-class classifier with 10 classes.

6.3 Model Selection

We tuned our model parameters on ChaLearn dataset. For selecting the model parameters we used a grid search. We varied the number of hidden nodes per layer in the range of \(\{10,20,30,50,70,100,200\}\), as well as the auto-regressive nodes in the range of \(\{5,10\}\), resulting a total of 2744 trained models using the development set and used them to classify the validation set. Figure 3 shows the average classification accuracy of the different models (per hidden layer) and the different delays. The best performing model on ChaLearn has the following configuration:

$$\begin{aligned} \begin{array}{lccccccccc} \text {Mocap: }&{} v&{}=&{}84,&{} h^m&{}=&{}30,&{}<t&{}=&{}10,\\ \text {Audio: }&{} v&{}=&{}40,&{} h^m&{}=&{}200,&{}<t&{}=&{}5,\\ \text {Multimodal: }&{} h^{1:M}&{}=&{}230,&{} h^{F}&{}=&{}200,&{} <t&{}=&{}5. \end{array} \end{aligned}$$

The best performing model on Tower Game Dataset has the following configuration:

$$\begin{aligned} \begin{array}{lccccccccc} \text {Mocap-1: }&{} v&{}=&{}84,&{} h^m&{}=&{}30,&{}<t&{}=&{}10,\\ \text {Mocap-2: }&{} v&{}=&{}84,&{} h^m&{}=&{}30,&{}<t&{}=&{}10,\\ \text {Multimodal: }&{} h^{1:M}&{}=&{}60,&{} h^{F}&{}=&{}200,&{} <t&{}=&{}5.\\ \end{array} \end{aligned}$$

The best performing model on AVEC, AVLetters and CUAVE Dataset has the following configuration:

$$\begin{aligned} \begin{array}{lccccccccc} \text {Visual: }&{} v&{}=&{}32,&{} h^m&{}=&{}150,&{}<t&{}=&{}10,\\ \text {Audio: } &{} v&{}=&{}100,&{} h^m&{}=&{}150,&{}<t&{}=&{}5,\\ \text {Multimodal: }&{} h^{1:M}&{}=&{}300,&{} h^{F}&{}=&{}300,&{} <t&{}=&{}5.\\ \end{array} \end{aligned}$$
Fig. 4
figure 4

This figure shows the sensitivity of our model’s average classification accuracy to amount of training data on the ChaLearn dataset. ac show the sensitivity of our audio, mocap, multimodal classifiers respectively to more training data

6.4 Quantitative Results

To evaluate our model we use three different metrics: (1) classification, (2) generation, and (3) localization. Note that for the ChaLearn dataset only the localization results were reported by other performers (Escalera et al. 2014). We are the first to report the generation results on this dataset since all the previous work done was using discriminative classifiers. As for the Tower Game dataset (Salter et al. 2015) we report classification accuracy as well as generation error. For AVEC (Schuller et al. 2011), AVLetters (Matthews et al. 2002), and CUAVE (Patterson et al. 2002) we report average classification accuracy which was the commonly used metric of performance.

Table 1 Average classification accuracy on simultaneous movement
Table 2 Average classification accuracy on coordination
Table 3 Average classification accuracy on tempo similarity

6.4.1 Classification

In ChaLearn Dataset we evaluated our average classification accuracy with respect to different size of training sets. Figure 4 shows our sensitivity with respect of the amount of training data used, we report the average over all classifiers 2744 models trained as well as the best performing classifier. Our approach was able to achieve relatively good results using only 25% of the training data. We achieve the best average classification accuracy results on the multimodal layer using only 50% of the data, which shows how powerful our model is learning from the first set of data. Our best configuration achieves 80.5%, 83.1%, and 98.5% average classification accuracy for audio, mocap, and multimodal respectively.

Table 4 Classification accuracy on CUAVE dataset
Table 5 Classification accuracy on AVLetters dataset
Table 6 Average classification accuracy on AVEC dataset (Schuller et al. 2011)
Table 7 Localization accuracy on ChaLearn dataset

In the Tower game dataset label can be assigned low, medium or high. The data is split into two sets, a training set consisting of 70% of the instances, and a test set consisting of the remaining 30%. We performed a 5 fold cross validation to guarantee unbiased results. Tables 1, 2, and 3 shows our average classification accuracy on the Tower Game Dataset using different features and baselines combinations as well as the results from our MMDCRBM model. The evaluation is done with respect to the six annotators \(\{A_1, A_2, \ldots , A_6\}\) as well as the mean annotation. We compare our approach against the baselines presented in Salter et al. (2015), where they extracted a set of first order static and dynamic handcrafted skeleton features. The static features are computed per frame. The features consist of relationships between all pairs of joints of a single actor, and the relationships between all pairs of joints of both the actors. The dynamic features are extracted per window (a set of 300 frames). In each window, they compute first and second order dynamics (velocities and accelerations) of each joint, as well as relative velocities and accelerations of pairs of joints per actor, and across actors. The dimensionality of their static and dynamic features is (257400 D). To reduce their dimensionality they used Principle Component Analysis (PCA) (100 D), Bag-of-Words (BoW) (100 and 300 D) (Niebles et al. 2008). We can see that the MMDCRBM model outperforms all the other models for each of the three measures across all annotators, thereby demonstrating its effectiveness on detecting these entrainment measures. Furthermore, the MMDCRBM model outperforms the PCA and BoW based features which are derived from the high dimensional handcrafted features, demonstrating its ability to learn a rich representation starting from the raw skeleton features.

For CUAVE dataset (Patterson et al. 2002) Table 4 shows the classification performance for visual speech recognition. Note that the models (Gurban and Thiran 2009; Lucey and Sridharan 2006; Papandreou et al. 2009), use a pre-processing step that is substantially more complex than ours. In our case, we use the same pre-processing as in Ngiam et al. (2011) which extracts bounding boxes ignoring orientation and perspective changes. Table 5 shows the classification performance for visual speech recognition on the AVLetters dataset (Cox et al. 2008). Our hybrid model shows a substantial improvement over the state-of-the-art which include the hand-engineered features (Matthews et al. 2002; Zhao and Barnard 2009) as well as the staged hybrid models CRF-CRBM (Amer et al. 2014) model and SVM-RBM (Ngiam et al. 2011). In AVEC dataset we evaluated the average classification accuracy in Table 6. Again, we can see that our MMDCRBM model outperformed all other models, followed by the staged CRF-CRBM and CRF-RBM models (Amer et al. 2014).

6.4.2 Localization

We used the Jaccard Index to evaluate the localization performance on ChaLearn dataset for the continuous sequences provided in the test set. The Jaccard Index was proposed by the challenge organizers to standardize a way for comparison between approaches and is defined in (25). \(A_{s,n}\) is the ground truth label for gesture n in sequence s, and \(B_{s,n}\) is the predicted label. This index provides an evaluation of the area of overlap as well as predicting the correct label. Table 7 shows the localization results on the ChaLearn dataset. Note that for localization we used a simple scanning window approach with no smoothing or post-processing and we were able to achieve 7th position using (audio and mocap) and 14th position using mocap only. Note that our iteratively trained model out performs the staged models CRF-CRBM (Amer et al. 2014) which further confirms that iterative training improves the representation learned. Also note that our approach uses only 10% of the parameters used by Neverova et al. (2014).

$$\begin{aligned} J_{s,n} =\frac{A_{s,n} \cap B_{s,n}}{A_{s,n} \cup B_{s,n}} \end{aligned}$$
(25)
Fig. 5
figure 5

Unimodal average generation error on the tower game dataset. The results are averaged over 100 instances per class. The average generation error (y-axis) for the full visible layer by varying the generated window size (x-axis)

Fig. 6
figure 6

Unimodal average generation error on ChaLearn dataset. The results are averaged over all 21 classes and over 100 instances per class. The average generation error (y-axis) for the full visible layer by varying the generated window size (x-axis)

6.4.3 Generation

We evaluated our DCRBM mocap generation error. Given the class label and initial history data and our goal is to generate the full visible layer (i.e. the raw features) for that label. This task allows us to visualize what the classifier has learned. We sample the hidden representation \(h_{t}^{m}\) and then generate frames using 50 Gibbs cycles where the last Gibbs cycle samples the mean of the hidden representation \(h_{t}^{m}\) instead of a sample. representation. The generation error is calculated using 26 and is averaged over 100 instances.

For the Tower Game dataset the sequence size is 300 frames so we vary the generated data window size from 0 to 300 frames as shown in Fig. 5. We can see that the generation error is relatively low \(({<}0.1)\) in all cases (except for Tempo Similarity. Tempo Similarity measures the similarity in the rate of the motion of the two players, and when data from both the players is missing generating their raw features based on whether their rate of motion is similar is extremely under constrained, when generating the entire visible layer data) demonstrating the effectiveness of DCRBM model for generating data. Also, the error is similar across different levels (strengths) for each measure indicating that the model is relatively stable. For ChaLearn dataset the average validation sequence size is 40 frames, so we vary the generated window size from 0 to 40 as show in Fig. 6. The error is a bit higher for the ChaLearn dataset since the gestures are structured. We visualized the generated frames and found that the model identifies the most discriminative pose of the gesture and locks onto it. Finally, the error increases with the length of the generated sequence, which is expected as the possibility of variation in the ground-truth sequences increases with length. Note that our approach is the only generative approach that was evaluated on either of the datasets.

$$\begin{aligned} \text {Generation Error}=\left( \frac{\Vert \mathbf {v}_\text {Generated}-\mathbf {v}_\text {Groundtruth}\Vert }{\Vert \mathbf {v}_\text {Groundtruth}\Vert }\right) ^2 \end{aligned}$$
(26)

7 Conclusion

We have proposed a hybrid model comprising of temporal generative and discriminative models for classifying sequential data from multiple heterogeneous modalities. Our research resulted in the development of two main models: the DCRBM, which combines temporal, discriminative, and generative concerns into a single RBM-based model; and the MMDCRBM, which fuses multiple DCRBMs, enabling the learning of a rich fused feature representation combining multiple modalities. We employ a energy based temporal generative model which enables us to learn a joint representation to model the short-term temporal characteristics, while also allowing us to handle missing data. An extensive experimental evaluation on two different realistic datasets and three toy datasets demonstrates the superiority of our approach over the state-of-the-art. These models are competitive with feedforward neural networks while using much fewer parameters and being generative. Furthermore we managed to reduce the number of parameters to 10% of the best performing method using only two modalities.