Abstract
Several information fusion methods are developed for increasing the recognition accuracy in multimodal systems. Canonical correlation analysis (CCA), cross-modal factor analysis (CFA) and their kernel versions are known as successful fusion techniques but they cannot digest the data variability. Probabilistic CCA (PCCA) is suggested as a linear fusion method to capture input variability. A new kernel PCCA (KPCCA) is proposed here to capture both the nonlinear correlations of sources and input variability. The functionality of KPCCA decreases when the number of samples, which determines the size of kernel matrix increases. In the conventional fusion methods the latent variables of different modalities are concatenated; consequently, a large-scale covariance matrix with just limited number of samples must be estimated To overcome this drawback, a sparse KPCCA (SKPCCA) is introduced which scarifies the covariance matrix elements at the cost of decreasing its rank. In the final stage of the gradual evolution of KPCCA, a new feature fusion manner is proposed for SKPCCA (FF-SKPCCA) as a second stage fusion. This proposed method unifies the latent variables of two modalities into a feature vector with an acceptable size. Audio-visual databases like M2VTS (for speech recognition) eNTERFACE and RML (for emotion recognition) are applied to assess FF-SKPCCA compared to state-of-the-art fusion methods. The comparative results indicate the superiority of the proposed method in most cases.
Similar content being viewed by others
References
Shivappa S, Trivedi M, Rao B (2010) Audiovisual information fusion in human computer interfaces and intelligent environments: A survey. Proc IEEE 98(10):1692–1715
Zeng Z, Pantic M, Roisman G I, Huang T S (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell:39–58
Ayadi M E, Kamel M, Karray F (2011) Survey on speech emotion recognition: features, classication schemes and databases. Pattern Recogn 44(3):572–587
Pradeep K A, M.Anwar H, Abdulmotaleb E S, Mohan S K (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Systems:345–379
Galatas G, Potamianos G, Makedon F (2012) Audio-visual speech recognition incorporating facial depth information captured by the Kinect. In: Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp 2714–2717
Gupta R, Malandrakis N, Xiao B, Guha T, Van Segbroeck M, Black M, Potamianos A, Narayanan S (2014) Multimodal Prediction of Affective Dimensions and Depression in Human-Computer Interactions. In: Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, pp. 33–40. Orlando Florida, USA: ACM
Taouche C, Batouche M C, Berkane M, Taleb-Ahmed A (2014) Multimodal biometric systems. In: International Conference on Multimedia Computing and Systems (ICMCS), pp 301–308
Xu C, Hero A O (2012) Savarese, s multimodal video indexing and retrieval using directed information. IEEE Trans Multimedia:3–16
Ercan A O, Gamal A E, Guibas L J (2013) Object tracking in the presence of occlusions using multiple cameras: a sensor network approach. ACM trans Sen Netw:16:1–16:36
Wagner J, Andre E, Lingenfelser F, Jonghwa K (2011) Exploring fusion methods for multimodal emotion recognition with missing data. IEEE Trans Affect Comput:206–218
Wang Y, Guan Y (2008) Recognizing human emotional state from audiovisual signals. IEEE Trans Multimedia 10(5):936–946
Wang Y, Guan Y, Venetsanopoulos A N (2012) Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans Multimedia:597– 607
Li B, Qi L, Gao L (2014) Multimodal emotion recognition based on kernel canonical correlation analysis
Hotelling H (1936) Relations between two sets of variates. Biometrika:321–377
Li D, Dimitrova N, Li N, Sethi I K (2003) Multimedia content processing through cross-modal association. In: Proceedings ACM International Conference, pp 604–611
Bredin H, Chollet G (2007) Audio-visual speech synchrony measure for talking-face identity verification. In: Acoustics, Speech and Signal Processing, ICASSP 2007, pp II–233
Abo-Zahhad M, Ahmed S M, Abbas S N (2014) PCG biometric identification system based on feature level fusion using canonical correlation analysis. In: 2014 IEEE 27th Canadian Conference on Electrical and Computer Engineering (CCECE), pp 1–6
Metallinou A, Lee S, Narayanan S (2010) Decision level combination of multiple modalities for recognition and analysis of emotional expression. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp 2462–2465
Li D, Taskiran C, Dimitrova N, Wang W, Li M, Sethi I K (2005) Cross-modal analysis of audio-visual programs for speaker detection. In: Proceedings IEEE Workshop Multimedia Signal Process., Shanghai, China, pp 1–4
Kumar K, Potamianos G, Navratil J, Marcheret E, Libal V (2011) Audio-visual speech synchrony detection by a family of bimodal linear prediction models. Multibiometrics for Human Identification:31–50
Lai P L, Fyfe C (2000) Kernel and nonlinear canonical correlation analysi73. Int J Neural Syst:365–377
Shi Y, Ji H (2014) Kernel canonical correlation analysis for specific radar emitter identification. Electron Lett:1318–1320
Chetty G, Göcke R, Wagner M (2009) Audio-Visual mutual dependency models for biometric liveness checks. AVSP 2009, Norwich, pp. 32–37
Bach F, Jordan M I (2005) A probabilistic interpretation of canonical correlation analysis. Technical Report 688 Department of Statistics, University of California, Berkeley
Archambeau C, Bach F R (2009) Sparse probabilistic projections. Adv Neural Inf Proces Syst 21:73–80
Klami A, Virtanen S, Kaski S (2010) Bayesian exponential family projections for coupled data sources. In: 26th Conference on Uncertainty in Artificial Intelligence (UAI), pp 286–293
Koskinen M, Viinikanoja J, Kurimo M, Klami A, Kaski S, Hari R (2013) Identifying Fragments of natural speech from the listener’s MEG signals. Hum Brain Mapp 34(6):1477–1489
Rudovic O, Petridis S, Pantic M (2013) Bimodal log-linear regression for fusion of audio and visual features. 21st ACM Int Conf Multimedia:789–792
Wu C H, Lin J C, Wei W L (2014) Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA Transactions on Signal and Information Processing, e12
Hardoon D, Szedmak S, Shawe-taylor J (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput:2639–2664
Blaschko M, Lampert C H (2008) Correlational spectral clustering. IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008:1–8
Golub G H, Hansen P C, O’Leary D P (1999) Tikhonov Regularization and total least squaresc. SIAM J Matrix Anal Appl 21(1):185–194
Rohani R, Sobhanmanesh F, Alizadeh S, Boostani R (2011) Lip processing and modeling based on spatial fuzzy clustering in color images. Int J Fuzzy Syst 13(2):65–73
Hermansky H, Hanson B A, Wakita H (1985) Perceptually-based linear predictive analysis of speech. In: Proceedings IEEE ICASSP, vol 2, pp 509–512
Bartlett A, Evans V, Frenkel I, Hobson C, Sumera E (2004) Digital Hearing Aids [Online]. Available www.clear.rice.edu/elec301/Projects01/dig_hear_aid
Wu C H, Lin J C, Wei W L (2013) Two-level hierarchical alignment for semi-coupled HMM-based audiovisual emotion recognition with temporal course. IEEE Trans Multimedia:1880– 1895
Jiang D, Cui Y, Zhang X, Fan P, Gonzalez I, Sahli H (2011) Audiovisual emotion recognition based on triple-stream dynamic Bayesian network models. Affective Computing and Intelligent Interaction:609–618
Sing V, Shokeen V, Singh B (2013) Face detection by haar cascade classifier with simple and complex backgrounds images using opencv implementation. International Journal of Advanced Technology in Engineering and Science:33–38
Lyons M J, Budynek J, Plante A, Akamatsu S (2000) Classifying facial attributes using a 2-D Gabor wavelet representation and discriminant analysis. 4th Int Conf Automatic Face and Gesture Recognition:202–207
Manjunath B S, Ma W Y (1996) Texture features for browsing and Texture features for browsing and IEEE Trans. Pattern Anal Machine Intell, pp 837–842
Pigeon S, Vandendorpe L (1997) The M2VTS multimodal face database (release 1.00). In Audio-and Video-Based Biometric Person Authentication. Springer, Berlin Heidelberg, pp 403– 409
Martin O, Kotsia I (2006) Macq, B Pitas, I The eNTERFACE05 audiovisual emotion database. In: Proc. ICDEW, p 8
Ekman P, Friesen W V, Press C P (1975) Pictures of facial affect. consulting psychologists press
Ekman P (1993) Facial expression and emotion. Am Psychol:384
Sun Q S, Zeng S G, Liu Y, Heng P A, Xia D S (2005) A new method of feature fusion and its application in image recognition. Pattern Recogn:2437–2448
Tipping M E, Bishop C M (1999) Probabilistic principal component analysis. Journal of the Royal Statistical Society B 61(3):611–622
Li Y O, Eichele T, Calhoun V D, Adali T (2012) Group study of simulated driving fMRI data by multiset canonical correlation analysis. Journal of signal processing systems:31–48
Lin J C, Wu C H, Wei W L (2012) Error weighted semi-coupled hidden Markov model for audio-visual emotion recognition. IEEE Trans Multimedia:142–156
Mello S, Kory J (2012) Consistent but modest: a meta-analysis on unimodal and multimodal affect detection accuracies from 30 studies. In: Proceedings of the 14th ACM international conference on Multimodal interaction, pp 31–38
Morrison D, Wang R, De Silva L C (2007) Ensemble methods for spoken emotion recognition in call-centres. Speech Commun:98–112
Muramatsu D, Iwama H, Makihara Y, Yagi Y (2013) Multi-view multi-modal person authentication from a single walking image sequence. 2013 International Conference on Biometrics (ICB):1–8
Acknowledgments
The authors of this paper acknowledge Dr. Homayounpour, professor of AmirKabir University, to let us using their M2VTS dataset in order to develop the experimental results.
Author information
Authors and Affiliations
Corresponding author
Appendix A
Appendix A
1.1 A-1. CCA Method
A proposed statistical method named Canonical Correlation Analysis (CCA) is proposed by [14], in order to find a shared structure between two sources of data. CCA is closely related to the mutual information method [45] but it has some differences in terms of objective function. A pair of feature vectors with zero is considered in the method as follows:
where x i and y i are the observation data (original features) of the two modalities with dimensions of p and q, respectively. CCA seeks to develop two transformation matrices of W x and W y with dimensions of p×d and q×d respectively, where d≤min(p q). The original features of these modalities are projected to the correlation subspace by W x and W y in a manner that the correlation between \(\hat {x}=\mathbf {x}\mathbf {W}_{x}\) and \(\hat {y}=\mathbf {y}\mathbf {W}_{y}\) is maximized. Maximizing the correlation between the projected feature vectors of \(\hat {x}\) and \(\hat {y}\) is the same as maximizing ρ(correlation coefficient) between them as follows:
where C x y is the cross-covariance matrix of (x , y) and C x x , C y y are the covariance matrices of x and y respectively.
The above equation can be solved as an Eigen-value problem like:
1.2 A-2. CFA Method
The cross-modal factor analysis (CFA) method is proposed by [15], where, the features from different modalities are treated as two subsets where the same patterns between these two subsets are discovered. In this method, it is assumed that a pair of normalized feature vectors x and y with zero means are linearly projected into a joint space applying W x and W y transforms, in a manner that the following criterion can be minimizing:
where, \(\mathbf {W}_{x}^{T}\mathbf {W}_{x}\) and \(\mathbf {W}_{y}^{T}\mathbf {W}_{y}\) are unit matrices and F is the Frobenius norm and is calculated by \(\left \| \mathbf {W} \right \|_{F}=\sqrt {\sum \nolimits _{ij} w_{ij}^{2}}\).
By solving the above equation for optimal transformation matrices W x and W y and decomposing cross-covariance matrix C x y through Singular Value Decomposition (SVD) method, the following equation is obtained:
Consequently,
1.3 A-3. Probabilistic CCA
To deal with the uncertainty problem in the CCA performance, the probabilistic CCA (PCCA) is introduced by [24] through the projected latent variables provide maximum variance in the joint correlation space. To do this, they defined a Gaussian model for every single source of data as follow:
where, z is the latent variable, shared between the two modalities x and y and μ and φ are the mean and covariance of each data, respectively. Here, by maximizing the probability functions, the φ x and φ y should be minimized. By considering \(\mathbf {O=}\left [ {\begin {array}{*{20}c} \mathbf {x}\\ \mathbf {y}\\ \end {array} } \right ]\) , W = [ W x W y ] , \(\mu \mathbf {=}\left [ {\begin {array}{*{20}c} \mu _{x}\\ \mu _{y}\\ \end {array} } \right ]\textit {and}\varphi \mathbf {=}\left [ {\begin {array}{*{20}c} \varphi _{x} & \mathbf {0}\\ \mathbf {0} & \varphi _{y}\\ \end {array} } \right ] \)both probabilistic functions are merged into the following joint probabilistic function as:
They indicate that the posterior expectation of z given x and y are:
where, W x and W y are the first d canonical directions of x and y. The parameters C x x and C y y are the covariances of x and y, respectively.
However, this new method named the unified latent variable can identify a latent variable, given x and y as:
where, \(P_{d}=M_{x}^{-1}\ast {(M_{y}^{-1})}^{T}\).
Another solution for (22) is based on the expectation maximization (EM) algorithm. Similarly, the Probabilistic Principle Component Analysis (PPCA) is proposed by [46] which iterates through the following expectation maximization (EM).
-
Expectation-step: finds the sufficient statistics of the latent variables given the current estimated parameter:
$$\begin{array}{@{}rcl@{}} M_{t}&=&I+ \mathbf{W}_{t}^{T}\varphi_{t}^{-1}\mathbf{W}_{t} \end{array} $$$$\begin{array}{@{}rcl@{}} E(z_{t})&=& M_{t}^{-1}\mathbf{W}_{t}\varphi_{t}^{-1}\mathbf{O} \end{array} $$$$\begin{array}{@{}rcl@{}} E(z_{t}{z_{t}^{T}})&=&M_{t}^{-1}+E(z_{t}){E(z_{t})}^{T} \end{array} $$(26)where subscript t indicate the iteration number.
-
Maximization-step: updates the estimated parameter to maximize the likelihood function:
By inserting (26) into (27), this method provides a general solution for PCCA scheme which yields the following updated equation:
where \(M_{t}\mathrm {=}I\mathrm {+ }\mathbf {W}_{t}^{T}\mathrm {\varphi }_{t}^{\mathrm {-1}}\mathbf {W}_{t}\) .
1.4 A-4. KCCA Method
Kernel Canonical Correlation Analysis (KCCA) [21] is the kernelized version of CCA method that projects data into higher dimensional feature spaces and applies CCA to the data in the kernel space in order to find a nonlinear correlation between the two modalities. Let us consider ϕ and ψ as two mapping functions that map the input data into a space of higher dimension:
The KCCA seeks to develop the two matrices α and β that are applied in the following equations:
This means that W x and W y are the projections of ϕ(x) and ψ(y) onto α and β, respectively. By inserting ϕ and ψ, into (16), the correlation function is applied as:
where, K x = E[ϕ(x).ϕ(x)T] and K y = E[ψ(y).ψ(y)T].
This optimization problem can be solved through the generalized Eigen-value decomposition method. When kernel functions are non-invertible, conventional regularization technique can be applied, therefore, the following Equation is yield [30]:
where, 0≤τ≤1.
1.5 A-5. KCFA Method
Kernel CFA [12] approach can provide correct information association provided that the two modalities are not linearly related. To illustrate this fact X=(ϕ(x 1), ϕ(x 2),…,ϕ(x n ))T, and Y =(ψ(y 1),ψ(y 2),…,ψ(y n ))T represent the two matrices with each row representing a sample in the nonlinearly mapped feature space; next, the X TY = S x y ##Λ## x y D x y should be solved through kernel method. The kernel matrices of the two subsets of features can be computed as K x = X X T and K y = Y Y T. By performing eigenvalue decomposition on the product of the kernel matrices K x K y , it becomes obvious that
Since the right singular vectors of the SVD of X T Y,D x y correspond with the eigenvectors of Y T X X T Y= (X T Y)T(X T Y)Y T β i corresponds to the columns of D x y , which can be further normalized into a unit norm as:
For a feature vector y ′ with nonlinear mapping ψ(y ′), the projection can be computed as
Similarly, it can be illustrated that
The left singular vectors S x y are the eigenvectors of X T Y Y T X=(X T Y)(X T Y)T, hence X T α j corresponds to the S x y columns, which can be normalized into a unit norm as:
By allowing x ′ to be a feature vector in the original domain where the nonlinear mapping is ϕ(x ′), the feature vector in the cross-modal associated domain can be computed as:
Rights and permissions
About this article
Cite this article
Sarvestani, R.R., Boostani, R. FF-SKPCCA: Kernel probabilistic canonical correlation analysis. Appl Intell 46, 438–454 (2017). https://doi.org/10.1007/s10489-016-0823-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-016-0823-x