Canonical correlation analysis using within-class coupling

doi:10.1016/j.patrec.2010.09.025

Pattern Recognition Letters

Volume 32, Issue 2, 15 January 2011, Pages 134-144

https://doi.org/10.1016/j.patrec.2010.09.025 Get rights and content

Abstract

Fisher’s linear discriminant analysis (LDA) is one of the most popular supervised linear dimensionality reduction methods. Unfortunately, LDA is not suitable for problems where the class labels are not available and only the spatial or temporal association of data samples is implicitly indicative of class membership. In this study, a new strategy for reducing LDA to Hotelling’s canonical correlation analysis (CCA) is proposed. CCA seeks prominently correlated projections between two views of data and it has been long known to be equivalent to LDA when the data features are used in one view and the class labels are used in the other view. The basic idea of the new equivalence between LDA and CCA, which we call within-class coupling CCA (WCCCA), is to apply CCA to pairs of data samples that are most likely to belong to the same class. We prove the equivalence between LDA and such an application of CCA. With such an implicit representation of the class labels, WCCCA is applicable both to regular LDA problems and to problems in which only spatial and/or temporal continuity provides clues to the class labels.

Research highlights

► Samples versus class-labels equivalence between LDA and CCA is extended to samples versus samples basis, which can be viewed as accomplishing LDA through a rather indirect and distributed style of an implicit presentation of the categorical class labels. ► Applicable both to regular LDA problems and to problems in which only spatial and/or temporal continuity provides clues to the class labels rather than being explicitly available, where they can be tracked down in the patterns of the data, such as for the tasks of splitting a video into scenes (sequences of relevant frames), segmentation of an image into image regions sharing certain visual characteristics, speech analysis, or biological sequence analysis. ► For demonstration, the ORL face dataset is made into a movie in a way that the consecutive frames are more likely to be of the pictures of the same individuals rather than different. ► When a scene change occurs, the movie continues with the pictures of another individual and so on. ► The method is applied on this movie and it can work just like when the LDA is given the actual class labels.

Introduction

Fisher’s linear discriminant analysis (LDA; Fisher, 1936) and Hotelling’s canonical correlation analysis (CCA; Hotelling, 1936) are among the oldest, yet the most powerful multivariate data analysis techniques. LDA is one of the most popular supervised dimensionality reduction methods incorporating the categorical class labels of the data samples into a search for linear projections of the data that maximize the between-class variance while minimizing the within-class variance (Rencher, 1997, Alpaydin, 2004, Izenman, 2008).

On the other hand, CCA works with two sets of (related) variables and its goal is to find a linear projection of the first set of variables that maximally correlates with a linear projection of the second set of variables. These sets have recently been also referred to as views or representations (Hardoon et al., 2004). Finding correlated functions (covariates) of the two views of the same phenomenon by discarding the representation-specific details (noise) is expected to reveal the underlying hidden yet influential semantic factors responsible for the correlation (Hardoon et al., 2004, Becker, 1999, Favorov and Ryder, 2004, Favorov et al., 2003).

Both LDA and CCA have been proposed in 1936, and shortly after, a direct link between them has been shown by Bartlett (1938) as follows. Given a dataset of samples and their class labels, if we consider the features given for the data samples as one view, versus the class labels as the other view (a single binary variable works for the two-class problem but a form of 1-of-C coding scheme is typically used for multi-class categorical class labels), this CCA setup is known to be equivalent to LDA (Bartlett, 1938, Hastie et al., 1995). In other words, LDA can be simply said to be a special case of CCA.

The knowledge of this insightful equivalence between LDA and CCA enabled the researchers attempt the use of CCA to surpass the quality of the LDA projections. These attempts used samples versus their class labels using several other forms of representations for the labels (Loog et al., 2005, Barker and Rayens, 2003, Gestel et al., 2001, Johansson, 2001, Sun and Chen, 2007). An interesting example of such a label transformation is by replacing hard categorical labels by soft-labels; in (Sun and Chen, 2007), similar to the support vector idea, the aim was to put more weight on the samples near the class boundaries rather than using a common label for all the samples of a class; thus, more useful projections were found as more focus was placed on the problematic regions in the input space rather than the high-density regions with class centers. Another example is the study on an image segmentation task presented in (Loog et al., 2005), which uses image-pixel features and their associated class labels for learning to classify pixels. Their CCA-based method incorporates the class labels of the neighboring pixels as well, which can naturally be expected to yield LDA-like (but possibly more informative) projections. The method can be applied to other forms of, non-image, data by accounting for the spatial class label configuration in the vicinity of every feature vector (Loog et al., 2005).

In this paper, we present another extension of CCA to LDA along with its equivalence proof. The main idea is to transform the class label of a sample such that it is represented, in a distributed manner, by all the samples in that same class. In other words, CCA is asked to produce correlated outputs (projections) for any pair of samples that belong to the same class, which we called WCCCA that stands for within-class coupling CCA. This extension of CCA to LDA has various advantages despite its increased complexity (see Section 4.2 for a detailed list). One important advantage of the WCCCA idea of using samples versus samples, as the two views, is in its ability to perform a form of implicitly-supervised LDA (see Section 5.2) as sometimes the class labels may be embedded in the patterns of the data rather than being explicitly available, for example, in the patterns of spatial and temporal continuity (Becker, 1999, Favorov and Ryder, 2004, Favorov et al., 2003, Borga and Knutsson, 2001, Stone, 1996). Among exemplary applications on such data, the tasks of division of a video into sequences of relevant frames (scenes), segmentation of an image into image regions sharing certain visual characteristics, identifying sequences of acoustic frames belonging to the same word in speech analysis, or finding sequences of base pairs or amino acids belonging to the same protein in biological sequence analysis can be mentioned. In such settings, the use of LDA is uneasy, if not impossible.

The idea of applying CCA or other forms of mutual information maximization models, for example, between the consecutive frames of a video or between the neighboring image patches for finding correlated functions is not a new one (Becker, 1999, Favorov and Ryder, 2004, Favorov et al., 2003, Borga and Knutsson, 2001, Borga, 1998, Stone, 1996, Kording and Konig, 2000, Phillips et al., 1995, Phillips and Singer, 1997). Many of these attempts are inspired by the learning mechanisms hypothesized to be used by neurons in the cerebral cortex. For example, cortical neurons might tune to correlated functions between their own afferent inputs and the lateral inputs they receive from other neurons with different but functionally related afferent inputs. Thus, groups of neurons receiving such different but related afferent inputs can learn to produce correlated outputs under the contextual guidance of each other (Phillips et al., 1995, Phillips and Singer, 1997). However, it is not mathematically justified whether these correlated functions are good for discrimination. Would the covariates found this way be suitable projections for clustering the frames into scenes or for image segmentation? The results of our study show that such a CCA application would be comparable to performing LDA; and as LDA projections maximize the between-class variance and minimize the within-class variance, the covariates found this way would be useful, for example, to cluster the frames into scenes.

This paper is organized as follows. In Sections 2 Canonical correlation analysis (CCA), 3 Fisher linear discriminant analysis (LDA), we review the CCA and LDA techniques, respectively. In Section 4, we present the WCCCA idea of using CCA on a samples versus samples basis and provide the proof for its equivalence to LDA; we also show that the theoretically derived equivalence holds also practically on a toy example. In Section 4, we also discuss the advantages and disadvantages of this way of performing LDA; and finally, finish this section by showing the nonlinear kernel extension of WCCCA. In Section 5, we present the experimental results on a face database and show that WCCCA can perform the task of LDA even when the images are made into a movie and the class label information is kept only implicitly through the temporal continuity of the identity of the individual seen in contiguous frames. We conclude in Section 6.

Section snippets

Canonical correlation analysis (CCA)

Canonical correlation analysis (CCA) is introduced by Hotelling (1936) to describe the linear relation between two multidimensional (or two sets of) variables as the problem of finding basis vectors for each set such that the projections of the two variables on their respective basis vectors are maximally correlated (Hotelling, 1936, Rencher, 1997, Hardoon et al., 2004, Izenman, 2008). These two sets of variables, for example, may correspond to different views of the same semantic object (e.g.

Fisher linear discriminant analysis (LDA)

Fisher linear discriminant analysis (LDA) is a variance preserving approach with the goal of finding the optimal linear discriminant function (Fisher, 1936, Rencher, 1997, Raudys and Duin, 1998, Alpaydin, 2004, Izenman, 2008). As opposed to unsupervised methods such as principal component analysis (PCA), independent component analysis (ICA), or the two view counterpart CCA, to utilize the categorical class label information in finding informative projections, LDA considers maximizing an

Within-class coupling CCA (WCCCA)

Clearly, for CCA to be applicable to a dataset D, two views are necessary, denoted as X and Y in Eq. (1). However, constructing a form of the dummy class label matrix L in Eq. (17) as the second one of the two views is not the only way to create these views. We prove that CCA can be used to perform LDA using a different method of incorporating the class labels of the data samples. Let us create the two views by coupling a pair of samples from the same classes (one for each view). For an

Experimental results

In this section, we will present the results obtained using WCCCA on two sets of experiments on the AT&T (ORL) face database, which is composed of 400 grayscale images obtained from 40 different individuals, ten different images per person. The images were taken at different times, varying the lighting, the viewing angle (frontal or more or less semi-frontal view), facial expressions (open or closed eyes, smiling or not, etc.), and other facial details (e.g., with or without glasses). All

Conclusions

Fisher’s linear discriminant analysis (LDA) has two main goals: (1) minimize the within-class variance, and (2) maximize the between-class variance. LDA has been long known to be a special case of Hotelling’s canonical correlation analysis (CCA). That is, CCA can be performed on a view that constitutes of samples (predictive features) versus a second view that is directly made up of the class labels of the samples in order to obtain projections that are identical to those of LDA. In this paper,

References (35)

R.H. Glendinning et al.
Shape classification using smooth principal components
Pattern Recognition Lett.
(2003)
M. Gonen et al.
Cost-conscious multiple kernel learning
Pattern Recognition Lett.
(2010)
P.L. Lai et al.
A neural network implementation of canonical correlation
Neural Networks
(1999)
M. Loog et al.
Dimensionality reduction of image features using the canonical contextual correlation projection
Pattern Recognition
(2005)
T. Melzer et al.
Appearance models based on kernel canonical correlation analysis
Pattern Recognition
(2003)
S. Raudys et al.
Expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix
Pattern Recognititon Lett.
(1998)
T. Sun et al.
Class label versus sample label-based CCA
Appl. Math. Comput.
(2007)
E. Alpaydin
Introduction to Machine Learning (Adaptive Computation and Machine Learning Series)
(2004)
M. Barker et al.
Partial least squares for discrimination
J. Chemometr.
(2003)
M.S. Bartlett
Further aspects of the theory of multiple regression
Proc. Cambridge Philos. Soc.
(1938)

S. Becker

Implicit learning in 3d object recognition: The importance of temporal context

Neural Comput.

(1999)

Borga, M., Knutsson, H., 2001. A Canonical Correlation Approach to Blind Source Separation. Technical Report...

Borga, M., 1998. Learning Multidimensional Signal Processing, Ph.D. Thesis. Department of Electrical Engineering,...

Farquhar, J.D.R., Hardoon, D.R., Meng, H., Shawe-Taylor, J., Szedmak, S., 2005. Two view learning: SVM-2K, theory and...

O.V. Favorov et al.

SINBAD: A neocortical mechanism for discovering environmental variables and regularities hidden in sensory input

Biological Cybernet.

(2004)

O.V. Favorov et al.

The cortical pyramidal cell as a set of interacting error backpropagating networks: A mechanism for discovering nature’s order

R.A. Fisher

The use of multiple measurements in taxonomic problems

Ann. Eugenic.

(1936)

Cited by (28)

Sparse semi-supervised heterogeneous interbattery bayesian analysis
2021, Pattern Recognition
Citation Excerpt :
In particular, one method that has been increasingly used in this context is Canonical Correlation Analysis (CCA) [2], which constructs the latent space from the correlation between different views. Despite being commonly used for a single input and output views [3,4], its formulation allows to combine multiple views of the data to improve the extraction of the latent features [5–7], what is commonly known as multi-task or multi-view learning. FE algorithms have been adapted to the Bayesian framework, introducing a probabilistic model able to correlate all involved views and latent low-dimensional variables [8,9].
The Bayesian approach to feature extraction, known as factor analysis (FA), has been widely studied in machine learning to obtain a latent representation of the data. An adequate selection of the probabilities and priors of these bayesian models allows the model to better adapt to the data nature (i.e. heterogeneity, sparsity), obtaining a more representative latent space.
The objective of this article is to propose a general FA framework capable of modelling any problem. To do so, we start from the Bayesian Inter-Battery Factor Analysis (BIBFA) model, enhancing it with new functionalities to be able to work with heterogeneous data, to include feature selection, and to handle missing values as well as semi-supervised problems.
The performance of the proposed model, Sparse Semi-supervised Heterogeneous Interbattery Bayesian Analysis (SSHIBA), has been tested on different scenarios to evaluate each one of its novelties, showing not only a great versatility and an interpretability gain, but also outperforming most of the state-of-the-art algorithms.
A survey on wearable sensor modality centred human activity recognition in health care
2019, Expert Systems with Applications
Citation Excerpt :
CCA can be used as a feature selector. CCA and its extended FS algorithms include LSCCA (Kursun, Alpaydin, & Favorov, 2011), DCCA (Andrew, Arora, Bilmes, & Livescu, 2013), MCR-CCA (Kaya, Eyben, Salah, & Schuller, 2014), etc. The wrapper methods select a subset of features with the most discriminating properties by using certain classifiers to evaluate the quality of a candidate feature, e.g. SVM (Bolón-Canedo, Sánchez-Maroño, & Alonso-Betanzos, 2013) and neural networks (NNs) (Kabir, Islam, & Murase, 2010).
Increased life expectancy coupled with declining birth rates is leading to an aging population structure. Aging-caused changes, such as physical or cognitive decline, could affect people's quality of life, result in injuries, mental health or the lack of physical activity. Sensor-based human activity recognition (HAR) is one of the most promising assistive technologies to support older people's daily life, which has enabled enormous potential in human-centred applications. Recent surveys in HAR either only focus on the deep learning approaches or one specific sensor modality. This survey aims to provide a more comprehensive introduction for newcomers and researchers to HAR. We first introduce the state-of-art sensor modalities in HAR. We look more into the techniques involved in each step of wearable sensor modality centred HAR in terms of sensors, activities, data pre-processing, feature learning and classification, including both conventional approaches and deep learning methods. In the feature learning section, we focus on both hand-crafted features and automatically learned features using deep networks. We also present the ambient-sensor-based HAR, including camera-based systems, and the systems which combine the wearable and ambient sensors. Finally, we identify the corresponding challenges in HAR to pose research problems for further improvement in HAR.
Algorithms for two dimensional multi set canonical correlation analysis
2018, Pattern Recognition Letters
Citation Excerpt :
It has been successfully used in many applications such as genomic data integration to identify the relationship amongst multiple phenotypic measures [3], cross-language document retrieval [4], etc. The equivalence between linear discriminant analysis (LDA) and CCA is proved in [5]. A new variant, called within-class coupling CCA, is proposed that is applicable in case of data whose samples are implicitly indicative of their class membership.
Multi set canonical correlation analysis (mCCA), which extends the application of canonical correlation analysis (CCA) to more than two datasets, is a data driven technique that can jointly analyze the relationship amongst multiple (more than two) datasets. However, the conventional mCCA is directly applicable only to multivariate vector data and requires the image data to be reshaped into vectors. This approach fails to consider the spatial structure of the images and in addition, leads to an increase in the computational complexity. In this paper, we propose new two dimensional mCCA algorithms that operate directly on the image data instead of vectorizing them. Face recognition experiments are presented to compare the performances of conventional mCCA and the proposed two dimensional mCCA techniques. Additionally, experiments against fMRI data are conducted to demonstrate the applicability of the proposed approach in multisubject fMRI analysis.
Canonical sparse cross-view correlation analysis
2016, Neurocomputing
Citation Excerpt :
Several works have also discussed the relationships between CCA and LDA, especially when the data features are used in one view and the class labels are used in the other view [15,16]. It is shown that CCA and LDA have some equivalent relations [17]. In order to deal with nonlinear circumstance, some nonlinear CCA algorithms have been proposed in the literature [18].
Recently, multi-view feature extraction has attracted great interest and Canonical Correlation Analysis (CCA) is a powerful technique for finding the linear correlation between two view variable sets. However, CCA does not consider the structure and cross view information in feature extraction, which is very important for subsequence tasks. In this paper, a new approach called Canonical Sparse Cross-view Correlation Analysis (CSCCA) is proposed to address this problem. We first construct similarity matrices by performing sparse representation between within-class samples. Then local manifold information and cross-view correlations are incorporated into CCA. Furthermore, a kernel version of CSCCA (KCSCCA) is proposed to reveal the nonlinear correlation relationship between two sets of features. We compare CSCCA and KCSCCA with existing multi-view feature extraction methods and perform experiments on both artificial data set and real world databases including multiple features and face data sets. The experimental results demonstrate the merits of our proposed method.
Canonical dependency analysis based on squared-loss mutual information
2012, Neural Networks
Citation Excerpt :
Indeed, in some special cases, CCA is equivalent to Fisher’s linear discriminant analysis (LDA) (Bartlett, 1938). Such CCA-based classification methods have been widely studied (Farquhar, Hardoon, Meng, Shawe-Taylor, & Szedmák, 2005; Kursun, Alpaydin, & Favorov, 2011; Rai & Daume, 2009; Sun, Ji, & Ye, 2011); they incorporate class information in various ways to learn projections that are informative for discrimination. The usefulness of these approaches has been demonstrated in various modern pattern recognition problems such as multi-label classification (Rai & Daume, 2009; Sun et al., 2011), which utilizes correlations among labels for improving classification performance.
Canonical correlation analysis (CCA) is a classical dimensionality reduction technique for two sets of variables that iteratively finds projection directions with maximum correlation. Although CCA is still in vital use in many practical application areas, recent real-world data often contain more complicated nonlinear correlations that cannot be properly captured by classical CCA. In this paper, we thus propose an extension of CCA that can effectively capture such complicated nonlinear correlations through statistical dependency maximization. The proposed method, which we call least-squares canonical dependency analysis (LSCDA), is based on a squared-loss variant of mutual information, and it has various useful properties besides its ability to capture higher-order correlations: for example, it can simultaneously find multiple projection directions (i.e., subspaces), it does not involve density estimation, and it is equipped with a model selection strategy. We demonstrate the usefulness of LSCDA through various experiments on artificial and real-world datasets.
Self Supervised Correlation-based Permutations for Multi-View Clustering
2024, arXiv

View all citing articles on Scopus

^☆: The work of O. Kursun was supported by Scientific Research Projects Coordination Unit of Istanbul University under the grant YADOP-5323.

View full text

Canonical correlation analysis using within-class coupling☆

Abstract

Research highlights

Introduction

Section snippets

Canonical correlation analysis (CCA)

Fisher linear discriminant analysis (LDA)

Within-class coupling CCA (WCCCA)

Experimental results

Conclusions

Pattern Recognition Lett.

Pattern Recognition Lett.

Neural Networks

Pattern Recognition

Pattern Recognition

Pattern Recognititon Lett.

Appl. Math. Comput.

Introduction to Machine Learning (Adaptive Computation and Machine Learning Series)

Partial least squares for discrimination

J. Chemometr.

Further aspects of the theory of multiple regression

Proc. Cambridge Philos. Soc.