Elsevier

Pattern Recognition Letters

Volume 32, Issue 2, 15 January 2011, Pages 134-144
Pattern Recognition Letters

Canonical correlation analysis using within-class coupling

https://doi.org/10.1016/j.patrec.2010.09.025Get rights and content

Abstract

Fisher’s linear discriminant analysis (LDA) is one of the most popular supervised linear dimensionality reduction methods. Unfortunately, LDA is not suitable for problems where the class labels are not available and only the spatial or temporal association of data samples is implicitly indicative of class membership. In this study, a new strategy for reducing LDA to Hotelling’s canonical correlation analysis (CCA) is proposed. CCA seeks prominently correlated projections between two views of data and it has been long known to be equivalent to LDA when the data features are used in one view and the class labels are used in the other view. The basic idea of the new equivalence between LDA and CCA, which we call within-class coupling CCA (WCCCA), is to apply CCA to pairs of data samples that are most likely to belong to the same class. We prove the equivalence between LDA and such an application of CCA. With such an implicit representation of the class labels, WCCCA is applicable both to regular LDA problems and to problems in which only spatial and/or temporal continuity provides clues to the class labels.

Research highlights

► Samples versus class-labels equivalence between LDA and CCA is extended to samples versus samples basis, which can be viewed as accomplishing LDA through a rather indirect and distributed style of an implicit presentation of the categorical class labels. ► Applicable both to regular LDA problems and to problems in which only spatial and/or temporal continuity provides clues to the class labels rather than being explicitly available, where they can be tracked down in the patterns of the data, such as for the tasks of splitting a video into scenes (sequences of relevant frames), segmentation of an image into image regions sharing certain visual characteristics, speech analysis, or biological sequence analysis. ► For demonstration, the ORL face dataset is made into a movie in a way that the consecutive frames are more likely to be of the pictures of the same individuals rather than different. ► When a scene change occurs, the movie continues with the pictures of another individual and so on. ► The method is applied on this movie and it can work just like when the LDA is given the actual class labels.

Introduction

Fisher’s linear discriminant analysis (LDA; Fisher, 1936) and Hotelling’s canonical correlation analysis (CCA; Hotelling, 1936) are among the oldest, yet the most powerful multivariate data analysis techniques. LDA is one of the most popular supervised dimensionality reduction methods incorporating the categorical class labels of the data samples into a search for linear projections of the data that maximize the between-class variance while minimizing the within-class variance (Rencher, 1997, Alpaydin, 2004, Izenman, 2008).

On the other hand, CCA works with two sets of (related) variables and its goal is to find a linear projection of the first set of variables that maximally correlates with a linear projection of the second set of variables. These sets have recently been also referred to as views or representations (Hardoon et al., 2004). Finding correlated functions (covariates) of the two views of the same phenomenon by discarding the representation-specific details (noise) is expected to reveal the underlying hidden yet influential semantic factors responsible for the correlation (Hardoon et al., 2004, Becker, 1999, Favorov and Ryder, 2004, Favorov et al., 2003).

Both LDA and CCA have been proposed in 1936, and shortly after, a direct link between them has been shown by Bartlett (1938) as follows. Given a dataset of samples and their class labels, if we consider the features given for the data samples as one view, versus the class labels as the other view (a single binary variable works for the two-class problem but a form of 1-of-C coding scheme is typically used for multi-class categorical class labels), this CCA setup is known to be equivalent to LDA (Bartlett, 1938, Hastie et al., 1995). In other words, LDA can be simply said to be a special case of CCA.

The knowledge of this insightful equivalence between LDA and CCA enabled the researchers attempt the use of CCA to surpass the quality of the LDA projections. These attempts used samples versus their class labels using several other forms of representations for the labels (Loog et al., 2005, Barker and Rayens, 2003, Gestel et al., 2001, Johansson, 2001, Sun and Chen, 2007). An interesting example of such a label transformation is by replacing hard categorical labels by soft-labels; in (Sun and Chen, 2007), similar to the support vector idea, the aim was to put more weight on the samples near the class boundaries rather than using a common label for all the samples of a class; thus, more useful projections were found as more focus was placed on the problematic regions in the input space rather than the high-density regions with class centers. Another example is the study on an image segmentation task presented in (Loog et al., 2005), which uses image-pixel features and their associated class labels for learning to classify pixels. Their CCA-based method incorporates the class labels of the neighboring pixels as well, which can naturally be expected to yield LDA-like (but possibly more informative) projections. The method can be applied to other forms of, non-image, data by accounting for the spatial class label configuration in the vicinity of every feature vector (Loog et al., 2005).

In this paper, we present another extension of CCA to LDA along with its equivalence proof. The main idea is to transform the class label of a sample such that it is represented, in a distributed manner, by all the samples in that same class. In other words, CCA is asked to produce correlated outputs (projections) for any pair of samples that belong to the same class, which we called WCCCA that stands for within-class coupling CCA. This extension of CCA to LDA has various advantages despite its increased complexity (see Section 4.2 for a detailed list). One important advantage of the WCCCA idea of using samples versus samples, as the two views, is in its ability to perform a form of implicitly-supervised LDA (see Section 5.2) as sometimes the class labels may be embedded in the patterns of the data rather than being explicitly available, for example, in the patterns of spatial and temporal continuity (Becker, 1999, Favorov and Ryder, 2004, Favorov et al., 2003, Borga and Knutsson, 2001, Stone, 1996). Among exemplary applications on such data, the tasks of division of a video into sequences of relevant frames (scenes), segmentation of an image into image regions sharing certain visual characteristics, identifying sequences of acoustic frames belonging to the same word in speech analysis, or finding sequences of base pairs or amino acids belonging to the same protein in biological sequence analysis can be mentioned. In such settings, the use of LDA is uneasy, if not impossible.

The idea of applying CCA or other forms of mutual information maximization models, for example, between the consecutive frames of a video or between the neighboring image patches for finding correlated functions is not a new one (Becker, 1999, Favorov and Ryder, 2004, Favorov et al., 2003, Borga and Knutsson, 2001, Borga, 1998, Stone, 1996, Kording and Konig, 2000, Phillips et al., 1995, Phillips and Singer, 1997). Many of these attempts are inspired by the learning mechanisms hypothesized to be used by neurons in the cerebral cortex. For example, cortical neurons might tune to correlated functions between their own afferent inputs and the lateral inputs they receive from other neurons with different but functionally related afferent inputs. Thus, groups of neurons receiving such different but related afferent inputs can learn to produce correlated outputs under the contextual guidance of each other (Phillips et al., 1995, Phillips and Singer, 1997). However, it is not mathematically justified whether these correlated functions are good for discrimination. Would the covariates found this way be suitable projections for clustering the frames into scenes or for image segmentation? The results of our study show that such a CCA application would be comparable to performing LDA; and as LDA projections maximize the between-class variance and minimize the within-class variance, the covariates found this way would be useful, for example, to cluster the frames into scenes.

This paper is organized as follows. In Sections 2 Canonical correlation analysis (CCA), 3 Fisher linear discriminant analysis (LDA), we review the CCA and LDA techniques, respectively. In Section 4, we present the WCCCA idea of using CCA on a samples versus samples basis and provide the proof for its equivalence to LDA; we also show that the theoretically derived equivalence holds also practically on a toy example. In Section 4, we also discuss the advantages and disadvantages of this way of performing LDA; and finally, finish this section by showing the nonlinear kernel extension of WCCCA. In Section 5, we present the experimental results on a face database and show that WCCCA can perform the task of LDA even when the images are made into a movie and the class label information is kept only implicitly through the temporal continuity of the identity of the individual seen in contiguous frames. We conclude in Section 6.

Section snippets

Canonical correlation analysis (CCA)

Canonical correlation analysis (CCA) is introduced by Hotelling (1936) to describe the linear relation between two multidimensional (or two sets of) variables as the problem of finding basis vectors for each set such that the projections of the two variables on their respective basis vectors are maximally correlated (Hotelling, 1936, Rencher, 1997, Hardoon et al., 2004, Izenman, 2008). These two sets of variables, for example, may correspond to different views of the same semantic object (e.g.

Fisher linear discriminant analysis (LDA)

Fisher linear discriminant analysis (LDA) is a variance preserving approach with the goal of finding the optimal linear discriminant function (Fisher, 1936, Rencher, 1997, Raudys and Duin, 1998, Alpaydin, 2004, Izenman, 2008). As opposed to unsupervised methods such as principal component analysis (PCA), independent component analysis (ICA), or the two view counterpart CCA, to utilize the categorical class label information in finding informative projections, LDA considers maximizing an

Within-class coupling CCA (WCCCA)

Clearly, for CCA to be applicable to a dataset D, two views are necessary, denoted as X and Y in Eq. (1). However, constructing a form of the dummy class label matrix L in Eq. (17) as the second one of the two views is not the only way to create these views. We prove that CCA can be used to perform LDA using a different method of incorporating the class labels of the data samples. Let us create the two views by coupling a pair of samples from the same classes (one for each view). For an

Experimental results

In this section, we will present the results obtained using WCCCA on two sets of experiments on the AT&T (ORL) face database, which is composed of 400 grayscale images obtained from 40 different individuals, ten different images per person. The images were taken at different times, varying the lighting, the viewing angle (frontal or more or less semi-frontal view), facial expressions (open or closed eyes, smiling or not, etc.), and other facial details (e.g., with or without glasses). All

Conclusions

Fisher’s linear discriminant analysis (LDA) has two main goals: (1) minimize the within-class variance, and (2) maximize the between-class variance. LDA has been long known to be a special case of Hotelling’s canonical correlation analysis (CCA). That is, CCA can be performed on a view that constitutes of samples (predictive features) versus a second view that is directly made up of the class labels of the samples in order to obtain projections that are identical to those of LDA. In this paper,

References (35)

  • S. Becker

    Implicit learning in 3d object recognition: The importance of temporal context

    Neural Comput.

    (1999)
  • Borga, M., Knutsson, H., 2001. A Canonical Correlation Approach to Blind Source Separation. Technical Report...
  • Borga, M., 1998. Learning Multidimensional Signal Processing, Ph.D. Thesis. Department of Electrical Engineering,...
  • Farquhar, J.D.R., Hardoon, D.R., Meng, H., Shawe-Taylor, J., Szedmak, S., 2005. Two view learning: SVM-2K, theory and...
  • O.V. Favorov et al.

    SINBAD: A neocortical mechanism for discovering environmental variables and regularities hidden in sensory input

    Biological Cybernet.

    (2004)
  • O.V. Favorov et al.

    The cortical pyramidal cell as a set of interacting error backpropagating networks: A mechanism for discovering nature’s order

  • R.A. Fisher

    The use of multiple measurements in taxonomic problems

    Ann. Eugenic.

    (1936)
  • Cited by (28)

    • Sparse semi-supervised heterogeneous interbattery bayesian analysis

      2021, Pattern Recognition
      Citation Excerpt :

      In particular, one method that has been increasingly used in this context is Canonical Correlation Analysis (CCA) [2], which constructs the latent space from the correlation between different views. Despite being commonly used for a single input and output views [3,4], its formulation allows to combine multiple views of the data to improve the extraction of the latent features [5–7], what is commonly known as multi-task or multi-view learning. FE algorithms have been adapted to the Bayesian framework, introducing a probabilistic model able to correlate all involved views and latent low-dimensional variables [8,9].

    • A survey on wearable sensor modality centred human activity recognition in health care

      2019, Expert Systems with Applications
      Citation Excerpt :

      CCA can be used as a feature selector. CCA and its extended FS algorithms include LSCCA (Kursun, Alpaydin, & Favorov, 2011), DCCA (Andrew, Arora, Bilmes, & Livescu, 2013), MCR-CCA (Kaya, Eyben, Salah, & Schuller, 2014), etc. The wrapper methods select a subset of features with the most discriminating properties by using certain classifiers to evaluate the quality of a candidate feature, e.g. SVM (Bolón-Canedo, Sánchez-Maroño, & Alonso-Betanzos, 2013) and neural networks (NNs) (Kabir, Islam, & Murase, 2010).

    • Algorithms for two dimensional multi set canonical correlation analysis

      2018, Pattern Recognition Letters
      Citation Excerpt :

      It has been successfully used in many applications such as genomic data integration to identify the relationship amongst multiple phenotypic measures [3], cross-language document retrieval [4], etc. The equivalence between linear discriminant analysis (LDA) and CCA is proved in [5]. A new variant, called within-class coupling CCA, is proposed that is applicable in case of data whose samples are implicitly indicative of their class membership.

    • Canonical sparse cross-view correlation analysis

      2016, Neurocomputing
      Citation Excerpt :

      Several works have also discussed the relationships between CCA and LDA, especially when the data features are used in one view and the class labels are used in the other view [15,16]. It is shown that CCA and LDA have some equivalent relations [17]. In order to deal with nonlinear circumstance, some nonlinear CCA algorithms have been proposed in the literature [18].

    • Canonical dependency analysis based on squared-loss mutual information

      2012, Neural Networks
      Citation Excerpt :

      Indeed, in some special cases, CCA is equivalent to Fisher’s linear discriminant analysis (LDA) (Bartlett, 1938). Such CCA-based classification methods have been widely studied (Farquhar, Hardoon, Meng, Shawe-Taylor, & Szedmák, 2005; Kursun, Alpaydin, & Favorov, 2011; Rai & Daume, 2009; Sun, Ji, & Ye, 2011); they incorporate class information in various ways to learn projections that are informative for discrimination. The usefulness of these approaches has been demonstrated in various modern pattern recognition problems such as multi-label classification (Rai & Daume, 2009; Sun et al., 2011), which utilizes correlations among labels for improving classification performance.

    View all citing articles on Scopus

    The work of O. Kursun was supported by Scientific Research Projects Coordination Unit of Istanbul University under the grant YADOP-5323.

    View full text