Elsevier

Information Sciences

Volumes 385–386, April 2017, Pages 338-352
Information Sciences

Canonical correlation analysis networks for two-view image recognition

https://doi.org/10.1016/j.ins.2017.01.011Get rights and content

Abstract

In recent years, deep learning has attracted an increasing amount of attention in machine learning and artificial intelligence areas. Currently, many deep learning network-related architectures such as deep neural networks (DNNs), convolutional neural network (CNN), wavelet scattering network (ScatNet) and principal component analysis network (PCANet) have been proposed. The most effective network is PCANet, which has achieved promising performance in image classification, such as for face, object and handwritten digit recognition. PCANet can only handle data that are represented by single-view features. In this paper, we present a canonical correlation analysis network (CCANet) to address image classification, in which images are represented by two-view features. The CCANet learns two-view multistage filter banks by a canonical correlation analysis (CCA) method and constructs a cascaded convolutional deep network. Then, we incorporate filters with binaryzation and block-wise histogram processes to form the final depth structure. In addition, we introduce a variation of CCANet—dubbed RandNet-2—in which the filter banks are randomly generated. Extensive experiments are conducted using the ETH-80, Yale-B, and USPS databases for object classification, face classification and handwritten digits classification, respectively. The experimental results demonstrate that the CCANet algorithm is more effective than PCANet, RandNet-1 and RandNet-2.

Introduction

In real world image classification tasks, a crucial problem is intra-class modifiability, which is derived from the variation in lighting, rotation and deformation. Numerous efforts have been made to eliminate the variability within image classes, such as low-level features and deep learning network structures. Although many handcrafted low-level image features, such as local binary patterns (LBPs) [20], salient features [40] and scale-invariant feature transforms(SIFT) [19], can extract the shape and texture features of a digital image in a valid manner, direct application to the new data sets is difficult [1], [28]. Thus, new domain knowledge, such as multiview techniques [21], [29], hashing algorithm [41], dictionary learning [23], [25], [44], manifold learning [24], [36] and subspace selection [35], are usually needed when generalizing manually designed features to new missions, including image classification [22], [28], [39], action retrieval [29], image super resolution [25] and efficient image search [21].

Deep learning (DL), which has rapidly developed in recent years, and many types of deep learning network-related algorithms have been successfully applied to image recognition tasks [2], [3], [15], [30], [33], [37], [38], [39], [42], [43]. The main idea of a deep network structure entails the use of features at different levels to represent different degrees of abstract semantics of images, such as pixels, margins, motifs, parts, objects, and scenes [15]. These layered features that are learned from training data as a remedy to low-level features can effectively guarantee invariants to intra-class variability. Representative deep learning methods include DNNs [6], [33], CNN [5], [10], [11], [14], [17], [18], [32], [34], ScatNet [2], [27], [31] and PCANet [3].

Deep neural networks (DNNs) [6], [33] employ a hierarchical structure to extract a multistage representation of data. Hinton et al. [6] utilized complementary knowledge to derive a fast, greedy algorithm that can rapidly learn parameters. Sun et al. [33] proposed two very deep neural networks that are based on stacked convolution architecture [32] and inception layers [34] for face recognition.

A convolutional neural network (CNN) [5], [10], [11], [14], [17], [18], [32], [34] incorporates a convolution structure in each trainable stage that is usually composed of three layers: a convolutional filter layer, a nonlinearity process layer, and a feature merging layer. In a convolutional layer, the filter kernel is generally learned by a stochastic gradient descent (SGD) method [14], and each filter can detect a particular feature of the input image. Therefore, the output of each convolutional layer will have a corresponding change to the translation of the input image [15]. In the CNN method, parameters tuning is a time-consuming task that requires some specific techniques. Krizhevsky et al. [11] designed an expertise network for a large image dataset that contains 650,000 neurons and 60 million parameters to train. Additionally, high recognition accuracy is certified by an adequately deep structure [32], [34]. For example, Simonyan et al. [32] researched the influence of the depths of convolutional networks in large-scale image recognition tasks, and ideal results are obtained when the model structure contains 16–19 layers. Convolutional based deep networks did not have an explicit mathematical explanation due to the nonlinearity process.

The wavelet scattering network (ScatNet) [2], [27], [31] is the first algorithm with a distinct mathematical basis. Bruna et al. [2] accomplished a scattering transform with a deep convolutional network that is composed of a cascaded wavelet transform and a modulus pooling operator. Compared with a CNN, ScatNet uses prefixed filters that are wavelet operators. Therefore, the filters are obtained without learning in ScatNet [2], [27], [31]. Although the filter bank is predetermined, the experimental results of ScatNet are remarkable and are superior to DNNs and CNN in some visual based recognition tasks, including handwritten digit recognition, texture discrimination [2], [31] and object classification [27]. However, when a pre-fixed structure is extended to face recognition, in which the intra-class variation is significant, the results are not satisfactory [3].

Chan et al. [3] built a principal component analysis network (PCANet) that employs a cascaded PCA to learn two layers of filter banks and follows by binaryzation and block-wise histogram to pool the final feature. The architecture of PCANet is very simple without numerous parameters to tune in the training stage. This seemingly naive structure performs equal to or more commonly better than well-designed low-level features, DNNs, CNN and ScatNet in several well-known databases that entail LFW, MultiPIE, Extend Yale-B, AR, FERET and MNIST [3].

These deep learning network-related methods can only handle circumstances, in which the input images are represented by a single view. To surmount two view cases and achieve a more robust performance, we propose a canonical correlation analysis network (CCANet) in this paper. Two-view multilayer filter banks are learned by a CCA method, which finds the principle filters by maximizing the correlation of the projected two-view variables. Thus, the filters can reflect more comprehensive information of the same object compared with PCANet. Fig. 1 illustrates the framework of a two-convolutional stage CCANet. In the output stage of CCANet, binaryzation is adopted as a nonlinear process instead of a rectified sigmoid function [15] or ReLU function [11], and a block-wise histogram method is employed to form the final feature representation. Our proposed CCANet model has three significant advantages. (1) CCANet can simultaneously consider two-view features of one image, which is considered to be more robust than the use of a single view in classification tasks regarding intra-class variance. (2) The number of convolutional stages of CCANet is less than the number of convolutional neural networks [11], [14], [34]. An unsupervised learning method is adopted in CCANet instead of the backpropagation algorithm in a typical CNN [12], [13]. The number of parameters in CCANet is small. (3) We also introduce a variation of the CCANet—named RandNet-2—which employs randomly generated filter banks (consider that the filters obey a Gaussian i.i.d.) to replace the filter banks in the CCANet structure. To verify the effectiveness of the proposed CCANet and RandNet-2, we conduct extensive experiments using the ETH-80 database for object recognition, using the Yale-B database for face verification and using the USPS database for handwritten digits classification. The experimental results demonstrate that CCANet achieves a higher recognition accuracy than the accuracy of the representative deep learning network-related methods, including PCANet, for object, face and handwritten digit recognition.

The remainder of this paper is arranged as follows: several types of related networks are described in Section 2. Section 3 presents details of the proposed CCANet. The experimental results are provided in Section 4. The conclusions are presented in Section 5.

Section snippets

Related works

In this section, we summarize several related networks of the CCANet, including the principle component analysis network (PCANet), two-dimensional principle component analysis network (2DPCANet), discrete cosine transform network (DCTNet), kernel principal component analysis network (KPCANet) and stacked principle component analysis network (SPCANet). Assume that N training images are given as {Ii}i=1N, where the size of m × n and the filter number of ith convolutional stage is Li.

PCANet [3]

Canonical correlation analysis networks

CCANet extracts two different view features of one object to generate the final expression, which yields higher recognition accuracy than the accuracy with a single view. The CCANet architecture can be divided into two parts. The first part consists of cascaded convolutional stages. In this part, the optimized two-view multistage filter banks are learned by the CCA method. The second part is the feature pooling stage. In this part, all filtered images are integrated into a feature vector, which

Experiments

In this section, we test the proposed CCANet and the variation algorithm RandNet-2 using several public databases, including ETH-80 [16], Yale-B + extend Yale-B [4] and USPS [8], for object recognition, face recognition and handwritten digit recognition, respectively. For convenience, we employ RandNet-1 and RandNet-2 to denote the filter banks, which are randomly generated in PCANet architecture and CCANet architecture, respectively. In the ETH-80 database, we extract different color

Conclusions

Although traditional deep learning (DL) related methods, such as DNN, CNN, ScatNet and PCANet, cannot address the situation in which sample images are represented by two view features, DL has proven to be a triumphant technique in machine learning and artificial intelligence areas by abundant practical applications. In this paper, we propose canonical correlation analysis networks (CCANet) to overcome this problem. In the CCANet architecture, two-view multistage filter banks are learned by a

Acknowledgment

This study was supported by the National Natural Science Foundation of China under Grants 61671048, 61301242, 61271407, 61572486, 61402458, 614002-1567, and 6140051238; the Fundamental Research Funds for the Central Universities, China University of Petroleum (East China), under Grants 14CX02203A and YCXJ2016075; the Yunnan Natural Science Funds under Grant 2016FB105; the Guangdong Natural Science Funds under Grants 2014A030310252 and 2015A030-313744; the Shenzhen Technology Project under

References (44)

  • H. Hotelling

    Relations between two sets of variates

    Biometrika

    (1936)
  • J.J. Hull

    A database for handwritten text recognition research

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1994)
  • JiaZ. et al.

    2dpcanet: dayside aurora classification based on deep learning

    Proceedings of the CCF Chinese Conference on Computer Vision

    (2015)
  • K. Kavukcuoglu et al.

    Learning convolutional feature hierarchies for visual recognition

    Advances in Neural Information Processing Systems

    (2010)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Advances in Neural Information Processing Systems

    (2012)
  • Le CunB.B. et al.

    Handwritten digit recognition with a back-propagation network

    Advances in Neural Information Processing Systems

    (1990)
  • LeCunY. et al.

    Backpropagation applied to handwritten zip code recognition

    Neural Comput

    (1989)
  • LeCunY. et al.

    Gradient-based learning applied to document recognition

    Proc. IEEE

    (1998)
  • LeCunY. et al.

    Convolutional networks and applications in vision.

    Proceedings of International Symposium on Circuits and Systems, ISCAS

    (2010)
  • B. Leibe et al.

    Analyzing appearance and contour based methods for object categorization

    Proceedings of 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    (2003)
  • LengB. et al.

    3D object understanding with 3d convolutional neural networks

    Inf. Sci.

    (2015)
  • LiH. et al.

    A convolutional neural network cascade for face detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2015)
  • Cited by (114)

    View all citing articles on Scopus
    View full text