Canonical correlation analysis networks for two-view image recognition
Introduction
In real world image classification tasks, a crucial problem is intra-class modifiability, which is derived from the variation in lighting, rotation and deformation. Numerous efforts have been made to eliminate the variability within image classes, such as low-level features and deep learning network structures. Although many handcrafted low-level image features, such as local binary patterns (LBPs) [20], salient features [40] and scale-invariant feature transforms(SIFT) [19], can extract the shape and texture features of a digital image in a valid manner, direct application to the new data sets is difficult [1], [28]. Thus, new domain knowledge, such as multiview techniques [21], [29], hashing algorithm [41], dictionary learning [23], [25], [44], manifold learning [24], [36] and subspace selection [35], are usually needed when generalizing manually designed features to new missions, including image classification [22], [28], [39], action retrieval [29], image super resolution [25] and efficient image search [21].
Deep learning (DL), which has rapidly developed in recent years, and many types of deep learning network-related algorithms have been successfully applied to image recognition tasks [2], [3], [15], [30], [33], [37], [38], [39], [42], [43]. The main idea of a deep network structure entails the use of features at different levels to represent different degrees of abstract semantics of images, such as pixels, margins, motifs, parts, objects, and scenes [15]. These layered features that are learned from training data as a remedy to low-level features can effectively guarantee invariants to intra-class variability. Representative deep learning methods include DNNs [6], [33], CNN [5], [10], [11], [14], [17], [18], [32], [34], ScatNet [2], [27], [31] and PCANet [3].
Deep neural networks (DNNs) [6], [33] employ a hierarchical structure to extract a multistage representation of data. Hinton et al. [6] utilized complementary knowledge to derive a fast, greedy algorithm that can rapidly learn parameters. Sun et al. [33] proposed two very deep neural networks that are based on stacked convolution architecture [32] and inception layers [34] for face recognition.
A convolutional neural network (CNN) [5], [10], [11], [14], [17], [18], [32], [34] incorporates a convolution structure in each trainable stage that is usually composed of three layers: a convolutional filter layer, a nonlinearity process layer, and a feature merging layer. In a convolutional layer, the filter kernel is generally learned by a stochastic gradient descent (SGD) method [14], and each filter can detect a particular feature of the input image. Therefore, the output of each convolutional layer will have a corresponding change to the translation of the input image [15]. In the CNN method, parameters tuning is a time-consuming task that requires some specific techniques. Krizhevsky et al. [11] designed an expertise network for a large image dataset that contains 650,000 neurons and 60 million parameters to train. Additionally, high recognition accuracy is certified by an adequately deep structure [32], [34]. For example, Simonyan et al. [32] researched the influence of the depths of convolutional networks in large-scale image recognition tasks, and ideal results are obtained when the model structure contains 16–19 layers. Convolutional based deep networks did not have an explicit mathematical explanation due to the nonlinearity process.
The wavelet scattering network (ScatNet) [2], [27], [31] is the first algorithm with a distinct mathematical basis. Bruna et al. [2] accomplished a scattering transform with a deep convolutional network that is composed of a cascaded wavelet transform and a modulus pooling operator. Compared with a CNN, ScatNet uses prefixed filters that are wavelet operators. Therefore, the filters are obtained without learning in ScatNet [2], [27], [31]. Although the filter bank is predetermined, the experimental results of ScatNet are remarkable and are superior to DNNs and CNN in some visual based recognition tasks, including handwritten digit recognition, texture discrimination [2], [31] and object classification [27]. However, when a pre-fixed structure is extended to face recognition, in which the intra-class variation is significant, the results are not satisfactory [3].
Chan et al. [3] built a principal component analysis network (PCANet) that employs a cascaded PCA to learn two layers of filter banks and follows by binaryzation and block-wise histogram to pool the final feature. The architecture of PCANet is very simple without numerous parameters to tune in the training stage. This seemingly naive structure performs equal to or more commonly better than well-designed low-level features, DNNs, CNN and ScatNet in several well-known databases that entail LFW, MultiPIE, Extend Yale-B, AR, FERET and MNIST [3].
These deep learning network-related methods can only handle circumstances, in which the input images are represented by a single view. To surmount two view cases and achieve a more robust performance, we propose a canonical correlation analysis network (CCANet) in this paper. Two-view multilayer filter banks are learned by a CCA method, which finds the principle filters by maximizing the correlation of the projected two-view variables. Thus, the filters can reflect more comprehensive information of the same object compared with PCANet. Fig. 1 illustrates the framework of a two-convolutional stage CCANet. In the output stage of CCANet, binaryzation is adopted as a nonlinear process instead of a rectified sigmoid function [15] or ReLU function [11], and a block-wise histogram method is employed to form the final feature representation. Our proposed CCANet model has three significant advantages. (1) CCANet can simultaneously consider two-view features of one image, which is considered to be more robust than the use of a single view in classification tasks regarding intra-class variance. (2) The number of convolutional stages of CCANet is less than the number of convolutional neural networks [11], [14], [34]. An unsupervised learning method is adopted in CCANet instead of the backpropagation algorithm in a typical CNN [12], [13]. The number of parameters in CCANet is small. (3) We also introduce a variation of the CCANet—named RandNet-2—which employs randomly generated filter banks (consider that the filters obey a Gaussian i.i.d.) to replace the filter banks in the CCANet structure. To verify the effectiveness of the proposed CCANet and RandNet-2, we conduct extensive experiments using the ETH-80 database for object recognition, using the Yale-B database for face verification and using the USPS database for handwritten digits classification. The experimental results demonstrate that CCANet achieves a higher recognition accuracy than the accuracy of the representative deep learning network-related methods, including PCANet, for object, face and handwritten digit recognition.
The remainder of this paper is arranged as follows: several types of related networks are described in Section 2. Section 3 presents details of the proposed CCANet. The experimental results are provided in Section 4. The conclusions are presented in Section 5.
Section snippets
Related works
In this section, we summarize several related networks of the CCANet, including the principle component analysis network (PCANet), two-dimensional principle component analysis network (2DPCANet), discrete cosine transform network (DCTNet), kernel principal component analysis network (KPCANet) and stacked principle component analysis network (SPCANet). Assume that N training images are given as where the size of m × n and the filter number of ith convolutional stage is Li.
PCANet [3]
Canonical correlation analysis networks
CCANet extracts two different view features of one object to generate the final expression, which yields higher recognition accuracy than the accuracy with a single view. The CCANet architecture can be divided into two parts. The first part consists of cascaded convolutional stages. In this part, the optimized two-view multistage filter banks are learned by the CCA method. The second part is the feature pooling stage. In this part, all filtered images are integrated into a feature vector, which
Experiments
In this section, we test the proposed CCANet and the variation algorithm RandNet-2 using several public databases, including ETH-80 [16], Yale-B + extend Yale-B [4] and USPS [8], for object recognition, face recognition and handwritten digit recognition, respectively. For convenience, we employ RandNet-1 and RandNet-2 to denote the filter banks, which are randomly generated in PCANet architecture and CCANet architecture, respectively. In the ETH-80 database, we extract different color
Conclusions
Although traditional deep learning (DL) related methods, such as DNN, CNN, ScatNet and PCANet, cannot address the situation in which sample images are represented by two view features, DL has proven to be a triumphant technique in machine learning and artificial intelligence areas by abundant practical applications. In this paper, we propose canonical correlation analysis networks (CCANet) to overcome this problem. In the CCANet architecture, two-view multistage filter banks are learned by a
Acknowledgment
This study was supported by the National Natural Science Foundation of China under Grants 61671048, 61301242, 61271407, 61572486, 61402458, 614002-1567, and 6140051238; the Fundamental Research Funds for the Central Universities, China University of Petroleum (East China), under Grants 14CX02203A and YCXJ2016075; the Yunnan Natural Science Funds under Grant 2016FB105; the Guangdong Natural Science Funds under Grants 2014A030310252 and 2015A030-313744; the Shenzhen Technology Project under
References (44)
- et al.
Ga-sift: a new scale invariant feature transform for multispectral image using geometric algebra
Inf. Sci.
(2014) - et al.
Extended local binary patterns for face recognition
Inf. Sci.
(2016) - et al.
Local receptive field constrained deep networks
Inf. Sci.
(2016) - et al.
Statistical quantization for similarity search
Comput. Vis. Image Underst.
(2014) - et al.
Representation learning: a review and new perspectives
IEEE Trans. Pattern Anal. Mach. Intell.
(2013) - et al.
Invariant scattering convolution networks
IEEE Trans. Pattern Anal. Mach. Intell.
(2013) - et al.
Pcanet: a simple deep learning baseline for image classification?
IEEE Trans. Image Process.
(2015) - et al.
Correlation metric for generalized feature extraction
IEEE Trans. Pattern Anal. Mach. Intell.
(2008) - et al.
Maxout networks.
Proceedings of International Conference on Machine Learning, ICML (3)
(2013) - et al.
A fast learning algorithm for deep belief nets
Neural Comput.
(2006)
Relations between two sets of variates
Biometrika
A database for handwritten text recognition research
IEEE Trans. Pattern Anal. Mach. Intell.
2dpcanet: dayside aurora classification based on deep learning
Proceedings of the CCF Chinese Conference on Computer Vision
Learning convolutional feature hierarchies for visual recognition
Advances in Neural Information Processing Systems
Imagenet classification with deep convolutional neural networks
Advances in Neural Information Processing Systems
Handwritten digit recognition with a back-propagation network
Advances in Neural Information Processing Systems
Backpropagation applied to handwritten zip code recognition
Neural Comput
Gradient-based learning applied to document recognition
Proc. IEEE
Convolutional networks and applications in vision.
Proceedings of International Symposium on Circuits and Systems, ISCAS
Analyzing appearance and contour based methods for object categorization
Proceedings of 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
3D object understanding with 3d convolutional neural networks
Inf. Sci.
A convolutional neural network cascade for face detection
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Cited by (114)
Trustworthy authorization method for security in Industrial Internet of Things
2021, Ad Hoc NetworksGeneralized two-dimensional linear discriminant analysis with regularization
2021, Neural NetworksOPLS-SR: A novel face super-resolution learning method using orthonormalized coherent features
2021, Information SciencesA convolution neural network-based computational model to identify the occurrence sites of various RNA modifications by fusing varied features
2021, Chemometrics and Intelligent Laboratory Systems