Image classification by multimodal subspace learning

doi:10.1016/j.patrec.2012.02.002

Pattern Recognition Letters

Volume 33, Issue 9, 1 July 2012, Pages 1196-1204

https://doi.org/10.1016/j.patrec.2012.02.002 Get rights and content

Abstract

In recent years we witnessed a surge of interest in subspace learning for image classification. However, the previous methods lack of high accuracy since they do not consider multiple features of the images. For instance, we can represent a color image by finding a set of visual features to represent the information of its color, texture and shape. According to the “Patch Alignment” Framework, we developed a new subspace learning method, termed Semi-Supervised Multimodal Subspace Learning (SS-MMSL), in which we can encode different features from different modalities to build a meaningful subspace. In particular, the new method adopts the discriminative information from the labeled data to construct local patches and aligns these patches to get the optimal low dimensional subspace for each modality. For local patch construction, the data distribution revealed by unlabeled data is utilized to enhance the subspace learning. In order to find a low dimensional subspace wherein the distribution of each modality is sufficiently smooth, SS-MMSL adopts an alternating and iterative optimization algorithm to explore the complementary characteristics of different modalities. The iterative procedure reaches the global minimum of the criterion due to the strong convexity of the criterion. Our experiments of image classification and cartoon retrieval demonstrate the validity of the proposed method.

Highlights

► We encode different features to build a physically meaningful subspace. ► We adopt discriminative information to obtain the optimal low-dimensional subspace for each view. ► We utilize unlabeled data to enhance the subspace learning. ► We use the alternating optimization to explore complementary characteristics of different features.

Introduction

In many applications of computer vision and multimedia management, images are usually represented in different ways. This kind of data is termed as multimodal data. A typical example is a color image which has different features from different modalities (Tao et al., 2006, Bian and Tao, 2010), e.g., color, texture, and shape. Classifying images into meaningful categories is a challenging and important task. Many methods (Vapnik, 1998, Domingos and Pazzani, 1996) have been used for image classification in order to organize, represent and browse images better; and also they are to improve the performances of the related applications such as Content Based Image Retrieval (CBIR) (Liu, 2004) and image annotation and image indexing (Datta et al., 2008). In these methods, an image of n₁ × n₂ pixels is described by a feature vector with n₁ × n₂ dimensions. Because of the large dimension of this data (Donoho, 2000), these methods cannot work well in practice. Some previous studies show that the performance of image classification can be improved significantly in a low dimensional subspace (Belhumeur et al., 1997). Representatives of the subspace learning methods are the Principal Component Analysis (PCA) (Belhumeur et al., 1997), Linear Discriminant Analysis (LDA) (Belhumeur et al., 1997), Locally Linear Embedding (LLE) (Roweis and Saul, 2000), Laplacian Eigenmaps (LE) (Belkin and Niyogi, 2002, Belkin and Niyogi, 2003) and Local Tangent Space Alignment (LTSA) (Zhang and Zha, 2005). However, these methods cannot deal with data represented by multiple feature spaces.

Although much progress has been made for multimodal data learning including classification (Zien and Ong, 2007), clustering (Bickel and Scheffer, 2004) and feature selection (Zhao and Liu, 2008), little progress has been made in multimodal subspace learning. A possible solution is to concatenate vectors from different modalities to form a new vector, and to apply dimensionality reduction directly on the concatenated vector. However, further analysis shows that this solution also has its own problem as these features describe different aspects of an image’s properties, that is, they are intrinsically embedded in heterogeneous feature spaces. This concatenation ignores the diversity of modalities and thus cannot efficiently explore their complementary nature. The other considerable solution is the Distributed Multiple-view Subspace Learning (DMSL) proposed in (Long et al., 2008). DMSL performs a subspace learning algorithm on each modality independently, and then based on the obtained low dimensional representations, it learns a common low dimensional representation which is as close as possible to each representation. Although DMSL allows selecting different subspace learning algorithms for different modalities, the original data are invisible to the final learning process, and thus, it cannot effectively explore the complementary nature.

In this paper, we propose a novel method, termed Semi-Supervised Multimodal Subspace Learning (SS-MMSL), which learns a unified low dimensional subspace over all modalities simultaneously. We consider the problem of multimodal subspace learning for image classification based on the “Patch Alignment” Framework (PAF) (Zhang et al., 2009; Guan et al., 2011a, Guan et al., 2011b; Wang et al., 2011). According to PAF, the subspace learning techniques are applied in the stages of local patch construction and whole alignment. We apply the discriminative information revealed by the labeled data in building local patches. Besides, we introduce the unlabeled data (Belkin et al., 2005, Zhu et al., 2003, Zheng et al., 2008, Belkin and Niyogi, 2004) into the local patch construction and incorporate them in the whole alignment stage to obtain the optimal low dimensional subspace for each modality. In order to find a unified low dimensional subspace wherein the distribution of each modality is sufficiently smooth, we derive an iterative algorithm by using alternating optimization (Bezdek and Hathaway, 2002) to obtain a group of appropriate weights which effectively learns the complementary nature. Our experimental results of image classification and cartoon retrieval demonstrate the effectiveness of the proposed method.

The rest of this paper is organized as follows. In Section 2, we present the proposed Semi-Supervised Multimodal Subspace Learning (SS-MMSL) method and the solution to image classification using SS-MMSL. Experimental results are presented in Section 3. The application for cartoon retrieval is described in Section 4. And finally, conclusions are drawn in Section 5.

Section snippets

Semi-supervised multimodal subspace learning

In this section, we present a novel Semi-Supervised Multimodal Subspace Learning (SS-MMSL) method based on the “Patch Alignment” Framework. The workflow of SS-MMSL is shown in Fig. 1. First, SS-MMSL extracts multiple features including color histogram, color correlogram and edge direction histogram for images. Subsequently, SS-MMSL builds the so-called local patches of both labeled and unlabeled data in each modality. Based on these patches, the whole alignment is performed to obtain a low

Experiments

In this section, we compare the effectiveness of the proposed SS-MMSL method with the Feature Concatenation based Subspace Learning (FCSL), the DMSL (Long et al., 2008), the Average performance of the Single-modality Subspace Learning (ASSL) and the Best performance of the Single-modality Subspace Learning (BSSL) in image classification. The experiment of image classification is conducted by LIBSVM (Fan et al., 2005) under the constructed low dimensional subspace.

Application in cartoon retrieval

As an extension of our approach, in this section, we demonstrate that SS-MMSL performs well in cartoon retrieval by conducting experiments on the cartoon database (Yu et al., 2007, Zhuang et al., 2008, Yang et al., 2009). The key issue is how to evaluate the shape similarities among cartoon images. It is important to carefully choose the features of cartoon images, e.g., in (Juan and Bodenheimer, 2004, Yu et al., 2011a), the edge feature is adopted to estimate the similarities. More recently,

Conclusions

This paper presents a new Semi-Supervised Multimodal Subspace Learning (SS-MMSL) method for image classification. By successfully integrating the discriminative information from the labeled data in the local patch construction and utilizing the data distribution revealed by unlabeled data, a new objective function for subspace learning is conducted to obtain the advantages of both the “Patch Alignment” Framework and the semi-supervised learning. Moreover, an iterative algorithm using the

Acknowledgements

This work is supported by the grant of the National Natural Science Foundation of China (No. 61100104), the Grant of the National Defense Basic Scientific Research program of China (No. B1420110155), the Specialized Research Fund for the Doctoral Program of Higher Education of China (No. 20110121110020), the Grant of the Fundamental Research Funds for the Central Universities (No. 2011121049), the Singapore National Research Foundation& Interactive Digital Media R& D Program Office Grant (

References (37)

P. Belhumeur et al.
Eigenfaces vs. fisherfaces: Recognition using class specific linear projection
IEEE Trans. Pattern Anal. Machine Intell.
(1997)
M. Belkin et al.
Laplacian eigenmaps and spectral techniques for embedding and clustering
Proc. Adv. Neural Inf. Process. Syst. (NIPS 02)
(2002)
M. Belkin et al.
Laplacian eigenmaps for dimensionality reduction and data representation
Neural Comput.
(2003)
M. Belkin et al.
Semi-supervised learning on riemannian manifolds
J. Mach. Learn.
(2004)
Belkin, M., Niyogi, P., Sindhwani, V., 2005. On manifold regularization. In: Proceedings of the Interantional Workshop...
Y. Bengio et al.
Out-of-sample extensions for LLE, Isomap, MDS, Eigenmaps, and spectral clustering
Adv. Neur. Inform. Process. Syst.
(2003)
J.C. Bezdek et al.
Some notes on alternating optimization
Proc. AFSS Internat. Conf. Fuzzy Syst.
(2002)
W. Bian et al.
Biased discriminant Euclidean embedding for content based image retrieval
IEEE Trans. Image Process.
(2010)
S. Bickel et al.
Multi-view clustering
Proc. Internat. Conf. Mach. Learn. (ICML 04)
(2004)
Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y., 2009. NUS-WIDE: A real-world web image database from...

R. Datta et al.

Image retrieval: Ideas, influences, and trends of the new age

ACM Comput. Surveys

(2008)

P. Domingos et al.

Beyond independence: Conditions for the optimality of the simple Bayesian classifier

Proc. Internat. Conf. Mach. Learn. (ICML 96)

(1996)

Donoho, D.L. 2000. High dimensional data analysis: The curses and blessings of dimensionality. In: Lecture at the...

R.E. Fan et al.

Working set selection using the second order information for training SVM

J. Mach. Learn. Res.

(2005)

N. Guan et al.

Manifold regularized discriminative nonnegative matrix factorization with fast gradient descent

IEEE Trans. Image Process.

(2011)

N. Guan et al.

Non-negative patch alignment framework

IEEE Trans. Neural Networ.

(2011)

X. He et al.

Locality preserving projections

Adv. Neur. Inform. Process. Syst.

(2004)

I. Jolliffe

Principal Component Analysis

(2002)

Cited by (29)

Multi-view partial label machine
2022, Information Sciences
Citation Excerpt :
For instance, Rakotomamonjy et al. [22] proposed a simple MKL method by using an adaptive 2-norm regularization function. In addition, subspace learning [24,25] is also an effective strategy to solve the multi-view problems. For example, Yu et al. [24] utilized the multi-view data to construct a low dimensional subspace and proposed the semi-supervised multimodal subspace learning (SS-MMSL) method for image classification.
In partial label learning (PLL), each instance is associated with a set of candidate labels, among which there is only ground-truth label. PLL aims to identify the ground-truth label out of these candidate labels. Most of the existing PLL approaches are proposed to handle the single-view PLL problem, and the multi-view PLL problem has not been addressed. In this paper, we propose a novel multi-view paunknown.
A similarity-based two-view multiple instance learning method for classification
2020, Knowledge-Based Systems
Citation Excerpt :
The single view is a one-sided description of the data. Jun et al. [42] propose semi-supervised multi-modal subspace learning (SS-MMSL). They use the data distribution revealed by unlabeled data to enhance subspace learning, and use alternating iterative optimization algorithms to explore the complementary features of different modes.
Multiple instance learning (MIL) has been proposed to classify the bag of instances. In practice, we may meet the problems which have more than one view data. For example, in the image classification, textual information is always used to describe the image, which can be considered as two-view data. In this paper, we propose a new similarity-based two-view multi-instance learning (STMIL) method that can incorporate two-view data into learning so as to improve classification accuracy of MIL. In order to obtain the predictive classifier, we first convert the proposed model into a convex optimization problem, and then propose a new alternative framework to solve the proposed method. We then analyze the convergence of the proposed STMIL method. The experiments have been conducted to compare the performance of our proposed method and the previous methods. The results show that our method can deliver superior performance than other methods.
Label embedded dictionary learning for image classification
2020, Neurocomputing
Recently, label consistent k-svd (LC-KSVD) algorithm has been successfully applied in image classification. The objective function of LC-KSVD is consisted of reconstruction error, classification error and discriminative sparse codes error with ℓ₀-norm sparse regularization term. The ℓ₀-norm, however, leads to NP-hard problem. Despite some methods such as orthogonal matching pursuit can help solve this problem to some extent, it is quite difficult to find the optimum sparse solution. To overcome this limitation, we propose a method named label embedded dictionary learning (LEDL), which embeds the label information into ℓ₁ regularized dictionary learning algorithm to improve the performance of image classification tasks. Specifically, (i) compared to LC-KSVD, we utilise the ℓ₁-norm to transfer the sparse constraint problem to convex optimization problem; (ii) alternating direction method of multipliers (ADMM) is adopted to solve the sparse constraint problem to improve the optimization speed; (iii) extensive experimental results on six benchmark datasets illustrate that the classification rate of our proposed algorithm exceeds the LC-KSVD algorithm and our proposed algorithm has achieved state-of-the-art performance.
Low-rank local tangent space embedding for subspace clustering
2020, Information Sciences
Citation Excerpt :
In order to reveal hidden features in data, subspace representation techniques [11,12,15,18,36] have been developed to transform data to its significant feature space. They bring out good performance in characterizing global structures of real-world data and have been widely applied in computer vision [3], image processing [51], and system identification [40] to handle various types of visual scenarios such as motion, emotion, illumination, texture, etc. Moreover, sufficient mathematical theories in subspace techniques make them capable of dealing with noise and outliers [5,26].
Subspace techniques have gained much attention for their remarkable efficiency in representing high-dimensional data, in which sparse subspace clustering (SSC) and low-rank representation (LRR) are two commonly used prototypes in the fields of pattern recognition, computer vision and signal processing. Both of them aim at constructing a block sparse matrix via linearly representing data to make them be embedded into linear subspaces. However, few datasets satisfy the linear subspace assumption in the real world. In this paper, data are peered from viewpoint of manifold architecture under the framework of sparse representation. A globally low-rank representation with the Frobenius norm minimization is constructed under the constraint of local manifold embedding and a novel low-rank local embedding representation (LRLER) model for subspace clustering of datasets is proposed. In this model, the local as well as global manifold structures of a dataset are concerned. Clusters of a dataset are considered as sub-manifolds embedded in low-dimensional subspaces. To represent and segment samples with hybrid neighbors or interlaced manifold structures, the local tangent space analysis strategy is introduced to characterize the local structure of neighborhood of samples. The coefficients of locally linear embedding are rectified according to the relationship between local tangent spaces of samples and their neighbors. A local tangent space based low-rank local embedding representation model (LRLTSER) is built to deal with data with neighborhood aliasing distortion. Extensive experiments on synthetic datasets and real-world datasets are implemented and experimental results show superior performance of the proposed methods for subspace clustering compared to the state-of-the-art techniques.
Representation learning using step-based deep multi-modal autoencoders
2019, Pattern Recognition
Deep learning techniques have been successfully used in learning a common representation for multi-view data, wherein different modalities are projected onto a common subspace. In a broader perspective, the techniques used to investigate common representation learning falls under the categories of ‘canonical correlation-based’ approaches and ‘autoencoder-based’ approaches. In this paper, we investigate the performance of deep autoencoder-based methods on multi-view data. We propose a novel step-based correlation multi-modal deep convolution neural network (CorrMCNN) which reconstructs one view of the data given the other while increasing the interaction between the representations at each hidden layer or every intermediate step. The idea of step reconstruction reduces the constraint of reconstruction of original data, instead, the objective function is optimized for reconstruction of representative features. This helps the proposed model to generalize for representation and transfer learning tasks efficiently for high dimensional data. Finally, we evaluate the performance of the proposed model on three multi-view and cross-modal problems viz., audio articulation, cross-modal image retrieval and multilingual (cross-language) document classification. Through extensive experiments, we find that the proposed model performs much better than the current state-of-the-art deep learning techniques on all three multi-view and cross-modal tasks.
Motion recognition and synthesis based on 3D sparse representation
2015, Signal Processing
Motion synthesis and recognition based on 3D motion data has been extensively studied in recent years. In this paper, we extract a dimensional representation of human motions from 3D spatial-temporal features and map this representation to low-dimensionality subspaces, which can preserve the intrinsic properties of original data. A method for automatic quantitative synthesis of human motion styles is then proposed. These methods help to make recognition and classification of 3D motion data more efficient, reducing computational complexity whilst preserving the intrinsic properties of original data. This also makes it useful for animation authoring systems and motion recognition. Experimental results show the effectiveness of the proposed methods.

View all citing articles on Scopus

View full text

Image classification by multimodal subspace learning

Abstract

Highlights

Introduction

Section snippets

Semi-supervised multimodal subspace learning

Experiments

Application in cartoon retrieval

Conclusions

Acknowledgements

Eigenfaces vs. fisherfaces: Recognition using class specific linear projection

IEEE Trans. Pattern Anal. Machine Intell.

Laplacian eigenmaps and spectral techniques for embedding and clustering

Proc. Adv. Neural Inf. Process. Syst. (NIPS 02)

Laplacian eigenmaps for dimensionality reduction and data representation

Neural Comput.

Semi-supervised learning on riemannian manifolds

J. Mach. Learn.

Out-of-sample extensions for LLE, Isomap, MDS, Eigenmaps, and spectral clustering

Adv. Neur. Inform. Process. Syst.

Some notes on alternating optimization

Proc. AFSS Internat. Conf. Fuzzy Syst.

Biased discriminant Euclidean embedding for content based image retrieval

IEEE Trans. Image Process.

Multi-view clustering

Proc. Internat. Conf. Mach. Learn. (ICML 04)

Image retrieval: Ideas, influences, and trends of the new age

ACM Comput. Surveys

Beyond independence: Conditions for the optimality of the simple Bayesian classifier

Proc. Internat. Conf. Mach. Learn. (ICML 96)

Working set selection using the second order information for training SVM

J. Mach. Learn. Res.

Manifold regularized discriminative nonnegative matrix factorization with fast gradient descent

IEEE Trans. Image Process.

Non-negative patch alignment framework

IEEE Trans. Neural Networ.

Locality preserving projections

Adv. Neur. Inform. Process. Syst.

Principal Component Analysis