Elsevier

Neurocomputing

Volume 333, 14 March 2019, Pages 124-134
Neurocomputing

A sharing multi-view feature selection method via Alternating Direction Method of Multipliers

https://doi.org/10.1016/j.neucom.2018.12.043Get rights and content

Highlights

  • A sharing multi-view feature selection method which combines the specificity of views and the common objective is proposed.

  • To effectively select features from the high dimensional data via sharing strategy and ADMM.

  • To speed up the calculation by decomposing the large scale optimization problem into small scale sub-problems.

  • The comparison experiments with several state-of-the-art feature selection methods show its effectiveness.

Abstract

The matrix-based multi-view feature selection, which can integrate the information of multiple views for selecting representative features, has attracted wide attention in recent years. In this paper, we propose a novel supervised sharing multi-view feature selection method. The proposed method makes all views share a common penalty that regresses samples to their labels. Meanwhile, it adopts the structured sparsity-inducing norm to implement sparsity for each view. The proposed method considers not only the complementary of different views, but also the specificity of each view instead of concatenating all views into high-dimensional vectors. In addition, the proposed model can be decomposed into N small scale subproblems (where N is the number of views) and solved efficiently via Alternating Direction Method of Multipliers (ADMM), especially for high-dimensional large scale data sets. The comparison experiments with several state-of-the-art feature selection methods show the effectiveness of the proposed method.

Introduction

Along with the development of the computer technology, storage technology and multimedia information technology, the amounts of text, image and video data collected show rapid growth, which brings about a considerable difficulty in calculation because the complexity of computation is dramatically increased when dealing with high-dimensional data. However, solving these problems quickly and efficiently is more and more important in recent years. One of the effective strategies is to reduce the dimensionality to obtain a low dimensional representation of the original data such that the selected features enhance generalization while reducing the chance of over-fitting. Depending on whether the raw features are changed, the dimension reduction methods are roughly divided into two categories: feature selection [1], [2], [3] and feature extraction [4]. Feature selection, also known as variable selection, chooses a small subset of informative features from the raw high-dimensional feature set and does not change the original representation of features, which has the advantage of preserving the original semantics of the features.

Different from most of the previous vector-based structured sparsity-inducing feature selection (SSFS) methods that can only solve binary classification problems, a number of matrix-based SSFS methods that can solve multi-class problems have emerged in recent years. Based on how much the labeled information is used for guiding the selection of the representative features, these methods can be categorized into three groups, including supervised methods [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], semi-supervised methods [22], [23], [24], [25], [26], and unsupervised methods [27], [28], [29], [30], [31], [32]. The supervised feature selection methods utilize all labeled samples that can provide valuable information to select the representative feature subset, and this kind of methods is more likely to find useful features than the other two kinds of methods. In addition, according to the machine learning algorithms combined by feature selection, matrix-based SSFS methods can also be grouped as multitask feature selection, multilabel feature selection, multi-view feature selection and so on. A comprehensive survey of SSFS methods has been provided in [33], in which the motivations and mathematical representations of various matrix-based SSFS methods are well elaborated.

A number of novel matrix-based SSFS methods have been proposed. Cai et al. [5] proposed a feature selection method by adopting the l2, 0-norm constraint. Nie et al. [6] proposed a self-weighted supervised discriminative feature selection method by combining l2, 0-norm regularization, row-sparse projection, and orthogonal constraint. Liu et al. [7] conducted a multitask feature selection method named as MTFL by using l2, 1-norm as the penalty to implement the joint feature selection across multiple tasks. Obozinski et al. [8] selected features via covariate selection and subspace selection. Nie et al. [9] proposed a robust feature selection method by adopting l2, 1-norm instead of Frobenius norm as the loss. They also put forward a sparse multitask regression and feature selection method to identify brain imaging predictors for memory performance [10]. Based on MTFL, the feature selection approach for social media data was presented by Tang and Liu [11]. Liang et al. [12] established multitask feature selection method by adopting lr, p-norm (p=1,r=1 or ∞) as penalty. Considering the drawbacks of the pre-determined graph and affinity measurements in the original feature space, Luo et al. [26] proposed a novel semi-supervised feature selection method for video semantic recognition tasks. This method incorporates the exploration of local structure into the process of joint feature selection to learn the optimal graph. They also proposed an adaptive unsupervised feature selection through using the adaptive reconstruction graph to characterize the intrinsic local structure and imposing a rank constraint on the corresponding Laplacian matrix to obtain the ideal neighbors assignment [27].

Features obtained from one view can not fully describe the samples, especially for web pages, images and video data. Each view has its own specificity, for example, the description of the people includes multiple views, such as audio, text and photos, where each view owns its unique physical and practical significance. Thus, it has become popular to describe objects from the connections and differences between multiple views, which has led to integrate multiple features from different views. Xiao et al. [13] proposed a two-view feature selection method for cross-sensor iris recognition based on l2, 1-norm regularization. Later research [14], [15] extended it to multiple views. Wang et al. [15] proposed a framework of joint G1-norm and l2, 1-norm for multi-view feature selection. They also proposed a multi-view learning method which can handle both clustering and classification tasks by integrating all features and learning the weights for features corresponding to each cluster through the joint structured sparsity-inducing norms [16]. In order to predict the cognitive performance and diagnosis of patients with or without AD, they [17] have recently proposed a novel joint high-order multi-modal multi-task feature learning method by adopting l2, 1-norm regularization, a new group l1-norm regularization, and trace norm regularization. A low-rank multi-view feature selection with graph Laplacian was proposed by embedding the global and local structures into the feature selection framework [18]. In [19], Yan et al. addressed the problem of class ambiguity by presenting a classification model that allows top-k (k  ≥ 2) predictions based on multi-modal feature fusion. Considering the intrinsic correlations of different views, Chang et al. [20] proposed a multi-view feature learning method to take advantage of common features shared by different views. Liu et al. [21] combined self-adaptive multi-view feature learning framework with the effective support vector machine (SVM) solver for complex multimedia event detection (MED), which not only fuses features of all views effectively, but also reduces the computational burden. In [29], regularization structure and matrix factorization were applied into unsupervised and supervised feature selection to preserve the intrinsic structure of the raw data.

Multi-view feature selection methods have demonstrated the good performance. Concatenating vectors from multiple views into a new high-dimensional vector is a common practice when establishing the multi-view feature selection methods. However, this may increase the computational complexity and the chance of over-fitting. Noticing these limitations, some unsupervised multi-view schemes [30], [31], [32] have been presented to learn corresponding weights for different views.

In this paper, to make full use of multi-view information and efficiently handle high-dimensional data feature selection, we propose a novel supervised sharing multi-view feature selection method based on the shared penalty and individual regularization. On one hand, we make all views share a common penalty, which regresses each sample to its label. On the other hand, we adopt l2, 1-norm regularization for each view to make the corresponding transformation matrix sparse, instead of simply concatenating multiple views into the high-dimensional vectors. So the model is established by considering not only the complementary of different views, but also the specificity of each view. The model can be decomposed into N (N is the number of views) subproblems in the process of optimization, and be efficiently solved by Alternating Direction Method of Multipliers (ADMM) [34].

In summary, the proposed method has the following advantages:

• It combines the specificity of each view and the common objective, and each view adjusts its transformation matrix to minimize the individual regularization as well as the shared penalty term.

• It employs the structured sparsity-inducing norm to implement the feature sparsity of each view.

• It can be implemented by decomposing the large scale optimization problem into small scale subproblems based on the sharing strategy, which can effectively reduce computational complexity, especially for high-dimensional data sets.

• The comparison experiments with several state-of-the-art feature selection methods on multiple data sets show the effectiveness of the proposed method.

The paper is organized as follows. In Section 2, we propose the sharing multi-view feature selection model and utilize ADMM to get the solution. Beside, we analyse the computational complexity of the new algorithm. After presenting the extensive experiments and analysis in Section 3, we conclude the paper in Section 4.

Section snippets

Model

In this subsection, we propose a new sharing multi-view feature selection model. Given a data set X belonging to c classes with n samples and d features from N views, we have the matrix X=[X1,X2,,XN]Rn×d, where XvRn×dv represents the set of samples from the vth view, and d=v=1Ndv with dv being the feature dimension of vth view. Denote xjRd the jth row of X. The label matrix is Y=(y1,,yn)TRn×c, where yj=(yj1,,yjc)TRc is the label vector related to the sample xj. If xj belongs to the pth

Experiments

In this section, the validity of the proposed method is evaluated through a series of experiments designed on widely used data sets. After the experimental setup is given, we carefully analyze and compare the performance of the proposed algorithm with several state-of-the-art feature selection methods.

Conclusion

In this paper, we have proposed a supervised sharing multi-view feature selection method which has advantages by ensuring the accuracy and training speed on high-dimensional data sets. The proposed method combines the shared penalty term and l2, 1-norm regularization for each view to form the global goal, which takes not only the complementary of different views but also the specificity of each view into account. The proposed method can be efficiently solved via ADMM by decomposing the large

Acknowledgments

The work is supported by the National Natural Science Foundation of China (Grant no. 61872368, U1536121, 11171346). The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.

Qiang Lin was born in 1990 in China. He has received the B.S. degree and M.S. degree from China Agricultural University in 2012 and 2015. Now he is a Ph.D. student of China Agricultural University. His research direction is large-scale machine learning.

References (39)

  • I. Guyon et al.

    Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing)

    (2006)
  • X. Cai et al.

    Exact top-k feature selection via l2, 0–norm constraint

    Proceedings of the International Joint Conference on Artificial Intelligence

    (2013)
  • R. Zhang et al.

    Self-weighted supervised discriminative feature selection

    IEEE Trans. Neural Netw. Learn. Syst.

    (2017)
  • J. Liu et al.

    Multi-task feature learning via efficient l2, 1-norm minimization

    Proceedings of the Conference on Uncertainty in Artificial Intelligence

    (2009)
  • G. Obozinski et al.

    Joint covariate selection and joint subspace selection for multiple classification problems

    Stat. Comput.

    (2010)
  • F. Nie et al.

    Efficient and robust feature selection via joint l2, 1 -norms minimization

    Proceedings of the International Conference on Neural Information Processing Systems

    (2010)
  • H. Wang et al.

    Sparse multi-task regression and feature selection to identify brain imaging predictors for memory performance

    Proceedings of the IEEE International Conference on Computer Vision

    (2011)
  • J. Tang et al.

    Feature selection with linked data in social media

    Proceedings of the SIAM International Conference on Data Mining

    (2012)
  • Y. Liang et al.

    Exploring regularized feature selection for person specific face verification

    IEEE Int. Conf. Comput. Vis. (ICCV)

    (2011)
  • Cited by (0)

    Qiang Lin was born in 1990 in China. He has received the B.S. degree and M.S. degree from China Agricultural University in 2012 and 2015. Now he is a Ph.D. student of China Agricultural University. His research direction is large-scale machine learning.

    Yiming Xue was born in 1968. He is now an associate professor and Master’s supervisor in College of Information and Electrical Engineering, China Agricultural University. His research interests include multimedia processing, multimedia security, VLSI design.

    Juan Wen received the B.E. degree in information engineering and Ph.D. degree in signal and information processing from Beijing University of Posts and Telecommunications. She is now a lecturer in China Agricultural University. Her research interests include artificial intelligence, information hiding and natural language processing.

    Ping Zhong is a professor and Ph. D. supervisor in College of Science, China Agricultural University, Beijing, China. She has published many papers. Her research interests include machine learning, support vector machines, steganalysis.

    View full text