Feature assessment and ranking for classification with nonlinear sparse representation and approximate dependence analysis
Introduction
Selecting salient features that preserve or promote the performance of data mining and decision-making is a problem of growing significance because of the increasing size, high-dimensionality, and complexity of real-world datasets in numerous domains, e.g., financial decision-making and credit scoring [e.g., Refs. 1, 2], image processing [e.g., Refs. [3], [4], [5]], and cancer diagnosis [e.g., Refs. 6, 7]. Taking the credit scoring in peer-to-peer lending as an example, automatically detecting which customers are possible to default on loan repayment using machine learning methods has been a hot issue for years, as it can help financial experts to make profitable decisions while fulfilling regulatory requirements. However, tremendous financial features known as the curse-of-dimensionality will not only increase the computational cost but also impair the performance of the machine learning methods, owing to the inclusion of redundant and noisy information. Feature selection is thus applied to reduce the dimensionality of feature space and to boost the performance of the machine learning methods. Effective feature selection methods have been widely recognized for their capabilities in facilitating data acquisition, increasing learning efficiency, removing noise, and reducing overfitting.
Feature selection methods can be broadly classified into three distinct types: filter methods [e.g., Ref. 8], wrapper methods [e.g., Ref. 9], and embedded methods [e.g., Ref. 2]. Filter methods apply classifier-irrelevant metrics such as information theoretic metrics [10] and ℓp-norm (0 < p ≤ 2) [11] to evaluate and select features; therefore, they generally incur relatively low computational cost compared with the following two types. Wrapper methods evaluate and select feature subsets based on specific classification accuracies. Thus, their performances are generally sensitive to the classifiers they use [12]. In addition, the computational costs of wrapper methods are relatively high because each candidate of the feature subsets would be utilized to train the classifier. Embedded methods are classifiers where feature selection is integrated into the learning process [2]. They are also classifier-specific, which limits their applications to other classifiers. Ordinarily, Filter is preferable to other types because of its superiority in numerous respects, e.g., remarkable generalization performance among different classifiers and high computational efficiency [13].
Filter methods developed in the early stage, such as mutual information maximization (MIM) [14], evaluate and rank features in terms of only the relationship between the feature and the class (i.e., class-relevance). Such simplicity renders them highly efficient even in present day applications. However, they have a severe drawback: Features are evaluated individually, and the potential correlations among them that may severely influence the classification performance are not considered. Since it has been indicated that simply combining class-relevant features together cannot guarantee the reasonable performance of the learning method, a natural improvement is to additionally consider dependencies among features [10, 15, 16]. Because the combination of multiple correlations should be taken into account and the corresponding formulated problems are often nonconvex, it appears infeasible to obtain a globally optimal feature subset within polynomial time unless P = NP. Moreover, a reliable estimation of the joint distribution among features requires a large amount of samples, whereas in most of the real-world cases, samples are insufficient for even a medium-scale joint estimation. Therefore, a number of existing feature selection methods decompose the objective of feature selection into multiple sub-objectives, including maximizing class-relevance and minimizing feature inner-correlations (e.g., redundancy) [17], and apply heuristic searching strategies and approximate dependence analysis (e.g., pairwise correlation analysis) for each sub-objective to finally obtain satisfactory solutions, i.e., to determine well-qualified feature subsets [e.g., Refs. 8, 18] or feature sequences of which the top-ranked features are salient for data representation [e.g., Refs. 10, 19, 20]. Besides, explicit decomposing the objective of feature selection into multiple sub-objectives can improve the interpretability of the results for real-world applications, e.g., practitioners and empirical researchers can find potential collinearity by conducting redundancy analysis. However, most of the heuristic strategies employed in these methods seem to be intuitive and unsound. In addition, approximate strategies like pairwise correlation analysis may lead to significant bias in measuring feature dependencies. All these deficiencies result in an unstable performance of such methods.
Recently, sparse representation techniques have attracted increasing attention in numerous domains because they aim to obtain a small group of patterns to optimally recover the target, which can be formulated using a ℓ0-norm objective or regularization term [21]. In the context of feature selection, the aim of sparse representation is to determine a small number of features to preserve the classification accuracy [[22], [23], [24]]. The most significant advantage of sparse representation-based feature selection is that, it provides a unifying and analytically solvable optimization framework for feature selection. Recent feature selection methods with sparse representation techniques utilize a variety of sparse models, such as the models with ℓ1-norm [25], ℓ2-norm [24], and ℓ2,p-norm (0 < p ≤ 1) [26], to select representative features. However, minimizing an ℓp-norm (0 ≤ p < 1) regularized objective is proved to be strongly NP-hard [21]. Although there are several relaxations, e.g., ℓ1- or ℓ2-norm as convex approximations, matrix computation incurs excessive execution time and space cost, and thus hinders the implementation of such methods on large-scale datasets. In addition, extant feature selection methods with sparse representation techniques do not explicitly handle feature inner-correlations, i.e., redundancy and complementarity [27], which is likely to severely influence the performance of the classification. This makes it similar to a black box, wherein it is infeasible to exactly determine whether or not sufficient efforts have been undertaken to handle feature inner-correlations.
Considering these, in this study, we propose a novel feature ranking method wherein sparse representation and information theoretic metrics are integrated to discriminate salient features, taking advantage of both sparse representation and feature inner-correlation analysis. More specifically, the proposed method not only utilizes the optimization framework of sparse representation, but also conducts dependence analysis to explicitly handle feature inner-correlations like redundancy and complementarity, in such a way as to obtain a salient and interpretable results for classification modeling. To our knowledge, this work is the first attempt to select features by combining sparse representation technique and information theoretic dependence analysis. The main contributions of this work that distinguishes it from extant literature are threefold:
- •
A nonlinear sparse representation method is applied to identify representative feature clusters,
- •
conditional mutual information is used to identify the initial point for kernel orthogonal matching pursuit (OMP) in order to take into account the feature dependence, and
- •
a novel approximate dependence analysis strategy that can effectively prevent the significant bias caused by pairwise correlation analysis for large-scale feature set is proposed to eliminate redundancy and to keep complementarity.
The remainder of the paper is organized as follows: Section 2 introduces related work. Section 3 briefly describes feature inner-correlations in the context of information theory, the principle of sparse representation, and the overarching framework of the proposed method. Section 4 proposes a feature ranking approach based on kernel OMP. Then, Section 5 proposes a novel approximate redundancy-complementarity analysis. The proposed method is presented in Section 6. Section 7 presents the experimental results and discussions to evaluate the effectiveness of the proposed method in comparison with the representative feature selection methods on nine well-known datasets. Finally, Section 8 summarizes the concluding remarks and indicates the future work.
Section snippets
Feature selection with dependence analysis
The main objective of the present feature evaluation criteria considering feature dependencies is to identify a set of class-relevant and complementary features, wherein the redundancy among them is minimal. In general, feature dependencies comprise feature redundancy and feature complementarity [20, 28], wherein redundancy has attracted significantly more attention than complementarity due to its detriment to classification performance. In order to identify class-relevance and redundancy, a
Redundancy and complementarity from information theoretic perspective
As mentioned previously, the objective of feature selection can be factorized into interpretable sub-objectives that are necessary for dependence analysis using information theoretic metrics. In this section, we introduce certain necessary albeit fundamental information theoretic metrics that will be employed by the proposed method in the following context. Entropy, an essential quantitative description of information, is used to measure the extent of uncertainty for the distribution of a
Feature clustering by sparse representation
Sparse representation techniques can effectively select a small part of the features to represent the class. That is, features are separated into two groups: one group contains the selected relevant features for class representation, and the other contains the rest. Thus, sparse representation techniques are preeminent choices as clustering methods for class-relevance analysis. This task can be effectively formulated using model (11). Regarding the nonlinearity of the feature space, the kernel
Approximate dependence analysis
As mentioned previously, feature selection methods with sparse representation do not explicitly handle redundancy and complementarity. It seems hard to know exactly whether or not those methods undertake sufficient efforts to handle feature inner-correlations. On the contrary, information theoretic feature evaluation strategies generally achieve remarkable performances in redundancy and complementarity analysis and thus are included in this work to cover the deficiency of sparse representation
Proposed method
The proposed method given in Algorithm 3, called feature selection with sparse representation and dependence analysis (SRDA), dynamically combines NOMP and dependence analysis. We also provide a toy example of an iteration of the proposed method on a dataset with 16 features, which is shown in Fig. 3.
In order to further utilize the high capability of MI in identifying relevant features, we marginally modify the NOMP shown in Algorithm 2 by using to determine the initial
Experiments and discussion
Four representative information theoretic feature selection methods, namely, MIM [14], mRMR [17], FOU [10], and JMI [29], and an ℓ2,p-norm regularized discriminative feature selection method [DFS, 26] are used to compare with the proposed SRDA. Three representative classifiers, namely, k-nearest neighbor [kNN, 49], naïve Bayes classifier [NBC, 50], and random forest [51], are selected to generate the classification error rate on the datasets represented with the selected features, because of
Conclusions
In this paper, a novel feature selection method is proposed to discriminate the salient features for classification utilizing both sparse representation and information theoretic dependence analysis. Specifically, each iteration of the proposed method first applies a nonlinear sparse representation approach to determine the representative feature cluster and then conducts approximate dependence analysis to eliminate redundancy in such a manner as to obtain the final selected features. This
Acknowledgments
We would like to thank the associate editor and three anonymous reviewers for their constructive comments and suggestions. This study was supported in part by the National Natural Science Foundation of China under Grants71702066,71802192,61703319, and71772077, in part by China Postdoctoral Science Foundation under Grant2017M612856, in part by Humanity and Social Science Youth foundation of Ministry of Education of China under Grant18YJC630137, and in part by National Key R&D Program of China
Yishi Zhang received a Bachelor Degree in Computer Science from University of Electronic Science and Technology of China in 2009, and a Master Degree and his Ph.D. in Software Architecture and Management Science and Engineering from Huazhong University of Science and Technology in 2011 and 2016, respectively. He is now a research fellow in School of Management at Jinan University, and a postdoc in Joseph M. Katz Graduate School of Business at University of Pittsburgh. His research interests
References (1)
Cited by (13)
A local dynamic feature selection fusion method for voice diagnosis of Parkinson's disease
2023, Computer Speech and LanguageFeature-based sensor configuration and working-stage recognition of wheel loader
2022, Automation in ConstructionCitation Excerpt :Based on the joint mutual information (JMI) algorithm [41], Sun [42], and Chen [43] believe that redundancy not only means that alternative features can be replaced by selected features, but it also means that alternative features are highly dependent on the selected features, which opens the discussion on the complementarity between features. Y.S. Zhang et al. [44] further rationalized the relationship between relevance, redundancy, and complementarity represented by mutual information through sparse representation. To eliminate the redundancy of selected features in the nonlinear orthogonal matching pursuit, the complementarity of features is further guaranteed, which can effectively prevent the significant bias caused by the pair relevance analysis on large-scale feature sets.
Deep learning approaches for high dimension cancer microarray data feature prediction: A review
2022, Computational Intelligence in Cancer Diagnosis: Progress and ChallengesEvaluating and selecting features via information theoretic lower bounds of feature inner correlations for high-dimensional data
2021, European Journal of Operational ResearchCitation Excerpt :It is thus deemed as the method that can obtain efficient and interpretable results (i.e., facilitating a better understanding of the learning model or data) (Chen et al., 2018). Furthermore, MI features the following advantages: (a) it is related to the lower bound of the Bayes prediction error (Fano, 1961), and (b) it can be estimated highly efficiently (Zhang et al., 2019). As such, in this study, we focus on information theoretic feature selection.
Clustering-based feature subset selection with analysis on the redundancy–complementarity dimension
2021, Computer CommunicationsCitation Excerpt :Song et al. [34] propose a feature selection algorithm based on MST-based clustering, but only concern the relevance and redundancy of features. Zhang et al. [24] first obtain candidate features by a clustering method and then use an approximate dependence analysis to maintain the complementarity while reducing the redundancy among features. Sechidis et al. [48] also consider the complementary between features and applied cluster analysis for feature selection.
Yishi Zhang received a Bachelor Degree in Computer Science from University of Electronic Science and Technology of China in 2009, and a Master Degree and his Ph.D. in Software Architecture and Management Science and Engineering from Huazhong University of Science and Technology in 2011 and 2016, respectively. He is now a research fellow in School of Management at Jinan University, and a postdoc in Joseph M. Katz Graduate School of Business at University of Pittsburgh. His research interests involve dimensionality reduction, topic modeling, and business intelligence.
Qi Zhang received a Bachelor Degree in Mechanical and Electrical Integration, a Master Degree in Technological Economics and Management, and her Ph.D. in Management Science and Engineering from Wuhan University of Technology, respectively. She is now an associate professor in School of Economics and Management, China University of Geosciences (Wuhan). Her research interests involve business intelligence and big data analytics in the field of digital business.
Zhijun Chen received a Bachelor Degree in Mechanical Engineering and Automation from Wuhan University of Technology in 2009, and a Master Degree and his Ph.D. in Transportation Engineering and Automotive Engineering from Wuhan University of Technology, in 2012 and 2016, respectively. He is now an associate professor in Intelligent Transport Systems Research Center, Wuhan University of Technology. His research interests involve traffic safety, vehicle behavior recognition, machine learning, and big data analytics in intelligent transportation systems.
Jennifer Shang is a Full professor in the area of Business Analytics in Joseph M. Katz Graduate School of Business at University of Pittsburgh. She received her Ph.D. in Operations Management from University of Texas at Austin. She has published in various journals, including Management Science, Information Systems Research, Marketing Science, European Journal of Operational Research, among others. She has won the EMBA Distinguished Teaching Award and several Excellence-in-Teaching Awards from the MBA/EMBA programs at Katz Business School.
Haiying Wei received a Bachelor Degree in Statistics from Renmin University of China in 1986, a Master Degree in Finance from Jinan University in 1997, and her Ph.D. in Management Science and Engineering from Huazhong University of Science and Technology in 2006. At present, she is a professor in School of Management at Jinan University, and serves as an executive member of China Institutions for Higher Learning. Her research interests involve statistical analysis and its application in marketing.