Feature assessment and ranking for classification with nonlinear sparse representation and approximate dependence analysis

doi:10.1016/j.dss.2019.05.004

Decision Support Systems

Volume 122, July 2019, 113064

https://doi.org/10.1016/j.dss.2019.05.004 Get rights and content

Highlights

•
A nonlinear sparse representation method is applied to find salient feature clusters.
•
An approximate feature dependence analysis strategy is proposed.
•
Salient and interpretable features can be obtained by the proposed method.

Abstract

Feature selection has received significant attention in knowledge management and decision support systems in the past decades. In this study, kernel-based sparse representation and feature dependence analysis are integrated into a feature assessment and ranking framework. The proposed method utilizes the advantages of the kernel-based sparse representation technique and of the information theoretic metric to iteratively obtain the salient feature cluster. Then, a novel approximate dependence analysis is applied to further maintain complementarity while eliminating redundancy among the features selected by nonlinear orthogonal matching pursuit (NOMP). This can effectively prevent the significant bias caused by the pairwise correlation analysis for a large-scale feature set. To illustrate the effectiveness of the proposed method, classification experiments are conducted with three representative classifiers, on nine well-known datasets. The experimental results show the superiority of the proposed method compared with the representative information theoretic and model-based methods in classification for data-driven decision support systems.

Introduction

Selecting salient features that preserve or promote the performance of data mining and decision-making is a problem of growing significance because of the increasing size, high-dimensionality, and complexity of real-world datasets in numerous domains, e.g., financial decision-making and credit scoring [e.g., Refs. 1, 2], image processing [e.g., Refs. [3], [4], [5]], and cancer diagnosis [e.g., Refs. 6, 7]. Taking the credit scoring in peer-to-peer lending as an example, automatically detecting which customers are possible to default on loan repayment using machine learning methods has been a hot issue for years, as it can help financial experts to make profitable decisions while fulfilling regulatory requirements. However, tremendous financial features known as the curse-of-dimensionality will not only increase the computational cost but also impair the performance of the machine learning methods, owing to the inclusion of redundant and noisy information. Feature selection is thus applied to reduce the dimensionality of feature space and to boost the performance of the machine learning methods. Effective feature selection methods have been widely recognized for their capabilities in facilitating data acquisition, increasing learning efficiency, removing noise, and reducing overfitting.

Feature selection methods can be broadly classified into three distinct types: filter methods [e.g., Ref. 8], wrapper methods [e.g., Ref. 9], and embedded methods [e.g., Ref. 2]. Filter methods apply classifier-irrelevant metrics such as information theoretic metrics [10] and ℓ_p-norm (0 < p ≤ 2) [11] to evaluate and select features; therefore, they generally incur relatively low computational cost compared with the following two types. Wrapper methods evaluate and select feature subsets based on specific classification accuracies. Thus, their performances are generally sensitive to the classifiers they use [12]. In addition, the computational costs of wrapper methods are relatively high because each candidate of the feature subsets would be utilized to train the classifier. Embedded methods are classifiers where feature selection is integrated into the learning process [2]. They are also classifier-specific, which limits their applications to other classifiers. Ordinarily, Filter is preferable to other types because of its superiority in numerous respects, e.g., remarkable generalization performance among different classifiers and high computational efficiency [13].

Filter methods developed in the early stage, such as mutual information maximization (MIM) [14], evaluate and rank features in terms of only the relationship between the feature and the class (i.e., class-relevance). Such simplicity renders them highly efficient even in present day applications. However, they have a severe drawback: Features are evaluated individually, and the potential correlations among them that may severely influence the classification performance are not considered. Since it has been indicated that simply combining class-relevant features together cannot guarantee the reasonable performance of the learning method, a natural improvement is to additionally consider dependencies among features [10, 15, 16]. Because the combination of multiple correlations should be taken into account and the corresponding formulated problems are often nonconvex, it appears infeasible to obtain a globally optimal feature subset within polynomial time unless P = NP. Moreover, a reliable estimation of the joint distribution among features requires a large amount of samples, whereas in most of the real-world cases, samples are insufficient for even a medium-scale joint estimation. Therefore, a number of existing feature selection methods decompose the objective of feature selection into multiple sub-objectives, including maximizing class-relevance and minimizing feature inner-correlations (e.g., redundancy) [17], and apply heuristic searching strategies and approximate dependence analysis (e.g., pairwise correlation analysis) for each sub-objective to finally obtain satisfactory solutions, i.e., to determine well-qualified feature subsets [e.g., Refs. 8, 18] or feature sequences of which the top-ranked features are salient for data representation [e.g., Refs. 10, 19, 20]. Besides, explicit decomposing the objective of feature selection into multiple sub-objectives can improve the interpretability of the results for real-world applications, e.g., practitioners and empirical researchers can find potential collinearity by conducting redundancy analysis. However, most of the heuristic strategies employed in these methods seem to be intuitive and unsound. In addition, approximate strategies like pairwise correlation analysis may lead to significant bias in measuring feature dependencies. All these deficiencies result in an unstable performance of such methods.

Recently, sparse representation techniques have attracted increasing attention in numerous domains because they aim to obtain a small group of patterns to optimally recover the target, which can be formulated using a ℓ₀-norm objective or regularization term [21]. In the context of feature selection, the aim of sparse representation is to determine a small number of features to preserve the classification accuracy [[22], [23], [24]]. The most significant advantage of sparse representation-based feature selection is that, it provides a unifying and analytically solvable optimization framework for feature selection. Recent feature selection methods with sparse representation techniques utilize a variety of sparse models, such as the models with ℓ₁-norm [25], ℓ₂-norm [24], and ℓ_2,p-norm (0 < p ≤ 1) [26], to select representative features. However, minimizing an ℓ_p-norm (0 ≤ p < 1) regularized objective is proved to be strongly NP-hard [21]. Although there are several relaxations, e.g., ℓ₁- or ℓ₂-norm as convex approximations, matrix computation incurs excessive execution time and space cost, and thus hinders the implementation of such methods on large-scale datasets. In addition, extant feature selection methods with sparse representation techniques do not explicitly handle feature inner-correlations, i.e., redundancy and complementarity [27], which is likely to severely influence the performance of the classification. This makes it similar to a black box, wherein it is infeasible to exactly determine whether or not sufficient efforts have been undertaken to handle feature inner-correlations.

Considering these, in this study, we propose a novel feature ranking method wherein sparse representation and information theoretic metrics are integrated to discriminate salient features, taking advantage of both sparse representation and feature inner-correlation analysis. More specifically, the proposed method not only utilizes the optimization framework of sparse representation, but also conducts dependence analysis to explicitly handle feature inner-correlations like redundancy and complementarity, in such a way as to obtain a salient and interpretable results for classification modeling. To our knowledge, this work is the first attempt to select features by combining sparse representation technique and information theoretic dependence analysis. The main contributions of this work that distinguishes it from extant literature are threefold:

•
A nonlinear sparse representation method is applied to identify representative feature clusters,
•
conditional mutual information is used to identify the initial point for kernel orthogonal matching pursuit (OMP) in order to take into account the feature dependence, and
•
a novel approximate dependence analysis strategy that can effectively prevent the significant bias caused by pairwise correlation analysis for large-scale feature set is proposed to eliminate redundancy and to keep complementarity.

The remainder of the paper is organized as follows: Section 2 introduces related work. Section 3 briefly describes feature inner-correlations in the context of information theory, the principle of sparse representation, and the overarching framework of the proposed method. Section 4 proposes a feature ranking approach based on kernel OMP. Then, Section 5 proposes a novel approximate redundancy-complementarity analysis. The proposed method is presented in Section 6. Section 7 presents the experimental results and discussions to evaluate the effectiveness of the proposed method in comparison with the representative feature selection methods on nine well-known datasets. Finally, Section 8 summarizes the concluding remarks and indicates the future work.

Section snippets

Feature selection with dependence analysis

The main objective of the present feature evaluation criteria considering feature dependencies is to identify a set of class-relevant and complementary features, wherein the redundancy among them is minimal. In general, feature dependencies comprise feature redundancy and feature complementarity [20, 28], wherein redundancy has attracted significantly more attention than complementarity due to its detriment to classification performance. In order to identify class-relevance and redundancy, a

Redundancy and complementarity from information theoretic perspective

As mentioned previously, the objective of feature selection can be factorized into interpretable sub-objectives that are necessary for dependence analysis using information theoretic metrics. In this section, we introduce certain necessary albeit fundamental information theoretic metrics that will be employed by the proposed method in the following context. Entropy, an essential quantitative description of information, is used to measure the extent of uncertainty for the distribution of a

Feature clustering by sparse representation

Sparse representation techniques can effectively select a small part of the features to represent the class. That is, features are separated into two groups: one group contains the selected relevant features for class representation, and the other contains the rest. Thus, sparse representation techniques are preeminent choices as clustering methods for class-relevance analysis. This task can be effectively formulated using model (11). Regarding the nonlinearity of the feature space, the kernel

Approximate dependence analysis

As mentioned previously, feature selection methods with sparse representation do not explicitly handle redundancy and complementarity. It seems hard to know exactly whether or not those methods undertake sufficient efforts to handle feature inner-correlations. On the contrary, information theoretic feature evaluation strategies generally achieve remarkable performances in redundancy and complementarity analysis and thus are included in this work to cover the deficiency of sparse representation

Proposed method

The proposed method given in Algorithm 3, called feature selection with sparse representation and dependence analysis (SRDA), dynamically combines NOMP and dependence analysis. We also provide a toy example of an iteration of the proposed method on a dataset with 16 features, which is shown in Fig. 3.

In order to further utilize the high capability of MI in identifying relevant features, we marginally modify the NOMP shown in Algorithm 2 by using $\tilde{F} = \arg \max_{F \in F ∖ S} I (F; C | S)$ to determine the initial

Experiments and discussion

Four representative information theoretic feature selection methods, namely, MIM [14], mRMR [17], FOU [10], and JMI [29], and an ℓ_2,p-norm regularized discriminative feature selection method [DFS, 26] are used to compare with the proposed SRDA. Three representative classifiers, namely, k-nearest neighbor [kNN, 49], naïve Bayes classifier [NBC, 50], and random forest [51], are selected to generate the classification error rate on the datasets represented with the selected features, because of

Conclusions

In this paper, a novel feature selection method is proposed to discriminate the salient features for classification utilizing both sparse representation and information theoretic dependence analysis. Specifically, each iteration of the proposed method first applies a nonlinear sparse representation approach to determine the representative feature cluster and then conducts approximate dependence analysis to eliminate redundancy in such a manner as to obtain the final selected features. This

Acknowledgments

We would like to thank the associate editor and three anonymous reviewers for their constructive comments and suggestions. This study was supported in part by the National Natural Science Foundation of China under Grants71702066,71802192,61703319, and71772077, in part by China Postdoctoral Science Foundation under Grant2017M612856, in part by Humanity and Social Science Youth foundation of Ministry of Education of China under Grant18YJC630137, and in part by National Key R&D Program of China

Yishi Zhang received a Bachelor Degree in Computer Science from University of Electronic Science and Technology of China in 2009, and a Master Degree and his Ph.D. in Software Architecture and Management Science and Engineering from Huazhong University of Science and Technology in 2011 and 2016, respectively. He is now a research fellow in School of Management at Jinan University, and a postdoc in Joseph M. Katz Graduate School of Business at University of Pittsburgh. His research interests

References (1)

Cited by (13)

A local dynamic feature selection fusion method for voice diagnosis of Parkinson's disease
2023, Computer Speech and Language
Voice disorders are one of the incipient symptoms of Parkinson's disease (PD). The existing PD voice feature analysis methods generally reduce dimensionality by selecting features with high relevance and low redundancy, ignoring the complementarity between features. To fully explore the intrinsic connection between PD features, this paper proposes a local dynamic feature selection fusion method (LDFSF) for PD identification and symptom severity prediction. First, the maximal information coefficient is used to calculate the relevance between features and the target value, and removes low relevant features. Then, a dynamic feature selection strategy based on self-organizing map network clustering is designed. This strategy dynamically updates the feature subset in a forward manner, and combines the relevance, redundancy and complementarity for feature selection. Among them, self-organizing map network clustering reduces the computational cost by reducing the number of candidate features, and effectively improves the accuracy of the model. Finally, the selected features are fed into the appropriate classifier or regressor to achieve accurate PD identification and severity progress prediction. Experiments on public OPDD and PTDS datasets show that the classification accuracy, sensitivity and specificity of LDFSF method on OPDD are 98.20%, 92.00% and 99.50% respectively, and the mean absolute error of predicting PD severity (motor- and total-UPDRS) on PTDS are 1.62 and 1.99 respectively, which are better than the compared methods for the selection of five and eight features respectively, thus indicating that the proposed LDFSF method can be effectively used in PD voice diagnosis.
Feature-based sensor configuration and working-stage recognition of wheel loader
2022, Automation in Construction
Citation Excerpt :
Based on the joint mutual information (JMI) algorithm [41], Sun [42], and Chen [43] believe that redundancy not only means that alternative features can be replaced by selected features, but it also means that alternative features are highly dependent on the selected features, which opens the discussion on the complementarity between features. Y.S. Zhang et al. [44] further rationalized the relationship between relevance, redundancy, and complementarity represented by mutual information through sparse representation. To eliminate the redundancy of selected features in the nonlinear orthogonal matching pursuit, the complementarity of features is further guaranteed, which can effectively prevent the significant bias caused by the pair relevance analysis on large-scale feature sets.
With the maturity of sensor and data acquisition technology, the intelligent development of multi-sensor integrated loader becomes inevitable. This paper focuses on the realization of intelligent recognition of loader's working stage via low-cost and efficient sensor configuration of bulk operation data. A feature selection method, Redundancy-Complementariness Dispersion-and-Relevance-based (RCDR) is introduced to select the optimal configuration with fewer sensors. By comparing different combinations of window size and various classifiers, it is found that the sensor set configured based on RCDR feature selection can achieve an accuracy of 94.17% in working-stages recognition. Arguably, the method is potent in configuring a subset of sensors with fewer sensors and accurately recognizing working stages in various types of low-cost operation data without introducing an intelligent calibration system (IF-Then strategy). Future research is expected to tackle the limited applicability of the model caused by data discontinuity, window size combination difference and the change of loader type.
Deep learning approaches for high dimension cancer microarray data feature prediction: A review
2022, Computational Intelligence in Cancer Diagnosis: Progress and Challenges
Nowadays, high dimension microarray data are largely used in the field of biological research, critical disease prediction, and diagnosis. The microarray data contain a large number of highly correlated gene expressions in which feature extraction is required for accurate identification of informative genes and the disease type and thus makes diagnosis easier. The feature extraction of such large data is a complex task, and machine learning uses hand-crafted methods for feature extraction. But, by adopting the deep learning technique, it can be possible to infer the important features of genes automatically in microarray data by using various computational tools and methods. Gene expression profiling is very much required to specify the cellular condition concerning critical diseases and medicine development. For a critical disease like cancer, the cancer microarray data have a limited number of samples with large gene expression and features. The expression profile is very large, which makes it a challenging task to classify and predict the feature of high dimension microarray data associated with cancer. Deep learning is the new promising methodology for feature extraction in microarray data and is widely used in the field of genomics, clinical genetics, cancer identification, and diagnosis. In literature, several deep learning methods for averaging, normalized initialization, model selection, and learning algorithms are adopted for prediction of the feature of high dimension microarray data. In this work, the authors have given a detailed description of the gene expression in digital form as well as microarray form and explained the use of it in disease diagnosis. Moreover, a broad literature study has been carried out related to the feature prediction of high dimension microarray cancer data using various deep learning techniques. Last, the work is concluded by giving a critical analysis with several challenges and future research scopes.
Evaluating and selecting features via information theoretic lower bounds of feature inner correlations for high-dimensional data
2021, European Journal of Operational Research
Citation Excerpt :
It is thus deemed as the method that can obtain efficient and interpretable results (i.e., facilitating a better understanding of the learning model or data) (Chen et al., 2018). Furthermore, MI features the following advantages: (a) it is related to the lower bound of the Bayes prediction error (Fano, 1961), and (b) it can be estimated highly efficiently (Zhang et al., 2019). As such, in this study, we focus on information theoretic feature selection.
Feature selection is an important preprocessing and interpretable method in the fields where big data plays an essential role. In this paper, we first reformulate and analyze some representative information theoretic feature selection methods from the perspective of approximations of feature inner correlations, and indicate that many of these methods cannot guarantee any theoretical bounds of feature inner correlations. We thus introduce two lower bounds that have very simple forms for feature redundancy and complementarity, and verify that they are closer to the optima than the existing lower bounds applied by some state-of-the-art information theoretic methods. A simple and effective feature selection method based on the proposed lower bounds is then proposed and empirically verified with a wide scope of real-world datasets. The experimental results show that the proposed method achieves promising improvement on feature selection, indicating the effectiveness of the feature criterion consisting of the proposed lower bounds of redundancy and complementarity.
Clustering-based feature subset selection with analysis on the redundancy–complementarity dimension
2021, Computer Communications
Citation Excerpt :
Song et al. [34] propose a feature selection algorithm based on MST-based clustering, but only concern the relevance and redundancy of features. Zhang et al. [24] first obtain candidate features by a clustering method and then use an approximate dependence analysis to maintain the complementarity while reducing the redundancy among features. Sechidis et al. [48] also consider the complementary between features and applied cluster analysis for feature selection.
In the era of big data, dimensionality reduction plays an extremely important role in many fields driven by machine learning and data mining techniques. The existing information-theoretic feature selection algorithms generally reduce the dimension by selecting the features with maximum class-relevance and minimum redundancy, while relatively overlook the complementary correlation among features and sometimes deal with it improperly. This paper proposes a novel feature subset selection algorithm called the Clustering-based Feature Selection with Redundancy–Complementarity Analysis (CFSRCA). The proposed algorithm can be mainly divided into two steps, namely, (a) selecting the candidate class-relevant features, and (b) selecting the representative features. In the latter step, the representative features are defined as the features with minimum redundancy and maximum complementarity, and a clustering method based on the minimum spanning tree (MST) is proposed to distinguish them effectively. To validate the effectiveness of CFSRCA, three comparative feature selection algorithms (ReliefF, CFS, and FOU) and four well-known classifiers (C4.5, SVM, kNN, and NBC) are used to conduct classification experiments on eight datasets. Experimental results verify the effectiveness of the proposed feature subset algorithm.
Robust semi-supervised data representation and imputation by correntropy based constraint nonnegative matrix factorization
2023, Applied Intelligence

View all citing articles on Scopus

Qi Zhang received a Bachelor Degree in Mechanical and Electrical Integration, a Master Degree in Technological Economics and Management, and her Ph.D. in Management Science and Engineering from Wuhan University of Technology, respectively. She is now an associate professor in School of Economics and Management, China University of Geosciences (Wuhan). Her research interests involve business intelligence and big data analytics in the field of digital business.

Zhijun Chen received a Bachelor Degree in Mechanical Engineering and Automation from Wuhan University of Technology in 2009, and a Master Degree and his Ph.D. in Transportation Engineering and Automotive Engineering from Wuhan University of Technology, in 2012 and 2016, respectively. He is now an associate professor in Intelligent Transport Systems Research Center, Wuhan University of Technology. His research interests involve traffic safety, vehicle behavior recognition, machine learning, and big data analytics in intelligent transportation systems.

Jennifer Shang is a Full professor in the area of Business Analytics in Joseph M. Katz Graduate School of Business at University of Pittsburgh. She received her Ph.D. in Operations Management from University of Texas at Austin. She has published in various journals, including Management Science, Information Systems Research, Marketing Science, European Journal of Operational Research, among others. She has won the EMBA Distinguished Teaching Award and several Excellence-in-Teaching Awards from the MBA/EMBA programs at Katz Business School.

Haiying Wei received a Bachelor Degree in Statistics from Renmin University of China in 1986, a Master Degree in Finance from Jinan University in 1997, and her Ph.D. in Management Science and Engineering from Huazhong University of Science and Technology in 2006. At present, she is a professor in School of Management at Jinan University, and serves as an executive member of China Institutions for Higher Learning. Her research interests involve statistical analysis and its application in marketing.

View full text