Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification
Introduction
With the rapid development of DNA sequencing technology, researchers can obtain a large amount of gene expression data from various tissue samples, thus providing technical support for the study of tumor pathogenesis at the molecular level [13]. Medical data mining is one of the main research directions of data mining technology and represents a key technology for cancer classification and a bioinformatics research hot spot [35]. When using gene expression data mining technology to find disease genes, protein functions and disease diagnoses are of great significance; therefore, gene selection is the research focus of tumor recognition and classification [21]. Due to the high costs of experiments, the sample sizes of gene expression datasets remain in the hundreds, which is low compared to the tens of thousands of genes involved [39]. Although gene expression datasets are high-dimensional, only a few of the dimensions are beneficial for classification [13]. The resulting dimensionality poses a considerable challenge for classification [31]. Thus, the few beneficial genes must be selected from huge amounts of gene expression data.
Feature selection, as a data mining preprocessing technique, is a dimensionality reduction method that attempts to reserve informative attributes in high-dimensional data, and attribute reduction in rough sets has been recognized as an important feature selection method [4], [26]. Feature selection has three main approaches: filter, wrapper and embedded methods [15]. Filter methods are typically employed as preprocessing methods that are independent of the classifier and use feature-ranking techniques as the basis for feature selection. Wrapper methods evaluate the goodness of each feature subset identified by estimating the accuracy percentage of the specific classifier used [15]. However, the wrapper methods not only exhibit sensitivity to the classifier but also tend to present considerable runtimes. Hence, these methods are not extensively used in microarray tasks, and few works in the field have employed them. Compared with wrapper methods, embedded methods integrate feature selection in the training process to reduce the total time required for reclassifying subsets [7]. In this paper, our feature selection method is based on the filter approach, in which a heuristic search algorithm is used to find an optimal feature subset with neighborhood rough sets for gene expression datasets.
Granular computing is an effective technology for uncertainty analyses, and attribute reduction is a fundamental research topic and an important application of granular computing [8], [33], [37], [41]. Traditional rough set-based attribute reduction methods are established based on an equivalence relation, and they are only compatible for categorical datasets and not for continuous numerical datasets [30], [43]. To overcome this drawback, Hu et al. [17] established a neighborhood rough set model to process both numerical and categorical datasets via neighborhood relation. Over the last few years, many reduction methods based on neighborhood relation have been investigated [5], [10], [31]. For instance, Chen et al. [5] studied a gene selection algorithm using neighborhood rough sets and a joint entropy measure. Fan et al. [10] introduced a max-decision neighborhood rough set model to design an attribute reduction algorithm. Sun et al. [31] described a gene selection approach based on Fisher linear discriminant and neighborhood rough sets. Most of the abovementioned feature selection algorithms that use neighborhood rough set models are based on the monotonicity of evaluation functions for heuristic searches [17]. However, there are some issues associated with feature selection based on the monotonicity of the evaluation functions. For example, when the classification performance of the original dataset is poor, the corresponding evaluation functions have low measured values. Therefore, these methods cannot yield good reduction results. To remedy this defect, Li et al. [20] presented a nonmonotonic attribute reduction algorithm for the decision-theoretic rough set model. The ideas of nonmonotonic reduction in [20] inspired us to investigate a new feature selection method based on neighborhood rough sets in this paper. It is known that a gene expression dataset can be granulated by using neighborhood parameters. Thus, some neighborhood entropy measures based on neighborhood rough sets can be further studied and the monotonicity or nonmonotonicity of the neighborhood entropy-based uncertainty measures can be proved. Therefore, a nonmonotonic feature selection algorithm is presented to address the abovementioned problems.
Note that the reduction calculation of neighborhood decision systems is a key problem in neighborhood rough sets. In addition, the reduct sets of an information system needs to be achieved to further extract rule-like knowledge from an information system [42]. In practical decision-making applications, the certainty factor and the object coverage factor of rules are two important standards for evaluating the decision-making ability of decision systems [43]. However, some of these existing reduction methods cannot objectively reflect the change of the decision-making ability of classification. The credibility and coverage degrees are known efficiently reflect the classification ability of conditional attributes with respect to the decision attribute [35]. Therefore, the conditional attributes with higher credibility and coverage degrees are important with respect to the decision attribute. Until now, the literature has not considered neighborhood rough sets, which inspires our investigation of new measures to fully reflect the classification performance and decision-making ability of neighborhood decision systems. Consequently, new uncertainty measures and an effective heuristic search algorithm must be investigated. Moreover, the concepts of coverage degree and credibility degree should be introduced into neighborhood decision systems as measures to reflect the classification ability of conditional attributes with respect to the decision attribute, and then the credibility degree and the coverage degree based on neighborhood relation should be integrated into neighborhood entropy measures to demonstrate the decision-making ability of attributes in neighborhood decision systems.
Available pretreatment methods for dimensionality reduction include the principal component analysis, which is the most common linear dimension-reduction method. However, in many real-world datasets, the low-dimensional structure hidden in high-dimensional data is nonlinear, and this reduction is not effective for mapping such high-dimensional data. Locally linear embedding (LLE) approximates the input data with a low-dimensional surface and reduces its dimensionality by learning a mapping of the surface [45]. Unfortunately, LLE has a drawback in that its computation cost is high [24]. The Fisher score, which is a common attribute relevance criterion, is a supervised learning technique with many advantages, such as few calculations, high accuracy, and strong operability, and it can efficiently reduce computational complexity [47]. However, the Fisher score method occasionally selects redundant attributes, which affects the classification result [14]. This phenomenon inspires us to combine the Fisher score with neighborhood rough sets to reduce the initial dimensions and improve the classification performance of high-dimensional gene expression datasets. Then, the appropriate genes are selected to form a candidate gene subset, and some neighborhood entropy-based uncertainty measures are studied to address the uncertainty and noise of gene expression datasets. To fully reflect the decision-making ability of attributes, a neighborhood credibility degree and a neighborhood coverage degree are introduced into decision neighborhood entropy and mutual information with nonmonotonicity. Thus, a heuristic nonmonotonic feature selection algorithm with Fisher score in neighborhood decision systems is designed to improve the classification performance of gene expression datasets. The experimental results for several gene expression datasets show that our proposed method can find optimal reduct sets with few genes and high classification accuracy.
The remainder of this paper is organized as follows. Section 2 reviews some basic concepts. Section 3 investigates some neighborhood entropy-based uncertainty measures and develops a nonmonotonic feature selection approach with Fisher score for gene expression data classification. Section 4 shows and analyzes the experimental results. Finally, Section 5 summarizes this study.
Section snippets
Previous knowledge
In this section, we briefly review several basic concepts of decision systems, information entropy measures and neighborhood rough sets described in previous studies [17], [22], [25].
Neighborhood entropy-based uncertainty measures
In recent years, some correlative concepts of neighborhood entropy are defined to measure the uncertainty of numerical data [16]. Considering information entropy and its variants, several feature selection algorithms with monotonicity are proposed to address the analysis of real-valued data [5], [16], [31]. However, when the classification performance of the original dataset is poor, the corresponding evaluation functions have lower measured values; thus, monotonic attribute reduction methods
Nonmonotonic feature selection in neighborhood decision systems
In neighborhood rough sets, many evaluation functions for feature subsets of feature selection methods are developed, and then, heuristic reduction algorithms based on the monotonicity of evaluation functions are established [10], [17]. However, in the rough set model of decision theory, the positive region of a decision attribute does not satisfy monotonicity, i.e., as the conditional attribute increases, the positive region of the decision attribute may decrease. Thus, these existing feature
Experiment preparation
In this section, the performance of our gene selection algorithm given in Section 3.2 is demonstrated. The gene expression datasets shown in Table 2 are described in detail as follows:
- (1)
A brain tumor [18] occurs when abnormal cells form within the brain. Brain tumors may produce symptoms that vary depending on the part of the brain involved. The Brain_Tumor2 gene expression dataset contains 10,367 genes and 50 samples with four subtypes.
- (2)
Colon cancer [27] is the development of cancer in the colon
Conclusion
Reducing the redundant or irrelevant genes of gene expression datasets can effectively decrease the cost of cancer classification. In this paper, a feature selection algorithm using neighborhood entropy-based uncertainty measures is proposed to improve the classification performance of gene expression data. The neighborhood entropy-based uncertainty measures are first investigated to measure the uncertainty and exclude the noise in gene expression datasets. Then, the neighborhood credibility
Declaration of the availability of data and materials
The datasets used during the study are available at the Kent Ridge Biomedical Dataset Repository and WEKA Collections of Datasets. (Last accessed: December 28, 2018) https://leo.ugr.es/elvira/DBCRepository/
Declaration of Competing Interest
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (Nos. 61772176, 61402153, 61370169, 61672332), the Project Funded by China Postdoctoral Science Foundation (No. 2016M602247), the Plan of Scientific Innovation Talent of Henan Province (No. 184100510003), the Key Scientific and Technological Project of Henan Province (No. 182102210362), the Young Scholar Program of Henan Province (No. 2017GGJS041), and the Natural Science Foundation of Henan Province (No.
References (50)
- et al.
Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments
Appl. Soft Comput.
(2016) - et al.
A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data
Genomics Data.
(2016) - et al.
Distributed feature selection: an application to microarray data classification
Appl Soft Comput.
(2015) - et al.
Feature selection for imbalanced data based on neighborhood rough sets
Inform. Sci.
(2019) - et al.
Gene selection for tumor classification using neighborhood rough sets and entropy measures
J Biomed Inform.
(2017) - et al.
Feature weighting and selection with a pareto-optimal trade-off between relevancy and redundancy
Pattern Recogn. Lett.
(2017) - et al.
A group incremental feature selection for classification using rough set theory based genetic algorithm
Appl. Soft. Comput.
(2018) - et al.
A novel hybrid genetic algorithm with granular information for feature selection and optimization
Appl. Soft. Comput.
(2018) - et al.
Attribute reduction based on max-decision neighborhood rough set model
Knowl-Based Syst.
(2018) - et al.
Feature selection considering two types of feature relevancy and feature interdependency
Expert Syst. Appl.
(2018)
Neighborhood rough set based heterogeneous feature subset selection
Inform. Sci.
A hybrid feature selection algorithm for gene expression data classification
Neurocomputing
Attribute reduction in generalized one-sided formal contexts
Inform. Sci.
A hybrid gene selection method for microarray recognition
Biocybern. Biomed. Eng.
A two-stage gene selection method for biomarker discovery from microarray data for cancer classification
Chemometr. Intell. Lab.
Feature selection using rough entropy-based uncertainty measures in incomplete decision systems
Knowl.-Based Syst.
Rough set theory and knowledge acquisition
Feature genes selection using supervised locally linear embedding and correlation coefficient for microarray classification
Comput. Math. Method M.
Multiple comparisons among means
J. Am. Stat. Assoc.
Binary feature selection with conditional mutual information
J. Mach. Learn. Res.
A comparison of alternative tests of significance for the problem of m rankings
Ann. Math. Stat.
Haploinsufficient gene selection in cancer
Science
A combined fisher and laplacian score for feature selection in QSAR based drug design using compounds with known and unknown activities
J. Comput. Aid. Mol. Des.
An efficient gene selection technique for cancer recognition based on neighborhood mutual information
Int. J. Mach. Learn. Cyb.
Feature clustering based support vector machine recursive feature elimination for gene selection
Appl. Intell.
Cited by (196)
Online group streaming feature selection based on fuzzy neighborhood granular ball rough sets
2024, Expert Systems with ApplicationsA flexible non-monotonic discretization method for pre-processing in supervised learning
2024, Pattern Recognition LettersLSFSR: Local label correlation-based sparse multilabel feature selection with feature redundancy
2024, Information SciencesFeature selection algorithm using neighborhood equivalence tolerance relation for incomplete decision systems
2024, Applied Soft ComputingA new method for feature selection based on weighted k-nearest neighborhood rough set
2024, Expert Systems with ApplicationsOptimal gene therapy network: Enhancing cancer classification through advanced AI-driven gene expression analysis
2024, e-Prime - Advances in Electrical Engineering, Electronics and Energy