Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification

doi:10.1016/j.ins.2019.05.072

Information Sciences

Volume 502, October 2019, Pages 18-41

https://doi.org/10.1016/j.ins.2019.05.072 Get rights and content

Abstract

Gene expression data classification is an important technology for cancer diagnosis in bioinformatics and has been widely researched. Due to the large number of genes and the small sample size in gene expression data, feature selection based on neighborhood rough sets is a key step for improving the performance of gene expression data classification. However, some quantitative measures of feature sets may be nonmonotonic in neighborhood rough sets, and many feature selection methods based on evaluation functions yield high cardinality and low predictive accuracy. Therefore, investigating effective and efficient heuristic reduction algorithms is necessary. In this paper, a novel feature selection method based on neighborhood rough sets using neighborhood entropy-based uncertainty measures for cancer classification from gene expression data is proposed. First, some neighborhood entropy-based uncertainty measures are investigated for handling the uncertainty and noise of neighborhood decision systems. Then, to fully reflect the decision-making ability of attributes, the neighborhood credibility and neighborhood coverage degrees are defined and introduced into decision neighborhood entropy and mutual information, which are proven to be nonmonotonic. Moreover, some of the properties and relationships among these measures are derived, which is helpful for understanding the essence of the knowledge content and the uncertainty of neighborhood decision systems. Finally, the Fisher score method is employed to preliminarily eliminate irrelevant genes to significantly reduce complexity, and a heuristic feature selection algorithm with low computational complexity is presented to improve the performance of cancer classification using gene expression data. Experiments on ten gene expression datasets show that our proposed algorithm is indeed efficient and outperforms other related methods in terms of the number of selected genes and the classification accuracy, especially as the size of the genes increases.

Introduction

With the rapid development of DNA sequencing technology, researchers can obtain a large amount of gene expression data from various tissue samples, thus providing technical support for the study of tumor pathogenesis at the molecular level [13]. Medical data mining is one of the main research directions of data mining technology and represents a key technology for cancer classification and a bioinformatics research hot spot [35]. When using gene expression data mining technology to find disease genes, protein functions and disease diagnoses are of great significance; therefore, gene selection is the research focus of tumor recognition and classification [21]. Due to the high costs of experiments, the sample sizes of gene expression datasets remain in the hundreds, which is low compared to the tens of thousands of genes involved [39]. Although gene expression datasets are high-dimensional, only a few of the dimensions are beneficial for classification [13]. The resulting dimensionality poses a considerable challenge for classification [31]. Thus, the few beneficial genes must be selected from huge amounts of gene expression data.

Feature selection, as a data mining preprocessing technique, is a dimensionality reduction method that attempts to reserve informative attributes in high-dimensional data, and attribute reduction in rough sets has been recognized as an important feature selection method [4], [26]. Feature selection has three main approaches: filter, wrapper and embedded methods [15]. Filter methods are typically employed as preprocessing methods that are independent of the classifier and use feature-ranking techniques as the basis for feature selection. Wrapper methods evaluate the goodness of each feature subset identified by estimating the accuracy percentage of the specific classifier used [15]. However, the wrapper methods not only exhibit sensitivity to the classifier but also tend to present considerable runtimes. Hence, these methods are not extensively used in microarray tasks, and few works in the field have employed them. Compared with wrapper methods, embedded methods integrate feature selection in the training process to reduce the total time required for reclassifying subsets [7]. In this paper, our feature selection method is based on the filter approach, in which a heuristic search algorithm is used to find an optimal feature subset with neighborhood rough sets for gene expression datasets.

Granular computing is an effective technology for uncertainty analyses, and attribute reduction is a fundamental research topic and an important application of granular computing [8], [33], [37], [41]. Traditional rough set-based attribute reduction methods are established based on an equivalence relation, and they are only compatible for categorical datasets and not for continuous numerical datasets [30], [43]. To overcome this drawback, Hu et al. [17] established a neighborhood rough set model to process both numerical and categorical datasets via neighborhood relation. Over the last few years, many reduction methods based on neighborhood relation have been investigated [5], [10], [31]. For instance, Chen et al. [5] studied a gene selection algorithm using neighborhood rough sets and a joint entropy measure. Fan et al. [10] introduced a max-decision neighborhood rough set model to design an attribute reduction algorithm. Sun et al. [31] described a gene selection approach based on Fisher linear discriminant and neighborhood rough sets. Most of the abovementioned feature selection algorithms that use neighborhood rough set models are based on the monotonicity of evaluation functions for heuristic searches [17]. However, there are some issues associated with feature selection based on the monotonicity of the evaluation functions. For example, when the classification performance of the original dataset is poor, the corresponding evaluation functions have low measured values. Therefore, these methods cannot yield good reduction results. To remedy this defect, Li et al. [20] presented a nonmonotonic attribute reduction algorithm for the decision-theoretic rough set model. The ideas of nonmonotonic reduction in [20] inspired us to investigate a new feature selection method based on neighborhood rough sets in this paper. It is known that a gene expression dataset can be granulated by using neighborhood parameters. Thus, some neighborhood entropy measures based on neighborhood rough sets can be further studied and the monotonicity or nonmonotonicity of the neighborhood entropy-based uncertainty measures can be proved. Therefore, a nonmonotonic feature selection algorithm is presented to address the abovementioned problems.

Note that the reduction calculation of neighborhood decision systems is a key problem in neighborhood rough sets. In addition, the reduct sets of an information system needs to be achieved to further extract rule-like knowledge from an information system [42]. In practical decision-making applications, the certainty factor and the object coverage factor of rules are two important standards for evaluating the decision-making ability of decision systems [43]. However, some of these existing reduction methods cannot objectively reflect the change of the decision-making ability of classification. The credibility and coverage degrees are known efficiently reflect the classification ability of conditional attributes with respect to the decision attribute [35]. Therefore, the conditional attributes with higher credibility and coverage degrees are important with respect to the decision attribute. Until now, the literature has not considered neighborhood rough sets, which inspires our investigation of new measures to fully reflect the classification performance and decision-making ability of neighborhood decision systems. Consequently, new uncertainty measures and an effective heuristic search algorithm must be investigated. Moreover, the concepts of coverage degree and credibility degree should be introduced into neighborhood decision systems as measures to reflect the classification ability of conditional attributes with respect to the decision attribute, and then the credibility degree and the coverage degree based on neighborhood relation should be integrated into neighborhood entropy measures to demonstrate the decision-making ability of attributes in neighborhood decision systems.

Available pretreatment methods for dimensionality reduction include the principal component analysis, which is the most common linear dimension-reduction method. However, in many real-world datasets, the low-dimensional structure hidden in high-dimensional data is nonlinear, and this reduction is not effective for mapping such high-dimensional data. Locally linear embedding (LLE) approximates the input data with a low-dimensional surface and reduces its dimensionality by learning a mapping of the surface [45]. Unfortunately, LLE has a drawback in that its computation cost is high [24]. The Fisher score, which is a common attribute relevance criterion, is a supervised learning technique with many advantages, such as few calculations, high accuracy, and strong operability, and it can efficiently reduce computational complexity [47]. However, the Fisher score method occasionally selects redundant attributes, which affects the classification result [14]. This phenomenon inspires us to combine the Fisher score with neighborhood rough sets to reduce the initial dimensions and improve the classification performance of high-dimensional gene expression datasets. Then, the appropriate genes are selected to form a candidate gene subset, and some neighborhood entropy-based uncertainty measures are studied to address the uncertainty and noise of gene expression datasets. To fully reflect the decision-making ability of attributes, a neighborhood credibility degree and a neighborhood coverage degree are introduced into decision neighborhood entropy and mutual information with nonmonotonicity. Thus, a heuristic nonmonotonic feature selection algorithm with Fisher score in neighborhood decision systems is designed to improve the classification performance of gene expression datasets. The experimental results for several gene expression datasets show that our proposed method can find optimal reduct sets with few genes and high classification accuracy.

The remainder of this paper is organized as follows. Section 2 reviews some basic concepts. Section 3 investigates some neighborhood entropy-based uncertainty measures and develops a nonmonotonic feature selection approach with Fisher score for gene expression data classification. Section 4 shows and analyzes the experimental results. Finally, Section 5 summarizes this study.

Section snippets

Previous knowledge

In this section, we briefly review several basic concepts of decision systems, information entropy measures and neighborhood rough sets described in previous studies [17], [22], [25].

Neighborhood entropy-based uncertainty measures

In recent years, some correlative concepts of neighborhood entropy are defined to measure the uncertainty of numerical data [16]. Considering information entropy and its variants, several feature selection algorithms with monotonicity are proposed to address the analysis of real-valued data [5], [16], [31]. However, when the classification performance of the original dataset is poor, the corresponding evaluation functions have lower measured values; thus, monotonic attribute reduction methods

Nonmonotonic feature selection in neighborhood decision systems

In neighborhood rough sets, many evaluation functions for feature subsets of feature selection methods are developed, and then, heuristic reduction algorithms based on the monotonicity of evaluation functions are established [10], [17]. However, in the rough set model of decision theory, the positive region of a decision attribute does not satisfy monotonicity, i.e., as the conditional attribute increases, the positive region of the decision attribute may decrease. Thus, these existing feature

Experiment preparation

In this section, the performance of our gene selection algorithm given in Section 3.2 is demonstrated. The gene expression datasets shown in Table 2 are described in detail as follows:

(1)
A brain tumor [18] occurs when abnormal cells form within the brain. Brain tumors may produce symptoms that vary depending on the part of the brain involved. The Brain_Tumor2 gene expression dataset contains 10,367 genes and 50 samples with four subtypes.
(2)
Colon cancer [27] is the development of cancer in the colon

Conclusion

Reducing the redundant or irrelevant genes of gene expression datasets can effectively decrease the cost of cancer classification. In this paper, a feature selection algorithm using neighborhood entropy-based uncertainty measures is proposed to improve the classification performance of gene expression data. The neighborhood entropy-based uncertainty measures are first investigated to measure the uncertainty and exclude the noise in gene expression datasets. Then, the neighborhood credibility

Declaration of the availability of data and materials

The datasets used during the study are available at the Kent Ridge Biomedical Dataset Repository and WEKA Collections of Datasets. (Last accessed: December 28, 2018) https://leo.ugr.es/elvira/DBCRepository/

https://www.cs.waikato.ac.nz/ml/weka/datasets.html

Declaration of Competing Interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (Nos. 61772176, 61402153, 61370169, 61672332), the Project Funded by China Postdoctoral Science Foundation (No. 2016M602247), the Plan of Scientific Innovation Talent of Henan Province (No. 184100510003), the Key Scientific and Technological Project of Henan Province (No. 182102210362), the Young Scholar Program of Henan Province (No. 2017GGJS041), and the Natural Science Foundation of Henan Province (No.

References (50)

J. Apolloni et al.
Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments
Appl. Soft Comput.
(2016)
R. Aziz et al.
A fuzzy based feature selection from independent component subspace for machine learning classification of microarray data
Genomics Data.
(2016)
V. Bolon-Canedo et al.
Distributed feature selection: an application to microarray data classification
Appl Soft Comput.
(2015)
H.M. Chen et al.
Feature selection for imbalanced data based on neighborhood rough sets
Inform. Sci.
(2019)
Y.M. Chen et al.
Gene selection for tumor classification using neighborhood rough sets and entropy measures
J Biomed Inform.
(2017)
A. Das et al.
Feature weighting and selection with a pareto-optimal trade-off between relevancy and redundancy
Pattern Recogn. Lett.
(2017)
A.K. Das et al.
A group incremental feature selection for classification using rough set theory based genetic algorithm
Appl. Soft. Comput.
(2018)
H.B. Dong et al.
A novel hybrid genetic algorithm with granular information for feature selection and optimization
Appl. Soft. Comput.
(2018)
X.D. Fan et al.
Attribute reduction based on max-decision neighborhood rough set model
Knowl-Based Syst.
(2018)
L. Hu et al.
Feature selection considering two types of feature relevancy and feature interdependency
Expert Syst. Appl.
(2018)

Q.H. Hu et al.

Neighborhood rough set based heterogeneous feature subset selection

Inform. Sci.

(2008)

H.J. Lu et al.

A hybrid feature selection algorithm for gene expression data classification

Neurocomputing

(2017)

M.W. Shao et al.

Attribute reduction in generalized one-sided formal contexts

Inform. Sci.

(2017)

A.K. Shukla et al.

A hybrid gene selection method for microarray recognition

Biocybern. Biomed. Eng.

(2018)

A.K. Shukla et al.

A two-stage gene selection method for biomarker discovery from microarray data for cancer classification

Chemometr. Intell. Lab.

(2018)

L. Sun et al.

Feature selection using rough entropy-based uncertainty measures in incomplete decision systems

Knowl.-Based Syst.

(2012)

G.Y. Wang

Rough set theory and knowledge acquisition

(2001)

J.C. Xu et al.

Feature genes selection using supervised locally linear embedding and correlation coefficient for microarray classification

Comput. Math. Method M.

(2018)

O.J. Dunn

Multiple comparisons among means

J. Am. Stat. Assoc.

(1961)

F. Fleuret

Binary feature selection with conditional mutual information

J. Mach. Learn. Res.

(2004)

M. Friedman

A comparison of alternative tests of significance for the problem of m rankings

Ann. Math. Stat.

(1940)

C.D. Greenman

Haploinsufficient gene selection in cancer

Science

(2012)

M.A.H. Valizade et al.

A combined fisher and laplacian score for feature selection in QSAR based drug design using compounds with known and unknown activities

J. Comput. Aid. Mol. Des.

(2018)

Q.H. Hu et al.

An efficient gene selection technique for cancer recognition based on neighborhood mutual information

Int. J. Mach. Learn. Cyb.

(2010)

X.J. Huang et al.

Feature clustering based support vector machine recursive feature elimination for gene selection

Appl. Intell.

(2018)

Cited by (196)

Online group streaming feature selection based on fuzzy neighborhood granular ball rough sets
2024, Expert Systems with Applications
Online group streaming feature selection holds significant research value in large-scale streaming data processing scenarios, and the related work based on rough set theory has attracted academic interest. Nevertheless, most relevant algorithms come with parameters, and performing a grid search for the optimal parameters reduces the efficiency. Moreover, the existing online group streaming feature selection frameworks cannot effectively fit the situation in practice. Focusing on these issues, this paper investigates an online group streaming feature selection method based on fuzzy neighborhood granular ball rough sets. First, Canopy clustering is introduced to granular ball computing, and the adaptive neighborhood of samples is generated based on the granular ball distribution. Second, we construct a fuzzy neighborhood granular ball rough set (FNGBRS) model and propose the integrated dependence degree to achieve maximal dependency and minimum classification error. Then, the purity of granular balls is considered as the weight of features, and some uncertainty measures based on FNGBRS are presented. Finally, we define a random factor to control the size of streaming groups and design an online group streaming feature selection algorithm. Comparative experimental results on sixteen public datasets demonstrate that the proposed algorithm exhibits superior and stable classification performance, coupled with increased efficiency from its parameter-free design.
A flexible non-monotonic discretization method for pre-processing in supervised learning
2024, Pattern Recognition Letters
Discretization is one of the important pre-processing steps for supervised learning. Discretizing attributes helps to simplify the data and make it easier to understand and analyze by reducing the number of values. It can provide a better representation of knowledge and thus help improve the accuracy of a classifier. However, to minimize the information loss, it is important to consider the characteristics of the data. Most approaches assume that the values of a continuous attribute are monotone with respect to the probability of belonging to a particular class. In other words, it is assumed that increasing or decreasing the value of the attribute leads to a proportional increase or decrease in the classification score. This assumption may not always be valid for all attributes of data. In this study, we present entropy-based, flexible discretization strategies capable of capturing the non-monotonicity of the attribute values. The algorithm can adjust the number of cut points and values depending on the characteristics of the data. It does not require setting of any hyper-parameter or threshold. Extensive experiments on different datasets have shown that the proposed discretizers significantly improve the performance of classifiers, especially on complex and high-dimensional data sets.
LSFSR: Local label correlation-based sparse multilabel feature selection with feature redundancy
2024, Information Sciences
In recent studies, existing multilabel feature selection models have focused on either considering the relationship between labels or the redundancy between features. Furthermore, they only use simple sparsity constraints to process high-dimensional data without the intrinsic relationships between features and labels. These issues can have a great impact on the classification effectiveness of feature selection. To address these limitations, this article describes a new local label correlation-based sparse multilabel feature selection approach with feature redundancy. First, a new loss function is established among the matrices of samples, label coefficients, and labels. Then, the Frobenius norm is imposed to investigate the potential relationships between features and labels. The weight matrix is sparsified by the l_2,1 norm to ensure that the new loss function has high interpretability. Second, a manifold constraint is employed to capture the local geometric structure between labels and to delve deeper into the latent information among the local labels. Then manifold constraints and Laplacian scores are combined for embedding feature selection to guide the exploration of hidden latent label. Finally, by considering the differences between the feature scores and the redundancy between the samples, feature redundancy is analyzed via the modified cosine similarity, and a candidate feature subset with low redundancy is generated. The l₂ norm is used to select features with low redundancy while preserving sparsity, and a novel objective function is developed to optimize this solution. Thus, a sparse feature selection algorithm via local label correlation and feature redundancy is designed, and has demonstrated remarkable classification effectiveness in comparative experiments conducted on 21 multilabel datasets.
Feature selection algorithm using neighborhood equivalence tolerance relation for incomplete decision systems
2024, Applied Soft Computing
Rough set is an important method for dealing with incomplete information systems. In incomplete information systems, the most common way to determine the relation between two samples is the tolerance relation. However, the condition for the tolerance relation to determine those samples may belong to the same category is very lenient, which makes the reduction rate low when using the rough set generated by this relation to select features. In response to the above problems, we design the neighborhood equivalence tolerance relation to solve them. Different from other improved tolerance relations, firstly, the relation designed in this paper does not require additional threshold to accomplish the above goals, which will avoid the trouble caused by the given threshold. Secondly, we notice that most of the current improvements for this kind of problems are computationally cumbersome, and the relation designed in this paper is simple and effective. Based on this, we construct a neighborhood rough set model that handles incomplete information by using this relation, introduce its properties, expound the properties that a reduction set should satisfy, quantify the importance of conditional attributes with attribute dependence degree, which provides the basis for the design of feature selection algorithm. Finally, the greedy strategy is used to design a forward feature selection algorithm. Experimental results show that the model is effective in dealing with incomplete information systems. The feature selection algorithm has the smallest size of the average reduced subset on twelve datasets, and maintains the accuracy of the classifier, which verifies that the feature selection algorithm can effectively deal with incomplete information systems.
A new method for feature selection based on weighted k-nearest neighborhood rough set
2024, Expert Systems with Applications
The neighborhood rough set theory is a helpful instrument for working with data that is numerical, and the performance of its uncertainty measures is generally stable. Even one noisy sample may result in several samples into the border domain because of the partitioning approach for upper approximation and lower approximation in neighborhood rough sets, which is particularly sensitive to incorrectly labeled samples. Based on the distributional information about the data in the sample area, the Weighted $k$ -nearest Neighborhood Rough Set (WKNRS) model was proposed as a solution to this problem. Firstly, we weight the neighbor samples using the standard deviation of pertaining class labels before offering a sample quality evaluation. Next, using this sample quality evaluation, a WKNRS is proposed, which fixes the neighborhood rough set’s noise sensitivity issue. Then, a Weighted $k$ -nearest Neighbor Features Selection(WKNFS) is generated using this model, along with a forward greedy search methodology. Finally, because we wanted to assess this method’s efficiency, we put it to the test using 16 UCI datasets. The results of the trials demonstrate that this algorithm performs better than the other 6 algorithms of feature selection currently in use.
Optimal gene therapy network: Enhancing cancer classification through advanced AI-driven gene expression analysis
2024, e-Prime - Advances in Electrical Engineering, Electronics and Energy
Gene therapy is an advanced medical approach that aims to find solutions for various cancers by identifying optimal gene expressions. In this context, computer-aided detection of gene expressions becomes a research challenge, where artificial intelligence methods are employed to classify cancer types. However, traditional machine learning models must be improved for accurately classifying cancers, leading to unsatisfactory quantitative performance. Therefore, this work implemented the optimal gene therapy network (OGT-Net) for identifying the different types of cancers from the gene expression sequences. Initially, the dataset pre-processing operation normalizes the dataset, which maintains the uniform nature of all records in the dataset. Then, the light gradient boosting model (LGBM) extracts the correlated features from the pre-processed dataset, which contains the relationship among the pre-processed gene expression data. In addition, interrupt-based Harris Hawk optimization (IHHO) extracts the optimal features from LGBM data, decreasing the total number of features by removing redundant gene sequences. Then, a customized deep learning convolution neural network (DLCNN) is used to categorize diseases using gene expression datasets based on lymphography, colon, lung, ovarian, and prostate cancers. The simulation results reveal that the proposed OGT-Net improved performance on various datasets compared to existing approaches, with an average accuracy of 91.128 %, precision of 90.836 %, recall of 91.25 %, and F1-score of 90.7 %.

View all citing articles on Scopus

View full text

Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification

Abstract

Introduction

Section snippets

Previous knowledge

Neighborhood entropy-based uncertainty measures

Nonmonotonic feature selection in neighborhood decision systems

Experiment preparation

Conclusion

Declaration of the availability of data and materials

Declaration of Competing Interest

Acknowledgements

Appl. Soft Comput.

Genomics Data.

Appl Soft Comput.

Inform. Sci.

J Biomed Inform.

Pattern Recogn. Lett.

Appl. Soft. Comput.

Appl. Soft. Comput.

Knowl-Based Syst.

Expert Syst. Appl.

Inform. Sci.

Neurocomputing

Inform. Sci.

Biocybern. Biomed. Eng.

Chemometr. Intell. Lab.

Knowl.-Based Syst.

Comput. Math. Method M.

Multiple comparisons among means

J. Am. Stat. Assoc.

Binary feature selection with conditional mutual information

J. Mach. Learn. Res.

A comparison of alternative tests of significance for the problem of m rankings

Ann. Math. Stat.

Haploinsufficient gene selection in cancer

Science

A combined fisher and laplacian score for feature selection in QSAR based drug design using compounds with known and unknown activities

J. Comput. Aid. Mol. Des.

An efficient gene selection technique for cancer recognition based on neighborhood mutual information

Int. J. Mach. Learn. Cyb.

Feature clustering based support vector machine recursive feature elimination for gene selection

Appl. Intell.