Gaussian kernel based gene selection in a single cell gene decision space
Introduction
Uncertainty nearly exists in everywhere. Uncertainty measurement is a significant issue in many fields, such as pattern recognition [3], image processing [16], medical diagnosis [32] and data mining [8].
RST (rough set theory) is a mathematical tool to test uncertainty. An IS (information system) on the basis of RST was presented by Pawlak [18]. Most applications of RST are related to an IS.
To handle uncertainty of a system, Shannon [21] introduced the notion of information entropy. Some contributions have been done in this respect. Wierman [27] measured the uncertainty in RST. Wang et al. [30] gave a novel entropy measure for the uncertainty of general fuzzy relations; Liang et al. [15] investigated information entropy in a complete IS or an incomplete IS. Zhang et al. [38] investigated uncertainty measures in a fully fuzzy IS. Yu et al. [33] detected uncertainty measurement in a hybrid IS with images. Singh et al. [24] proposed uncertainty measurement in the background of a set-valued IS. Zhang et al. [37] explored uncertainty measurement in a categorical IS.
Feature selection or attribute reduction as an important technology of data processing in machine learning can effectively reduce redundant selections and improve the accuracy of classification. Many researchers studied feature selection. The details are described as follows.
Cornelis et al. [5] introduced the concept of fuzzy decision reducts, dependent on an increasing attribute subset measure. Experimental results demonstrated the potential of fuzzy decision reducts to discover shorter attribute subsets, leading to decision models with a better coverage and with comparable, or even higher accuracy.
Jia et al. [12] defined the intra-class similarity for objects in the same decision class and the inter-class similarity for objects between different decision classes, and proposed similarity-based feature selection by maximizing intra-class similarity and minimizing inter-class similarity from the clustering perspective. By considering the heuristic search strategy, they designed the corresponding feature selection algorithm. The experimental results indicated that designed algorithm can significantly improve the classification performance.
Li et al. [14] defined information entropy for heterogeneous data, and put forward the notions of joint information entropy, conditional information entropy and mutual information entropy. They applied information entropy to perform feature selection for heterogeneous data and proposed two information entropy based feature selection algorithms. Experimental analysis and comparisons illustrated the efficiency of the proposed algorithms.
The dependency in a neighborhood rough set considered only the classification information contained in the lower approximation of the decision while ignoring the upper approximation. Wang et al. [30] constructed a class of uncertainty measures: decision self-information for the feature selection. These measures considered the uncertainty information in the lower and the upper approximations. They designed a greedy algorithm for feature selection. The experiment results verified the designed algorithm often chooses fewer features and improves the classification accuracy in most cases.
Wang et al. [28] proposed a new form of conditional entropy, defined local conditional entropy based feature selection and presented an ensemble strategy into the heuristic process for searching feature selection subset. The experimental results showed that local conditional entropy based feature selection is superior to traditional conditional entropy based feature selection, the former may provide us attributes with higher classification accuracies. This study suggested new trends for considering feature selection and provides guidelines for designing new measurements and related algorithms.
Wang et al. [31] presented a new fuzzy rough set model for categorical data by introducing a variable parameter to control the similarity between objects. This model used the iterative computation strategy to define fuzzy rough approximations and dependency functions. They applied the presented model to feature selection for categorical data. The experimental results indicated that the proposed algorithm is more effective than some existing algorithms.
Zeng et al. [38] developed a new hybrid distance in a hybrid IS based on the value difference metric, constructed a novel fuzzy rough set by combining the hybrid distance and Gaussian kernel, and analyzed the updating mechanisms for feature selection with the variation of the feature set. They presented a fuzzy rough set approach for incremental feature selection on a hybrid IS and proposed two corresponding incremental algorithms. Finally, they conducted extensive experiments on eight datasets from UCI. The results show that the incremental approaches significantly outperform non-incremental approaches with feature selection in the computational time.
Feature selection based on rough set theory is summarized in Table 1.
With the booming development of microarray and gene sequencing technologies, a huge amount of scRNA-seq (single cell RNA-seq) data have arisen. Due to technology and sampling, scRNA-seq data have the characteristics of high noise, high sparse, high dimensionality and uncertainty, which makes it difficult to judge their cell types and subtypes. To address the problems, it needs to select appropriate genes from a huge amount of genes. The selection of the best genes using traditional statistical analysis and machine learning methods often leads to a dimensional disaster. Thus, we need to reduce the complexity of scRNA-seq data and to establish gene selection theory. Information entropy based on Gaussian kernel is an effective method to deal with attribute reduction or gene selection.
If information values in an IS are real numbers, then it can be called a RVIS (real-valued information system). A RVIS with decision attributes is called a RVDIS (real-valued decision information system). If objects, attributes and information values in a RVDIS are respectively cells, genes and gene expression values, then this RVDIS is said to be a gene space. If gene expression data in a gene space changes to scRNA-seq data, then this gene space is referred to as a single cell gene decision space. This study presents two gene selection algorithms in terms of information entropy and information granularity in a single cell gene decision space.
As a common method of attribute reduction, traditional rough set model can only deal with categorical data. For scRNA-seq data with real or noisy data, discrete preprocessing commonly used may lose part information of the data itself, which reduces the accuracy of classification. This study does not use discrete preprocessing for the original scRNA-seq data, but utilizes fuzzy sets and fuzzy relations to design the distances among cells. Furthermore, this paper also investigate the fuzzy -equivalence relation and uncertainty measure of a single cell gene decision space.
The flowchart of this paper is shown in Fig. 1. The remaining part of this paper is organized as follows. Section 2 recalls the notion of fuzzy relations and defines a single cell gene decision space. Section 3 proposes the fuzzy -equivalence relation induced by a single cell gene decision subspace in terms of Gaussian kernel method. Section 4 studies entropy measure for a single cell gene decision space. Section 5 investigates gene selection in a single cell gene decision space based on fuzzy conditional entropy. Section 6 performs some numerical experiments to testify the performances of the proposed gene selection algorithms. Section 7 summarizes this paper.
Section snippets
Preliminaries
In this part, the notion of fuzzy relations is recalled and a single cell gene decision space is defined.
Throughout this paper, U denotes a finite set, I means the unit interval indicates the set of all fuzzy sets on U and shows the set of all fuzzy relations on U. Put
Fuzzy -equivalence relation induced by a single cell gene decision subspace
This section defines the fuzzy -equivalence relation induced by a single cell gene decision subspace in terms of Gaussian kernel.
Gaussian kernelmeasures the similarity between the points u and v, where is Euclidean distance between the points u and v, and is a threshold.
Gaussian kernel method is a significant methodology in artificial intelligence.
If the data distribution or information value distribution follows the circular distribution, then the
Entropy measure for a single cell gene decision space
This part studies entropy measure for a single cell gene decision space.
Stipulate . Definition 4.1 For a single cell gene decision space , given and . Then fuzzy information entropy of P is defined as
In (4.1), the fuzzy information entropy is controlled by . Proposition 4.2 For a single cell gene decision space , given and . Then Moreover, if , then ; if , then achieves the minimum value 0. Proof Since
Gene selection in a single cell gene decision space based on fuzzy conditional entropy
In this section, several gene selection algorithms in a single cell gene decision space are presented, and their cost effectiveness by the time complexity and space complexity are discussed. Definition 5.1 For a single cell gene decision space , given and . Then P is referred to as a coordination subset of C relative to d, if .
In this paper, the family of all coordination subsets of C relative to d is denoted by . Definition 5.2 For a single cell gene decision space , given
Experimental analyses
In this section, some numerical experiments are performed to testify the performances of the above proposed gene selection algorithms. These experiments first implement the proposed two gene selection algorithms on several publicly available datasets, then perform three classifiers such as KNN (K-Nearest Neighbor), SVM (Support Vector Machine) and ELM (Extreme Learning Machine) [11] to evaluate the performances of two gene selection algorithms with respect to the classification accuracy (ACC).
Conclusions
In this study, a single cell gene decision space have been deduced and the fuzzy -equivalence relation to each subspace has been constructed. Three fuzzy entropy measures in a single cell gene decision space have been defined. Relationships among these entropy measures have been discussed. Based on -fuzzy conditional information entropy, the inner-significance and outer-significance in a single cell gene decision space have been introduced. Two gene selection algorithms based on them have
CRediT authorship contribution statement
Zhaowen Li: Methodology, Software, Investigation. Junhong Feng: Data curation, Writing – original draft, Investigation. Jie Zhang: Writing – original draft, Investigation. Fang Liu: Software, Validation. Pei Wang: Software, Validation. Ching-Feng Wen: Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The authors would like to thank the editors and the anonymous reviewers for their valuable comments and suggestions, which have helped immensely in improving the quality of the paper. This work is supported by National Natural Science Foundation of China (11971420), Natural Science Foundation of Guangxi (2021GXNSFC, AD19245102, 2020GXNSFAA159155), Natural Science Foundation of Yulin (202125001), Key Laborabory of Software Engineering in Guangxi University for Nationalities (2021-18XJSY-03) and
References (39)
- et al.
Fusion of local normalization and Gabor entropy weighted genes for face identification
Pattern Recogn.
(2014) - et al.
Attribute selection with fuzzy decision reducts
Inf. Sci.
(2010) - et al.
Extreme learning machine: Theory and applications
Neurocomputing
(2006) - et al.
Single sell transcriptomes reveal characteristic features of human pancreatic islet cell types
Embo Rep.
(2016) - et al.
Color smoothing for RGB-D data using entropy information
Appl. Soft Comput.
(2016) - et al.
Feature selection using neighborhood entropy-based uncertainty measures for gene expression data classification
Inf. Sci.
(2019) - et al.
Uncertainty measures for general fuzzy relations
Fuzzy Sets Syst.
(2019) - et al.
A three-way decision method based on Gaussian kernel in a hybrid information system with images: An application in medical diagnosis
Appl. Soft Comput.
(2019) - et al.
New uncertainty measurement for categorical data based on fuzzy information structures: an application in attribute reduction
Inf. Sci.
(2021) - et al.
A fuzzy rough set approach for incremental feature selection on hybrid information systems
Fuzzy Sets Syst.
(2015)