Elsevier

Information Sciences

Volume 610, September 2022, Pages 1029-1057
Information Sciences

Gaussian kernel based gene selection in a single cell gene decision space

https://doi.org/10.1016/j.ins.2022.08.050Get rights and content

Abstract

Information system is a database that shows relationships between objects and attributes. A real-valued information system is an information system whose information values are real numbers. A real-valued information system with decision attributes is referred to as a real-valued decision information system. If objects, conditional attributes and information values in a real-valued decision information system are cells, genes and gene expression values, respectively, then this information system is said to be a gene decision space. In a gene decision space, people are faced with gene expression data. If gene expression data in a gene decision space changes to single cell RNA-seq data, then this space is referred to as a single cell gene decision space. This paper explores gene selection in a single cell gene decision space in terms of Gaussian kernel. In the first place, the distance between two cells in each subspace of a single cell gene decision space is constructed. Next, the fuzzy Tcos-equivalence relation on the cell set is obtained in terms of Gaussian kernel. After that, measures of uncertainty for a single cell gene decision space are investigated. Lastly, gene selection algorithms in a single cell gene decision space are presented in terms of the proposed information entropy and information granularity. The presented algorithms are testified in several publicly open single cell RNA-seq data sets. Experimental results reveal that the presented algorithms can select appropriate genes related to classification, and significantly improve classification performances.

Introduction

Uncertainty nearly exists in everywhere. Uncertainty measurement is a significant issue in many fields, such as pattern recognition [3], image processing [16], medical diagnosis [32] and data mining [8].

RST (rough set theory) is a mathematical tool to test uncertainty. An IS (information system) on the basis of RST was presented by Pawlak [18]. Most applications of RST are related to an IS.

To handle uncertainty of a system, Shannon [21] introduced the notion of information entropy. Some contributions have been done in this respect. Wierman [27] measured the uncertainty in RST. Wang et al. [30] gave a novel entropy measure for the uncertainty of general fuzzy relations; Liang et al. [15] investigated information entropy in a complete IS or an incomplete IS. Zhang et al. [38] investigated uncertainty measures in a fully fuzzy IS. Yu et al. [33] detected uncertainty measurement in a hybrid IS with images. Singh et al. [24] proposed uncertainty measurement in the background of a set-valued IS. Zhang et al. [37] explored uncertainty measurement in a categorical IS.

Feature selection or attribute reduction as an important technology of data processing in machine learning can effectively reduce redundant selections and improve the accuracy of classification. Many researchers studied feature selection. The details are described as follows.

Cornelis et al. [5] introduced the concept of fuzzy decision reducts, dependent on an increasing attribute subset measure. Experimental results demonstrated the potential of fuzzy decision reducts to discover shorter attribute subsets, leading to decision models with a better coverage and with comparable, or even higher accuracy.

Jia et al. [12] defined the intra-class similarity for objects in the same decision class and the inter-class similarity for objects between different decision classes, and proposed similarity-based feature selection by maximizing intra-class similarity and minimizing inter-class similarity from the clustering perspective. By considering the heuristic search strategy, they designed the corresponding feature selection algorithm. The experimental results indicated that designed algorithm can significantly improve the classification performance.

Li et al. [14] defined information entropy for heterogeneous data, and put forward the notions of joint information entropy, conditional information entropy and mutual information entropy. They applied information entropy to perform feature selection for heterogeneous data and proposed two information entropy based feature selection algorithms. Experimental analysis and comparisons illustrated the efficiency of the proposed algorithms.

The dependency in a neighborhood rough set considered only the classification information contained in the lower approximation of the decision while ignoring the upper approximation. Wang et al. [30] constructed a class of uncertainty measures: decision self-information for the feature selection. These measures considered the uncertainty information in the lower and the upper approximations. They designed a greedy algorithm for feature selection. The experiment results verified the designed algorithm often chooses fewer features and improves the classification accuracy in most cases.

Wang et al. [28] proposed a new form of conditional entropy, defined local conditional entropy based feature selection and presented an ensemble strategy into the heuristic process for searching feature selection subset. The experimental results showed that local conditional entropy based feature selection is superior to traditional conditional entropy based feature selection, the former may provide us attributes with higher classification accuracies. This study suggested new trends for considering feature selection and provides guidelines for designing new measurements and related algorithms.

Wang et al. [31] presented a new fuzzy rough set model for categorical data by introducing a variable parameter to control the similarity between objects. This model used the iterative computation strategy to define fuzzy rough approximations and dependency functions. They applied the presented model to feature selection for categorical data. The experimental results indicated that the proposed algorithm is more effective than some existing algorithms.

Zeng et al. [38] developed a new hybrid distance in a hybrid IS based on the value difference metric, constructed a novel fuzzy rough set by combining the hybrid distance and Gaussian kernel, and analyzed the updating mechanisms for feature selection with the variation of the feature set. They presented a fuzzy rough set approach for incremental feature selection on a hybrid IS and proposed two corresponding incremental algorithms. Finally, they conducted extensive experiments on eight datasets from UCI. The results show that the incremental approaches significantly outperform non-incremental approaches with feature selection in the computational time.

Feature selection based on rough set theory is summarized in Table 1.

With the booming development of microarray and gene sequencing technologies, a huge amount of scRNA-seq (single cell RNA-seq) data have arisen. Due to technology and sampling, scRNA-seq data have the characteristics of high noise, high sparse, high dimensionality and uncertainty, which makes it difficult to judge their cell types and subtypes. To address the problems, it needs to select appropriate genes from a huge amount of genes. The selection of the best genes using traditional statistical analysis and machine learning methods often leads to a dimensional disaster. Thus, we need to reduce the complexity of scRNA-seq data and to establish gene selection theory. Information entropy based on Gaussian kernel is an effective method to deal with attribute reduction or gene selection.

If information values in an IS are real numbers, then it can be called a RVIS (real-valued information system). A RVIS with decision attributes is called a RVDIS (real-valued decision information system). If objects, attributes and information values in a RVDIS are respectively cells, genes and gene expression values, then this RVDIS is said to be a gene space. If gene expression data in a gene space changes to scRNA-seq data, then this gene space is referred to as a single cell gene decision space. This study presents two gene selection algorithms in terms of information entropy and information granularity in a single cell gene decision space.

As a common method of attribute reduction, traditional rough set model can only deal with categorical data. For scRNA-seq data with real or noisy data, discrete preprocessing commonly used may lose part information of the data itself, which reduces the accuracy of classification. This study does not use discrete preprocessing for the original scRNA-seq data, but utilizes fuzzy sets and fuzzy relations to design the distances among cells. Furthermore, this paper also investigate the fuzzy Tcos-equivalence relation and uncertainty measure of a single cell gene decision space.

The flowchart of this paper is shown in Fig. 1. The remaining part of this paper is organized as follows. Section 2 recalls the notion of fuzzy relations and defines a single cell gene decision space. Section 3 proposes the fuzzy Tcos-equivalence relation induced by a single cell gene decision subspace in terms of Gaussian kernel method. Section 4 studies entropy measure for a single cell gene decision space. Section 5 investigates gene selection in a single cell gene decision space based on fuzzy conditional entropy. Section 6 performs some numerical experiments to testify the performances of the proposed gene selection algorithms. Section 7 summarizes this paper.

Section snippets

Preliminaries

In this part, the notion of fuzzy relations is recalled and a single cell gene decision space is defined.

Throughout this paper, U denotes a finite set, I means the unit interval [0,1],IU indicates the set of all fuzzy sets on U and IU×U shows the set of all fuzzy relations on U. PutU={u1,u2,,un}

Fuzzy Tcos-equivalence relation induced by a single cell gene decision subspace

This section defines the fuzzy Tcos-equivalence relation induced by a single cell gene decision subspace in terms of Gaussian kernel.

Gaussian kernelG(u,v)=exp(-u-v22λ2)(λ(0,1])measures the similarity between the points u and v, where u-v is Euclidean distance between the points u and v, and λ is a threshold.

Gaussian kernel method is a significant methodology in artificial intelligence.

If the data distribution or information value distribution follows the circular distribution, then the

Entropy measure for a single cell gene decision space

This part studies entropy measure for a single cell gene decision space.

Stipulate 0log20=0.

Definition 4.1

For a single cell gene decision space (U,C{d}), given PC and λ(0,1]. Then fuzzy information entropy of P is defined asHλ(P)=-i=1n|SGPλ(ui)|nlog2|SGPλ(ui)|n

In (4.1), the fuzzy information entropy Hλ(P) is controlled by λ.

Proposition 4.2

For a single cell gene decision space (U,C{d}), given PC and λ(0,1]. Then0Hλ(P)nlog2n

Moreover, if GPλ=, then Hλ(P)=log2n; if GPλ=ω, then Hλ achieves the minimum value 0.

Proof

Since i,j

Gene selection in a single cell gene decision space based on fuzzy conditional entropy

In this section, several gene selection algorithms in a single cell gene decision space are presented, and their cost effectiveness by the time complexity and space complexity are discussed.

Definition 5.1

For a single cell gene decision space (U,C{d}), given PC and λ(0,1]. Then P is referred to as a coordination subset of C relative to d, if Hλ(P|d)=Hλ(C|d).

In this paper, the family of all coordination subsets of C relative to d is denoted by cod(C) .

Definition 5.2

For a single cell gene decision space (U,C{d}), given P

Experimental analyses

In this section, some numerical experiments are performed to testify the performances of the above proposed gene selection algorithms. These experiments first implement the proposed two gene selection algorithms on several publicly available datasets, then perform three classifiers such as KNN (K-Nearest Neighbor), SVM (Support Vector Machine) and ELM (Extreme Learning Machine) [11] to evaluate the performances of two gene selection algorithms with respect to the classification accuracy (ACC).

Conclusions

In this study, a single cell gene decision space have been deduced and the fuzzy Tcos-equivalence relation to each subspace has been constructed. Three fuzzy entropy measures in a single cell gene decision space have been defined. Relationships among these entropy measures have been discussed. Based on λ-fuzzy conditional information entropy, the inner-significance and outer-significance in a single cell gene decision space have been introduced. Two gene selection algorithms based on them have

CRediT authorship contribution statement

Zhaowen Li: Methodology, Software, Investigation. Junhong Feng: Data curation, Writing – original draft, Investigation. Jie Zhang: Writing – original draft, Investigation. Fang Liu: Software, Validation. Pei Wang: Software, Validation. Ching-Feng Wen: Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The authors would like to thank the editors and the anonymous reviewers for their valuable comments and suggestions, which have helped immensely in improving the quality of the paper. This work is supported by National Natural Science Foundation of China (11971420), Natural Science Foundation of Guangxi (2021GXNSFC, AD19245102, 2020GXNSFAA159155), Natural Science Foundation of Yulin (202125001), Key Laborabory of Software Engineering in Guangxi University for Nationalities (2021-18XJSY-03) and

References (39)

  • J. Zhang et al.

    Feature selection in a neighborhood decision information system with application to scRNA data classification

    Appl. Soft Comput.

    (2021)
  • F.D. Biase et al.

    Cell fate inclination within 2-cell and 4-cell mouse embryos revealed by single-cell RNA sequencing

    Genome Res.

    (2014)
  • F. Buettner et al.

    Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells

    Nat. Biotechnol.

    (2015)
  • W. Chung et al.

    Single-cell RNA-seq enables comprehensive tumour and immune cell profiling in primary breast cancer

    Nat. Commun.

    (2017)
  • O. Dunn

    Multiple comparisons among means

    J. Am. Stat. Assoc.

    (1961)
  • J. Derrac, S. García, D. Molina, F. Herrera, A practical tutorial on the use of nonparametric statistical tests as a...
  • J.H. Dai et al.

    Conditional entropy for incomplete decision systems and its application in data mining

    Int. J. Gen. Syst.

    (2012)
  • M. Friedman

    A comparison of alternative tests of significance for the problem of m rankings

    Ann. Math. Stat.

    (1940)
  • A. Grover et al.

    Single-cell RNA sequencing reveals molecular and functional platelet bias of aged haematopoietic stem cells

    Nat. Commun.

    (2016)
  • Cited by (0)

    View full text