Inverse projection group sparse representation for tumor classification: A low rank variation dictionary approach
Introduction
With the rapid development of gene chip technology, high-dimensional microarray gene expression data can be quickly and accurately obtained. Microarray technology changes people’s thinking about the classification of human tumor molecules [1], [2], [3]. Therefore, effective analysis of microarray data has attracted great attention in recent years. However, high-dimensional microarray data has the characteristics of small samples (patients), high redundancy [4] and class im-balance [5], [6], [7], which poses challenges for tumor classification.
In fact, only a few genes in the high-dimensional microarray data are involved in a particular biological process [8]. Hence, a small set of information genes are needed to be selected before doing further research, where gene selection aims to remove irrelevant, redundant genes [9]. Rank-based gene selection methods are promising and attractive because their simplicity and stability, for example signal noise ratio (SNR) [10]. Of course, there are other gene selection methods, for example, a two-stage hybrid gene selection method [11] was presented by combining three common used ranking methods and LASSO (Least Absolute Shrinkage and Selection Operator), a decision curve analysis-based gene selection statistical index [12] was proposed by considering the clinical misdiagnosis rate, et al. Low-rank decomposition decomposes a matrix into low-rank and sparse components [13], [14], which have much smaller ranks than the original matrix. Liu et al. [15] directly used low-rank decomposition to extract gene subsets from original microarray data. On the whole, these gene selection methods all directly based on original microarray data, whether its all genes or a selected information gene set.
Effective tumor classification plays an important role in clinical diagnosis, treatment and prognosis analysis. Spare representation based classification (SRC) [16] achieves competitive results without learning. However, the success of SRC depends on having sufficient labeled training samples. For solving the few training sample problem, a kind of inverse projection-based sparse representation methods are proposed for face recognition [17] and tumor recognition [11], [12], which show the inverse projection-based sparse representation methods are superior to the standard sparse representation methods. The reason lies in that the former can take full advantage of and mine information embedded in unlabeled samples, rather than the latter relies on the enough labeled training samples and ignores the large number of exist unlabeled samples. The fly in the ointment is that these inverse projection sparse representation-based methods are based on the -constraint or -constraint, while microarray data have the characteristics of group sparsity. How to take advantage of the intrinsic characteristic will be an interesting and promising work. On the other hand, these methods are all based on original data or the corresponding feature data.
For tumor classification, on the one hand, the expression profiles of most of the genes are flat and are considered as non-differential expression; on the other hand, only a small number of genes are relevant to a special biological process and called differentially expressed genes. For low-rank decomposition, the low-rank part corresponds to the similar part among different samples of the same category, while sparse noise part corresponds to difference part among different samples of the same category. From the point of view, there is some inherent similarity between non-differential expression and low-rank part, and between differential expression and sparse noise part.
Motivated by these works, a tumor classification framework is presented based on an inverse projection-based group spare representation (IPGSR) model with low rank variation dictionary (LRVD), for short, called LRVD-IPGSR model. It is worth mentioning that we focus our attention on constructing an improved group sparsity-based inverse projection sparse representation model from a new perspective. Fig. 1 gives the flowchart of the LRVD-IPGSR model-based tumor classification. The main contributions are as follows.
(1) From a new perspective, a LRVD is constructed based on the sparse noise part of low-rank decomposition [13] for tumor classification, rather than the microarray data. The LRVD focuses on the variations between different categories obtained from the typical training samples, and construct an independent dictionary. It is worth noting that the LRVD not only has good recognition effect on existing categories, but also can be automatic updated and extended with new categories. Different from other information gene selection methods, the LRVD does not select an information gene subset from the original microarray data, but mines the intrinsic essential variation information between different categories.
(2) An IRGSR model is constructed based on introducing the -regularization constraint [18] into inverse projection sparse representation model [11], [12]. The IRGSR can embody the group sparsity of microarray data and make full use of the unlabeled data simultaneously.
(3) A LRVD-IPGSR model is proposed by integrating the IPGSR and LRVD. The methodology mainly includes model construction, feasibility, stability, optimization and convergence analysis.
The remainder of this paper is organized as follows. Section 2 is the methodology. Experiments and analysis are shown in Section 3. Finally, conclusions will be drawn in Section 4.
Section snippets
Construction of the IPGSR model
In order to improve classification accuracy and stability when there are few training samples, the inverse projection in [11], [12], [17] is also adopted. Furthermore, the group sparse constraint [18] is introduced to make full use of the intrinsic characteristic of microarray data.
Suppose is a training set, are the th category samples, where is the number of category, each training sample can be represented by test set
Experimental results and analysis
The effectiveness of the proposed method is demonstrated by using six measures on nine public microarray datasets. Accuracy measures the classification performance by using the percentage of correctly classified samples; Sensitivity measures the non-missed diagnosis performance by using the rate of correctly classified positive samples, specificity measures the non-misdiagnosis performance by using the rate of correctly classified negative samples. For any test, there is usually a trade-off
Conclusions
In this paper, one of the major innovations is a low-rank variation dictionary is presented from a novel perspective, another major innovation lies in the construction of inverse projection-based inverse space sparse representation classification. Experiments show that the proposed model can not only fully mine the useful information contained in unlabeled data, but also identify tumors simply and effectively with the help of low-rank variation dictionary.
There are remaining some interesting
CRediT authorship contribution statement
Xiaohui Yang is responsible for modeling and writing article. Xiaoying Jiang and Chenxi Tian have done the implementation of the experiments. Pei Wang has given the suggestions of biological analysis and article title. Funa Zhou and Hamido Fujita have given the suggestions of algorithms and generating article.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The authors would like to thank https://tumorgenome.nih.gov/ and https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4766 for their datasets. This work has been supported by National Natural Science Foundation of China (11701144, 41771375), Natural Science Foundation of Henan Province (202102310087) and Open Fund of Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University (IPIU2019010).
References (41)
- et al.
An empirical comparison on state-of-the-srt multi-class imbalance learning algorithms and a new diversified ensemble learning scheme
Knowl.-Based Syst.
(2018) - et al.
Extracting gene regulation information for cancer classification
Pattern Recognit.
(2007) - et al.
An integrated inverse space sparse representation framework for tumor classification
Pattern Recognit.
(2019) - et al.
Low-rank local tangent space embedding for subspace clustering
Inform. Sci.
(2020) - et al.
Pseudo-full-space representation based classification for robust face recognition
Signal Process. Image Commun.
(2018) - et al.
Group sparse representation based classification for multi-feature multimodal biometrics
Inf. Fusion
(2016) - et al.
Gene selection with guided regularized random forest
Pattern Recognit.
(2013) - et al.
Mapping microarray gene expression data into dissimilarity spaces for tumor classification
Inform. Sci.
(2015) - et al.
Incremental wrapper-based gene selection from microarray data for cancer classification
Pattern Recognit.
(2006) - et al.
Gene boosting for cancer classification based on gene expression profiles
Pattern Recognit.
(2009)
Ensembles of random sphere cover classifiers
Pattern Recognit.
Benefits of dimension reduction in penalized regression methods for high-dimensional grouped data: a case study in low sample size
Bioinformatics
Defining the galaxy of gene expression in breast cancer
Breast Cancer Res.
Gene expression profiling predicts clinical outcome of breast cancer
Nature
Evaluating microarray-based classifiers: an overview
J. Cancer Inform.
Deep learning fault diagnosis method based on global optimization GAN for unbalanced data
Genome-wide reprogramming of primary and secondary metabolism, protein synthesis, cellular growth processes, and the regulatory infrastructure of arabidopsis in response to nitrogen
Plant Physiol.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
Inverse projection representation and category contribution rate for robust tumor recognition
IEEE/ACM Trans. Comput. Biol. Bioinform.
Cited by (35)
Low-rank and sparse representation based learning for cancer survivability prediction
2022, Information SciencesCitation Excerpt :In this paper, we introduce some constraints/regularization to interpret the original noisy data, with the ultimate aim to establish a low-rank and sparse representation based prediction model. The low-rank representation is being extensively employed in machine learning tasks recently, such as image denoising [23], tumor classification [29], etc. The fundamental concept is to reconstruct/represent high-dimensional data samples using low-dimensional subspaces.
A deep fusion framework for unlabeled data-driven tumor recognition
2021, Pattern RecognitionCitation Excerpt :The recognition effect of SRC is great reduced when there are few training samples per subject. In order to make full use of the large amount of unlabeled data, our team proposed an integrated inverse projection-based inverse space sparse representation tumor recognition frame [11], and an invariant dictionary-based inverse space group sparse representation tumor recognition approach [13]. In these methods, however, feature representation and classification are mostly completed by the combination of two successive stages.
Sparse flow adversarial model for robust image compression
2021, Knowledge-Based SystemsLow-rank representation with adaptive dictionary learning for subspace clustering[Formula presented]
2021, Knowledge-Based SystemsQR decomposition based low rank approximation for Gaussian process regression
2023, Applied IntelligenceLaplacian regularized deep low-rank subspace clustering network
2023, Applied Intelligence