Inverse projection group sparse representation for tumor classification: A low rank variation dictionary approach

doi:10.1016/j.knosys.2020.105768

Knowledge-Based Systems

Volume 196, 21 May 2020, 105768

https://doi.org/10.1016/j.knosys.2020.105768 Get rights and content

Abstract

Sparse representation based classification (SRC) achieves good results by addressing recognition problem with sufficient training samples per subject. Tumor classification, however, is a typical small sample problem. In this paper, an inverse projection group sparse representation (IPGSR) model is presented for tumor classification based on constructing a low rank variation dictionary (LRVD), for short, LRVD-IPGSR model. Firstly, an IPGSR model is constructed based on making full use of existing training and test samples, and group sparsity effect of genetic data. Furthermore, from a new viewpoint, a LRVD is constructed for improving the performance of IPGSR-based tumor classification. The LRVD can be independently constructed by detecting and utilizing variations of normals and typical patients, rather than directly using and changed with the genetic data or their corresponding feature data. And the LRVD can be automatic updated and extended to fit the case of new types of diseases. Finally, the LRVD-IPGSR model is fully analyzed from feasibility, stability, optimization and convergence. The performance of the LRVD-IPGSR model-based tumor classification framework is verified on eight microarray gene expression datasets, which contain early diagnosis, tumor type recognition and postoperative metastasis.

Introduction

With the rapid development of gene chip technology, high-dimensional microarray gene expression data can be quickly and accurately obtained. Microarray technology changes people’s thinking about the classification of human tumor molecules [1], [2], [3]. Therefore, effective analysis of microarray data has attracted great attention in recent years. However, high-dimensional microarray data has the characteristics of small samples (patients), high redundancy [4] and class im-balance [5], [6], [7], which poses challenges for tumor classification.

In fact, only a few genes in the high-dimensional microarray data are involved in a particular biological process [8]. Hence, a small set of information genes are needed to be selected before doing further research, where gene selection aims to remove irrelevant, redundant genes [9]. Rank-based gene selection methods are promising and attractive because their simplicity and stability, for example signal noise ratio (SNR) [10]. Of course, there are other gene selection methods, for example, a two-stage hybrid gene selection method [11] was presented by combining three common used ranking methods and LASSO (Least Absolute Shrinkage and Selection Operator), a decision curve analysis-based gene selection statistical index [12] was proposed by considering the clinical misdiagnosis rate, et al. Low-rank decomposition decomposes a matrix into low-rank and sparse components [13], [14], which have much smaller ranks than the original matrix. Liu et al. [15] directly used low-rank decomposition to extract gene subsets from original microarray data. On the whole, these gene selection methods all directly based on original microarray data, whether its all genes or a selected information gene set.

Effective tumor classification plays an important role in clinical diagnosis, treatment and prognosis analysis. Spare representation based classification (SRC) [16] achieves competitive results without learning. However, the success of SRC depends on having sufficient labeled training samples. For solving the few training sample problem, a kind of inverse projection-based sparse representation methods are proposed for face recognition [17] and tumor recognition [11], [12], which show the inverse projection-based sparse representation methods are superior to the standard sparse representation methods. The reason lies in that the former can take full advantage of and mine information embedded in unlabeled samples, rather than the latter relies on the enough labeled training samples and ignores the large number of exist unlabeled samples. The fly in the ointment is that these inverse projection sparse representation-based methods are based on the $l_{2}$ -constraint or $l_{1}$ -constraint, while microarray data have the characteristics of group sparsity. How to take advantage of the intrinsic characteristic will be an interesting and promising work. On the other hand, these methods are all based on original data or the corresponding feature data.

For tumor classification, on the one hand, the expression profiles of most of the genes are flat and are considered as non-differential expression; on the other hand, only a small number of genes are relevant to a special biological process and called differentially expressed genes. For low-rank decomposition, the low-rank part corresponds to the similar part among different samples of the same category, while sparse noise part corresponds to difference part among different samples of the same category. From the point of view, there is some inherent similarity between non-differential expression and low-rank part, and between differential expression and sparse noise part.

Motivated by these works, a tumor classification framework is presented based on an inverse projection-based group spare representation (IPGSR) model with low rank variation dictionary (LRVD), for short, called LRVD-IPGSR model. It is worth mentioning that we focus our attention on constructing an improved group sparsity-based inverse projection sparse representation model from a new perspective. Fig. 1 gives the flowchart of the LRVD-IPGSR model-based tumor classification. The main contributions are as follows.

(1) From a new perspective, a LRVD is constructed based on the sparse noise part of low-rank decomposition [13] for tumor classification, rather than the microarray data. The LRVD focuses on the variations between different categories obtained from the typical training samples, and construct an independent dictionary. It is worth noting that the LRVD not only has good recognition effect on existing categories, but also can be automatic updated and extended with new categories. Different from other information gene selection methods, the LRVD does not select an information gene subset from the original microarray data, but mines the intrinsic essential variation information between different categories.

(2) An IRGSR model is constructed based on introducing the $l_{2, 1}$ -regularization constraint [18] into inverse projection sparse representation model [11], [12]. The IRGSR can embody the group sparsity of microarray data and make full use of the unlabeled data simultaneously.

(3) A LRVD-IPGSR model is proposed by integrating the IPGSR and LRVD. The methodology mainly includes model construction, feasibility, stability, optimization and convergence analysis.

The remainder of this paper is organized as follows. Section 2 is the methodology. Experiments and analysis are shown in Section 3. Finally, conclusions will be drawn in Section 4.

Section snippets

Construction of the IPGSR model

In order to improve classification accuracy and stability when there are few training samples, the inverse projection in [11], [12], [17] is also adopted. Furthermore, the group sparse constraint [18] is introduced to make full use of the intrinsic characteristic of microarray data.

Suppose $X = [x_{1}, \dots, x_{h_{1}}, \dots, x_{h_{c}}] \in R^{d \times n}$ is a training set, $X_{j} = [x_{h_{j - 1} + 1}, \dots, x_{h_{j}}] \in R^{d \times (h_{j} - h_{j - 1})}$ are the $j$ th category samples, where $j = 1, \dots, 2$ is the number of category, each training sample $x_{i} \in X$ can be represented by test set $Y = [y_{1},$

Experimental results and analysis

The effectiveness of the proposed method is demonstrated by using six measures on nine public microarray datasets. Accuracy measures the classification performance by using the percentage of correctly classified samples; Sensitivity measures the non-missed diagnosis performance by using the rate of correctly classified positive samples, specificity measures the non-misdiagnosis performance by using the rate of correctly classified negative samples. For any test, there is usually a trade-off

Conclusions

In this paper, one of the major innovations is a low-rank variation dictionary is presented from a novel perspective, another major innovation lies in the construction of inverse projection-based inverse space sparse representation classification. Experiments show that the proposed model can not only fully mine the useful information contained in unlabeled data, but also identify tumors simply and effectively with the help of low-rank variation dictionary.

There are remaining some interesting

CRediT authorship contribution statement

Xiaohui Yang is responsible for modeling and writing article. Xiaoying Jiang and Chenxi Tian have done the implementation of the experiments. Pei Wang has given the suggestions of biological analysis and article title. Funa Zhou and Hamido Fujita have given the suggestions of algorithms and generating article.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank https://tumorgenome.nih.gov/ and https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4766 for their datasets. This work has been supported by National Natural Science Foundation of China (11701144, 41771375), Natural Science Foundation of Henan Province (202102310087) and Open Fund of Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University (IPIU2019010).

References (41)

BiJ.J. et al.
An empirical comparison on state-of-the-srt multi-class imbalance learning algorithms and a new diversified ensemble learning scheme
Knowl.-Based Syst.
(2018)
WangH.Q. et al.
Extracting gene regulation information for cancer classification
Pattern Recognit.
(2007)
YangX.H. et al.
An integrated inverse space sparse representation framework for tumor classification
Pattern Recognit.
(2019)
DengT.Q. et al.
Low-rank local tangent space embedding for subspace clustering
Inform. Sci.
(2020)
YangX.H. et al.
Pseudo-full-space representation based classification for robust face recognition
Signal Process. Image Commun.
(2018)
GoswamiG. et al.
Group sparse representation based classification for multi-feature multimodal biometrics
Inf. Fusion
(2016)
DengH. et al.
Gene selection with guided regularized random forest
Pattern Recognit.
(2013)
GarcíaV. et al.
Mapping microarray gene expression data into dissimilarity spaces for tumor classification
Inform. Sci.
(2015)
RuizR. et al.
Incremental wrapper-based gene selection from microarray data for cancer classification
Pattern Recognit.
(2006)
HongJ.H. et al.
Gene boosting for cancer classification based on gene expression profiles
Pattern Recognit.
(2009)

YounsiR. et al.

Ensembles of random sphere cover classifiers

Pattern Recognit.

(2016)

AjanaS. et al.

Benefits of dimension reduction in penalized regression methods for high-dimensional grouped data: a case study in low sample size

Bioinformatics

(2019)

LiuE.T. et al.

Defining the galaxy of gene expression in breast cancer

Breast Cancer Res.

(2002)

vanL.J. et al.

Gene expression profiling predicts clinical outcome of breast cancer

Nature

(2002)

BoulesteixA.L. et al.

Evaluating microarray-based classifiers: an overview

J. Cancer Inform.

(2008)

ZhouF.N. et al.

Deep learning fault diagnosis method based on global optimization GAN for unbalanced data

(2019)

C.S. Zhang, J.J. Bi, . X.U. S. X, E. Ramentol, G.J. Fan, B.J. Qiao, Fujita. H., Multi-imbalance: an open-source...

ScheibleW.R. et al.

Genome-wide reprogramming of primary and secondary metabolism, protein synthesis, cellular growth processes, and the regulatory infrastructure of arabidopsis in response to nitrogen

Plant Physiol.

(2004)

GolubT.R. et al.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

(1999)

YangX.H. et al.

Inverse projection representation and category contribution rate for robust tumor recognition

IEEE/ACM Trans. Comput. Biol. Bioinform.

(2019)

Cited by (35)

Low-rank and sparse representation based learning for cancer survivability prediction
2022, Information Sciences
Citation Excerpt :
In this paper, we introduce some constraints/regularization to interpret the original noisy data, with the ultimate aim to establish a low-rank and sparse representation based prediction model. The low-rank representation is being extensively employed in machine learning tasks recently, such as image denoising [23], tumor classification [29], etc. The fundamental concept is to reconstruct/represent high-dimensional data samples using low-dimensional subspaces.
Cancer survivability prediction has been of great interest to health professionals and researchers. The task refers to the procedure of estimating the potential survivability according to an individual’s medical history. The difficulty is that raw data is usually subject to some noise, such as missing values. To address this issue, we propose a novel low-rank and sparse representation-based learning algorithm, which consists of two main stages of data self expressiveness and classification. Firstly, in the data self expressiveness stage, raw inputs have been decomposed into one dictionary (which is enforced with a low-rank constraint) and one coefficient matrix (which is sparsely coded), respectively. Secondly, this sparse coefficient matrix is paired with sample labels for training during the classification stage. We further integrate these two stages and formulate them into an optimization problem, which is then solved using an iterative computational strategy. Theoretically, we analyze the convergence of the proposed algorithm. The relationship between the proposed algorithm and existing approaches are also discussed. The efficiency of the proposed algorithm is experimentally verified using several benchmarking classification problems and a public longitudinal dataset. Experimental results demonstrate that the proposed algorithm achieves superior performance in terms of affordable computational complexity and high prediction accuracy, compared to state-of-the-art approaches.
A deep fusion framework for unlabeled data-driven tumor recognition
2021, Pattern Recognition
Citation Excerpt :
The recognition effect of SRC is great reduced when there are few training samples per subject. In order to make full use of the large amount of unlabeled data, our team proposed an integrated inverse projection-based inverse space sparse representation tumor recognition frame [11], and an invariant dictionary-based inverse space group sparse representation tumor recognition approach [13]. In these methods, however, feature representation and classification are mostly completed by the combination of two successive stages.
Traditional pattern recognition problems are usually accomplished through two successive stages of representation and classification, the generalization ability and stability are difficult to guarantee for small samples and category imbalance. For tackling these problems, an unlabeled data-driven representation learning classification (RLC) fused model is constructed by integrating representation learning and classification into one model, rather than simple putting the two stages together. The RLC fused model mainly focuses on interactive iteratively optimizing representation learning and classification in a model, guiding and reinforcing each other. Under the framework of RLC, a deep nonnegative matrix factorization (NMF) is adopted for representation learning by complementing the advantages of NMF and deep learning, and avoiding complex network structure and parameter modulation. The framework is called deep NMF-RLC fusion model, which can achieve good performance for binary classification even the simplest linear regression classifier is used. The model explores useful information embedded in unlabeled data, and is suitable for small training samples and unbalanced classification. The performance of the proposed framework is verified on genetic-based tumor recognition, which contains all three stages of early diagnosis, tumor type recognition and postoperative metastasis. Experiments show that, compared with the published state-of-the-art methods and results, there are significant improvements in classification accuracy, specificity and sensitivity.
Sparse flow adversarial model for robust image compression
2021, Knowledge-Based Systems
Existing learned-based image compression methods have shown impressive performance. However, they rely mostly on the consistent distribution between training and test images, which reduces the robustness of the training model. In this paper, we propose a novel compression method called sparse flow adversarial model (SFAM). SFAM employs a deep generative framework to learn a reversible and stable mapping between image distributions, thus it can work in varied scenes for robust compression. The mapping explores the sparsity of the image by combining linear and nonlinear transformations, rather than extracting the features of a particular dataset as is the case with other learning-based methods. Moreover, a sparse adversarial map is introduced into SFAM, to constrain the SFAM to generate sparser features for efficient compression. Extensive experiments are performed on different datasets, in which the effectiveness and robustness of the proposed method are verified. Meanwhile, SFAM is trained only once and it can work well on three different datasets, which also prove the robustness of the proposed SFAM.
Low-rank representation with adaptive dictionary learning for subspace clustering[Formula presented]
2021, Knowledge-Based Systems
High-dimensional data are often treated as collections of data samples approximately drawn from a union of multiple low-dimensional subspaces. Subspace clustering, where high-dimensional data samples are divided into low-dimensional subspace clusters, provides valuable insight into the underlying structures of high-dimensional data. The key challenge in subspace clustering is how to effectively measure the similarity among data samples. This paper presents an adaptive low-rank representation (ALRR) method for subspace clustering. An adaptive dictionary learning strategy that employs an orthonormality constraint is integrated into the low-rank representation (LRR) model. The dictionary, adaptively learned from the original data, makes the ALRR model robust to noise. The projection matrix and low-rank features are obtained simultaneously using an alternative optimization method. The convergence of ALRR is theoretically guaranteed under certain conditions, where ALRR requires at most three iterations for optimization. Consequently, it effectively obtains a convergence rate for ALRR that is better than those of several existing LRR algorithms. The experimental results on benchmark datasets show that the proposed method significantly outperforms several state-of-the-art subspace clustering methods, which indicates the effectiveness of ALRR for subspace clustering.
QR decomposition based low rank approximation for Gaussian process regression
2023, Applied Intelligence
Laplacian regularized deep low-rank subspace clustering network
2023, Applied Intelligence

View all citing articles on Scopus

View full text

Inverse projection group sparse representation for tumor classification: A low rank variation dictionary approach

Abstract

Introduction

Section snippets

Construction of the IPGSR model

Experimental results and analysis

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Knowl.-Based Syst.

Pattern Recognit.

Pattern Recognit.

Inform. Sci.

Signal Process. Image Commun.

Inf. Fusion

Pattern Recognit.

Inform. Sci.

Pattern Recognit.

Pattern Recognit.

Pattern Recognit.

Benefits of dimension reduction in penalized regression methods for high-dimensional grouped data: a case study in low sample size

Bioinformatics

Defining the galaxy of gene expression in breast cancer

Breast Cancer Res.

Gene expression profiling predicts clinical outcome of breast cancer

Nature

Evaluating microarray-based classifiers: an overview

J. Cancer Inform.

Deep learning fault diagnosis method based on global optimization GAN for unbalanced data

Genome-wide reprogramming of primary and secondary metabolism, protein synthesis, cellular growth processes, and the regulatory infrastructure of arabidopsis in response to nitrogen

Plant Physiol.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

Inverse projection representation and category contribution rate for robust tumor recognition

IEEE/ACM Trans. Comput. Biol. Bioinform.