Inverse projection group sparse representation for tumor classification: A low rank variation dictionary approach

https://doi.org/10.1016/j.knosys.2020.105768Get rights and content

Abstract

Sparse representation based classification (SRC) achieves good results by addressing recognition problem with sufficient training samples per subject. Tumor classification, however, is a typical small sample problem. In this paper, an inverse projection group sparse representation (IPGSR) model is presented for tumor classification based on constructing a low rank variation dictionary (LRVD), for short, LRVD-IPGSR model. Firstly, an IPGSR model is constructed based on making full use of existing training and test samples, and group sparsity effect of genetic data. Furthermore, from a new viewpoint, a LRVD is constructed for improving the performance of IPGSR-based tumor classification. The LRVD can be independently constructed by detecting and utilizing variations of normals and typical patients, rather than directly using and changed with the genetic data or their corresponding feature data. And the LRVD can be automatic updated and extended to fit the case of new types of diseases. Finally, the LRVD-IPGSR model is fully analyzed from feasibility, stability, optimization and convergence. The performance of the LRVD-IPGSR model-based tumor classification framework is verified on eight microarray gene expression datasets, which contain early diagnosis, tumor type recognition and postoperative metastasis.

Introduction

With the rapid development of gene chip technology, high-dimensional microarray gene expression data can be quickly and accurately obtained. Microarray technology changes people’s thinking about the classification of human tumor molecules [1], [2], [3]. Therefore, effective analysis of microarray data has attracted great attention in recent years. However, high-dimensional microarray data has the characteristics of small samples (patients), high redundancy [4] and class im-balance [5], [6], [7], which poses challenges for tumor classification.

In fact, only a few genes in the high-dimensional microarray data are involved in a particular biological process [8]. Hence, a small set of information genes are needed to be selected before doing further research, where gene selection aims to remove irrelevant, redundant genes [9]. Rank-based gene selection methods are promising and attractive because their simplicity and stability, for example signal noise ratio (SNR) [10]. Of course, there are other gene selection methods, for example, a two-stage hybrid gene selection method [11] was presented by combining three common used ranking methods and LASSO (Least Absolute Shrinkage and Selection Operator), a decision curve analysis-based gene selection statistical index [12] was proposed by considering the clinical misdiagnosis rate, et al. Low-rank decomposition decomposes a matrix into low-rank and sparse components [13], [14], which have much smaller ranks than the original matrix. Liu et al. [15] directly used low-rank decomposition to extract gene subsets from original microarray data. On the whole, these gene selection methods all directly based on original microarray data, whether its all genes or a selected information gene set.

Effective tumor classification plays an important role in clinical diagnosis, treatment and prognosis analysis. Spare representation based classification (SRC) [16] achieves competitive results without learning. However, the success of SRC depends on having sufficient labeled training samples. For solving the few training sample problem, a kind of inverse projection-based sparse representation methods are proposed for face recognition [17] and tumor recognition [11], [12], which show the inverse projection-based sparse representation methods are superior to the standard sparse representation methods. The reason lies in that the former can take full advantage of and mine information embedded in unlabeled samples, rather than the latter relies on the enough labeled training samples and ignores the large number of exist unlabeled samples. The fly in the ointment is that these inverse projection sparse representation-based methods are based on the l2-constraint or l1-constraint, while microarray data have the characteristics of group sparsity. How to take advantage of the intrinsic characteristic will be an interesting and promising work. On the other hand, these methods are all based on original data or the corresponding feature data.

For tumor classification, on the one hand, the expression profiles of most of the genes are flat and are considered as non-differential expression; on the other hand, only a small number of genes are relevant to a special biological process and called differentially expressed genes. For low-rank decomposition, the low-rank part corresponds to the similar part among different samples of the same category, while sparse noise part corresponds to difference part among different samples of the same category. From the point of view, there is some inherent similarity between non-differential expression and low-rank part, and between differential expression and sparse noise part.

Motivated by these works, a tumor classification framework is presented based on an inverse projection-based group spare representation (IPGSR) model with low rank variation dictionary (LRVD), for short, called LRVD-IPGSR model. It is worth mentioning that we focus our attention on constructing an improved group sparsity-based inverse projection sparse representation model from a new perspective. Fig. 1 gives the flowchart of the LRVD-IPGSR model-based tumor classification. The main contributions are as follows.

(1) From a new perspective, a LRVD is constructed based on the sparse noise part of low-rank decomposition [13] for tumor classification, rather than the microarray data. The LRVD focuses on the variations between different categories obtained from the typical training samples, and construct an independent dictionary. It is worth noting that the LRVD not only has good recognition effect on existing categories, but also can be automatic updated and extended with new categories. Different from other information gene selection methods, the LRVD does not select an information gene subset from the original microarray data, but mines the intrinsic essential variation information between different categories.

(2) An IRGSR model is constructed based on introducing the l2,1-regularization constraint [18] into inverse projection sparse representation model [11], [12]. The IRGSR can embody the group sparsity of microarray data and make full use of the unlabeled data simultaneously.

(3) A LRVD-IPGSR model is proposed by integrating the IPGSR and LRVD. The methodology mainly includes model construction, feasibility, stability, optimization and convergence analysis.

The remainder of this paper is organized as follows. Section 2 is the methodology. Experiments and analysis are shown in Section 3. Finally, conclusions will be drawn in Section 4.

Section snippets

Construction of the IPGSR model

In order to improve classification accuracy and stability when there are few training samples, the inverse projection in [11], [12], [17] is also adopted. Furthermore, the group sparse constraint [18] is introduced to make full use of the intrinsic characteristic of microarray data.

Suppose X=[x1,,xh1,,xhc]Rd×n is a training set, Xj=[xhj1+1,,xhj]Rd×(hjhj1) are the jth category samples, where j=1,,2 is the number of category, each training sample xiX can be represented by test set Y=[y1,

Experimental results and analysis

The effectiveness of the proposed method is demonstrated by using six measures on nine public microarray datasets. Accuracy measures the classification performance by using the percentage of correctly classified samples; Sensitivity measures the non-missed diagnosis performance by using the rate of correctly classified positive samples, specificity measures the non-misdiagnosis performance by using the rate of correctly classified negative samples. For any test, there is usually a trade-off

Conclusions

In this paper, one of the major innovations is a low-rank variation dictionary is presented from a novel perspective, another major innovation lies in the construction of inverse projection-based inverse space sparse representation classification. Experiments show that the proposed model can not only fully mine the useful information contained in unlabeled data, but also identify tumors simply and effectively with the help of low-rank variation dictionary.

There are remaining some interesting

CRediT authorship contribution statement

Xiaohui Yang is responsible for modeling and writing article. Xiaoying Jiang and Chenxi Tian have done the implementation of the experiments. Pei Wang has given the suggestions of biological analysis and article title. Funa Zhou and Hamido Fujita have given the suggestions of algorithms and generating article.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to thank https://tumorgenome.nih.gov/ and https://www.ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS4766 for their datasets. This work has been supported by National Natural Science Foundation of China (11701144, 41771375), Natural Science Foundation of Henan Province (202102310087) and Open Fund of Key Laboratory of Intelligent Perception and Image Understanding of Ministry of Education, Xidian University (IPIU2019010).

References (41)

  • YounsiR. et al.

    Ensembles of random sphere cover classifiers

    Pattern Recognit.

    (2016)
  • AjanaS. et al.

    Benefits of dimension reduction in penalized regression methods for high-dimensional grouped data: a case study in low sample size

    Bioinformatics

    (2019)
  • LiuE.T. et al.

    Defining the galaxy of gene expression in breast cancer

    Breast Cancer Res.

    (2002)
  • vanL.J. et al.

    Gene expression profiling predicts clinical outcome of breast cancer

    Nature

    (2002)
  • BoulesteixA.L. et al.

    Evaluating microarray-based classifiers: an overview

    J. Cancer Inform.

    (2008)
  • ZhouF.N. et al.

    Deep learning fault diagnosis method based on global optimization GAN for unbalanced data

    (2019)
  • C.S. Zhang, J.J. Bi, . X.U. S. X, E. Ramentol, G.J. Fan, B.J. Qiao, Fujita. H., Multi-imbalance: an open-source...
  • ScheibleW.R. et al.

    Genome-wide reprogramming of primary and secondary metabolism, protein synthesis, cellular growth processes, and the regulatory infrastructure of arabidopsis in response to nitrogen

    Plant Physiol.

    (2004)
  • GolubT.R. et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • YangX.H. et al.

    Inverse projection representation and category contribution rate for robust tumor recognition

    IEEE/ACM Trans. Comput. Biol. Bioinform.

    (2019)
  • Cited by (35)

    • Low-rank and sparse representation based learning for cancer survivability prediction

      2022, Information Sciences
      Citation Excerpt :

      In this paper, we introduce some constraints/regularization to interpret the original noisy data, with the ultimate aim to establish a low-rank and sparse representation based prediction model. The low-rank representation is being extensively employed in machine learning tasks recently, such as image denoising [23], tumor classification [29], etc. The fundamental concept is to reconstruct/represent high-dimensional data samples using low-dimensional subspaces.

    • A deep fusion framework for unlabeled data-driven tumor recognition

      2021, Pattern Recognition
      Citation Excerpt :

      The recognition effect of SRC is great reduced when there are few training samples per subject. In order to make full use of the large amount of unlabeled data, our team proposed an integrated inverse projection-based inverse space sparse representation tumor recognition frame [11], and an invariant dictionary-based inverse space group sparse representation tumor recognition approach [13]. In these methods, however, feature representation and classification are mostly completed by the combination of two successive stages.

    View all citing articles on Scopus
    View full text