Sparse Manifold Clustering and Embedding to discriminate gene expression profiles of glioblastoma and meningioma tumors

https://doi.org/10.1016/j.compbiomed.2013.08.025Get rights and content

Abstract

Sparse Manifold Clustering and Embedding (SMCE) algorithm has been recently proposed for simultaneous clustering and dimensionality reduction of data on nonlinear manifolds using sparse representation techniques. In this work, SMCE algorithm is applied to the differential discrimination of Glioblastoma and Meningioma Tumors by means of their Gene Expression Profiles. Our purpose was to evaluate the robustness of this nonlinear manifold to classify gene expression profiles, characterized by the high-dimensionality of their representations and the low discrimination power of most of the genes. For this objective, we used SMCE to reduce the dimensionality of a preprocessed dataset of 35 single-labeling cDNA microarrays with 11500 original clones. Afterwards, supervised and unsupervised methodologies were applied to obtain the classification model: the former was based on linear discriminant analysis, the later on clustering using the SMCE embedding data. The results obtained using both approaches showed that all (100%) the samples could be correctly classified and the results of all repetitions but one formed a compatible cluster of predictive labels. Finally, the embedding dimensionality of the dataset extracted by SMCE revealed large discrimination margins between both classes.

Introduction

Brain tumors are1 growths of abnormal cells in the tissues of the brain. Brain tumors are the second fastest growing cause of cancer death among people older than 65 years [1], in addition, they are also the second leading cause of cancer death (after leukemia) in children under fifteen years and young adults up to the age of thirty-four.

The brain tumors are classified in grades based on their malignancy characteristics. On one hand, low-grade tumors have low proliferative potential and possibility of cure following surgical resection alone. On the other hand, high grade tumors are generally associated with a rapid pre- and post-operative evolution of the disease. Specifically, the most frequent primary brain tumor types are of glial origin (40%), 30% are derived from the meninges and 8% are located in cranial and spinal nerves [2]. In adults over 45 years, the most frequent tumors are from the meningioma and glioblastoma types. Meningioma are usually graded I tumors whereas glioblastoma are the most aggressive tumors (grade IV). Glioblastomas arise from glial cells, they are the most infiltrative tumors, and a poor prognosis is associated with them [3]. In contrast, Meningiomas arise from meningothelial cells, they usually show well defined edges and they remain at the benign stage [4].

Biomedical data that come from different biological levels offer great information for the medical decision process. New biomedical technologies go insight the origin and prognosis of the illness moving to an evidence-based medicine paradigm. Despite of the extended use of histopathology as gold standard of Primary Brain Tumours (PBTs), high throughput genome sequences and expression techniques [5] will likely allow to improve the prediction of the clinical course and the response to therapy of patients [6], [7]. Microarray-based gene expression profiles simultaneously show messenger RNA expression level of genes monitored under certain condition, such as belonging to a tumor tissue.

Different technologies are available to study gene expression at the transcriptomic level [8], [9]. Single-labeling cDNA microarrays are a cheap technology more flexible than any commercial product. This makes them accessible to a wide spectrum of research laboratories of molecular biology. A challenging problem of high-throughput genome techniques is its high-dimensionality, in terms of the number of variables in the profiles [10], [11]. For example, the initial dimension of gene-expression profiles studied in this work is 11,500. Moreover, most of those variables have little discrimination power, and hence, they do not give relevant information for the design of predictive models for classification. Hence, robust feature reduction methodologies are required to obtain gene-expression signatures to visualize the datasets, study differential gene-expressions, design predictive models and identify new molecular subtypes.

Nonlinear manifold techniques have recently arisen as the generalization of the classical linear multivariate techniques for feature extraction and reduction. Nonlinear manifolds establish a correspondence between a high dimensional space and a lower dimensionality from topological relationships. Sparse Manifold Clustering and Embedding (SMCE) is a new nonlinear manifold algorithm proposed for simultaneous clustering and dimensionality reduction of data on nonlinear manifolds using sparse representation techniques.

Several studies have applied machine learning techniques, including manifold learning techniques, for discriminating gene expression profiles or new next generation sequencing from tumours. Fuller et al. [12] used cDNA array technology to profile with multidimensional scaling (MDS) the gene expression of 30 primary human glioma tissue samples comprising 4 different glioma subtypes: glioblastoma (GM, WHO grade IV), anaplastic astrocytoma (AA, WHO grade III), anaplastic oligodendroglioma (AO, WHO grade III), and oligodendroglioma (OL, WHO grade II). Marko et al. [13] applied different unsupervised method to integrated genomic, transcriptomic, and morphologic data to reveal molecular classification of low-grade gliomas. Zang and Zang [14] tested a supervised orthogonal discriminant projection for tumor classification using gene expression data in five public tumor datasets. Huang and Feng [15] proposed a parameter-free semi-supervised local Fisher discriminant analysis (pSELF) to map the gene expression data into a low-dimensional space for tumor classification. They tested the method in the SRBCT, DLBCL, and Brain Tumor gene expression data sets. Siu and Hing [16] applied the locally linear embedding (LLE) method to project high dimensional genomic data from the 1000 Genomes Project and a PHASE III Mexico dataset of the HapMap into low dimensional in order to identify population substructures by common and rare variants. Li et al. [17] developed and tested in documents, images, and gene expression data sets their relational multimanifold coclustering based on symmetric nonnegative matrix trifactorization.

In this work, the performance of the SMCE algorithm to classify glioblastoma and meningioma tumors by means of their gene expression profiles has been evaluated. Glioblastomas and meningiomas tumor types have been chosen because they are two diagnoses from different types of cells and with an antagonist aggressiveness behavior. This leads us to expect a good outcome of the classification results, so it gives us the opportunity to focus the attention on the capability of the method to reduce the dimensionality of the representation and to evaluate the robustness of the method with respect to changes in the samples.

For this objective, an Affymetrix-based preprocessing to a dataset of 35 single-labeling cDNA microarrays with 11,500 original clones has been performed. Next, SMCE has been applied directly to the matrix of pre-processed gene-expression values. Afterwards, a linear discriminant analysis and a clustering algorithm have been applied alternatively to the SMCE embedded data in order to produce the classification model. The evaluation based on a bootstrap strategy showed the stability of the models. Moreover, the visual inspection of the embedding dimensionality ratified the discrimination capability of the approach.

Next section shows the dataset preparation process. Afterwards, Section 4 briefly describes the SMCE method and Section 4 shows the simulation experiments. Section 5 shows the obtained results and discusses their relevance. Finally, Section 6 shows the conclusions of the paper.

Section snippets

Dataset preparation

The dataset used in this study is composed by samples extracted from 35 frozen biopsies carried out at the Hospital Prínceps d'Espanya (Bellvitge, Spain). The preparation of the samples is fully described in [18]. From the 35 samples, 18 samples were histopathologically diagnosed as Meningiomas (MM) and 17 samples as Glioiblastomas (GM). The samples consisted in single-labeling cDNA microarrays based on human CNIO oncochip, which is a 12 K cDNA Clone Set microarray that contains 11,500 cDNA

SMCE algorithm

In this section, we review the Sparse Manifold Clustering and Embedding (SMCE) Algorithm [22]. Given a set of N data points {yiRD}i=1N lying in multiple low-dimensional manifolds {Mj}j=1n, the SMCE algorithm clusters the data into the underlying manifolds and finds low-dimensional representations of the data in each manifold. More specifically, it assumes that the data can be divided into multiple classes where each class can be characterized by a mapping from an unknown low-dimensional space

Experiments

SMCE algorithm was applied to the pre-processed dataset obtained after applying the pipeline described in Section 2. No information about the class was taken into account during the dimensionality reduction. Two dimensions were selected for the embedding representation. Preliminary trials with more dimensions (3, 4 and 5) obtained similar results. Hence 2D was selected following the maximum parsimony criterion and the possibilities for visual inspection. Two parameters must be tuned in SMCE:

Visual inspection of the reduced dimensionality obtained by SMCE

Fig. 1 shows a scatter plot of a 2D embedding obtained using SMCE with λ=10 and KMax=12. In this figure we can see the actual classes of the samples (GM, blue dots and MM, red dots) in the embedding space. It is worth noting how the neighborhood relationships between the patterns are preserved in the reduced dimensionality space obtained by the SMCE technique.

SMCE dimensionality reduction and LDA classification

After the dimensionality reduction, a linear discriminant analysis (LDA) was applied to obtain the classification model in the embedding

Conclusions

In this work, we have evaluated the capability of the Sparse Manifold Clustering and Embedding (SMCE) to classify two types of brain tumors using gene expression profiles. The SMCE method has been used for dimensionality reduction and for unsupervised classification. Our result provides an objective evaluation of the SMCE algorithm with a real medical problem of diagnosis based on gene-expression profiles. Satisfactory results have been obtained using supervised and unsupervised approaches.

Juan M. García-Gómez Ph.D. in Computer Engineering, 2009. He is Professor Contratado Doctor in UPVLC. In 2007, he was a Visiting Researcher at ESAT Katholieke Universiteit Leuven. His research interests are mainly in Machine Learning on Biomedical data. From 2004 to 2008, he coordinated the P:R WP of the FP6 EU projects eTumour. PIC from UPVLC in the HELP4MOOD project (FP7-ICT-2009-4; 248765), under negotiation in November, 2011. He has headed, among others, a national project to develop a CDSS

References (25)

  • L. Shi et al.

    The microarray quality control (maqc)-ii study of common practices for the development and validation of microarray-based predictive models

    Nat. Biotechnol.

    (2010)
  • J.C. Roden et al.

    Mining gene expression data by interpreting principal components

    BMC Bioinformatics

    (2006)
  • Cited by (9)

    • Dimensionality reduction of hyperspectral images based on sparse discriminant manifold embedding

      2015, ISPRS Journal of Photogrammetry and Remote Sensing
      Citation Excerpt :

      It assumes that for each data point there exists a small neighborhood in which only the points coming from the same manifold lie approximately in a low-dimensional affine subspace. An optimization program based on SR is used to select a few neighbors of each data point that span a low-dimensional affine subspace passing near this point (García-Gómez et al., 2013). As a result, a few nonzero elements of the solution indicate the points on the same manifold, which can be used for clustering.

    • Gene expression microarray classification using PCA-BEL

      2014, Computers in Biology and Medicine
      Citation Excerpt :

      In the literature, various classifiers have been investigated in order to find the best classifier. It seems that the NN and various types of NN [29,36,57,6,56,68,74,81,69,16], k nearest neighbors [61,13], k-means algorithms [32], Fuzzy c-means algorithm [11], bayesian networks [4], vector quantization based classifier [59], manifold methods [18,80], fuzzy approaches [54,58,30,60], complementary learning fuzzy neural network [64–67], ensemble learning [55,8,27,50], logistic regression, support vector machines [22,5,82,73,63,46,70], LSVM [44], wavelet transform [28] as well as radial basis-support vector machines [51] have been investigated successfully in classification and cancer detection. But the recently developed classifiers such as brain emotional learning (BEL) networks [42] have not been examined in this field.

    • Microarray Data Classification Based on Neighbourhood Components Analysis Projection Method

      2021, 2021 IEEE 6th International Conference on Big Data Analytics, ICBDA 2021
    • Tumor Gene Selection and Prediction via Supervised Correlation Analysis Based F-Score Method

      2020, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus

    Juan M. García-Gómez Ph.D. in Computer Engineering, 2009. He is Professor Contratado Doctor in UPVLC. In 2007, he was a Visiting Researcher at ESAT Katholieke Universiteit Leuven. His research interests are mainly in Machine Learning on Biomedical data. From 2004 to 2008, he coordinated the P:R WP of the FP6 EU projects eTumour. PIC from UPVLC in the HELP4MOOD project (FP7-ICT-2009-4; 248765), under negotiation in November, 2011. He has headed, among others, a national project to develop a CDSS for risk assessment of outpatients with diabetes and a project for the evaluation of the Health System for Chronic diseases based on clinical guidelines at IBSALUT (Balearic Islands, ES). He is author of more than 25 articles published on specialized magazines (including Nucleic Research, Bioinformatics, NMR in Biomedicine, Applied Intelligence, and LNCS, among others).

    Juan Gómez-Sanchís received a B.Sc. degree in Physics (2000) and a B.Sc. degree in Electronics Engineering from the University of Valencia (2003). He joined at the Public Research Institute IVIA in 2004, developing is Ph.D. in hyperspectral computer vision systems applied to the agriculture. He joined to the Department of Electronics Engineering at University of Valencia in 2008, where he currently works as assistant professor in automation applications using machine learning and manifold learning.

    Pablo Escandell-Montero received the B.Eng. degree in Telecommunications in 2006, and the M.Eng. degree in Electronics in 2009, both from the University of Valencia, Spain. Currently, he is working towards the Ph.D. degree in the IDAL research group in the same university. His research interests include machine learning and its applications, in particular reinforcement learning and approximate dynamic programming.

    Elies Fuster-Garcia was born in Alcoi in 1980. He finished his studies of Physics in 2005 at Universitat de Valè ncia d'Estudi General. Then, he carried out his Ph.D. studies on applied Physics at Universitat Politècnica de València. Between 2007 and 2009 he was technical editor of Modelling in Science Education and Learning (MSEL), and between 2009 and 2010 he was assistant professor at Universitat Internacional Valenciana in the Master in Astronomy and Astrophysics. In 2010, he was Visiting Researcher at University of London (St George's Hospital) for three months. Finally, he got his Ph.D. degree on September 2012 about biomedical signal analysis in classification problems. In the last years he was involved in twelve research projects in the biomedical field, including three EU projects.

    Emilio Soria-Olivas received his B.Sc. in 1982 in Physics and his Ph.D. degree in Electronic Engineering in 1997 from the Universitat de Valencia (Spain). Since 1994 he has been with the Department of Ingenieria Electronica at the University of Valencia, where he belongs to the GPDS (Digital Signal Processing Group). He is an Associate Professor. His research activities include advanced signal processing using neural networks and fuzzy system.

    This work was supported by the University of Valencia through project UV-INV-AE11-41271.

    View full text