Subtype dependent biomarker identification and tumor classification from gene expression profiles

doi:10.1016/j.knosys.2018.01.025

Knowledge-Based Systems

Volume 146, 15 April 2018, Pages 104-117

https://doi.org/10.1016/j.knosys.2018.01.025 Get rights and content

Abstract

Gene expression profiles are being used to categorize disease specific genes and classify different tumor subtypes at the molecular level. Due to the inherent nature of these data having high dimensionality and small sample sizes, current conventional machine learning and statistical techniques have drawbacks in achieving satisfactory predictive classification performance in clinical samples. The typical approach to handling this situation is to eliminate noisy and redundant genes from the original gene space. There are currently multiple gene selection methods available, but most of them seek to find a common subset of genes for all tumor subtypes and fail to reflect the unique characteristics of each subtype. Consequently, in this study, we propose a general framework that aims to identify subset of genes for each tumor subtype, and also give another gene selection framework that combines the obtained subtype specific gene subsets into a single gene subset. We then present a corresponding classification model for distinguishing different tumor subtypes, and implement three specific gene selection algorithms within the two frameworks. Finally, extensive experimental results on the six benchmark microarray data validate the proposed tumor subtype dependent selection process to predict and rank specific molecular biomarkers to define tumor subtypes. This new process contributes significantly to the enhancement of tumor-predictive classification performance.

Introduction

Tumor metastasis and subsequent mortality place a heavy social and fiscal burden on our society. Early diagnosis of tumor is more cost effective and plays a significant role in better management, treatment, and outcomes [1]. Traditional diagnostic methods include cell based observational and biochemical examination in an organ based context, both of which rely on vast and varied domain knowledge of pathological research. Guidelines and standards of care have progressed yet maintain inherent disadvantages of bias, time, and limited accuracy. Gene mutation and subsequent loss of function or alteration in molecular pathways is a defining occurrence for most metastatic events, and measuring the differential gene expression patterns in tumor cells compared with those of a normal population is increasingly accepted to diagnose cancer, define treatments, and predict outcomes in personalized cancer care plans [2].

The rapid development and wide use of microarray technology enables simultaneous measures of expression perturbations of thousands of genes under multiple experimental conditions. These early multivariate analyses have increased our capacity to identify disease genes, drug targets, and tumor subtypes [2], [3], [4], [5]. Accordingly, various methods of analysis, including machine learning algorithms, have been created to compare gene expression profiles. The intrinsic nature of these microarray data collections is usually characterized by high dimensionality (with thousands of gene observations over time and context) and often using a small sample size of specimens or patients to limit the statistical power for clinical use [6]. This multiplicity of classifiers and data dimensionality often causes pattern profiles to be overfit and thus predictions will suffer from poor generalization capacity [7]. There are studies suggesting that there are a few important genes that are associated with a specific classification of cancer subtypes and may be (ideally) submitted for Food and Drug Administration (FDA) validation and used for diagnosis [8]. Also, the affected gene space often consists of a large number of noisy and redundant genes, which can diminish the performance of a classifier [9], [10], [11]. For example, k-nearest-neighbor algorithm is sensitive to irrelevant features in classification [12]. One feasible way to mitigate this problem is to select a subset of discriminant genes from the original gene space by filtering noisy and redundant genes using effective feature selection methods [13], [14].

Gene selection, also known as feature selection and variable selection, is defined as a process of selecting a small subset of genes that contains the most discriminant information with well-defined evaluation metrics [6]. In addition to reducing the dimensionality of original gene space, effective gene selection methods bring us significant enhancements of quality measures for defining gene sets that validate the drug targets in biological and medical research. These enhancements include better generalization capacity of the constructed classifier, reducing the classifier training time, and improving the interpretability of obtained biomarkers [15].

According to whether using a classifier to evaluate the quality of a candidate feature in the feature selection process, existing feature selection methods can be broadly divided into four categories: (1) filter methods, (2) wrapper methods, (3) embedded methods, and (4) hybrid methods [16], [17]. Filter methods are independent of a classification model and measure the quality of a feature, or a subset, using only the intrinsic nature of training samples. Filter methods are flexible in combination with various classifiers and have lower computational complexity. They also have better generalization ability [16]. Furthermore, commonly used metrics in filter methods mainly include distance, consistency, dependency, and information theory-based metrics [18]. Distance-based methods define separability as the metric and try to find those features that can best discriminate the target class. One such method is the ReliefF algorithm [19]. Consistency-based methods use the inconsistency rate as the criterion and seek to select a subset of features with better consistency, such as Focus and LVF algorithms [20]. Dependency-based methods evaluate the importance of candidate features with statistical theory, and there are a variety of methods available such as Pearson correlation coefficient, partial least squares, and Fisher score [21], [22]. Information theory based feature selectors have efficiency and effectiveness because of their capacity in capturing higher order statistics of data and reflecting the non-linear relationships between variables [23]. Consequently, researchers have proposed and developed a number of feature selectors from the view of mutual information, including information gain, minimum redundancy maximum relevance (mRMR) [24], and fast correlation based filter (FCBF) [25]. In contrast, wrapper-based methods are specific to a given learning algorithm to extending non-filter features of selection to evaluate the quality of a selected candidate. These methods often use the classification error rate or classification accuracy as an evaluation criterion [26], [27], [28]. Due to the specific interaction between the obtained features and the learning algorithm, wrapper methods tend to obtain better classification results but at the cost of high time complexity [27]. Embedded methods are essentially a special case of wrapper methods and more tightly coupled with a specified learning algorithm. Feature subsets are generated during training the classifier, which makes embedded methods usually more tractable and time efficient than wrapper methods. Decision tree and Lasso algorithms are two typical embedded cases [29], [30]. Besides, a hybrid scheme has been proposed to take advantage of both filter and wrapper methods, and researchers have proposed to combine filter and wrapper methods [31], [32]. Essentially, a filter is initially used to remove a large number of noisy and redundant features from the original feature space, and then a wrapper method is used to find a discriminant feature subset from the reduced subset [33].

According to the final output style, we can group existing feature selection methods into feature ranking and feature subset selection categories. Feature ranking methods return a ranked list of the original features in descending order according to the predictive power of each feature [34]. We are required to specify the number of how many features are to be selected after ranking. Alternatively, we can determine the optimal size of a feature subset with the help of a learning algorithm. Feature ranking methods include single feature ranking and multiple feature ranking methods. The former evaluates the quality of each candidate feature individually, and does not consider the redundancy and interaction between features [19]. These feature ranking methods often fail to obtain a feature subset of high quality. Multiple feature ranking methods take the relationship between the candidate feature and previously selected features into account in the process of feature selection [25]. Ranking methods belonging to this category have a sequential forward or backward selection scheme to rank original features [6]. Unlike feature ranking methods, feature subset selection methods explicitly or implicitly consider the relevance and redundancy between features, and finally return a feature subset without involving a further step to determine the optimal size [25].

Currently, there are a wealth of feature selection methods available [35], [36], [37], [38], but most of them seek to find a common subset of genes for subtypes within a defined pathology, and fail to reflect the unique characteristics of each subtype based on molecular differences. In fact, a unique subset of genes is likely to exist within each tumor subtype. Identifying these molecularly based tumor subtypes will increase the clinical efficacy of treatments with such predictive biomarkers [2], [39]. Obtaining molecular subtype dependent biomarkers helps design a personalized treatment plan. These plans have been shown to often reduce the toxicity and side effects in treatment, concurrent with significant slowing of tumor progression. These biomarkers also accelerate structural and cell-based refinement in drug development research on these molecular subtypes, reducing the time and cost of bring drugs to clinic.

There are studies from related fields that propose to select a possible different feature subset for each class. For example, de Lannoy et al. propose a method to perform class-specific feature selection in multiclass support vector machines and experimentally validate its performance [40]. Zhou and Wang use class separability measure to select different feature subsets for different classes and compare their method with class independent feature selection method by applying the method on several biomedical data with support vector machine [41]. A major limitation of these methods is that they are related to the use of a particular classifier, which limits its applicability. To alleviate this problem, Pineda-Bautista et al. propose a class specific feature method that can be used with any classifiers and they use classifier ensemble to classify an unseen sample. Their experimental results on low dimensional datasets show the effectiveness of the proposed method [42]. However, classifying new test samples under an ensemble framework without utilizing the confidence of each sub-classifier may makes poor decisions when we face the problem of voting conflict. Besides, the aim of these studies is to return multiple feature subsets for feature analysis and classification model construction, and few studies, to the best of our knowledge, explore the fusion of multiple class-specific feature subsets and further evaluate the effectiveness of these combined features in classification. Furthermore, they conduct experiments on low dimensional datasets without considering a more difficult case that is characterized by high dimensionality and small sample sizes. Accordingly, in this study, we propose to select gene profiles that are associated tumor subtypes, enabling us to define unique genes for a tumor subtype as well as common genes for all tumor subtypes. We will enhance the performance in classifying different tumor subtypes and further reduce the chance of partially overfitting in future algorithms. The main contributions of this study are as follows:

1)
We propose a general framework for subtype dependent biomarker identification that returns a filtered profile of genes for each tumor subtype. Subsequently, we provide another gene selection framework, called fusion based gene selection that merges the obtained subtype dependent gene profiles and finally returns a single defining gene profile. We then present corresponding classification (training and testing) model, associated with subtype dependent method, for distinguishing different tumor subtypes.
2)
Under each of the frameworks, we implemented three specific gene selection algorithms with Fisher score, mRMR and FCBF as the building blocks, respectively. We have detailed how to obtain the optimal feature subset for feature ranking based as well as feature subset based feature selection methods in this study.
3)
We integrate three classification models with different metrics, including support vector machine, Naïve bayes, and k-nearest-neighbor, into the framework to construct classifiers, and detail how to estimate the confidence that a sample belongs to a specific class to solve the problem of voting conflict.
4)
We tested the proposed methods on six benchmarked microarray datasets that contain multiple tumor subtypes, and compared the performance of support vector machine, Naïve bayes, and k-nearest-neighbor. Extensive experiments demonstrate the superiority of subtype-dependent feature selection methods over subtype-independent feature selection methods and the superiority of support vector machine over Naïve bayes and k-nearest-neighbor in obtaining a feature subset of high quality.

The paper is structured in the following way. Section II details the proposed subtype dependent biomarker identification framework and its fusion version, and present corresponding subtype-dependent classification model. Section III illustrates the experimental data, three baseline feature selectors and support vector machine classifier, as well as the evaluation metrics. Section IV presents the experimental results. The last section concludes this study.

Section snippets

Subtype dependent gene selection and tumor classification framework

In this section, we first present the two proposed gene selection frameworks for biomarker identification: subtype dependent framework and its fusion version. We then show the classifier training and testing model under the subtype dependent framework.

Experimental setup

In this section, we first describe the microarray data used in our experiments, and then introduce three gene selection methods that are building blocks of the two proposed frameworks. Finally, we present the widely used support vector machine classifier, and give the evaluation metrics to measure the performance of the proposed methods.

Selection of discriminant genes

In this section, we report the selected genes of each feature selector from two aspects: the number of selected genes, and the relations between genes selected with subtype dependent and independent methods.

Conclusion

Tumor progression is a social and economic problem that affects the life quality of a large number of individuals, thus accurately distinguishing tumor subtypes contributes to the better management, treatment, and outcomes. Microarray technology provides us a way to identify disease genes and classify tumor subtypes, but the intrinsic nature of microarray data characterized by high dimensionality and small sample sizes limits their capacity. Correspondingly, researchers have put forward a

Acknowledgment

This work was partially supported by the China Postdoctoral Science Foundation (No. 2016M592046), the National Natural Science Foundation of China (No. 71661167004), the Fundamental Research Funds for the Central Universities(No. JZ2016HGBH1053), the “111 Project” of the Ministry of Education and State Administration of Foreign Experts Affairs(Grant No. B14025), and the Science and Technology Innovation Project of Foshan City, China(Grant No. 2015IT100095).

References (48)

A. Wang et al.
Wrapper-based gene selection with Markov blanket
Comput. Biol. Med.
(2017)
D. Singh
Gene expression correlates of clinical prostate cancer behavior
Cancer Cell
(2002)
A. Wang et al.
Accelerating wrapper-based feature selection with K-nearest-neighbor
Knowl.-Based Syst.
(2015)
I. Inza et al.
Filter versus wrapper gene selection approaches in DNA microarray domains
Artif. Intell. Med.
(2004)
R. Ruiz et al.
Incremental wrapper-based gene selection from microarray data for cancer classification
Pattern Recognit.
(2006)
M. Dash et al.
Consistency-based search in feature selection
Artif. Intell.
(2003)
R. Kohavi et al.
Wrappers for feature subset selection
Artif. Intell.
(1997)
S. Maldonado et al.
A wrapper method for feature selection using support vector machines
Inf. Sci.
(2009)
J. Huang et al.
Decision forest for classification of gene expression data
Comput. Biol. Med.
(2010)
S. Shreem et al.
Hybridising harmony search with a Markov blanket for gene selection problems
Inf. Sci.
(2014)

A. Wang et al.

Improving PLS–RFE based gene selection for microarray data classification

Comput. Biol. Med.

(2015)

B. Pineda-Bautista et al.

General framework for class-specific feature selection

Expert Syst. Appl.

(2011)

A. Wang et al.

Predicting hypertension without measurement: a non-invasive, questionnaire-based approach

Expert Syst. Appl.

(2015)

T.R. Golub

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

(1999)

J. Welsh

Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer

Cancer Res.

(2001)

M. Wu et al.

Biomarker identification and cancer classification based on microarray data using Laplace Naive Bayes model with mean shrinkage

IEEE/ACM Trans. Comput. Biol. Bioinform.

(Nov./Dec. 2012)

I. Guyon et al.

Gene selection for cancer classification using support vector machines

Mach. Learn.

(2002)

L. Ein-Dor et al.

Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer

Proc. Natl. Acad. Sci.

(2006)

W. Li et al.

How many genes are needed for a discriminant microarray data analysis?

G. Piatetsky-Shapiro et al.

Microarray data mining: facing the challenges

Sigkdd Explorations

(2003)

S. Bandyopadhyay et al.

A survey and comparative study of statistical tests for identifying differential expressionfrom microarray data

IEEE/ACM Trans. Comput. Biol. Bioinf.

(Jan./Feb. 2014)

T. Abeel et al.

Robust biomarker identification for cancer diagnosis with ensemble feature selection methods

Bioinformatics

(2010)

I. Guyon et al.

An introduction to variable and feature selection

J. Mach. Learn. Res.

(2003)

H. Liu et al.

Toward integrating feature selection algorithms for classification and clustering

IEEE Trans. Knowl. Data Eng.

(2005)

Cited by (22)

Artificial intelligence-driven biomedical genomics
2023, Knowledge-Based Systems
As genomic research becomes more complex and data-rich, artificial intelligence (AI) has emerged as a crucial tool for processing and analyzing high-dimensional genomic data, accelerating biomarker discovery, and enhancing genomic sequence annotations. Despite the increasing application of AI in genomic research, challenges persist, particularly regarding the integration of biomedical knowledge into algorithm development. We reviewed high-quality, AI-driven biomedical genomic studies from the past five years, covering applications in disease prediction, detection, diagnosis, and treatment. Each category highlights how different AI techniques are applied in biomedical contexts. Furthermore, we identify current challenges and potential solutions in AI-assisted biomedical genomics. This comprehensive review is designed to encourage collaboration among computer scientists, healthcare professionals, and interested communities, propelling the development of AI applications that can be smoothly integrated into routine medical services.
Microplastics in coastal and oceanic surface waters and their role as carriers of pollutants of emerging concern in marine organisms
2023, Marine Environmental Research
Microplastics (Mps) pose a significant environmental challenge with global implications. To examine the effect of Mps on coastal and oceanic surface waters, as well as in marine organisms, 167 original research papers published between January 2013 and September 2022 were analyzed. The study revealed an unequal distribution of research efforts across the world. Fragments and fibers were the most frequently detected particles in ocean surface waters and marine biota, which mainly consisted of colored and transparent microparticles. Sampling of Mps was primarily done using collecting nets with a mesh size of 330 μm. Most articles used a stereomicroscope and Fourier-Transform Infrared spectroscopy for identification and composition determination, respectively. Polyethylene and polypropylene were the most frequent polymers found, both in coastal waters and in marine organisms. The major impact observed on marine organisms was a reduction in growth rate, an increase in mortality, and reduced food consumption. The hydrophobic nature of plastics encourages the formation of biofilms called the “plastisphere,” which can carry pollutants that are often toxic and can enter the food chain. To better define management measures, it is necessary to standardize investigations that assess Mp pollution, considering not only the geomorphological and oceanographic features of each region but also the urban and industrial occupation of the studied marine environments.
Ensemble feature selection for stable biomarker identification and cancer classification from microarray expression data
2022, Computers in Biology and Medicine
Citation Excerpt :
According to the outcome of feature selection methods, we can also group them into feature ranking method and feature subset selection method. The former returns a ranked list of original features and requires a further step to determine the number of finally selected features, while the latter outputs a subset of features [7]. For microarray data analysis, besides the selection of discriminating genes and construction of an accurate classifier, feature selection stability is another critical topic, which refers to the ability of a feature selection algorithm to select the same or similar features with the change of microarray data [23,24].
Microarray technology facilitates the simultaneous measurement of expression of tens of thousands of genes and enables us to study cancers and tumors at the molecular level. Because microarray data are typically characterized by small sample size and high dimensionality, accurate and stable feature selection is thus of fundamental importance to the diagnostic accuracy and deep understanding of disease mechanism. Hence, we in this study present an ensemble feature selection framework to improve the discrimination and stability of finally selected features. Specifically, we utilize sampling techniques to obtain multiple sampled datasets, from each of which we use a base feature selector to select a subset of features. Afterwards, we develop two aggregation strategies to combine multiple feature subsets into one set. Finally, comparative experiments are conducted on four publicly available microarray datasets covering both binary and multi-class cases in terms of classification accuracy and three stability metrics. Results show that the proposed method obtains better stability scores and achieves comparable to and even better classification performance than its competitors.
An enhanced feature selection and cancer classification for microarray data using relaxed Lasso and support vector machine
2021, Translational Bioinformatics in Healthcare and Medicine
Cancer is still the main cause of mortality for both men and women all around the world. In fact, about one in six deaths in the world is due to cancer, making it the most common cause of death globally. Lung and breast cancers had the highest mortality rates in men and women, respectively. Early detection of cancer is important to improve the chance of survival since early treatment can be provided for the patients who have this disease. The emergence of microarray technology has been applied to the medical field in terms of classification of cancer and other diseases. By using the microarray, the expression of hundreds to thousands of genes can be analyzed simultaneously. However, this microarray suffers from several problems such as high dimensionality, noise, and irrelevant genes. Thus, various feature selection methods have been developed intended to reduce the dimensionality of microarray as well as to select only the most relevant genes. In addition, it also difficult to select relevant features for classification from microarray gene expression data and successfully differentiate subgroups of cancer. For this study, we select three datasets of cancer microarray in the experiment. This chapter proposed relaxed Lasso and support vector machine (rL-SVM) for selecting features and classifying cancer. We gain classification accuracy through a 10-fold cross-validation for all datasets to compete with other existing methods. The performance of the classification algorithm will be evaluated by using the accuracy, area under the curve (AUC), and Kappa statistics. In this chapter, the experimental findings indicate that the method proposed has improved efficiency and achieves better accuracy for classification with fewer selected feature genes. rL-SVM can be used in large for classification of high dimension and small sample cancer data.
SGL-SVM: A novel method for tumor classification via support vector machine with sparse group Lasso
2020, Journal of Theoretical Biology
Citation Excerpt :
At present, the DNA microarray technology of gene expression data from thousands of genes is becoming more and more mature, which is widely used in the field of biomedicine, especially in tumor research (Bolon-Canedo et al., 2014; Margalit et al., 2005; Lv et al., 2016; Salem et al., 2017). The feature genes extracted from the DNA microarray data can not only be used as tumor classification markers but can be used for more accurate diagnosis, prevention, and treatment of tumors (Shipp et al., 2002; Latkowski and Osowski, 2015; Northcott et al., 2017; Han et al., 2017; Wang et al., 2018). Microarray data is characterized by small samples, high dimensionality, and unbalanced distribution (Zhao and Wu, 2016; Kang and Song, 2017; Jain et al., 2018).
At present, with the in-depth study of gene expression data, the significant role of tumor classification in clinical medicine has become more apparent. In particular, the sparse characteristics of gene expression data within and between groups. Therefore, this paper focuses on the study of tumor classification based on the sparsity characteristics of genes. On this basis, we propose a new method of tumor classification—Sparse Group Lasso (least absolute shrinkage and selection operator) and Support Vector Machine (SGL-SVM). Firstly, the primary selection of feature genes is performed on the normalized tumor datasets using the Kruskal–Wallis rank sum test. Secondly, using a sparse group Lasso for further selection, and finally, the support vector machine serves as a classifier for classification. We validate proposed method on microarray and NGS datasets respectively. Formerly, on three two-class and five multi-class microarray datasets it is tested by 10-fold cross-validation and compared with other three classifiers. SGL-SVM is then applied on BRCA and GBM datasets and tested by 5-fold cross-validation. Satisfactory accuracy is obtained by above experiments and compared with other proposed methods. The experimental results show that the proposed method achieves a higher classification accuracy and selects fewer feature genes, which can be widely applied in classification for high-dimensional and small-sample tumor datasets. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/SGL-SVM/.
Regularized minimax probability machine
2019, Knowledge-Based Systems
In this paper, we propose novel second-order cone programming formulations for binary classification, by extending the Minimax Probability Machine (MPM) approach. Inspired by Support Vector Machines, a regularization term is included in the MPM and Minimum Error Minimax Probability Machine (MEMPM) methods. This inclusion reduces the risk of obtaining ill-posed estimators, stabilizing the problem, and, therefore, improving the generalization performance. Our approaches are first derived as linear methods, and subsequently extended as kernel-based strategies for nonlinear classification. Experiments on well-known binary classification datasets demonstrate the virtues of the regularized formulations in terms of predictive performance.

View all citing articles on Scopus

View full text

Subtype dependent biomarker identification and tumor classification from gene expression profiles

Abstract

Introduction

Section snippets

Subtype dependent gene selection and tumor classification framework

Experimental setup

Selection of discriminant genes

Conclusion

Acknowledgment

Comput. Biol. Med.

Cancer Cell

Knowl.-Based Syst.

Artif. Intell. Med.

Pattern Recognit.

Artif. Intell.

Artif. Intell.

Inf. Sci.

Comput. Biol. Med.

Inf. Sci.

Comput. Biol. Med.

Expert Syst. Appl.

Expert Syst. Appl.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer

Cancer Res.

Biomarker identification and cancer classification based on microarray data using Laplace Naive Bayes model with mean shrinkage

IEEE/ACM Trans. Comput. Biol. Bioinform.

Gene selection for cancer classification using support vector machines

Mach. Learn.

Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer

Proc. Natl. Acad. Sci.

How many genes are needed for a discriminant microarray data analysis?

Microarray data mining: facing the challenges

Sigkdd Explorations

A survey and comparative study of statistical tests for identifying differential expressionfrom microarray data

IEEE/ACM Trans. Comput. Biol. Bioinf.

Robust biomarker identification for cancer diagnosis with ensemble feature selection methods

Bioinformatics

An introduction to variable and feature selection

J. Mach. Learn. Res.

Toward integrating feature selection algorithms for classification and clustering

IEEE Trans. Knowl. Data Eng.