Subtype dependent biomarker identification and tumor classification from gene expression profiles
Introduction
Tumor metastasis and subsequent mortality place a heavy social and fiscal burden on our society. Early diagnosis of tumor is more cost effective and plays a significant role in better management, treatment, and outcomes [1]. Traditional diagnostic methods include cell based observational and biochemical examination in an organ based context, both of which rely on vast and varied domain knowledge of pathological research. Guidelines and standards of care have progressed yet maintain inherent disadvantages of bias, time, and limited accuracy. Gene mutation and subsequent loss of function or alteration in molecular pathways is a defining occurrence for most metastatic events, and measuring the differential gene expression patterns in tumor cells compared with those of a normal population is increasingly accepted to diagnose cancer, define treatments, and predict outcomes in personalized cancer care plans [2].
The rapid development and wide use of microarray technology enables simultaneous measures of expression perturbations of thousands of genes under multiple experimental conditions. These early multivariate analyses have increased our capacity to identify disease genes, drug targets, and tumor subtypes [2], [3], [4], [5]. Accordingly, various methods of analysis, including machine learning algorithms, have been created to compare gene expression profiles. The intrinsic nature of these microarray data collections is usually characterized by high dimensionality (with thousands of gene observations over time and context) and often using a small sample size of specimens or patients to limit the statistical power for clinical use [6]. This multiplicity of classifiers and data dimensionality often causes pattern profiles to be overfit and thus predictions will suffer from poor generalization capacity [7]. There are studies suggesting that there are a few important genes that are associated with a specific classification of cancer subtypes and may be (ideally) submitted for Food and Drug Administration (FDA) validation and used for diagnosis [8]. Also, the affected gene space often consists of a large number of noisy and redundant genes, which can diminish the performance of a classifier [9], [10], [11]. For example, k-nearest-neighbor algorithm is sensitive to irrelevant features in classification [12]. One feasible way to mitigate this problem is to select a subset of discriminant genes from the original gene space by filtering noisy and redundant genes using effective feature selection methods [13], [14].
Gene selection, also known as feature selection and variable selection, is defined as a process of selecting a small subset of genes that contains the most discriminant information with well-defined evaluation metrics [6]. In addition to reducing the dimensionality of original gene space, effective gene selection methods bring us significant enhancements of quality measures for defining gene sets that validate the drug targets in biological and medical research. These enhancements include better generalization capacity of the constructed classifier, reducing the classifier training time, and improving the interpretability of obtained biomarkers [15].
According to whether using a classifier to evaluate the quality of a candidate feature in the feature selection process, existing feature selection methods can be broadly divided into four categories: (1) filter methods, (2) wrapper methods, (3) embedded methods, and (4) hybrid methods [16], [17]. Filter methods are independent of a classification model and measure the quality of a feature, or a subset, using only the intrinsic nature of training samples. Filter methods are flexible in combination with various classifiers and have lower computational complexity. They also have better generalization ability [16]. Furthermore, commonly used metrics in filter methods mainly include distance, consistency, dependency, and information theory-based metrics [18]. Distance-based methods define separability as the metric and try to find those features that can best discriminate the target class. One such method is the ReliefF algorithm [19]. Consistency-based methods use the inconsistency rate as the criterion and seek to select a subset of features with better consistency, such as Focus and LVF algorithms [20]. Dependency-based methods evaluate the importance of candidate features with statistical theory, and there are a variety of methods available such as Pearson correlation coefficient, partial least squares, and Fisher score [21], [22]. Information theory based feature selectors have efficiency and effectiveness because of their capacity in capturing higher order statistics of data and reflecting the non-linear relationships between variables [23]. Consequently, researchers have proposed and developed a number of feature selectors from the view of mutual information, including information gain, minimum redundancy maximum relevance (mRMR) [24], and fast correlation based filter (FCBF) [25]. In contrast, wrapper-based methods are specific to a given learning algorithm to extending non-filter features of selection to evaluate the quality of a selected candidate. These methods often use the classification error rate or classification accuracy as an evaluation criterion [26], [27], [28]. Due to the specific interaction between the obtained features and the learning algorithm, wrapper methods tend to obtain better classification results but at the cost of high time complexity [27]. Embedded methods are essentially a special case of wrapper methods and more tightly coupled with a specified learning algorithm. Feature subsets are generated during training the classifier, which makes embedded methods usually more tractable and time efficient than wrapper methods. Decision tree and Lasso algorithms are two typical embedded cases [29], [30]. Besides, a hybrid scheme has been proposed to take advantage of both filter and wrapper methods, and researchers have proposed to combine filter and wrapper methods [31], [32]. Essentially, a filter is initially used to remove a large number of noisy and redundant features from the original feature space, and then a wrapper method is used to find a discriminant feature subset from the reduced subset [33].
According to the final output style, we can group existing feature selection methods into feature ranking and feature subset selection categories. Feature ranking methods return a ranked list of the original features in descending order according to the predictive power of each feature [34]. We are required to specify the number of how many features are to be selected after ranking. Alternatively, we can determine the optimal size of a feature subset with the help of a learning algorithm. Feature ranking methods include single feature ranking and multiple feature ranking methods. The former evaluates the quality of each candidate feature individually, and does not consider the redundancy and interaction between features [19]. These feature ranking methods often fail to obtain a feature subset of high quality. Multiple feature ranking methods take the relationship between the candidate feature and previously selected features into account in the process of feature selection [25]. Ranking methods belonging to this category have a sequential forward or backward selection scheme to rank original features [6]. Unlike feature ranking methods, feature subset selection methods explicitly or implicitly consider the relevance and redundancy between features, and finally return a feature subset without involving a further step to determine the optimal size [25].
Currently, there are a wealth of feature selection methods available [35], [36], [37], [38], but most of them seek to find a common subset of genes for subtypes within a defined pathology, and fail to reflect the unique characteristics of each subtype based on molecular differences. In fact, a unique subset of genes is likely to exist within each tumor subtype. Identifying these molecularly based tumor subtypes will increase the clinical efficacy of treatments with such predictive biomarkers [2], [39]. Obtaining molecular subtype dependent biomarkers helps design a personalized treatment plan. These plans have been shown to often reduce the toxicity and side effects in treatment, concurrent with significant slowing of tumor progression. These biomarkers also accelerate structural and cell-based refinement in drug development research on these molecular subtypes, reducing the time and cost of bring drugs to clinic.
There are studies from related fields that propose to select a possible different feature subset for each class. For example, de Lannoy et al. propose a method to perform class-specific feature selection in multiclass support vector machines and experimentally validate its performance [40]. Zhou and Wang use class separability measure to select different feature subsets for different classes and compare their method with class independent feature selection method by applying the method on several biomedical data with support vector machine [41]. A major limitation of these methods is that they are related to the use of a particular classifier, which limits its applicability. To alleviate this problem, Pineda-Bautista et al. propose a class specific feature method that can be used with any classifiers and they use classifier ensemble to classify an unseen sample. Their experimental results on low dimensional datasets show the effectiveness of the proposed method [42]. However, classifying new test samples under an ensemble framework without utilizing the confidence of each sub-classifier may makes poor decisions when we face the problem of voting conflict. Besides, the aim of these studies is to return multiple feature subsets for feature analysis and classification model construction, and few studies, to the best of our knowledge, explore the fusion of multiple class-specific feature subsets and further evaluate the effectiveness of these combined features in classification. Furthermore, they conduct experiments on low dimensional datasets without considering a more difficult case that is characterized by high dimensionality and small sample sizes. Accordingly, in this study, we propose to select gene profiles that are associated tumor subtypes, enabling us to define unique genes for a tumor subtype as well as common genes for all tumor subtypes. We will enhance the performance in classifying different tumor subtypes and further reduce the chance of partially overfitting in future algorithms. The main contributions of this study are as follows:
- 1)
We propose a general framework for subtype dependent biomarker identification that returns a filtered profile of genes for each tumor subtype. Subsequently, we provide another gene selection framework, called fusion based gene selection that merges the obtained subtype dependent gene profiles and finally returns a single defining gene profile. We then present corresponding classification (training and testing) model, associated with subtype dependent method, for distinguishing different tumor subtypes.
- 2)
Under each of the frameworks, we implemented three specific gene selection algorithms with Fisher score, mRMR and FCBF as the building blocks, respectively. We have detailed how to obtain the optimal feature subset for feature ranking based as well as feature subset based feature selection methods in this study.
- 3)
We integrate three classification models with different metrics, including support vector machine, Naïve bayes, and k-nearest-neighbor, into the framework to construct classifiers, and detail how to estimate the confidence that a sample belongs to a specific class to solve the problem of voting conflict.
- 4)
We tested the proposed methods on six benchmarked microarray datasets that contain multiple tumor subtypes, and compared the performance of support vector machine, Naïve bayes, and k-nearest-neighbor. Extensive experiments demonstrate the superiority of subtype-dependent feature selection methods over subtype-independent feature selection methods and the superiority of support vector machine over Naïve bayes and k-nearest-neighbor in obtaining a feature subset of high quality.
The paper is structured in the following way. Section II details the proposed subtype dependent biomarker identification framework and its fusion version, and present corresponding subtype-dependent classification model. Section III illustrates the experimental data, three baseline feature selectors and support vector machine classifier, as well as the evaluation metrics. Section IV presents the experimental results. The last section concludes this study.
Section snippets
Subtype dependent gene selection and tumor classification framework
In this section, we first present the two proposed gene selection frameworks for biomarker identification: subtype dependent framework and its fusion version. We then show the classifier training and testing model under the subtype dependent framework.
Experimental setup
In this section, we first describe the microarray data used in our experiments, and then introduce three gene selection methods that are building blocks of the two proposed frameworks. Finally, we present the widely used support vector machine classifier, and give the evaluation metrics to measure the performance of the proposed methods.
Selection of discriminant genes
In this section, we report the selected genes of each feature selector from two aspects: the number of selected genes, and the relations between genes selected with subtype dependent and independent methods.
Conclusion
Tumor progression is a social and economic problem that affects the life quality of a large number of individuals, thus accurately distinguishing tumor subtypes contributes to the better management, treatment, and outcomes. Microarray technology provides us a way to identify disease genes and classify tumor subtypes, but the intrinsic nature of microarray data characterized by high dimensionality and small sample sizes limits their capacity. Correspondingly, researchers have put forward a
Acknowledgment
This work was partially supported by the China Postdoctoral Science Foundation (No. 2016M592046), the National Natural Science Foundation of China (No. 71661167004), the Fundamental Research Funds for the Central Universities(No. JZ2016HGBH1053), the “111 Project” of the Ministry of Education and State Administration of Foreign Experts Affairs(Grant No. B14025), and the Science and Technology Innovation Project of Foshan City, China(Grant No. 2015IT100095).
References (48)
- et al.
Wrapper-based gene selection with Markov blanket
Comput. Biol. Med.
(2017) Gene expression correlates of clinical prostate cancer behavior
Cancer Cell
(2002)- et al.
Accelerating wrapper-based feature selection with K-nearest-neighbor
Knowl.-Based Syst.
(2015) - et al.
Filter versus wrapper gene selection approaches in DNA microarray domains
Artif. Intell. Med.
(2004) - et al.
Incremental wrapper-based gene selection from microarray data for cancer classification
Pattern Recognit.
(2006) - et al.
Consistency-based search in feature selection
Artif. Intell.
(2003) - et al.
Wrappers for feature subset selection
Artif. Intell.
(1997) - et al.
A wrapper method for feature selection using support vector machines
Inf. Sci.
(2009) - et al.
Decision forest for classification of gene expression data
Comput. Biol. Med.
(2010) - et al.
Hybridising harmony search with a Markov blanket for gene selection problems
Inf. Sci.
(2014)
Improving PLS–RFE based gene selection for microarray data classification
Comput. Biol. Med.
General framework for class-specific feature selection
Expert Syst. Appl.
Predicting hypertension without measurement: a non-invasive, questionnaire-based approach
Expert Syst. Appl.
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer
Cancer Res.
Biomarker identification and cancer classification based on microarray data using Laplace Naive Bayes model with mean shrinkage
IEEE/ACM Trans. Comput. Biol. Bioinform.
Gene selection for cancer classification using support vector machines
Mach. Learn.
Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer
Proc. Natl. Acad. Sci.
How many genes are needed for a discriminant microarray data analysis?
Microarray data mining: facing the challenges
Sigkdd Explorations
A survey and comparative study of statistical tests for identifying differential expressionfrom microarray data
IEEE/ACM Trans. Comput. Biol. Bioinf.
Robust biomarker identification for cancer diagnosis with ensemble feature selection methods
Bioinformatics
An introduction to variable and feature selection
J. Mach. Learn. Res.
Toward integrating feature selection algorithms for classification and clustering
IEEE Trans. Knowl. Data Eng.
Cited by (22)
Artificial intelligence-driven biomedical genomics
2023, Knowledge-Based SystemsMicroplastics in coastal and oceanic surface waters and their role as carriers of pollutants of emerging concern in marine organisms
2023, Marine Environmental ResearchEnsemble feature selection for stable biomarker identification and cancer classification from microarray expression data
2022, Computers in Biology and MedicineCitation Excerpt :According to the outcome of feature selection methods, we can also group them into feature ranking method and feature subset selection method. The former returns a ranked list of original features and requires a further step to determine the number of finally selected features, while the latter outputs a subset of features [7]. For microarray data analysis, besides the selection of discriminating genes and construction of an accurate classifier, feature selection stability is another critical topic, which refers to the ability of a feature selection algorithm to select the same or similar features with the change of microarray data [23,24].
An enhanced feature selection and cancer classification for microarray data using relaxed Lasso and support vector machine
2021, Translational Bioinformatics in Healthcare and MedicineSGL-SVM: A novel method for tumor classification via support vector machine with sparse group Lasso
2020, Journal of Theoretical BiologyCitation Excerpt :At present, the DNA microarray technology of gene expression data from thousands of genes is becoming more and more mature, which is widely used in the field of biomedicine, especially in tumor research (Bolon-Canedo et al., 2014; Margalit et al., 2005; Lv et al., 2016; Salem et al., 2017). The feature genes extracted from the DNA microarray data can not only be used as tumor classification markers but can be used for more accurate diagnosis, prevention, and treatment of tumors (Shipp et al., 2002; Latkowski and Osowski, 2015; Northcott et al., 2017; Han et al., 2017; Wang et al., 2018). Microarray data is characterized by small samples, high dimensionality, and unbalanced distribution (Zhao and Wu, 2016; Kang and Song, 2017; Jain et al., 2018).
Regularized minimax probability machine
2019, Knowledge-Based Systems