Elsevier

Applied Soft Computing

Volume 81, August 2019, 105538
Applied Soft Computing

A novel filter–wrapper hybrid greedy ensemble approach optimized using the genetic algorithm to reduce the dimensionality of high-dimensional biomedical datasets

https://doi.org/10.1016/j.asoc.2019.105538Get rights and content

Highlights

  • Design of a filter–wrapper hybrid greedy ensemble selection approach to kindle an optimal subspace.

  • Leveraging effective search strategies to learn the values of penalty parameters heuristically.

  • Benchmarking results indicate the efficacy, flexibility, and robustness of the proposed approach.

  • The proposed ensemble approach when optimized using GA, outperformed various selection methods.

Abstract

The predictive accuracy of high-dimensional biomedical datasets is often dwindled by many irrelevant and redundant molecular disease diagnosis features. Dimensionality reduction aims at finding a feature subspace that preserves the predictive accuracy while eliminating noise and curtailing the high computational cost of training. The applicability of a particular feature selection technique is heavily reliant on the ability of that technique to match the problem structure and to capture the inherent patterns in the data. In this paper, we propose a novel filter–wrapper hybrid ensemble feature selection approach based on the weighted occurrence frequency and the penalty scheme, to obtain the most discriminative and instructive feature subspace. The proposed approach engenders an optimal feature subspace by greedily combining the feature subspaces obtained from various predetermined base feature selection techniques. Furthermore, the base feature subspaces are penalized based on specific performance dependent penalty parameters. We leverage effective heuristic search strategies including the greedy parameter-wise optimization and the Genetic Algorithm (GA) to optimize the subspace ensembling process. The effectiveness, robustness, and flexibility of the proposed hybrid greedy ensemble approach in comparison with the base feature selection techniques, and prolific filter and state-of-the-art wrapper methods are justified by empirical analysis on three distinct high-dimensional biomedical datasets. Experimental validation revealed that the proposed greedy approach, when optimized using GA, outperformed the selected base feature selection techniques by 4.17%–15.14% in terms of the prediction accuracy.

Introduction

The need for efficient analytical methodologies in healthcare applications has led to an unparalleled development in the field of biomedicine and bioinformatics over the past decade [1], [2]. Research in these fields frequently encounters supervised classification of disease data (e.g., microarray gene data, lung cancer data, and others) [1], [3], [4]. The advances in wet-technology are increasing the volume of data with a large number of dimensions [5]. For example, the profiling of microarray gene [5], [6], [7] aims at measuring the expression levels of tens of thousands of genes over tens of thousands of features. Over the last decade, owing to the availability of high dimensional biomedical data, numerous feature selection methods have become viable processes that provide robust data in low-dimensional spaces [8], [9]. In the sense of high dimensional data, standard statistical methods suffer from the curse of dimensionality [10], [11] signifying a drastic rise in the classification error and computational complexity. This makes it mandatory to use a feature subspace before the classification is undertaken [12], [13], [14]. Therefore, feature selection does not represent the very aim of data analysis but is instead a preliminary step to finding the most informative and discriminative feature subset that optimally represents the given data.

Dimensionality reduction can aid in the provision of better insights to understanding causal relationships, reduce computational complexities, and engender more reliable estimates [15], [16]. There are numerous methods to achieve dimensionality reduction including feature selection based on information gain and minimum Redundancy Maximum Relevance (mRMR). Real-world datasets vary, implying that no single feature selection technique is best suited for all the datasets [17]. The effectiveness of a feature selection technique depends on its ability to match the problem structure and maintain only those features that describe the inherent patterns within the data. The selection of such a technique is usually heuristic and intuition based. The challenge to the machine learner is the selection of a feature selection technique that works best for a given dataset. A naive approach to achieve the same would be to select a technique from the set of predetermined techniques that results in the best performance. This approach is computationally very expensive and infeasible. An alternative approach would be to perform a heuristic selection which is further explored using evolutionary computational algorithms [18]. This approach requires an investment of an arbitrary amount of computation time, and the actual optimal solution and the obtained solution might not converge for a limited number of iterations [19], [20].

Early works [21], [22], [23] aimed at using filter approaches to determine the most optimal feature subspace. These approaches are heavily reliant on the correlation between the features and are independent of the classifier which limits their accuracy. Min et al. [24] developed a backtracking and heuristic search algorithm to search for optimal feature subspaces. The authors showed that the performance of the evolutionary computing algorithm was similar to backtracking but with lower computational time. More recently, Masood et al. [25] proposed wrapper and hybrid algorithms which used an incremental search on an ordered set of features and Extreme Learning Machine (ELM) classifier to select the best feature subspace. A hybrid genetic algorithm with feature granulation was developed by Dong et al. [26] for feature selection. Tu et al. [27] proposed a multi-strategy ensemble grey wolf optimizer with three search strategies and demonstrated its effectiveness in selecting optimal features. From the existing literature, it is evident that hybrid and wrapper feature selection methods overcome the limitations of filter methods. Moreover, evolutionary computing algorithms are widely used in feature selection because of their population-based mechanism and domain adaptability.

Although most state-of-the-art methods aim at effectively determining an optimal feature subspace, they are either extremely data specific or utilize heuristic-based approaches requiring an arbitrary amount of time with no guarantee on their convergence. Furthermore, heuristic search methods using swarm intelligence seldom use correlation measures to guide the search process. To address these problems, we propose a novel ensemble selection approach that uses a set of (five) predetermined feature selection techniques on a representative sample of the dataset to generate multiple feature subspaces. These subspaces are then evaluated using (three) different supervised classification algorithms. The features in the subspaces obtained from the set of chosen feature selection techniques are then penalized based on the evaluation scores, to form an optimal subset of features selected greedily. The penalty factors that affect the choice of features in the hybrid subset are optimized using the greedy parameter-wise optimization and the Genetic Algorithm (GA). Moreover, the penalty factors are modeled in a way that is aimed at selecting smaller and most instructive feature subspace. Since the feature selection is performed on a sample of the dataset as opposed to the entire dataset, the computational cost is relatively low. Furthermore, the values of the penalty factors that affect the choice of the features in the final feature subspace are heuristically determined, limiting the problem of algorithmic convergence occurring when the features themselves are heuristically selected. Table 1 shows the comparison of this work with the existing state-of-the-art methods in effective feature selection. The key contributions of this work are summarized below:

  • Design of a filter–wrapper hybrid ensemble selection approach that kindles an optimal feature subspace by greedily combining the subspaces generated by various predetermined feature selection techniques based on specific performance dependent penalty parameters.

  • Leveraging heuristic search strategies such as greedyparameter-wise optimization and GA to determine the optimal values of the penalty factors which affect how different feature subspaces are ensembled to engender an optimal feature subspace.

  • We present detailed benchmarking results of our hybrid greedy ensemble feature selection approach on three distinct high-dimensional biomedical datasets. Our experimental results indicate the efficiency and robustness of the proposed approach over the base feature selection methods, and other prolific filter and wrapper methods.

The remainder of the paper is structured as follows: Section 2 provides an overview of the existing works and reviews their evaluation approaches, advantages, and limitations. Section 3 presents the statistics of the datasets used and addresses the fundamentals of the utilized feature selection algorithms, classification algorithms, and GA. The proposed greedy methodology is presented in Section 4 and the same is evaluated empirically in Section 5. In Section 6, a sensitivity analysis is presented to assess the performance of the results. Finally, Section 7 concludes this paper with highlights on future research possibilities.

Section snippets

Related work

An extensive body of research on the effective determination of most descriptive feature subspace is available in the literature [28], [29]. This section provides an extensive review of a few significant dimensionality reduction approaches to provide an overview of the existing state-of-the-art methods built on large biomedical datasets.

Feature selection approaches can be categorized into four categories including filter, wrapper, embedded, and hybrid models. In the field of biomedicine,

Materials and methods

The experimental data consists of three biomedical datasets which are first described. All the datasets used are split into three mutually and collectively independent homogeneous samples using stratified random sampling [46]. Stratified random sampling guarantees the adequate representation of all the classes in the data, maintaining homogeneity within stratum and heterogeneity between strata.1 The

Proposed novel filter–wrapper hybrid greedy ensemble approach for optimal feature selection

The proposed filter–wrapper hybrid feature selection approach uses three samples that are derived from the dataset using stratified random sampling [46]. Division of population into strata reduces the computational complexity and the sampling error. The first sample (S1) is used in selecting features from the predetermined feature selection technique(s) (five here). The feature space of the second sample (S2) is then reduced to the set of features selected using S1. Then, S2 is evaluated using

Experimental results and discussion

In this section, we report a detailed benchmarking of our filter–wrapper hybrid greedy ensemble approach on three high-dimensional biomedical datasets. We first describe the implementation setup, the working environment, and the validation procedure used. Then we discuss the parameter setup, their affect on the proposed system, and the performance of the proposed model, followed by its complexity analysis and training details. Finally, we elucidate on the implications of using our proposed

Sensitivity analysis

The experimental results highlight the effectiveness and robustness of the proposed approach over the base selection methods. To further analyze the obtained results, a sensitivity analysis was performed. Sensitivity analysis helps in making decisions concerning more than a solution to the given problem [70]. Sensitivity analysis measures the extent to which the optimal solution is sensitive to the change in the input to one or more parameters. The Kolmogorov–Smirnov test of normality revealed

Conclusions, limitations, and future directions

Feature selection in the field of biomedicine and bioinformatics is indispensable. In this study, we proposed a penalty based filter–wrapper hybrid greedy ensemble approach to facilitate optimal feature selection. The proposed approach greedily selects the features from the subspaces obtained from the predetermined base selection methods. Specific performance dependent penalty parameters were used to penalize the base feature subspaces essential to achieve the optimal ensembling of those

Declaration of competing interest

The authors confirm that there are no known potential conflicts of interest associated with this publication.

Tushaar Gangavarapu is with the Department of Information Technology at National Institute of Technology Karnataka, Surathkal, India. He is serving as the NITK Student Ambassador for the Intel AI group and CodeNation since 2016. He is a Student Member of the IEEE, Human Centered Computing Group, and Healthcare Analytics and Language Engineering research lab. He worked on Deep NLP-based Customer Assistance and Accessibility for the Blind, at Amazon. His current research interests include

References (70)

  • MaShuangge et al.

    Penalized feature selection and classification in bioinformatics

    Brief. Bioinform.

    (2008)
  • TomarDivya et al.

    A survey on data mining approaches for healthcare

    Int. J. Bio-Sci. Bio-Technol.

    (2013)
  • DaiYongqiang et al.

    Feature selection of high-dimensional biomedical data using improved sfla for disease diagnosis

  • AbeelThomas et al.

    Robust biomarker identification for cancer diagnosis with ensemble feature selection methods

    Bioinformatics

    (2009)
  • LiJinyan et al.

    Mean-entropy discretized features are effective for classifying high-dimensional bio-medical data

  • Bolón-CanedoVerónica et al.

    Statistical dependence measure for feature selection in microarray datasets

  • LiTao et al.

    A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression

    Bioinformatics

    (2004)
  • SaeysYvan et al.

    A review of feature selection techniques in bioinformatics

    Bioinformatics

    (2007)
  • HeMatthew et al.

    Mathematics of Bioinformatics: Theory, Methods and Applications, Vol. 19

    (2011)
  • BellmanRichard E.

    Adaptive Control Processes: A Guided Tour, Vol. 2045

    (2015)
  • KalinaJan et al.

    Dimensionality reduction methods for biomedical data

    Lékař Tech.-Clin. Technol.

    (2018)
  • PechenizkiyMykola et al.

    Local dimensionality reduction and supervised learning within natural clusters for biomedical data analysis

    IEEE Trans. Inf. Technol. Biomed.

    (2006)
  • PhyuThu Zar et al.

    Performance comparison of feature selection methods

  • JainAnil K. et al.

    Statistical pattern recognition: A review

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • TangJiliang et al.

    Feature selection for classification: A review

  • FrankEibe et al.

    Data mining in bioinformatics using weka

    Bioinformatics

    (2004)
  • JovićAlan et al.

    A review of feature selection methods with applications

  • Abd-AlsabourNadia

    A review on evolutionary feature selection

  • GutjahrWalter J.

    A generalized convergence result for the graph-based ant system metaheuristic

    Probab. Engrg. Inform. Sci.

    (2003)
  • EkwevugbeTobore et al.

    Realt-time building occupancy sensing for supporting demand driven hvac operations

    (2013)
  • ZhangRui et al.

    Information-theoretic environment features selection for occupancy detection in open office spaces

  • CarlosOscar Sánchez Sorzano, Javier Vargas, A Pascual Montano, A survey of dimensionality reduction techniques. arXiv...
  • AgarwalS. et al.

    Dimensionality reduction methods classical and recent trends: a survey

    IJCTA

    (2016)
  • KimSung-Kyu et al.

    Mitarget: microrna target gene prediction using a support vector machine

    BMC Bioinform.

    (2006)
  • SalzbergSteven L. et al.

    Microbial gene identification using interpolated markov models

    Nucl. Acids Res.

    (1998)
  • Cited by (0)

    Tushaar Gangavarapu is with the Department of Information Technology at National Institute of Technology Karnataka, Surathkal, India. He is serving as the NITK Student Ambassador for the Intel AI group and CodeNation since 2016. He is a Student Member of the IEEE, Human Centered Computing Group, and Healthcare Analytics and Language Engineering research lab. He worked on Deep NLP-based Customer Assistance and Accessibility for the Blind, at Amazon. His current research interests include Bioinformatics, Healthcare Analytics, Psychological Trait Modeling, Learning Sciences, and Social Network Analysis. He is currently focused on understanding, modeling, and prediction of human behavior and affective outcomes.

    Nagamma Patil is with the Department of Information Technology at National Institute of Technology Karnataka, Surathkal, India. She received her Ph.D. degree in Computer Science and Engineering from Indian Institute of Technology Roorkee, India. Her research interests include Data Mining, Soft Computing, Big Data Analytics, Machine Learning, and Bioinformatics. She received a grant from the Vision Group on Science and Technology, Government of Karnataka, in 2018 for her work on protein structure prediction. She has over 35 research publications in reputed and peer-reviewed International Journals and Conference Proceedings.

    View full text