Skip to main content

Advertisement

Log in

Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods

  • Original Article
  • Published:
Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Abstract

Microarray datasets play a crucial role in cancer detection. But the high dimension of these datasets makes the classification challenging due to the presence of many irrelevant and redundant features. Hence, feature selection becomes irreplaceable in this field because of its ability to remove the unrequired features from the system. As the task of selecting the optimal number of features is an NP-hard problem, hence, some meta-heuristic search technique helps to cope up with this problem. In this paper, we propose a 2-stage model for feature selection in microarray datasets. The ranking of the genes for the different filter methods are quite diverse and effectiveness of rankings is datasets dependent. First, we develop an ensemble of filter methods by considering the union and intersection of the top-n features of ReliefF, chi-square, and symmetrical uncertainty. This ensemble allows us to combine all the information of the three rankings together in a subset. In the next stage, we use genetic algorithm (GA) on the union and intersection to get the fine-tuned results, and union performs better than the latter. Our model has been shown to be classifier independent through the use of three classifiers—multi-layer perceptron (MLP), support vector machine (SVM), and K-nearest neighbor (K-NN). We have tested our model on five cancer datasets—colon, lung, leukemia, SRBCT, and prostate. Experimental results illustrate the superiority of our model in comparison to state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Vaidya AR (2015) Neural mechanisms for undoing the “curse of dimensionality”. J Neurosci 35:12083–12084

    Article  CAS  Google Scholar 

  2. Jain A, Zongker D (1997) Feature selection: evaluation, application, and small sample performance. IEEE Trans Pattern Anal Mach Intell 19:153–158

    Article  Google Scholar 

  3. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    Google Scholar 

  4. Mitra P, Murthy CA, Pal SK (2002) Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 24:301–312

    Article  Google Scholar 

  5. Kashef S, Nezamabadi-pour H (2015) An advanced ACO algorithm for feature subset selection. Neurocomputing 147:271–279. https://doi.org/10.1016/j.neucom.2014.06.067

    Article  Google Scholar 

  6. Duval B, Hao J-K, Hernandez Hernandez JC (2009) A memetic algorithm for gene selection and molecular classification of cancer. Proc 11th Annu Conf Genet Evol Comput - GECCO ‘09 201 . doi: https://doi.org/10.1145/1569901.1569930

  7. Mohamed NS, Zainudin S, Othman ZA (2017) Metaheuristic approach for an enhanced mRMR filter method for classification using drug response microarray data. Expert Syst Appl 90:224–231

    Article  Google Scholar 

  8. Hall MA (1999) Correlation-based feature selection for machine learning

  9. Shannon CE, Weaver W (1964) The mathematical theory of communication. University of Illinois Press, Urbana, pp 10–61

  10. Wang Z, Zhang Y, Chen Z et al (2016) Application of ReliefF algorithm to selecting feature sets for classification of high resolution remote sensing image. In: IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2016, pp 755–758. https://doi.org/10.1109/IGARSS.2016.7729190

  11. Uğuz H (2011) A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm. Knowl-Based Syst 24:1024–1032

    Article  Google Scholar 

  12. Jin X, Xu A, Bie R, Guo P (2006) Machine learning techniques and chi-square feature selection for cancer classification using SAGE gene expression profiles. In: International workshop on data mining for biomedical applications. Springer-Verlag Berlin, Heidelberg, pp 106–115

  13. Zheng Z, Wu X, Srihari R (2004) Feature selection for text categorization on imbalanced data. ACM Sigkdd Explor Newsl 6:80–89

    Article  Google Scholar 

  14. Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23:2507–2517

    Article  CAS  Google Scholar 

  15. Kwon O-W, Chan K, Hao J, Lee T-W (2003) Emotion recognition by speech signals. In: Eighth European Conference on Speech Communication and Technology

  16. Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1:131–156

    Article  Google Scholar 

  17. Yang J, Honavar V (1998) Feature subset selection using a genetic algorithm. IEEE Intell Syst their Appl 13:44–49

    Article  Google Scholar 

  18. Jain I, Jain VK, Jain R (2017) Correlation feature selection based improved-binary particle swarm optimization for gene selection and cancer classification. Appl Soft Comput 62:203–215. https://doi.org/10.1016/j.asoc.2017.09.038

    Article  Google Scholar 

  19. Forsati R, Moayedikia A, Jensen R et al (2014) Enriched ant colony optimization and its application in feature selection. Neurocomputing 142:354–371. https://doi.org/10.1016/j.neucom.2014.03.053

    Article  Google Scholar 

  20. Rashedi E, Nezamabadi-Pour H, Saryazdi S (2010) BGSA: binary gravitational search algorithm. Nat Comput 9:727–745. https://doi.org/10.1007/s11047-009-9175-3

    Article  Google Scholar 

  21. Mollaee M, Moattar MH (2016) A novel feature extraction approach based on ensemble feature selection and modified discriminant independent component analysis for microarray data classification. Biocybern Biomed Eng 36:521–529

    Article  Google Scholar 

  22. Chuang L-Y, Yang C-S, Wu K-C, Yang C-H (2011) Gene selection and classification using Taguchi chaotic binary particle swarm optimization. Expert Syst Appl 38:13367–13377

    Article  Google Scholar 

  23. Banka H, Dara S (2015) A hamming distance based binary particle swarm optimization (HDBPSO) algorithm for high dimensional feature selection, classification and validation. Pattern Recogn Lett 52:94–100

    Article  Google Scholar 

  24. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Distributed feature selection: an application to microarray data classification. Appl Soft Comput 30:136–150

    Article  Google Scholar 

  25. Apolloni J, Leguizamón G, Alba E (2016) Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments. Appl Soft Comput J 38:922–932. https://doi.org/10.1016/j.asoc.2015.10.037

    Article  Google Scholar 

  26. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2012) An ensemble of filters and classifiers for microarray data classification. Pattern Recogn 45:531–539

    Article  Google Scholar 

  27. Hsu H-H, Hsieh C-W, Lu M-D (2011) Hybrid feature selection by combining filters and wrappers. Expert Syst Appl 38:8144–8150

    Article  Google Scholar 

  28. Park CH, Kim SB (2015) Sequential random k-nearest neighbor feature selection for high-dimensional data. Expert Syst Appl 42:2336–2342

    Article  Google Scholar 

  29. Mohapatra P, Chakravarty S, Dash PK (2016) Microarray medical data classification using kernel ridge regression and modified cat swarm optimization based gene selection system. Swarm Evol Comput 28:144–160

    Article  Google Scholar 

  30. Sun S, Peng Q, Zhang X (2016) Global feature selection from microarray data using Lagrange multipliers. Knowl Based Syst 110:267–274

    Article  Google Scholar 

  31. García-Torres M, Gómez-Vela F, Melián-Batista B, Moreno-Vega JM (2016) High-dimensional feature selection via feature grouping: a variable neighborhood search approach. Inf Sci (NY) 326:102–118

    Article  Google Scholar 

  32. Huang J, Cai Y, Xu X (2007) A hybrid genetic algorithm for feature selection wrapper based on mutual information. Pattern Recogn Lett 28:1825–1844

    Article  Google Scholar 

  33. Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the ninth international workshop on. Mach Learn:249–256

  34. Kononenko I (1994) Estimating attributes: analysis and extensions of RELIEF. In: European Conference on Machine Learning. Lecture Notes in Computer Science book series (LNCS), vol 784. Springer-Verlag Berlin, Heidelberg, pp 171–182 

  35. Spears WM, De Jong KD (1995) On the virtues of parameterized uniform crossover. Naval Research Lab, Washington DC

    Book  Google Scholar 

  36. BioInformatics Laboratory http://www.biolab.si/supp/bi-cancer/projections/info/BC_CCGSE3726_frozen.html

  37. Schölkopf B, Smola A, Müller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10:1299–1319

    Article  Google Scholar 

  38. Tipping ME, Bishop CM (1999) Probabilistic principal component analysis. J R Stat Soc Ser B (Statistical Methodol) 61:611–622

    Article  Google Scholar 

  39. Lawrence ND (2006) The Gaussian process latent variable models for visualisation of high dimensional data. In: Proceedings of the 16th International Conference on Neural Information Processing Systems. MIT Press, Cambridge, pp 329–336

  40. Li C-G, Guo J (2006) Supervised isomap with explicit mapping. In: innovative computing, information and control, 2006. ICICIC’06. First International Conference on. IEEE, pp 345–348

  41. Van Der Maaten L, Postma E, Van den Herik J (2009) Dimensionality reduction: a comparative. J Mach Learn Res 10:66–71

    Google Scholar 

  42. Pinto da Costa JF, Alonso H, Roque L (2011) A weighted principal component analysis and its application to gene expression data. IEEE/ACM Trans Comput Biol Bioinforma 8:246–252

    Article  Google Scholar 

  43. Dhir CS, Lee J, Lee S-Y (2012) Extraction of independent discriminant features for data with asymmetric distribution. Knowl Inf Syst 30:359–375

    Article  Google Scholar 

  44. le Rolle A-F, Chiu TK, Fara M et al (2015) The prognostic significance of CXCL1 hypersecretion by human colorectal cancer epithelia and myofibroblasts. J Transl Med 13:199

    Article  Google Scholar 

  45. Kropotova ES, Zinovieva OL, Zyryanova AF et al (2014) Altered expression of multiple genes involved in retinoic acid biosynthesis in human colorectal cancer. Pathol Oncol Res 20:707–717

    Article  CAS  Google Scholar 

  46. Bongaerts BWC (2008) Alcohol consumption as a risk factor for colorectal cancer: an epidemiological study on genetic susceptibility and molecular endpoints. Maastricht University, Maastricht, pp 127–144

  47. Chiang S-C, Han C-L, Yu K-H et al (2013) Prioritization of cancer marker candidates based on the immunohistochemistry staining images deposited in the human protein atlas. PLoS One 8:e81079

    Article  Google Scholar 

  48. Papadaki C, Sfakianaki M, Lagoudaki E et al (2014) PKM2 as a biomarker for chemosensitivity to front-line platinum-based chemotherapy in patients with metastatic non-small-cell lung cancer. Br J Cancer 111:1757–1764

    Article  CAS  Google Scholar 

  49. Liang B, Shao Y, Long F, Jiang S-J (2016) Predicting diagnostic gene biomarkers for non-small-cell lung cancer. Biomed Res Int 2016:1–8

    CAS  Google Scholar 

  50. Lonergan KM, Chari R, Coe BP et al (2010) Transcriptome profiles of carcinoma-in-situ and invasive non-small cell lung cancer as revealed by SAGE. PLoS One 5:e9162

    Article  Google Scholar 

  51. Jiang C, Huang T, Wang Y et al (2014) Immunoglobulin G expression in lung cancer and its effects on metastasis. PLoS One 9:e97359

    Article  Google Scholar 

  52. Van den Broeck A, Vankelecom H, Van Eijsden R et al (2012) Molecular markers associated with outcome and metastasis in human pancreatic cancer. J Exp Clin Cancer Res 31:68

    Article  Google Scholar 

  53. Goonesekere NCW, Andersen W, Smith A, Wang X (2017) Identification of genes highly downregulated in pancreatic cancer through a meta-analysis of microarray datasets: implications for discovery of novel tumor-suppressor genes and therapeutic targets. J Cancer Res Clin Oncol 144(2):309–320

  54. Bittanti S, Garatti S, Liberati D (2005) From DNA micro-arrays to disease classification: an unsupervised clustering approach. IFAC Proc 38:319–324

    Article  Google Scholar 

  55. Labaj W, Papiez A, Polanski A, Polanska J (2017) Comprehensive analysis of MILE gene expression data set advances discovery of leukaemia type and subtype biomarkers. Interdiscip Sci Comput Life Sci 9:24–35

    Article  CAS  Google Scholar 

  56. Liberati D, Bittanti S, Garatti S (2005) Unsupervised mining of genes classifying leukemia. In: Encyclopedia of data warehousing and mining. IGI Global, pp 1155–1159

  57. Khabbaz M, Kianmehr K, Alshalalfa M, Alhajj R (2010) An integrated framework for fuzzy classification and analysis of gene expression data. Strategic advancements in utilizing data mining and warehousing technologies, pp 151–153

  58. Tong DL (2010) Genetic algorithm-neural network: feature extraction for bioinformatics data. Doctorate Thesis (Doctorate), Bournemouth University

  59. Chen Z, Gerke T, Bird V, Prosperi M (2017) Trends in gene expression profiling for prostate cancer risk assessment: a systematic review. Biomed Hub 2:1

    Article  CAS  Google Scholar 

  60. Kelly KA, Setlur SR, Ross R et al (2008) Detection of early prostate cancer using a hepsin-targeted imaging agent. Cancer Res 68:2286–2291

    Article  CAS  Google Scholar 

  61. Noel EE, Ragavan N, Walsh MJ et al (2008) Differential gene expression in the peripheral zone compared to the transition zone of the human prostate gland. Prostate Cancer Prostatic Dis 11:173–180

    Article  CAS  Google Scholar 

  62. D’Antonio KEB (2009) Analysis of novel targets in the pathobiology of prostate cancer. University of Pittsburgh

  63. Kelemen A, Abraham A, Chen Y (2008) Computational intelligence in bioinformatics. Springer, Heidelberg

  64. Lazzarini N, Bacardit J (2017) RGIFE: a ranked guided iterative feature elimination heuristic for the identification of biomarkers. BMC Bioinformatics 18:322

    Article  Google Scholar 

  65. Xu J, Mu H, Wang Y, Huang F (2018) Feature genes selection using supervised locally linear embedding and correlation coefficient for microarray classification. Comput Math Methods Med 2018. https://doi.org/10.1155/2018/5490513

  66. Massoner P, Lueking A, Goehler H et al (2012) Serum-autoantibodies for discovery of prostate cancer specific biomarkers. Prostate 72:427–436

    Article  CAS  Google Scholar 

  67. Tsai Y-S, Aguan K, Pal NR, Chung I-F (2011) Identification of single-and multiple-class specific signature genes from gene expression profiles by group marker index. PLoS One 6:e24259

    Article  CAS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manosij Ghosh.

Ethics declarations

Competing interests

None of the authors has any competing interests in the manuscript.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ghosh, M., Adhikary, S., Ghosh, K.K. et al. Genetic algorithm based cancerous gene identification from microarray data using ensemble of filter methods. Med Biol Eng Comput 57, 159–176 (2019). https://doi.org/10.1007/s11517-018-1874-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11517-018-1874-4

Keywords

Navigation