Abstract
For a huge number of features versus a small size of samples, feature selection methods are useful preprocessing approaches that could eliminate the irrelevant and redundant features from the final feature subset. One of the recent research areas in feature selection is DNA microarray that the number of dimensions increase fast and requires further research in the field of feature selection. Modeling the feature search space as a graph leads to improving the visualizing of features and using graph theoretic concepts in the feature selection process. In this paper, a filer-based feature selection algorithm using graph technique is proposed for reducing the dimension of dataset named as Symmetric Uncertainty Class-Feature Association Map feature selection (SU-CFAM). In the first step, it uses the Symmetric Uncertainty concept for visualizing the feature search space as a graph. After clustering the graph into several clusters using a community detection algorithm, SU-CFAM constructs an adjacency matrix for each cluster and the final subset is selected by using the concept of maximal independent set. The performance of SU-CFAM has been compared with five well-known feature selection approaches using three classifiers including SVM, DT, NB. Experiments on fifteen public DNA microarray datasets show that SU-CFAM can achieve a better classification performance compared with other methods.
Similar content being viewed by others
References
Hu X, Zhou P, Li P, Wang J, Wu X (2016) A survey on online feature selection with streaming features. Front Comput Sci 1–15
Das AK, Goswami S, Chakrabarti A, Chakraborty B (2017) A new hybrid feature selection approach using feature association map for supervised and unsupervised classification. Expert Syst Appl 88(supplement C):81–94
Chen T, Hong Z, Deng Fa, Yang X, Wei J, Cui M (2015) A novel selective ensemble classification of microarray data based on teaching-learning-based optimization. Int J Multimed Ubiquitous Eng 10(6):203–218
Hoque N, Bhattacharyya D, Kalita JK (2014) Mifs-nd: a mutual information-based feature selection method. Expert Syst Appl 41(14):6371–6385
Liao B, Jiang Y, Liang W, Zhu W, Cai L, Cao Z (2014) Gene selection using locality sensitive laplacian score. IEEE/ACM Trans Comput Biol Bioinform 11(6):1146–1156
Solorio-Fernandez S, Carrasco-Ochoa JA, Martínez-Trinidad JF (2016) A new hybrid filter-wrapper feature selection method for clustering based on ranking. Neurocomputing 214:866–880
Theodoridis S, Koutroumbas K (2008) Pattern recognition, 4th edn. Academic Press, Oxford
Lai CM, Yeh WC, Chang CY (2016) Gene selection using information gain and improved simplified swarm optimization. Neurocomputing 218:331–338
Radovic M, Ghalwash M, Filipovic N, Obradovic Z (2017) Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinform 18(1):9
Peker M, Sen B, Delen D (2015) Computer-aided diagnosis of parkinson’s disease using complex-valued neural networks and mrmr feature selection algorithm. J Healthcare Eng 6(3):281–302
Sun S, Peng Q, Shakoor A (2014) A kernel-based multivariate feature selection method for microarray data classification. PloS one 9(7):e102541
Labani M, Moradi P, Ahmadizar F, Jalili M (2018) A novel multivariate filter method for feature selection in text classification problems. Eng Appl Artif Intell 70:25–37
Ferreira AJ, Figueiredo MA (2012) An unsupervised approach to feature discretization and selection. Pattern Recognit 45(9):3048–3060
Ferreira AJ, Figueiredo MA (2012) Efficient feature selection filters for high-dimensional data. Pattern Recognit Lett 33(13):1794–1804
Tabakhi S, Moradi P, Akhlaghian F (2014) An unsupervised feature selection algorithm based on ant colony optimization. Eng Appl Artif Intell 32(supplement C):112–123
Cheriguene S, Azizi N, Zemmal N, Dey N, Djellali H, Farah N (2016) Optimized tumor breast cancer classification using combining random subspace and static classifiers selection paradigms. Applications of intelligent optimization in biology and medicine. Springer, Cham, pp 289–307
Haindl M, Somol P, Ververidis D, Kotropoulos C (2006) Feature selection based on mutual correlation. Springer, Berlin Heidelberg, pp 569–577
Brusco MJ (2014) A comparison of simulated annealing algorithms for variable selection in principal component analysis and discriminant analysis. Computat Stat Data Anal 77:38–53
Li Y, Wang G, Chen H, Shi L, Qin L (2013) An ant colony optimization based dimension reduction method for high-dimensional datasets. J Bionic Eng 10(2):231–241
Kabir MM, Shahjahan M, Murase K (2012) A new hybrid ant colony optimization algorithm for feature selection. Expert Syst Appl 39(3):3747–3763
Sahu B, Mishra D (2012) A novel feature selection algorithm using particle swarm optimization for cancer microarray data. Proc Eng 38(Supplement C):27–31
Martinez E, Alvarez MM, Trevino V (2010) Compact cancer biomarkers discovery using a swarm intelligence feature selection algorithm. Comput Biol Chem 34(4):244–250
Oreski S, Oreski G (2014) Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert Syst Appl 41(4):2052–2064
Goswami S, Saha S, Chakravorty S, Chakrabarti A, Chakraborty B (2015) A new evaluation measure for feature subset selection with genetic algorithm. Int J Intell Syst Appl 7(10):28
Xue B, Zhang M, Browne WN, Yao X (2016) A survey on evolutionary computation approaches to feature selection. IEEE Trans Evol Comput 20(4):606–626
Shah M, Marchand M, Corbeil J (2012) Feature selection with conjunctions of decision stumps and learning from microarray data. IEEE Trans Pattern Anal Mach Intell 34(1):174–186
Huang ML, Hung YH, Lee W, Li R, Jiang BR (2014) Svm-rfe based feature selection and taguchi parameters optimization for multiclass svm classifier. Sci World J
Wang S, Tang J, Liu H (2015) Embedded unsupervised feature selection. In: AAA, pp 470–476
Mundra PA, Rajapakse JC (2010) Svm-rfe with mrmr filter for gene selection. IEEE Trans NanoBiosci 9(1):31–37
Chuang LY, Yang CH, Wu KC, Yang CH (2011) A hybrid feature selection method for dna microarray data. Comput Biol Med 41(4):228–237
Ghosh R, Kumar P, Roy PP (2018) A dempster–shafer theory based classifier combination for online signature recognition and verification systems. Int J Mach Learn Cybern 1–16
Kumar P, Roy PP, Dogra DP (2018) Independent bayesian classifier combination based sign language recognition using facial expression. Inf Sci 428:30–48
Kumar P, Saini R, Roy PP, Pal U (2018) A lexicon-free approach for 3d handwriting recognition using classifier combination. Pattern Recognit Lett 103:1–7
Santosh K, Roy PP (2018) Arrow detection in biomedical images using sequential classifier. Int J Mach Learn Cybern 9(6):993–1006
Song Q, Ni J, Wang G (2013) A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE Trans Knowl Data Eng 25:1–14
Mandal M, Mukhopadhyay A (2013) Unsupervised non-redundant feature selection: a graph-theoretic approach. Springer, Berlin Heidelberg, pp 373–380
Bandyopadhyay S, Bhadra T, Mitra P, Maulik U (2014) Integration of dense subgraph finding with feature clustering for unsupervised feature selection. Pattern Recognit Lett 40(Supplement C):104–112
Moradi P, Rostami M (2015) A graph theoretic approach for unsupervised feature selection. Eng Appl Artif Intell 44:33–45
Kabir MM, Islam MM, Murase K (2010) A new wrapper feature selection approach using neural network. Neurocomputing 73(16):3273–3283
Pino Angulo A (2018) Gene selection for microarray cancer data classification by a novel rule-based algorithm. Information 9(1):6
Kannan SS, Ramaraj N (2010) A novel hybrid feature selection via symmetrical uncertainty ranking based local memetic search algorithm. Knowl-Based Syst 23(6):580–585
Zheng K, Wang X (2018) Feature selection method with joint maximal information entropy between features and class. Pattern Recognit 77:20–29
Moradi P, Rostami M (2015) Integration of graph clustering with ant colony optimization for feature selection. Knowl-Based Syst 84(Supplement C):144–161
Ghimatgar H, Kazemi K, Helfroush MS, Aarabi A (2018) An improved feature selection algorithm based on graph clustering and ant colony optimization. Knowl-Based Syst 159:270–285
Witten IH, Frank E, Hall MA (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Morgan Kaufmann, Amsterdam
Ghasemzadeh H, Amini N, Saeedi R, Sarrafzadeh M (2015) Power-aware computing in wearable sensor networks: an optimal feature selection. IEEE Trans Mobile Comput 14(4):800–812
Bennasar M, Hicks Y, Setchi R (2015) Feature selection using joint mutual information maximisation. Expert Syst Appl 42(22):8520–8532
Cover T, Thomas J (2012) Elements of information theory. Wiley, New York, USA
Le Martelot E, Hankin C (2013) Fast multi-scale detection of relevant communities in large-scale networks. Comput J 56(9):1136–1150
Blondel VD, Ioup Guillaume J, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 10(2008):P10008
Luby M (1986) A simple parallel algorithm for the maximal independent set problem. SIAM J Comput 15(4):1036–1053
Yadav T, Sadhukhan K, Mallari RA (2016) Approximation algorithm for n-distance minimal vertex cover problem. arXiv preprint arXiv:1606.02889
Hippo Y, Taniguchi H, Tsutsumi S, Machida N, Chong JM, Fukayama M, Kodama T, Aburatani H (2002) Global gene expression analysis of gastric cancer by oligonucleotide microarrays. Cancer Res 62(1):233–240
Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS (2002) Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8(1):68
Piloto S, Schilling TF (2010) Ovo1 links wnt signaling with n-cadherin localization during neural crest migration. Development dev-048439
Repository KRBDS kent ridge bio-medical dataset. http://datam.i2r.a-star.edu.sg/datasets/krbd/
institute B (2014) Cancer program data aets. http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi
Statnikov A, CF Aliferis, ITG (2005) Gene Expression Model Selector. http://www.gems-system.org
Zhu Z, Ong YS, Dash M (2007) Markov blanket-embedded genetic algorithm for gene selection. Pattern Recognit 40(11):3236–3248
Zhu Z (2018) Cancer data sets. http://csse.szu.edu.cn/staff/zhuzx/Datasets.html
Quinlan JR (1986) Induction of decision trees. Mach Learn 1
Obaidullah SM, Halder C, Santosh K, Das N, Roy K (2018) Phdindic\(\_11\): page-level handwritten document image dataset of 11 official indic scripts for script identification. Multimed Tools Appl 77(2):1643–1678
Cleophas TJ, Zwinderman AH (2015) Quantile-quantile plots, a good start for looking at your medical data (50 cholesterol measurements and 58 patients). Machine learning in medicine–a complete overview. Springer, Berlin, pp 253–259
Bouguelia MR, Nowaczyk S, Santosh K, Verikas A (2018) Agreeing to disagree: active learning with noisy labels without crowdsourcing. Int J Mach Learn Cybern 9(8):1307–1319
Bouguelia MR, Nowaczyk S, Payberah AH (2018) An adaptive algorithm for anomaly and novelty detection in evolving data streams. Data Min Knowl Discov 2018:1–37
Vajda S, Santosh K (2016) A fast k-nearest neighbor classifier using unsupervised clustering. In: International conference on recent trends in image processing and pattern Rrecognition, Springer, pp 185–193
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bakhshandeh, S., Azmi, R. & Teshnehlab, M. Symmetric uncertainty class-feature association map for feature selection in microarray dataset. Int. J. Mach. Learn. & Cyber. 11, 15–32 (2020). https://doi.org/10.1007/s13042-019-00932-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-019-00932-7