Abstract
Text classification is a popular research topic in data mining. Many classification methods have been proposed. Feature selection is an important technique for text classification since it is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. In recent years, data have become increasingly larger in both the number of instances and the number of features in many applications. As a result, classical feature selection methods do not work well in processing large-scale dataset due to the expensive computational cost. To address this issue, in this paper, a parallel feature selection method based on MapReduce is proposed. Specifically, mutual information based on Renyi entropy is used to measure the relationship between feature variables and class variables. Maximum mutual information theory is then employed to choose the most informative combination of feature variables. We implemented the selection process based on MapReduce, which is efficient and scalable for large-scale problems. At last, a practical example well demonstrates the efficiency of the proposed method.
Similar content being viewed by others
References
Baccianella S, Esuli A, Sebastiani F (2014) Feature selection for ordinal text classification. Neural Comput 26(3):557–591. doi:10.1162/NECO_a_00558
Bawaneh MJ, Alkoffash MS, Al Rabea A (2008) Arabic text classification using k-nn and naive bayes. J Comput Sci 4(7):600–605
Chang X, Nie F, Yang Y, Huang H (2014) A convex formulation for semi-supervised multi-label feature selection. In: Proceedings of the twenty-eighth AAAI conference on artificial intelligence, pp 1171–1177
Chang X, Yang Y, Xing E, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic svm. In: Proceedings of the 32nd international conference on machine learning (ICML-15), pp 1348–1357
Christopher DM, Prabhakar R, Hinrich S (2008) Scoring, term weighting, and the vector space model. In: Introduction to information retrieval, pp 100–123
Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2010) Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing, pp 810–818
Estévez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Tran Neural Netw 20(2):189–201
Fan RE, Chen PH, Lin CJ (2005) Working set selection using second order information for training support vector machines. J Mach Learn Res 6:1889–1918
Fox G, Bae SH, Ekanayake J, Qiu X, Yuan H (2009) Parallel data mining from multicore to cloudy grids. High Perform Comput Workshop 18:311–340
Herman G, Zhang B, Wang Y, Ye G, Chen F (2013) Mutual information-based method for selecting informative feature sets. Pattern Recognit 46(12):3315–3327
Huang K, Aviyente S (2008) Wavelet feature selection for image classification. IEEE Trans Image Process 17(9):1709–1720. doi:10.1109/TIP.2008.2001050
Kwak N, Choi CH (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159
Li F, Ooi BC, Özsu MT, Wu S (2014) Distributed data management using mapreduce. ACM Comput Surv (CSUR) 46(3):31
Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26(9):2138–2150. doi:10.1109/TKDE.2013.65
Liu CL, Hsaio WH, Lee CH, Chang TH, Kuo TH (2015) Semi-supervised text classification with universum learning. IEEE Trans Cybern :1
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Springer, Muharram
Liu X, Tang J (2014) Mass classification in mammograms using selected geometry and texture features, and a new svm-based feature selection method. IEEE Syst J 8(3):910–920. doi:10.1109/JSYST.2013.2286539
López FG, Torres MG, Batista BM, Pérez JAM, Moreno-Vega JM (2006) Solving feature subset selection problem by a parallel scatter search. Eur J Oper Res 169(2):477–489
Mukherjee I, Schapire RE (2013) A theory of multiclass boosting. J Mach Learn Res 14(1):437–497
Nguyen XV, Chan J, Romano S, Bailey J (2014) Effective global approaches for mutual information based feature selection. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 512–521
Shen H (2009) Dimensionality reduction. In: Encyclopedia of database systems. Springer, pp 843–846
Singh S, Kubica J, Larsen S, Sorokina D (2009) Parallel large scale feature selection for logistic regression. In: SDM, pp 1172–1183
Sun Z, Fox G (2012) Study on parallel svm based on mapreduce. In: International conference on parallel and distributed processing techniques and applications
Sun Z, Li Z (2014) Data intensive parallel feature selection method study. In: IEEE 2014 International joint conference on neural networks (IJCNN), pp 2256–2262
Swedlow JR, Zanetti G, Best C (2011) Channeling the data deluge. Nature Methods 8(6):463
Thomas J, Raj NS, Vinod P (2014) Towards filtering spam mails using dimensionality reduction methods. In: 2014 5th International conference confluence the next generation information technology summit (Confluence), pp 163–168
Vinh NX, Bailey J (2013) Comments on supervised feature selection by clustering using conditional mutual information-based distances. Pattern Recognit 46(4):1220–1225
Wang D, Nie F, Huang H (2015) Feature selection via global redundancy minimization. IEEE Trans Knowl Data Eng 27(10):2743–2755. doi:10.1109/TKDE.2015.2426703
Wang S, Nie F, Chang X, Yao L, Li X, Sheng QZ (2015) Machine learning and knowledge discovery in databases: European conference, ECML PKDD 2015, Porto, Portugal, 7–11 Sept 2015. In: Proceedings, Part I, chap. unsupervised feature analysis with class margin optimization, pp 383–398. Springer International Publishing, Cham doi:10.1007/978-3-319-23528-8_24
Xu JW, Suzuki K (2014) Max-auc feature selection in computer-aided detection of polyps in ct colonography. IEEE J Biomed Health Inform 18(2):585–593. doi:10.1109/JBHI.2013.2278023
Xu Z, King I, Lyu MRT, Jin R (2010) Discriminative semi-supervised feature selection via manifold regularization. IEEE Trans Neural Netw 21(7):1033–1047. doi:10.1109/TNN.2010.2047114
Yang Y, Ma Z, Hauptmann AG, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimed 15(3):661–669. doi:10.1109/TMM.2012.2237023
Zhang B, Ji Z, Li C (2012) A parallel feature selection based on rough set theory for protein mass spectrometry data. In: International conference on automatic control and artificial intelligence (ACAI 2012), pp 248–251
Zhang B, Ruan Y, Wu TL, Qiu J, Hughes A, Fox G (2010) Applying twister to scientific applications. In: 2010 IEEE second international conference on cloud computing technology and science (CloudCom), pp 25–32
Zhao Z, Wang L, Liu H, Ye J (2013) On similarity preserving feature selection. IEEE Trans Knowl Data Eng 25(3):619–632. doi:10.1109/TKDE.2011.222
Zhao Z, Zhang R, Cox J, Duling D, Sarle W (2013) Massively parallel feature selection: an approach based on variance preservation. Mach Learn 92(1):195–220. doi:10.1007/s10994-013-5373-4
Zhu HD, Li HC, Zhao XH, Zhong Y (2010) Feature selection method by applying parallel collaborative evolutionary genetic algorithm. J Electron Sci Technol 8(2):108–113
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Z., Lu, W., Sun, Z. et al. A parallel feature selection method study for text classification . Neural Comput & Applic 28 (Suppl 1), 513–524 (2017). https://doi.org/10.1007/s00521-016-2351-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-016-2351-3