Skip to main content
Log in

A parallel feature selection method study for text classification

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Text classification is a popular research topic in data mining. Many classification methods have been proposed. Feature selection is an important technique for text classification since it is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. In recent years, data have become increasingly larger in both the number of instances and the number of features in many applications. As a result, classical feature selection methods do not work well in processing large-scale dataset due to the expensive computational cost. To address this issue, in this paper, a parallel feature selection method based on MapReduce is proposed. Specifically, mutual information based on Renyi entropy is used to measure the relationship between feature variables and class variables. Maximum mutual information theory is then employed to choose the most informative combination of feature variables. We implemented the selection process based on MapReduce, which is efficient and scalable for large-scale problems. At last, a practical example well demonstrates the efficiency of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Baccianella S, Esuli A, Sebastiani F (2014) Feature selection for ordinal text classification. Neural Comput 26(3):557–591. doi:10.1162/NECO_a_00558

    Article  MathSciNet  Google Scholar 

  2. Bawaneh MJ, Alkoffash MS, Al Rabea A (2008) Arabic text classification using k-nn and naive bayes. J Comput Sci 4(7):600–605

    Article  Google Scholar 

  3. Chang X, Nie F, Yang Y, Huang H (2014) A convex formulation for semi-supervised multi-label feature selection. In: Proceedings of the twenty-eighth AAAI conference on artificial intelligence, pp 1171–1177

  4. Chang X, Yang Y, Xing E, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic svm. In: Proceedings of the 32nd international conference on machine learning (ICML-15), pp 1348–1357

  5. Christopher DM, Prabhakar R, Hinrich S (2008) Scoring, term weighting, and the vector space model. In: Introduction to information retrieval, pp 100–123

  6. Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2010) Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing, pp 810–818

  7. Estévez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Tran Neural Netw 20(2):189–201

    Article  Google Scholar 

  8. Fan RE, Chen PH, Lin CJ (2005) Working set selection using second order information for training support vector machines. J Mach Learn Res 6:1889–1918

    MathSciNet  MATH  Google Scholar 

  9. Fox G, Bae SH, Ekanayake J, Qiu X, Yuan H (2009) Parallel data mining from multicore to cloudy grids. High Perform Comput Workshop 18:311–340

    Google Scholar 

  10. Herman G, Zhang B, Wang Y, Ye G, Chen F (2013) Mutual information-based method for selecting informative feature sets. Pattern Recognit 46(12):3315–3327

    Article  Google Scholar 

  11. Huang K, Aviyente S (2008) Wavelet feature selection for image classification. IEEE Trans Image Process 17(9):1709–1720. doi:10.1109/TIP.2008.2001050

    Article  MathSciNet  Google Scholar 

  12. Kwak N, Choi CH (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159

    Article  Google Scholar 

  13. Li F, Ooi BC, Özsu MT, Wu S (2014) Distributed data management using mapreduce. ACM Comput Surv (CSUR) 46(3):31

    Google Scholar 

  14. Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26(9):2138–2150. doi:10.1109/TKDE.2013.65

    Article  Google Scholar 

  15. Liu CL, Hsaio WH, Lee CH, Chang TH, Kuo TH (2015) Semi-supervised text classification with universum learning. IEEE Trans Cybern :1

  16. Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Springer, Muharram

    Book  MATH  Google Scholar 

  17. Liu X, Tang J (2014) Mass classification in mammograms using selected geometry and texture features, and a new svm-based feature selection method. IEEE Syst J 8(3):910–920. doi:10.1109/JSYST.2013.2286539

    Article  Google Scholar 

  18. López FG, Torres MG, Batista BM, Pérez JAM, Moreno-Vega JM (2006) Solving feature subset selection problem by a parallel scatter search. Eur J Oper Res 169(2):477–489

    Article  MathSciNet  MATH  Google Scholar 

  19. Mukherjee I, Schapire RE (2013) A theory of multiclass boosting. J Mach Learn Res 14(1):437–497

    MathSciNet  MATH  Google Scholar 

  20. Nguyen XV, Chan J, Romano S, Bailey J (2014) Effective global approaches for mutual information based feature selection. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 512–521

  21. Shen H (2009) Dimensionality reduction. In: Encyclopedia of database systems. Springer, pp 843–846

  22. Singh S, Kubica J, Larsen S, Sorokina D (2009) Parallel large scale feature selection for logistic regression. In: SDM, pp 1172–1183

  23. Sun Z, Fox G (2012) Study on parallel svm based on mapreduce. In: International conference on parallel and distributed processing techniques and applications

  24. Sun Z, Li Z (2014) Data intensive parallel feature selection method study. In: IEEE 2014 International joint conference on neural networks (IJCNN), pp 2256–2262

  25. Swedlow JR, Zanetti G, Best C (2011) Channeling the data deluge. Nature Methods 8(6):463

    Article  Google Scholar 

  26. Thomas J, Raj NS, Vinod P (2014) Towards filtering spam mails using dimensionality reduction methods. In: 2014 5th International conference confluence the next generation information technology summit (Confluence), pp 163–168

  27. Vinh NX, Bailey J (2013) Comments on supervised feature selection by clustering using conditional mutual information-based distances. Pattern Recognit 46(4):1220–1225

    Article  MATH  Google Scholar 

  28. Wang D, Nie F, Huang H (2015) Feature selection via global redundancy minimization. IEEE Trans Knowl Data Eng 27(10):2743–2755. doi:10.1109/TKDE.2015.2426703

    Article  Google Scholar 

  29. Wang S, Nie F, Chang X, Yao L, Li X, Sheng QZ (2015) Machine learning and knowledge discovery in databases: European conference, ECML PKDD 2015, Porto, Portugal, 7–11 Sept 2015. In: Proceedings, Part I, chap. unsupervised feature analysis with class margin optimization, pp 383–398. Springer International Publishing, Cham doi:10.1007/978-3-319-23528-8_24

  30. Xu JW, Suzuki K (2014) Max-auc feature selection in computer-aided detection of polyps in ct colonography. IEEE J Biomed Health Inform 18(2):585–593. doi:10.1109/JBHI.2013.2278023

    Article  Google Scholar 

  31. Xu Z, King I, Lyu MRT, Jin R (2010) Discriminative semi-supervised feature selection via manifold regularization. IEEE Trans Neural Netw 21(7):1033–1047. doi:10.1109/TNN.2010.2047114

    Article  Google Scholar 

  32. Yang Y, Ma Z, Hauptmann AG, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimed 15(3):661–669. doi:10.1109/TMM.2012.2237023

    Article  Google Scholar 

  33. Zhang B, Ji Z, Li C (2012) A parallel feature selection based on rough set theory for protein mass spectrometry data. In: International conference on automatic control and artificial intelligence (ACAI 2012), pp 248–251

  34. Zhang B, Ruan Y, Wu TL, Qiu J, Hughes A, Fox G (2010) Applying twister to scientific applications. In: 2010 IEEE second international conference on cloud computing technology and science (CloudCom), pp 25–32

  35. Zhao Z, Wang L, Liu H, Ye J (2013) On similarity preserving feature selection. IEEE Trans Knowl Data Eng 25(3):619–632. doi:10.1109/TKDE.2011.222

    Article  Google Scholar 

  36. Zhao Z, Zhang R, Cox J, Duling D, Sarle W (2013) Massively parallel feature selection: an approach based on variance preservation. Mach Learn 92(1):195–220. doi:10.1007/s10994-013-5373-4

    Article  MathSciNet  MATH  Google Scholar 

  37. Zhu HD, Li HC, Zhao XH, Zhong Y (2010) Feature selection method by applying parallel collaborative evolutionary genetic algorithm. J Electron Sci Technol 8(2):108–113

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhao Li.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Z., Lu, W., Sun, Z. et al. A parallel feature selection method study for text classification . Neural Comput & Applic 28 (Suppl 1), 513–524 (2017). https://doi.org/10.1007/s00521-016-2351-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-016-2351-3

Keywords

Navigation