A parallel feature selection method study for text classification

Li, Zhao; Lu, Wei; Sun, Zhanquan; Xing, Weiwei

doi:10.1007/s00521-016-2351-3

A parallel feature selection method study for text classification

Original Article
Published: 01 June 2016

Volume 28, pages 513–524, (2017)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Zhao Li^1,2,
Wei Lu¹,
Zhanquan Sun² &
…
Weiwei Xing¹

911 Accesses
18 Citations
Explore all metrics

Abstract

Text classification is a popular research topic in data mining. Many classification methods have been proposed. Feature selection is an important technique for text classification since it is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. In recent years, data have become increasingly larger in both the number of instances and the number of features in many applications. As a result, classical feature selection methods do not work well in processing large-scale dataset due to the expensive computational cost. To address this issue, in this paper, a parallel feature selection method based on MapReduce is proposed. Specifically, mutual information based on Renyi entropy is used to measure the relationship between feature variables and class variables. Maximum mutual information theory is then employed to choose the most informative combination of feature variables. We implemented the selection process based on MapReduce, which is efficient and scalable for large-scale problems. At last, a practical example well demonstrates the efficiency of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Baccianella S, Esuli A, Sebastiani F (2014) Feature selection for ordinal text classification. Neural Comput 26(3):557–591. doi:10.1162/NECO_a_00558
Article MathSciNet Google Scholar
Bawaneh MJ, Alkoffash MS, Al Rabea A (2008) Arabic text classification using k-nn and naive bayes. J Comput Sci 4(7):600–605
Article Google Scholar
Chang X, Nie F, Yang Y, Huang H (2014) A convex formulation for semi-supervised multi-label feature selection. In: Proceedings of the twenty-eighth AAAI conference on artificial intelligence, pp 1171–1177
Chang X, Yang Y, Xing E, Yu Y (2015) Complex event detection using semantic saliency and nearly-isotonic svm. In: Proceedings of the 32nd international conference on machine learning (ICML-15), pp 1348–1357
Christopher DM, Prabhakar R, Hinrich S (2008) Scoring, term weighting, and the vector space model. In: Introduction to information retrieval, pp 100–123
Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G (2010) Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM international symposium on high performance distributed computing, pp 810–818
Estévez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Tran Neural Netw 20(2):189–201
Article Google Scholar
Fan RE, Chen PH, Lin CJ (2005) Working set selection using second order information for training support vector machines. J Mach Learn Res 6:1889–1918
MathSciNet MATH Google Scholar
Fox G, Bae SH, Ekanayake J, Qiu X, Yuan H (2009) Parallel data mining from multicore to cloudy grids. High Perform Comput Workshop 18:311–340
Google Scholar
Herman G, Zhang B, Wang Y, Ye G, Chen F (2013) Mutual information-based method for selecting informative feature sets. Pattern Recognit 46(12):3315–3327
Article Google Scholar
Huang K, Aviyente S (2008) Wavelet feature selection for image classification. IEEE Trans Image Process 17(9):1709–1720. doi:10.1109/TIP.2008.2001050
Article MathSciNet Google Scholar
Kwak N, Choi CH (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159
Article Google Scholar
Li F, Ooi BC, Özsu MT, Wu S (2014) Distributed data management using mapreduce. ACM Comput Surv (CSUR) 46(3):31
Google Scholar
Li Z, Liu J, Yang Y, Zhou X, Lu H (2014) Clustering-guided sparse structural learning for unsupervised feature selection. IEEE Trans Knowl Data Eng 26(9):2138–2150. doi:10.1109/TKDE.2013.65
Article Google Scholar
Liu CL, Hsaio WH, Lee CH, Chang TH, Kuo TH (2015) Semi-supervised text classification with universum learning. IEEE Trans Cybern :1
Liu H, Motoda H (1998) Feature selection for knowledge discovery and data mining. Springer, Muharram
Book MATH Google Scholar
Liu X, Tang J (2014) Mass classification in mammograms using selected geometry and texture features, and a new svm-based feature selection method. IEEE Syst J 8(3):910–920. doi:10.1109/JSYST.2013.2286539
Article Google Scholar
López FG, Torres MG, Batista BM, Pérez JAM, Moreno-Vega JM (2006) Solving feature subset selection problem by a parallel scatter search. Eur J Oper Res 169(2):477–489
Article MathSciNet MATH Google Scholar
Mukherjee I, Schapire RE (2013) A theory of multiclass boosting. J Mach Learn Res 14(1):437–497
MathSciNet MATH Google Scholar
Nguyen XV, Chan J, Romano S, Bailey J (2014) Effective global approaches for mutual information based feature selection. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, pp 512–521
Shen H (2009) Dimensionality reduction. In: Encyclopedia of database systems. Springer, pp 843–846
Singh S, Kubica J, Larsen S, Sorokina D (2009) Parallel large scale feature selection for logistic regression. In: SDM, pp 1172–1183
Sun Z, Fox G (2012) Study on parallel svm based on mapreduce. In: International conference on parallel and distributed processing techniques and applications
Sun Z, Li Z (2014) Data intensive parallel feature selection method study. In: IEEE 2014 International joint conference on neural networks (IJCNN), pp 2256–2262
Swedlow JR, Zanetti G, Best C (2011) Channeling the data deluge. Nature Methods 8(6):463
Article Google Scholar
Thomas J, Raj NS, Vinod P (2014) Towards filtering spam mails using dimensionality reduction methods. In: 2014 5th International conference confluence the next generation information technology summit (Confluence), pp 163–168
Vinh NX, Bailey J (2013) Comments on supervised feature selection by clustering using conditional mutual information-based distances. Pattern Recognit 46(4):1220–1225
Article MATH Google Scholar
Wang D, Nie F, Huang H (2015) Feature selection via global redundancy minimization. IEEE Trans Knowl Data Eng 27(10):2743–2755. doi:10.1109/TKDE.2015.2426703
Article Google Scholar
Wang S, Nie F, Chang X, Yao L, Li X, Sheng QZ (2015) Machine learning and knowledge discovery in databases: European conference, ECML PKDD 2015, Porto, Portugal, 7–11 Sept 2015. In: Proceedings, Part I, chap. unsupervised feature analysis with class margin optimization, pp 383–398. Springer International Publishing, Cham doi:10.1007/978-3-319-23528-8_24
Xu JW, Suzuki K (2014) Max-auc feature selection in computer-aided detection of polyps in ct colonography. IEEE J Biomed Health Inform 18(2):585–593. doi:10.1109/JBHI.2013.2278023
Article Google Scholar
Xu Z, King I, Lyu MRT, Jin R (2010) Discriminative semi-supervised feature selection via manifold regularization. IEEE Trans Neural Netw 21(7):1033–1047. doi:10.1109/TNN.2010.2047114
Article Google Scholar
Yang Y, Ma Z, Hauptmann AG, Sebe N (2013) Feature selection for multimedia analysis by sharing information among multiple tasks. IEEE Trans Multimed 15(3):661–669. doi:10.1109/TMM.2012.2237023
Article Google Scholar
Zhang B, Ji Z, Li C (2012) A parallel feature selection based on rough set theory for protein mass spectrometry data. In: International conference on automatic control and artificial intelligence (ACAI 2012), pp 248–251
Zhang B, Ruan Y, Wu TL, Qiu J, Hughes A, Fox G (2010) Applying twister to scientific applications. In: 2010 IEEE second international conference on cloud computing technology and science (CloudCom), pp 25–32
Zhao Z, Wang L, Liu H, Ye J (2013) On similarity preserving feature selection. IEEE Trans Knowl Data Eng 25(3):619–632. doi:10.1109/TKDE.2011.222
Article Google Scholar
Zhao Z, Zhang R, Cox J, Duling D, Sarle W (2013) Massively parallel feature selection: an approach based on variance preservation. Mach Learn 92(1):195–220. doi:10.1007/s10994-013-5373-4
Article MathSciNet MATH Google Scholar
Zhu HD, Li HC, Zhao XH, Zhong Y (2010) Feature selection method by applying parallel collaborative evolutionary genetic algorithm. J Electron Sci Technol 8(2):108–113
Google Scholar

Download references

Author information

Authors and Affiliations

School of Software Engineering, Beijing Jiaotong University, Beijing, China
Zhao Li, Wei Lu & Weiwei Xing
Shandong Computer Science Center(National Supercomputer Center in Jinan), Shandong Provincial Key Laboratory of Computer Networks, Jinan, China
Zhao Li & Zhanquan Sun

Authors

Zhao Li
View author publications
You can also search for this author in PubMed Google Scholar
Wei Lu
View author publications
You can also search for this author in PubMed Google Scholar
Zhanquan Sun
View author publications
You can also search for this author in PubMed Google Scholar
Weiwei Xing
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhao Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Z., Lu, W., Sun, Z. et al. A parallel feature selection method study for text classification . Neural Comput & Applic 28 (Suppl 1), 513–524 (2017). https://doi.org/10.1007/s00521-016-2351-3

Download citation

Received: 21 July 2015
Accepted: 10 May 2016
Published: 01 June 2016
Issue Date: December 2017
DOI: https://doi.org/10.1007/s00521-016-2351-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A parallel feature selection method study for text classification

Abstract

Access this article

Similar content being viewed by others

Modified Pointwise Mutual Information-Based Feature Selection for Text Classification

Feature selection method based on multiple centrifuge models

A new feature selection method for handling redundant information in text classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A parallel feature selection method study for text classification

Abstract

Access this article

Similar content being viewed by others

Modified Pointwise Mutual Information-Based Feature Selection for Text Classification

Feature selection method based on multiple centrifuge models

A new feature selection method for handling redundant information in text classification

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation