skip to main content
10.1145/3011141.3011190acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
short-paper

Standard measure and SVM measure for feature selection and their performance effect for text classification

Published: 28 November 2016 Publication History

Abstract

This paper compares the prediction performance of document classification based on a variety of feature selection measures. Empirical experiments were conducted for the dataset re0 with 10 measures for feature selection and with SVM. It is confirmed that the feature selection based on the SVM-score proposed by Sakai and Hirokawa (2012) outperforms the standard measures with small number of features. In fact, 100 words are enough to get the similar performance obtained with all words. The reason of good performance of this feature selection is that the SVM-score capture not only the characteristic words of positive samples but of negative samples as well.

References

[1]
Cataltepe, Z. and Aygun, E., An Improvement of Centroid-Based Classification Algorithm for Text Classification. In Proceeding of 2007 IEEE 23rd International Conference on the Data Engineering Workshop, pp.952--956, 2007
[2]
Cover, T, M. and Thomas, J, A., Elements of Information Theory, Wiley-interscience, 1991
[3]
Forman, G., An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research, Vol.3, pp. 1289--1305,2003 bibitem{4}Han, E-H. and Karypis, G., Centroid-Based Document Classification: Analysis & Experimental Results. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery PKDD ' 00, pp.424--431, 2000
[4]
Joachims, T., A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, In Proceedings of the Fourteenth International Conference on Machine Learning ICML ' 97, pp.143--151, 1997
[5]
Joachims, T., Learning to Classify Text Using Support Vector Machines, Kluwer Academic Publishers, 2002
[6]
Joachims, T., A Support Vector Method for Multivariate Performance Measures, In Proceedings of the 22nd international conference on Machine learning ICML ' 05, pp.377--384, 2005
[7]
Joachims, T., Training Linear SVMs in Linear Time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining KDD ' 06, pp.217--226, 2006
[8]
Joachims, T. and Yu, C, J., Sparse Kernel SVMs via Cutting-Plane Training, Journal of Machine Learning, Vol.76, No.2--3, pp.179--193,2009
[9]
Kohavi, R., A study of cross-validation and bootstrap for accuracy estimation and model selection, In Proceedings of the 14th international joint conference on Artificial intelligence - Vol.2, IJCAI ' 95,pp.1137--1143,1995
[10]
Li, J., Cheng, K., Wang, S., Morstatter, E., Trevino, R, P., Tang, J. and Liu, H., Feature Selection: A Data Perspective, arXiv preprint, http://arxiv.org/abs/1601.07996, 2016
[11]
Lewis, D, D., Reuters-21578 text categorization test collection, ReadMe file of Distribution 1.0, 1997
[12]
Malik, H, H. and Kender, J, R., Classifying high-dimensional text and web data using very short patterns, In Proceeding of the 2008 Eighth IEEE International Conference on Data Mining, pp.923--928,2008
[13]
McCallum, A. and Nigam, K., A Comparison of Event Models for Naive Bayes Text Classification, Learning for Text Categorization, In AAAI-98 Workshop on Learning for Text Categorization, pp.41--48, 1998
[14]
Park, H., and Kwon, H-C., Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification, Journal of IEICE Transactions on Information and Systems, Vol.E94.D, No.4, pp.855--865, 2011
[15]
Sakai, T., and Hirokawa, S., Feature words that classify problem sentence in scientific article, In Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services IIWAS ' 12, pp.360--367, 2012
[16]
Salton, G. and Buckly, C. Term Weighting Approaches in Automatic Text Retrieval, Journal of Information Processing and Management, Vol.24, No.5,pp.513--523, 1988
[17]
Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., and Wang, Z., A novel feature selection algorithm for text categorization, Expert Systems with Applications, Vol.33, No.1, pp.1--5, 2007
[18]
Singhal, A., Buckley, C. and Mitra, M., Pivoted document length normalization, In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ' 96, ACM, New York, pp.21--29, 1996
[19]
Sondakh, D, E., A Comparative Study of Classification Algorithms: k-Folds and Holdout as Accuracy Estimation Methods, International Journal of Advanced Research in Computer Science and Software Engineering, Vol.6, No.1, pp.22--29, 2016
[20]
Xu, B., Guo, X., Ye, Y., and Cheng, J., An Improved Random Forest classifier for Text Categorization, Journal of Computers, Vol.7, No.12, pp.2913--2920, 2012
[21]
Yang, Y. and Pedersen, J, O., A Comparative Study on Feature Selection in Text Categorization, In Proceeding of the Fourteenth International Conference on Machine Learning ICML ' 97, pp.412--420, 1997

Cited By

View all
  • (2022)Knowledge Graph Civil Aviation Question Answering Based on Deep Learning2022 China Automation Congress (CAC)10.1109/CAC57257.2022.10054717(600-604)Online publication date: 25-Nov-2022
  • (2019)Integrating multiple methods to enhance medical data classificationEvolving Systems10.1007/s12530-019-09272-xOnline publication date: 6-Feb-2019
  • (2019)Ties Between Mined Structural Patterns in Program and Their Identifier NamesIntegrated Uncertainty in Knowledge Modelling and Decision Making10.1007/978-3-030-14815-7_28(335-346)Online publication date: 7-Mar-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
iiWAS '16: Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services
November 2016
528 pages
ISBN:9781450348072
DOI:10.1145/3011141
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 November 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. SVM
  2. feature selection

Qualifiers

  • Short-paper

Conference

iiWAS '16

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Knowledge Graph Civil Aviation Question Answering Based on Deep Learning2022 China Automation Congress (CAC)10.1109/CAC57257.2022.10054717(600-604)Online publication date: 25-Nov-2022
  • (2019)Integrating multiple methods to enhance medical data classificationEvolving Systems10.1007/s12530-019-09272-xOnline publication date: 6-Feb-2019
  • (2019)Ties Between Mined Structural Patterns in Program and Their Identifier NamesIntegrated Uncertainty in Knowledge Modelling and Decision Making10.1007/978-3-030-14815-7_28(335-346)Online publication date: 7-Mar-2019
  • (2018)Automated Evaluation of Students Comments Regarding Correct Concepts and Misconceptions of Convex Lenses2018 7th International Congress on Advanced Applied Informatics (IIAI-AAI)10.1109/IIAI-AAI.2018.00059(273-277)Online publication date: Jul-2018
  • (2018)Is SVM+FS Better to Satisfy Decision by Majority?Recent Advances on Soft Computing and Data Mining10.1007/978-3-319-72550-5_26(261-271)Online publication date: 12-Jan-2018
  • (2017)Machine learning is better than human to satisfy decision by majorityProceedings of the International Conference on Web Intelligence10.1145/3106426.3106520(694-701)Online publication date: 23-Aug-2017
  • (2017)Classification of Imbalanced Documents by Feature SelectionProceedings of the International Conference on Compute and Data Analysis10.1145/3093241.3093246(228-232)Online publication date: 19-May-2017
  • (2017)A comparative study of ensemble back-propagation neural network for the regression problems2017 2nd International Conference on Information Technology (INCIT)10.1109/INCIT.2017.8257853(1-6)Online publication date: Nov-2017
  • (2016)Generation of Sentence Template Graph from SOAP Format Medical Documents2016 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI.2016.0037(159-162)Online publication date: Dec-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media