short-paper

Standard measure and SVM measure for feature selection and their performance effect for text classification

Authors:

Takanori Yamashita,

Sachio HirokawaAuthors Info & Claims

iiWAS '16: Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services

Pages 262 - 266

https://doi.org/10.1145/3011141.3011190

Published: 28 November 2016 Publication History

Abstract

This paper compares the prediction performance of document classification based on a variety of feature selection measures. Empirical experiments were conducted for the dataset re0 with 10 measures for feature selection and with SVM. It is confirmed that the feature selection based on the SVM-score proposed by Sakai and Hirokawa (2012) outperforms the standard measures with small number of features. In fact, 100 words are enough to get the similar performance obtained with all words. The reason of good performance of this feature selection is that the SVM-score capture not only the characteristic words of positive samples but of negative samples as well.

References

[1]

Cataltepe, Z. and Aygun, E., An Improvement of Centroid-Based Classification Algorithm for Text Classification. In Proceeding of 2007 IEEE 23rd International Conference on the Data Engineering Workshop, pp.952--956, 2007

Digital Library

[2]

Cover, T, M. and Thomas, J, A., Elements of Information Theory, Wiley-interscience, 1991

Digital Library

[3]

Forman, G., An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research, Vol.3, pp. 1289--1305,2003 bibitem{4}Han, E-H. and Karypis, G., Centroid-Based Document Classification: Analysis & Experimental Results. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery PKDD ' 00, pp.424--431, 2000

Digital Library

[4]

Joachims, T., A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, In Proceedings of the Fourteenth International Conference on Machine Learning ICML ' 97, pp.143--151, 1997

Digital Library

[5]

Joachims, T., Learning to Classify Text Using Support Vector Machines, Kluwer Academic Publishers, 2002

Digital Library

[6]

Joachims, T., A Support Vector Method for Multivariate Performance Measures, In Proceedings of the 22nd international conference on Machine learning ICML ' 05, pp.377--384, 2005

Digital Library

[7]

Joachims, T., Training Linear SVMs in Linear Time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining KDD ' 06, pp.217--226, 2006

Digital Library

[8]

Joachims, T. and Yu, C, J., Sparse Kernel SVMs via Cutting-Plane Training, Journal of Machine Learning, Vol.76, No.2--3, pp.179--193,2009

Digital Library

[9]

Kohavi, R., A study of cross-validation and bootstrap for accuracy estimation and model selection, In Proceedings of the 14th international joint conference on Artificial intelligence - Vol.2, IJCAI ' 95,pp.1137--1143,1995

Digital Library

[10]

Li, J., Cheng, K., Wang, S., Morstatter, E., Trevino, R, P., Tang, J. and Liu, H., Feature Selection: A Data Perspective, arXiv preprint, http://arxiv.org/abs/1601.07996, 2016

[11]

Lewis, D, D., Reuters-21578 text categorization test collection, ReadMe file of Distribution 1.0, 1997

[12]

Malik, H, H. and Kender, J, R., Classifying high-dimensional text and web data using very short patterns, In Proceeding of the 2008 Eighth IEEE International Conference on Data Mining, pp.923--928,2008

Digital Library

[13]

McCallum, A. and Nigam, K., A Comparison of Event Models for Naive Bayes Text Classification, Learning for Text Categorization, In AAAI-98 Workshop on Learning for Text Categorization, pp.41--48, 1998

[14]

Park, H., and Kwon, H-C., Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification, Journal of IEICE Transactions on Information and Systems, Vol.E94.D, No.4, pp.855--865, 2011

[15]

Sakai, T., and Hirokawa, S., Feature words that classify problem sentence in scientific article, In Proceedings of the 14th International Conference on Information Integration and Web-based Applications & Services IIWAS ' 12, pp.360--367, 2012

Digital Library

[16]

Salton, G. and Buckly, C. Term Weighting Approaches in Automatic Text Retrieval, Journal of Information Processing and Management, Vol.24, No.5,pp.513--523, 1988

Digital Library

[17]

Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., and Wang, Z., A novel feature selection algorithm for text categorization, Expert Systems with Applications, Vol.33, No.1, pp.1--5, 2007

Digital Library

[18]

Singhal, A., Buckley, C. and Mitra, M., Pivoted document length normalization, In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ' 96, ACM, New York, pp.21--29, 1996

Digital Library

[19]

Sondakh, D, E., A Comparative Study of Classification Algorithms: k-Folds and Holdout as Accuracy Estimation Methods, International Journal of Advanced Research in Computer Science and Software Engineering, Vol.6, No.1, pp.22--29, 2016

[20]

Xu, B., Guo, X., Ye, Y., and Cheng, J., An Improved Random Forest classifier for Text Categorization, Journal of Computers, Vol.7, No.12, pp.2913--2920, 2012

[21]

Yang, Y. and Pedersen, J, O., A Comparative Study on Feature Selection in Text Categorization, In Proceeding of the Fourteenth International Conference on Machine Learning ICML ' 97, pp.412--420, 1997

Digital Library

Cited By

Yu PGong WBai ZZhao HDeng W(2022)Knowledge Graph Civil Aviation Question Answering Based on Deep Learning2022 China Automation Congress (CAC)10.1109/CAC57257.2022.10054717(600-604)Online publication date: 25-Nov-2022
https://doi.org/10.1109/CAC57257.2022.10054717
Tarle BChintakindi SJena S(2019)Integrating multiple methods to enhance medical data classificationEvolving Systems10.1007/s12530-019-09272-xOnline publication date: 6-Feb-2019
https://doi.org/10.1007/s12530-019-09272-x
Mashima YHirokawa STakeuchi K(2019)Ties Between Mined Structural Patterns in Program and Their Identifier NamesIntegrated Uncertainty in Knowledge Modelling and Decision Making10.1007/978-3-030-14815-7_28(335-346)Online publication date: 7-Mar-2019
https://doi.org/10.1007/978-3-030-14815-7_28
Show More Cited By

Index Terms

Standard measure and SVM measure for feature selection and their performance effect for text classification
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

An evaluation of classifier-specific filter measure performance for feature selection

Feature selection is an important part of classifier design. There are many possible methods for searching and evaluating feature subsets, but little consensus on which methods are best. This paper examines a number of filter-based feature subset ...
An Improved Ambiguity Measure Feature Selection for Text Categorization
IHMSC '12: Proceedings of the 2012 4th International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 01

The high dimensionality of the text categorization raises big hurdles in applying many sophisticated learning algorithms to the text categorization. Feature selection, which reduces the number of features that represent documents, is an absolute ...
A new mutual information based measure for feature selection

In this paper, we discuss the problem of feature selection and the importance of using mutual information in evaluating the discrimination ability of feature subsets between class labels. Because of the difficulties associated with estimating the exact ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

iiWAS '16: Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services

November 2016

528 pages

ISBN:9781450348072

DOI:10.1145/3011141

General Chair:
Gabriele Anderst-Kotsis
Johannes Kepler University Linz, Austria

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 November 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Conference

iiWAS '16

iiWAS '16: 18th International Conference on Information Integration and Web-based Applications and Services

November 28 - 30, 2016

Singapore, Singapore

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
158
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yu PGong WBai ZZhao HDeng W(2022)Knowledge Graph Civil Aviation Question Answering Based on Deep Learning2022 China Automation Congress (CAC)10.1109/CAC57257.2022.10054717(600-604)Online publication date: 25-Nov-2022
https://doi.org/10.1109/CAC57257.2022.10054717
Tarle BChintakindi SJena S(2019)Integrating multiple methods to enhance medical data classificationEvolving Systems10.1007/s12530-019-09272-xOnline publication date: 6-Feb-2019
https://doi.org/10.1007/s12530-019-09272-x
Mashima YHirokawa STakeuchi K(2019)Ties Between Mined Structural Patterns in Program and Their Identifier NamesIntegrated Uncertainty in Knowledge Modelling and Decision Making10.1007/978-3-030-14815-7_28(335-346)Online publication date: 7-Mar-2019
https://doi.org/10.1007/978-3-030-14815-7_28
Taga MOnishi THirokawa S(2018)Automated Evaluation of Students Comments Regarding Correct Concepts and Misconceptions of Convex Lenses2018 7th International Congress on Advanced Applied Informatics (IIAI-AAI)10.1109/IIAI-AAI.2018.00059(273-277)Online publication date: Jul-2018
https://doi.org/10.1109/IIAI-AAI.2018.00059
Lin YYamaguchi KMine THirokawa S(2018)Is SVM+FS Better to Satisfy Decision by Majority?Recent Advances on Soft Computing and Data Mining10.1007/978-3-319-72550-5_26(261-271)Online publication date: 12-Jan-2018
https://doi.org/10.1007/978-3-319-72550-5_26
Hirokawa SSuzuki TMine TSheth ANgonga AWang yChang EŚlęzak DFranczyk BAlt RTao X(2017)Machine learning is better than human to satisfy decision by majorityProceedings of the International Conference on Web Intelligence10.1145/3106426.3106520(694-701)Online publication date: 23-Aug-2017
https://dl.acm.org/doi/10.1145/3106426.3106520
Adachi YOnimura NYamashita THirokawa S(2017)Classification of Imbalanced Documents by Feature SelectionProceedings of the International Conference on Compute and Data Analysis10.1145/3093241.3093246(228-232)Online publication date: 19-May-2017
https://dl.acm.org/doi/10.1145/3093241.3093246
Kajornrit JChaipornkaew P(2017)A comparative study of ensemble back-propagation neural network for the regression problems2017 2nd International Conference on Information Technology (INCIT)10.1109/INCIT.2017.8257853(1-6)Online publication date: Nov-2017
https://doi.org/10.1109/INCIT.2017.8257853
Onimura NYamashita TNakayama NSoejima HHirokawa S(2016)Generation of Sentence Template Graph from SOAP Format Medical Documents2016 International Conference on Computational Science and Computational Intelligence (CSCI)10.1109/CSCI.2016.0037(159-162)Online publication date: Dec-2016
https://doi.org/10.1109/CSCI.2016.0037

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten