Representative Term Based Feature Selection Method for SVM Based Document Classification

Kang, YunHee

doi:10.1007/11552413_9

YunHee Kang²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3681))

Included in the following conference series:

International Conference on Knowledge-Based and Intelligent Information and Engineering Systems

1113 Accesses
1 Citations

Abstract

This paper describes a document classifier for web documents in the fields of Information Technology and uses SVM to learn a model, which is constructed from the training sets and its representative terms. To reduce information overload, it needs to exploit automatic text classification for handling enormous documents. Support Vector Machine (SVM) is a model that is calculated as a weighted sum of kernel function outputs. The basic idea is to exploit the representative terms meaning distribution in coherent thematic texts of each category by simple statistics methods. Vector-space model is applied to represent documents in the categories by using feature selection scheme based on TFiDF. We apply a category factor which represents effects in category of any term to the feature selection. Experiments show the results of categorization and the correlation of vector length.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Yan, T.W., Garcia-Molina, H.: Sift - a tool for wide-area information dissemination. In: Proceedings of the 1995 USENIX Technical Conference, pp. 177–186 (1995)
Google Scholar
Salton, G.: Automatic Text Processing: The Transformation Analysis and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Google Scholar
Vapnik, V.: Statistical Learning Tehory. John Wiley and Sons, Inc., New York (1998)
Google Scholar
Chapelle, O., Haffner, P., Vapnik, V.: Svm for histogram-based image classification. IEEE Trans. on Neural Networks 10(5), 1055–1065 (1999)
Article Google Scholar
Yang, Y., Pdedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proc. Of the 14th Internatinal Conference on Machine Learning ICML 1997, pp. 412–429 (1997)
Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Joachims: SVM^Light (1998), http://ais.gmd.de/~thorsten/svm_light
Lewis, D., Gale, W.A.: A sequential algorithm for training text classifiers. In: Proc. SIGIR 1994, Dublin, Ireland, pp. 3–12 (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer and Communication Engineering, Cheonan University, 115 Anseo-dong, Cheonan, 330-704, Korea
YunHee Kang

Authors

YunHee Kang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Business, La Trobe University, 3086, Melbourne, Victoria, Australia
Rajiv Khosla
Centre for SMART systems Engineering Research Centre, University of Brighton, Moulsecoomb, BN2 4GJ, Brighton, UK
Robert J. Howlett
School of Electrical and Information Engineering, Knowledge Based Intelligent Engineering Systems Centre, University of South Australia, 5095, Mawson Lakes, SA, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kang, Y. (2005). Representative Term Based Feature Selection Method for SVM Based Document Classification. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2005. Lecture Notes in Computer Science(), vol 3681. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11552413_9

Download citation

DOI: https://doi.org/10.1007/11552413_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28894-7
Online ISBN: 978-3-540-31983-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics