GA Based Optimal Keyword Extraction in an Automatic Chinese Web Document Classification System

Chou, Chih-Hsun; Han, Chin-Chuan; Chen, Ya-Hui

doi:10.1007/978-3-540-74767-3_24

Chih-Hsun Chou¹,
Chin-Chuan Han² &
Ya-Hui Chen¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4743))

Included in the following conference series:

International Symposium on Parallel and Distributed Processing and Applications

978 Accesses

Abstract

The main steps for designing an automatic document classification system include feature extraction and classification. In this paper a method to improve feature extraction is proposed. In this method, genetic algorithm (GA) was applied to determine the threshold values of four criteria for extracting the representative keywords for each class. The purpose of these four threshold values is to extract as few representative keywords as possible. This keyword extraction method was combined with two classification algorithms, vector space model (VSM) and support vector machine (SVM), for examining the performance of the proposed classification system under various extracting conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Investigation of Feature Selection Techniques on Performance of Automatic Text Categorization

Feature Selection for Text Classification Using Genetic Algorithm

A feature selection model for document classification using Tom and Jerry Optimization algorithm

Article 21 June 2023

References

Liao, S.H., Jiang, M.H.: A Combined Weight method in Automatic Classification of Chinese Text. Proc. ICNN&B 2005 2, 625–630 (2005)
Google Scholar
Lam, W., Han, Y.Q.: Automatic textual document categorization based on generalized instance sets and a metamodel. IEEE Trans. Patt. Analysis Mach. Intell. 25, 628–633 (2003)
Article Google Scholar
Liang, J.Z.: SVM based Chinese Web page automatic classification. Proc. 2003 Intern. Conf. Mach. Learn. Cybern. 4, 2265–2268 (2003)
Google Scholar
Yang, Y.Y.: Document Automatic Classification and Ranking. Master Thesis, Dept. Comp. Sci., National Tsing Hua University, Taiwan R.O.C (1992)
Google Scholar
Maron, M.E.: Automatic indexing: an experimental inquiry. Journ. ACM. 8, 417–440 (1961)
Google Scholar
Tai, X.Y., Ren, F.J., Kita, K.J.: An information retrieval model based on vector space method by supervised learning. Inform. Proc. Manag. 38, 749–764 (2002)
Article MATH Google Scholar
Tan, S.B.: Neighbor-weighted K-nearest neighbor for unbalanced text corpus. Exp. Syst. Appl. 28, 667–671 (2005)
Article Google Scholar
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. In: Proc. 17th Intern. Conf. Machine Learning, pp. 999–1006 (2001)
Google Scholar
Liu, C.H., Lu, C.C., Lee, W.P.: Document categorisation by genetic algorithms. In: Proc. 2000 IEEE Intern. Conf. Systems, Man, Cybern. 5, 3868–3872 (2000)
Google Scholar
Nie, J.Y., Ren, F.J.: Chinese information retrieval- using characters or words. Inform. Proc. Manag. 35, 443–462 (1999)
Article Google Scholar
Zhou, G.D., Lua, K.T.: Interpolation of n-gram and mutual-information based trigger pair language models for Mandarin speech recognition. Comp. Speech Lang. 13, 125–141 (1999)
Article Google Scholar
Academic Sinica Balanced Corpus of Modern Chinese. Institute of Information Science and CKIP group in Academia Sinica: http://www.sinica.edu.tw/SinicaCorpus/
Chang, C.C., Lin, C.J.: LIBSVM: A Library for Support Vector Machines (2001), http://www.csie.ntu.edu.tw/cjlin/libsvm
Zhang, W.F., Xu, B.W., Cui, Z.F.: A document classification approach by GA feature extraction based corner classification neural network. In: Proc. Intern. Conf. Cyberworlds (2005)
Google Scholar
Martin-Bautista, M.J., Vila, M.-A., Larsen, H.L.: Building adaptive user profiles by a genetic fuzzy classifier with feature selection. In: Proc. FUZZ IEEE., vol. 1, pp. 308–312 (2000)
Google Scholar
Cheatham, M., Rizki, M.: Feature and Prototype Evolution for Nearest Neighbor Classification of Web Documents. In: Proc. ITNG, pp. 364–369 (2006)
Google Scholar
Martin-Bautista, M.J., et al.: Fuzzy genes- improving the effectiveness of information retrieval. In: Proc. 2000 Congr. Evolut. Comput., vol. 1, pp. 471–478 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Information Engineering, Chung Hua University, No.707, Sec.2, WuFu Rd., Hsinchu, 300 Taiwan, R.O.C.
Chih-Hsun Chou & Ya-Hui Chen
Department of Computer Science and Information Engineering, National United University, Miaoli, Taiwan, R.O.C.
Chin-Chuan Han

Authors

Chih-Hsun Chou
View author publications
You can also search for this author in PubMed Google Scholar
Chin-Chuan Han
View author publications
You can also search for this author in PubMed Google Scholar
Ya-Hui Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Parimala Thulasiraman Xubin He Tony Li Xu Mieso K. Denko Ruppa K. Thulasiram Laurence T. Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chou, CH., Han, CC., Chen, YH. (2007). GA Based Optimal Keyword Extraction in an Automatic Chinese Web Document Classification System. In: Thulasiraman, P., He, X., Xu, T.L., Denko, M.K., Thulasiram, R.K., Yang, L.T. (eds) Frontiers of High Performance Computing and Networking ISPA 2007 Workshops. ISPA 2007. Lecture Notes in Computer Science, vol 4743. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74767-3_24

Download citation

DOI: https://doi.org/10.1007/978-3-540-74767-3_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74766-6
Online ISBN: 978-3-540-74767-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics