Abstract
The main steps for designing an automatic document classification system include feature extraction and classification. In this paper a method to improve feature extraction is proposed. In this method, genetic algorithm (GA) was applied to determine the threshold values of four criteria for extracting the representative keywords for each class. The purpose of these four threshold values is to extract as few representative keywords as possible. This keyword extraction method was combined with two classification algorithms, vector space model (VSM) and support vector machine (SVM), for examining the performance of the proposed classification system under various extracting conditions.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Liao, S.H., Jiang, M.H.: A Combined Weight method in Automatic Classification of Chinese Text. Proc. ICNN&B 2005 2, 625–630 (2005)
Lam, W., Han, Y.Q.: Automatic textual document categorization based on generalized instance sets and a metamodel. IEEE Trans. Patt. Analysis Mach. Intell. 25, 628–633 (2003)
Liang, J.Z.: SVM based Chinese Web page automatic classification. Proc. 2003 Intern. Conf. Mach. Learn. Cybern. 4, 2265–2268 (2003)
Yang, Y.Y.: Document Automatic Classification and Ranking. Master Thesis, Dept. Comp. Sci., National Tsing Hua University, Taiwan R.O.C (1992)
Maron, M.E.: Automatic indexing: an experimental inquiry. Journ. ACM. 8, 417–440 (1961)
Tai, X.Y., Ren, F.J., Kita, K.J.: An information retrieval model based on vector space method by supervised learning. Inform. Proc. Manag. 38, 749–764 (2002)
Tan, S.B.: Neighbor-weighted K-nearest neighbor for unbalanced text corpus. Exp. Syst. Appl. 28, 667–671 (2005)
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. In: Proc. 17th Intern. Conf. Machine Learning, pp. 999–1006 (2001)
Liu, C.H., Lu, C.C., Lee, W.P.: Document categorisation by genetic algorithms. In: Proc. 2000 IEEE Intern. Conf. Systems, Man, Cybern. 5, 3868–3872 (2000)
Nie, J.Y., Ren, F.J.: Chinese information retrieval- using characters or words. Inform. Proc. Manag. 35, 443–462 (1999)
Zhou, G.D., Lua, K.T.: Interpolation of n-gram and mutual-information based trigger pair language models for Mandarin speech recognition. Comp. Speech Lang. 13, 125–141 (1999)
Academic Sinica Balanced Corpus of Modern Chinese. Institute of Information Science and CKIP group in Academia Sinica: http://www.sinica.edu.tw/SinicaCorpus/
Chang, C.C., Lin, C.J.: LIBSVM: A Library for Support Vector Machines (2001), http://www.csie.ntu.edu.tw/cjlin/libsvm
Zhang, W.F., Xu, B.W., Cui, Z.F.: A document classification approach by GA feature extraction based corner classification neural network. In: Proc. Intern. Conf. Cyberworlds (2005)
Martin-Bautista, M.J., Vila, M.-A., Larsen, H.L.: Building adaptive user profiles by a genetic fuzzy classifier with feature selection. In: Proc. FUZZ IEEE., vol. 1, pp. 308–312 (2000)
Cheatham, M., Rizki, M.: Feature and Prototype Evolution for Nearest Neighbor Classification of Web Documents. In: Proc. ITNG, pp. 364–369 (2006)
Martin-Bautista, M.J., et al.: Fuzzy genes- improving the effectiveness of information retrieval. In: Proc. 2000 Congr. Evolut. Comput., vol. 1, pp. 471–478 (2000)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chou, CH., Han, CC., Chen, YH. (2007). GA Based Optimal Keyword Extraction in an Automatic Chinese Web Document Classification System. In: Thulasiraman, P., He, X., Xu, T.L., Denko, M.K., Thulasiram, R.K., Yang, L.T. (eds) Frontiers of High Performance Computing and Networking ISPA 2007 Workshops. ISPA 2007. Lecture Notes in Computer Science, vol 4743. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74767-3_24
Download citation
DOI: https://doi.org/10.1007/978-3-540-74767-3_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74766-6
Online ISBN: 978-3-540-74767-3
eBook Packages: Computer ScienceComputer Science (R0)