Skip to main content

GA Based Optimal Keyword Extraction in an Automatic Chinese Web Document Classification System

  • Conference paper
Frontiers of High Performance Computing and Networking ISPA 2007 Workshops (ISPA 2007)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4743))

Abstract

The main steps for designing an automatic document classification system include feature extraction and classification. In this paper a method to improve feature extraction is proposed. In this method, genetic algorithm (GA) was applied to determine the threshold values of four criteria for extracting the representative keywords for each class. The purpose of these four threshold values is to extract as few representative keywords as possible. This keyword extraction method was combined with two classification algorithms, vector space model (VSM) and support vector machine (SVM), for examining the performance of the proposed classification system under various extracting conditions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Liao, S.H., Jiang, M.H.: A Combined Weight method in Automatic Classification of Chinese Text. Proc. ICNN&B 2005 2, 625–630 (2005)

    Google Scholar 

  2. Lam, W., Han, Y.Q.: Automatic textual document categorization based on generalized instance sets and a metamodel. IEEE Trans. Patt. Analysis Mach. Intell. 25, 628–633 (2003)

    Article  Google Scholar 

  3. Liang, J.Z.: SVM based Chinese Web page automatic classification. Proc. 2003 Intern. Conf. Mach. Learn. Cybern. 4, 2265–2268 (2003)

    Google Scholar 

  4. Yang, Y.Y.: Document Automatic Classification and Ranking. Master Thesis, Dept. Comp. Sci., National Tsing Hua University, Taiwan R.O.C (1992)

    Google Scholar 

  5. Maron, M.E.: Automatic indexing: an experimental inquiry. Journ. ACM. 8, 417–440 (1961)

    Google Scholar 

  6. Tai, X.Y., Ren, F.J., Kita, K.J.: An information retrieval model based on vector space method by supervised learning. Inform. Proc. Manag. 38, 749–764 (2002)

    Article  MATH  Google Scholar 

  7. Tan, S.B.: Neighbor-weighted K-nearest neighbor for unbalanced text corpus. Exp. Syst. Appl. 28, 667–671 (2005)

    Article  Google Scholar 

  8. Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. In: Proc. 17th Intern. Conf. Machine Learning, pp. 999–1006 (2001)

    Google Scholar 

  9. Liu, C.H., Lu, C.C., Lee, W.P.: Document categorisation by genetic algorithms. In: Proc. 2000 IEEE Intern. Conf. Systems, Man, Cybern. 5, 3868–3872 (2000)

    Google Scholar 

  10. Nie, J.Y., Ren, F.J.: Chinese information retrieval- using characters or words. Inform. Proc. Manag. 35, 443–462 (1999)

    Article  Google Scholar 

  11. Zhou, G.D., Lua, K.T.: Interpolation of n-gram and mutual-information based trigger pair language models for Mandarin speech recognition. Comp. Speech Lang. 13, 125–141 (1999)

    Article  Google Scholar 

  12. Academic Sinica Balanced Corpus of Modern Chinese. Institute of Information Science and CKIP group in Academia Sinica: http://www.sinica.edu.tw/SinicaCorpus/

  13. Chang, C.C., Lin, C.J.: LIBSVM: A Library for Support Vector Machines (2001), http://www.csie.ntu.edu.tw/cjlin/libsvm

  14. Zhang, W.F., Xu, B.W., Cui, Z.F.: A document classification approach by GA feature extraction based corner classification neural network. In: Proc. Intern. Conf. Cyberworlds (2005)

    Google Scholar 

  15. Martin-Bautista, M.J., Vila, M.-A., Larsen, H.L.: Building adaptive user profiles by a genetic fuzzy classifier with feature selection. In: Proc. FUZZ IEEE., vol. 1, pp. 308–312 (2000)

    Google Scholar 

  16. Cheatham, M., Rizki, M.: Feature and Prototype Evolution for Nearest Neighbor Classification of Web Documents. In: Proc. ITNG, pp. 364–369 (2006)

    Google Scholar 

  17. Martin-Bautista, M.J., et al.: Fuzzy genes- improving the effectiveness of information retrieval. In: Proc. 2000 Congr. Evolut. Comput., vol. 1, pp. 471–478 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Parimala Thulasiraman Xubin He Tony Li Xu Mieso K. Denko Ruppa K. Thulasiram Laurence T. Yang

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chou, CH., Han, CC., Chen, YH. (2007). GA Based Optimal Keyword Extraction in an Automatic Chinese Web Document Classification System. In: Thulasiraman, P., He, X., Xu, T.L., Denko, M.K., Thulasiram, R.K., Yang, L.T. (eds) Frontiers of High Performance Computing and Networking ISPA 2007 Workshops. ISPA 2007. Lecture Notes in Computer Science, vol 4743. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74767-3_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-74767-3_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74766-6

  • Online ISBN: 978-3-540-74767-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics