Abstract
This paper describes methods of document classification for a highly inflectional/derivational language that forms monolithic compound noun terms, like Dutch and Korean. The system is composed of three phases: (1) a Korean morphological analyzer called HAM (Kang, 1993), (2) an application of compound noun phrase analysis to the result of HAM analysis and extraction of terms whose syntactic categories are noun, name (proper noun), verb, and adjective, and (3) an effective document classification algorithm based on preferred class score heuristics. This paper focuses on the comparison of document classification methods including a simple heuristic method, and preferred class score heuristics employing two factors namely ICF (inverted class frequency) and IDF (inverted document frequency) with/without term frequency weight. In addition this paper describes a simple classification approach without a learning algorithm rather than a vector space model with a complex training and classification algorithm such as cosine similarity measurement. The experimental results show 95.7% correct classifications of 720 training data and 63.8%-71.3% of randomly chosen 80 testing data through various methods.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Allan, J., Leuski, A., Swan, R., Byrd, D.: Evaluating combinations of ranked lists and visualizations of inter-document similarity. Information Processing and Management. 37 (2001) 435–458
Apte, C., Demerau, F., Weiss M.: Automated Learning of Decision Rules for Text Categorization. ACM Transactions on Information Systems. 12(3) (1994) 233–251
Arppe A.: Term Extraction from Unrestricted Text. http://www.lingsoft.fi/doc/nptool/term-extraction. (1995)
Brasethvik, T., Gulla J.: Natural Language Analysis for Semantic Document Modeling. Data & Knowledge Engineering. 38 (2001) 45–62
Cohen, W., Singer, Y.: Context-Sensitive Learning Methods for Text Categorization, ACM Transactions on Information Systems, 7(2) (1999) 141–173
Earley, J.: An Efficient Context-Free Parsing Algorithm. CACM. 13(2) (1970) 94–102
Fuketa, M., Lee, S., Tsuji, T., Okada, M., Aoe, J.: A Document Classification Method by Using Field Association Words. Information Science. 126 (2000) 57–70
Han, K., Sun, B., Han, S., Rim, K.: A Study on Development of Automatic Categorization System for Internet Documents. KIPS Journal. 7(9) (2000) 2867–2875
Hirshberg, D.S.: Algorithms for the Longest Common Subsequence Problem. The Journal of ACM. 24(4) (1977) 664–675
Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. Proceedings of International Conference of Machine Learning (CIML97). (1997) 143–151
Kang, S.: Korean Morphological Analysis Using Syllable Information and Multi-word Unit Information. Ph.D thesis. Seoul National University (1993)
Kang, S.: Korean Morphological Analysis Program for Linux OS, http://nlp.kookmin.ac.kr. (2001)
Lewis, D., Jones, K.S.: Natural Language Processing for Information Retrieval. Communication of the ACM. 39(1) (1996) 92–101
Li, Y., Jain, A.: Classification of Text Documents. The Computer Journal. 41(8) (1998) 537–546
Moon, Y., Min, K.: (2000). Verifying Appropriateness of the Semantic Networks and Integration for the Selectional Restriction Relation. Proceedings of the 2000 MIS/OA International Conference. Seoul Korea (2000) 535–539
Mostafa, J., Lam, W.: Automatic classification using supervised learning in a medical document filtering application. Information Processing and Management. 36 (2000) 415–444
Salton, G., Singhal, A., Mitra, M., Buckley C.: Automatic Text Structuring and Summarization. Information Processing and Management. 33(2) (1997) 193–207
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. Proceedings of ACM SIGIR Conference on Research and Development Retrieval. (1999) 42–49
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Min, K., Wilson, W.H., Moon, YJ. (2002). Preferred Document Classification for a Highly Inflectional/Derivational Language. In: McKay, B., Slaney, J. (eds) AI 2002: Advances in Artificial Intelligence. AI 2002. Lecture Notes in Computer Science(), vol 2557. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36187-1_2
Download citation
DOI: https://doi.org/10.1007/3-540-36187-1_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00197-3
Online ISBN: 978-3-540-36187-9
eBook Packages: Springer Book Archive