ABSTRACT
This paper evaluated text categorization using charactes, bigrams, words and hybrid terms. These terms were also augmented with mined terms. Classifiers using hybrid terms did not achieve better classification performance. The use of data mining techniques to add new terms to the dictionary improves the performance of character-based classifiers. Our naïve comparison between the Pat-tree classifier and our best classifier shows that the Pat-tree classifier has the best precision (77%) and our best classifier has the best recall (72%) and the lowest storage requirement (13%).
- 1.Lewis, D.D. (1992) "An evaluation of phrasal and clustered representations on a text categorization task", Proc. of 15th ACM SIGIR, pp.37--50. Google ScholarDigital Library
- 2.Chen, C.L. and L.-F. Chien (1999) "PAT-tree based online corpus classification with an application to OCR text verification", 1RAL Workshop 1999.Google Scholar
- 3.Lam, W., C-Y Wong and K.F. Wong (1997) Performance Evaluation of Character-, Word- and N- Gram-Based Indexing for Chinese Text Retrieval, IRAL 97, Japan.Google Scholar
- 4.Tsang, T.F., R.W.P. Luk and K.F. Wong (1999) A Hybrid terms indexing strategy using words and bigrams, IRAL 99, Taiwan.Google Scholar
- 5.Van Rijsbergen, C.V. (1979) Information Retrieval, Butterworths, London. Google ScholarDigital Library
- 6.Lin, Y.H. and A.K. Jain (1998) Classification of text documents, The Computer Journal, 41(8), 537--546.Google ScholarCross Ref
- 7.Fung, P. and D. Wu (1994) Statistical Augmentation of a Chinese Machine-readable dictionary, Proceedings of Workshop on Very Large Corpora, Kyoto, August.Google Scholar
- Text categorization using hybrid (mined) terms (poster session)
Recommendations
Effect of term distributions on centroid-based text categorization
Special issue: Informatics and computer science intelligent systems applicationsMost of traditional text categorization approaches utilize term frequency (tf) and inverse document frequency (idf) for representing importance of words and/or terms in classifying a text document. This paper describes an approach to apply term ...
Chinese text classification by the Naïve Bayes Classifier and the associative classifier with multiple confidence threshold values
Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naive Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of ...
Best terms: an efficient feature-selection algorithm for text categorization
In this paper, we propose a new feature-selection algorithm for text classification, called best terms (BT). The complexity of BT is linear in respect to the number of the training-set documents and is independent from both the vocabulary size and the ...
Comments