Abstract
Text Categorization is the process of automatically assigning predefined categories to free text documents. Although there have existed a large number of text classification algorithms, most of them are either inefficient or too complex. In this paper, we propose the concept of category memberships, which stand for the degrees that words belonging to categories. Based on category memberships, a simple but efficient algorithm is presented. To evaluate our new algorithm, we have conducted experiments using Newsgroup_18828 text collection to compare it with Naive Bayes and k-NN. Experimental results show that our algorithm outperforms Naive Bayes and k-NN if a suitable category membership function is adopted.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 13–22 (1994)
McCallum, A., Nigam, K.: A comparison of event models for naïve bayes text classification. In: AAA-98 Workshop on Learning for Text Categorization (1998)
Apte, C., Damerau, F., Weiss, S.: Text mining with decision rules and decision trees. In: Proceedings of Conference on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web (1998)
Ng, H.T., Goh, W.B., Low, K.L.: Feature selection, perceptron learning, and a usability case study for text categorization. In: 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 67–73 (1997)
Schapire, R.E., Singer, Y.: BoosTexter: A Boosting-based System for Text Categorization. Machine Learning 2/3, 135–168 (2000)
Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Mladenic, D., Grobelnik, M.: Feature Selection for Classification Based on Text Hierarchy. In: Working notes of Learning from Text and the Web Conference on Automated Learning and Discovery (1998)
Yang, Y., Pedersen, J.P.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of 14th International Conference on Machine Learning, pp. 412–420 (1997)
Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Proceedings of 27thACL, pp. 76–83 (1989)
Fano, R.: Transmission of information. MIT Press, Cambridge (1961)
Dunning, T.E.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 1, 61–74 (1993)
Ricardo, B.Y., Berthier, R.N.: Modern Information Retrieval. ACM Press, New York (1999)
Dasarathy, B.V.: Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. MCGraw-Hill Computer Science Series. IEEE Computer Society, Los Alamitos (1991)
Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval 1/2, 67–88 (1999)
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 42–49 (1999)
Porter, M.F.: An algorithm for suffix stripping. Program 3, 130–137 (1980)
van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Deng, ZH., Tang, SW., Zhang, M. (2005). An Efficient Text Categorization Algorithm Based on Category Memberships. In: Wang, L., Jin, Y. (eds) Fuzzy Systems and Knowledge Discovery. FSKD 2005. Lecture Notes in Computer Science(), vol 3613. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11539506_48
Download citation
DOI: https://doi.org/10.1007/11539506_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28312-6
Online ISBN: 978-3-540-31830-9
eBook Packages: Computer ScienceComputer Science (R0)