Abstract
In this paper, we present a linear text classification algorithm called CRF. By using category relevance factors, CRF computes the feature vectors of training documents belonging to the same category. Based on these feature vectors, CRF induces the profile vector of each category. For new unlabelled documents, CRF adopts a modified cosine measure to obtain similarities between these documents and categories and assigns them to categories that have the biggest similarity scores. In CRF, it is profile vectors not vectors of all training documents that join in computing the similarities between documents and categories. We evaluated our algorithm on a subset of Reuters-21578 and 20_newsgroups text collections and compared it against k-NN and SVM. Experimental results show that CRF outperforms k-NN and is competitive with SVM.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Y. Yang. Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94), pages 13–22, 1994.
A. McCallum and K. Nigam. A comparison of event models for naïve bayes text classification. In AAA-98 Workshop on Learning for Text Categorization, 1998.
C. Apte, F. Damerau, and S. Weiss. Text mining with decision rules and decision trees. In proceedings of Conference on Automated Learning and Discovery, Workshop 6: Learning from Text and the Web, 1998.
H.T. Ng, W.B. Goh, and K.L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. In 20th Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’97), pages 67–73, 1997.
S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, pages 148–155, 1998.
Y. Yang and C.G. Chute. An example-based mapping method for text categorization and retrieval. ACM Transaction on Information Systems (TOIS), 12(3): 252–277, 1994.
T. Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In European Conference on Machines Learning (ECML), pages 137–142, 1998.
Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2): 67–88, 1999.
Y. Yang, X. Liu. A re-examination of text categorization methods. In 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), pages 42–49, 1999.
B. Masand, G. Linoff, and D. Waltz. Classifying News Stories using Memory Based Reasoning. In 15th Annul International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92), pages 59–64, 1992.
M. Iwayama, T. Tokunaga. Cluster-Based Text Categorization: A Comparison of Category Search Strategies. In 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’95), pages 273–280, 1995.
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, MA, 1989.
V. Vapnic. The Nature of Statistical Learning Theory. Springer, New York, 1995.
C. Cortes and V. Vapnik. Support Vector networks. Machine Learning, 20: 273–297, 1995.
C. T. Yu, K. Lam, G. Salton. Term weighting in information retrieval using the term precision model. Journal of the ACM, 29(1): 152–170, 1982.
G. Salton, M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.
T. Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods-Support Vector Learning, MIT-Press, 1999.
D.D. Lewis. Reuters_21578 text categorization test collection. http://www.research.att.com /~lewis/reuters21578.html.
M.F. Porter. An algorithm for suffix stripping. Program, 14(3): 130–137, 1980.
F. Sebastiani. A Tutorial on Automated Text Categorisation. In Proceedings of the First Argentinean Symposium on Artificial Intelligence, 7–35, 1999.
C.J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.
D.D. Lewis. Representation and Learning in Information Retrieval. Ph.D. dissertation, University of Massachusetts, USA, 1992.
Y. Yang, J.P. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of 14th International Conference on Machine Learning, 412–420, 1997.
D. Mladenic, M. Grobelnik. Feature Selection for Classification Based on Text Hierarchy. In Working notes of Learning from Text and the Web, Conference on Automated Learning and Discovery (CONALD’98), 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Deng, ZH., Tang, SW., Yang, DQ., Zhang, M., Wu, XB., Yang, M. (2002). A Linear Text Classification Algorithm Based on Category Relevance Factors. In: Lim, E.P., et al. Digital Libraries: People, Knowledge, and Technology. ICADL 2002. Lecture Notes in Computer Science, vol 2555. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36227-4_9
Download citation
DOI: https://doi.org/10.1007/3-540-36227-4_9
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00261-1
Online ISBN: 978-3-540-36227-2
eBook Packages: Springer Book Archive