Abstract
In this paper, we introduce efficient methods for incremental multi-label categorization of documents. We use concept clumping to efficiently categorize news articles into a hierarchical structure of categories. Concept clumping is a phenomenon of local coherences occurring in the data and it has been previously used for fast, incremental e-mail classification. We extend the definition of clumping and introduce additional clumping metrics specifically for multi-label document categorization. We present three methods for incremental multi-label categorization that exploit concept clumping and make use of thresholding techniques and a new term-category weight boosting method. Our methods are tested using the Reuters (RCV1) news corpus and the accuracy obtained is comparable to some well known machine learning methods trained in batch mode, but with much lower computation time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Esuli, A., Fagni, T., Sebastiani, F.: Boosting multi-label hierarchical text categorization. Information Retrieval 11, 287–313 (2008)
Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Proceedings of the 13th International Conference on Machine Learning (ICML 1996), pp. 148–156 (1996)
Gkanogiannis, A., Kalamboukis, T.: A Perceptron-Like Linear Supervised Algorithm for Text Classification. In: Cao, L., Feng, Y., Zhong, J. (eds.) ADMA 2010, Part I. LNCS, vol. 6440, pp. 86–97. Springer, Heidelberg (2010)
Granitzer, M.: Hierarchical Text Classification Using Methods from Machine Learning. Master’s thesis, Institute of Theoretical Computer Science (IGI), Graz University of Technology (2003)
Krzywicki, A., Wobcke, W.: Incremental E-Mail Classification and Rule Suggestion Using Simple Term Statistics. In: Nicholson, A., Li, X. (eds.) AI 2009. LNCS, vol. 5866, pp. 250–259. Springer, Heidelberg (2009)
Krzywicki, A., Wobcke, W.: Exploiting Concept Clumping for Efficient Incremental E-Mail Categorization. In: Cao, L., Zhong, J., Feng, Y. (eds.) ADMA 2010, Part II. LNCS, vol. 6441, pp. 244–258. Springer, Heidelberg (2010)
Lee, K.-H., Kay, J., Kang, B.-H., Rosebrock, U.: A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization. In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS (LNAI), vol. 2417, pp. 444–453. Springer, Heidelberg (2002)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Rocchio, J.J.: Relevance feedback in information retrieval. In: Salton, G. (ed.) The SMART Retrieval System: Experiments in Automatic Document Processing, pp. 313–323. Prentice-Hall, Englewood Cliffs (1971)
Rousu, J., Saunders, C., Szedmak, S., Shawe-Taylor, J.: Learning hierarchical multi-category text classification models. In: Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), pp. 744–751 (2005)
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1, 69–90 (1999)
Yang, Y.: A study of thresholding strategies for text categorization. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 137–145 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Krzywicki, A., Wobcke, W. (2011). Exploiting Concept Clumping for Efficient Incremental News Article Categorization. In: Tang, J., King, I., Chen, L., Wang, J. (eds) Advanced Data Mining and Applications. ADMA 2011. Lecture Notes in Computer Science(), vol 7120. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25853-4_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-25853-4_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25852-7
Online ISBN: 978-3-642-25853-4
eBook Packages: Computer ScienceComputer Science (R0)