Abstract
In this paper we propose TreeBoost.MH, an algorithm for multi-label Hierarchical Text Categorization (HTC) consisting of a hierarchical variant of AdaBoost.MH. TreeBoost.MH embodies several intuitions that had arisen before within HTC: e.g. the intuitions that both feature selection and the selection of negative training examples should be performed “locally”, i.e. by paying attention to the topology of the classification scheme. It also embodies the novel intuition that the weight distribution that boosting algorithms update at every boosting round should likewise be updated “locally”. We present the results of experimenting TreeBoost.MH on two HTC benchmarks, and discuss analytically its computational cost.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chakrabarti, S., Dom, B.E., Agrawal, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. Journal of Very Large Data Bases 7(3), 163–178 (1998)
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of the 14th International Conference on Machine Learning (ICML1997), Nashville, US, pp. 170–178 (1997)
Gaussier, É., Goutte, C., Popat, K., Chen, F.: A Hierarchical Model for Clustering and Categorising Documents. In: Crestani, F., Girolami, M., van Rijsbergen, C.J.K. (eds.) ECIR 2002. LNCS, vol. 2291, pp. 229–247. Springer, Heidelberg (2002)
McCallum, A.K., Rosenfeld, R., Mitchell, T.M., Ng, A.Y.: Improving text classification by shrinkage in a hierarchy of classes. In: Proceedings of the 15th International Conference on Machine Learning (ICML 1998), Madison, US, pp. 359–367 (1998)
Toutanova, K., Chen, F., Popat, K., Hofmann, T.: Text classification in a hierarchical mixture model for small training sets. In: Proceedings of the 10th ACM International Conference on Information and Knowledge Management (CIKM 2001), Atlanta, US, pp. 105–113 (2001)
Vinokourov, A., Girolami, M.: A probabilistic framework for the hierarchic organisation and classification of document collections. Journal of Intelligent Information Systems 18(2/3), 153–172 (2002)
Ruiz, M., Srinivasan, P.: Hierarchical text classification using neural networks. Information Retrieval 5(1), 87–118 (2002)
Weigend, A.S., Wiener, E.D., Pedersen, J.O.: Exploiting hierarchy in text categorization. Information Retrieval 1(3), 193–216 (1999)
Wiener, E.D., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1995), Las Vegas, US, pp. 317–332 (1995)
Dumais, S.T., Chen, H.: Hierarchical classification of web content. In: Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR 2000), Athens, GR, pp. 256–263 (2000)
Yang, Y., Zhang, J., Kisiel, B.: A scalability analysis of classifiers in text categorization. In: Proceedings of the 26th ACM International Conference on Research and Development in Information Retrieval (SIGIR 2003), Toronto, CA, pp. 96–103 (2003)
Schapire, R.E., Singer, Y.: BoosTexter: a boosting-based system for text categorization. Machine Learning 39(2/3), 135–168 (2000)
Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Machine Learning 37(3), 297–336 (1999)
Schapire, R.E., Singer, Y., Singhal, A.: Boosting and Rocchio applied to text filtering. In: Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (SIGIR 1998), Melbourne, AU, pp. 215–223 (1998)
Ng, H.T., Goh, W.B., Low, K.L.: Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1997), Philadelphia, US, pp. 67–73 (1997)
Forman, G.: A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the 21st International Conference on Machine Learning (ICML 2004), Banff, CA (2004)
Esuli, A., Fagni, T., Sebastiani, F.: TreeBoost. MH: A boosting algorithm for multi-label hierarchical text categorization. Technical Report 2006-TR-56, Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Pisa, IT (submitted for publication, 2006)
Lewis, D.D., Li, F., Rose, T., Yang, Y.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)
Apté, C., Damerau, F.J., Weiss, S.M.: Automated learning of decision rules for text categorization. ACM Transactions on Information Systems 12(3), 233–251 (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Esuli, A., Fagni, T., Sebastiani, F. (2006). TreeBoost.MH: A Boosting Algorithm for Multi-label Hierarchical Text Categorization. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_2
Download citation
DOI: https://doi.org/10.1007/11880561_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45774-9
Online ISBN: 978-3-540-45775-6
eBook Packages: Computer ScienceComputer Science (R0)