ABSTRACT
The explosion of online content has made the management of such content non-trivial. Web-related tasks such as web page categorization, news filtering, query categorization, tag recommendation, etc. often involve the construction of multi-label categorization systems on a large scale. Existing multi-label classification methods either do not scale or have unsatisfactory performance. In this work, we propose MetaLabeler to automatically determine the relevant set of labels for each instance without intensive human involvement or expensive cross-validation. Extensive experiments conducted on benchmark data show that the MetaLabeler tends to outperform existing methods. Moreover, MetaLabeler scales to millions of multi-labeled instances and can be deployed easily. This enables us to apply the MetaLabeler to a large scale query categorization problem in Yahoo!, yielding a significant improvement in performance.
- L. Cai and T. Hofmann. Hierarchical document categorization with support vector machines. In CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management, pages 78--87, New York, NY, USA, 2004. Google ScholarDigital Library
- R. Caruana and A. Niculescu-Mizil. An empirical comparison of supervised learning algorithms. In ICML '06: Proceedings of the 23rd international conference on Machine learning, pages 161--168, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal, 7(3):163--178, 1998. Google ScholarDigital Library
- S. Dumais and H. Chen. Hierarchical classification of web content. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 256--263, New York, NY, USA, 2000. Google ScholarDigital Library
- S. Dzeroski and B. Zenko. Is combining classifiers with stacking better than selecting the best one? Mach. Learn., 54(3):255--273, 2004. Google ScholarDigital Library
- R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871--1874, 2008. Google ScholarDigital Library
- R.-E. Fan and C.-J. Lin. A study on threshold selection for multi-label classication. 2007.Google Scholar
- N. Ghamrawi and A. McCallum. Collective multi-label classification. In CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management, pages 195--200, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
- N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intell. Data Anal., 6(5):429--449, 2002. Google ScholarCross Ref
- S. Ji, L. Tang, S. Yu, and J. Ye. Extracting shared subspace for multi-label classification. In KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 381--389, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- T. Joachims. Text categorization with support vector machines: learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, pages 137--142, Heidelberg et al., 1998. Google ScholarDigital Library
- I. Katakis, G. Tsoumakas, and I. Vlahavas. Multilabel text classification for automated tag suggestion. In Proceedings of the ECML/PKDD 2008 Discovery Challenge, 2008.Google Scholar
- S. S. Keerthi, S. Sundararajan, K.-W. Chang, C.-J. Hsieh, and C.-J. Lin. A sequential dual method for large scale multi-class linear svms. In KDD, pages 408--416, 2008. Google ScholarDigital Library
- D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In ICML '97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 170--178, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5:361--397, 2004. Google ScholarDigital Library
- T.-Y. Liu, Y. Yang, H. Wan, H.-J. Zeng, Z. Chen, and W.-Y. Ma. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explor. Newsl., 7(1):36--43, 2005. Google ScholarDigital Library
- T.-Y. Liu, Y. Yang, H. Wan, Q. Zhou, B. Gao, H.-J. Zeng, Z. Chen, and W.-Y. Ma. An experimental study on large-scale web categorization. In WWW ’05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 1106--1107, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- A. McCallum, R. Rosenfeld, T. M. Mitchell, and A. Y. Ng. Improving text classification by shrinkage in a hierarchy of classes. In ICML '98: Proceedings of the Fifteenth International Conference on Machine Learning, pages 359--367, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- K. Punera and J. Ghosh. Enhanced hierarchical classification via isotonic smoothing. In WWW '08: Proceeding of the 17th international conference on World Wide Web, pages 151--160, 2008. Google ScholarDigital Library
- R. Rifkin and A. Klautau. In defense of one-vs-all classification. JMLR, 5:101--141, 2004. Google ScholarDigital Library
- J. Rousu, C. Saunders, S. Szedmak, and J. Shawe-Taylor. Kernel-based learning of hierarchical multilabel classification models. J. Mach. Learn. Res., 7:1601--1626, 2006. Google ScholarDigital Library
- L. Tang and H. Liu. Bias analysis in text classification for highly skewed data. In ICDM '05: Proceedings of the Fifth IEEE International Conference on Data Mining, pages 781--784, 2005. IEEE Computer Society. Google ScholarDigital Library
- L. Tang, H. Liu, J. Zhang, N. Agarwal, and J. J. Salerno. Topic taxonomy adaptation for group profiling. ACM Trans. Knowl. Discov. Data, 1(4):1--28, 2008. Google ScholarDigital Library
- L. Tang, J. Zhang, and H. Liu. Acclimatizing taxonomic semantics for hierarchical content classification. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 384--393, 2006. Google ScholarDigital Library
- I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 104, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- G. Tsoumakas and K. Ioannis. Multi label classification: An overview. International Journal of Data Warehousing and Mining, 3:1--13, 2007.Google ScholarCross Ref
- G. Tsoumakas and K. Ioannis. Random k-labelsets: An ensemble method for multilabel classification. In ECML, 2007. Google ScholarDigital Library
- N. Ueda and K. Saito. Parametric mixture models for multi-labeled text. In NIPS, pages 721--728, 2002.Google Scholar
- R. Yan, J. Tesic, and J. R. Smith. Model-shared subspace boosting for multi-label classification. In KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 834--843, 2007. Google ScholarDigital Library
- Y. Yang. A study of thresholding strategies for text categorization. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 137--145, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
- K. Yu, S. Yu, and V. Tresp. Multi-label informed latent semantic indexing. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 258--265, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- B. Zadrozny and C. Elkan. Transforming classifier scores into accurate multiclass probability estimates. In KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 694--699, 2002. Google ScholarDigital Library
- M.-L. Zhang and Z.-H. Zhou. Ml-knn: A lazy learning approach to multi-label learning. Pattern Recogn., 40(7):2038--2048, 2007. Google ScholarDigital Library
- S. Zhu, X. Ji, W. Xu, and Y. Gong. Multi-labelled classification using maximum entropy method. In SIGIR, 2005. Google ScholarDigital Library
Index Terms
- Large scale multi-label classification via metalabeler
Recommendations
Dynamic ensemble learning for multi-label classification
AbstractEnsemble learning has been shown to be an effective approach to solve multi-label classification problem. However, most existing ensemble learning methods do not consider the difference between unseen instances, and existing methods that consider ...
Incorporating label dependency into the binary relevance framework for multi-label classification
In multi-label classification, examples can be associated with multiple labels simultaneously. The task of learning from multi-label data can be addressed by methods that transform the multi-label classification problem into several single-label ...
Multi-label classification by exploiting label correlations
Nowadays, multi-label classification methods are of increasing interest in the areas such as text categorization, image annotation and protein function classification. Due to the correlation among the labels, traditional single-label classification ...
Comments