research-article

A large-scale active learning system for topical categorization on the web

Authors:
Suju Rajan

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Dragomir Yankov

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Scott J. Gaffney

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

,
Adwait Ratnaparkhi

Yahoo! Labs, Sunnyvale, CA, USA

Yahoo! Labs, Sunnyvale, CA, USA
View Profile

WWW '10: Proceedings of the 19th international conference on World wide webApril 2010Pages 791–800https://doi.org/10.1145/1772690.1772771

Published:26 April 2010Publication History

WWW '10: Proceedings of the 19th international conference on World wide web

Pages 791–800

ABSTRACT

Many web applications such as ad matching systems, vertical search engines, and page categorization systems require the identification of a particular type or class of pages on the Web. The sheer number and diversity of the pages on the Web, however, makes the problem of obtaining a good sample of the class of interest hard. In this paper, we describe a successfully deployed end-to-end system that starts from a biased training sample and makes use of several state-of-the-art machine learning algorithms working in tandem, including a powerful active learning component, in order to achieve a good classification system. The system is evaluated on traffic from a real-world ad-matching platform and is shown to achieve high categorization effectiveness with a significant reduction in editorial effort and labeling time.

References

Becker, Hila and Broder, Andrei and Gabrilovich, Evgeniy and Josifovski, Vanja and Pang, Bo, Context transfer in search advertising, SIGIR '09: Proc. of 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2009: 656--657, Boston, MA, USA Google ScholarDigital Library
Davis, J., and Goadrich, M., The Relationship Between Precision-Recall and ROC Curves, In ICML '06: Proceedings of the 23rd international conference on Machine learning, 2006: 233--240. Google ScholarDigital Library
Steffen Bickel and Tobias Scheffer, Dirichlet-enhanced spam filtering based on biased samples, Advances in Neural Information Processing Systems 19. 2007:161--168, MIT Press.Google Scholar
Anagnostopoulos, Aris and Broder, Andrei Z. and Gabrilovich, Evgeniy and Josifovski, Vanja and Riedel, Lance, Just-in-time contextual advertising, CIKM '07: Proc. of the 16th ACM conference on Conference on Information and Knowledge Management. 2007:331--340, Lisbon, Portugal. Google ScholarDigital Library
Christopher J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery. 1998, 2:121--167. Google ScholarDigital Library
Karsten M. Borgwardt and Arthur Gretton and Malte J. Rasch and Hans-Peter Kriegel and Bernhard Schölkopf and Alexander J. Smola. ISMB (Supplement of Bioinformatics), 49--57, Integrating structured biological data by Kernel Maximum Mean Discrepancy. 2006. Google ScholarDigital Library
Philip Chan and Salvatore J. Stolfo. Toward Scalable Learning with Non-uniform Distributions: Effects and a Multi-classifier Approach, KDD '99: Proc. of the 4th International Conference on Knowledge Discovery and Data Mining, 1999:164--168, AAAI Press.Google Scholar
Nitesh V. Chawla and Kevin W. Bowyer and Lawrence O. Hall and W. Philip Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research. 2002. 16:321--357. Google ScholarDigital Library
Ertekin, Seyda and Huang, Jian and Bottou, Leon and Giles, Lee. Learning on the border: active learning in imbalanced data classification. CIKM '07: Proc. of the 16th ACM Conference on Information and Knowledge Management, 2007. isbn = 978-1-59593-803-9, pages = 127--136, Lisbon, Portugal. ACM, New York, NY, USA. Google ScholarDigital Library
Fan, Rong-En and Chang, Kai-Wei and Hsieh, Cho-Jui and Wang, Xiang-Rui and Lin, Chih-Jen. LIBLINEAR: A Library for Large Linear Classification, J. Mach. Learn. Res., 9, 2008: 1871--1874. Google ScholarDigital Library
Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra. Clustering on the Unit Hypersphere using von Mises-Fisher Distributions. J. Mach. Learn. Res., 6, 2005. 1345--1382. Google ScholarDigital Library
Brinker, Klaus. Incorporating diversity in Active Learning with Support Vector Machines. ICML '03: Proc. of the 20th International Conference on Machine learning. 2003: 408--415, Washington D.C., USA.Google Scholar
Morris DeGroot and Stephen Fienberg. The Comparison and Evaluation of Forecasters. The Statistician, volume = 32, 1983. 12--22.Google Scholar
Pedro Domingos. MetaCost: A General Method for Making Classifiers Cost-Sensitive. Proc. of the 5th International Conference on Knowledge Discovery and Data Mining, 1999:155--164. Google ScholarDigital Library
Elkan, Charles and Noto, Keith. Learning classifiers from only positive and unlabeled data. KDD '08: Proc. of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008;213--220. Las Vegas, Nevada, USA. Google ScholarDigital Library
Hsieh, Cho-Jui and Chang, Kai-Wei and Lin, Chih-Jen and Keerthi, S. Sathiya and Sundararajan, S. A dual coordinate descent method for large-scale linear SVM. ICML '08: Proc. of the 25th International Conference on Machine Learning, 2008:408--415. Helsinki, Finland. Google ScholarDigital Library
Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. Proc. of 10th European Conference on Machine Learning, 1998:137--142. Google ScholarDigital Library
Joshi, Mahesh V. and Agarwal, Ramesh C. and Kumar, Vipin. Predicting rare classes: can boosting make any weak learner strong? KDD '02: Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. isbn = 1-58113-567-X. 297--306. Edmonton, Alberta, Canada. Google ScholarDigital Library
Mahesh V. Joshi and Vipin Kumar and Ramesh C. Agarwal. Evaluating Boosting Algorithms to Classify Rare Classes: Comparison And Improvements. ICDM '01: Proc. of 1st IEEE International Conference on Data Mining, 2001. Google ScholarDigital Library
Kubat, Miroslav and Holte, Robert C. and Matwin, Stan. Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Mach. Learn, 30:2--3, 1998:195--215. Google ScholarDigital Library
Langford, John and Li, Lihong and Zhang, Tong. Sparse Online Learning via Truncated Gradient. J. Mach. Learn. Res., 10, 2009: 777--801. Google ScholarDigital Library
Andrew Mccallum and Kamal Nigam. Employing EM in pool-based active learning for text classification. Proc. of the 15th International Conference on Machine Learning, 1998. 350--358. Google ScholarDigital Library
Andrew McCallum and Kamal Nigam. A comparison of event models for Naive Bayes text classification. A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998. 1998.Google Scholar
Dunja Mladenic and Marko Grobelnik. Feature selection for unbalanced class distribution and Naive Bayes. Proc. of the 16th International Conference on Machine Learning (ICML), 1999: 258--267. Google ScholarDigital Library
Kamal Nigam. Using maximum entropy for text classification. IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999: 61--67.Google Scholar
Foster Provost. Machine Learning from Imbalanced Data Sets 101 (Extended Abstract. Proc. of AAAI Workshop on Imbalanced Data Sets, 2000"Google Scholar
Quinlan, J. Ross. C4.5: programs for machine learning, 1993. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
Masashi Sugiyama and Klaus-robert Müller. Model selection under covariate shift. Proc. of the International Conference on Artificial Neural Networks, 2005. Springer. Google ScholarDigital Library
Yuchun Tang and S. Rrasser and P. Judge and Yan-Qing Zhang. Fast and Effective Spam Sender Detection with Granular SVM on Highly Imbalanced Mail Server Behavior Data. International Conference on Collaborative Computing: Networking, Applications and Worksharing, 2006: 27.Google Scholar
Tong, Simon and Koller, Daphne. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2, 2002: 45--66. Google ScholarDigital Library
Kevin Woods and Jeffrey Solka and Carey Priebe and Christopher Doss and Kevin Bowyer and Larence Clarke. Comparative Evaluation of Pattern Recognition Techniques for Detection of Microcalcifications. J. of Intelligent Automation, 1993.Google Scholar
Rong Yan and Yan Liu and Rong Jin and Alex Hauptmann. On Predicting Rare Classes With Svm Ensembles In Scene Classification. In ICASSP, 2003: 21--24.Google Scholar
Zadrozny, Bianca and Elkan, Charles. Transforming classifier scores into accurate multiclass probability estimates. Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002:694--699. Edmonton, Alberta, Canada. Google ScholarDigital Library

Index Terms

A large-scale active learning system for topical categorization on the web
1. Computing methodologies
  1. Machine learning

Recommendations

Multiple-Instance Active Learning for Image Categorization
MMM '09: Proceedings of the 15th International Multimedia Modeling Conference on Advances in Multimedia Modeling

Both multiple-instance learning and active learning are widely employed in image categorization, but generally they are applied separately. This paper studies the integration of these two methods. Different from typical active learning approaches, the ...
Read More
Combining active learning and semi-supervised for improving learning performance
ISABEL '11: Proceedings of the 4th International Symposium on Applied Sciences in Biomedical and Communication Technologies

In many learning tasks, there are abundant unlabeled samples but the number of labeled training samples is limited, because labeling the samples requires the efforts of human annotators and expertise. There are three major techniques for labeling the ...
Read More
Large-scale text categorization by batch mode active learning
WWW '06: Proceedings of the 15th international conference on World Wide Web

Large-scale text categorization is an important research topic for Web data mining. One of the challenges in large-scale text categorization is how to reduce the human efforts in labeling text documents for building reliable classification models. In ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '10: Proceedings of the 19th international conference on World wide web
April 2010
1407 pages
ISBN:9781605587998
DOI:10.1145/1772690
General Chairs:
Michael Rappa
North Carolina State University, USA
,
Paul Jones
University of North Carolina at Chapel Hill, USA
,
Program Chairs:
Juliana Freire
University of Utah, USA
,
Soumen Chakrabarti
Indian Institute of Technology, India
Copyright © 2010 International World Wide Web Conference Committee (IW3C2)
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 April 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
active learning
classification with imbalanced datasets
svms
web scale performance evaluation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 691
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ePub

View this article in ePub.

View ePub

A large-scale active learning system for topical categorization on the web

WWW '10: Proceedings of the 19th international conference on World wide web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multiple-Instance Active Learning for Image Categorization

Combining active learning and semi-supervised for improving learning performance

Large-scale text categorization by batch mode active learning