skip to main content
10.1145/1772690.1772771acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

A large-scale active learning system for topical categorization on the web

Published:26 April 2010Publication History

ABSTRACT

Many web applications such as ad matching systems, vertical search engines, and page categorization systems require the identification of a particular type or class of pages on the Web. The sheer number and diversity of the pages on the Web, however, makes the problem of obtaining a good sample of the class of interest hard. In this paper, we describe a successfully deployed end-to-end system that starts from a biased training sample and makes use of several state-of-the-art machine learning algorithms working in tandem, including a powerful active learning component, in order to achieve a good classification system. The system is evaluated on traffic from a real-world ad-matching platform and is shown to achieve high categorization effectiveness with a significant reduction in editorial effort and labeling time.

References

  1. Becker, Hila and Broder, Andrei and Gabrilovich, Evgeniy and Josifovski, Vanja and Pang, Bo, Context transfer in search advertising, SIGIR '09: Proc. of 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2009: 656--657, Boston, MA, USA Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Davis, J., and Goadrich, M., The Relationship Between Precision-Recall and ROC Curves, In ICML '06: Proceedings of the 23rd international conference on Machine learning, 2006: 233--240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Steffen Bickel and Tobias Scheffer, Dirichlet-enhanced spam filtering based on biased samples, Advances in Neural Information Processing Systems 19. 2007:161--168, MIT Press.Google ScholarGoogle Scholar
  4. Anagnostopoulos, Aris and Broder, Andrei Z. and Gabrilovich, Evgeniy and Josifovski, Vanja and Riedel, Lance, Just-in-time contextual advertising, CIKM '07: Proc. of the 16th ACM conference on Conference on Information and Knowledge Management. 2007:331--340, Lisbon, Portugal. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Christopher J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery. 1998, 2:121--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Karsten M. Borgwardt and Arthur Gretton and Malte J. Rasch and Hans-Peter Kriegel and Bernhard Schölkopf and Alexander J. Smola. ISMB (Supplement of Bioinformatics), 49--57, Integrating structured biological data by Kernel Maximum Mean Discrepancy. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Philip Chan and Salvatore J. Stolfo. Toward Scalable Learning with Non-uniform Distributions: Effects and a Multi-classifier Approach, KDD '99: Proc. of the 4th International Conference on Knowledge Discovery and Data Mining, 1999:164--168, AAAI Press.Google ScholarGoogle Scholar
  8. Nitesh V. Chawla and Kevin W. Bowyer and Lawrence O. Hall and W. Philip Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research. 2002. 16:321--357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ertekin, Seyda and Huang, Jian and Bottou, Leon and Giles, Lee. Learning on the border: active learning in imbalanced data classification. CIKM '07: Proc. of the 16th ACM Conference on Information and Knowledge Management, 2007. isbn = 978-1-59593-803-9, pages = 127--136, Lisbon, Portugal. ACM, New York, NY, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Fan, Rong-En and Chang, Kai-Wei and Hsieh, Cho-Jui and Wang, Xiang-Rui and Lin, Chih-Jen. LIBLINEAR: A Library for Large Linear Classification, J. Mach. Learn. Res., 9, 2008: 1871--1874. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Arindam Banerjee, Inderjit S. Dhillon, Joydeep Ghosh, Suvrit Sra. Clustering on the Unit Hypersphere using von Mises-Fisher Distributions. J. Mach. Learn. Res., 6, 2005. 1345--1382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Brinker, Klaus. Incorporating diversity in Active Learning with Support Vector Machines. ICML '03: Proc. of the 20th International Conference on Machine learning. 2003: 408--415, Washington D.C., USA.Google ScholarGoogle Scholar
  13. Morris DeGroot and Stephen Fienberg. The Comparison and Evaluation of Forecasters. The Statistician, volume = 32, 1983. 12--22.Google ScholarGoogle Scholar
  14. Pedro Domingos. MetaCost: A General Method for Making Classifiers Cost-Sensitive. Proc. of the 5th International Conference on Knowledge Discovery and Data Mining, 1999:155--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Elkan, Charles and Noto, Keith. Learning classifiers from only positive and unlabeled data. KDD '08: Proc. of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008;213--220. Las Vegas, Nevada, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hsieh, Cho-Jui and Chang, Kai-Wei and Lin, Chih-Jen and Keerthi, S. Sathiya and Sundararajan, S. A dual coordinate descent method for large-scale linear SVM. ICML '08: Proc. of the 25th International Conference on Machine Learning, 2008:408--415. Helsinki, Finland. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. Proc. of 10th European Conference on Machine Learning, 1998:137--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Joshi, Mahesh V. and Agarwal, Ramesh C. and Kumar, Vipin. Predicting rare classes: can boosting make any weak learner strong? KDD '02: Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. isbn = 1-58113-567-X. 297--306. Edmonton, Alberta, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Mahesh V. Joshi and Vipin Kumar and Ramesh C. Agarwal. Evaluating Boosting Algorithms to Classify Rare Classes: Comparison And Improvements. ICDM '01: Proc. of 1st IEEE International Conference on Data Mining, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kubat, Miroslav and Holte, Robert C. and Matwin, Stan. Machine Learning for the Detection of Oil Spills in Satellite Radar Images. Mach. Learn, 30:2--3, 1998:195--215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Langford, John and Li, Lihong and Zhang, Tong. Sparse Online Learning via Truncated Gradient. J. Mach. Learn. Res., 10, 2009: 777--801. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Andrew Mccallum and Kamal Nigam. Employing EM in pool-based active learning for text classification. Proc. of the 15th International Conference on Machine Learning, 1998. 350--358. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Andrew McCallum and Kamal Nigam. A comparison of event models for Naive Bayes text classification. A. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998. 1998.Google ScholarGoogle Scholar
  24. Dunja Mladenic and Marko Grobelnik. Feature selection for unbalanced class distribution and Naive Bayes. Proc. of the 16th International Conference on Machine Learning (ICML), 1999: 258--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Kamal Nigam. Using maximum entropy for text classification. IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999: 61--67.Google ScholarGoogle Scholar
  26. Foster Provost. Machine Learning from Imbalanced Data Sets 101 (Extended Abstract. Proc. of AAAI Workshop on Imbalanced Data Sets, 2000"Google ScholarGoogle Scholar
  27. Quinlan, J. Ross. C4.5: programs for machine learning, 1993. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Masashi Sugiyama and Klaus-robert Müller. Model selection under covariate shift. Proc. of the International Conference on Artificial Neural Networks, 2005. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yuchun Tang and S. Rrasser and P. Judge and Yan-Qing Zhang. Fast and Effective Spam Sender Detection with Granular SVM on Highly Imbalanced Mail Server Behavior Data. International Conference on Collaborative Computing: Networking, Applications and Worksharing, 2006: 27.Google ScholarGoogle Scholar
  30. Tong, Simon and Koller, Daphne. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2, 2002: 45--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Kevin Woods and Jeffrey Solka and Carey Priebe and Christopher Doss and Kevin Bowyer and Larence Clarke. Comparative Evaluation of Pattern Recognition Techniques for Detection of Microcalcifications. J. of Intelligent Automation, 1993.Google ScholarGoogle Scholar
  32. Rong Yan and Yan Liu and Rong Jin and Alex Hauptmann. On Predicting Rare Classes With Svm Ensembles In Scene Classification. In ICASSP, 2003: 21--24.Google ScholarGoogle Scholar
  33. Zadrozny, Bianca and Elkan, Charles. Transforming classifier scores into accurate multiclass probability estimates. Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2002:694--699. Edmonton, Alberta, Canada. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A large-scale active learning system for topical categorization on the web

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      WWW '10: Proceedings of the 19th international conference on World wide web
      April 2010
      1407 pages
      ISBN:9781605587998
      DOI:10.1145/1772690

      Copyright © 2010 International World Wide Web Conference Committee (IW3C2)

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 April 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    ePub

    View this article in ePub.

    View ePub