skip to main content
research-article

Improving Researcher Homepage Classification with Unlabeled Data

Published:19 October 2015Publication History
Skip Abstract Section

Abstract

A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changing content on the Web? We investigate this question in the context of identifying researcher homepages. We show experimentally that classifiers trained on existing datasets of academic homepages underperform on “non-homepages” present on current-day academic websites. As an alternative to obtaining labeled datasets to retrain classifiers for the new content, in this article we ask the following question: “How can we effectively use the unlabeled data readily available from academic websites to improve researcher homepage classification?”

We design novel URL-based features and use them in conjunction with content-based features for representing homepages. Within the co-training framework, these sets of features can be treated as complementary views enabling us to effectively use unlabeled data and obtain remarkable improvements in homepage identification on the current-day academic websites. We also propose a novel technique for “learning a conforming pair of classifiers” that mimics co-training. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We argue that this loss formulation provides insights for understanding co-training and can be used even in the absence of a validation dataset.

Our next set of findings pertains to the evaluation of other state-of-the-art techniques for classifying homepages. First, we apply feature selection (FS) and feature hashing (FH) techniques independently and in conjunction with co-training to academic homepages. FS is a well-known technique for removing redundant and unnecessary features from the data representation, whereas FH is a technique that uses hash functions for efficient encoding of features. We show that FS can be effectively combined with co-training to obtain further improvements in identifying homepages. However, using hashed feature representations, a performance degradation is observed possibly due to feature collisions.

Finally, we evaluate other semisupervised algorithms for homepage classification. We show that although several algorithms are effective in using information from the unlabeled instances, co-training that explicitly harnesses the feature split in the underlying instances outperforms approaches that combine content and URL features into a single view.

References

  1. Maria F. Balcan, Avrim Blum, and Ke Yang. 2005. Co-training and expansion: Towards bridging theory and practice. In Proceedings of Neural Information Processing Systems (NIPS’05).Google ScholarGoogle Scholar
  2. Krisztian Balog, Toine Bogers, Leif Azzopardi, M. de Rijke, and Antal van den Bosch. 2007. Broad expertise retrieval in sparse data environments. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). 551--558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ziv Bar-Yossef, Idit Keidar, and Uri Schonfeld. 2009. Do not crawl in the DUST: Different URLs with similar text. ACM Transactions on the Web 3, 1, 3:1--3:31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Eda Baykan, Monika Henzinger, Ludmila Marian, and Ingmar Weber. 2011. A comprehensive study of features and algorithms for URL-based topic classification. ACM Transactions on the Web 5, 3, 15:1--15:29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Lorenzo Blanco, Nilesh Dalvi, and Ashwin Machanavajjhala. 2011. Highly efficient algorithms for structural clustering of large websites. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). 437--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT’98). 92--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Paul De Bra, Geert Jan Houben, Yoram Kornatzky, and Reinier Post. 1994. Information retrieval in distributed hypertexts. In Proceedings of the 11th International Conference on Machine Learning (ICML’94).Google ScholarGoogle Scholar
  9. Ulf Brefeld and Tobias Scheffer. 2004. Co-EM support vector learning. In Proceedings of the 21st International Conference on Machine Learning (ICML’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cornelia Caragea, Adrian Silvescu, and Prasenjit Mitra. 2012. Combining hashing and abstraction in sparse high dimensional feature spaces. In Proceedings of the 26th Conference on Artificial Intelligence (AAAI’12).Google ScholarGoogle Scholar
  11. Cornelia Caragea, Jian Wu, Kyle Williams, Sujatha Das Gollapalli, Madian Khabsa, Pradeep Teregowda, and C. Lee Giles. 2014. Automatic identification of research articles from crawled documents. In Proceedings of the Web-Scale Classification Workshop: Classifying Big Data from the Web colocated with WSDM.Google ScholarGoogle Scholar
  12. Soumen Chakrabarti, Martin van den Berg, and Byron Dom. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. In Proceedings of the 8th International Conference on World Wide Web (WWW’99). 1623--1640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3, Article No. 27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Minmin Chen, Kilian Weinberger, and Yixin Chen. 2011. Automatic feature decomposition for single view co-training. In Proceedings of the 28th International Conference on Machine Learning (ICML’11).Google ScholarGoogle Scholar
  15. C. Mario Christoudias, Raquel Urtasun, and Trevor. Darrell. 2008. Multi-view learning in the presence of view disagreement. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (UAI’08).Google ScholarGoogle Scholar
  16. Nello Cristianini and John Shawe-Taylor. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. 2012. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research 13, 1, 165--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gregory Druck, Gideon Mann, and Andrew McCallum. 2008. Learning from labeled features using generalized expectation criteria. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). 595--602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jun Du, Charles X. Ling, and Zhi-Hua Zhou. 2011. When does cotraining work in real data? IEEE Transactions on Knowledge and Data Engineering 23, 5, 788--799. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  21. George Forman. 2003. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289--1305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. George Forman and Evan Kirshenbaum. 2008. Extremely fast text feature extraction for classification and indexing. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. 1221--1230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kuzman Ganchev, João Graça, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for structured latent variable models. Journal of Machine Learning Research 11, 2001--2049. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Rayid Ghani. 2002. Combining labeled and unlabeled data for multiclass text categorization. In Proceedings of the 19th International Conference on Machine Learning (ICML’02). 187--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra, and C. Lee Giles. 2013. Researcher homepage classification using unlabeled data. In Proceedings of the 22nd International Conference on World Wide Web (WWW’13). 471--482. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sujatha Das Gollapalli, C. Lee Giles, Prasenjit Mitra, and Cornelia Caragea. 2011. On identifying academic homepages for digital libraries. In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL’11). 123--132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 11, 1, 10--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Thorsten Joachims. 1999. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (ICML’99). 200--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Junghoo Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. 1998. Efficient crawling through URL ordering. In Proceedings of the 7th International Conference on World Wide Web (WWW-7). 161--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Min-Yen Kan and Hoang Oanh Nguyen Thi. 2005. Fast webpage classification using URL features. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM’05). 325--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Hema Swetha Koppula, Krishna P. Leela, Amit Agarwal, Krishna Prasad Chitrapura, Sachin Garg, and Amit Sasturkar. 2010. Learning URL patterns for webpage de-duplication. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 381--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. John Langford, Lihong Li, and Alex Strehl. 2007. Vowpal Wabbit Online Learning Project. Technical Report.Google ScholarGoogle Scholar
  33. Steve Lawrence. 2001. Free online availability substantially increases a paper’s impact. Nature 411, 6837, 521.Google ScholarGoogle ScholarCross RefCross Ref
  34. Huajing Li, Isaac G. Councill, Levent Bolelli, Ding Zhou, Yang Song, Wang-Chien Lee, Anand Sivasubramaniam, and C. Lee Giles. 2006. CiteSeerX: A scalable autonomous scientific digital library. In Proceedings of the 1st International Conference on Scalable Information Systems (InfoScale’06). Article No. 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xu-Ying Liu and Zhi-Hua Zhou. 2006. The influence of class imbalance on cost-sensitive learning: An empirical study. In Proceedings of the 6th International Conference on Data Mining (ICDM’06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Bo Long, Philip S. Yu, and Zhongfei (Mark) Zhang. 2008. A general model for multiple view unsupervised learning. In SDM.Google ScholarGoogle Scholar
  37. Gideon S. Mann and Andrew McCallum. 2010. Generalized expectation criteria for semi-supervised learning with weakly labeled data. Journal of Machine Learning Research 11, 955--984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Andrew McCallum and Kamal Nigam. 1999. A comparison of event models for naive Bayes text classification. In Proceedings of the 16th National Conference on Artificial Intelligence (AAAI’99).Google ScholarGoogle Scholar
  40. Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. Retrieved September 5, 2015, from http://mallet.cs.umass.edu.Google ScholarGoogle Scholar
  41. George A. Miller. 1995. WordNet: A lexical database for English. Communications of the ACM 38, 11, 39--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Kamal Nigam and Rayid Ghani. 2000. Analyzing the effectiveness and applicability of co-training. In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM’00). 86--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Kamal Nigam, John Lafferty, and Andrew McCallum. 1999. Using maximum entropy for text classification. In Proceedings of the IJCAI Workshop on Machine Learning for Information Filtering.Google ScholarGoogle Scholar
  44. Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom Mitchell. 1998. Learning to classify text from labeled and unlabeled documents. In Proceedings of the 15th National/10th Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence (AAAI’98/IAAI’98). 729--799. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. 2000. Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 2--3, 103--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jorge Nocedal and Stephen J. Wright. 2006. Numerical Optimization. Springer.Google ScholarGoogle Scholar
  47. Jose Luis Ortega-Priego, Isidro F. Aguillo, and Jos Antonio Prieto-Valverde. 2006. Longitudinal study of contents and elements in the scientific Web environment. Journal of Information Science 32, 4, 344--351.Google ScholarGoogle ScholarCross RefCross Ref
  48. Xiaoguang Qi and Brian D. Davison. 2009. Web page classification: Features and algorithms. ACM Computing Surveys 41, 2, Article No. 12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Bernhard Schölkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett. May 2000. New support vector algorithms. Neural Computation 12, 5, 1207--1245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, Alex Strehl, and Vishy Vishwanathan. 2009. Hash kernels. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics.Google ScholarGoogle Scholar
  51. Lawrence K. Shih and David R. Karger. 2004. Using URLs and table layout for Web classification tasks. In Proceedings of the 13th International Conference on World Wide Web (WWW’04). 193--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. 2005. A co-regularization approach to semi-supervised learning with multiple views. In Proceedings of the ICML Workshop on Learning with Multiple Views.Google ScholarGoogle Scholar
  53. Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner: Extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08). 990--998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Vladimir N. Vapnik. 1995. The Nature of Statistical Learning Theory. Springer-Verlag, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Yuxin Wang and Keizo Oyama. 2006. Web page classification exploiting contents of surrounding pages for building a high-quality homepage collection. In Proceedings of the 9th International Conference on Asian Digital Libraries: Achievements, Challenges, and Opportunities (ICADL’06). 515--518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 1113--1120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Ian H. Witten, Eibe Frank, and Mark A. Hall. 2011. Data Mining: Practical Machine Learning Tools and Techniques (3rd ed.). Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Shuang-Hong Yang, Bo Long, Alexander J. Smola, Narayanan Sadagopan, Zhaohui Zheng, and Hongyuan Zha. 2011. Like like alike: Joint friendship and interest propagation in social networks. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). 537--546. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML’97). 412--420. http://dl.acm.org/citation.cfm?id=645526.657137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (ACL’95). 189--196. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Hwanjo Yu, Jiawei Han, and K. C.-C. Chang. 2004. PEBL: Web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering 16, 1, 70--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Xiaojin Zhu. 2005. Semi-Supervised Learning Literature Survey. Technical Report. Computer Sciences, University of Wisconsin-Madison. http://www.cs.wisc.edu/∼.Google ScholarGoogle Scholar

Index Terms

  1. Improving Researcher Homepage Classification with Unlabeled Data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 9, Issue 4
      October 2015
      114 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/2830542
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 October 2015
      • Accepted: 1 April 2015
      • Revised: 1 February 2015
      • Received: 1 January 2014
      Published in tweb Volume 9, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader