research-article

Improving Researcher Homepage Classification with Unlabeled Data

Authors:
Sujatha Das Gollapalli

The Pennsylvania State University

The Pennsylvania State University
View Profile

,
Cornelia Caragea

University of North Texas, TX, USA

University of North Texas, TX, USA
View Profile

,
Prasenjit Mitra

The Pennsylvania State University, PA, USA

The Pennsylvania State University, PA, USA
View Profile

,
C. Lee Giles

The Pennsylvania State University, PA, USA

The Pennsylvania State University, PA, USA
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 9 Issue 4Article No.: 17pp 1–32https://doi.org/10.1145/2767135

Published:19 October 2015Publication History

ACM Transactions on the Web

Abstract

A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changing content on the Web? We investigate this question in the context of identifying researcher homepages. We show experimentally that classifiers trained on existing datasets of academic homepages underperform on “non-homepages” present on current-day academic websites. As an alternative to obtaining labeled datasets to retrain classifiers for the new content, in this article we ask the following question: “How can we effectively use the unlabeled data readily available from academic websites to improve researcher homepage classification?”

We design novel URL-based features and use them in conjunction with content-based features for representing homepages. Within the co-training framework, these sets of features can be treated as complementary views enabling us to effectively use unlabeled data and obtain remarkable improvements in homepage identification on the current-day academic websites. We also propose a novel technique for “learning a conforming pair of classifiers” that mimics co-training. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We argue that this loss formulation provides insights for understanding co-training and can be used even in the absence of a validation dataset.

Our next set of findings pertains to the evaluation of other state-of-the-art techniques for classifying homepages. First, we apply feature selection (FS) and feature hashing (FH) techniques independently and in conjunction with co-training to academic homepages. FS is a well-known technique for removing redundant and unnecessary features from the data representation, whereas FH is a technique that uses hash functions for efficient encoding of features. We show that FS can be effectively combined with co-training to obtain further improvements in identifying homepages. However, using hashed feature representations, a performance degradation is observed possibly due to feature collisions.

Finally, we evaluate other semisupervised algorithms for homepage classification. We show that although several algorithms are effective in using information from the unlabeled instances, co-training that explicitly harnesses the feature split in the underlying instances outperforms approaches that combine content and URL features into a single view.

References

Maria F. Balcan, Avrim Blum, and Ke Yang. 2005. Co-training and expansion: Towards bridging theory and practice. In Proceedings of Neural Information Processing Systems (NIPS’05).Google Scholar
Krisztian Balog, Toine Bogers, Leif Azzopardi, M. de Rijke, and Antal van den Bosch. 2007. Broad expertise retrieval in sparse data environments. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). 551--558. Google ScholarDigital Library
Ziv Bar-Yossef, Idit Keidar, and Uri Schonfeld. 2009. Do not crawl in the DUST: Different URLs with similar text. ACM Transactions on the Web 3, 1, 3:1--3:31. Google ScholarDigital Library
Eda Baykan, Monika Henzinger, Ludmila Marian, and Ingmar Weber. 2011. A comprehensive study of features and algorithms for URL-based topic classification. ACM Transactions on the Web 5, 3, 15:1--15:29. Google ScholarDigital Library
Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, New York, NY. Google ScholarDigital Library
Lorenzo Blanco, Nilesh Dalvi, and Ashwin Machanavajjhala. 2011. Highly efficient algorithms for structural clustering of large websites. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). 437--446. Google ScholarDigital Library
Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT’98). 92--100. Google ScholarDigital Library
Paul De Bra, Geert Jan Houben, Yoram Kornatzky, and Reinier Post. 1994. Information retrieval in distributed hypertexts. In Proceedings of the 11th International Conference on Machine Learning (ICML’94).Google Scholar
Ulf Brefeld and Tobias Scheffer. 2004. Co-EM support vector learning. In Proceedings of the 21st International Conference on Machine Learning (ICML’04). Google ScholarDigital Library
Cornelia Caragea, Adrian Silvescu, and Prasenjit Mitra. 2012. Combining hashing and abstraction in sparse high dimensional feature spaces. In Proceedings of the 26th Conference on Artificial Intelligence (AAAI’12).Google Scholar
Cornelia Caragea, Jian Wu, Kyle Williams, Sujatha Das Gollapalli, Madian Khabsa, Pradeep Teregowda, and C. Lee Giles. 2014. Automatic identification of research articles from crawled documents. In Proceedings of the Web-Scale Classification Workshop: Classifying Big Data from the Web colocated with WSDM.Google Scholar
Soumen Chakrabarti, Martin van den Berg, and Byron Dom. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. In Proceedings of the 8th International Conference on World Wide Web (WWW’99). 1623--1640. Google ScholarDigital Library
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3, Article No. 27. Google ScholarDigital Library
Minmin Chen, Kilian Weinberger, and Yixin Chen. 2011. Automatic feature decomposition for single view co-training. In Proceedings of the 28th International Conference on Machine Learning (ICML’11).Google Scholar
C. Mario Christoudias, Raquel Urtasun, and Trevor. Darrell. 2008. Multi-view learning in the presence of view disagreement. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (UAI’08).Google Scholar
Nello Cristianini and John Shawe-Taylor. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press. Google ScholarDigital Library
Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. 2012. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research 13, 1, 165--202. Google ScholarDigital Library
Gregory Druck, Gideon Mann, and Andrew McCallum. 2008. Learning from labeled features using generalized expectation criteria. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). 595--602. Google ScholarDigital Library
Jun Du, Charles X. Ling, and Zhi-Hua Zhou. 2011. When does cotraining work in real data? IEEE Transactions on Knowledge and Data Engineering 23, 5, 788--799. Google ScholarDigital Library
Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA.Google Scholar
George Forman. 2003. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289--1305. Google ScholarDigital Library
George Forman and Evan Kirshenbaum. 2008. Extremely fast text feature extraction for classification and indexing. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. 1221--1230. Google ScholarDigital Library
Kuzman Ganchev, João Graça, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for structured latent variable models. Journal of Machine Learning Research 11, 2001--2049. Google ScholarDigital Library
Rayid Ghani. 2002. Combining labeled and unlabeled data for multiclass text categorization. In Proceedings of the 19th International Conference on Machine Learning (ICML’02). 187--194. Google ScholarDigital Library
Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra, and C. Lee Giles. 2013. Researcher homepage classification using unlabeled data. In Proceedings of the 22nd International Conference on World Wide Web (WWW’13). 471--482. Google ScholarDigital Library
Sujatha Das Gollapalli, C. Lee Giles, Prasenjit Mitra, and Cornelia Caragea. 2011. On identifying academic homepages for digital libraries. In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL’11). 123--132. Google ScholarDigital Library
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 11, 1, 10--18. Google ScholarDigital Library
Thorsten Joachims. 1999. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (ICML’99). 200--209. Google ScholarDigital Library
Junghoo Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. 1998. Efficient crawling through URL ordering. In Proceedings of the 7th International Conference on World Wide Web (WWW-7). 161--172. Google ScholarDigital Library
Min-Yen Kan and Hoang Oanh Nguyen Thi. 2005. Fast webpage classification using URL features. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM’05). 325--326. Google ScholarDigital Library
Hema Swetha Koppula, Krishna P. Leela, Amit Agarwal, Krishna Prasad Chitrapura, Sachin Garg, and Amit Sasturkar. 2010. Learning URL patterns for webpage de-duplication. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 381--390. Google ScholarDigital Library
John Langford, Lihong Li, and Alex Strehl. 2007. Vowpal Wabbit Online Learning Project. Technical Report.Google Scholar
Steve Lawrence. 2001. Free online availability substantially increases a paper’s impact. Nature 411, 6837, 521.Google ScholarCross Ref
Huajing Li, Isaac G. Councill, Levent Bolelli, Ding Zhou, Yang Song, Wang-Chien Lee, Anand Sivasubramaniam, and C. Lee Giles. 2006. CiteSeerX: A scalable autonomous scientific digital library. In Proceedings of the 1st International Conference on Scalable Information Systems (InfoScale’06). Article No. 18. Google ScholarDigital Library
Xu-Ying Liu and Zhi-Hua Zhou. 2006. The influence of class imbalance on cost-sensitive learning: An empirical study. In Proceedings of the 6th International Conference on Data Mining (ICDM’06). Google ScholarDigital Library
Bo Long, Philip S. Yu, and Zhongfei (Mark) Zhang. 2008. A general model for multiple view unsupervised learning. In SDM.Google Scholar
Gideon S. Mann and Andrew McCallum. 2010. Generalized expectation criteria for semi-supervised learning with weakly labeled data. Journal of Machine Learning Research 11, 955--984. Google ScholarDigital Library
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, MA. Google ScholarDigital Library
Andrew McCallum and Kamal Nigam. 1999. A comparison of event models for naive Bayes text classification. In Proceedings of the 16th National Conference on Artificial Intelligence (AAAI’99).Google Scholar
Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. Retrieved September 5, 2015, from http://mallet.cs.umass.edu.Google Scholar
George A. Miller. 1995. WordNet: A lexical database for English. Communications of the ACM 38, 11, 39--41. Google ScholarDigital Library
Kamal Nigam and Rayid Ghani. 2000. Analyzing the effectiveness and applicability of co-training. In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM’00). 86--93. Google ScholarDigital Library
Kamal Nigam, John Lafferty, and Andrew McCallum. 1999. Using maximum entropy for text classification. In Proceedings of the IJCAI Workshop on Machine Learning for Information Filtering.Google Scholar
Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom Mitchell. 1998. Learning to classify text from labeled and unlabeled documents. In Proceedings of the 15th National/10th Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence (AAAI’98/IAAI’98). 729--799. Google ScholarDigital Library
Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. 2000. Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 2--3, 103--134. Google ScholarDigital Library
Jorge Nocedal and Stephen J. Wright. 2006. Numerical Optimization. Springer.Google Scholar
Jose Luis Ortega-Priego, Isidro F. Aguillo, and Jos Antonio Prieto-Valverde. 2006. Longitudinal study of contents and elements in the scientific Web environment. Journal of Information Science 32, 4, 344--351.Google ScholarCross Ref
Xiaoguang Qi and Brian D. Davison. 2009. Web page classification: Features and algorithms. ACM Computing Surveys 41, 2, Article No. 12. Google ScholarDigital Library
Bernhard Schölkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett. May 2000. New support vector algorithms. Neural Computation 12, 5, 1207--1245. Google ScholarDigital Library
Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, Alex Strehl, and Vishy Vishwanathan. 2009. Hash kernels. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics.Google Scholar
Lawrence K. Shih and David R. Karger. 2004. Using URLs and table layout for Web classification tasks. In Proceedings of the 13th International Conference on World Wide Web (WWW’04). 193--202. Google ScholarDigital Library
Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. 2005. A co-regularization approach to semi-supervised learning with multiple views. In Proceedings of the ICML Workshop on Learning with Multiple Views.Google Scholar
Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner: Extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08). 990--998. Google ScholarDigital Library
Vladimir N. Vapnik. 1995. The Nature of Statistical Learning Theory. Springer-Verlag, New York, NY. Google ScholarDigital Library
Yuxin Wang and Keizo Oyama. 2006. Web page classification exploiting contents of surrounding pages for building a high-quality homepage collection. In Proceedings of the 9th International Conference on Asian Digital Libraries: Achievements, Challenges, and Opportunities (ICADL’06). 515--518. Google ScholarDigital Library
Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 1113--1120. Google ScholarDigital Library
Ian H. Witten, Eibe Frank, and Mark A. Hall. 2011. Data Mining: Practical Machine Learning Tools and Techniques (3rd ed.). Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
Shuang-Hong Yang, Bo Long, Alexander J. Smola, Narayanan Sadagopan, Zhaohui Zheng, and Hongyuan Zha. 2011. Like like alike: Joint friendship and interest propagation in social networks. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). 537--546. Google ScholarDigital Library
Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML’97). 412--420. http://dl.acm.org/citation.cfm?id=645526.657137. Google ScholarDigital Library
David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (ACL’95). 189--196. Google ScholarDigital Library
Hwanjo Yu, Jiawei Han, and K. C.-C. Chang. 2004. PEBL: Web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering 16, 1, 70--81. Google ScholarDigital Library
Xiaojin Zhu. 2005. Semi-Supervised Learning Literature Survey. Technical Report. Computer Sciences, University of Wisconsin-Madison. http://www.cs.wisc.edu/&sim;.Google Scholar

Index Terms

Improving Researcher Homepage Classification with Unlabeled Data
1. Information systems
  1. Information retrieval

Recommendations

Researcher homepage classification using unlabeled data
WWW '13: Proceedings of the 22nd international conference on World Wide Web

A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of ...
Read More
Exploiting unlabeled data to enhance ensemble diversity

Ensemble learning learns from the training data by generating an ensemble of multiple base learners. It is well-known that to construct a good ensemble with strong generalization ability, the base learners are deemed to be accurate as well as diverse. ...
Read More
Learning Instance Weighted Naive Bayes from labeled and unlabeled data

In real-world data mining applications, it is often the case that unlabeled instances are abundant, while available labeled instances are very limited. Thus, semi-supervised learning, which attempts to benefit from large amount of unlabeled data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on the Web Volume 9, Issue 4
October 2015
114 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/2830542
Editors:
Brian D. Davison
Lehigh University, USA
,
Marianne Winslett
University of Illinois at Urbana-Champaign
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2015
- Accepted: 1 April 2015
- Revised: 1 February 2015
- Received: 1 January 2014
Published in tweb Volume 9, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Researcher homepage classification
co-training
conforming classifiers
unlabeled data
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 252
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Improving Researcher Homepage Classification with Unlabeled Data

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Researcher homepage classification using unlabeled data

Exploiting unlabeled data to enhance ensemble diversity

Learning Instance Weighted Naive Bayes from labeled and unlabeled data