Abstract
A classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of changing content on the Web? We investigate this question in the context of identifying researcher homepages. We show experimentally that classifiers trained on existing datasets of academic homepages underperform on “non-homepages” present on current-day academic websites. As an alternative to obtaining labeled datasets to retrain classifiers for the new content, in this article we ask the following question: “How can we effectively use the unlabeled data readily available from academic websites to improve researcher homepage classification?”
We design novel URL-based features and use them in conjunction with content-based features for representing homepages. Within the co-training framework, these sets of features can be treated as complementary views enabling us to effectively use unlabeled data and obtain remarkable improvements in homepage identification on the current-day academic websites. We also propose a novel technique for “learning a conforming pair of classifiers” that mimics co-training. Our algorithm seeks to minimize a loss (objective) function quantifying the difference in predictions from the two views afforded by co-training. We argue that this loss formulation provides insights for understanding co-training and can be used even in the absence of a validation dataset.
Our next set of findings pertains to the evaluation of other state-of-the-art techniques for classifying homepages. First, we apply feature selection (FS) and feature hashing (FH) techniques independently and in conjunction with co-training to academic homepages. FS is a well-known technique for removing redundant and unnecessary features from the data representation, whereas FH is a technique that uses hash functions for efficient encoding of features. We show that FS can be effectively combined with co-training to obtain further improvements in identifying homepages. However, using hashed feature representations, a performance degradation is observed possibly due to feature collisions.
Finally, we evaluate other semisupervised algorithms for homepage classification. We show that although several algorithms are effective in using information from the unlabeled instances, co-training that explicitly harnesses the feature split in the underlying instances outperforms approaches that combine content and URL features into a single view.
- Maria F. Balcan, Avrim Blum, and Ke Yang. 2005. Co-training and expansion: Towards bridging theory and practice. In Proceedings of Neural Information Processing Systems (NIPS’05).Google Scholar
- Krisztian Balog, Toine Bogers, Leif Azzopardi, M. de Rijke, and Antal van den Bosch. 2007. Broad expertise retrieval in sparse data environments. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). 551--558. Google ScholarDigital Library
- Ziv Bar-Yossef, Idit Keidar, and Uri Schonfeld. 2009. Do not crawl in the DUST: Different URLs with similar text. ACM Transactions on the Web 3, 1, 3:1--3:31. Google ScholarDigital Library
- Eda Baykan, Monika Henzinger, Ludmila Marian, and Ingmar Weber. 2011. A comprehensive study of features and algorithms for URL-based topic classification. ACM Transactions on the Web 5, 3, 15:1--15:29. Google ScholarDigital Library
- Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag, New York, NY. Google ScholarDigital Library
- Lorenzo Blanco, Nilesh Dalvi, and Ashwin Machanavajjhala. 2011. Highly efficient algorithms for structural clustering of large websites. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). 437--446. Google ScholarDigital Library
- Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT’98). 92--100. Google ScholarDigital Library
- Paul De Bra, Geert Jan Houben, Yoram Kornatzky, and Reinier Post. 1994. Information retrieval in distributed hypertexts. In Proceedings of the 11th International Conference on Machine Learning (ICML’94).Google Scholar
- Ulf Brefeld and Tobias Scheffer. 2004. Co-EM support vector learning. In Proceedings of the 21st International Conference on Machine Learning (ICML’04). Google ScholarDigital Library
- Cornelia Caragea, Adrian Silvescu, and Prasenjit Mitra. 2012. Combining hashing and abstraction in sparse high dimensional feature spaces. In Proceedings of the 26th Conference on Artificial Intelligence (AAAI’12).Google Scholar
- Cornelia Caragea, Jian Wu, Kyle Williams, Sujatha Das Gollapalli, Madian Khabsa, Pradeep Teregowda, and C. Lee Giles. 2014. Automatic identification of research articles from crawled documents. In Proceedings of the Web-Scale Classification Workshop: Classifying Big Data from the Web colocated with WSDM.Google Scholar
- Soumen Chakrabarti, Martin van den Berg, and Byron Dom. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. In Proceedings of the 8th International Conference on World Wide Web (WWW’99). 1623--1640. Google ScholarDigital Library
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3, Article No. 27. Google ScholarDigital Library
- Minmin Chen, Kilian Weinberger, and Yixin Chen. 2011. Automatic feature decomposition for single view co-training. In Proceedings of the 28th International Conference on Machine Learning (ICML’11).Google Scholar
- C. Mario Christoudias, Raquel Urtasun, and Trevor. Darrell. 2008. Multi-view learning in the presence of view disagreement. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence (UAI’08).Google Scholar
- Nello Cristianini and John Shawe-Taylor. 2000. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press. Google ScholarDigital Library
- Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. 2012. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research 13, 1, 165--202. Google ScholarDigital Library
- Gregory Druck, Gideon Mann, and Andrew McCallum. 2008. Learning from labeled features using generalized expectation criteria. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’08). 595--602. Google ScholarDigital Library
- Jun Du, Charles X. Ling, and Zhi-Hua Zhou. 2011. When does cotraining work in real data? IEEE Transactions on Knowledge and Data Engineering 23, 5, 788--799. Google ScholarDigital Library
- Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA.Google Scholar
- George Forman. 2003. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research 3, 1289--1305. Google ScholarDigital Library
- George Forman and Evan Kirshenbaum. 2008. Extremely fast text feature extraction for classification and indexing. In Proceedings of the 17th ACM Conference on Information and Knowledge Management. 1221--1230. Google ScholarDigital Library
- Kuzman Ganchev, João Graça, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for structured latent variable models. Journal of Machine Learning Research 11, 2001--2049. Google ScholarDigital Library
- Rayid Ghani. 2002. Combining labeled and unlabeled data for multiclass text categorization. In Proceedings of the 19th International Conference on Machine Learning (ICML’02). 187--194. Google ScholarDigital Library
- Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra, and C. Lee Giles. 2013. Researcher homepage classification using unlabeled data. In Proceedings of the 22nd International Conference on World Wide Web (WWW’13). 471--482. Google ScholarDigital Library
- Sujatha Das Gollapalli, C. Lee Giles, Prasenjit Mitra, and Cornelia Caragea. 2011. On identifying academic homepages for digital libraries. In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries (JCDL’11). 123--132. Google ScholarDigital Library
- Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 11, 1, 10--18. Google ScholarDigital Library
- Thorsten Joachims. 1999. Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (ICML’99). 200--209. Google ScholarDigital Library
- Junghoo Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. 1998. Efficient crawling through URL ordering. In Proceedings of the 7th International Conference on World Wide Web (WWW-7). 161--172. Google ScholarDigital Library
- Min-Yen Kan and Hoang Oanh Nguyen Thi. 2005. Fast webpage classification using URL features. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM’05). 325--326. Google ScholarDigital Library
- Hema Swetha Koppula, Krishna P. Leela, Amit Agarwal, Krishna Prasad Chitrapura, Sachin Garg, and Amit Sasturkar. 2010. Learning URL patterns for webpage de-duplication. In Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (WSDM’10). 381--390. Google ScholarDigital Library
- John Langford, Lihong Li, and Alex Strehl. 2007. Vowpal Wabbit Online Learning Project. Technical Report.Google Scholar
- Steve Lawrence. 2001. Free online availability substantially increases a paper’s impact. Nature 411, 6837, 521.Google ScholarCross Ref
- Huajing Li, Isaac G. Councill, Levent Bolelli, Ding Zhou, Yang Song, Wang-Chien Lee, Anand Sivasubramaniam, and C. Lee Giles. 2006. CiteSeerX: A scalable autonomous scientific digital library. In Proceedings of the 1st International Conference on Scalable Information Systems (InfoScale’06). Article No. 18. Google ScholarDigital Library
- Xu-Ying Liu and Zhi-Hua Zhou. 2006. The influence of class imbalance on cost-sensitive learning: An empirical study. In Proceedings of the 6th International Conference on Data Mining (ICDM’06). Google ScholarDigital Library
- Bo Long, Philip S. Yu, and Zhongfei (Mark) Zhang. 2008. A general model for multiple view unsupervised learning. In SDM.Google Scholar
- Gideon S. Mann and Andrew McCallum. 2010. Generalized expectation criteria for semi-supervised learning with weakly labeled data. Journal of Machine Learning Research 11, 955--984. Google ScholarDigital Library
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, MA. Google ScholarDigital Library
- Andrew McCallum and Kamal Nigam. 1999. A comparison of event models for naive Bayes text classification. In Proceedings of the 16th National Conference on Artificial Intelligence (AAAI’99).Google Scholar
- Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit. Retrieved September 5, 2015, from http://mallet.cs.umass.edu.Google Scholar
- George A. Miller. 1995. WordNet: A lexical database for English. Communications of the ACM 38, 11, 39--41. Google ScholarDigital Library
- Kamal Nigam and Rayid Ghani. 2000. Analyzing the effectiveness and applicability of co-training. In Proceedings of the 9th International Conference on Information and Knowledge Management (CIKM’00). 86--93. Google ScholarDigital Library
- Kamal Nigam, John Lafferty, and Andrew McCallum. 1999. Using maximum entropy for text classification. In Proceedings of the IJCAI Workshop on Machine Learning for Information Filtering.Google Scholar
- Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom Mitchell. 1998. Learning to classify text from labeled and unlabeled documents. In Proceedings of the 15th National/10th Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence (AAAI’98/IAAI’98). 729--799. Google ScholarDigital Library
- Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. 2000. Text classification from labeled and unlabeled documents using EM. Machine Learning 39, 2--3, 103--134. Google ScholarDigital Library
- Jorge Nocedal and Stephen J. Wright. 2006. Numerical Optimization. Springer.Google Scholar
- Jose Luis Ortega-Priego, Isidro F. Aguillo, and Jos Antonio Prieto-Valverde. 2006. Longitudinal study of contents and elements in the scientific Web environment. Journal of Information Science 32, 4, 344--351.Google ScholarCross Ref
- Xiaoguang Qi and Brian D. Davison. 2009. Web page classification: Features and algorithms. ACM Computing Surveys 41, 2, Article No. 12. Google ScholarDigital Library
- Bernhard Schölkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett. May 2000. New support vector algorithms. Neural Computation 12, 5, 1207--1245. Google ScholarDigital Library
- Qinfeng Shi, James Petterson, Gideon Dror, John Langford, Alex Smola, Alex Strehl, and Vishy Vishwanathan. 2009. Hash kernels. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics.Google Scholar
- Lawrence K. Shih and David R. Karger. 2004. Using URLs and table layout for Web classification tasks. In Proceedings of the 13th International Conference on World Wide Web (WWW’04). 193--202. Google ScholarDigital Library
- Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. 2005. A co-regularization approach to semi-supervised learning with multiple views. In Proceedings of the ICML Workshop on Learning with Multiple Views.Google Scholar
- Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. 2008. ArnetMiner: Extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08). 990--998. Google ScholarDigital Library
- Vladimir N. Vapnik. 1995. The Nature of Statistical Learning Theory. Springer-Verlag, New York, NY. Google ScholarDigital Library
- Yuxin Wang and Keizo Oyama. 2006. Web page classification exploiting contents of surrounding pages for building a high-quality homepage collection. In Proceedings of the 9th International Conference on Asian Digital Libraries: Achievements, Challenges, and Opportunities (ICADL’06). 515--518. Google ScholarDigital Library
- Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML’09). ACM, New York, NY, 1113--1120. Google ScholarDigital Library
- Ian H. Witten, Eibe Frank, and Mark A. Hall. 2011. Data Mining: Practical Machine Learning Tools and Techniques (3rd ed.). Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
- Shuang-Hong Yang, Bo Long, Alexander J. Smola, Narayanan Sadagopan, Zhaohui Zheng, and Hongyuan Zha. 2011. Like like alike: Joint friendship and interest propagation in social networks. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). 537--546. Google ScholarDigital Library
- Yiming Yang and Jan O. Pedersen. 1997. A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML’97). 412--420. http://dl.acm.org/citation.cfm?id=645526.657137. Google ScholarDigital Library
- David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (ACL’95). 189--196. Google ScholarDigital Library
- Hwanjo Yu, Jiawei Han, and K. C.-C. Chang. 2004. PEBL: Web page classification without negative examples. IEEE Transactions on Knowledge and Data Engineering 16, 1, 70--81. Google ScholarDigital Library
- Xiaojin Zhu. 2005. Semi-Supervised Learning Literature Survey. Technical Report. Computer Sciences, University of Wisconsin-Madison. http://www.cs.wisc.edu/∼.Google Scholar
Index Terms
- Improving Researcher Homepage Classification with Unlabeled Data
Recommendations
Researcher homepage classification using unlabeled data
WWW '13: Proceedings of the 22nd international conference on World Wide WebA classifier that determines if a webpage is relevant to a specified set of topics comprises a key component for focused crawling. Can a classifier that is tuned to perform well on training datasets continue to filter out irrelevant pages in the face of ...
Exploiting unlabeled data to enhance ensemble diversity
Ensemble learning learns from the training data by generating an ensemble of multiple base learners. It is well-known that to construct a good ensemble with strong generalization ability, the base learners are deemed to be accurate as well as diverse. ...
Learning Instance Weighted Naive Bayes from labeled and unlabeled data
In real-world data mining applications, it is often the case that unlabeled instances are abundant, while available labeled instances are very limited. Thus, semi-supervised learning, which attempts to benefit from large amount of unlabeled data ...
Comments