Abstract
With the rapid growth of the World Wide Web, a focused crawling has been increasingly of importance. The goal of the focused crawling is to seek out and collect the pages that are relevant to a predefined set of topics. The determination of the relevance of a page to a specific topic has been addressed as a classification problem. However, when training the classifiers, one can often encounter some difficulties in selecting negative samples. Such difficulties come from the fact that collecting a set of pages relevant to a specific topic is not a classification process by nature.
In this paper, we propose a novel focused crawling method using only positive samples to represent a given topic as a form of hyperplane, where we can obtain such representation from a modified Proximal Support Vector Machines. The distance from a page to the hyperplane is used to prioritize the visit order of the page. We demonstrated the performance of the proposed method over the WebKB data set and the Web. The promising results suggest that our proposed method be more effective to the focused crawling problem than the traditional approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topicspecific Web resource discovery. In: 8th International World Wide Web Conference, Toronto, pp. 1623–1640 (1999)
Aggarwal, C.C., Al-Garawi, F., Yu, P.S.: Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In: 10th International World Wide Web Conference, Hong Kong, pp. 96–105 (2001)
Rennie, J., McCallum, A.K.: Using Reinforcement Learning to Spider the Web Efficiently. In: 16th International Conference on Machine Learning (ICML), pp. 335–343 (1999)
Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C.L., Gori, M.: Focused Using Context Graphs. In: 26th International Conference on Very Large Databases (VLDB), pp. 527–534 (2000)
Cho, J., Garcia-Mlina, H., Page, L.: Efficient Crawling Through URL Ordering. Computer Networks and ISDN Systems, 161–172 (1998)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. In: Proc. 7th Int. World Wide Web Conference, Brisbane, Australia, Computer Networks and ISDN Systems, vol. 30, pp. 107–117 (1998)
Fung, G., Mangasarian, O.L.: Proximal Support Vector Machine Classifiers. In: KDD2001: 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, pp. 77–86 (2001)
Choi, Y.S., Noh, J.S.: Relevance Feedback for Content-Based Image Retrieval Using Proximal Support Vector Machine. In: Laganá, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds.) ICCSA 2004. LNCS, vol. 3044, pp. 942–951. Springer, Heidelberg (2004)
Charkrabarti, S.: Mining the web Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco (2003)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)
Najork, M., Heydon, A.: High-performance Web crawling. Tech. Rep. Research Report 173, Compaq SRC (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Choi, Y., Kim, K., Kang, M. (2005). A Focused Crawling for the Web Resource Discovery Using a Modified Proximal Support Vector Machines. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2005. ICCSA 2005. Lecture Notes in Computer Science, vol 3480. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11424758_20
Download citation
DOI: https://doi.org/10.1007/11424758_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25860-5
Online ISBN: 978-3-540-32043-2
eBook Packages: Computer ScienceComputer Science (R0)