An Extended Method for Finding Related Web Pages with Focused Crawling Techniques

Furuse, Kazutaka; Ohmura, Hiroaki; Chen, Hanxiong; Kitagawa, Hiroyuki

doi:10.1007/978-3-642-23863-5_3

Kazutaka Furuse²⁵,
Hiroaki Ohmura²⁵,
Hanxiong Chen²⁵ &
…
Hiroyuki Kitagawa²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6882))

Included in the following conference series:

International Conference on Knowledge-Based and Intelligent Information and Engineering Systems

1304 Accesses
1 Citations

Abstract

This paper proposes an extended mechanism for efficiently finding related web pages, which is constructed by introducing some focused crawling techniques.

One of the successful methods for finding related web pages is Kleinberg’s HITS algorithm, and this method determines web pages which are related to a set of given web pages by calculating the hub and authority scores. Although this method is effective for extracting fine related web pages, it has a limitation that it only concerns the web pages which are directly connected to the given web pages for the score calculation.

The proposed method of this paper extends the HITS algorithm by enlarging neighborhood graph used for the score calculation. By navigating links forward and backward, pages which are not directly connected to the given web pages are included in the neighborhood graph. Since the navigation is done by using the focused crawling techniques, the proposed method effectively collects promising pages which contribute to improve accuracy of the scores. Moreover, unrelated pages are filtered out for avoiding topic drift in the course of the navigation. Consequently, the proposed method successfully finds related pages, since scores are calculated with adequately extended neighborhood graphs. The effectiveness and the efficiency of the proposed method is confirmed by the results of experiments performed with real data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

The Open Directory Project, http://www.dmoz.org/
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. In: Proceedings of the Eighth International Conference on World Wide Web, WWW 1999, pp. 1623–1640. Elsevier North-Holland, Inc., New York (1999)
Google Scholar
Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Kleinberg, D.G.J.: Automatic resource compilation by analyzing hyperlink structure and associated text. In: Proceedings of the Seventh International Conference on World Wide Web 7, WWW 7, pp. 65–74. Elsevier Science Publishers B. V., Amsterdam (1998)
Google Scholar
Järvelin, K., Kekäläinen, J.: IR evaluation methods for retrieving highly relevant documents. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2000, pp. 41–48 (2000)
Google Scholar
Kleinberg, J.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46, 604–632 (1999)
Article MathSciNet MATH Google Scholar
Liu, B.: Web Data Mining — Exploring Hyperlinks, Contents, and Usage Data. Springer, Heidelberg (2007)
MATH Google Scholar
Micarelli, A., Gasparetti, F.: Adaptive focused crawling. In: Brusilovsky, P., Kobsa, A., Nejdl, W. (eds.) Adaptive Web 2007. LNCS, vol. 4321, pp. 231–262. Springer, Heidelberg (2007)
Chapter Google Scholar
Olston, C., Najork, M.: Web crawling. Foundations and Trends in Information Retrieval 4, 175–246 (2010)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Graduate School of Systems and Information Engineering, University of Tsukuba, Japan
Kazutaka Furuse, Hiroaki Ohmura, Hanxiong Chen & Hiroyuki Kitagawa

Authors

Kazutaka Furuse
View author publications
You can also search for this author in PubMed Google Scholar
Hiroaki Ohmura
View author publications
You can also search for this author in PubMed Google Scholar
Hanxiong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Kitagawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Integrated Sensor Systems, University of Kaiserslautern, Erwin-Schroedinger-str. 12, 67663, Kaiserslautern, Germany
Andreas König
Knowledge-Based Systems Group, Department of Computer Science, University of Kaiserslautern, P.O. Box 3049, 67653, Kaiserslautern, Germany
Andreas Dengel
School of Business, University of Applied Sciences Northwestern Switzerland, Riggenbachstr. 16, 4600, Olten, Switzerland
Knut Hinkelmann
Graduate School of Engineering, Osaka Prefecture University, 1-1 Gakuen-cho, 599-8531, Sakai,, Osaka, Japan
Koichi Kise
KES International, P.O. Box 2115, BN43 9AF, Shoreham-by-sea, UK
Robert J. Howlett
University of South Australia, Adelaide, 5095, Mawson Lakes, SA, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Furuse, K., Ohmura, H., Chen, H., Kitagawa, H. (2011). An Extended Method for Finding Related Web Pages with Focused Crawling Techniques. In: König, A., Dengel, A., Hinkelmann, K., Kise, K., Howlett, R.J., Jain, L.C. (eds) Knowlege-Based and Intelligent Information and Engineering Systems. KES 2011. Lecture Notes in Computer Science(), vol 6882. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23863-5_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-23863-5_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23862-8
Online ISBN: 978-3-642-23863-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics