Novel approaches to crawling important pages early

Alam, Md. Hijbul; Ha, JongWoo; Lee, SangKeun

doi:10.1007/s10115-012-0535-4

Novel approaches to crawling important pages early

Regular Paper
Published: 09 September 2012

Volume 33, pages 707–734, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Md. Hijbul Alam¹,
JongWoo Ha¹ &
SangKeun Lee¹

901 Accesses
11 Citations
Explore all metrics

Abstract

Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5 % in cumulative PageRank.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

A Dynamic Page-Refresh Index Policy for Web Crawlers

Towards Intelligent Web Crawling – A Theme Weight and Bayesian Page Rank Based Approach

References

Abiteboul S, Preda M, Cobena G (2003) Adaptive on-line page importance computation. In: Proccedings of 12th international conference on world Wide Web. ACM, New York, pp 280–290
Alam MH, Ha J, Lee S (2009) Fractional pagerank crawler: prioritizing urls efficiently for crawling important pages early. In: Proccedings of 14th international conference on database systems for advanced applications. Springer, Berlin, pp 590–594
Almpanidis G, Kotropoulos C, Pitas I (2007) Combining text and link analysis for focused crawling-an application for vertical search engines. Inf Syst 32(6):886–908
Article Google Scholar
Baeza YR, Castillo C, Marin M, Rodriguez A (2005) Crawling a country: better strategies than breadth-first for Web page ordering. In: Proccedings of special interest tracks and posters of the 14th international conference on world wide web. ACM, New York, pp 864–872
Bai X, Cambazoglu BB, Junqueira FP (2011) Discovering URLs through user feedback. In: Proceedings of the 20th ACM international conference on information and knowledge. ACM, New York, pp 77–86
Boldi P, Vigna S (2004) The WebGraph framework I: compression techniques. In: Proccedings of 13th international conference on world wide web. ACM, New York, pp 595–602
Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst 30(1–7):107–117
Article Google Scholar
Brinkmeier M (2006) Pagerank revisited. ACM Trans Int Technol 6(3):282–301
Article Google Scholar
Castillo C, Donato D, Becchetti L, Boldi P, Leonardi S, Santini M, Vigna S (2006) A reference collection for web spam. SIGIR Forum 40(2):11–24
Article Google Scholar
Chakrabarti S, Van den BM, Dom B (1999) Focused crawling: a new approach to topic-specific web resource discovery. Comput Netw 31(11–16):1623–1640
Article Google Scholar
Cho J, Garcia MH (2000) The evolution of the Web and implications for an incremental crawler. In: Proccedings of 26th international conference on very large data bases. Morgan Kaufmann, San Francisco, pp 200–209
Cho J, Garcia MH, Page L (1998) Efficient crawling through url ordering. Comput Netw ISDN Syst 30(1–7):161–172
Article Google Scholar
Cho J, Roy S, Adams RE (2005) Page quality: in search of an unbiased Web ranking. In: Proccedings of 2005 ACM SIGMOD international conference on management of data. ACM, New York, pp 551–562
Cho J, Schonfeld U (2007) Rankmass crawler: a crawler with high personalized pagerank coverage guarantee. In: Proccedings of 33rd international conference on very large data bases. VLDB Endowment, pp 375–386
Dasgupta A, Ghosh A, Kumar R, Olston C, Pandey S, Tomkins A. (2007) The discoverability of the Web. In: Proccedings of 16th international conference on world wide web. ACM, New York, pp 421–430
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1988) Indexing by latent semantic analysis. In: Proccedings of 51st Annual Meeting of the American Society for Information, Science, pp 36–40
Fan R, Chang K, Hsieh CJ, Wang XR, Lin C (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
MATH Google Scholar
Fetterly D, Craswell N, Vinay V (2009) The impact of crawl policy on web search effectiveness. In: Proccedings of 32nd international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 580–587
Géry M, Largeron C (2012) BM25t: a BM25 extension for focused information retrieval. Knowl Inf Syst 32(1):217–241
Article Google Scholar
Groza T, Grimnes G, Handschuh S, Decker S (2011) From raw publications to Linked Data. Knowl Inf Syst (available online). doi:10.1007/s10115-011-0473-6
Gyöngyi Z, Garcia MH, Pedersen J (2004) Combating Web spam with trustrank. In: Proccedings of 13th international conference on very large data bases. VLDB Endowment, pp 576–587
Kamvar SD, Haveliwala TH, Manning CD, Golub GH (2003) Extrapolation methods for accelerating pagerank computations. In: Proccedings of 12th international conference on world wide web. ACM, New York, pp 261–270
Lee HT, Leonard D, Wang X, Loguinov D (2009) Irlbot: scaling to 6 billion pages and beyond. ACM Trans Web 3(3):1–34
Article MATH Google Scholar
Lin Z, Lyu M, King I (2012) MatchSim: a novel similarity measure based on maximum neighborhood matching. Knowl Inf Syst 32(1):141–161
Article Google Scholar
Mei JP, Chen L (2012) SumCR: a new subtopic-based extractive approach for text summarization. Knowl Inf Syst 31(3):527–545
Article Google Scholar
Najork M, Wiener JL (2001) Breadth-first crawling yields high-quality pages. In: Proccedings of 10th international conference on world wide web. ACM, New York, pp 114–118
Ntoulas A, Cho J, Olston C (2004) What’s new on the web? The evolution of the web from a search engine perspective. In: Proccedings of 13th international conference on world wide web. ACM, New York, pp 1–12
Olston C, Pandey S (2008) Recrawl scheduling based on information longevity. In: Proccedings of 17th international conference on world wide web. ACM, New York, pp 437–446
Orlandi A, Vigna S (2008) Compressed collections for simulated crawling. SIGIR Forum 42(2):39–44
Article Google Scholar
Pandey S, Olston C (2008) Crawl ordering by search impact. In: Proccedings of 1st international conference on web search and data mining. ACM, New York, pp 3–14
Pant G, Srinivasan P (2005) Learning to crawl: comparing classification schemes. ACM Trans Inf Syst 23(4):430–462
Article Google Scholar
Salton G, Wong A, Yang CS (1971) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Article Google Scholar
Shchekotykhin K, Jannach D, Friedrich G (2009) xCrawl: a high-recall crawling method for Web mining. Knowl Inf Syst 25(2):303–326
Article Google Scholar
Wan M, Jnsson A, Wang C, Li L, Yang Y (2011) Web user clustering and Web prefetching using Random Indexing with weight functions. Knowl Inf Syst (available online). doi:10.1007/s10115-011-0453-x

Download references

Acknowledgments

This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 2012M3C4A7033344) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 2011-0010325).

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Korea University, Seoul, 136-701, Korea
Md. Hijbul Alam, JongWoo Ha & SangKeun Lee

Authors

Md. Hijbul Alam
View author publications
You can also search for this author in PubMed Google Scholar
JongWoo Ha
View author publications
You can also search for this author in PubMed Google Scholar
SangKeun Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to SangKeun Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alam, M.H., Ha, J. & Lee, S. Novel approaches to crawling important pages early. Knowl Inf Syst 33, 707–734 (2012). https://doi.org/10.1007/s10115-012-0535-4

Download citation

Received: 10 November 2010
Revised: 03 May 2012
Accepted: 11 August 2012
Published: 09 September 2012
Issue Date: December 2012
DOI: https://doi.org/10.1007/s10115-012-0535-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Novel approaches to crawling important pages early

Abstract

Access this article

Similar content being viewed by others

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

A Dynamic Page-Refresh Index Policy for Web Crawlers

Towards Intelligent Web Crawling – A Theme Weight and Bayesian Page Rank Based Approach

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Novel approaches to crawling important pages early

Abstract

Access this article

Similar content being viewed by others

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

A Dynamic Page-Refresh Index Policy for Web Crawlers

Towards Intelligent Web Crawling – A Theme Weight and Bayesian Page Rank Based Approach

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation