Abstract
Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5 % in cumulative PageRank.
Similar content being viewed by others
References
Abiteboul S, Preda M, Cobena G (2003) Adaptive on-line page importance computation. In: Proccedings of 12th international conference on world Wide Web. ACM, New York, pp 280–290
Alam MH, Ha J, Lee S (2009) Fractional pagerank crawler: prioritizing urls efficiently for crawling important pages early. In: Proccedings of 14th international conference on database systems for advanced applications. Springer, Berlin, pp 590–594
Almpanidis G, Kotropoulos C, Pitas I (2007) Combining text and link analysis for focused crawling-an application for vertical search engines. Inf Syst 32(6):886–908
Baeza YR, Castillo C, Marin M, Rodriguez A (2005) Crawling a country: better strategies than breadth-first for Web page ordering. In: Proccedings of special interest tracks and posters of the 14th international conference on world wide web. ACM, New York, pp 864–872
Bai X, Cambazoglu BB, Junqueira FP (2011) Discovering URLs through user feedback. In: Proceedings of the 20th ACM international conference on information and knowledge. ACM, New York, pp 77–86
Boldi P, Vigna S (2004) The WebGraph framework I: compression techniques. In: Proccedings of 13th international conference on world wide web. ACM, New York, pp 595–602
Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst 30(1–7):107–117
Brinkmeier M (2006) Pagerank revisited. ACM Trans Int Technol 6(3):282–301
Castillo C, Donato D, Becchetti L, Boldi P, Leonardi S, Santini M, Vigna S (2006) A reference collection for web spam. SIGIR Forum 40(2):11–24
Chakrabarti S, Van den BM, Dom B (1999) Focused crawling: a new approach to topic-specific web resource discovery. Comput Netw 31(11–16):1623–1640
Cho J, Garcia MH (2000) The evolution of the Web and implications for an incremental crawler. In: Proccedings of 26th international conference on very large data bases. Morgan Kaufmann, San Francisco, pp 200–209
Cho J, Garcia MH, Page L (1998) Efficient crawling through url ordering. Comput Netw ISDN Syst 30(1–7):161–172
Cho J, Roy S, Adams RE (2005) Page quality: in search of an unbiased Web ranking. In: Proccedings of 2005 ACM SIGMOD international conference on management of data. ACM, New York, pp 551–562
Cho J, Schonfeld U (2007) Rankmass crawler: a crawler with high personalized pagerank coverage guarantee. In: Proccedings of 33rd international conference on very large data bases. VLDB Endowment, pp 375–386
Dasgupta A, Ghosh A, Kumar R, Olston C, Pandey S, Tomkins A. (2007) The discoverability of the Web. In: Proccedings of 16th international conference on world wide web. ACM, New York, pp 421–430
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1988) Indexing by latent semantic analysis. In: Proccedings of 51st Annual Meeting of the American Society for Information, Science, pp 36–40
Fan R, Chang K, Hsieh CJ, Wang XR, Lin C (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
Fetterly D, Craswell N, Vinay V (2009) The impact of crawl policy on web search effectiveness. In: Proccedings of 32nd international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 580–587
Géry M, Largeron C (2012) BM25t: a BM25 extension for focused information retrieval. Knowl Inf Syst 32(1):217–241
Groza T, Grimnes G, Handschuh S, Decker S (2011) From raw publications to Linked Data. Knowl Inf Syst (available online). doi:10.1007/s10115-011-0473-6
Gyöngyi Z, Garcia MH, Pedersen J (2004) Combating Web spam with trustrank. In: Proccedings of 13th international conference on very large data bases. VLDB Endowment, pp 576–587
Kamvar SD, Haveliwala TH, Manning CD, Golub GH (2003) Extrapolation methods for accelerating pagerank computations. In: Proccedings of 12th international conference on world wide web. ACM, New York, pp 261–270
Lee HT, Leonard D, Wang X, Loguinov D (2009) Irlbot: scaling to 6 billion pages and beyond. ACM Trans Web 3(3):1–34
Lin Z, Lyu M, King I (2012) MatchSim: a novel similarity measure based on maximum neighborhood matching. Knowl Inf Syst 32(1):141–161
Mei JP, Chen L (2012) SumCR: a new subtopic-based extractive approach for text summarization. Knowl Inf Syst 31(3):527–545
Najork M, Wiener JL (2001) Breadth-first crawling yields high-quality pages. In: Proccedings of 10th international conference on world wide web. ACM, New York, pp 114–118
Ntoulas A, Cho J, Olston C (2004) What’s new on the web? The evolution of the web from a search engine perspective. In: Proccedings of 13th international conference on world wide web. ACM, New York, pp 1–12
Olston C, Pandey S (2008) Recrawl scheduling based on information longevity. In: Proccedings of 17th international conference on world wide web. ACM, New York, pp 437–446
Orlandi A, Vigna S (2008) Compressed collections for simulated crawling. SIGIR Forum 42(2):39–44
Pandey S, Olston C (2008) Crawl ordering by search impact. In: Proccedings of 1st international conference on web search and data mining. ACM, New York, pp 3–14
Pant G, Srinivasan P (2005) Learning to crawl: comparing classification schemes. ACM Trans Inf Syst 23(4):430–462
Salton G, Wong A, Yang CS (1971) A vector space model for automatic indexing. Commun ACM 18(11):613–620
Shchekotykhin K, Jannach D, Friedrich G (2009) xCrawl: a high-recall crawling method for Web mining. Knowl Inf Syst 25(2):303–326
Wan M, Jnsson A, Wang C, Li L, Yang Y (2011) Web user clustering and Web prefetching using Random Indexing with weight functions. Knowl Inf Syst (available online). doi:10.1007/s10115-011-0453-x
Acknowledgments
This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 2012M3C4A7033344) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 2011-0010325).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Alam, M.H., Ha, J. & Lee, S. Novel approaches to crawling important pages early. Knowl Inf Syst 33, 707–734 (2012). https://doi.org/10.1007/s10115-012-0535-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-012-0535-4