Skip to main content
Log in

Novel approaches to crawling important pages early

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Web crawlers are essential to many Web applications, such as Web search engines, Web archives, and Web directories, which maintain Web pages in their local repositories. In this paper, we study the problem of crawl scheduling that biases crawl ordering toward important pages. We propose a set of crawling algorithms for effective and efficient crawl ordering by prioritizing important pages with the well-known PageRank as the importance metric. In order to score URLs, the proposed algorithms utilize various features, including partial link structure, inter-host links, page titles, and topic relevance. We conduct a large-scale experiment using publicly available data sets to examine the effect of each feature on crawl ordering and evaluate the performance of many algorithms. The experimental results verify the efficacy of our schemes. In particular, compared with the representative RankMass crawler, the FPR-title-host algorithm reduces computational overhead by a factor as great as three in running time while improving effectiveness by 5 % in cumulative PageRank.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  1. Abiteboul S, Preda M, Cobena G (2003) Adaptive on-line page importance computation. In: Proccedings of 12th international conference on world Wide Web. ACM, New York, pp 280–290

  2. Alam MH, Ha J, Lee S (2009) Fractional pagerank crawler: prioritizing urls efficiently for crawling important pages early. In: Proccedings of 14th international conference on database systems for advanced applications. Springer, Berlin, pp 590–594

  3. Almpanidis G, Kotropoulos C, Pitas I (2007) Combining text and link analysis for focused crawling-an application for vertical search engines. Inf Syst 32(6):886–908

    Article  Google Scholar 

  4. Baeza YR, Castillo C, Marin M, Rodriguez A (2005) Crawling a country: better strategies than breadth-first for Web page ordering. In: Proccedings of special interest tracks and posters of the 14th international conference on world wide web. ACM, New York, pp 864–872

  5. Bai X, Cambazoglu BB, Junqueira FP (2011) Discovering URLs through user feedback. In: Proceedings of the 20th ACM international conference on information and knowledge. ACM, New York, pp 77–86

  6. Boldi P, Vigna S (2004) The WebGraph framework I: compression techniques. In: Proccedings of 13th international conference on world wide web. ACM, New York, pp 595–602

  7. Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Comput Netw ISDN Syst 30(1–7):107–117

    Article  Google Scholar 

  8. Brinkmeier M (2006) Pagerank revisited. ACM Trans Int Technol 6(3):282–301

    Article  Google Scholar 

  9. Castillo C, Donato D, Becchetti L, Boldi P, Leonardi S, Santini M, Vigna S (2006) A reference collection for web spam. SIGIR Forum 40(2):11–24

    Article  Google Scholar 

  10. Chakrabarti S, Van den BM, Dom B (1999) Focused crawling: a new approach to topic-specific web resource discovery. Comput Netw 31(11–16):1623–1640

    Article  Google Scholar 

  11. Cho J, Garcia MH (2000) The evolution of the Web and implications for an incremental crawler. In: Proccedings of 26th international conference on very large data bases. Morgan Kaufmann, San Francisco, pp 200–209

  12. Cho J, Garcia MH, Page L (1998) Efficient crawling through url ordering. Comput Netw ISDN Syst 30(1–7):161–172

    Article  Google Scholar 

  13. Cho J, Roy S, Adams RE (2005) Page quality: in search of an unbiased Web ranking. In: Proccedings of 2005 ACM SIGMOD international conference on management of data. ACM, New York, pp 551–562

  14. Cho J, Schonfeld U (2007) Rankmass crawler: a crawler with high personalized pagerank coverage guarantee. In: Proccedings of 33rd international conference on very large data bases. VLDB Endowment, pp 375–386

  15. Dasgupta A, Ghosh A, Kumar R, Olston C, Pandey S, Tomkins A. (2007) The discoverability of the Web. In: Proccedings of 16th international conference on world wide web. ACM, New York, pp 421–430

  16. Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1988) Indexing by latent semantic analysis. In: Proccedings of 51st Annual Meeting of the American Society for Information, Science, pp 36–40

  17. Fan R, Chang K, Hsieh CJ, Wang XR, Lin C (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  18. Fetterly D, Craswell N, Vinay V (2009) The impact of crawl policy on web search effectiveness. In: Proccedings of 32nd international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 580–587

  19. Géry M, Largeron C (2012) BM25t: a BM25 extension for focused information retrieval. Knowl Inf Syst 32(1):217–241

    Article  Google Scholar 

  20. Groza T, Grimnes G, Handschuh S, Decker S (2011) From raw publications to Linked Data. Knowl Inf Syst (available online). doi:10.1007/s10115-011-0473-6

  21. Gyöngyi Z, Garcia MH, Pedersen J (2004) Combating Web spam with trustrank. In: Proccedings of 13th international conference on very large data bases. VLDB Endowment, pp 576–587

  22. Kamvar SD, Haveliwala TH, Manning CD, Golub GH (2003) Extrapolation methods for accelerating pagerank computations. In: Proccedings of 12th international conference on world wide web. ACM, New York, pp 261–270

  23. Lee HT, Leonard D, Wang X, Loguinov D (2009) Irlbot: scaling to 6 billion pages and beyond. ACM Trans Web 3(3):1–34

    Article  MATH  Google Scholar 

  24. Lin Z, Lyu M, King I (2012) MatchSim: a novel similarity measure based on maximum neighborhood matching. Knowl Inf Syst 32(1):141–161

    Article  Google Scholar 

  25. Mei JP, Chen L (2012) SumCR: a new subtopic-based extractive approach for text summarization. Knowl Inf Syst 31(3):527–545

    Article  Google Scholar 

  26. Najork M, Wiener JL (2001) Breadth-first crawling yields high-quality pages. In: Proccedings of 10th international conference on world wide web. ACM, New York, pp 114–118

  27. Ntoulas A, Cho J, Olston C (2004) What’s new on the web? The evolution of the web from a search engine perspective. In: Proccedings of 13th international conference on world wide web. ACM, New York, pp 1–12

  28. Olston C, Pandey S (2008) Recrawl scheduling based on information longevity. In: Proccedings of 17th international conference on world wide web. ACM, New York, pp 437–446

  29. Orlandi A, Vigna S (2008) Compressed collections for simulated crawling. SIGIR Forum 42(2):39–44

    Article  Google Scholar 

  30. Pandey S, Olston C (2008) Crawl ordering by search impact. In: Proccedings of 1st international conference on web search and data mining. ACM, New York, pp 3–14

  31. Pant G, Srinivasan P (2005) Learning to crawl: comparing classification schemes. ACM Trans Inf Syst 23(4):430–462

    Article  Google Scholar 

  32. Salton G, Wong A, Yang CS (1971) A vector space model for automatic indexing. Commun ACM 18(11):613–620

    Article  Google Scholar 

  33. Shchekotykhin K, Jannach D, Friedrich G (2009) xCrawl: a high-recall crawling method for Web mining. Knowl Inf Syst 25(2):303–326

    Article  Google Scholar 

  34. Wan M, Jnsson A, Wang C, Li L, Yang Y (2011) Web user clustering and Web prefetching using Random Indexing with weight functions. Knowl Inf Syst (available online). doi:10.1007/s10115-011-0453-x

Download references

Acknowledgments

This research was supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 2012M3C4A7033344) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (No. 2011-0010325).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to SangKeun Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alam, M.H., Ha, J. & Lee, S. Novel approaches to crawling important pages early. Knowl Inf Syst 33, 707–734 (2012). https://doi.org/10.1007/s10115-012-0535-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0535-4

Keywords

Navigation