skip to main content
10.1145/383952.383995acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Evaluating topic-driven web crawlers

Published:01 September 2001Publication History

ABSTRACT

Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies to prioritize the pages to be indexed. The issue is even more important for topic-specific search engines, where crawlers must make additional decisions based on the relevance of visited pages. However, it is difficult to evaluate alternative crawling strategies because relevant sets are unknown and the search space is changing. We propose three different methods to evaluate crawling strategies. We apply the proposed metrics to compare three topic-driven crawling algorithms based on similarity ranking, link analysis, and adaptive agents.

References

  1. 1.B.Amento,L.Terveen,and W.Hill.Does "authority " mean quality?predicting expert quality ratings of web documents.In Proceedings of the 23rd International ACM SIGIRConference on Research and Development in Information Retrieval pages 296 -303, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.I.Ben-Shaul,M.Herscovici,M.Jacovi,Y.Maarek, D.Pelleg,M.Shtalhaim,V.Soroka,and S.Ur. Adding support for dynamic and focused search with fetuccino.In Proceedings of 8th International World Wide Web Conference 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.K.Bharat and M.Henzinger.Improved algorithms for topic distillation in hyperlinked environments.In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval pages 104 -111,1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.S.Brin and L.Page.The anatomy of a large-scale hypertextual web search engine.In Proceedings of the Seventh International World Wide Web Conference, Brisbane,Australia 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.S.Chakrabarti,B.Dom,D.Gibson J.Kleinberg P.Raghavan,and S.Rajagopalan.Automatic resource compilation by analyzing hyperlink structure and associated text.In Proceedings of 7th International World Wide Web Conference 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.S.Chakrabarti,M.van den Berg,and B.Dom. Focused crawling:A new approach to topic-speci .c web resource discovery.In Proceedings of 8th International World Wide Web Conference 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.J.Cho,H.Garcia-Molina,and L.Page.E .cient crawling through url ordering.In Proceedings of the Seventh International World Wide Web Conference, Brisbane,Australia 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.P.De Bra and R.Post.Information retrieval in the world wide web:Making client-based searching feasible.In Proceedings of the First International World Wide Web Conference 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.T.Haveliwala.E .cient computation of pagerank. Technical report,Stanford Database Group,1999.Google ScholarGoogle Scholar
  10. 10.M.Henzinger,A.Heydon,M.Mitzenmacher,and M.Najork.Measuring search engine quality using random walks on the web.In Proceedings of 8th International World Wide Web Conference ,pages 213 -225,1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.D.A.Hull.Improving text retrieval for the routing problem using lat ent semantic indexing.In W.B. Croft and C.J.van Rijsbergen,editors,Proceedings of SIGIR-94,17th ACM International Conference on Research and Development in Information Retrieval pages 282 -289,Dublin,IE,1994.Springer Verlag, Heidelberg,DE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.J.Kivinen and M.K.Warmuth.Exponentiated gradient versus gradient descent for linear p redictors. Technical Report Technical Report UCSC-CRL-94-16, Baking Center for Computer Engineering & Information Scien ces;University of California,Santa Cruz,CA,1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.J.Kleinberg.Authoritative sources in a hyperlinked environment.In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms pages 668 -677,1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.D.Lewis,R.Schapire,J.Callan,and R.Papka. Training algorithms for linear text classi .ers.In Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval pages 298 -303,1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.F.Menczer and R.Belew.Adaptive retrieval agents: Internalizing local context and scaling up to the web. Machine Learning 39(2/3):203 -242,2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.H.Ng,W.Goh,and K.Low.Feature selection, perceptron learning,and a usability case study for text categorization.In Proceedings of the 20th Annual ACM SIGIRConference on Research and Development in Information Retrieval pages 67 -73, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17.B.Pinkerton.Finding what people want:Experiences with the webcrawler.In Proceedings of the First International World Wide Web Conference,Geneva, Switzerland 1994.Google ScholarGoogle Scholar
  18. 18.M.Porter.An algorithm for su .x stripping.Program 14(3):130 -137,1980. http://www.muscat.com/~martin/stem.html.Google ScholarGoogle ScholarCross RefCross Ref
  19. 19.D.Rumelhart,G.Hinton,and R.Williams.Learning internal representations by error propagation.In D.Rumelhart and J.McClelland,editors,Parallel Distributed Processing:Explorations in the Microstructure of Cognition volume 1 chapter 8. Bradford Books (MIT Press),Cambridge,MA,1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.G.Salton.The SMART Retrieval System - Experiments in Automatic Document Processing . Prentice-Hall,Englewood Cli .s,NJ,1971. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21.I.Silva,B.Ribeiro-Neto,P.Calado,N.Ziviani,and E.Moura.Link-based and content-based evidential information in a belief network model.In Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval pages 96 -103,2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. 22.C.van Rijsbergen.Information Retrieval . Butterworths,London,1979.Second edition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. 23.B.Widrow and S.D.Stearns.Adaptive Signal Processing .Prentice-Hall,Englewood Cli .s,NJ,1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. 24.I.Witten,A.Mo .at,and T.Bell.Managing Gigabytes:Compressing and Indexing Documents and Images .Morgan Kaufmann,San Francisco,CA,1999. Second Edition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. 25.Y.Yang.An evaluation of statistical approaches to text categorization.Information Retrieval 1(1):69 -90, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Evaluating topic-driven web crawlers

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
            September 2001
            454 pages
            ISBN:1581133316
            DOI:10.1145/383952

            Copyright © 2001 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 September 2001

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            SIGIR '01 Paper Acceptance Rate47of201submissions,23%Overall Acceptance Rate792of3,983submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader