ABSTRACT
Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies to prioritize the pages to be indexed. The issue is even more important for topic-specific search engines, where crawlers must make additional decisions based on the relevance of visited pages. However, it is difficult to evaluate alternative crawling strategies because relevant sets are unknown and the search space is changing. We propose three different methods to evaluate crawling strategies. We apply the proposed metrics to compare three topic-driven crawling algorithms based on similarity ranking, link analysis, and adaptive agents.
- 1.B.Amento,L.Terveen,and W.Hill.Does "authority " mean quality?predicting expert quality ratings of web documents.In Proceedings of the 23rd International ACM SIGIRConference on Research and Development in Information Retrieval pages 296 -303, 2000. Google ScholarDigital Library
- 2.I.Ben-Shaul,M.Herscovici,M.Jacovi,Y.Maarek, D.Pelleg,M.Shtalhaim,V.Soroka,and S.Ur. Adding support for dynamic and focused search with fetuccino.In Proceedings of 8th International World Wide Web Conference 1999. Google ScholarDigital Library
- 3.K.Bharat and M.Henzinger.Improved algorithms for topic distillation in hyperlinked environments.In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval pages 104 -111,1998. Google ScholarDigital Library
- 4.S.Brin and L.Page.The anatomy of a large-scale hypertextual web search engine.In Proceedings of the Seventh International World Wide Web Conference, Brisbane,Australia 1998. Google ScholarDigital Library
- 5.S.Chakrabarti,B.Dom,D.Gibson J.Kleinberg P.Raghavan,and S.Rajagopalan.Automatic resource compilation by analyzing hyperlink structure and associated text.In Proceedings of 7th International World Wide Web Conference 1998. Google ScholarDigital Library
- 6.S.Chakrabarti,M.van den Berg,and B.Dom. Focused crawling:A new approach to topic-speci .c web resource discovery.In Proceedings of 8th International World Wide Web Conference 1999. Google ScholarDigital Library
- 7.J.Cho,H.Garcia-Molina,and L.Page.E .cient crawling through url ordering.In Proceedings of the Seventh International World Wide Web Conference, Brisbane,Australia 1998. Google ScholarDigital Library
- 8.P.De Bra and R.Post.Information retrieval in the world wide web:Making client-based searching feasible.In Proceedings of the First International World Wide Web Conference 1994. Google ScholarDigital Library
- 9.T.Haveliwala.E .cient computation of pagerank. Technical report,Stanford Database Group,1999.Google Scholar
- 10.M.Henzinger,A.Heydon,M.Mitzenmacher,and M.Najork.Measuring search engine quality using random walks on the web.In Proceedings of 8th International World Wide Web Conference ,pages 213 -225,1999. Google ScholarDigital Library
- 11.D.A.Hull.Improving text retrieval for the routing problem using lat ent semantic indexing.In W.B. Croft and C.J.van Rijsbergen,editors,Proceedings of SIGIR-94,17th ACM International Conference on Research and Development in Information Retrieval pages 282 -289,Dublin,IE,1994.Springer Verlag, Heidelberg,DE. Google ScholarDigital Library
- 12.J.Kivinen and M.K.Warmuth.Exponentiated gradient versus gradient descent for linear p redictors. Technical Report Technical Report UCSC-CRL-94-16, Baking Center for Computer Engineering & Information Scien ces;University of California,Santa Cruz,CA,1994. Google ScholarDigital Library
- 13.J.Kleinberg.Authoritative sources in a hyperlinked environment.In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms pages 668 -677,1998. Google ScholarDigital Library
- 14.D.Lewis,R.Schapire,J.Callan,and R.Papka. Training algorithms for linear text classi .ers.In Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval pages 298 -303,1996. Google ScholarDigital Library
- 15.F.Menczer and R.Belew.Adaptive retrieval agents: Internalizing local context and scaling up to the web. Machine Learning 39(2/3):203 -242,2000. Google ScholarDigital Library
- 16.H.Ng,W.Goh,and K.Low.Feature selection, perceptron learning,and a usability case study for text categorization.In Proceedings of the 20th Annual ACM SIGIRConference on Research and Development in Information Retrieval pages 67 -73, 1997. Google ScholarDigital Library
- 17.B.Pinkerton.Finding what people want:Experiences with the webcrawler.In Proceedings of the First International World Wide Web Conference,Geneva, Switzerland 1994.Google Scholar
- 18.M.Porter.An algorithm for su .x stripping.Program 14(3):130 -137,1980. http://www.muscat.com/~martin/stem.html.Google ScholarCross Ref
- 19.D.Rumelhart,G.Hinton,and R.Williams.Learning internal representations by error propagation.In D.Rumelhart and J.McClelland,editors,Parallel Distributed Processing:Explorations in the Microstructure of Cognition volume 1 chapter 8. Bradford Books (MIT Press),Cambridge,MA,1986. Google ScholarDigital Library
- 20.G.Salton.The SMART Retrieval System - Experiments in Automatic Document Processing . Prentice-Hall,Englewood Cli .s,NJ,1971. Google ScholarDigital Library
- 21.I.Silva,B.Ribeiro-Neto,P.Calado,N.Ziviani,and E.Moura.Link-based and content-based evidential information in a belief network model.In Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval pages 96 -103,2000. Google ScholarDigital Library
- 22.C.van Rijsbergen.Information Retrieval . Butterworths,London,1979.Second edition. Google ScholarDigital Library
- 23.B.Widrow and S.D.Stearns.Adaptive Signal Processing .Prentice-Hall,Englewood Cli .s,NJ,1985. Google ScholarDigital Library
- 24.I.Witten,A.Mo .at,and T.Bell.Managing Gigabytes:Compressing and Indexing Documents and Images .Morgan Kaufmann,San Francisco,CA,1999. Second Edition. Google ScholarDigital Library
- 25.Y.Yang.An evaluation of statistical approaches to text categorization.Information Retrieval 1(1):69 -90, 1999. Google ScholarDigital Library
Index Terms
- Evaluating topic-driven web crawlers
Recommendations
Mining the web with hierarchical crawlers – a resource sharing based crawling approach
An important component of any web search engine is its crawler, which is also known as robot or spider. An efficient set of crawlers make any search engine more powerful, apart from its other measures of performance, such as its ranking algorithm, ...
An architecture for a focused trend parallel Web crawler with the application of clickstream analysis
The tremendous growth of the Web poses many challenges for all-purpose single-process crawlers including the presence of some irrelevant answers among search results and the coverage and scaling issues regarding the enormous dimension of the World Wide ...
State of the Art in Semantic Focused Crawlers
ICCSA '09: Proceedings of the International Conference on Computational Science and Its Applications: Part IINowadays, the research of focused crawler approaches the field of semantic web, along with the appearance of increasing semantic web documents and the rapid development of ontology mark-up languages. Semantic focused crawlers are a series of focused ...
Comments