Article

Evaluating topic-driven web crawlers

Authors:
Filippo Menczer

Univ. of Iowa, Iowa City

Univ. of Iowa, Iowa City
View Profile

,
Gautam Pant

Univ. of Iowa, Iowa City

Univ. of Iowa, Iowa City
View Profile

,
Padmini Srinivasan

Univ. of Iowa, Iowa City

Univ. of Iowa, Iowa City
View Profile

,
Miguel E. Ruiz

Textwise, Syracuse, NY

Textwise, Syracuse, NY
View Profile

SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrievalSeptember 2001Pages 241–249https://doi.org/10.1145/383952.383995

Published:01 September 2001Publication History

SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 241–249

ABSTRACT

Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies to prioritize the pages to be indexed. The issue is even more important for topic-specific search engines, where crawlers must make additional decisions based on the relevance of visited pages. However, it is difficult to evaluate alternative crawling strategies because relevant sets are unknown and the search space is changing. We propose three different methods to evaluate crawling strategies. We apply the proposed metrics to compare three topic-driven crawling algorithms based on similarity ranking, link analysis, and adaptive agents.

References

1.B.Amento,L.Terveen,and W.Hill.Does "authority " mean quality?predicting expert quality ratings of web documents.In Proceedings of the 23rd International ACM SIGIRConference on Research and Development in Information Retrieval pages 296 -303, 2000. Google ScholarDigital Library
2.I.Ben-Shaul,M.Herscovici,M.Jacovi,Y.Maarek, D.Pelleg,M.Shtalhaim,V.Soroka,and S.Ur. Adding support for dynamic and focused search with fetuccino.In Proceedings of 8th International World Wide Web Conference 1999. Google ScholarDigital Library
3.K.Bharat and M.Henzinger.Improved algorithms for topic distillation in hyperlinked environments.In Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval pages 104 -111,1998. Google ScholarDigital Library
4.S.Brin and L.Page.The anatomy of a large-scale hypertextual web search engine.In Proceedings of the Seventh International World Wide Web Conference, Brisbane,Australia 1998. Google ScholarDigital Library
5.S.Chakrabarti,B.Dom,D.Gibson J.Kleinberg P.Raghavan,and S.Rajagopalan.Automatic resource compilation by analyzing hyperlink structure and associated text.In Proceedings of 7th International World Wide Web Conference 1998. Google ScholarDigital Library
6.S.Chakrabarti,M.van den Berg,and B.Dom. Focused crawling:A new approach to topic-speci .c web resource discovery.In Proceedings of 8th International World Wide Web Conference 1999. Google ScholarDigital Library
7.J.Cho,H.Garcia-Molina,and L.Page.E .cient crawling through url ordering.In Proceedings of the Seventh International World Wide Web Conference, Brisbane,Australia 1998. Google ScholarDigital Library
8.P.De Bra and R.Post.Information retrieval in the world wide web:Making client-based searching feasible.In Proceedings of the First International World Wide Web Conference 1994. Google ScholarDigital Library
9.T.Haveliwala.E .cient computation of pagerank. Technical report,Stanford Database Group,1999.Google Scholar
10.M.Henzinger,A.Heydon,M.Mitzenmacher,and M.Najork.Measuring search engine quality using random walks on the web.In Proceedings of 8th International World Wide Web Conference ,pages 213 -225,1999. Google ScholarDigital Library
11.D.A.Hull.Improving text retrieval for the routing problem using lat ent semantic indexing.In W.B. Croft and C.J.van Rijsbergen,editors,Proceedings of SIGIR-94,17th ACM International Conference on Research and Development in Information Retrieval pages 282 -289,Dublin,IE,1994.Springer Verlag, Heidelberg,DE. Google ScholarDigital Library
12.J.Kivinen and M.K.Warmuth.Exponentiated gradient versus gradient descent for linear p redictors. Technical Report Technical Report UCSC-CRL-94-16, Baking Center for Computer Engineering & Information Scien ces;University of California,Santa Cruz,CA,1994. Google ScholarDigital Library
13.J.Kleinberg.Authoritative sources in a hyperlinked environment.In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms pages 668 -677,1998. Google ScholarDigital Library
14.D.Lewis,R.Schapire,J.Callan,and R.Papka. Training algorithms for linear text classi .ers.In Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval pages 298 -303,1996. Google ScholarDigital Library
15.F.Menczer and R.Belew.Adaptive retrieval agents: Internalizing local context and scaling up to the web. Machine Learning 39(2/3):203 -242,2000. Google ScholarDigital Library
16.H.Ng,W.Goh,and K.Low.Feature selection, perceptron learning,and a usability case study for text categorization.In Proceedings of the 20th Annual ACM SIGIRConference on Research and Development in Information Retrieval pages 67 -73, 1997. Google ScholarDigital Library
17.B.Pinkerton.Finding what people want:Experiences with the webcrawler.In Proceedings of the First International World Wide Web Conference,Geneva, Switzerland 1994.Google Scholar
18.M.Porter.An algorithm for su .x stripping.Program 14(3):130 -137,1980. http://www.muscat.com/~martin/stem.html.Google ScholarCross Ref
19.D.Rumelhart,G.Hinton,and R.Williams.Learning internal representations by error propagation.In D.Rumelhart and J.McClelland,editors,Parallel Distributed Processing:Explorations in the Microstructure of Cognition volume 1 chapter 8. Bradford Books (MIT Press),Cambridge,MA,1986. Google ScholarDigital Library
20.G.Salton.The SMART Retrieval System - Experiments in Automatic Document Processing . Prentice-Hall,Englewood Cli .s,NJ,1971. Google ScholarDigital Library
21.I.Silva,B.Ribeiro-Neto,P.Calado,N.Ziviani,and E.Moura.Link-based and content-based evidential information in a belief network model.In Proceedings of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval pages 96 -103,2000. Google ScholarDigital Library
22.C.van Rijsbergen.Information Retrieval . Butterworths,London,1979.Second edition. Google ScholarDigital Library
23.B.Widrow and S.D.Stearns.Adaptive Signal Processing .Prentice-Hall,Englewood Cli .s,NJ,1985. Google ScholarDigital Library
24.I.Witten,A.Mo .at,and T.Bell.Managing Gigabytes:Compressing and Indexing Documents and Images .Morgan Kaufmann,San Francisco,CA,1999. Second Edition. Google ScholarDigital Library
25.Y.Yang.An evaluation of statistical approaches to text categorization.Information Retrieval 1(1):69 -90, 1999. Google ScholarDigital Library

Index Terms

Evaluating topic-driven web crawlers
1. Computing methodologies
  1. Artificial intelligence
    1. Search methodologies
      1. Discrete space search
      2. Game tree search
2. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
    2. Information retrieval query processing

Recommendations

Mining the web with hierarchical crawlers – a resource sharing based crawling approach

An important component of any web search engine is its crawler, which is also known as robot or spider. An efficient set of crawlers make any search engine more powerful, apart from its other measures of performance, such as its ranking algorithm, ...
Read More
An architecture for a focused trend parallel Web crawler with the application of clickstream analysis

The tremendous growth of the Web poses many challenges for all-purpose single-process crawlers including the presence of some irrelevant answers among search results and the coverage and scaling issues regarding the enormous dimension of the World Wide ...
Read More
State of the Art in Semantic Focused Crawlers
ICCSA '09: Proceedings of the International Conference on Computational Science and Its Applications: Part II

Nowadays, the research of focused crawler approaches the field of semantic web, along with the appearance of increasing semantic web documents and the rapid development of ontology mark-up languages. Semantic focused crawlers are a series of focused ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
September 2001
454 pages
ISBN:1581133316
DOI:10.1145/383952
Chairmen:
Donald H. Kraft
Louisiana State Univ.
,
W. Bruce Croft
University of Massachusetts, (For the Americas)
,
David J. Harper
The Robert Gordon University, (For Europe and Africa)
,
Justin Zobel
RMIT University, (For Asia and Australasia)
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 September 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
InfoSpiders
PageRank
Web information retrieval
best-first search
focused crawlers
performance metrics
topic driven crawling
Qualifiers
- Article
Conference

Acceptance Rates
SIGIR '01 Paper Acceptance Rate47of201submissions,23%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 131
  Total Citations
  View Citations
- 2,037
  Total Downloads
- Downloads (Last 12 months)30
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluating topic-driven web crawlers

SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining the web with hierarchical crawlers – a resource sharing based crawling approach

An architecture for a focused trend parallel Web crawler with the application of clickstream analysis

State of the Art in Semantic Focused Crawlers