Skip to main content

Evaluation Methods for Focused Crawling

  • Conference paper
  • First Online:
Book cover AI*IA 2001: Advances in Artificial Intelligence (AI*IA 2001)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2175))

Included in the following conference series:

Abstract

The exponential growth of documents available in the World Wide Webmak es it increasingly difficult to discover relevant information on a specific topic. In this context, growing interest is emerging in focused crawling, a technique that dynamically browses the Internet by choosing directions that maximize the probability of discovering relevant pages, given a specific topic. Predicting the relevance of a document before seeing its contents (i.e., relying on the parent pages only) is one of the central problem in focused crawling because it can save significant bandwidth resources. In this paper, we study three different evaluation functions for predicting the relevance of a hyperlink with respect to the target topic. We show that classification based on the anchor text is more accurate than classification based on the whole page. Moreover, we introduce a method that combines both the anchor and the whole parent document, using a Bayesian representation of the Webg raph structure. The latter method obtains further accuracy improvements.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. G. Attardi, S. Di Marco, and D. Salvi. Categorization by context. Journal of Universal Computer Science, 4(9):719–736, 1998.

    Google Scholar 

  2. P. De Bra, G.-J. Houben, Y. Kornatzky, and R. Post. Information retrieval in distributed hypertexts. In Proceedings of RIAO’94, Intelligent Multimedia, Information Retrieval Systems and Management, New York, NY, 1994.

    Google Scholar 

  3. S. Chakrabarti, M. van der Berg, and B. Dom. Focused crawling: a new approach to topic-specific webre source discovery. In Proceedings of the 8th International World Wide Web Conference, Toronto, Canada, 1999.

    Google Scholar 

  4. M. Diligenti, F.M. Coetzee, S. Lawrence, C.L. Giles, and M. Gori. Focused crawling using context graphs. In Proceedings of the 6th International Conference on Very Large Databases, VLDB 2000, Cairo, Egypt, 2000.

    Google Scholar 

  5. M. Hersovici, M. Jacovi, Y.S. Maarek, D. Pelleg, M. Shtalheim, and S. Ur. The shark-search algorithm-an application: tailored websit e mapping. In Proceedings of the 7th International World Wide Web Conference (WWW7), Brisb ane, Australia, 1998.

    Google Scholar 

  6. S. Lawrence and C.L. Giles. Accessibility of information on the web. Nature, 400:107–109, July 1999.

    Article  Google Scholar 

  7. J. Rennie and A. McCallum. Using reinforcement learning to spider the webe fficiently. In Proceedings of the 16th International Conference on Machine Learning (ICML’99), 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Passerini, A., Frasconi, P., Soda, G. (2001). Evaluation Methods for Focused Crawling. In: Esposito, F. (eds) AI*IA 2001: Advances in Artificial Intelligence. AI*IA 2001. Lecture Notes in Computer Science(), vol 2175. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45411-X_4

Download citation

  • DOI: https://doi.org/10.1007/3-540-45411-X_4

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42601-1

  • Online ISBN: 978-3-540-45411-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics