Skip to main content

Agents, Crawlers, and Web Retrieval

  • Conference paper
  • First Online:
Cooperative Information Agents VI (CIA 2002)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2446))

Included in the following conference series:

Abstract

In this paper we survey crawlers, a specific type of agents used by search engines. We also explore the relation with generic agents and how agent technology or variants of it could help to develop search engines that are more effective, efficient, and scalable.

Funded by Millennium Nucleus Center for Web Research, Mideplan, Chile.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anurag Acharya, M. Ranganathan, and Joel Saltz. Sumatra: A Language for Resource-aware Mobile Programs. In J. Vitek and C. Tschudin, editors, Mobile Object Systems: Towards the Programmable Internet, volume 1222, pages 111–130. Springer-Verlag, Heidelberg, Germany, 1997.

    Google Scholar 

  2. A. Arasu, J. Cho, H. Garcia-Molina, and S. Raghavan. Searching the Web. ACM Transactions on Internet Technologies, 1(1), June 2001.

    Google Scholar 

  3. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, England, 513 pages, 1999.

    Google Scholar 

  4. M. Balabanovic and Y. Shoham, Learning Information Retrieval Agents: Experiments with Automated Web Browsing, in AAAI Spring Symposium on Information Gathering, Stanford, CA, March 1995.

    Google Scholar 

  5. Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Trovatore: Towards a highly scalable distributed web crawler. In Proc. of 10th International World-Wide Web Conference, Hong Kong, China, 2001. Poster session (Winner of the Best Poster Award).

    Google Scholar 

  6. O. Brandman, J. Cho, H. Garcia-Molina, and N. Shivakumar. Crawler-friendly web servers. In Workshop on Performance and Architecture of Web Servers ( PAWS), June 2000.

    Google Scholar 

  7. B. Brewington, G. Cybenko. How dynamic is theWeb?, Proc. WWW9, 2000.

    Google Scholar 

  8. M. Burner. Crawling towards Eternity — Building An Archive of The World Wide Web, Web Techniques, May 1997. http://www.webtechniques.com/-archives/1997/05/burner/.

  9. L. Cardelli, Mobile Computation, In J. Vitek and C. Tschudin (Eds), Mobile Object Systems: Towards the Programmable Internet, Vol 1222, LNCS, Springer-Verlag, 1997.

    Google Scholar 

  10. D. Caromel, W. Klauser, J. Vayssiere. Towards seamless computing and metacomputing in Java. Concurrency, Practice and Experience 10, Sept 1998.

    Google Scholar 

  11. Castillo, C. and Baeza-Yates, R. A New Model for Web Crawling (poster), WWW11, Honolulu, 2002.

    Google Scholar 

  12. Chakrabarti, S., van der Berg, M., and Dom, B. Focused crawling: a new approach to topic-specific Web resource discovery. In Proceedings of 8th International World Wide Web Conference (WWW8), 1999.

    Google Scholar 

  13. Chakrabarti, S., van der Berg, M., and Dom, B. Distributed hypertext resource discovery through examples, VLDB, 1999, 375–386.

    Google Scholar 

  14. Chakrabarti, S. Recent results in automatic Web resource discovery, ACM Computing Surveys, 1999.

    Google Scholar 

  15. Cho, J. Crawling The Web: Discovery and Maintenance Of Large-Scale Web Data, Ph.D. thesis, Stanford University, 2001.

    Google Scholar 

  16. J. Cho, N. Shivakumar, H. Garcia-Molina. Finding replicated Web collections, In Proc. of 2000 ACM International Conference on Management of Data (SIGMOD) Conference, May 2000.

    Google Scholar 

  17. J. Cho, H. Garcia-Molina. Parallel Crawlers, WWW11, 2001.

    Google Scholar 

  18. J. Cho, H. Garcia-Molina. Estimating Frequency of Change, Technical Report, Dept. of Computer Science, Stanford University, 2001.

    Google Scholar 

  19. J. Cho, H. Garcia-Molina. The Evolution of the Web and Implications for an Incremental Crawler, VLDB conference, pages 200–209, 2000.

    Google Scholar 

  20. J. Cho, H. Garcia-Molina. Synchronizing a database to improve freshness. Proc. of ACM SIGMOD, pages 117–128, 2000.

    Google Scholar 

  21. J. Cho, H. Garcia-Molina. Efficient crawling through URL ordering. Proc. WWW7, 1998.

    Google Scholar 

  22. E.G. Coan, Jr., Zhen Liu, Richard R. Weber. Optimal robot scheduling for Web search engines. Technical Report, INRIA, 1997.

    Google Scholar 

  23. M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused Crawling using Context Graphs, Proc. of 26th International Conference on Very Large Databases, VLDB 2000.

    Google Scholar 

  24. F. Douglas, A. Feldmann, B. Krishnamurthy, J.C. Mogul. Rate of Change and other Metrics: a Live Study of the World Wide Web, USENIX Symposium on Internet Technologies and Systems, 1997.

    Google Scholar 

  25. Jenny Edwards, Kevin McCurley, and John Tomlin. An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In Proceedings of the Tenth International World Wide Web Conference, pages 106–113, May 2001.

    Google Scholar 

  26. D. Eichmann. The RBSE spider: Balancing effective search against Web load, Proc. of 1st WWW conference, 1994.

    Google Scholar 

  27. V. Gupta and R. Campbell. Internet search engine freshness by web server help. Technical Report UIUCDCS-R-2000-2153, Digital Computer Laboratory, University of Illinois at UrbanaChampaign, January 2000.

    Google Scholar 

  28. D. Hagimont and D. Louvegnies. Javanaise: distributed shared objects for Internet cooperative applications. In Middleware’98, The Lake District, England, 1998.

    Google Scholar 

  29. A. Heydon, M. Najork. Mercator: A scalable, extensible Web crawler., World Wide Web, 2(4):219–229, 1999.

    Article  Google Scholar 

  30. V. Katz and W.-S. Li. Topic distillation on hierarchically categorized Web documents. In Proceedings of the 1999 Workshop on Knowledge and Data Engineering Exchange, IEEE, 1999.

    Google Scholar 

  31. J. Kiniry, D. Zimmerman A Hands-on Look at Java Mobile Agents, IEEE Internet Computing 1(4):21–30, July–August 1997.

    Article  Google Scholar 

  32. Kluev, V. Compiling document collections from the Internet, SIGIR Forum 34, 2000.

    Google Scholar 

  33. R. Koblick, Concordia, Communications of ACM 42(3):96–99, March 1999.

    Article  Google Scholar 

  34. M. Koster Robots in the Web: threat or treat, ConneXions 9(4), 1995.

    Google Scholar 

  35. D. Lange, M. Oshima. Programming and Deploying Java Mobile Agents with Aglets. Addison Wesley, 1998

    Google Scholar 

  36. D.B. Lange and M. Oshima, Seven Good Reasons for Mobile Agents, Communications of ACM 42(3):88–91, March 1999.

    Article  Google Scholar 

  37. H. Lieberman. Letizia: An Agent That Assists Web Browsing. In 1995 International Joint Conference on Artificial Intelligence, Montreal, CA, 1995.

    Google Scholar 

  38. F. Menczer and R. Belew. Adaptive retrieval agents: Internalizing local context and scaling up to the web. Machine Learning conference, 1999. Later in Machine Learning 39, 200, 203-242.

    Google Scholar 

  39. F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven web crawlers. In Proc. 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001.

    Google Scholar 

  40. R. Miller, K. Bharat. SPHINX: A framework for creating personal, site-specific Web crawlers, Proc. of WWW7, 1998.

    Google Scholar 

  41. Mukherjea, S. WTMS: A system for collecting and analyzing topic-specific Web information, WWW 9, Elsevier, 2000.

    Google Scholar 

  42. M. Najork, J. Wiener. Breadth-first search crawling yields high-quality pages, Proc. of WWW10, 2001.

    Google Scholar 

  43. Marc Najork and Allan Heydon. On High-Performance Web Crawling. Chapter 2 in J. Abello et al. (editors), Handbook of Massive Data Sets, Kluwer Academic Publishers, 2002.

    Google Scholar 

  44. L. Page, S. Brin. The anatomy of a large-scale hypertextual Web search engine. Proc. of WWW7, 1998.

    Google Scholar 

  45. G. Pant and F. Menczer. Myspiders: Evolve your own intelligent web crawlers. Autonomous Agents and Multi-Agent Systems 5(2):221–229, 2002.

    Article  Google Scholar 

  46. G. Pant, P. Srinivasan, and F. Menczer. Exploration versus exploitation in topic driven crawlers. In Proc. Second International Workshop on Web Dynamics, 2002.

    Google Scholar 

  47. Jose M. Piquer. Indirect distributed garbage collection: Handling object migration. ACM Transactions on Programming Languages and Systems (TOPLAS), 18(5):615–647, September 1996.

    Article  Google Scholar 

  48. Michael Philippsen and Matthias Zenger. JavaParty — transparent remote objects in Java. Concurrency: Practice and Experience, 9(11):1225–1242, 1997.

    Article  Google Scholar 

  49. S. Raghavan, H. Garcia-Molina. Crawling the Hidden Web, 27th International Conference on Very Large Data Bases, September 2001.

    Google Scholar 

  50. Rennie, J. and McCallum, A. Using reinforcement learning to spider the Web efficiently, Int. Conf. on Machine Learning, 1999.

    Google Scholar 

  51. V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed Web crawler. In Proceedings of the 18th International Conference on Data Engineering (ICDE’02), San Jose, CA Feb. 26–March 1, pages 357–368, 2002.

    Google Scholar 

  52. Padmini Srinivasan, Gautam Pant, Filippo Menczer. Target Seeking Crawlers and their Topical Performance, 25th ACM SIGIR, Finland, August 2002.

    Google Scholar 

  53. J. Talim, Z. Liu, Ph. Nain, E. G. Coffman. Controlling the robots of Web search engines, Joint international conference on on Measurement and modeling of computer systems, 2001.

    Google Scholar 

  54. P.N. Tan, V. Kumar. Discovery of Web Robots Session Based on their Navigational Patterns, Available on-line at http://citeseer.nj.nec.com/443855.html

  55. E. Tanter, J. Piquer. Managing References upon Object Migration: Applying separation of Concerns SCCC’01, Punta Arenas, Chile, IEEE Press, Nov 2001.

    Google Scholar 

  56. Giovanni Vigna, Protecting Mobile Agents through Tracing, 3rd ECOOP Workshop on Mobile Object Systems, 1997.

    Google Scholar 

  57. D. Wong, N. Paciorek, D. Moore. Java-Based Mobile Agents. Communications of ACM, 42(3):92–95, March 1999.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Baeza-Yates, R., Piquer, J.M. (2002). Agents, Crawlers, and Web Retrieval. In: Klusch, M., Ossowski, S., Shehory, O. (eds) Cooperative Information Agents VI. CIA 2002. Lecture Notes in Computer Science(), vol 2446. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45741-0_1

Download citation

  • DOI: https://doi.org/10.1007/3-540-45741-0_1

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44173-1

  • Online ISBN: 978-3-540-45741-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics