Agents, Crawlers, and Web Retrieval

Baeza-Yates, Ricardo; Piquer, José Miguel

doi:10.1007/3-540-45741-0_1

Ricardo Baeza-Yates⁴ &
José Miguel Piquer⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2446))

Included in the following conference series:

International Workshop on Cooperative Information Agents

239 Accesses
2 Citations

Abstract

In this paper we survey crawlers, a specific type of agents used by search engines. We also explore the relation with generic agents and how agent technology or variants of it could help to develop search engines that are more effective, efficient, and scalable.

Funded by Millennium Nucleus Center for Web Research, Mideplan, Chile.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anurag Acharya, M. Ranganathan, and Joel Saltz. Sumatra: A Language for Resource-aware Mobile Programs. In J. Vitek and C. Tschudin, editors, Mobile Object Systems: Towards the Programmable Internet, volume 1222, pages 111–130. Springer-Verlag, Heidelberg, Germany, 1997.
Google Scholar
A. Arasu, J. Cho, H. Garcia-Molina, and S. Raghavan. Searching the Web. ACM Transactions on Internet Technologies, 1(1), June 2001.
Google Scholar
R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, England, 513 pages, 1999.
Google Scholar
M. Balabanovic and Y. Shoham, Learning Information Retrieval Agents: Experiments with Automated Web Browsing, in AAAI Spring Symposium on Information Gathering, Stanford, CA, March 1995.
Google Scholar
Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Trovatore: Towards a highly scalable distributed web crawler. In Proc. of 10th International World-Wide Web Conference, Hong Kong, China, 2001. Poster session (Winner of the Best Poster Award).
Google Scholar
O. Brandman, J. Cho, H. Garcia-Molina, and N. Shivakumar. Crawler-friendly web servers. In Workshop on Performance and Architecture of Web Servers ( PAWS), June 2000.
Google Scholar
B. Brewington, G. Cybenko. How dynamic is theWeb?, Proc. WWW9, 2000.
Google Scholar
M. Burner. Crawling towards Eternity — Building An Archive of The World Wide Web, Web Techniques, May 1997. http://www.webtechniques.com/-archives/1997/05/burner/.
L. Cardelli, Mobile Computation, In J. Vitek and C. Tschudin (Eds), Mobile Object Systems: Towards the Programmable Internet, Vol 1222, LNCS, Springer-Verlag, 1997.
Google Scholar
D. Caromel, W. Klauser, J. Vayssiere. Towards seamless computing and metacomputing in Java. Concurrency, Practice and Experience 10, Sept 1998.
Google Scholar
Castillo, C. and Baeza-Yates, R. A New Model for Web Crawling (poster), WWW11, Honolulu, 2002.
Google Scholar
Chakrabarti, S., van der Berg, M., and Dom, B. Focused crawling: a new approach to topic-specific Web resource discovery. In Proceedings of 8th International World Wide Web Conference (WWW8), 1999.
Google Scholar
Chakrabarti, S., van der Berg, M., and Dom, B. Distributed hypertext resource discovery through examples, VLDB, 1999, 375–386.
Google Scholar
Chakrabarti, S. Recent results in automatic Web resource discovery, ACM Computing Surveys, 1999.
Google Scholar
Cho, J. Crawling The Web: Discovery and Maintenance Of Large-Scale Web Data, Ph.D. thesis, Stanford University, 2001.
Google Scholar
J. Cho, N. Shivakumar, H. Garcia-Molina. Finding replicated Web collections, In Proc. of 2000 ACM International Conference on Management of Data (SIGMOD) Conference, May 2000.
Google Scholar
J. Cho, H. Garcia-Molina. Parallel Crawlers, WWW11, 2001.
Google Scholar
J. Cho, H. Garcia-Molina. Estimating Frequency of Change, Technical Report, Dept. of Computer Science, Stanford University, 2001.
Google Scholar
J. Cho, H. Garcia-Molina. The Evolution of the Web and Implications for an Incremental Crawler, VLDB conference, pages 200–209, 2000.
Google Scholar
J. Cho, H. Garcia-Molina. Synchronizing a database to improve freshness. Proc. of ACM SIGMOD, pages 117–128, 2000.
Google Scholar
J. Cho, H. Garcia-Molina. Efficient crawling through URL ordering. Proc. WWW7, 1998.
Google Scholar
E.G. Coan, Jr., Zhen Liu, Richard R. Weber. Optimal robot scheduling for Web search engines. Technical Report, INRIA, 1997.
Google Scholar
M. Diligenti, F. Coetzee, S. Lawrence, C. L. Giles, and M. Gori. Focused Crawling using Context Graphs, Proc. of 26th International Conference on Very Large Databases, VLDB 2000.
Google Scholar
F. Douglas, A. Feldmann, B. Krishnamurthy, J.C. Mogul. Rate of Change and other Metrics: a Live Study of the World Wide Web, USENIX Symposium on Internet Technologies and Systems, 1997.
Google Scholar
Jenny Edwards, Kevin McCurley, and John Tomlin. An Adaptive Model for Optimizing Performance of an Incremental Web Crawler. In Proceedings of the Tenth International World Wide Web Conference, pages 106–113, May 2001.
Google Scholar
D. Eichmann. The RBSE spider: Balancing effective search against Web load, Proc. of 1st WWW conference, 1994.
Google Scholar
V. Gupta and R. Campbell. Internet search engine freshness by web server help. Technical Report UIUCDCS-R-2000-2153, Digital Computer Laboratory, University of Illinois at UrbanaChampaign, January 2000.
Google Scholar
D. Hagimont and D. Louvegnies. Javanaise: distributed shared objects for Internet cooperative applications. In Middleware’98, The Lake District, England, 1998.
Google Scholar
A. Heydon, M. Najork. Mercator: A scalable, extensible Web crawler., World Wide Web, 2(4):219–229, 1999.
Article Google Scholar
V. Katz and W.-S. Li. Topic distillation on hierarchically categorized Web documents. In Proceedings of the 1999 Workshop on Knowledge and Data Engineering Exchange, IEEE, 1999.
Google Scholar
J. Kiniry, D. Zimmerman A Hands-on Look at Java Mobile Agents, IEEE Internet Computing 1(4):21–30, July–August 1997.
Article Google Scholar
Kluev, V. Compiling document collections from the Internet, SIGIR Forum 34, 2000.
Google Scholar
R. Koblick, Concordia, Communications of ACM 42(3):96–99, March 1999.
Article Google Scholar
M. Koster Robots in the Web: threat or treat, ConneXions 9(4), 1995.
Google Scholar
D. Lange, M. Oshima. Programming and Deploying Java Mobile Agents with Aglets. Addison Wesley, 1998
Google Scholar
D.B. Lange and M. Oshima, Seven Good Reasons for Mobile Agents, Communications of ACM 42(3):88–91, March 1999.
Article Google Scholar
H. Lieberman. Letizia: An Agent That Assists Web Browsing. In 1995 International Joint Conference on Artificial Intelligence, Montreal, CA, 1995.
Google Scholar
F. Menczer and R. Belew. Adaptive retrieval agents: Internalizing local context and scaling up to the web. Machine Learning conference, 1999. Later in Machine Learning 39, 200, 203-242.
Google Scholar
F. Menczer, G. Pant, M. Ruiz, and P. Srinivasan. Evaluating topic-driven web crawlers. In Proc. 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001.
Google Scholar
R. Miller, K. Bharat. SPHINX: A framework for creating personal, site-specific Web crawlers, Proc. of WWW7, 1998.
Google Scholar
Mukherjea, S. WTMS: A system for collecting and analyzing topic-specific Web information, WWW 9, Elsevier, 2000.
Google Scholar
M. Najork, J. Wiener. Breadth-first search crawling yields high-quality pages, Proc. of WWW10, 2001.
Google Scholar
Marc Najork and Allan Heydon. On High-Performance Web Crawling. Chapter 2 in J. Abello et al. (editors), Handbook of Massive Data Sets, Kluwer Academic Publishers, 2002.
Google Scholar
L. Page, S. Brin. The anatomy of a large-scale hypertextual Web search engine. Proc. of WWW7, 1998.
Google Scholar
G. Pant and F. Menczer. Myspiders: Evolve your own intelligent web crawlers. Autonomous Agents and Multi-Agent Systems 5(2):221–229, 2002.
Article Google Scholar
G. Pant, P. Srinivasan, and F. Menczer. Exploration versus exploitation in topic driven crawlers. In Proc. Second International Workshop on Web Dynamics, 2002.
Google Scholar
Jose M. Piquer. Indirect distributed garbage collection: Handling object migration. ACM Transactions on Programming Languages and Systems (TOPLAS), 18(5):615–647, September 1996.
Article Google Scholar
Michael Philippsen and Matthias Zenger. JavaParty — transparent remote objects in Java. Concurrency: Practice and Experience, 9(11):1225–1242, 1997.
Article Google Scholar
S. Raghavan, H. Garcia-Molina. Crawling the Hidden Web, 27th International Conference on Very Large Data Bases, September 2001.
Google Scholar
Rennie, J. and McCallum, A. Using reinforcement learning to spider the Web efficiently, Int. Conf. on Machine Learning, 1999.
Google Scholar
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed Web crawler. In Proceedings of the 18th International Conference on Data Engineering (ICDE’02), San Jose, CA Feb. 26–March 1, pages 357–368, 2002.
Google Scholar
Padmini Srinivasan, Gautam Pant, Filippo Menczer. Target Seeking Crawlers and their Topical Performance, 25th ACM SIGIR, Finland, August 2002.
Google Scholar
J. Talim, Z. Liu, Ph. Nain, E. G. Coffman. Controlling the robots of Web search engines, Joint international conference on on Measurement and modeling of computer systems, 2001.
Google Scholar
P.N. Tan, V. Kumar. Discovery of Web Robots Session Based on their Navigational Patterns, Available on-line at http://citeseer.nj.nec.com/443855.html
E. Tanter, J. Piquer. Managing References upon Object Migration: Applying separation of Concerns SCCC’01, Punta Arenas, Chile, IEEE Press, Nov 2001.
Google Scholar
Giovanni Vigna, Protecting Mobile Agents through Tracing, 3rd ECOOP Workshop on Mobile Object Systems, 1997.
Google Scholar
D. Wong, N. Paciorek, D. Moore. Java-Based Mobile Agents. Communications of ACM, 42(3):92–95, March 1999.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Center for Web Research Dept. of Computer Science, University of Chile, Blanco Encalada, 2120, Santiago, Chile
Ricardo Baeza-Yates & José Miguel Piquer

Authors

Ricardo Baeza-Yates
View author publications
You can also search for this author in PubMed Google Scholar
José Miguel Piquer
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

German Research Center for Artificial Intelligence, DFKI GmbH, Stuhlsatzenhausweg 3, 66123, Saarbrücken, Germany
Matthias Klusch
School of Engineering (ESCET), University Rey Juan Carlos, Campus de Mostoles, Calle Tulipan s/n, 28933, Madrid, Spain
Sascha Ossowski
IBM - Haifa Research Labs, Haifa University, Mount Carmel, 31905, Haifa, Israel
Onn Shehory

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baeza-Yates, R., Piquer, J.M. (2002). Agents, Crawlers, and Web Retrieval. In: Klusch, M., Ossowski, S., Shehory, O. (eds) Cooperative Information Agents VI. CIA 2002. Lecture Notes in Computer Science(), vol 2446. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45741-0_1

Download citation

DOI: https://doi.org/10.1007/3-540-45741-0_1
Published: 02 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44173-1
Online ISBN: 978-3-540-45741-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics