Skip to main content
Log in

Das MINERVA-Projekt: Datenbankselektion für Peer-to-Peer-Websuche

  • Original Article
  • Published:
Informatik - Forschung und Entwicklung

Zusammenfassung

In diesem Artikel wird MINERVA präsentiert, eine prototypische Implementierung einer verteilten Suchmaschine basierend auf einer Peer-to-Peer (P2P)-Architektur. MINERVA setzt auf die in der P2P-Welt verbreitete Technik verteilter Hash-Tabellen auf und benutzt diese zum Aufbau eines verteilten Verzeichnisses. Peers in unserem Ansatz entsprechen völlig autonomen Benutzern mit ihren lokalen Suchm"oglichkeiten, die bereit sind, ihr lokales Wissen und ihre lokalen Suchmöglichkeiten im Rahmen einer Kollaboration zur Verfügung zu stellen. Wir formalisieren unsere Systemarchitektur und beschreiben das zentrale Problem einer effizienten Suche nach vielversprechenden Peers für eine konkrete Anfrage innerhalb des Verbundes. Wir greifen dabei auf existierende Methoden zurück and passen diese an unseren Systemkontext an. Wir präsentieren Experimente auf realen Daten, die verschiedene dieser Ansätze vergleichen. Diese Experimente zeigen, dass die Qualität der Ansätze variiert und untermauern damit die Wichtigkeit und den Einfluss einer leistungsstarken Methode zur Auswahl guter Datenbanken. Unsere Experimente deuten an, dass eine geringe Anzahl sorgfältig ausgewählter Datenbanken typischerweise bereits einen Großteil aller relevanten Ergebnisse des Gesamtsystems liefert.

Abstract

This paper presents the MINERVA project that protoypes a distributed search engine based on P2P techniques. MINERVA is layered on top of a Chord-style overlay network and uses a powerful crawling, indexing, and search engine on every autonomous peer. We formalize our system model and identify the problem of efficiently selecting promising peers for a query as a pivotal issue. We revisit existing approaches to the database selection problem and adapt them to our system environment. Measurements are performed to compare different selection strategies using real-world data. The experiments show significant performance differences between the strategies and prove the importance of a judicious peer selection strategy. The experiments also present first evidence that a small number of carefully selected peers already provide the vast majority of all relevant results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Literatur

  1. Alonso G, Casati F, Kuno H (2004) Web Services – Concepts, Architectures and Applications. Springer, Berlin Heidelberg New York

  2. Aberer K, Cudre-Mauroux P, Hauswirth M, Van Pelt T (2004) Gridvine: Building internet-scale semantic overlay networks. Technical report, EPFL

  3. Aberer K, Hauswirth M, Punceva M, Schmidt R (2002) Improving data access in p2p systems. IEEE Internet Computing 6(1):58–67

    Google Scholar 

  4. Buchmann E, Böhm K (2004) How to Run Experiments with Large Peer-to-Peer Data Structures. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, USA, April 2004

  5. Bender M, Michel S, Weikum G, Zimmer C (2004) Bookmark-driven routing in peer-to-peer web search. In: Callan J, Fuhr N, Nejdl W (eds) Proceedings of the SIGIR Workshop on Peer-to-Peer Information-Retrieval, pp 46–57

  6. Callan J (2000) Distributed information retrieval. Advances in information retrieval, Kluwer Academic Publishers, pp 127–150

  7. Cuenca-Acuna FM, Peery C, Martin RP, Nguyen TD (2002) PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities. Technical Report DCS-TR-487, Rutgers University, September 2002

  8. Cohen E, Fiat A, Kaplan H (2003) Associative search in peer to peer networks: Harnessing latent semantics. In: Proceedings of the IEEE INFOCOM’03 Conference, April 2003

  9. Crespo A, Garcia-Molina H (2002) Routing indices for peer-to-peer systems. In: Proc. of the 28th Conference on Distributed Computing Systems, July 2002

  10. Crespo A, Garcia-Molina H (2002) Semantic Overlay Networks for P2P Systems. Technical report, Stanford University, October 2002

  11. Chakrabarti S (2002) Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco

  12. Croft WB, Lafferty J (2003) Language Modeling for Information-Retrieval, vol 13. Kluwer International Series on Information-Retrieval

  13. Callan JP, Lu Z, Croft WB (1995) Searching distributed collections with inference networks. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, pp 21–28

  14. Diaconis P, Graham R (1977) Spearman’s footrule as a measure of disarray. Journal of the Royal Statistical Society, pp 262–268

  15. Diaconis P, Graham R (1988) Group representation in probability and statistics. Institute of Mathematical Statistics

  16. Fagin R (1999) Combining fuzzy information from multiple systems. J Comput Syst Sci 58(1):83–99

    Google Scholar 

  17. Fagin R, Kumar R, Sivakumar D (2003) Comparing top k lists. In: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, pp 28–36

  18. Fagin R, Lotem A, Naor M (2001) Optimal aggregation algorithms for middleware. In: Symposium on Principles of Database Systems

  19. Fuhr N (1999) A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems 17(3):229–249

    Google Scholar 

  20. Guntzer U, Balke W-T, Kiesling W (2000) Optimizing multi-feature queries for image databases. In: The VLDB Journal, pp 419–428

  21. Grabs T, Böhm K, Schek H-J (2001) Powerdb-ir: information retrieval on top of a database cluster. In: Proceedings of the tenth international conference on Information and knowledge management. ACM Press, pp 411–418

  22. Gravano L, Garcia-Molina H, Tomasic A (1999) Gloss: text-source discovery over the internet. ACM Trans Database Syst 24(2):229–264

    Google Scholar 

  23. Kendall M, Gibbons JD (1990) Rank correlation methods. Edward Arnold, London

  24. Karger D, Lehman E, Leighton T, Levine M, Lewin D, Panigrahy R (1997) Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In: ACM Symposium on Theory of Computing, pp 654–663, May 1997

  25. Lu J, Callan J (2003) Content-based retrieval in hybrid peer-to-peer networks. In: Proceedings of the twelfth international conference on Information and knowledge management. ACM Press, pp 199–206

  26. Löser A, Siberski W, Naumann F, Nejdl W, Thaden U (2003) Semantic overlay clusters within super-peer networks. In: Proceedings of the International Workshop on Databases, Information Systems and Peer-to-Peer Computing, (DBISP2P), pp 33–47

  27. Ludwig T (1993) Lastverwaltung für parallelrechner

  28. Luxenburger J, Weikum G (2004) Query-log based authority analysis for web information search. In: WISE04

  29. Melnik S, Garcia-Molina H, Raghavan S, Yang B (2001) Building a distributed full-text index for the web. ACM Trans Inf Syst 19(3):217–241

    Google Scholar 

  30. Manning CD, Schütze H (1999) Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts

  31. Meng W, Yu CT, Liu K-L (2002) Building efficient and effective metasearch engines. ACM Computing Surveys 34(1):48–89

    Google Scholar 

  32. Nottelmann H, Fuhr N (2003) Evaluating different methods of estimating retrieval quality for resource selection. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM Press, pp 290–297

  33. Nepal S, Ramakrishna MV (1999) Query processing issues in image (multimedia) databases. In: ICDE, pp 22–29

  34. Rowstron A, Druschel P (2001) Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), pp 329–350

  35. Ratnasamy S, Francis P, Handley M, Karp R, Schenker S (2001) A scalable content-addressable network. In: Proceedings of ACM SIGCOMM 2001. ACM Press, pp 161–172

  36. Reynolds P, Vahdat A (2003) Efficient peer-to-peer keyword searching. In: Proceedings of International Middleware Conference, pp 21–40, June 2003

  37. Si L, Jin R, Callan J, Ogilvie P (2002) A language modeling framework for resource selection and results merging. In: Proceedings of the eleventh international conference on Information and knowledge management. ACM Press, pp 391–397

  38. Stoica I, Karger D, Morris R, Kaashoek MF, Balakrishnan H (2001) Chord: A scalable peer-to-peer lookup service for internet applications. In: Proceedings of the 2001 conference on applications, technologies, architectures, and protocols for computer communications. ACM Press, pp 149–160

  39. Suel T, Mathur C, Wu J, Zhang J, Delis A, Kharrazi M, Long X, Shanmugasunderam K (2003) Odissea: A peer-to-peer architecture for scalable web search and information retrieval. Technical report, Polytechnic Univ

  40. Theobald M, Weikum G, Schenkel R (2004) Top-k query evaluation with probabilistic guarantees. In: VLDB, pp 648–659

  41. Tang C, Xu Z, Dwarkadas S (2003) Peer-to-peer information retrieval using self-organizing semantic overlay networks. In: Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications. ACM Press, pp 175–186

  42. Wang Y, Galanis L, de Witt DJ (2003) Galanx: An efficient peer-to-peer search engine system. Available at http://www.cs.wisc.edu/∼yuanwang

  43. Wu Z, Meng W, Yu CT, Li Z (2001) Towards a highly-scalable and effective metasearch engine. In: World Wide Web, pp 386–395

  44. Xu J, Croft WB (1999) Cluster-based language models for distributed retrieval. In: Research and Development in Information-Retrieval, pp 254–261

  45. B Yang, Garcia-Molina H (2002) Improving search in peer-to-peer networks. In: Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS’02). IEEE Computer Society, pp 5–14

Download references

Author information

Authors and Affiliations

Authors

Additional information

CR Subject Classification

H.4,H.3.3,H3.4

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bender, M., Michel, S., Weikum, G. et al. Das MINERVA-Projekt: Datenbankselektion für Peer-to-Peer-Websuche. Informatik Forsch. Entw. 20, 152–166 (2005). https://doi.org/10.1007/s00450-005-0205-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00450-005-0205-9

Keywords

Navigation