Zusammenfassung
In diesem Artikel wird MINERVA präsentiert, eine prototypische Implementierung einer verteilten Suchmaschine basierend auf einer Peer-to-Peer (P2P)-Architektur. MINERVA setzt auf die in der P2P-Welt verbreitete Technik verteilter Hash-Tabellen auf und benutzt diese zum Aufbau eines verteilten Verzeichnisses. Peers in unserem Ansatz entsprechen völlig autonomen Benutzern mit ihren lokalen Suchm"oglichkeiten, die bereit sind, ihr lokales Wissen und ihre lokalen Suchmöglichkeiten im Rahmen einer Kollaboration zur Verfügung zu stellen. Wir formalisieren unsere Systemarchitektur und beschreiben das zentrale Problem einer effizienten Suche nach vielversprechenden Peers für eine konkrete Anfrage innerhalb des Verbundes. Wir greifen dabei auf existierende Methoden zurück and passen diese an unseren Systemkontext an. Wir präsentieren Experimente auf realen Daten, die verschiedene dieser Ansätze vergleichen. Diese Experimente zeigen, dass die Qualität der Ansätze variiert und untermauern damit die Wichtigkeit und den Einfluss einer leistungsstarken Methode zur Auswahl guter Datenbanken. Unsere Experimente deuten an, dass eine geringe Anzahl sorgfältig ausgewählter Datenbanken typischerweise bereits einen Großteil aller relevanten Ergebnisse des Gesamtsystems liefert.
Abstract
This paper presents the MINERVA project that protoypes a distributed search engine based on P2P techniques. MINERVA is layered on top of a Chord-style overlay network and uses a powerful crawling, indexing, and search engine on every autonomous peer. We formalize our system model and identify the problem of efficiently selecting promising peers for a query as a pivotal issue. We revisit existing approaches to the database selection problem and adapt them to our system environment. Measurements are performed to compare different selection strategies using real-world data. The experiments show significant performance differences between the strategies and prove the importance of a judicious peer selection strategy. The experiments also present first evidence that a small number of carefully selected peers already provide the vast majority of all relevant results.
Literatur
Alonso G, Casati F, Kuno H (2004) Web Services – Concepts, Architectures and Applications. Springer, Berlin Heidelberg New York
Aberer K, Cudre-Mauroux P, Hauswirth M, Van Pelt T (2004) Gridvine: Building internet-scale semantic overlay networks. Technical report, EPFL
Aberer K, Hauswirth M, Punceva M, Schmidt R (2002) Improving data access in p2p systems. IEEE Internet Computing 6(1):58–67
Buchmann E, Böhm K (2004) How to Run Experiments with Large Peer-to-Peer Data Structures. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, USA, April 2004
Bender M, Michel S, Weikum G, Zimmer C (2004) Bookmark-driven routing in peer-to-peer web search. In: Callan J, Fuhr N, Nejdl W (eds) Proceedings of the SIGIR Workshop on Peer-to-Peer Information-Retrieval, pp 46–57
Callan J (2000) Distributed information retrieval. Advances in information retrieval, Kluwer Academic Publishers, pp 127–150
Cuenca-Acuna FM, Peery C, Martin RP, Nguyen TD (2002) PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities. Technical Report DCS-TR-487, Rutgers University, September 2002
Cohen E, Fiat A, Kaplan H (2003) Associative search in peer to peer networks: Harnessing latent semantics. In: Proceedings of the IEEE INFOCOM’03 Conference, April 2003
Crespo A, Garcia-Molina H (2002) Routing indices for peer-to-peer systems. In: Proc. of the 28th Conference on Distributed Computing Systems, July 2002
Crespo A, Garcia-Molina H (2002) Semantic Overlay Networks for P2P Systems. Technical report, Stanford University, October 2002
Chakrabarti S (2002) Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco
Croft WB, Lafferty J (2003) Language Modeling for Information-Retrieval, vol 13. Kluwer International Series on Information-Retrieval
Callan JP, Lu Z, Croft WB (1995) Searching distributed collections with inference networks. In: Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval. ACM Press, pp 21–28
Diaconis P, Graham R (1977) Spearman’s footrule as a measure of disarray. Journal of the Royal Statistical Society, pp 262–268
Diaconis P, Graham R (1988) Group representation in probability and statistics. Institute of Mathematical Statistics
Fagin R (1999) Combining fuzzy information from multiple systems. J Comput Syst Sci 58(1):83–99
Fagin R, Kumar R, Sivakumar D (2003) Comparing top k lists. In: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, pp 28–36
Fagin R, Lotem A, Naor M (2001) Optimal aggregation algorithms for middleware. In: Symposium on Principles of Database Systems
Fuhr N (1999) A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems 17(3):229–249
Guntzer U, Balke W-T, Kiesling W (2000) Optimizing multi-feature queries for image databases. In: The VLDB Journal, pp 419–428
Grabs T, Böhm K, Schek H-J (2001) Powerdb-ir: information retrieval on top of a database cluster. In: Proceedings of the tenth international conference on Information and knowledge management. ACM Press, pp 411–418
Gravano L, Garcia-Molina H, Tomasic A (1999) Gloss: text-source discovery over the internet. ACM Trans Database Syst 24(2):229–264
Kendall M, Gibbons JD (1990) Rank correlation methods. Edward Arnold, London
Karger D, Lehman E, Leighton T, Levine M, Lewin D, Panigrahy R (1997) Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In: ACM Symposium on Theory of Computing, pp 654–663, May 1997
Lu J, Callan J (2003) Content-based retrieval in hybrid peer-to-peer networks. In: Proceedings of the twelfth international conference on Information and knowledge management. ACM Press, pp 199–206
Löser A, Siberski W, Naumann F, Nejdl W, Thaden U (2003) Semantic overlay clusters within super-peer networks. In: Proceedings of the International Workshop on Databases, Information Systems and Peer-to-Peer Computing, (DBISP2P), pp 33–47
Ludwig T (1993) Lastverwaltung für parallelrechner
Luxenburger J, Weikum G (2004) Query-log based authority analysis for web information search. In: WISE04
Melnik S, Garcia-Molina H, Raghavan S, Yang B (2001) Building a distributed full-text index for the web. ACM Trans Inf Syst 19(3):217–241
Manning CD, Schütze H (1999) Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts
Meng W, Yu CT, Liu K-L (2002) Building efficient and effective metasearch engines. ACM Computing Surveys 34(1):48–89
Nottelmann H, Fuhr N (2003) Evaluating different methods of estimating retrieval quality for resource selection. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM Press, pp 290–297
Nepal S, Ramakrishna MV (1999) Query processing issues in image (multimedia) databases. In: ICDE, pp 22–29
Rowstron A, Druschel P (2001) Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), pp 329–350
Ratnasamy S, Francis P, Handley M, Karp R, Schenker S (2001) A scalable content-addressable network. In: Proceedings of ACM SIGCOMM 2001. ACM Press, pp 161–172
Reynolds P, Vahdat A (2003) Efficient peer-to-peer keyword searching. In: Proceedings of International Middleware Conference, pp 21–40, June 2003
Si L, Jin R, Callan J, Ogilvie P (2002) A language modeling framework for resource selection and results merging. In: Proceedings of the eleventh international conference on Information and knowledge management. ACM Press, pp 391–397
Stoica I, Karger D, Morris R, Kaashoek MF, Balakrishnan H (2001) Chord: A scalable peer-to-peer lookup service for internet applications. In: Proceedings of the 2001 conference on applications, technologies, architectures, and protocols for computer communications. ACM Press, pp 149–160
Suel T, Mathur C, Wu J, Zhang J, Delis A, Kharrazi M, Long X, Shanmugasunderam K (2003) Odissea: A peer-to-peer architecture for scalable web search and information retrieval. Technical report, Polytechnic Univ
Theobald M, Weikum G, Schenkel R (2004) Top-k query evaluation with probabilistic guarantees. In: VLDB, pp 648–659
Tang C, Xu Z, Dwarkadas S (2003) Peer-to-peer information retrieval using self-organizing semantic overlay networks. In: Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications. ACM Press, pp 175–186
Wang Y, Galanis L, de Witt DJ (2003) Galanx: An efficient peer-to-peer search engine system. Available at http://www.cs.wisc.edu/∼yuanwang
Wu Z, Meng W, Yu CT, Li Z (2001) Towards a highly-scalable and effective metasearch engine. In: World Wide Web, pp 386–395
Xu J, Croft WB (1999) Cluster-based language models for distributed retrieval. In: Research and Development in Information-Retrieval, pp 254–261
B Yang, Garcia-Molina H (2002) Improving search in peer-to-peer networks. In: Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS’02). IEEE Computer Society, pp 5–14
Author information
Authors and Affiliations
Additional information
CR Subject Classification
H.4,H.3.3,H3.4
Rights and permissions
About this article
Cite this article
Bender, M., Michel, S., Weikum, G. et al. Das MINERVA-Projekt: Datenbankselektion für Peer-to-Peer-Websuche. Informatik Forsch. Entw. 20, 152–166 (2005). https://doi.org/10.1007/s00450-005-0205-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00450-005-0205-9