Abstract
There exist a number of approaches for query processing in Peer-to-Peer information systems that efficiently retrieve relevant information from distributed peers. However, very few of them take into consideration the overlap between peers: as the most popular resources (e.g., documents or files) are often present at most of the peers, a large fraction of the documents eventually received by the query initiator are duplicates. We develop a technique based on the notion of global document occurrences (GDO) that, when processing a query, penalizes frequent documents increasingly as more and more peers contribute their local results. We argue that the additional effort to create and maintain the GDO information is reasonably low, as the necessary information can be piggybacked onto the existing communication. Early experiments indicate that our approach significantly decreases the number of peers that have to be involved in a query to reach a certain level of recall and, thus, decreases user-perceived latency and the wastage of network resources.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: A scalable peer-to-peer lookup service for internet applications. In: Proceedings of the ACM SIGCOMM 2001, pp. 149–160. ACM Press, New York (2001)
Ratnasamy, S., Francis, P., Handley, M., Karp, R., Schenker, S.: A scalable content-addressable network. In: Proceedings of ACM SIGCOMM 2001, pp. 161–172. ACM Press, New York (2001)
Rowstron, A., Druschel, P.: Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Guerraoui, R. (ed.) Middleware 2001. LNCS, vol. 2218, pp. 329–350. Springer, Heidelberg (2001)
Buchmann, E., Böhm, K.: How to Run Experiments with Large Peer-to-Peer Data Structures. In: Proceedings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, USA (2004)
Aberer, K., Punceva, M., Hauswirth, M., Schmidt, R.: Improving data access in p2p systems. IEEE Internet Computing 6, 58–67 (2002)
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco (2002)
Fuhr, N.: A decision-theoretic approach to database selection in networked IR. ACM Transactions on Information Systems 17, 229–249 (1999)
Gravano, L., Garcia-Molina, H., Tomasic, A.: Gloss: text-source discovery over the internet. ACM Trans. Database Syst. 24, 229–264 (1999)
Si, L., Jin, R., Callan, J., Ogilvie, P.: A language modeling framework for resource selection and results merging. In: Proceedings of CIKM 2002, pp. 391–397. ACM Press, New York (2002)
Xu, J., Croft, W.B.: Cluster-based language models for distributed retrieval. In: Research and Development in Information Retrieval, pp. 254–261 (1999)
Callan, J.: Distributed information retrieval. In: Advances in information retrieval, pp. 127–150. Kluwer Academic Publishers, Dordrecht (2000)
Nottelmann, H., Fuhr, N.: Evaluating different methods of estimating retrieval quality for resource selection. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 290–297. ACM Press, New York (2003)
Grabs, T., Böhm, K., Schek, H.J.: Powerdb-ir: information retrieval on top of a database cluster. In: Proceedings of CIKM 2001, pp. 411–418. ACM Press, New York (2001)
Melnik, S., Raghavan, S., Yang, B., Garcia-Molina, H.: Building a distributed full-text index for the web. ACM Trans. Inf. Syst. 19, 217–241 (2001)
Byers, J., Considine, J., Mitzenmacher, M., Rost, S.: Informed content delivery across adaptive overlay networks. In: Proceedings of ACM SIGCOMM (2002)
Ganguly, S., Garofalakis, M., Rastogi, R.: Processing set expressions over continuous update streams. In: SIGMOD 2003: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 265–276. ACM Press, New York (2003)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970)
Mitzenmacher, M.: Compressed bloom filters. IEEE/ACM Trans. Netw. 10, 604–612 (2002)
Florescu, D., Koller, D., Levy, A.Y.: Using probabilistic information in data integration. The VLDB Journal, 216–225 (1997)
Zhang, Y., Callan, J., Minka, T.: Novelty and redundancy detection in adaptive filtering. In: SIGIR 2002: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 81–88. ACM Press, New York (2002)
Nie, Z., Kambhampati, S., Hernandez, T.: Bibfinder/statminer: Effectively mining and using coverage and overlap statistics in data integration. In: VLDB, pp. 1097–1100 (2003)
Hernandez, T., Kambhampati, S.: Improving text collection selection with coverage and overlap statistics. pc-recommended poster. In: WWW (2005), Full version available at http://rakaposhi.eas.asu.edu/thomas-www05-long.pdf
Bender, M., Michel, S., Triantafillou, P., Weikum, G., Zimmer, C.: Improving collection selection with overlap awareness in p2p systems. In: Proceedings of the SIGIR Conference (2005)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Croft, W.B., Lafferty, J.: Language Modeling for Information Retrieval. Kluwer International Series on Information Retrieval, vol. 13 (2003)
Bender, M., Michel, S., Weikum, G., Zimmer, C.: The MINERVA project: Database selection in the context of P2P search. In: BTW 2005 (2005)
Bender, M., Michel, S., Weikum, G., Zimmer, C.: Minerva: Collaborative p2p search. In: Proceedings of the VLDB Conference (Demonstration) (2005)
Bender, M., Michel, S., Weikum, G., Zimmer, C.: Bookmark-driven query routing in peer-to-peer web search. In: Callan, J., Fuhr, N., Nejdl, W. (eds.) Proceedings of the SIGIR Workshop on Peer-to-Peer Information Retrieval, pp. 46–57 (2004)
Buckley, C., Salton, G., Allan, J.: The effect of adding relevance information in a relevance feedback environment. In: SIGIR. Springer, Heidelberg (1994)
Luxenburger, J., Weikum, G.: Query-log based authority analysis for web information search. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds.) WISE 2004. LNCS, vol. 3306, pp. 90–101. Springer, Heidelberg (2004)
Srivastava, J., et al.: Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations 1, 12–23 (2000)
Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: Symposium on Principles of Database Systems (2001)
Nepal, S., Ramakrishna, M.V.: Query processing issues in image (multimedia) databases. In: ICDE, pp. 22–29 (1999)
Guntzer, U., Balke, W.T., Kiesling, W.: Optimizing multi-feature queries for image databases. The VLDB Journal, 419–428 (2000)
Theobald, M., Weikum, G., Schenkel, R.: Top-k query evaluation with probabilistic guarantees. VLDB (2004)
Zipf, G.K.: Human behavior and the principle of least effort. Addison-wesley press, Reading (1949)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Papapetrou, O., Michel, S., Bender, M., Weikum, G. (2005). On the Usage of Global Document Occurrences in Peer-to-Peer Information Systems. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2005: CoopIS, DOA, and ODBASE. OTM 2005. Lecture Notes in Computer Science, vol 3760. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11575771_21
Download citation
DOI: https://doi.org/10.1007/11575771_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29736-9
Online ISBN: 978-3-540-32116-3
eBook Packages: Computer ScienceComputer Science (R0)