Abstract
Peer-to-peer (P2P) systems have been recently proposed for providing search and information retrieval facilities over distributed data sources, including web data. Terms and their document frequencies are the main building blocks of retrieval and as such need to be computed, aggregated, and distributed throughout the system. This is a tedious task, as the local view of each peer may not reflect the global document collection, due to skewed document distributions. Moreover, central assembly of the total information is not feasible, due to the prohibitive cost of storage and maintenance, and also because of issues related to digital rights management. In this paper, we propose an efficient approach for aggregating the document frequencies of carefully selected terms based on a hierarchical overlay network. To this end, we examine unsupervised feature selection techniques at the individual peer level, in order to identify only a limited set of the most important terms for aggregation. We provide a theoretical analysis to compute the cost of our approach, and we conduct experiments on two document collections, in order to measure the quality of the aggregated document frequencies.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ahmad, K., Gillam, L., Tostevin, L.: Weirdness indexing for logical document extrapolation and retrieval WILDER. In: TREC (1999)
Balke, W.-T.: Supporting information retrieval in peer-to-peer systems. In: Steinmetz, R., Wehrle, K. (eds.) Peer-to-Peer Systems and Applications. LNCS, vol. 3485, pp. 337–352. Springer, Heidelberg (2005)
Balke, W.-T., Nejdl, W., Siberski, W., Thaden, U.: DL Meets P2P – Distributed Document Retrieval Based on Classification and Content. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds.) ECDL 2005. LNCS, vol. 3652, pp. 379–390. Springer, Heidelberg (2005)
Balke, W.-T., Nejdl, W., Siberski, W., Thaden, U.: Progressive distributed top-k retrieval in peer-to-peer networks. In: Proc. of ICDE (2005)
Bender, M., Michel, S., Triantafillou, P., Weikum, G.: Global document frequency estimation in peer-to-peer web search. In: Proc. of the 9th Int. Workshop on the web and databases (2006)
Cuenca-Acuna, F., Peery, C., Martin, R., Nguyen, T.: PlanetP: Using gossiping to build content addressable peer-to-peer information sharing communities. In: Proc. of HPDC (2003)
Doulkeridis, C., Nørvåg, K., Vazirgiannis, M.: Scalable semantic overlay generation for P2P-based digital libraries. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, pp. 26–38. Springer, Heidelberg (2006)
Doulkeridis, C., Nørvåg, K., Vazirgiannis, M.: DESENT: Decentralized and distributed semantic overlay generation in P2P networks. Journal on Selected Areas in Communications 25(1) (2007)
Lu, J., Callan, J.: Full-text federated search of text-based digital libraries in peer-to-peer networks. Information Retrieval 9(4) (2006)
Melink, S., Raghavan, S., Yang, B., Garcia-Molina, H.: Building a distributed full-text index for the web. ACM Transactions on Information Systems 19(3) (2001)
Michel, S., Triantafillou, P., Weikum, G.: MINERVA infinity: A scalable efficient peer-to-peer search engine. In: Alonso, G. (ed.) Middleware 2005. LNCS, vol. 3790, pp. 60–81. Springer, Heidelberg (2005)
Nottelmann, H., Fuhr, N.: Comparing different architectures for query routing in peer-to-peer networks. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 253–264. Springer, Heidelberg (2006)
Papapetrou, O., Michel, S., Bender, M., Weikum, G.: On the usage of global document occurrences in peer-to-peer information systems. In: Proc. of COOPIS (2005)
Podnar, I., Luu, T., Rajman, M., Klemm, F., Aberer, K.: A P2P architecture for information retrieval across digital library collections. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, pp. 14–25. Springer, Heidelberg (2006)
Raftopoulou, P., Petrakis, E.G.M., Tryfonopoulos, C., Weikum, G.: Information retrieval and filtering over self-organising digital libraries. In: Christensen-Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds.) ECDL 2008. LNCS, vol. 5173, pp. 320–333. Springer, Heidelberg (2008)
Sahin, O.D., Emekçi, F., Agrawal, D., Abbadi, A.E.: Content-based similarity search over peer-to-peer systems. In: Ng, W.S., Ooi, B.-C., Ouksel, A.M., Sartori, C. (eds.) DBISP2P 2004. LNCS, vol. 3367, pp. 61–78. Springer, Heidelberg (2005)
Skobeltsyn, G., Luu, T., Zarko, I.P., Rajman, M., Aberer, K.: Query-driven indexing for scalable peer-to-peer text retrieval. In: Proc. of Infoscale (2007)
Suel, T., Mathur, C., wen Wu, J., Zhang, J., Delis, A., Mehdi, Kharrazi, X.L., Shanmugasundaram, K.: Odissea: A peer-to-peer architecture for scalable web search and information retrieval. In: Proc. of WebDB (2003)
Tang, C., Dwarkadas, S.: Hybrid global-local indexing for efficient peer-to-peer information retrieval. In: Proc. of NSDI (2004)
Viles, C.L., French, J.C.: Dissemination of collection wide information in a distributed information retrieval system. In: Proc. of SIGIR (1995)
Viles, C.L., French, J.C.: On the update of term weights in dynamic information retrieval systems. In: Proc. of CIKM (1995)
Witschel, H.F.: Global term weights in distributed environments. Information Processing and Management 44(3) (2008)
Xu, Y., Wang, B., Li, J., Jing, H.: An extended document frequency metric for feature selection in text categorization. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou, G. (eds.) AIRS 2008. LNCS, vol. 4993, pp. 71–82. Springer, Heidelberg (2008)
Zhang, J., Suel, T.: Efficient query evaluation on large textual collections in a peer-to-peer environment. In: Proc. of IEEE P2P (2005)
Zimmer, C., Tryfonopoulos, C., Berberich, K., Koubarakis, M., Weikum, G.: Approximate information filtering in peer-to-peer networks. In: Bailey, J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175, pp. 6–19. Springer, Heidelberg (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Neumayer, R., Doulkeridis, C., Nørvåg, K. (2009). Aggregation of Document Frequencies in Unstructured P2P Networks. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds) Web Information Systems Engineering - WISE 2009. WISE 2009. Lecture Notes in Computer Science, vol 5802. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04409-0_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-04409-0_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04408-3
Online ISBN: 978-3-642-04409-0
eBook Packages: Computer ScienceComputer Science (R0)