Abstract
Text clustering is an established technique for improving quality in information retrieval, for both centralized and distributed environments. However, for highly distributed environments, such as peer-to-peer networks, current clustering algorithms fail to scale. Our algorithm for peer-to-peer clustering achieves high scalability by using a probabilistic approach for assigning documents to clusters. It enables a peer to compare each of its documents only with very few selected clusters, without significant loss of clustering quality. The algorithm offers probabilistic guarantees for the correctness of each document assignment to a cluster. Extensive experimental evaluation with up to 100000 peers and 1 million documents demonstrates the scalability and effectiveness of the algorithm.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Blake, C.: A comparison of document, sentence, and term event spaces. In: ACL (2006)
Cudré-Mauroux, P., Agarwal, S., Aberer, K.: Gridvine: An infrastructure for peer information management. IEEE Internet Computing 11(5) (2007)
Datta, S., Giannella, C.R., Kargupta, H.: Approximate distributed K-Means clustering over a peer-to-peer network. IEEE TKDE 21(10), 1372–1388 (2009)
Dhillon, I.S., Modha, D.S.: A data-clustering algorithm on distributed memory multiprocessors. In: Workshop on Large-Scale Parallel KDD Systems (1999)
Eisenhardt, M., Müller, W., Henrich, A.: Classifying documents by distributed P2P clustering. In: INFORMATIK (2003)
Forman, G., Zhang, B.: Distributed data clustering can be efficient and exact. SIGKDD Explor. Newsl. 2(2), 34–38 (2000)
Hammouda, K., Kamel, M.: HP2PC: Scalable hierarchically-distributed peer to peer clustering. In: SDM (2007)
Haslhofer, B., Knezevié, P.: The BRICKS digital library infrastructure. In: Semantic Digital Libraries, pp. 151–161 (2009)
Hsiao, H.-C., King, C.-T.: Similarity discovery in structured P2P overlays. In: ICPP (2003)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Lu, J., Callan, J.: Content-based retrieval in hybrid peer-to-peer networks. In: CIKM 2003, pp. 199–206. ACM, New York (2003)
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995)
Papadimitriou, C.H., Tamaki, H., Raghavan, P., Vempala, S.: Latent semantic indexing: a probabilistic analysis. In: PODS (1998)
Papapetrou, O., Siberski, W., Fuhr, N.: Text clustering for P2P networks with probabilistic guarantees. Extended version (2009), http://www.l3s.de/~papapetrou/publications/pcp2p-ecir-ext.pdf
Steyvers, M., Griffiths, T.: Handbook of Latent Semantic Analysis. In: Probabilistic Topic Models, pp. 427–448. Lawrence Erlbaum, Mahwah (2007)
Stoica, I., Morris, R., Karger, D., Kaashoek, F., Balakrishnan, H.: Chord: A scalable peer-to-peer lookup service for internet applications. In: SIGCOMM (2001)
Xu, J., Croft, W.B.: Cluster-based language models for distributed retrieval. In: SIGIR (1999)
Zipf, G.K.: Human Behavior and the Principle of Least-Effort. Addison-Wesley, Cambridge (1949)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Papapetrou, O., Siberski, W., Fuhr, N. (2010). Text Clustering for Peer-to-Peer Networks with Probabilistic Guarantees. In: Gurrin, C., et al. Advances in Information Retrieval. ECIR 2010. Lecture Notes in Computer Science, vol 5993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12275-0_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-12275-0_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12274-3
Online ISBN: 978-3-642-12275-0
eBook Packages: Computer ScienceComputer Science (R0)